Skip to content

Voice Input Basics

The core of BiBi Keyboard is high-quality speech recognition. It supports multiple ASR engines and recognition modes, so you can use voice input in any app.

How It Works

Voice input has three stages:

  1. Recording: the app records your voice. Depending on settings, it can auto-stop on silence or stop manually.
  2. Recognition: audio is sent to an ASR engine (cloud or local) and transcribed into text.
  3. Output: the transcript can optionally be refined by AI post-processing, then inserted into the current editor.

Supported ASR Providers

BiBi Keyboard supports 12 ASR providers, grouped into cloud and local:

Cloud ASR

ProviderStreamingDuration limit (non-streaming)Notes
Volcengine1 hourNew users often get free quota; supports bidirectional streaming
SiliconFlow20 minBuilt-in free ASR (SenseVoiceSmall / TeleSpeechASR); supports Qwen3-Omni transcription (own key)
ElevenLabs20 minHigh-accuracy English; supports both file and streaming
OpenAI20 minDefault gpt-4o-mini-transcribe; you can use any compatible Audio Transcriptions model
DashScope (Alibaba)3 minqwen3-asr-flash; supports streaming and non-streaming
Gemini (Google)4 hoursFile-based multimodal speech understanding
Soniox1 hourSupports multi-language prompts; both streaming and file modes
Zhipu (GLM)20 minGLM-ASR; supports context prompt parameters

Local ASR (Offline)

ProviderStreamingDuration limit (non-streaming)Notes
SenseVoicePseudo ¹5 minBased on sherpa-onnx; multilingual
FunASR Nano5 minsherpa-onnx offline recognition (no pseudo-streaming preview)
TeleSpeechPseudo ¹5 minBased on sherpa-onnx; optimized for CN
ParaformerUnlimited ²Pure local streaming recognition

Notes

¹ Pseudo-streaming: shows partial results based on VAD segmentation, but it is not true real-time streaming.

² Streaming mode has no duration limit (continuous recognition).

The "Duration limit (non-streaming)" here is the app's single-segment recording cap used to control segmented recording behavior. It does not represent provider billing limits or total free quota. For example: Volcengine often provides ~20 hours of free quota for new users; SiliconFlow provides a built-in free ASR service with no total duration quota. For other providers, check their consoles for quota/billing.

For more details on supported models, recommended configs and updated quotas, see the Providers & Models Guide.

Cloud vs Local

Advantages of cloud ASR

  • Higher accuracy: large cloud models often perform better
  • Multilingual: better for code-switching, dialects, and more languages
  • No maintenance: no model download; updates are handled by the provider

Advantages of local ASR

  • Fully offline: no network required; privacy-friendly
  • Lower latency: no network transfer
  • No data usage: good for limited networks
  • No API quota: no need to worry about API costs and limits

Streaming vs Non-streaming

Streaming recognition

How it works: upload audio while recording, and get partial results in real time.

Pros:

  • ✅ real-time feedback
  • ✅ no duration limit
  • ✅ lower latency

Supported engines:

  • Cloud: Volcengine, Soniox, DashScope, ElevenLabs
  • Local: Paraformer

Non-streaming recognition (file upload)

How it works: upload the whole audio file after recording stops.

Pros:

  • ✅ potentially higher quality (global analysis on full audio)
  • ✅ simpler and stable
  • ✅ supports more providers

Cons:

  • ⚠️ duration limit (see tables above)
  • ⚠️ recognition starts only after recording stops

Suggestions

  • For providers that support both modes, switch under Settings → ASR Settings → [Provider Settings].
  • Streaming is great for long recordings and low-latency feedback.
  • File mode is great for short audio when accuracy matters more.

Segmented Recording

For non-streaming engines, if a recording exceeds the app's single-segment limit for that provider, BiBi Keyboard automatically performs segmented recording.

How it works

  1. Auto split: near the limit, the current segment is cut and a new segment starts
  2. Background upload: segments are uploaded/recognized in background while recording continues
  3. Seamless UX: UI stays in recording state without noticeable interruption
  4. Merge results: transcripts from segments are concatenated automatically

Per-provider segment limits (app-side cap)

ProviderSegment capNotes
Volcengine1 hourOfficial max per request is ~2h; app uses 1h as a safety margin
SiliconFlow20 minApp default; unrelated to billing/quota
ElevenLabs20 minApp default to avoid failures on very long audio
OpenAI20 minApp default; tune model/usage as needed
DashScope3 minDefault qwen3-asr-flash; app segment cap is 3 min
Gemini4 hoursOfficial max is ~9.5h; app uses 4h as a safety margin
Soniox1 hourNo strict official max found; app defaults to 1h
SenseVoice5 minLocal performance cap to avoid excessive RAM/time
FunASR Nano5 minLocal performance cap to avoid excessive RAM/time
TeleSpeech5 minLocal performance cap to avoid excessive RAM/time

Notes

  • Streaming engines (Paraformer, etc.) have no duration limit.
  • Segmented recording works only in non-streaming mode.
  • Each segment may incur a separate API call cost (for cloud providers).

Backup ASR Engine (Parallel Primary/Backup)

If your primary ASR occasionally times out or fails, you can enable a backup ASR engine: BiBi Keyboard records only once, then pushes the same audio to both primary and backup. If primary returns a non-empty final result in time, it uses primary; otherwise it falls back to the backup result.

How to enable

  1. Open Settings → ASR Settings
  2. Find "Backup speech recognition engine" and enable "Enable backup engine"
  3. Tap "Backup provider" and choose a provider different from your primary one
  4. Make sure the backup provider is also configured (API key / local model files, etc.)

Notes

This runs two engines in parallel. Even if the primary result is used, the backup may still trigger an API request/cost (depending on vendor billing and cancellation behavior).

One-tap Setup Options

No config needed:

  1. Open the app; it defaults to SiliconFlow free service
  2. Under Settings → ASR Settings → SiliconFlow, switch between the two free models (FunAudioLLM/SenseVoiceSmall and TeleAI/TeleSpeechASR)

2. Configure a cloud provider

Using Volcengine as an example:

  1. Create an account in the Volcengine console
  2. Create an app and obtain App Key and Access Key
  3. In BiBi Keyboard, go to Settings → ASR Settings → Provider and select Volcengine
  4. Fill in credentials and save

3. Auto-configure local models

  1. Download SenseVoice Small from the model release
  2. Extract to Android/data/com.brycewg.asrkb/files/sensevoice/
  3. The app will automatically select SenseVoice under Settings → ASR Settings → Provider

Tip

First load of local models may take a few seconds. You can enable "Preload model" (SenseVoice / FunASR Nano / TeleSpeech / Paraformer all support it) so the model is loaded when the keyboard or floating ball is first shown, reducing the first-recognition latency.

Local Punctuation (Optional)

TeleSpeech and Paraformer can add punctuation with an extra shared punctuation model. If the model is missing, recognition still works, but results may look more "spoken" (less punctuated).

  1. Open Settings → ASR Settings
  2. Go to the TeleSpeech or Paraformer section
  3. Under the punctuation model section, tap "Download model" (or import the ZIP)

Download source

When downloading local models, you can choose a download source and see latency. Picking a lower-latency source is usually more stable.

Recognition Enhancements (Optional)

  • Offline denoise for non-streaming ASR: Settings → Input Settings → Offline denoise for non-streaming ASR (applies to file-mode and local offline recognition)

Released under the Apache 2.0 License.