Voice Input Basics

The core of BiBi Keyboard is high-quality speech recognition. It supports multiple ASR engines and recognition modes, so you can use voice input in any app.

How It Works

Voice input has three stages:

Recording: the app records your voice. Depending on settings, it can auto-stop on silence or stop manually.
Recognition: audio is sent to an ASR engine (cloud or local) and transcribed into text.
Output: the transcript can optionally be refined by AI post-processing, then inserted into the current editor.

Supported ASR Providers

BiBi Keyboard supports 12 ASR providers, grouped into cloud and local:

Cloud ASR

Provider	Streaming	Duration limit (non-streaming)	Notes
Volcengine	✅	1 hour	New users often get free quota; supports bidirectional streaming
SiliconFlow	❌	20 min	Built-in free ASR (SenseVoiceSmall / TeleSpeechASR); supports Qwen3-Omni transcription (own key)
ElevenLabs	✅	20 min	High-accuracy English; supports both file and streaming
OpenAI	❌	20 min	Default `gpt-4o-mini-transcribe`; you can use any compatible Audio Transcriptions model
DashScope (Alibaba)	✅	3 min	qwen3-asr-flash; supports streaming and non-streaming
Gemini (Google)	❌	4 hours	File-based multimodal speech understanding
Soniox	✅	1 hour	Supports multi-language prompts; both streaming and file modes
Zhipu (GLM)	❌	20 min	GLM-ASR; supports context prompt parameters

Local ASR (Offline)

Provider	Streaming	Duration limit (non-streaming)	Notes
SenseVoice	Pseudo ¹	5 min	Based on sherpa-onnx; multilingual
FunASR Nano	❌	5 min	sherpa-onnx offline recognition (no pseudo-streaming preview)
TeleSpeech	Pseudo ¹	5 min	Based on sherpa-onnx; optimized for CN
Paraformer	✅	Unlimited ²	Pure local streaming recognition

Notes

¹ Pseudo-streaming: shows partial results based on VAD segmentation, but it is not true real-time streaming.

² Streaming mode has no duration limit (continuous recognition).

The "Duration limit (non-streaming)" here is the app's single-segment recording cap used to control segmented recording behavior. It does not represent provider billing limits or total free quota. For example: Volcengine often provides ~20 hours of free quota for new users; SiliconFlow provides a built-in free ASR service with no total duration quota. For other providers, check their consoles for quota/billing.

For more details on supported models, recommended configs and updated quotas, see the Providers & Models Guide.

Cloud vs Local

Advantages of cloud ASR

Higher accuracy: large cloud models often perform better
Multilingual: better for code-switching, dialects, and more languages
No maintenance: no model download; updates are handled by the provider

Advantages of local ASR

Fully offline: no network required; privacy-friendly
Lower latency: no network transfer
No data usage: good for limited networks
No API quota: no need to worry about API costs and limits

Streaming vs Non-streaming

Streaming recognition

How it works: upload audio while recording, and get partial results in real time.

Pros:

✅ real-time feedback
✅ no duration limit
✅ lower latency

Supported engines:

Cloud: Volcengine, Soniox, DashScope, ElevenLabs
Local: Paraformer

Non-streaming recognition (file upload)

How it works: upload the whole audio file after recording stops.

Pros:

✅ potentially higher quality (global analysis on full audio)
✅ simpler and stable
✅ supports more providers

Cons:

⚠️ duration limit (see tables above)
⚠️ recognition starts only after recording stops

Suggestions

For providers that support both modes, switch under Settings → ASR Settings → [Provider Settings].
Streaming is great for long recordings and low-latency feedback.
File mode is great for short audio when accuracy matters more.

Segmented Recording

For non-streaming engines, if a recording exceeds the app's single-segment limit for that provider, BiBi Keyboard automatically performs segmented recording.

How it works

Auto split: near the limit, the current segment is cut and a new segment starts
Background upload: segments are uploaded/recognized in background while recording continues
Seamless UX: UI stays in recording state without noticeable interruption
Merge results: transcripts from segments are concatenated automatically

Per-provider segment limits (app-side cap)

Provider	Segment cap	Notes
Volcengine	1 hour	Official max per request is ~2h; app uses 1h as a safety margin
SiliconFlow	20 min	App default; unrelated to billing/quota
ElevenLabs	20 min	App default to avoid failures on very long audio
OpenAI	20 min	App default; tune model/usage as needed
DashScope	3 min	Default qwen3-asr-flash; app segment cap is 3 min
Gemini	4 hours	Official max is ~9.5h; app uses 4h as a safety margin
Soniox	1 hour	No strict official max found; app defaults to 1h
SenseVoice	5 min	Local performance cap to avoid excessive RAM/time
FunASR Nano	5 min	Local performance cap to avoid excessive RAM/time
TeleSpeech	5 min	Local performance cap to avoid excessive RAM/time

Notes

Streaming engines (Paraformer, etc.) have no duration limit.
Segmented recording works only in non-streaming mode.
Each segment may incur a separate API call cost (for cloud providers).

Backup ASR Engine (Parallel Primary/Backup)

If your primary ASR occasionally times out or fails, you can enable a backup ASR engine: BiBi Keyboard records only once, then pushes the same audio to both primary and backup. If primary returns a non-empty final result in time, it uses primary; otherwise it falls back to the backup result.

How to enable

Open Settings → ASR Settings
Find "Backup speech recognition engine" and enable "Enable backup engine"
Tap "Backup provider" and choose a provider different from your primary one
Make sure the backup provider is also configured (API key / local model files, etc.)

Notes

This runs two engines in parallel. Even if the primary result is used, the backup may still trigger an API request/cost (depending on vendor billing and cancellation behavior).

One-tap Setup Options

1. Use the free service (recommended for beginners)

No config needed:

Open the app; it defaults to SiliconFlow free service
Under Settings → ASR Settings → SiliconFlow, switch between the two free models (FunAudioLLM/SenseVoiceSmall and TeleAI/TeleSpeechASR)

2. Configure a cloud provider

Using Volcengine as an example:

Create an account in the Volcengine console
Create an app and obtain App Key and Access Key
In BiBi Keyboard, go to Settings → ASR Settings → Provider and select Volcengine
Fill in credentials and save

3. Auto-configure local models

Download SenseVoice Small from the model release
Extract to Android/data/com.brycewg.asrkb/files/sensevoice/
The app will automatically select SenseVoice under Settings → ASR Settings → Provider

Tip

First load of local models may take a few seconds. You can enable "Preload model" (SenseVoice / FunASR Nano / TeleSpeech / Paraformer all support it) so the model is loaded when the keyboard or floating ball is first shown, reducing the first-recognition latency.

Local Punctuation (Optional)

TeleSpeech and Paraformer can add punctuation with an extra shared punctuation model. If the model is missing, recognition still works, but results may look more "spoken" (less punctuated).

Open Settings → ASR Settings
Go to the TeleSpeech or Paraformer section
Under the punctuation model section, tap "Download model" (or import the ZIP)

Download source

When downloading local models, you can choose a download source and see latency. Picking a lower-latency source is usually more stable.

Recognition Enhancements (Optional)

Offline denoise for non-streaming ASR: Settings → Input Settings → Offline denoise for non-streaming ASR (applies to file-mode and local offline recognition)

Recognition History & Metrics

You can view history records in Settings Home → Recognition History. History sources include:

Keyboard input
Floating ball input
External input (for example, via external AIDL integration)

Each record shows basic info (provider, source, AI processing status, word count, audio duration). When available, it also shows these elapsed-time metrics:

Total elapsed time: end-to-end time from recording start to final text commit.
Recognition elapsed time: ASR-stage processing time.
AI post-processing elapsed time: shown when AI post-processing is enabled and attempted.

Older records may not include some timing fields, which is expected.

Floating Ball - voice input anywhere
AI Post-processing - refine transcripts with LLM
Recording Modes - press-and-hold vs tap-to-toggle
Auto-stop on Silence (VAD) - stop recording automatically

Voice Input Basics ​

How It Works ​

Supported ASR Providers ​

Cloud ASR ​

Local ASR (Offline) ​

Cloud vs Local ​

Advantages of cloud ASR ​

Advantages of local ASR ​

Streaming vs Non-streaming ​

Streaming recognition ​

Non-streaming recognition (file upload) ​

Segmented Recording ​

How it works ​

Per-provider segment limits (app-side cap) ​

Backup ASR Engine (Parallel Primary/Backup) ​

How to enable ​

One-tap Setup Options ​

1. Use the free service (recommended for beginners) ​

2. Configure a cloud provider ​

3. Auto-configure local models ​

Local Punctuation (Optional) ​

Recognition Enhancements (Optional) ​

Recognition History & Metrics ​

Related ​

Voice Input Basics

How It Works

Supported ASR Providers

Cloud ASR

Local ASR (Offline)

Cloud vs Local

Advantages of cloud ASR

Advantages of local ASR

Streaming vs Non-streaming

Streaming recognition

Non-streaming recognition (file upload)

Segmented Recording

How it works

Per-provider segment limits (app-side cap)

Backup ASR Engine (Parallel Primary/Backup)

How to enable

One-tap Setup Options

1. Use the free service (recommended for beginners)

2. Configure a cloud provider

3. Auto-configure local models

Local Punctuation (Optional)

Recognition Enhancements (Optional)

Recognition History & Metrics

Related