|
||
---|---|---|
backend | ||
docker | ||
frontend | ||
.gitignore | ||
LICENSE | ||
README.md | ||
backend.Dockerfile | ||
backend.apionly.Dockerfile | ||
example.docker-compose.yml | ||
example.env | ||
frontend.Dockerfile |
README.md
Currently working on an improved version. See web whisper plus
🎶 Convert any audio to text 📝
A light user interface for OpenAI's Whisper right into your browser!
WEB WHISPER
Features:
- Record and transcribe audio right from your browser.
- Run it 100% locally, or you can make use of OpenAI Whisper API.
- Ability to switch between API and LOCAL mode.
- Upload any media file (video, audio) in any format and transcribe it.
- Option to cut audio to X seconds before transcription.
- Option to disable file uploads.
- Enter a video URL to transcribe it to text (uses yt-dlp for getting video).
- Select input audio language.
- Auto-detect input audio language.
- Option to speed up audio by 2x for faster results (this has negative impact on accuracy).
- Translate input audio transcription to english.
- Download
.srt
subtitle file generated from audio. - Option to enable transcription history.
- Configure whisper
- Choose the Whisper model you want to use (tiny, base, small...)
- Configure the number of threads and processors to use.
- Docker compose for easy self-hosting
- Privacy respecting (when run locally):
- All happens locally. No third parties involved.
- Option to delete all files immediately after processing.
- Option keep files for later use / download.
- Uses C++ whisper version from whisper.cpp.
- You don't need a GPU, uses CPU.
- No need for complex installations.
- Backend written in Go
- Lightweight and beautiful UI.
- Frontend written with Svelte and Tailwind CSS.
Roadmap:
- Ability to transcribe videos from a URL using the API.
- Summarize transcriptions via ChatGPT API.
Test it!
You can easily self host your own instance with docker (locally or in a server).
Also, I have made testing instance available at: https://whisper.r3d.red
Note that this instance is limited:
- Maximum of 10 seconds audio recordings
- File uploads are disabled.
- Uses the
base
model.
Screenshots
*Logo generated with Stable Diffusion*
Main page

Video options

Recording

Transcription Options

Processing

Result

Other information
How fast is this?
Whisper.cpp usually provides faster results than the python implementation. Although it will highly depend on your machine resources, the length of the media source and the file size. Here is a little benchmark:
Processor | RAM | Threads | Processors | Length | Size | Elapsed time |
---|---|---|---|---|---|---|
i7 | 16 | 4 | 1 | 30m | 7MB | 7m 38s |
i7 | 16 | 8 | 1 | 30s | < 1MB | 5s |
What is the difference between models?
There are several models, which differ by size. The size difference is related to having more or less parameters. The more parameters the better it can "understand" what it is listening to (less errors). With smaller models, more errors will occur (i.e. confusing words).
Also note that when using bigger models, the transcription time and the memory usage will increase:
Model | Disk | Mem (since v1.6.1) |
---|---|---|
tiny | 75 MB | ~125 MB |
base | 142 MB | ~210 MB |
small | 466 MB | ~600 MB |
medium | 1.5 GB | ~1.7 GB |
large | 2.9 GB | ~3.3 GB |
Table from Whisper.cpp repo.
How accurate is this?
Not all languages provide the same accuracy when using Whisper. Please, take a look at the following graphic to see the Languages and their related WER (Word Error Ratio). The smaller the WER, the better the model will understand the language.
Image from original Whisper repo.
Similar projects
- Whisper WASM - If you want to run Whisper directly in your browser without the need of a server, you can use this project. Note that performance for this version is not very good.