Project URL: https://github.com/BharatKalluri/speechshift
Goal
Get world class, high speed, voice to text based typing systems to Linux.
There is nothing Super Whisper for Linux. The idea is very simple. You have a simple application which triggers on a keyboard shortcut and converts speech to text and pastes the text in the window in focus. So that instead of typing, you can just talk.
The Why
It is said that the average type speed is at 60-80 wpm and with voice we get up to 120-140 wpm. That’s a pretty good win. Voice based typing should have been a lot more straight forward but unfortunately its not. When whisper came out, apart from subtitles auto generation the other thing I was genuinely excited about was speech to text for typing.
Interestingly, this is not a solved problem yet on our phones. But on macs, it seems to be a solved problem. There are a plethora of choices for STT (speech to text) for typing
are some of them. You’ll find people developing apps all the time for this and selling lifetime subscriptions.
Life time subscriptions usually because they do the transcribing on machine.
You’ll also see them talk about a minimum system requirements of apple silicon. I think with apple silicon, its not really hard to run Whisper or Parakeet locally.
But then on my measly thinkpad linux system, there is neither apple silicon or a GPU. So I’ll have to make do with slow transcription if I do decide to run everything locally.
Anyways, I thought I’ll fix this by creating Speechshift. A CLI tool to run in linux so that your voice can be used to type at the speed of talk.
Whisper is not that great for audio transcription
Whisper was released in 2022. Its been a while, although it does not feel like it. Whisper base & tiny are very bad in terms of transcription, although they are amazingly fast.
To get quality, I think we need to use the medium or the large model. But the problem is that the medium model 1 takes time to execute, 2 is fairly heavy, so the transcription speeds are very low. It will take a lot of time to process very little text. Not a great experience. The higher up you go, the accuracy goes up, but then the time also goes up, so it doesn’t work. There are other models by Nvidia which seem to perform better. I’ll need to test those.
Pre-processing for audio
With Whisper. One of the problems is that we need to do a lot of pre processing of audio before we decide that that audio file can be passed to Whisper. Two very important stages of audio pre processing we’ll have to do is VAD & normalization.
VAD stands for voice Activity detection. The idea is that it tries to figure out whether voice in the audio and wherever it does not find voice it will effectively remove that part. That will make sure that Whisper processes lesser audio and correspondingly has a performance boost. The other one is normalization where you strip off all the higher and lower frequencies so that unnecessarily audio data is not being passed to Whisper.
These are the two things I could figure out. I am sure there is more to do in this segment to make sure we can squeeze the most performance out of Whisper, but right now this is what we are doing. Things tend to be slightly better but there is a lot of scope for improvement.
Perception & UX
We seem to have a very low tolerance for how much time we allow a process to run. If a transcription of say 100 words takes 5 odd seconds, it feels very slow still, so having anything above three seconds is usually a bad idea for something this.
Making sure that Transcriptions happen before 3 seconds is going to be an absolute challenge. Local models tend to perform very slowly, and that is something we’ll have to live with until better CPUs come to everyone. Even online services tend to take a bit of time to process audio. Since I don’t have a very powerful system, I thought I’ll move to the online transcription services, Assembly AI being the most popular one. But assembly AI also seems to take quite a bit of time to process small amount of audio, and that lag tends to be very irritating.
One more issue is that, I am running the entire system as a daemon. In the Linux system in the way you trigger a recording is by pressing command shift r and the way you end the recording is by pressing Command shift R. There is no real way to cancel a recording midway. This is a feature I will need to implement as soon as possible because this is a very normal requirement. But right now, as a function of not having a UI, there is no way to cut a recording midway.