transcribe.cpp

Apr 2026 - now

I'm super excited to share transcribe.cpp today.

transcribe.cpp is a ggml based transcription library which supports all the latest transcription models. Every model published under the handy-computer HF org has been numerically validated and WER tested to match the reference implementation. It's accelerated everywhere.

I'm the author and maintainer of Handy. This library grew from the pains of distributing a cross-platform speech-to-text application to many people.

This is a v0.1.0 library which means that there are some rough edges which I cannot discover alone! Please report them, and let's fix them together!

Motivation

Let me say this. I think distributing a cross-platform application with the current ASR inference stack is terrible.

You've basically got whisper.cpp and ONNX. That's it. You could roll MLX in for Apple devices, but now you've to support two different engines and port models to each. I've been a fan of ONNX for getting model support into Handy quickly, but so much performance is left on the table with CPU only.

There are a few random libraries out there which claim to support a lot of models, but they have unknown authors, and unknown testing, as far as I've seen. They leave me with more questions than answers.

When will they stop maintaining this library? Has the creator thought about bindings so you can actually use it in a real desktop or mobile app? Is this effectively demo code? Have they benchmarked it? Is it faster than ONNX?

And this is what led to transcribe.cpp. As Handy's maintainer I needed a library I could trust. Where I could download a file and run inference on it. Where I can know that the inference coming from the model in the engine is as good as the reference implementation. The inference should run on the GPU for the best performance. It should be trivially embeddable in Handy, it cannot be a huge pytorch lib. It must be something that works on Mac, Windows, and Linux. And ggml seemed like by far the best way forward. It has a strong community, and a great distribution story.

So what do you get?

You get a fast and accurate inference engine with wide ranging model support.

  • Support for 16 ASR Families (60+ models) with more coming
  • Acceleration via Vulkan, Metal, CUDA, and TinyBLAS
  • Every model has been numerically verified and WER tested
  • Support for Streaming Transcription
  • Support for Batch Transcription
  • More or less drop in whisper.cpp replacement
  • Maintainer supported bindings in 4 Languages
    • Python
    • Javascript/Typescript
    • Rust
    • ObjC/Swift

Wide Model Support

We intend to support as many state-of-the-art transcription models as possible. As of today, we support most of the modern transcription models that are publicly available. There are a few missing still, but they will be added soon.

Acceleration Support

One of my top goals was to run any ASR model I wanted on Vulkan. In my opinion this is the floor for any application shipping local inference. For every model we support, there is a corresponding benchmark run from a Ryzen 4750U (CPU + Vulkan) on Fedora as well as on my M4 Max.

Numerically Verified

I also wanted to make sure that inference in transcribe.cpp is accurate and as close to the reference implementation as possible. This largely came from a huge degree of uncertainty of inference accuracy when using .onnx models I found on Hugging Face. In order to ensure the inference we do is correct we numerically validate every model versus the reference. On top of numerical validation, we run full WER sweeps to make sure that whatever the reference is outputting, we output the same thing. That means every model has run through thousands of utterances and is very close or same as the reference. And the results of this data are published in the transcribe.cpp repo as well as with each model on Hugging Face.

Drop In whisper.cpp replacement

transcribe.cpp is more or less a drop in support for whisper.cpp. The main reason for this is: Handy used whisper.cpp and I needed to ship an update with transcribe.cpp which would replace it. I needed to keep some compatibility with the very popular .bin files which run in whisper.cpp and shipped with Handy. transcribe.cpp can run them. There are some flags and features in whisper.cpp which we do not support yet. But I think for the vast majority of use cases our whisper implementation is solid and can replace whisper.cpp while having about equal performance.

Real Distribution

Language bindings were on my mind to begin with. While this library is written in C/C++, I needed bindings in Rust. And I also knew that in order for us to distribute local transcription as widely as possible, it requires at minimum decent first-party support of bindings. I've chosen 4 languages that I think are fairly representative of where people will use the library. I welcome others to contribute bindings directly to the project as well, assuming that they are willing to take on the maintenance burden of doing so.

And of course, at the end of the day, a lot of the decisions were driven by Handy. As a result of Handy being popular, I intend to maintain this library, just as I've done my best to maintain Handy. I intend to be someone who continues to maintain open source projects and contribute to the ecosystem where I can.

This library never would have existed without Handy because I wouldn't have had the problem of trying to support a bunch of different ASR models. I would have never learned all the use cases that people have for ASR. I've done my best to cover the ones that I hear about the most. Certainly, there are cases in the library that are not currently handled. If there are things that I missed, you are free to contribute to the library!

Making Local Speech to Text More Accessible

transcribe.cpp is aimed squarely at making locally run ASR easier. We know that transcription can run extremely accurately on most devices, and there should be no need to send your voice to a cloud service. An RK3566 can run models via transcribe.cpp faster than real time on its anemic CPU. Faster than real time transcription with SOTA models runs in a handful of watts. It's not a hope or a dream, it's a fact.

I think as we look forward to the future, more inference will start happening locally for one reason or the other. This brings the distribution story front and center. In order to have more applications running inference locally, we need to make running inference easier. Certainly transcribe.cpp does not solve this on the whole, and there is a long way to go, but I hope it's a small step forward. I've certainly learned a lot.

Gratitude

I am extremely thankful for all the folks who have supported this project.

First and foremost is to Mozilla AI, their BiR program, and Davide from Mozilla AI. This project was largely a problem in my head that I came to them with, and they decided to support me in solving the problem. At the time transcribe.cpp wasn't even a concrete idea, I was just exploring how to solve accelerated distribution in Handy. So a huge thanks to them, their support, and helping to bring this project into existence.

ggml. This project wouldn't be possible without ggml and all of the contributors to it. Thank you all so much for the work you've done. I think ggml really does amazing work in helping to make distributing local inference applications easy and possible.

Modal has also been a critical help for me. I reached out to them, and they gave me credits. These credits are put towards doing the WER testing and ensuring the library works well on CUDA. It is an immense help being able to verify the correctness of the work.

Blacksmith helps to power some of the CI/CD for transcribe.cpp. Again I reached out to them and they immediately responded with credits. Of course CI/CD is critical for making sure everything put out has been tested to at least some degree.

Hugging Face both for being a pillar in the local AI community, as well as providing the handy-computer org private storage, so I could upload models at my own will.

AI Assisted?

Yes absolutely. I don't think it's possible for a single individual to write an engine from scratch of this size using ggml in a handful of months without outside assistance. Were any of the words here written using AI? Nope. They came from my mouth or my fingers.