Moonshine: Industry-leading edge ASR

October 21, 2024

Moonshine outperforms speech-to-text models from OpenAI, Google, NVIDIA, and Meta on the OpenASR Leaderboard while running 5x faster¹ on edge devices.

I built the data collection and preprocessing pipelines that we used to train Moonshine, delivering over 200K hours of labeled data. We needed a LOT of good data, and we had to move fast. The pipeline I constructed was massively distributed, allowing us to intake hundreds of terabytes of raw audio data, label it, and clean it within the span of several weeks.

Moonshine is optimized for low-latency edge deployment on mobile phones, single-board computers, and embedded devices. Unlike Whisper, which pads everything to a fixed 30-second attention window, Moonshine uses rotary position embeddings (RoPE) in the encoder. This means its compute requirements and inference latency scale to the actual length of input audio.

You can use Moonshine via our inference library, open weights on Hugging Face, or read the whitepaper for more technical detail.

Compared to OpenAI Whisper base ↩︎