On this page

Speech recognition

OpenAI Whisper (has nice C++ & CoreML ports), ocotillo & SpeechBrain are nice.

Notes

Voice assistants don't seem to stick for most people is that they're actually command line interfaces, but even less discoverable because they don't provide any visible feedback at all.

Links

HN: Facebook open-sources a speech-recognition system and a machine learning library (2018)
DeepSpeech - Open source Speech-To-Text engine, using a model trained by machine learning techniques, based on Baidu's Deep Speech research paper. (Examples)
Online speech recognition with wav2letter@anywhere (2020)
wav2letter++ - Fast, open source speech processing toolkit from the Speech team at Facebook AI Research built to facilitate research in end-to-end models for speech recognition.
Kaldi - Speech Recognition Toolkit.
Building an end-to-end Speech Recognition model in PyTorch (HN)
Real-Time Voice Cloning - Clone a voice in 5 seconds to generate arbitrary speech in real-time.
Kaldi Active Grammar - Python Kaldi speech recognition with grammars that can be set active/inactive dynamically at decode-time.
SpecAugment with PyTorch - PyTorch Implementation of GoogleBrain's SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition.
Dragonfly - Speech recognition framework for Python that makes it convenient to create custom commands to use with speech recognition software.
Gentle - Robust yet lenient forced-aligner built on Kaldi. A tool for aligning speech with text.
Porcupine - On-device wake word detection powered by deep learning.
Eesen - End-to-End Speech Recognition using Deep RNN Models and WFST-based Decoding.
Ask HN: Is there any work being done in speech-to-code with deep learning? (2020)
Silero Models - Pre-trained STT models and benchmarks made embarrassingly simple. (HN)
High-quality pre-trained speech-to-text models now available on Torch Hub (HN)
Wavenet For Speech Denoising - Neural network for end-to-end speech denoising, as described in: "A Wavenet For Speech Denoising".
Vosk - Speech recognition toolkit with state-of-the-art accuracy and low latency in Rust.
Voicegain - Speech-to-text Platform and APIs. Speech Recognition.
LibreASR - On-Premises, Streaming Speech Recognition System. (HN)
WORLD - High-quality speech analysis, manipulation and synthesis system. (Web)
ESPnet - End-to-end speech processing toolkit. (Docs)
Speaker Diarization - Process to answer the question of 'who spoke when?' in an audio file.
SpeechRecognition - Local auto speech recognition project based on Kaldi and ALSA.
Athena - Open-source implementation of sequence-to-sequence based speech processing engine.
PyTorch end-to-end speech recognition
Cheetah - On-device streaming speech-to-text engine powered by deep learning.
WaveRNN - PyTorch implementation of Deepmind's WaveRNN model from Efficient Neural Audio Synthesis.
Conformer - PyTorch implementation of Conformer: Convolution-augmented Transformer for Speech Recognition.
A Review of End-to-End Architectures for Speech Recognition (2021)
libfvad - Voice activity detection (VAD) library, based on WebRTC's VAD engine.
ASR with PyTorch - Experimental code for speech recognition using PyTorch and Kaldi.
YSDA Speech Processing Course
Paper List for Speech Translation
Deep Contextualized Acoustic Representations For Semi-Supervised Speech Recognition (2020) (Code)
Lyra: A New Very Low-Bitrate Codec for Speech Compression (2021)
Parrot.PY - Computer interaction using audio and speech recognition.
SpeechBrain Toolkit - PyTorch-based Speech Toolkit. (Web)
Vosk API - Offline open source speech recognition toolkit. (Rust API)
Lyra - Very Low-Bitrate Codec for Speech Compression.
lasr - PyTorch Lightning implementation of Automatic Speech Recognition.
Speech Recognition from Scratch
Common Voice - Mozilla's initiative to help teach machines how real people speak.
FullSubNet: A Full-Band and Sub-Band Fusion Model for Real-Time Single-Channel Speech Enhancement (2021) (Code)
DeepSpeech2 in PyTorch using PyTorch Lightning
Speech and Language Processing Book (2021) - Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. (2020 Version)
voice2json - Command-line tools for speech and intent recognition on Linux. (Web)
wav2vec Unsupervised: Speech recognition without supervision (2021)
Online Speech recognition using RNN-Transducer
Openspeech - Open-Source Toolkit for End-to-End Speech Recognition.
Unsupervised Speech Decomposition via Triple Information Bottleneck (2020) (Code)
AudioCLIP: Extending CLIP to Image, Text and Audio (2021) (Code)
Wav2vec: Semi and Unsupervised Speech Recognition (HN)
WeNet - Production First and Production Ready End-to-End Speech Recognition Toolkit. (Docs)
Why Hasn’t the iPhone Moment Happened Yet for Voice UIs (2021)
LeBenchmark: a reproducible framework for assessing SSL from speech
INTERSPEECH 2021
WER are we? - Tracking states of the art(s) and recent results on speech recognition.
GigaSpeech - Large, modern dataset for speech recognition.
Coqui STT - Deep learning toolkit for Speech-to-Text, battle-tested in research and production. (Docs) (Rust lib)
Coqui - Startup providing open speech tech for everyone. (GitHub)
Open Speech Corpora - List of accessible speech corpora for ASR, TTS, and other Speech Technologies.
An Overview of Multi-Task Learning in Speech Recognition (2020)
Coqui Inference Engine - Library for efficiently deploying speech models.
PDF to Speech - Deep-learning powered accessibility application which turns PDFs into audio files.
ASV-Subtools - Open Source Tools for Speaker Recognition.
VoiceFixer - General Speech Restoration.
speechmetrics - Wrapper around speech quality metrics MOSNet, BSSEval, STOI, PESQ, SRMR, SISDR.
Silero VAD - Pre-trained enterprise-grade Voice Activity Detector, Language Classifier and Spoken Number Detector.
A New AI Lexicon: Voice (2021) - The Legacies and Limits of Automated Voice Analysis.
Octopus - On-device speech-to-index engine powered by deep learning.
Open Audio Search - Full text search engine with automatic speech recognition for podcasts.
HuBERT: How to Apply BERT to Speech, Visually Explained (2021)
Happy Scribe - Audio Transcription & Video Subtitles.
Speech Recognition Papers
Steerable discovery of neural audio effects (2021) (Code)
audapolis - Editor for spoken-word media with transcription.
Shennong - Python toolbox for speech features extraction.
Paderbox - Collection of utilities for audio / speech processing.
Icefall - Speech recognition recipes using k2. (Docs)
k2 - FSA/FST algorithms, differentiable, with PyTorch compatibility.
ViSQOL (Virtual Speech Quality Objective Listener) - Objective, full-reference metric for perceived audio quality.
Espresso - Fast End-to-End Neural Speech Recognition Toolkit.
UniSpeech - Large Scale Self-Supervised Learning for Speech
NISQA: Speech Quality and Naturalness Assessment
Optimization techniques proposed in Improving RNN Transducer Modeling for End-to-End Speech Recognition
Conformer: Convolution-augmented Transformer for Speech Recognition (2020) (Code)
CAT: Crf-based Asr Toolkit - Complete workflow for CRF-based data-efficient end-to-end speech recognition.
Neural HMMs are all you need (for high-quality attention-free TTS) (2022) (Code)
End-to-End Speech Translation Progress - Tracking the progress in end-to-end speech translation.
EfficientTTS: An Efficient and High-Quality Text-to-Speech Architecture (2020) (Code)
S3PRL - Self-Supervised Speech Pre-training and Representation Learning Toolkit.
pyannote-audio - Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding.
DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (2021) (Code)
Speech recognition polyfill - Polyfill for the SpeechRecognition standard on web, using Speechly as the underlying API.
Speech-to-Text Benchmark
Hyperion - Speaker Recognition Toolkit based on PyTorch and numpy.
textlesslib - Library for Textless Spoken Language Processing.
FastSpeech 2: Fast and High-Quality End-to-End Text-to-Speech (2021) (Code)
HuggingSound - Toolkit for speech-related tasks based on HuggingFace's tools.
hear - macOS speech recognition via the command line.
PaddleSpeech - Easy-to-use Speech Toolkit including SOTA ASR pipeline, influential TTS with text frontend and End-to-End Speech Simultaneous Translation.
BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation (2021) (Code)
Edinburgh Speech Tools
rVADfast - Python library for an unsupervised, fast method for robust voice activity detection.
NeuralSpeech - Research project in Microsoft Research Asia focusing on neural network based speech processing, including automatic speech recognition (ASR), text to speech (TTS), etc.
Speech Super-resolution Evaluation and Benchmarking
Real Time Speech Recognition with Gradio (HN)
Assem-VC: Realistic Voice Conversion by Assembling Modern Speech Synthesis Techniques (2021) (Code)
CoVoST: A Large-Scale Multilingual Speech-To-Text Translation Corpus
Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction (2022) (Code)
Real Time Speech Enhancement in the Waveform Domain (2020) (Code)
Vosk-Browser - Opinionated speech recognition library for the browser using a WebAssembly build of Vosk.
VocalSound: A Dataset for Improving Human Vocal Sounds Recognition
PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition
NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality (2022) (HN)
George Hotz | Programming | speech recognition (2022)
NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality (2022) (Code) (Code)
CoquiSTT + Signal = Love (death to voice messages) (2022)
ocotillo - PyTorch-based ML model that does state-of-the-art English speech transcription.
SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing (2021) (Code)
pyctcdecode - Fast and lightweight python-based CTC beam search decoder for speech recognition.
Avocodo: Generative Adversarial Network for Artifact-free Vocoder (2022) (Code)
Squeezeformer - PyTorch implementation of "Squeezeformer: An Efficient Transformer for Automatic Speech Recognition".
Masked Autoencoders that Listen (2022) (Code)
SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech (2022) (Code)
Speech Enhancement and Dereverberation with Diffusion-based Generative Models
SSAST: Self-Supervised Audio Spectrogram Transformer
OpenAI Whisper - General-purpose speech recognition. Approaches human level robustness and accuracy on English speech recognition. (Web) (HN) (Notes) (Paper + Code walkthroughs) (Demo) (Demo Code) (Failure case)
Whisper ASR Web service
React-Speech-Recognition - Speech recognition for your React app.
Stage-Whisper - Easy to use AI transcriber, powered by OpenAI's Whisper.
Whisper.cpp - High-performance inference of OpenAI's Whisper automatic speech recognition (ASR) model. (HN)
Real-time speech recognition using next-gen Kaldi with ncnn
Gecko - Tool for Effective Annotation of Human Conversations. (Web)
OpenAI Whisper - CPU - Improving transcription performance of OpenAI Whisper for CPU based deployment.
Whispering - Streaming transcriber with whisper.
Buzz - Transcribe and translate audio offline on your personal computer. Powered by OpenAI's Whisper.
FastWhisper - Optimized implementation of OpenAI's Whisper for multilingual transcription.
I record myself on audio 24x7 and use an AI to process the information (2022) (HN)
Transcribe File - Free transcription service powered by Whisper AI. (HN)
SpeechRecognition - Speech recognition module for Python, supporting several engines and APIs, online and offline.
TransFusion: Transcribing Speech with Multinomial Diffusion
Convmelspec: Melspectrograms for On-Device Audio Machine Learning (2022)
Lightning Echo - Production-ready audio and video transcription app that can run on your laptop or in the cloud.
Best way to transcribe audio snippets (2022)
Whisper's transcription plus Pyannote's Diarization
WhisperX - Timestamp-Accurate Automatic Speech Recognition.
Speech-to-text with Whisper: How I Use It & Why (2022)
SPTK - Speech Signal Processing Toolkit.
Speechbox - Speech processing tools, such as punctuation restoration.
Zero-shot Punctuation Insertion using Whisper
Zero-shot Audio Classification using Whisper
Whisper CLI, built with Rust
Whisper-rs - Rust bindings to whisper.cpp.
OpenAI's Whisper ported to CoreML
Real Time Whisper Transcription
Multilingual Automatic Speech Recognition with Word-level Timestamps
Quick Caption - Transcribe and generate caption files (SRT and FCPXML) without manually entering time codes.
ArchiSound - Collection of pre-trained audio models, in PyTorch.
MacWhisper - Quickly and easily transcribe audio files into text with OpenAI's state-of-the-art transcription technology Whisper. (HN)
Whisper.cpp example running fully in the browser (HN)
Mesostructures: Beyond Spectrogram Loss in Differentiable Time-Frequency Analysis (2023) (Code)
CLAP (Contrastive Language-Audio Pretraining) - Neural network model that learns acoustic concepts from natural language supervision.
Speaker Diarization Using OpenAI Whisper
WaaS - Self-host Whisper As a Service with GUI and queueing. (HN)
Faster Whisper transcription with CTranslate2
FunASR: Fundamental End-to-End Speech Recognition Toolkit
Streamlit UI for OpenAI's Whisper
Whisperer - Go from raw audio files to a speaker separated text-audio datasets automatically.
Whisper Node - Node.js bindings for OpenAI's Whisper.
Stabilizing Timestamps for Whisper
TriAAN-VC: Triple Adaptive Attention Normalization for any-to-any Voice Conversion (2023)
Universal Speech Model (2023) (HN)
Ermine.ai - Local Audio Transcription. (Code) (Reddit) (HN)
Whisper Playground - Build real time speech2text web apps using OpenAI's Whisper.
fan_transcribe - Fan out audio transcription tasks via OpenAI Whisper and Modal Labs.
Whisper JAX - Optimized JAX code for OpenAI's Whisper Model.
NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers (2023) (Code) (Web)
Bark Web UI - Web UI for the Bark Text-to-Speech.
generate-subtitles - Generate transcripts for audio and video content with a user friendly UI, powered by Open AI's Whisper.
Awesome Whisper
Willow - Open-source privacy-focused voice assistant hardware. (HN)
Listen, Think, and Understand (2023) (Code)
Must-read paper and tutorial list for speech separation based on neural networks
Pengi - Audio Language Model for Audio Tasks.
Ecoute - Live transcription tool that provides real-time transcripts.
Whisper Web - ML-powered speech recognition directly in your browser! Built with 🤗 Transformers.js.
Diffusion-based Generative Speech Source Separation (2023) (Code)
EasyMMS - Simple Python package to easily use Meta's Massively Multilingual Speech (MMS) project.
SpeechPrompt v2: Prompt Tuning for Speech Classification Tasks (2023) (Code)
Speech Prompts Adapters
AudioPaLM: A Large Language Model That Can Speak and Listen (2023) (HN)
OpenAI Whisper API - OpenAI Whisper API based on Node.js / Bun.sh in a Docker Container + Google Cloud Run Example.
bigWav.app - Free audio transcription | Private audio transcription & annotation.
Coalesce - Audio editor which makes slicing dialogue as easy as editing text.
Whisper Burn - Rust implementation of OpenAI's Whisper model using the burn framework.
MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation (2023) (Code)
Subs AI - Subtitles generation tool (Web-UI + CLI + Python package) powered by OpenAI's Whisper and its variants.
Whisper.api - Open-source, self-hosted speech-to-text with fast transcription. (HN)
SeamlessM4T, a Multimodal AI Model for Speech and Text Translation (2023) (HN)
SwiftWhisper - Easiest way to transcribe audio in Swift.
whisper_streaming - Whisper real time streaming for long speech-to-text transcription and translation.
SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models
RepCodec: A Speech Representation Codec for Speech Tokenization

Genomics

Immunology

Startups

AWS

Serverless computing

Build systems

Computer vision

Algorithms

Formal verification

Blockchain

Figma

Message queue

Remote Procedure Calls

Psychedelics

Lysergamides

Tryptamines

Renewable energy

CSS

Game development

Game engines

CPU

Nutrition

Drinks

2018

2019

2020

2021

2022

Alfred

Keyboard Maestro

Xcode

Neural networks

Linear algebra

Logic

Automated theorem proving

Mathematical optimization

Statistics

Type Theory

Diseases

Music production

GraphQL

Internet of things

Peer to peer

VPN

GitHub

Containers

Kubernetes

iOS

Linux

Nix

Electrical engineering

Quantum physics

Functional programming

Interactive computing

Software testing

Version control

C

Clojure

C++

Dart

Elixir

Elm

Go

Go libraries

Java

JavaScript

JS libraries

React

Julia

Kotlin

Lisp

Nim

Objective C

OCaml

Processing

Prolog

Python

Python libraries

R language

ReasonML