Speech recognition
OpenAI Whisper (has nice C++ & CoreML ports), ocotillo & SpeechBrain are nice.
Notes
Links
- HN: Facebook open-sources a speech-recognition system and a machine learning library (2018)
- DeepSpeech - Open source Speech-To-Text engine, using a model trained by machine learning techniques, based on Baidu's Deep Speech research paper. (Examples)
- Online speech recognition with wav2letter@anywhere (2020)
- wav2letter++ - Fast, open source speech processing toolkit from the Speech team at Facebook AI Research built to facilitate research in end-to-end models for speech recognition.
- Kaldi - Speech Recognition Toolkit.
- Building an end-to-end Speech Recognition model in PyTorch (HN)
- Real-Time Voice Cloning - Clone a voice in 5 seconds to generate arbitrary speech in real-time.
- Kaldi Active Grammar - Python Kaldi speech recognition with grammars that can be set active/inactive dynamically at decode-time.
- SpecAugment with PyTorch - PyTorch Implementation of GoogleBrain's SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition.
- Dragonfly - Speech recognition framework for Python that makes it convenient to create custom commands to use with speech recognition software.
- Gentle - Robust yet lenient forced-aligner built on Kaldi. A tool for aligning speech with text.
- Porcupine - On-device wake word detection powered by deep learning.
- Eesen - End-to-End Speech Recognition using Deep RNN Models and WFST-based Decoding.
- Ask HN: Is there any work being done in speech-to-code with deep learning? (2020)
- Silero Models - Pre-trained STT models and benchmarks made embarrassingly simple. (HN)
- High-quality pre-trained speech-to-text models now available on Torch Hub (HN)
- Wavenet For Speech Denoising - Neural network for end-to-end speech denoising, as described in: "A Wavenet For Speech Denoising".
- Vosk - Speech recognition toolkit with state-of-the-art accuracy and low latency in Rust.
- Voicegain - Speech-to-text Platform and APIs. Speech Recognition.
- LibreASR - On-Premises, Streaming Speech Recognition System. (HN)
- WORLD - High-quality speech analysis, manipulation and synthesis system. (Web)
- ESPnet - End-to-end speech processing toolkit. (Docs)
- Speaker Diarization - Process to answer the question of 'who spoke when?' in an audio file.
- SpeechRecognition - Local auto speech recognition project based on Kaldi and ALSA.
- Athena - Open-source implementation of sequence-to-sequence based speech processing engine.
- PyTorch end-to-end speech recognition
- Cheetah - On-device streaming speech-to-text engine powered by deep learning.
- WaveRNN - PyTorch implementation of Deepmind's WaveRNN model from Efficient Neural Audio Synthesis.
- Conformer - PyTorch implementation of Conformer: Convolution-augmented Transformer for Speech Recognition.
- A Review of End-to-End Architectures for Speech Recognition (2021)
- libfvad - Voice activity detection (VAD) library, based on WebRTC's VAD engine.
- ASR with PyTorch - Experimental code for speech recognition using PyTorch and Kaldi.
- YSDA Speech Processing Course
- Paper List for Speech Translation
- Deep Contextualized Acoustic Representations For Semi-Supervised Speech Recognition (2020) (Code)
- Lyra: A New Very Low-Bitrate Codec for Speech Compression (2021)
- Parrot.PY - Computer interaction using audio and speech recognition.
- SpeechBrain Toolkit - PyTorch-based Speech Toolkit. (Web)
- Vosk API - Offline open source speech recognition toolkit. (Rust API)
- Lyra - Very Low-Bitrate Codec for Speech Compression.
- lasr - PyTorch Lightning implementation of Automatic Speech Recognition.
- Speech Recognition from Scratch
- Common Voice - Mozilla's initiative to help teach machines how real people speak.
- FullSubNet: A Full-Band and Sub-Band Fusion Model for Real-Time Single-Channel Speech Enhancement (2021) (Code)
- DeepSpeech2 in PyTorch using PyTorch Lightning
- Speech and Language Processing Book (2021) - Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. (2020 Version)
- voice2json - Command-line tools for speech and intent recognition on Linux. (Web)
- wav2vec Unsupervised: Speech recognition without supervision (2021)
- Online Speech recognition using RNN-Transducer
- Openspeech - Open-Source Toolkit for End-to-End Speech Recognition.
- Unsupervised Speech Decomposition via Triple Information Bottleneck (2020) (Code)
- AudioCLIP: Extending CLIP to Image, Text and Audio (2021) (Code)
- Wav2vec: Semi and Unsupervised Speech Recognition (HN)
- WeNet - Production First and Production Ready End-to-End Speech Recognition Toolkit. (Docs)
- Why Hasn’t the iPhone Moment Happened Yet for Voice UIs (2021)
- LeBenchmark: a reproducible framework for assessing SSL from speech
- INTERSPEECH 2021
- WER are we? - Tracking states of the art(s) and recent results on speech recognition.
- GigaSpeech - Large, modern dataset for speech recognition.
- Coqui STT - Deep learning toolkit for Speech-to-Text, battle-tested in research and production. (Docs) (Rust lib)
- Coqui - Startup providing open speech tech for everyone. (GitHub)
- Open Speech Corpora - List of accessible speech corpora for ASR, TTS, and other Speech Technologies.
- An Overview of Multi-Task Learning in Speech Recognition (2020)
- Coqui Inference Engine - Library for efficiently deploying speech models.
- PDF to Speech - Deep-learning powered accessibility application which turns PDFs into audio files.
- ASV-Subtools - Open Source Tools for Speaker Recognition.
- VoiceFixer - General Speech Restoration.
- speechmetrics - Wrapper around speech quality metrics MOSNet, BSSEval, STOI, PESQ, SRMR, SISDR.
- Silero VAD - Pre-trained enterprise-grade Voice Activity Detector, Language Classifier and Spoken Number Detector.
- A New AI Lexicon: Voice (2021) - The Legacies and Limits of Automated Voice Analysis.
- Octopus - On-device speech-to-index engine powered by deep learning.
- Open Audio Search - Full text search engine with automatic speech recognition for podcasts.
- HuBERT: How to Apply BERT to Speech, Visually Explained (2021)
- Happy Scribe - Audio Transcription & Video Subtitles.
- Speech Recognition Papers
- Steerable discovery of neural audio effects (2021) (Code)
- audapolis - Editor for spoken-word media with transcription.
- Shennong - Python toolbox for speech features extraction.
- Paderbox - Collection of utilities for audio / speech processing.
- Icefall - Speech recognition recipes using k2. (Docs)
- k2 - FSA/FST algorithms, differentiable, with PyTorch compatibility.
- ViSQOL (Virtual Speech Quality Objective Listener) - Objective, full-reference metric for perceived audio quality.
- Espresso - Fast End-to-End Neural Speech Recognition Toolkit.
- UniSpeech - Large Scale Self-Supervised Learning for Speech
- NISQA: Speech Quality and Naturalness Assessment
- Optimization techniques proposed in Improving RNN Transducer Modeling for End-to-End Speech Recognition
- Conformer: Convolution-augmented Transformer for Speech Recognition (2020) (Code)
- CAT: Crf-based Asr Toolkit - Complete workflow for CRF-based data-efficient end-to-end speech recognition.
- Neural HMMs are all you need (for high-quality attention-free TTS) (2022) (Code)
- End-to-End Speech Translation Progress - Tracking the progress in end-to-end speech translation.
- EfficientTTS: An Efficient and High-Quality Text-to-Speech Architecture (2020) (Code)
- S3PRL - Self-Supervised Speech Pre-training and Representation Learning Toolkit.
- pyannote-audio - Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding.
- DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (2021) (Code)
- Speech recognition polyfill - Polyfill for the SpeechRecognition standard on web, using Speechly as the underlying API.
- Speech-to-Text Benchmark
- Hyperion - Speaker Recognition Toolkit based on PyTorch and numpy.
- textlesslib - Library for Textless Spoken Language Processing.
- FastSpeech 2: Fast and High-Quality End-to-End Text-to-Speech (2021) (Code)
- HuggingSound - Toolkit for speech-related tasks based on HuggingFace's tools.
- hear - macOS speech recognition via the command line.
- PaddleSpeech - Easy-to-use Speech Toolkit including SOTA ASR pipeline, influential TTS with text frontend and End-to-End Speech Simultaneous Translation.
- BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation (2021) (Code)
- Edinburgh Speech Tools
- rVADfast - Python library for an unsupervised, fast method for robust voice activity detection.
- NeuralSpeech - Research project in Microsoft Research Asia focusing on neural network based speech processing, including automatic speech recognition (ASR), text to speech (TTS), etc.
- Speech Super-resolution Evaluation and Benchmarking
- Real Time Speech Recognition with Gradio (HN)
- Assem-VC: Realistic Voice Conversion by Assembling Modern Speech Synthesis Techniques (2021) (Code)
- CoVoST: A Large-Scale Multilingual Speech-To-Text Translation Corpus
- Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction (2022) (Code)
- Real Time Speech Enhancement in the Waveform Domain (2020) (Code)
- Vosk-Browser - Opinionated speech recognition library for the browser using a WebAssembly build of Vosk.
- VocalSound: A Dataset for Improving Human Vocal Sounds Recognition
- PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition
- NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality (2022) (HN)
- George Hotz | Programming | speech recognition (2022)
- NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality (2022) (Code) (Code)
- CoquiSTT + Signal = Love (death to voice messages) (2022)
- ocotillo - PyTorch-based ML model that does state-of-the-art English speech transcription.
- SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing (2021) (Code)
- pyctcdecode - Fast and lightweight python-based CTC beam search decoder for speech recognition.
- Avocodo: Generative Adversarial Network for Artifact-free Vocoder (2022) (Code)
- Squeezeformer - PyTorch implementation of "Squeezeformer: An Efficient Transformer for Automatic Speech Recognition".
- Masked Autoencoders that Listen (2022) (Code)
- SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech (2022) (Code)
- Speech Enhancement and Dereverberation with Diffusion-based Generative Models
- SSAST: Self-Supervised Audio Spectrogram Transformer
- OpenAI Whisper - General-purpose speech recognition. Approaches human level robustness and accuracy on English speech recognition. (Web) (HN) (Notes) (Paper + Code walkthroughs) (Demo) (Demo Code) (Failure case)
- Whisper ASR Web service
- React-Speech-Recognition - Speech recognition for your React app.
- Stage-Whisper - Easy to use AI transcriber, powered by OpenAI's Whisper.
- Whisper.cpp - High-performance inference of OpenAI's Whisper automatic speech recognition (ASR) model. (HN)
- Real-time speech recognition using next-gen Kaldi with ncnn
- Gecko - Tool for Effective Annotation of Human Conversations. (Web)
- OpenAI Whisper - CPU - Improving transcription performance of OpenAI Whisper for CPU based deployment.
- Whispering - Streaming transcriber with whisper.
- Buzz - Transcribe and translate audio offline on your personal computer. Powered by OpenAI's Whisper.
- FastWhisper - Optimized implementation of OpenAI's Whisper for multilingual transcription.
- I record myself on audio 24x7 and use an AI to process the information (2022) (HN)
- Transcribe File - Free transcription service powered by Whisper AI. (HN)
- SpeechRecognition - Speech recognition module for Python, supporting several engines and APIs, online and offline.
- TransFusion: Transcribing Speech with Multinomial Diffusion
- Convmelspec: Melspectrograms for On-Device Audio Machine Learning (2022)
- Lightning Echo - Production-ready audio and video transcription app that can run on your laptop or in the cloud.
- Best way to transcribe audio snippets (2022)
- Whisper's transcription plus Pyannote's Diarization
- WhisperX - Timestamp-Accurate Automatic Speech Recognition.
- Speech-to-text with Whisper: How I Use It & Why (2022)
- SPTK - Speech Signal Processing Toolkit.
- Speechbox - Speech processing tools, such as punctuation restoration.
- Zero-shot Punctuation Insertion using Whisper
- Zero-shot Audio Classification using Whisper
- Whisper CLI, built with Rust
- Whisper-rs - Rust bindings to whisper.cpp.
- OpenAI's Whisper ported to CoreML
- Real Time Whisper Transcription
- Multilingual Automatic Speech Recognition with Word-level Timestamps
- Quick Caption - Transcribe and generate caption files (SRT and FCPXML) without manually entering time codes.
- ArchiSound - Collection of pre-trained audio models, in PyTorch.
- MacWhisper - Quickly and easily transcribe audio files into text with OpenAI's state-of-the-art transcription technology Whisper. (HN)
- Whisper.cpp example running fully in the browser (HN)
- Mesostructures: Beyond Spectrogram Loss in Differentiable Time-Frequency Analysis (2023) (Code)
- CLAP (Contrastive Language-Audio Pretraining) - Neural network model that learns acoustic concepts from natural language supervision.
- Speaker Diarization Using OpenAI Whisper
- WaaS - Self-host Whisper As a Service with GUI and queueing. (HN)
- Faster Whisper transcription with CTranslate2
- FunASR: Fundamental End-to-End Speech Recognition Toolkit
- Streamlit UI for OpenAI's Whisper
- Whisperer - Go from raw audio files to a speaker separated text-audio datasets automatically.
- Whisper Node - Node.js bindings for OpenAI's Whisper.
- Stabilizing Timestamps for Whisper
- TriAAN-VC: Triple Adaptive Attention Normalization for any-to-any Voice Conversion (2023)
- Universal Speech Model (2023) (HN)
- Ermine.ai - Local Audio Transcription. (Code) (Reddit) (HN)
- Whisper Playground - Build real time speech2text web apps using OpenAI's Whisper.
- fan_transcribe - Fan out audio transcription tasks via OpenAI Whisper and Modal Labs.
- Whisper JAX - Optimized JAX code for OpenAI's Whisper Model.
- NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers (2023) (Code) (Web)
- Bark Web UI - Web UI for the Bark Text-to-Speech.
- generate-subtitles - Generate transcripts for audio and video content with a user friendly UI, powered by Open AI's Whisper.
- Awesome Whisper
- Willow - Open-source privacy-focused voice assistant hardware. (HN)
- Listen, Think, and Understand (2023) (Code)
- Must-read paper and tutorial list for speech separation based on neural networks
- Pengi - Audio Language Model for Audio Tasks.
- Ecoute - Live transcription tool that provides real-time transcripts.
- Whisper Web - ML-powered speech recognition directly in your browser! Built with 🤗 Transformers.js.
- Diffusion-based Generative Speech Source Separation (2023) (Code)
- EasyMMS - Simple Python package to easily use Meta's Massively Multilingual Speech (MMS) project.
- SpeechPrompt v2: Prompt Tuning for Speech Classification Tasks (2023) (Code)
- Speech Prompts Adapters
- AudioPaLM: A Large Language Model That Can Speak and Listen (2023) (HN)
- OpenAI Whisper API - OpenAI Whisper API based on Node.js / Bun.sh in a Docker Container + Google Cloud Run Example.
- bigWav.app - Free audio transcription | Private audio transcription & annotation.
- Coalesce - Audio editor which makes slicing dialogue as easy as editing text.
- Whisper Burn - Rust implementation of OpenAI's Whisper model using the burn framework.
- MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation (2023) (Code)
- Subs AI - Subtitles generation tool (Web-UI + CLI + Python package) powered by OpenAI's Whisper and its variants.
- Whisper.api - Open-source, self-hosted speech-to-text with fast transcription. (HN)
- SeamlessM4T, a Multimodal AI Model for Speech and Text Translation (2023) (HN)
- SwiftWhisper - Easiest way to transcribe audio in Swift.
- whisper_streaming - Whisper real time streaming for long speech-to-text transcription and translation.
- SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models
- RepCodec: A Speech Representation Codec for Speech Tokenization