Distributed systems
Encore & Service Weaver seem nice.
Notes
- Getting a million users is infinitely harder than scaling a system to handle a million users. Most systems could run comfortably on a Raspberry Pi
- Fault-tolerant designs treat failures as routine. In large-scale systems, the assumption is that component failures will happen sooner or later. Any individual failure must be presumed imminent and component failures must be expected to be continuous.
Links
- Setting up containers, load balancing, and service discovery on light hardware
- Ask HN: Any recommended resources to develop system thinking? (2018)
- Distributed Systems in One Lesson by Tim Berglund (2017)
- Traefik - Modern HTTP reverse proxy and load balancer that makes deploying microservices easy. (Hello World with Traefik) (Awesome) (Helm Chart)
- Traefik Training course resources (Web)
- Kit - Standard library for microservices written in Go. (kit-auth)
- Fear and Loathing in Lock-Free Programming (2017)
- Reliable Systems Series: Model-Based Testing (2018)
- Awesome Distributed Systems
- Awesome Distributed Systems 2
- Kong - Cloud-Native API Gateway & Service Mesh.
- Disque - Distributed message broker.
- Mesh - Tool for building distributed applications.
- Raft - Raft distributed consensus algorithm implemented in Rust.
- hraftd - Hashicorp's Raft implementation.
- In Search of an Understandable Consensus Algorithm (HN)
- libp2p specification - Technical specifications for the libp2p networking stack.
- Class materials for a distributed systems lecture series
- Raft Consensus Algorithm (Code)
- Qri - Global dataset version control system (GDVCS) built on the distributed web.
- Project Oak - Meaningful control of data in distributed systems.
- mudb - Collection of modules for building realtime client-server networked applications.
- Verdi - Framework for formally verifying distributed systems implementations in Coq.
- PingCAP Talent Plan - Series of training courses about writing distributed systems in Go and Rust.
- Protocol Labs - Build protocols, systems, and tools to improve internet.
- Dark Crystal - Open source R&D affinity. Exploring the potential of new and existing technologies in crypto-space to encourage horizontal group collaboration.
- Protozoa - Web developers, facilitators, crypto-engineers. Experts in Node.js & distributed systems.
- Akka - Build highly concurrent, distributed, and resilient message-driven applications on the JVM. (Web) (Reddit) (Reddit)
- Distributed Components - Provides reusable infrastructure for formally verifying distributed systems using the Coq proof assistant.
- Practical Networked Applications in Rust, Part 1: Non-Networked Key-Value Store (HN)
- LF - Fully Decentralized Fully Replicated Key/Value Store.
- Awesome Consensus - Curated selection of artisanal consensus algorithms and hand-crafted distributed lock services.
- Rezolus - Tool for collecting detailed systems performance telemetry and exposing burst patterns through high-resolution telemetry.
- Cadence - Distributed, scalable, durable, and highly available orchestration engine to execute asynchronous long-running business logic in a scalable and resilient way.
- Pilosa - Open source, distributed bitmap index that dramatically accelerates queries across multiple, massive data sets.
- Finagle - Fault tolerant, protocol-agnostic RPC system. (Scaling out a Rails app with Finagle) (Twitter) (Tweet)
- How To Build A Modern Distributed Compute Platform (2018)
- Chaos Monkey - Resiliency tool that helps applications tolerate random instance failures.
- Faust - Python Stream Processing.
- "Consistency without consensus in production systems" by Peter Bourgon (2014)
- Distributed consensus reading list
- Titanoboa - Community version of fully distributed, highly scalable and fault tolerant workflow orchestration platform for JVM.
- Buoyant - Helps you deploy and run Linkerd, the fully open source, ultralight service mesh.
- Grappa - Runtime system for scaling irregular applications on commodity clusters.
- MIT Distributed Systems course (2020) (Videos) (Notes) (HN) (Discord) (Code)
- Correctness proofs of distributed systems with Isabelle/HOL (2019)
- Apache Mesos - Cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks.
- Gleam - Fast, efficient, and scalable distributed map/reduce system, DAG execution, in memory or on disk, written in pure Go, runs standalone or distributedly.
- Learning Distributed Systems - Cloud Native Podcast
- etcd - Distributed reliable key-value store for the most critical data of a distributed system.
- etcdadm - Command-line tool for operating an etcd cluster. It makes it easy to create a new cluster, add a member to, or remove a member from an existing cluster.
- Learning to build distributed systems (2019) (Lobsters)
- SwarmKit - Toolkit for orchestrating distributed systems at any scale. It includes primitives for node discovery, raft-based consensus, task scheduling and more.
- How to get started with infrastructure and distributed systems (2016)
- Advanced Napkin Math: Estimating System Performance from First Principles (2019) (Code)
- Golimit - Uber ringpop based distributed and decentralized rate limiter.
- System Design lectures (2020)
- Awesome Scalability - Patterns of Scalable, Reliable, and Performant Large-Scale Systems.
- LeetCode System Design Questions
- Grokking the System Design Interview (Code)
- Amazon Builders' Library - How Amazon builds and operates software.
- Distributed Systems Wiki (Code)
- Jepsen - Distributed Systems Safety Research.
- ION - Distributed RTC system written by pure go and flutter.
- Challenges with distributed systems (HN)
- Systems design for Advanced Beginners (2020)
- Performance Under Load (2018)
- Veneur - Distributed, fault-tolerant pipeline for runtime data.
- Going multi-region
- List of distributed systems reading lists
- Complexities of Capacity Management for Distributed Services (2020)
- Hermes: a Fast, Fault-Tolerant and Linearizable Replication Protocol (2020)
- WormSpace: A Modular Foundation for Simple, Verifiable Distributed Systems
- Paxos vs Raft: Have we reached consensus on distributed consensus? (2020) (HN)
- Debugging Distributed Systems (HN)
- Distributed systems for fun and profit
- Temporal - Open source microservices orchestration engine for running mission critical code at any scale. (Code) (Docs) (Why I joined Temporal) (Go SDK) (Talk)
- Temporalite - Distribution of Temporal that runs as a single process with zero runtime dependencies.
- Stateright - Model checker for implementing distributed systems. (HN)
- Arvind Krishnamurthy's research
- Distributed Services with Go
- Fully asynchronous C implementation of the Raft consensus protocol
- Notes on Distributed Systems for Young Bloods (2013) (HN)
- Paxakos - Rust implementation of a distributed consensus algorithm based on Leslie Lamport's Paxos.
- Riemann - Network event stream processing system, in Clojure.
- Collection of the papers, conference talks, articles, blog posts, interesting Twitter threads, HN/reddit comments on systems engineering
- Tess Rinearson - All Together Now: An Introduction to Distributed Consensus (2019)
- Slurm - Open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. (Code) (Docs) (Set up Slurm across Multiple Machines)
- Submitit - Lightweight tool for submitting Python functions for computation within a Slurm cluster.
- CAP FAQ
- Readings in Distributed Systems
- Control theory for fun and profit (2020) (HN)
- Understanding Replication in Databases and Distributed Systems (2018)
- A plain English introduction to CAP theorem
- Debugging Incidents in Google's Distributed Systems (2020) (HN)
- Odin - Programmable, observable and distributed job orchestration system which allows for the scheduling, management and unattended background execution of user created tasks on Linux based systems. (HN)
- Verifying Strong Eventual Consistency in Distributed Systems (2017)
- Patterns of Distributed Systems (2020) (HN)
- Keeping CALM: When Distributed Consistency Is Easy (2020)
- Distributed Systems Notes
- Avoiding fallback in distributed systems
- The Reactive Principles - Design Principles for Distributed Applications.
- Paxi - Framework that implements WPaxos and other Paxos protocol variants.
- Rafting Trip - Learn about network programming, concurrency, distributed systems, and more as you tackle the challenge of implementing the Raft distributed consensus algorithm.
- Resources for learning distributed systems (2020)
- Workload isolation using shuffle-sharding (2020)
- Consensus is Harder Than It Looks (2020)
- The Little Strangler (Lobsters)
- A Review of Consensus Protocols (2020) (HN)
- Disel: Distributed Separation Logic - Separation-style logic for compositional verification of distributed systems.
- raft-zero - Implementation of the Raft consensus algorithm on top of the act-zero actor framework.
- raft-playground - Application to simulate and test a Raft cluster, using raft-zero.
- Building Netflix’s Distributed Tracing Infrastructure (2020)
- Wikipedia's self-hosted CDN (2020)
- Infinite Parallel Universes: State at the Edge (2020) (Summary)
- Awesome Chaos Engineering
- How you could have come up with Paxos yourself (2020) (HN)
- Grafana Tempo - Open source, easy-to-use and high-scale distributed tracing backend. (Web) (Announcement) (HN)
- Principles of chaos engineering (Code) (HN)
- Chaos Experimentation, an open-source framework built on top of Envoy Proxy (2021)
- Testing Distributed Systems - Curated list of resources on testing distributed systems. (Code) (HN)
- Pegasus: Tolerating Skewed Workloads in Distributed Storage with In-Network Coherence Directories (2020) (Summary)
- Notes on Paxos (2020) (HN)
- This is why distributed systems are useful (and I am building one) (2020)
- Distributed Systems lecture series by Martin Kleppmann (2020) (Lectures Notes)
- Dkron - Distributed, fault tolerant job scheduling system for cloud native environments. (Web)
- Braft - Industrial-grade C++ implementation of the RAFT consensus algorithm.
- Distributed Systems course (2020) (NotesG)
- MirBFT Library - Consensus library implementing the Mir consensus protocol.
- Fairness in multi-tenant systems (2020)
- Advanced Distributed Systems Design course
- Raft implementation in Go
- Loading Shedding Strategies - Demonstration of load shedding and how it can make your services more resilient in outages and come back online quicker.
- A Byzantine failure in the real world (2020)
- Byzantine Eventual Consistency
- Interval Tree Clocks (2020)
- Distributed Systems Reading List (HN)
- Raft Visualization (HN) (Code)
- Meld - Decentralized shared state.
- Understanding Connections & Pools (2021) (HN)
- Fission Whitepaper (Code)
- Awesome distributed transactions
- Rystsov's Blog on distributed systems
- Compartmentalized Paxos - Scaling Replicated State Machines with Compartmentalization. (Tweet)
- DistSys Reading Group
- CASPaxos: Replicated State Machines without logs (2018) (Code)
- Consensus: Bridging Theory and Practice - PhD dissertation on the Raft consensus algorithm.
- The Fundamental Mechanism of Scaling (2021)
- Ray - Simple, universal API for building distributed applications. Accelerating machine learning workloads. (Code) (Docs)
- Jepsen - Framework for distributed systems verification, with fault injection. Clojure library.
- How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh (2019)
- Distributed Systems in Rust - Training course about the distributed systems in Rust.
- rsraft - Raft implementation in Rust.
- Implementing Raft's Leader Election in Rust (2021)
- Effective Fallbacks (2020)
- Ask HN: Recommended books and papers on distributed systems? (2021)
- Raft implementation in Rust language
- Porcupine - Fast linearizability checker for testing the correctness of distributed systems.
- Testing Distributed Systems for Linearizability (2017)
- Namazu - Programmable Fuzzy Scheduler for Testing Distributed Systems.
- Engineering Dependability and Fault Tolerance in a Distributed System (2021)
- Autopilot: workload autoscaling at Google (2020)
- Byztime - Byzantine-fault-tolerant protocol for synchronizing time among a group of peers, without reliance on any external time authority.
- Foundational Distributed Systems Papers (2021) (HN)
- Making reliable distributed systems in presence of software errors by Joe Armstrong (2003)
- unitalk - Distributed chat system which can be used as chat rooms or state synchronization.
- Maelstrom - Workbench for learning distributed systems by writing your own.
- An introduction to lockless algorithms (2021) (HN) (HN)
- Clio - Functional, distributed programming language that compiles to JavaScript. (Code)
- Distributed Systems Course (HN)
- Sundial: Fault-tolerant Clock Synchronization for Data Centers (2021)
- Achieving reliable dual writes in distributed systems (2021)
- Paxos Made Simple (2016)
- Fiber - Distributed Computing for AI Made Simple. (Web)
- Raft Implementation & CLI Visualization in Rust
- Ask HN: Learning Distributed Systems as a Junior Engineer (2021)
- The Distributed Reading List
- Launchpad - Library that simplifies writing distributed programs by seamlessly launching them on a variety of different platforms.
- The Problem of Distributed Consensus (2021)
- A robust distributed locking algorithm based on Google Cloud Storage (2021)
- Sealer - Build share and run your distributed applications.
- Scalability - Guides, Articles, Podcasts, Videos and Notes to Build Reliable Large-Scale Distributed Systems.
- Building a Raft (2021)
- Time, clocks, and order. (2020) - Look at the notion of time in a distributed system, and its effects on ordering.
- The Generals (2020) - Look at the Two Generals' and Byzantine Generals' problem, two popular consensus problems.
- Impossibility of Distributed Consensus with One Faulty Process (2020)
- The CAP Theorem (2020)
- Metastability and Distributed Systems (2021)
- Distributed Systems Course (2021) (Tweet)
- Metastable Failures in Distributed Systems (2021)
- Distributed Systems Engineering Course Notes (2015)
- Emitter - High performance, distributed and low latency publish-subscribe platform. (Web)
- Patterns of Distributed Systems: Lamport Clock (2021)
- Make your cluster SWIM (2020)
- Systemizer - Tool for designing complex distributed systems, allowing you to simulate data flow with customizable components. (Web)
- Patterns of Distributed Systems: Follower Reads (2021)
- Getting To Know Logical Clocks By Implementing Them (2021)
- Paxos vs Raft: Have we reached consensus on distributed consensus? (2021) (HN)
- Consistency and Consensus – How Do Paxos and Raft Work? (2021)
- Summer Blog Backlog: Distributed Systems (2021)
- Fanouts and Percentiles (2020)
- Distributed Tracing — we’ve been doing it wrong (2019)
- How To Design A Reliable Distributed Timer (2021)
- raft-engine - WAL-is-data engine that used to store multi-raft log.
- Three Clocks are Better than One
- RAMP up your distributed transactions (2021)
- Errors found in distributed protocols
- Python for Distributed Systems (2021)
- FastPay - High-Performance Byzantine Fault Tolerant Settlement.
- Distributed consensus made simple (for real this time!) (2021)
- Hints and Principles for Computer System Design (2021) (HN)
- Guide To Prepare for the Gremlin Certified Chaos Engineering Practitioner Exam
- Balsam - High throughput workflows and automation for HPC.
- Hypercore - Secure, distributed append-only log.
- Hypercore Next - Append only log with multi-writer primitives built in.
- "Waterpark: Distributed Actors vs the Pandemic" by Bryan Hunter (2021) - Building reliable, actor-based systems.
- P language - Modular and Safe Programming for Distributed Systems. (Docs) (Tweet)
- Raft Consensus Protocol (HN)
- Paper review: Scaling Large Production Clusters with Partitioned Synchronization (2021)
- MadSim - Magical Deterministic Simulator for distributed systems in Rust.
- Deep dive into Yrs architecture (2021)
- fantoch - Framework for evaluating (planet-scale) consensus protocols.
- MultiPaxos made Simple (2021)
- Paxos made Abstract (2021)
- Unbase - Distributed database/application framework that is fundamentally reactive, fault tolerant, and decentralized.
- Beating the CAP Theorem Checklist
- Paper review: Paxos vs Raft
- Shardz (2021) (HN)
- microcosm - Prototype of distributed task scheduler.
- Canary - Distributed systems library for making communications through the network easier, while keeping minimalism and flexibility. (Code)
- Components Contrib - Community driven, reusable components for distributed apps in Go.
- Paxos explained
- Consistency Models Explained (2021)
- Fault - Modeling language for building system dynamic models and checking them using a combination of first order logic and probability.
- Events, Event Sourcing, and the Path Forward (2022)
- How to make distributed system available (2022)
- Best resources to learn about data and distributed systems (2022)
- Lock-Free Locks Revisited (2022) (Lobsters) (HN)
- ljepsen - Framework for distributed system's verification, with fault injection.
- NATS.io - Cloud Native, Open Source, High-performance Messaging. (Code) (NATS 2.0 and Connectivity)
- RustDDS - Rust implementation of Data Distribution Service.
- Evolving clock sync for distributed databases (2022) (HN)
- Ask HN: Do you find working on large distributed systems exhausting? (2022)
- Life Beyond Distributed Transactions / Space-efficient Static Trees and Graphs (Video Overview)
- Delicate - Lightweight and distributed task scheduling platform written in rust.
- Practical Byzantium Fault Tolerant (PBFT) algorithm in Go
- chaosd - Chaos Engineering toolkit.
- dcache - CoreDNS Plugin: Asynchronous Distributed Cache for Distributed System.
- Consensus that unifies paxos, raft, 2pc, etc.
- Your computer is a distributed system (HN)
- Consul at Fly.io (2022) (Lobsters) (HN)
- MatrixCube - Fundamental Building Block for Elastic Storage With Strong Consistency and Reliability.
- A Brief History of High Availability (2021) (HN)
- Artillery - Fire-forged cluster management & Distributed data protocol.
- Principles of Distributed Computing (lecture collection)
- Distributed Systems Shibboleths (2022) (HN)
- minicache - Distributed cache with client-side consistent hashing, distributed leader-elections, and dynamic node discovery. Supports both HTTP/gRPC interfaces secured with mTLS.
- Sprinkle - Run jobs on distributed machines easily.
- Fallacies of distributed systems (2022) (HN)
- Distributed systems for fun and profit (Code)
- Bistro - Fast, flexible toolkit for scheduling and running distributed tasks.
- A Benchmarking Study to Evaluate Apache Spark on Large-Scale Supercomputers (2019)
- Federation vs. Clustering: Self-determination vs. distributed computing? (2022)
- Ask HN: Why are distributed systems so polarizing? (2022)
- Surviving Continuous Deployment in Distributed Systems (2021)
- Raft Consensus Animated (HN)
- Notes on Theory of Distributed Systems (HN)
- Distributed Algorithms 2020
- Shadow Simulator - Discrete-event network simulator that directly executes real application code, enabling you to simulate distributed systems.
- Aneris - Program logic for developing and verifying distributed systems.
- How Tencent Maintains Apache Pulsar Clusters with over 100 Billion Messages Daily (2022)
- CID (Content IDentifier) Specification - Self-describing content-addressed identifiers for distributed systems.
- Aurae - Simplified distributed systems runtime for application teams. Written in Rust. (Docs)
- Raft Is So Fetch: The Raft Consensus Algorithm Explained Through Mean Girls (2022) (HN)
- HyperQueue - Scheduler for sub-node tasks for HPC systems with batch scheduling.
- Building Distributed Systems With Stateright
- Papers on Distributed Systems
- "Workflows, a new abstraction for distributed systems" by Dominik Tornow (2022)
- Timestamp-based Algorithms for Concurrency Control in Distributed Database Systems (2022)
- Paxos, Raft, EPaxos: How Has Distributed Consensus Technology Evolved? (2021)
- The Distributed Computing Manifesto (2022)
- Planet-Scale Leaderless Consensus (2022)
- Moving Away from UUIDs (2018) (HN)
- How Paxos and Two-Phase Commit Differ (2021)
- Raft - Raft library for maintaining a replicated state machine.
- Distributed Transactional Systems Cannot Be Fast (2019) (Tweet)
- Paper: VR Revisited - View changes - Questions (2022)
- Transactions in distributed systems (2022)
- Ask HN: What are best patterns for events in distributed transactions? (2022)
- Thinking in Distributed Systems Book
- Datacake - Tooling for creating your own distributed systems. (Reddit)
- Past Conferences | USENIX (Tweet)
- Systems Distributed Event
- Myths and Legends in High-Performance Computing (2023)
- HotShot - BFT consensus protocol based off of HotStuff, with the addition of proof-of-stake and VRF committee elections.
- Armstrong distributed systems
- What is consensus
- Raft algorithm concept prove application in Go
- Gossip Glomers: Fly.io Distributed Systems Challenges (HN)
- Eventually Consistent (2008)
- Service Weaver - Programming framework for writing, deploying, and managing distributed applications. (Docs) (Intro) (Tweet) (Reddit)
- CASM - Universal middleware for decentralized computing.
- Wetware - Alternative to Kubernetes, Mesos and OpenShift that turns any group of networked computers -- including cloud-based instances -- into a programmable IaaS/PaaS cluster.
- Canadensis - Open technology for real-time intravehicular distributed computing and communication based on modern networking standards.
- Husky: Exactly-Once Ingestion and Multi-Tenancy at Scale (2023)
- NOLA - Distributed virtual actor system that is heavily inspired by Cloudflare Durable Objects and other virtual actor systems like Orleans.
- Finding and fixing eventual consistency with Stripe events (2023)
- Ordering Events in Distributed Systems (2022) (HN)
- Fly.io distributed systems challenges solved in Rust
- The end of a myth: Distributed transactions can scale (2023) (HN)
- Load Balancing (2023) (HN) (Lobsters)
- Basic Raft implementation in Go
- From scratch implementation of Raft consensus algorithm in Go
- Awesome Load Management
- Implementing a distributed key-value store on top of implementing Raft in Go (2023) (HN)
- Nezha: Deployable and High-Performance Consensus Using Synchronized Clocks (2022)
- Breaking Changes in Distributed Systems (2023)
- ctlstore - Distributed data store that provides very low latency, always-available, "infinitely" scalable reads.