On this page

Data Processing

Ibis, Orchest & Mage are nice. Greptime is interesting OSS DB for time series data processing.

Polars is nice DataFrames library implementation.

Nuclio & CloudQuery are interesting too.

Notes

A simple use of array of JSON to store tabular data seems nice. [{row1}, {row2}, {row3}].

Links

Bigslice - System for fast, large-scale, serverless data processing using Go.
Reflow - Language and runtime for distributed, incremental data processing in the cloud.
Self-managing serverless computing with Bigmachine (2019)
Bigslice: a cluster computing system for Go (2019)
When your data doesn’t fit in memory: the basic techniques (2019) (HN)
Differential Dataflow - Implementation of differential dataflow using timely dataflow on Rust. (Book) (HN)
The Log: What every software engineer should know about real-time data's unifying abstraction (2013)
Luna - Data processing and visualization environment built on a principle that people need an immediate connection to what they are building.
Guide To The Data Lake — Modern Batch Data Warehousing (2020)
Plumbing At Scale (2020) - Event Sourcing and Stream Processing Pipelines at Grab.
Differential Dataflow! But at what COST? (2017) (HN)
Timely Dataflow and Total Order (2020)
Nuclio - High-Performance Serverless event and data processing platform.
Apache Spark - Unified analytics engine for large-scale data processing. (PySpark) (PySpark Style Guide) (Article) (Web) (Spark Learning Guide)
Spark: The Definitive Guide Book (2018) (Code)
Batch - Event replay platform. Version control for data passing through your messaging systems. (HN)
A log/event processing pipeline you can't have (2019) (HN)
mm-ADT - Multi-Model Abstract Data Type. Distributed virtual machine capable of integrating a diverse collection of data processing technologies. (Code)
Data Preprocessing in Machine Learning (2020)
lakeFS - Open source layer that delivers resilience and manageability to object-storage based data lakes. (Web)
Baker - High performance, composable and extendable data-processing pipeline for the big data era.
Cylon - Fast, scalable distributed memory data parallel library for processing structured data. (Web)
cuGraph - GPU Graph Analytics.
Opaque - Secure Apache Spark SQL.
Apache Beam - Unified programming model for Batch and Streaming. (Web)
Stitch - Simple, extensible ETL built for data teams.
Databricks - Unified Data Analytics. (GitHub) (CLI) (Reflecting on Four Years at Databricks (2021))
AugMix - Simple Data Processing Method to Improve Robustness and Uncertainty.
Snapflow - Framework for building end-to-end functional data pipelines from modular components.
Workflow Description Language (WDL) - Way to specify data processing workflows with a human-readable and writeable syntax.
Cloudfuse - Open source serverless data solutions. Future of data pipelines. (GitHub)
Create your own data stream for Kafka with Python and Faker (2021)
Hindsight - C based data processing infrastructure based on the lua sandbox project.
Reverse ETL — A Primer (2021)
I wrote one of the fastest DataFrame libraries (2021)
Build your own “data lake” for reporting purposes in a multi-services environment (2021)
Feature Stores: The Data Side of ML Pipelines (2021)
Flowgger - Fast, simple and lightweight data collector written in Rust.
Popsink - Real-time data platform you don't have to build.
Flyte - Structured programming and distributed processing platform that enables highly concurrent, scalable and maintainable workflows for Machine Learning and Data Processing. (Web) (GitHub) (Python SDK) (CLI)
Winterfell - Distributed STARK prover.
Python to Distributed Python to Airflow task in ~5 lines of code
DataFusion - Extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format.
Delta Lake - Reliable Data Lakes at Scale. (GitHub)
Delta Sharing - Open Protocol for Secure Data Sharing. (Article) (Tweet)
Dataform - Manage data pipelines in BigQuery.
Legate Pandas - Aspiring Drop-In Replacement for Pandas at Scale.
datablocks - Flow based data processing editor. (HN)
Reproducible data processing pipelines (2021)
datasketch - Probabilistic data structures that can process and search very large amount of data super fast, with little loss of accuracy.
Tuplex - Parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. (Web)
file.d - Blazing fast tool for building data pipelines: read, process and output events.
Datafuse - Modern Real-Time Data Processing in Rust. (Code) (HN)
MapReduce is making a comeback (2021) (HN)
SciPipe - Robust, flexible and resource-efficient pipelines using Go and the command line. (Docs)
The Future Is Big Graphs: A Community View on Graph Processing Systems (2021) (HN)
What Is the Data Lakehouse Pattern? (HN)
Apache Hadoop - Open-source software for reliable, scalable, distributed computing. (Is Hadoop Dead?) (Code)
go-stash - High performance, free and open source server-side data processing pipeline that ingests data from Kafka, processes it, and then sends it to ElasticSearch.
pypely - Make your data processing easy - build pipelines in a functional manner.
An opinionated map of incremental and streaming systems (2021)
Crossjoin - Joins together your data from anywhere.
Ceramic Network - Decentralized, open source platform for creating, hosting, and sharing streams of data. (TS Code) (GitHub) (Doc)
Graphite-Web - Highly scalable real-time graphing system. (Docs)
vega - Faster implementation of Apache Spark from scratch in Rust.
Memgraph - Build modern, graph-based applications on top of your streaming data in minutes. (Web)
Apache Parquetv - Columnar storage format that supports nested data. (Code)
Data Pipelines Pocket Reference Book (2021) (Code)
miniwdl - Workflow Description Language developer tools & local runner.
Rain - Framework for large distributed pipelines.
Apache SeaTunnel - Distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time). (Code)
Databend - Open Source Serverless Data Warehouse for Everyone. (Web) (GitHub) (CLI)
Pydra - Simple dataflow engine with scalable semantics.
Bytewax - Open source Python framework for building highly scalable dataflows.
Atomic Data - Modular specification for sharing, modifying and modeling graph data. (Code) (Rust Code)
Apache Arrow Flight SQL: Accelerating Database Access (2022) (HN)
Grist - Modern relational spreadsheet. Open core alternative to Airtable and Google Sheets. (HN)
Data Engineering Practice Problems
Dagster: Rebundling the Data Platform (2022)
cq - Clojure Command-line Data Processor for JSON, YAML, EDN, XML and more.
utt - Universal text transformer.
Loggie - Lightweight, high-performance, cloud-native agent and aggregator based on Go.
ter - CLI to run text expressions and perform basic text operations such as filtering, ignoring and replacing on the command line.
csv-diff - Python CLI tool and library for diffing CSV and JSON files.
pqrs - Command line tool for inspecting Parquet files.
Kestra - Infinitely scalable open source orchestration & scheduling platform. (Code) (HN)
TiFlash - Analytical engine for TiDB.
Streamify - Data pipeline with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more.
DTL - Language and JavaScript lib to transform and manipulate data. (HN)
Hawk - Haskell text processor for the command-line.
Alternatives to pandas library
Zed - Tooling for super-structured data: a new and easier way to manipulate data. (Web)
Fast Analysis with DuckDB + PyArrow (2022) - Trying out some new speedy tools for data analysis.
Why isn’t there a decent file format for tabular data? (2022) (HN)
Data Engineering Wiki (Code)
csv-clean - Command line tool to clean up malformed CSV files.
rq - Tool for doing record analysis and transformation.
Data Integration Guide: Techniques, Technologies, and Tools (2022)
Mito - Edit a spreadsheet, generate Python. (HN) (HN) (Code)
Tornado - Complex Event Processor that receives reports of events from data sources such as monitoring, email, and telegram, matches them against pre-configured rules.
Meet Dash-AB — The Statistics Engine of Experimentation at DoorDash (2022)
dataPipe - Data processing and data analytics library for JavaScript.
gosquito - Pluggable tool for data gathering, data processing and data transmitting to various destinations.
DLT - Enables simple python-native data pipelining for data professionals.
PipeRider - Toolkit for detecting data issues across pipelines that works with CI systems for continuous data quality assessment.
airflint - Enforce Best Practices for all your Airflow DAGs.
Scaling our Spreadsheet Engine from Thousands to Billions of Cells (2022) (HN) (Lobsters)
qv - Simple CLI to quickly view your data. Powered by DataFusion.
Airflow's Problem (2022) (HN)
Quokka - Fast data processing engine whose core consists of ~1000 lines of Python code.
Modal - On-demand compute that just works. (Twitter)
Building open source downscaling pipelines for the cloud (2022)
Dabbling with Dagster vs. Airflow (2022) (HN)
Mage - Data pipelines for data scientists. (Code)
Merriam-Webster and Unstructured Data Processing (2022)
Best tools to analyze CSV with 100,000 rows in it (2022)
Akvorado - Flow collector, hydrater and visualizer.
Columnq - Run SQL on CSV, Parquet, JSON, Arrow, Unix Pipes and Google Sheet. (HN)
Grai - Data lineage made simple. Grai makes it easy to understand how your data relates together across databases, warehouses, APIs and dashboards. (Web)
Kuma-san's toolbox for data analysis
Byzer - Low-code, open-sourced and distributed programming language for data pipeline, analytics and AI in cloud native way. (Web)
Data Analysis at the Command Line (2022)
Data Pipeline in Rust - Data pipeline example written in Rust with Polars and DataFusion DataFrame package.
Memphis - Real-Time Data Processing Platform. (Web)
dedup - Command-line tool for deduplicating entries in a file or stream.
parquet-tools - Easy install parquet-tools.
BitSail - ByteDance's open source data integration engine which is based on distributed architecture and provides high performance.
thisthat - Data format conversion utility.
Boring Data Tool (bdt) - Command-line tool for viewing, querying, and converting between various file formats. Powered by DataFusion.
Dabbling with Dagster (2022)
Report: Databricks vs Snowflake (2022)
DataFusion-tui - Terminal based, extensible, interactive data analysis tool using SQL.
Querying Parquet with Millisecond Latency (2022)
Using Commandline To Process CSV files (2022)
Apache Arrow Cookbooks
parquet2json - Command-line tool for streaming Parquet as line-delimited JSON.
What I Want from DataFusion in 2023
xvc - Fast and robust MLOps tool for managing data and pipelines.
Recap - Dead simple data catalog for engineers.
Modern Polars: a comparison of the Polars and Pandas dataframe libraries (HN) (Code)
Text Processing in Linux: Understanding Grep, sed, and AWK (2023)
Demystifying Apache Arrow (2020) (HN)
Awk: Power and Promise of a 40 yr old language (2021) (HN)
Replacing Pandas with Polars (2023) (HN)
dbt-prql - Allows writing PRQL in dbt models.
Using Rust to write a Data Pipeline (2023) (Code)
Pandas Illustrated: Visual Guide to Pandas (2023) (HN)
Arrow CLI Tools - Collection of handy CLI tools to convert CSV and JSON to Apache Arrow and Parquet.
unstructured - Open-Source Pre-Processing Tools for Unstructured Data.
pipelime - Swiss army knife for data processing.
Explore Data with Data Painter (HN)
Pandas 2.0 and the Arrow revolution (2023) (HN)
Demystifying the Parquet File Format (2022)
How to Get Started with Dbt (2023) (HN)
Miller - Like Awk, sed, cut, join, and sort for CSV, TSV, and tabular JSON. (HN)
Fascination of AWK (HN)
Kaskada - Modern, open-source event-processing.
Sidekick - Open-source ETL framework to sync data from SaaS tools to vector stores. (HN)
Parquet: more than just "Turbo CSV" (2023)
Pandas 2.0 (HN)
Polars for initial data analysis, Polars for production (2023)
dbt-osmosis - Provides automated YAML management, a dbt server, streamlit workbench, and git-integrated dbt model output diff tools.
Database Stream Processor (DBSP) - Framework for computing over data streams that aims to be more expressive and performant than existing streaming engines.
Malloy An Experimental Language for Data (2023)
I decided not to commercialize nbdev (2023) (HN)
stlite - Serverless Streamlit.
Daft: A High-Performance Distributed Dataframe Library for Multimodal Data (HN)
Use pygwalker to build visual analysis app in streamlit (HN)
Datadex - Collaborate on Open Data using Open Source Tools.
Data analysis with SQLite and Python (2023)
VulcanSQL - Create and share Data APIs fast! Data API framework for DuckDB, Snowflake, BigQuery, PostgreSQL. (Web) (HN)
Quilt - Data mesh for connecting people with actionable data. (Web)
attranslate - Command line tool for translating JSON, YAML, CSV, ARB, XML (via a CLI).
Snakemake - Framework for reproducible data analysis. (HN)
Retake - Open Source Infrastructure for Vector Data Streams. (HN)
Bluesky - Python Package is an experiment specification and orchestration engine.
Polars: Company Formation Announcement (2023) (HN)
RuleGo - Lightweight, high-performance, embedded rule engine based on Go language.
TableFlow - Open source CSV importer. (Web)
Zed: Leveraging Data Types to Process Eclectic Data (2023)
Pandata - Scalable open-source analysis stack.
Most companies do not need Snowflake or Databricks (2023) (HN)
Polars CLI - CLI interface for running SQL queries with Polars as backend.
ETL Helper - Python ETL library to simplify data transfer into and out of databases.
Our journey at F5 with Apache Arrow (2023) (Lobsters)

Genomics

Immunology

Startups

AWS

Serverless computing

Build systems

Computer vision

Algorithms

Formal verification

Blockchain

Figma

Message queue

Remote Procedure Calls

Psychedelics

Lysergamides

Tryptamines

Renewable energy

CSS

Game development

Game engines

CPU

Nutrition

Drinks

2018

2019

2020

2021

2022

Alfred

Keyboard Maestro

Xcode

Neural networks

Linear algebra

Logic

Automated theorem proving

Mathematical optimization

Statistics

Type Theory

Diseases

Music production

GraphQL

Internet of things

Peer to peer

VPN

GitHub

Containers

Kubernetes

iOS

Linux

Nix

Electrical engineering

Quantum physics

Functional programming

Interactive computing

Software testing

Version control

C

Clojure

C++

Dart

Elixir

Elm

Go

Go libraries

Java

JavaScript

JS libraries

React

Julia

Kotlin

Lisp

Nim

Objective C

OCaml

Processing

Prolog

Python

Python libraries

R language

ReasonML