Data Processing
Ibis, Orchest & Mage are nice. Greptime is interesting OSS DB for time series data processing.
Polars is nice DataFrames library implementation.
Nuclio & CloudQuery are interesting too.
Notes
Links
- Bigslice - System for fast, large-scale, serverless data processing using Go.
- Reflow - Language and runtime for distributed, incremental data processing in the cloud.
- Self-managing serverless computing with Bigmachine (2019)
- Bigslice: a cluster computing system for Go (2019)
- When your data doesn’t fit in memory: the basic techniques (2019) (HN)
- Differential Dataflow - Implementation of differential dataflow using timely dataflow on Rust. (Book) (HN)
- The Log: What every software engineer should know about real-time data's unifying abstraction (2013)
- Luna - Data processing and visualization environment built on a principle that people need an immediate connection to what they are building.
- Guide To The Data Lake — Modern Batch Data Warehousing (2020)
- Plumbing At Scale (2020) - Event Sourcing and Stream Processing Pipelines at Grab.
- Differential Dataflow! But at what COST? (2017) (HN)
- Timely Dataflow and Total Order (2020)
- Nuclio - High-Performance Serverless event and data processing platform.
- Apache Spark - Unified analytics engine for large-scale data processing. (PySpark) (PySpark Style Guide) (Article) (Web) (Spark Learning Guide)
- Spark: The Definitive Guide Book (2018) (Code)
- Batch - Event replay platform. Version control for data passing through your messaging systems. (HN)
- A log/event processing pipeline you can't have (2019) (HN)
- mm-ADT - Multi-Model Abstract Data Type. Distributed virtual machine capable of integrating a diverse collection of data processing technologies. (Code)
- Data Preprocessing in Machine Learning (2020)
- lakeFS - Open source layer that delivers resilience and manageability to object-storage based data lakes. (Web)
- Baker - High performance, composable and extendable data-processing pipeline for the big data era.
- Cylon - Fast, scalable distributed memory data parallel library for processing structured data. (Web)
- cuGraph - GPU Graph Analytics.
- Opaque - Secure Apache Spark SQL.
- Apache Beam - Unified programming model for Batch and Streaming. (Web)
- Stitch - Simple, extensible ETL built for data teams.
- Databricks - Unified Data Analytics. (GitHub) (CLI) (Reflecting on Four Years at Databricks (2021))
- AugMix - Simple Data Processing Method to Improve Robustness and Uncertainty.
- Snapflow - Framework for building end-to-end functional data pipelines from modular components.
- Workflow Description Language (WDL) - Way to specify data processing workflows with a human-readable and writeable syntax.
- Cloudfuse - Open source serverless data solutions. Future of data pipelines. (GitHub)
- Create your own data stream for Kafka with Python and Faker (2021)
- Hindsight - C based data processing infrastructure based on the lua sandbox project.
- Reverse ETL — A Primer (2021)
- I wrote one of the fastest DataFrame libraries (2021)
- Build your own “data lake” for reporting purposes in a multi-services environment (2021)
- Feature Stores: The Data Side of ML Pipelines (2021)
- Flowgger - Fast, simple and lightweight data collector written in Rust.
- Popsink - Real-time data platform you don't have to build.
- Flyte - Structured programming and distributed processing platform that enables highly concurrent, scalable and maintainable workflows for Machine Learning and Data Processing. (Web) (GitHub) (Python SDK) (CLI)
- Winterfell - Distributed STARK prover.
- Python to Distributed Python to Airflow task in ~5 lines of code
- DataFusion - Extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format.
- Delta Lake - Reliable Data Lakes at Scale. (GitHub)
- Delta Sharing - Open Protocol for Secure Data Sharing. (Article) (Tweet)
- Dataform - Manage data pipelines in BigQuery.
- Legate Pandas - Aspiring Drop-In Replacement for Pandas at Scale.
- datablocks - Flow based data processing editor. (HN)
- Reproducible data processing pipelines (2021)
- datasketch - Probabilistic data structures that can process and search very large amount of data super fast, with little loss of accuracy.
- Tuplex - Parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. (Web)
- file.d - Blazing fast tool for building data pipelines: read, process and output events.
- Datafuse - Modern Real-Time Data Processing in Rust. (Code) (HN)
- MapReduce is making a comeback (2021) (HN)
- SciPipe - Robust, flexible and resource-efficient pipelines using Go and the command line. (Docs)
- The Future Is Big Graphs: A Community View on Graph Processing Systems (2021) (HN)
- What Is the Data Lakehouse Pattern? (HN)
- Apache Hadoop - Open-source software for reliable, scalable, distributed computing. (Is Hadoop Dead?) (Code)
- go-stash - High performance, free and open source server-side data processing pipeline that ingests data from Kafka, processes it, and then sends it to ElasticSearch.
- pypely - Make your data processing easy - build pipelines in a functional manner.
- An opinionated map of incremental and streaming systems (2021)
- Crossjoin - Joins together your data from anywhere.
- Ceramic Network - Decentralized, open source platform for creating, hosting, and sharing streams of data. (TS Code) (GitHub) (Doc)
- Graphite-Web - Highly scalable real-time graphing system. (Docs)
- vega - Faster implementation of Apache Spark from scratch in Rust.
- Memgraph - Build modern, graph-based applications on top of your streaming data in minutes. (Web)
- Apache Parquetv - Columnar storage format that supports nested data. (Code)
- Data Pipelines Pocket Reference Book (2021) (Code)
- miniwdl - Workflow Description Language developer tools & local runner.
- Rain - Framework for large distributed pipelines.
- Apache SeaTunnel - Distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time). (Code)
- Databend - Open Source Serverless Data Warehouse for Everyone. (Web) (GitHub) (CLI)
- Pydra - Simple dataflow engine with scalable semantics.
- Bytewax - Open source Python framework for building highly scalable dataflows.
- Atomic Data - Modular specification for sharing, modifying and modeling graph data. (Code) (Rust Code)
- Apache Arrow Flight SQL: Accelerating Database Access (2022) (HN)
- Grist - Modern relational spreadsheet. Open core alternative to Airtable and Google Sheets. (HN)
- Data Engineering Practice Problems
- Dagster: Rebundling the Data Platform (2022)
- cq - Clojure Command-line Data Processor for JSON, YAML, EDN, XML and more.
- utt - Universal text transformer.
- Loggie - Lightweight, high-performance, cloud-native agent and aggregator based on Go.
- ter - CLI to run text expressions and perform basic text operations such as filtering, ignoring and replacing on the command line.
- csv-diff - Python CLI tool and library for diffing CSV and JSON files.
- pqrs - Command line tool for inspecting Parquet files.
- Kestra - Infinitely scalable open source orchestration & scheduling platform. (Code) (HN)
- TiFlash - Analytical engine for TiDB.
- Streamify - Data pipeline with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more.
- DTL - Language and JavaScript lib to transform and manipulate data. (HN)
- Hawk - Haskell text processor for the command-line.
- Alternatives to pandas library
- Zed - Tooling for super-structured data: a new and easier way to manipulate data. (Web)
- Fast Analysis with DuckDB + PyArrow (2022) - Trying out some new speedy tools for data analysis.
- Why isn’t there a decent file format for tabular data? (2022) (HN)
- Data Engineering Wiki (Code)
- csv-clean - Command line tool to clean up malformed CSV files.
- rq - Tool for doing record analysis and transformation.
- Data Integration Guide: Techniques, Technologies, and Tools (2022)
- Mito - Edit a spreadsheet, generate Python. (HN) (HN) (Code)
- Tornado - Complex Event Processor that receives reports of events from data sources such as monitoring, email, and telegram, matches them against pre-configured rules.
- Meet Dash-AB — The Statistics Engine of Experimentation at DoorDash (2022)
- dataPipe - Data processing and data analytics library for JavaScript.
- gosquito - Pluggable tool for data gathering, data processing and data transmitting to various destinations.
- DLT - Enables simple python-native data pipelining for data professionals.
- PipeRider - Toolkit for detecting data issues across pipelines that works with CI systems for continuous data quality assessment.
- airflint - Enforce Best Practices for all your Airflow DAGs.
- Scaling our Spreadsheet Engine from Thousands to Billions of Cells (2022) (HN) (Lobsters)
- qv - Simple CLI to quickly view your data. Powered by DataFusion.
- Airflow's Problem (2022) (HN)
- Quokka - Fast data processing engine whose core consists of ~1000 lines of Python code.
- Modal - On-demand compute that just works. (Twitter)
- Building open source downscaling pipelines for the cloud (2022)
- Dabbling with Dagster vs. Airflow (2022) (HN)
- Mage - Data pipelines for data scientists. (Code)
- Merriam-Webster and Unstructured Data Processing (2022)
- Best tools to analyze CSV with 100,000 rows in it (2022)
- Akvorado - Flow collector, hydrater and visualizer.
- Columnq - Run SQL on CSV, Parquet, JSON, Arrow, Unix Pipes and Google Sheet. (HN)
- Grai - Data lineage made simple. Grai makes it easy to understand how your data relates together across databases, warehouses, APIs and dashboards. (Web)
- Kuma-san's toolbox for data analysis
- Byzer - Low-code, open-sourced and distributed programming language for data pipeline, analytics and AI in cloud native way. (Web)
- Data Analysis at the Command Line (2022)
- Data Pipeline in Rust - Data pipeline example written in Rust with Polars and DataFusion DataFrame package.
- Memphis - Real-Time Data Processing Platform. (Web)
- dedup - Command-line tool for deduplicating entries in a file or stream.
- parquet-tools - Easy install parquet-tools.
- BitSail - ByteDance's open source data integration engine which is based on distributed architecture and provides high performance.
- thisthat - Data format conversion utility.
- Boring Data Tool (bdt) - Command-line tool for viewing, querying, and converting between various file formats. Powered by DataFusion.
- Dabbling with Dagster (2022)
- Report: Databricks vs Snowflake (2022)
- DataFusion-tui - Terminal based, extensible, interactive data analysis tool using SQL.
- Querying Parquet with Millisecond Latency (2022)
- Using Commandline To Process CSV files (2022)
- Apache Arrow Cookbooks
- parquet2json - Command-line tool for streaming Parquet as line-delimited JSON.
- What I Want from DataFusion in 2023
- xvc - Fast and robust MLOps tool for managing data and pipelines.
- Recap - Dead simple data catalog for engineers.
- Modern Polars: a comparison of the Polars and Pandas dataframe libraries (HN) (Code)
- Text Processing in Linux: Understanding Grep, sed, and AWK (2023)
- Demystifying Apache Arrow (2020) (HN)
- Awk: Power and Promise of a 40 yr old language (2021) (HN)
- Replacing Pandas with Polars (2023) (HN)
- dbt-prql - Allows writing PRQL in dbt models.
- Using Rust to write a Data Pipeline (2023) (Code)
- Pandas Illustrated: Visual Guide to Pandas (2023) (HN)
- Arrow CLI Tools - Collection of handy CLI tools to convert CSV and JSON to Apache Arrow and Parquet.
- unstructured - Open-Source Pre-Processing Tools for Unstructured Data.
- pipelime - Swiss army knife for data processing.
- Explore Data with Data Painter (HN)
- Pandas 2.0 and the Arrow revolution (2023) (HN)
- Demystifying the Parquet File Format (2022)
- How to Get Started with Dbt (2023) (HN)
- Miller - Like Awk, sed, cut, join, and sort for CSV, TSV, and tabular JSON. (HN)
- Fascination of AWK (HN)
- Kaskada - Modern, open-source event-processing.
- Sidekick - Open-source ETL framework to sync data from SaaS tools to vector stores. (HN)
- Parquet: more than just "Turbo CSV" (2023)
- Pandas 2.0 (HN)
- Polars for initial data analysis, Polars for production (2023)
- dbt-osmosis - Provides automated YAML management, a dbt server, streamlit workbench, and git-integrated dbt model output diff tools.
- Database Stream Processor (DBSP) - Framework for computing over data streams that aims to be more expressive and performant than existing streaming engines.
- Malloy An Experimental Language for Data (2023)
- I decided not to commercialize nbdev (2023) (HN)
- stlite - Serverless Streamlit.
- Daft: A High-Performance Distributed Dataframe Library for Multimodal Data (HN)
- Use pygwalker to build visual analysis app in streamlit (HN)
- Datadex - Collaborate on Open Data using Open Source Tools.
- Data analysis with SQLite and Python (2023)
- VulcanSQL - Create and share Data APIs fast! Data API framework for DuckDB, Snowflake, BigQuery, PostgreSQL. (Web) (HN)
- Quilt - Data mesh for connecting people with actionable data. (Web)
- attranslate - Command line tool for translating JSON, YAML, CSV, ARB, XML (via a CLI).
- Snakemake - Framework for reproducible data analysis. (HN)
- Retake - Open Source Infrastructure for Vector Data Streams. (HN)
- Bluesky - Python Package is an experiment specification and orchestration engine.
- Polars: Company Formation Announcement (2023) (HN)
- RuleGo - Lightweight, high-performance, embedded rule engine based on Go language.
- TableFlow - Open source CSV importer. (Web)
- Zed: Leveraging Data Types to Process Eclectic Data (2023)
- Pandata - Scalable open-source analysis stack.
- Most companies do not need Snowflake or Databricks (2023) (HN)
- Polars CLI - CLI interface for running SQL queries with Polars as backend.
- ETL Helper - Python ETL library to simplify data transfer into and out of databases.
- Our journey at F5 with Apache Arrow (2023) (Lobsters)