Data Science
Equals, Tably, DataStation, Datasette & Malloy are neat. Want to learn Numpy and scikit-learn libraries more.
Python for Data Analysis is a good read.
Notes
- If you can solve a problem with a simple heuristic, do that. Sometimes you don't need machine learning.
- Data is not always useful and it doesn't matter how much of it you have. There’s no mathematical tool to tell you if your hypothesis is true; you can only see whether it is consistent with the data, and if the data is sparse or unclear, your conclusions are uncertain.
- By writing the data generating process first, and therefore knowing parameters, you can get more confidence that your model is well constructed.
- Building a data pipeline in 2020 is like building a bridge in the 14th century: You do a lot of work that gets thrown away. Half the job is getting rid of the stuff you don't want. The folks who started it are dead by the time it's done.
Links
- Some advice for young and aspiring Data Scientists
- A Beginner’s Guide to Data Engineering — Part I
- The Rise of the Data Engineer
- Cookiecutter Data Science - Logical, reasonably standardized, but flexible project structure for doing and sharing data science work. (Code)
- Best way to organize research code? (2018)
- Data Science Cheat Sheet
- Our world in data
- Free data science books
- Pachyderm - Reproducible Data Science at Scale. (Web) (Pachyderm Hub)
- Data Science Cheat Sheets
- Data Science in Visual Studio Code using Neuron, a new VS Code extension (2018)
- Virgilio - Mentor for Data Science E-Learning.
- Awesome Data Science with Python - Curated list of Python resources for data science.
- Data Science (Cheat Sheets)
- nteract - Interactive computing suite for you.
- 120 Data Science Interview Questions
- How To Become a Data Engineer
- Ingestion Data Mapping Language
- The reference implementation of IDML for the JVM
- Pandas - Powerful Python data analysis toolkit. (Ongoing list of pandas quirks) (Manual) (HN) (Pandas Cookbook) (Modern Pandas) (HN)
- Programming Language Support for Data-intensive Applications meeting (2019)
- Datasette - Open source multi-tool for exploring and publishing data. (Web) (datasette-graphql) (Running Datasette on DigitalOcean App Platform) (Interesting ideas in Datasette) (HN) (Twitter) (HN) (Web Code) (Datasette Ecosystem) (HN)
- Weld - High-performance runtime for data analytics applications.
- Vaex - Out-of-Core DataFrames for Python, visualize and explore big tabular data at a billion rows per second.
- PyParis 2018 - Vaex: Out of Core Dataframes for Python
- Maarten Breddels & Jovan Veljanoski- A new approach to DataFrames and pipelines - PyData London 2019 (GitHub)
- The Data Engineering Cookbook
- Data science without borders - Wes McKinney (2017)
- Curated list of awesome ETL frameworks, libraries, and software
- Ibis - Python data analysis framework for Hadoop and SQL engines.
- Kyso - Data analytics knowledge hub.
- Feather - Fast, interoperable binary data frame storage for Python, R, and more powered by Apache Arrow.
- ROOT system - Provides a set of OO frameworks with all the functionality needed to handle and analyze large amounts of data in a very efficient way.
- Redash - Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data. (Web)
- Prefect - New workflow management system, designed for modern infrastructure and powered by the open-source Prefect Core workflow engine.
- Turn Python Scripts into Beautiful ML Tools (2019) (HN)
- Foundations of Data Science (2019) (HN) (HN)
- Gyana - No code desktop data science tool. (Article)
- Monument - High-productivity toolkit for predictions. AutoML for time series on any desktop, laptop or server.
- Numba - NumPy aware dynamic Python compiler using LLVM. (5 minute guide) (Web) (Make Python code 1000x Faster with Numba) (HN)
- What's your typical data pipeline in a small company? (2019)
- dbt - Data build tool. Analytics engineering workflow. (Code) (Docs) (fal - Run python scripts directly from dbt) (dbt-expectations) (External sources in dbt) (Awesome)
- Introducing dbt + Materialize (2021) (HN)
- Apache Airflow - Platform to programmatically author, schedule, and monitor workflows. (Tutorial) (Kedro-Airflow - Makes it easy to deploy Kedro projects to Airflow.) (Airflow 2.0) (HN) (Introduction to Apache Airflow (2021)) (Customizing Airflow: Beyond Boilerplate Settings) (Web) (Telemetry-Airflow) (Lessons Learned From Running Apache Airflow at Scale) (HN)
- The Unbundling of Airflow (2022) (HN)
- Overview of Popular Open Source Big Data Technologies (2018)
- Introducing Apache Arrow Flight: A Framework for Fast Data Transport (2019)
- Understanding Apache Arrow Flight (2019)
- New Developments in the Open Source Ecosystem: Apache Spark 3 0, Delta Lake, and Koalas (2019)
- Materials and IPython notebooks for Python for Data Analysis, 2nd Edition book
- Technical Notes On Using Data Science & Artificial Intelligence
- Amazon Data Science Interview (2018)
- Data Analysis and Prediction Algorithms with R (2019)
- Elegant SciPy book
- Data science blogs
- Awesome Data Science
- Things About Real-World Data Science Not Discussed In MOOCs and Thought Pieces (2018)
- Apache Zeppelin - Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.
- Apache Nifi - Easy to use, powerful, and reliable system to process and distribute data.
- Data Science: Principles and Practice Course materials (2018/19)
- Modern Data Practice and the SQL Tradition (2019) (HN)
- Koalas - Pandas API on Apache Spark.
- Ask HN: What does your BI stack look like? (2019)
- Easy Data Transform - Transform Your Data Without Programming. (HN)
- Prophet - Tool for producing high quality forecasts for time series data that has multiple seasonality with linear or non-linear growth.
- Dagster - Data orchestrator for machine learning, analytics, and ETL. (Dagster: The Data Orchestrator) (Lobsters) (Code) (Tweet) (Web Code)
- CuPy - NumPy-like API accelerated with CUDA.
- An Introduction To Data Science On The Linux Command Line (2019) (HN)
- How to analyse 100 GB of data on your laptop with Python (2019)
- Metaflow - Framework for real-life data science. (HN) (Code) (Tools)
- SaturnCloud - Manage Data Science applications so Data Scientists don't have to do DevOps.
- Falcon - Interactive Visual Analysis for Big Data.
- 160+ Data Science Interview Questions
- Data Science Interview Questions (HN)
- Google Cloud DataLab - Interactive tools and developer experiences for Big Data on Google Cloud Platform.
- Great Expectations - Leading tool for validating, documenting, and profiling, your data to maintain quality and improve communication between teams. (Code)
- Path to a free self-taught education in Data Science
- Common Workflow Language - Open standard for describing analysis workflows and tools. (HN)
- Apache Kudu - Completes Hadoop's storage layer to enable fast analytics on fast data.
- Time Series Forecasting Best Practices & Examples
- Turing Way - Lightly opinionated guide to reproducible data science. (Code)
- What to do when you didn’t get any medal in a Kaggle competition? (2020)
- Build a Career in Data Science book (2020)
- Data Science Tutorials in Julia (Code)
- Data Science Resources
- Master Data Analysis with Python
- Introduction to Data Science book (2020) (Code)
- VisiData - Terminal spreadsheet multitool for discovering and arranging data. (Code) (HN)
- Sisu - Fastest Diagnostic Platform for Structured Data. (Introducing Sisu)
- Deepnote - Data science notebook for teams. (Docs) (Awesome Deepnote) (HN)
- Towards Data Science blog
- Scaling Pandas: Dask vs Ray vs Modin vs Vaex vs RAPIDS (2020) (HN)
- A graphical analysis of women's tops sold on Goodwill's website (HN)
- 1.1B Taxi Rides Using OmniSciDB and a MacBook Pro (2020) (HN)
- Data Science Interview Resources
- Data Science Meets Devops: MLOps with Jupyter, Git, & Kubernetes (2020)
- dplyr - Grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges. (Web)
- dstack.ai - Build data and ML applications using pure Python. (HN) (Code)
- Learn Python for Data Science - Collection of Jupyter Notebooks designed to learn Python for Data Science. (HN)
- Computational Causal Inference at Netflix (2020)
- Practical Data Ethics (HN)
- New Google Data Science project
- Python for Data Analytics course (Reddit)
- Modern Data Engineer Roadmap
- How to share data with a statistician
- Becoming 1% better at data science everyday
- Jigsaw Labs - Learn Data Science part-time.
- Data Science Ontology - Knowledge base about data science.
- Narrator - Data modeling platform built on a single table. (HN)
- Data Engineering Project - Implementation of the data pipeline which consumes the latest news from RSS Feeds and makes them available for users via handy API.
- Hex Technologies - Collaborative data workspace that makes it easy to go from idea to analysis to sharing. Work in SQL and Python notebooks, collaborate live, and publish interactive data apps anyone can use. (Twitter) (Hex 2.0) (Tweet)
- News Aggregator from Scratch in 2 Weeks (2020)
- Awesome Scholarly Data Analysis
- Nemo: Data discovery at Facebook (2020) (Tweet)
- Amundsen by Lyft - Open source data discovery and metadata engine.
- Google BigQuery: Node.js Client
- The Modern Data Science Stack (2020)
- Streamlit Sharing - Platform for deploying, managing, and sharing your apps. (HN)
- Awesome Data Engineering Learning Path (Code) (HN)
- Emerging Architectures for Modern Data Infrastructure (2020) (HN)
- PandasGUI - GUI for analyzing Pandas DataFrames.
- Holistics - Data Modeling & Self-Service BI Platform.
- Numerai data science tournament
- Neptune.ai - Experiment tracking tool for you and your team. (GitHub)
- Neptune Python Client - Integrate your Python scripts with Neptune.
- AimStack - Version Control and Development Environment for AI. (Code) (GitHub)
- Synerise - Powerful ecosystem driven by Artificial Intelligence with real-time data orchestration created to drive business growth.
- Good Data Analysis
- Data Science Learning Resources
- Dataquest - Learn R, Python and SQL for Data Science.
- Awesome OSINT
- Carpentries - Teach foundational coding and data science skills to researchers worldwide.
- Orchest - Web based tool for creating data science pipelines. (Code)
- Data Engineering Book - Accumulated knowledge and experience in the field of Data Engineering.
- Data Science Lifecycle Process - Set of prescriptive steps and best practices to enable data science teams to consistently deliver value.
- Data Science Lifecycle Base Repo - Template repository for data science projects using the Data Science Life Cycle Process.
- 5th International Summer School on Data Science (2020) (Code)
- Scalable Data Science - Course sets in big data Using Apache Spark over databricks and their mathematical, statistical and computational foundations using SageMath. (Code)
- Datatap - Free user-friendly platform for visual data management. (Code)
- Data Carpentry - Develops and teaches workshops on the fundamental data skills needed to conduct research. (GitHub)
- Data Carpentry Lessons
- Doing Symbolic Math with SymPy (2020) (HN)
- How To Become a Data Engineer (2021) (HN)
- We don't need data scientists, we need data engineers (2021) (HN)
- Airbyte - Open-Source Data Integration Pipelines To Your Warehouses. (Code) (HN)
- Data Together - Exploring Community-Driven Data Stewardship. (GitHub)
- Data Together Research - Research for tackling the general problem of data resilience & interactivity in all its forms.
- Apache Superset - Modern data exploration and visualization platform. (Code)
- Open-Source Data Science Masters (Web)
- Streamlit Data Science and ML Apps in Python
- Storing and retrieving millions of ad impressions per second at Twitter (2021)
- Elements of Data Science - Introduction to data science in Python, for people with no programming experience. (Code)
- Data Science on AWS - AI and Machine Learning with Kubeflow, Amazon EKS, and SageMaker. (Code)
- Tips for Shipping Data Products Fast (2021) (HN)
- Data Science Topics Notes
- sq - Command line tool that provides jq-style access to structured data sources such as SQL databases, or document formats like CSV or Excel. (Web)
- Ask HN: What are the best data science bootcamps? (2021)
- Data Science Learning Resources
- Loading SQL data into Pandas without running out of memory (2021)
- What is the best structured ds project you have seen? (2021)
- Data Science Learning Resources (2021)
- ZeroCostDL4Mic - Google Colab based no-cost toolbox to explore Deep-Learning in Microscopy.
- How do you manage Data Science experiments? (2021)
- Querybook - Big Data Querying UI, combining collocated table metadata and a simple notebook interface. (Code)
- Principal Component Analysis explained
- Dataproofer - Proofreader for your data.
- Automated Data Wrangling for Open Data (2021)
- Data science interview questions with answers
- Lightdash - Open Source BI for your whole team. (Code) (HN)
- Wildland - Open data management protocol.
- Data Science Fails
- Building a data team at a mid-stage startup (2021) (HN)
- DataStation - Data IDE for Developers. (Code) (HN)
- Awesome data annotation
- Data Movement in Netflix Studio via Data Mesh (2021)
- Small Summaries for Big Data (2020)
- What even is data mesh (2021)
- Build 12 Data Science Apps with Python and Streamlit - Full Course (2021)
- JetBrains DataSpell - IDE for Data Scientists.
- node-rapids - GPU-accelerated data science and visualization in node.
- R or Python for data analysis? (2021)
- Moving beyond “algorithmic bias is a data problem” (2021) (HN)
- Clustering Algorithms with Python (2020) (HN)
- Data Science Cheatsheet 2.0
- Lessons Learned from two years as a Data Scientist (2021) (HN)
- Free data science resources - Thematic list of high-quality data science resources.
- Slight - Bridging the interaction between data teams and domain experts.
- Computational and Inferential Thinking: The Foundations of Data Science (Code)
- Berkeley library for introductory data science
- Is BI dead? – On dismantling data's ship of Theseus (2021) (HN)
- Kedro Community - Examples of data science projects created with Kedro.
- Data Science for Beginners - A Curriculum
- atoti - Free Python BI analytics platform. (GitHub) (Notebooks)
- Tably - Lightning fast exploration, search and analytics, with an intuitive UI and supercharged formula.
- OpenRefine - Free, open source power tool for working with messy data and improving it. (Code)
- Authenticated Full-Stack Streamlit
- Kaggle Solutions - Comprehensive List of Kaggle Solutions and Ideas. (Code)
- Knowledge Repo - Curated knowledge sharing platform for data scientists and other technical professions.
- Book: The Science of Science — Dashun Wang (2021)
- An introduction to Monzo’s data stack (2021)
- Evidence - Enables analysts to deliver a polished business intelligence system using SQL and markdown. (Code) (HN)
- Future of Data Work: Collaboration and No Limits (2021)
- Rows - Spreadsheet where teams work faster. (Twitter) (HN)
- RDMkit - Online guide containing good data management practices applicable to research projects from the beginning to the end. (Code)
- Select Star - Data discovery made easy.
- Data Engineering Principles - Cogent
- Data Science at the Command Line Book (2021) - Obtain, Scrub, Explore, and Model Data with Unix Power Tools. (Code)
- DrivenData - Data science competitions for social good. (Winners Code)
- Tabby Data - No-fuss data warehouse for startups. (SQL Assistant Demo) (HN)
- Data Engineering Zoomcamp (2022)
- Julia-Python-R / Plots.jl / Data 101 Cheat Sheets
- If I had to start learning Data Science again, how would I do it? (2020)
- Python Data Science Tutorials
- Data Science Stack - Cookiecutter - Cookiecutter template to launch an awesome dockerized Data Science toolstack (incl. Jupyster, Superset, Postgres, Minio, AirFlow & API Star).
- Human-first AI - Power Tools for AI Engineers With Deadlines. (Code)
- Data Science from Scratch Book
- Corteza - Open-source Low-Code Platform and Salesforce alternative. (GitHub) (Server Code)
- Learning From Data - Online Course (Code) (Solutions)
- Python for Data Analysis (HN)
- Guess the daily Wordle in one try using the tweet distribution (2022) (HN)
- Kedro-Viz - Visualise your Kedro data pipelines.
- Datasette Desktop for macOS (Code)
- Phases of Netflix’s real-time data infrastructure (2022) (HN)
- Shopify's Data Science and Engineering Foundations (2020) (HN)
- Introduction to K-Means Clustering (HN)
- Modern data analysis stack (2022)
- Stop aggregating away the signal in your data (2022) (Lobsters)
- OpenMetadata - Open Standard for Metadata. A Single place to Discover, Collaborate and Get your data right. (Code)
- Rill Data - Fully Managed Apache Druid as a Service. (Rill Developer) (Twitter)
- Glean - Fast data insights for your team.
- Dash Enterprise App Gallery (Code)
- Comprehensive, original, insightful, and otherwise interesting data science blogs (2022)
- datasette-publish-vercel - Datasette plugin for publishing data using Vercel.
- Tad - Desktop application for viewing and analyzing tabular data. (Code)
- Ask HN: Which new skills for a data science career? (2022)
- 5 Minimalist Tips for Data Scientists to reduce frustration while working with Pandas
- D-Tale - Visualizer for pandas data structures.
- data-describe - Pythonic EDA Accelerator for Data Science.
- How Well Can You Kaggle with Just One Hour a Day? (2022) (HN)
- Data Science in Context: Foundations, Challenges, Opportunities - Peter Norvig's New Book.
- LineaPy - Python package for capturing, analyzing, and automating data science workflows. (Web)
- Amundsen - Data discovery and metadata engine for improving the productivity of data analysts, data scientists and engineers when interacting with data.
- Equals - Next generation spreadsheet with built in connections to any data warehouse, modern versioning, and collaboration. (Twitter) (Tweet)
- Entity Resolution: The most common data science challenge (HN)
- What is the 'Bible' of Data Science? (2022)
- Aqueduct - Prediction Infrastructure for Data Scientists.
- DagsHub - Open Source Data Science Collaboration.
- Kaggle Past Competitions - Sortable and searchable compilation of solutions to past Kaggle competitions. (Code)
- Hedgehog Lab - Run, compile and execute JavaScript for Scientific Computing and Data Visualization. (Code)
- What is THE Data Science book? (2022)
- Everyday Data Science Interactive Course (HN)
- nbdev+Quarto: A new secret weapon for productivity (2022)
- CS109a: Introduction to Data Science – Resources (HN)
- CNext Instructions - Data-centric workspace for DS and AI.
- In what situations would a Bayesian model work better than frequentist model? (2022)
- Ask HN: Data Scientists, what libraries do you use for timeseries forecasting? (2022)
- Goodbye, Data Science (2022) (HN)
- Ask HN: Upskilling as a Data Engineer (2022)
- Graphic Walker - Open source alternative to Tableau. (Code) (HN)
- Take the tools out of 'Data', but don't take the data out of the tools (2023)
- Kaggle Python docker image
- Data Science Cookie Cutter for Prefect
- Time series resources
- Quadratic - Data Science Spreadsheet. (Code)
- posit::conf(2023) workshops (Code)
- Equals Dashboards - Build auto-updating dashboards in a spreadsheet. (HN)
- BastionLab - Simple framework for privacy-friendly data science collaboration.
- Most data work seems fundamentally worthless (2023) (HN)
- scikit-learn-cheat-sheet
- Xorbits - Scalable Python data science, in an API compatible & lightning fast way.
- DataDM - Your private data assistant.
- streamlit-extras - Python library with useful Streamlit extras.
- Hunch - AI answers with data teams in the loop.
- DataPen - Curation of Free Data Science Resources.
- SciDataFlow - Facilitating the Flow of Data in Science.