ChartDB - Database schema diagrams visualizer

Free and Open-source database diagrams editor, visualize and design your database with a single query. Tool to help you draw your DB relationship diagrams and export DDL scripts.

Tools #database #data-modeling

·chartdb.io·Nov 3, 2024

ChartDB - Database schema diagrams visualizer

Cooking With DuckDB

Tools #duckdb

·duckdb.hrbrmstr.app·Oct 5, 2024

Cooking With DuckDB

B-trees and database indexes — PlanetScale

B-trees are used by many modern DBMSs. Learn how they work, how databases use them, and how your choice of primary key can affect index performance.

Databases

·planetscale.com·Sep 9, 2024

B-trees and database indexes — PlanetScale

The Rise of Medium Code | Dagster Blog

Why the reports of software’s demise are greatly exaggerated.

Architecture

·dagster.io·Jun 23, 2024

The Rise of Medium Code | Dagster Blog

Delta Lake Liquid Clustering — A visual explanation

How to optimize lakehouse data storage layout with minimal effort.

Delta

·levelup.gitconnected.com·Jun 18, 2024

Delta Lake Liquid Clustering — A visual explanation

Building Cost Efficient Data Pipelines with Python & DuckDB

Imagine working for a company that processes a few GBs of data every day but spends hours configuring/debugging large-scale data processing systems! Whoever set up the data infrastructure copied it from some blog/talk by big tech. Now, the responsibility of managing the data team's expenses has fallen on your shoulders. You're under pressure to scrutinize every system expense, no matter how small, in an effort to save some money for the organization. It can be frustrating when data vendors charge you a lot and will gladly charge you more if you are not careful with usage. Imagine if your data processing costs were dirt cheap! Imagine being able to replicate and debug issues quickly on your laptop! In this post, we will discuss how to use the latest advancements in data processing systems and cheap hardware to enable cheap data processing. We will use DuckDB and Python to demonstrate how to process data quickly while improving developer ergonomics.

Patterns #duckdb

·startdataengineering.com·Jun 8, 2024

Building Cost Efficient Data Pipelines with Python & DuckDB

Netflix Data Tech Stack

Learn about the Data Tech Stack used by Netflix to process trillions of events every day.

Architecture

·junaideffendi.com·May 9, 2024

Netflix Data Tech Stack

drawDB | Online database diagram editor and SQL generator

Online database entity-realtionship diagram editor and SQL generator. Design, visualize, and export scripts without an account and completely free of charge.

Tools

·drawdb.vercel.app·Apr 18, 2024

drawDB | Online database diagram editor and SQL generator

Data Pipeline Design Patterns - #2. Coding patterns in Python

As data engineers, you might have heard the terms functional data pipeline, factory pattern, singleton pattern, etc. One can quickly look up the implementation, but it can be tricky to understand what they are precisely and when to (& when not to) use them. Blindly following a pattern can help in some cases, but not knowing the caveats of a design will lead to hard-to-maintain and brittle code! While writing clean and easy-to-read code takes years of experience, you can accelerate that by understanding the nuances and reasoning behind each pattern. Imagine being able to design an implementation that provides the best extensibility and maintainability! Your colleagues (& future self) will be extremely grateful, your feature delivery speed will increase, and your boss will highly value your opinion. In this post, we will go over the specific code design patterns used for data pipelines, when and why to use them, and when not to use them, and we will also go over a few python specific techniques to help you write better pipelines. By the end of this post, you will be able to identify patterns in your data pipelines and apply the appropriate code design patterns. You will also be able to take advantage of pythonic features to write bug-free, maintainable code that is a joy to work on!

Patterns

·startdataengineering.com·Apr 11, 2024

Data Pipeline Design Patterns - #2. Coding patterns in Python

Delta Lake - State of the Project - Part 1

Delta Lake, a project hosted under The Linux Foundation, has been growing by leaps and bounds. To celebrate the achievements of the project, we’re publishing a 2-part series on Delta Lake.

Delta

·delta.io·Apr 11, 2024

Delta Lake - State of the Project - Part 1

What is best practice for local setup?

That was the solution. Thank you! With this was able to dockerise the setup and gain access.

·discuss.ray.io·Mar 24, 2024

What is best practice for local setup?

Spark Concepts: pyspark.sql.DataFrame.observe Simplified | Orchestra

Deep Dive

·getorchestra.io·Mar 22, 2024

Spark Concepts: pyspark.sql.DataFrame.observe Simplified | Orchestra

Structured Streaming Programming Guide - Spark 3.5.1 Documentation

Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine.

Finally, the system ensures end-to-end exactly-once fault-tolerance guarantees through checkpointing and Write-Ahead Logs.

Structured Streaming queries are processed using a micro-batch processing engine, which processes data streams as a series of small batch jobs thereby achieving end-to-end latencies as low as 100 milliseconds and exactly-once fault-tolerance guarantees

However, since Spark 2.3, we have introduced a new low-latency processing mode called Continuous Processing, which can achieve end-to-end latencies as low as 1 millisecond with at-least-once guarantees.

The output can be defined in a different mode

Complete Mode

Append Mode

Update Mode

The engine uses checkpointing and write-ahead logs to record the offset range of the data being processed in each trigger.

sliding event-time

We can easily define watermarking on the previous example using withWatermark() as shown below.

In other words, late data within the threshold will be aggregated, but data later than the threshold will start getting dropped (see later in the section for the exact guarantees). Let’s understand this with an example

This means the system needs to know when an old aggregate can be dropped from the in-memory state because the application is not going to receive late data for that aggregate any more.

we have introduced watermarking, which lets the engine automatically track the current event time in the data and attempt to clean up old state accordingly. You can define the watermark of a query by specifying the event time column and the threshold on how late the data is expected to be in terms of event time

Note that after every trigger, the updated counts (i.e. purple rows) are written to sink as the trigger output, as dictated by the Update mode.

Deep Dive

·spark.apache.org·Mar 4, 2024

Structured Streaming Programming Guide - Spark 3.5.1 Documentation

RDD Programming Guide - Spark 3.5.0 Documentation

Shuffle operations explained

Tutorials

·spark.apache.org·Feb 29, 2024

RDD Programming Guide - Spark 3.5.0 Documentation

Spark Optimization : Reducing Shuffle

“Shuffling is the only thing which Nature cannot undo.” — Arthur Eddington

Deep Dive

·selectfrom.dev·Feb 29, 2024

Spark Optimization : Reducing Shuffle

Faster PySpark Unit Tests

TL;DR: A PySpark unit test setup for pytest that uses efficient default settings and utilizes all CPU cores via pytest-xdist is available…

shuffle.partitions

Tutorials #testing

·medium.com·Feb 29, 2024

Faster PySpark Unit Tests

Understanding Spark Connect API - Part 1: Shortcomings of Spark Driver Architecture

Thoughts on technology, life and everything else.

Deep Dive

·blog.madhukaraphatak.com·Feb 28, 2024

Understanding Spark Connect API - Part 1: Shortcomings of Spark Driver Architecture

Data Engineering on People Data

The application of analytics to people data empowers organizations to harness the full potential of their most important asset: their people.

Architecture

·engineering.gusto.com·Jan 27, 2024

Data Engineering on People Data

A Guide to Optimising your Spark Application Performance (Part 2)

A cheat sheet to refer to when you run into performance issues with your Spark application.

Deep Dive

·newsletter.swirlai.com·Jan 27, 2024

A Guide to Optimising your Spark Application Performance (Part 2)

SAI Notes #09: Database Sharding.

And implementation architectures.

Architecture

·newsletter.swirlai.com·Jan 27, 2024

SAI Notes #09: Database Sharding.

SAI Notes #02: Encoding in Parquet.

Let's look into the types of Encoding in Parquet Files.

Patterns

·newsletter.swirlai.com·Jan 27, 2024

SAI Notes #02: Encoding in Parquet.

11 lessons learned managing a Data Platform team within a data mesh

Introduction

Architecture

·medium.com·Jan 27, 2024

11 lessons learned managing a Data Platform team within a data mesh

Database Fundamentals

Databases

·tontinton.com·Dec 22, 2023

Database Fundamentals

Understanding the Delta Lake transaction log at the file level

Dive into the Delta Lake transaction log at the file level to understand how it provides atomicity and multi-version concurrency control.

Delta

·dennyglee.com·Nov 27, 2023

Understanding the Delta Lake transaction log at the file level

Malloy Documentation

Tools

·malloydata.github.io·Nov 19, 2023

Malloy Documentation

Text to (animated) diagrams | D2 Documentation

This is a single SVG file created purely through D2 text:

Tools

·d2lang.com·Nov 12, 2023

Text to (animated) diagrams | D2 Documentation

DevOps for Data Science

Architecture

·do4ds.com·Nov 1, 2023

DevOps for Data Science

Building an End-To-End Analytic solution in Power BI: Part 3 – Level Up with Data Modeling! | LinkedIn

When I talk to people who are not deep into the Power BI world, I often get the impression that they think of Power BI as a visualization tool exclusively. While that is true to a certain extent, it seems to me that they are not seeing the bigger picture – or maybe it’s better to say – they see just

Architecture

·linkedin.com·Oct 24, 2023

Building an End-To-End Analytic solution in Power BI: Part 3 – Level Up with Data Modeling! | LinkedIn

Navigating the data lake using Rust Part Two | Cuusoo

Most data engineers correlate delta format with Spark and Databricks. That's not true. Delta can be used by so many other tools and most cloud providers have added delta support to their analytics tools. In this post we will see how to use delta from a Rust client but this time the focus will be on S3 storage.

Delta

·cuusoo.com.au·Oct 23, 2023

Navigating the data lake using Rust Part Two | Cuusoo

Integrating Rust for High-Performance Data Processing in Delta Lake: A Technical Exploration

Utilizing Delta Lake format outside of the JVM ecosystem with Rust.

Delta

·blog.det.life·Oct 23, 2023

Integrating Rust for High-Performance Data Processing in Delta Lake: A Technical Exploration