Data Engineering

Data Engineering

B-trees and database indexes — PlanetScale
B-trees and database indexes — PlanetScale
B-trees are used by many modern DBMSs. Learn how they work, how databases use them, and how your choice of primary key can affect index performance.
·planetscale.com·
B-trees and database indexes — PlanetScale
Building Cost Efficient Data Pipelines with Python & DuckDB
Building Cost Efficient Data Pipelines with Python & DuckDB
Imagine working for a company that processes a few GBs of data every day but spends hours configuring/debugging large-scale data processing systems! Whoever set up the data infrastructure copied it from some blog/talk by big tech. Now, the responsibility of managing the data team's expenses has fallen on your shoulders. You're under pressure to scrutinize every system expense, no matter how small, in an effort to save some money for the organization. It can be frustrating when data vendors charge you a lot and will gladly charge you more if you are not careful with usage. Imagine if your data processing costs were dirt cheap! Imagine being able to replicate and debug issues quickly on your laptop! In this post, we will discuss how to use the latest advancements in data processing systems and cheap hardware to enable cheap data processing. We will use DuckDB and Python to demonstrate how to process data quickly while improving developer ergonomics.
·startdataengineering.com·
Building Cost Efficient Data Pipelines with Python & DuckDB
Netflix Data Tech Stack
Netflix Data Tech Stack
Learn about the Data Tech Stack used by Netflix to process trillions of events every day.
·junaideffendi.com·
Netflix Data Tech Stack
Data Pipeline Design Patterns - #2. Coding patterns in Python
Data Pipeline Design Patterns - #2. Coding patterns in Python
As data engineers, you might have heard the terms functional data pipeline, factory pattern, singleton pattern, etc. One can quickly look up the implementation, but it can be tricky to understand what they are precisely and when to (& when not to) use them. Blindly following a pattern can help in some cases, but not knowing the caveats of a design will lead to hard-to-maintain and brittle code! While writing clean and easy-to-read code takes years of experience, you can accelerate that by understanding the nuances and reasoning behind each pattern. Imagine being able to design an implementation that provides the best extensibility and maintainability! Your colleagues (& future self) will be extremely grateful, your feature delivery speed will increase, and your boss will highly value your opinion. In this post, we will go over the specific code design patterns used for data pipelines, when and why to use them, and when not to use them, and we will also go over a few python specific techniques to help you write better pipelines. By the end of this post, you will be able to identify patterns in your data pipelines and apply the appropriate code design patterns. You will also be able to take advantage of pythonic features to write bug-free, maintainable code that is a joy to work on!
·startdataengineering.com·
Data Pipeline Design Patterns - #2. Coding patterns in Python
Delta Lake - State of the Project - Part 1
Delta Lake - State of the Project - Part 1
Delta Lake, a project hosted under The Linux Foundation, has been growing by leaps and bounds. To celebrate the achievements of the project, we’re publishing a 2-part series on Delta Lake.
·delta.io·
Delta Lake - State of the Project - Part 1
What is best practice for local setup?
What is best practice for local setup?
That was the solution. Thank you! With this was able to dockerise the setup and gain access.
·discuss.ray.io·
What is best practice for local setup?
Structured Streaming Programming Guide - Spark 3.5.1 Documentation
Structured Streaming Programming Guide - Spark 3.5.1 Documentation
Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine.
Finally, the system ensures end-to-end exactly-once fault-tolerance guarantees through checkpointing and Write-Ahead Logs.
Structured Streaming queries are processed using a micro-batch processing engine, which processes data streams as a series of small batch jobs thereby achieving end-to-end latencies as low as 100 milliseconds and exactly-once fault-tolerance guarantees
However, since Spark 2.3, we have introduced a new low-latency processing mode called Continuous Processing, which can achieve end-to-end latencies as low as 1 millisecond with at-least-once guarantees.
The output can be defined in a different mode
Complete Mode
Append Mode
Update Mode
The engine uses checkpointing and write-ahead logs to record the offset range of the data being processed in each trigger.
sliding event-time
We can easily define watermarking on the previous example using withWatermark() as shown below.
In other words, late data within the threshold will be aggregated, but data later than the threshold will start getting dropped (see later in the section for the exact guarantees). Let’s understand this with an example
This means the system needs to know when an old aggregate can be dropped from the in-memory state because the application is not going to receive late data for that aggregate any more.
we have introduced watermarking, which lets the engine automatically track the current event time in the data and attempt to clean up old state accordingly. You can define the watermark of a query by specifying the event time column and the threshold on how late the data is expected to be in terms of event time
Note that after every trigger, the updated counts (i.e. purple rows) are written to sink as the trigger output, as dictated by the Update mode.
·spark.apache.org·
Structured Streaming Programming Guide - Spark 3.5.1 Documentation
Faster PySpark Unit Tests
Faster PySpark Unit Tests
TL;DR: A PySpark unit test setup for pytest that uses efficient default settings and utilizes all CPU cores via pytest-xdist is available…
shuffle.partitions
·medium.com·
Faster PySpark Unit Tests
Data Engineering on People Data
Data Engineering on People Data
The application of analytics to people data empowers organizations to harness the full potential of their most important asset: their people.
·engineering.gusto.com·
Data Engineering on People Data
Building an End-To-End Analytic solution in Power BI: Part 3 – Level Up with Data Modeling! | LinkedIn
Building an End-To-End Analytic solution in Power BI: Part 3 – Level Up with Data Modeling! | LinkedIn
When I talk to people who are not deep into the Power BI world, I often get the impression that they think of Power BI as a visualization tool exclusively. While that is true to a certain extent, it seems to me that they are not seeing the bigger picture – or maybe it’s better to say – they see just
·linkedin.com·
Building an End-To-End Analytic solution in Power BI: Part 3 – Level Up with Data Modeling! | LinkedIn
Navigating the data lake using Rust Part Two | Cuusoo
Navigating the data lake using Rust Part Two | Cuusoo
Most data engineers correlate delta format with Spark and Databricks. That's not true. Delta can be used by so many other tools and most cloud providers have added delta support to their analytics tools. In this post we will see how to use delta from a Rust client but this time the focus will be on S3 storage.
·cuusoo.com.au·
Navigating the data lake using Rust Part Two | Cuusoo