Data Engineering

157 bookmarks

Newest

Create a Delta Table in S3 using Rust

See how to write Rust code to create a Delta Table in S3 and add data to it, using your local Windows development environment.

Delta

·blog.det.life·Oct 23, 2023

Create a Delta Table in S3 using Rust

Error handling with Rust using delta-rs as an example - Qxf2 BLOG

Learn various ways to perform error handling in Rust explained using an example in this easy-to-follow-along blog.

Delta

·qxf2.com·Oct 23, 2023

Error handling with Rust using delta-rs as an example - Qxf2 BLOG

The Hitchhiker's Guide to Delta Lake Streaming

This session will provide answers for some of the biggest questions in the universe: namely, how to take full advantage of Delta Lake streaming. You will be ...

Delta

·m.youtube.com·Oct 12, 2023

The Hitchhiker's Guide to Delta Lake Streaming

Schedule - CMU 15-721 :: Advanced Database Systems (Spring 2023)

Course schedule with slides, lecture notes, and videos.

Databases

·15721.courses.cs.cmu.edu·Oct 10, 2023

Schedule - CMU 15-721 :: Advanced Database Systems (Spring 2023)

Schedule - CMU 15-445/645 :: Intro to Database Systems (Fall 2023)

Course schedule with slides, lecture notes, and videos.

Databases

·15445.courses.cs.cmu.edu·Oct 10, 2023

Schedule - CMU 15-445/645 :: Intro to Database Systems (Fall 2023)

Data Movement: Most Used Messaging Patterns and Why They are so Important

Discover the most used messaging patterns and how data movement affects operational complexity, costs, and sustainability.

Architecture

·pandio.com·Oct 6, 2023

Data Movement: Most Used Messaging Patterns and Why They are so Important

Data Architecture – Data Engineer Things

Read writing about Data Architecture in Data Engineer Things. Things learned in our data engineering journey and ideas on data and engineering.

Architecture

·blog.det.life·Oct 6, 2023

Data Architecture – Data Engineer Things

Architecture - Data Developer Platform

Architecture

·datadeveloperplatform.org·Oct 6, 2023

Architecture - Data Developer Platform

Mastering Delta Lake Optimizations: OPTIMIZE, Z-ORDER and VACUUM

Find out how to use the optimizations of your Delta Lake to increase the performance of the operations and save costs.

Delta

·medium.com·Oct 5, 2023

Mastering Delta Lake Optimizations: OPTIMIZE, Z-ORDER and VACUUM

Pros and cons of Hive-style partitioning

This post discusses the pros and cons of Hive-style partioning.

Delta

·delta.io·Oct 5, 2023

Pros and cons of Hive-style partitioning

Homepage – bytewax

bytewax - Stream Processing Python framework

Orchestration

·bytewax.io·Oct 4, 2023

Homepage – bytewax

Introducing pgroll: zero-downtime, reversible, schema migrations for Postgres

We are excited to ship the first version of pgroll, a command line tool that offers safe and reversible schema migrations for PostgreSQL

Tools

·xata.io·Oct 4, 2023

Introducing pgroll: zero-downtime, reversible, schema migrations for Postgres

Design patterns every data engineer should know

(empty introductory line to avoid a formatting issue with Medium editor)

Patterns

·rspacesamuel.medium.com·Sep 30, 2023

Design patterns every data engineer should know

Navigating the data lake using Rust - Part One | Cuusoo

Most data engineers correlate delta format with Spark and Databricks. That's not true. Delta can be used by so many other tools and most cloud providers have added delta support to their analytics tools. In this post we will see how to use delta from a Rust client.

Delta

·cuusoo.com.au·Sep 22, 2023

Navigating the data lake using Rust - Part One | Cuusoo

Deploy a Delta Sharing Server on Azure

If you’ve been following along in this series, we’ve previously deployed a Delta Sharing server on AWS. Providing a similar tutorial for…

Delta

·medium.com·Sep 21, 2023

Deploy a Delta Sharing Server on Azure

The SwirlAI Data Engineering Project Master Template: The Collector (Part 1).

And how to run it on Kubernetes.

Tools

·newsletter.swirlai.com·Sep 21, 2023

The SwirlAI Data Engineering Project Master Template: The Collector (Part 1).

(1) Data Modeling for Mere Mortals – Part 1: What is Data Modeling?! | LinkedIn

In recent years, I’ve done dozens of training on various data platform topics, for all kinds of audiences. When teaching various data platform concepts and techniques, I find one of the concepts particularly intimidating for many business analysts, especially those who are just starting their journe

Architecture

·linkedin.com·Aug 1, 2023

(1) Data Modeling for Mere Mortals – Part 1: What is Data Modeling?! | LinkedIn

spark_hive_test/src/main/scala/tests/SparkHiveTest.scala at master · arempter/spark_hive_test · GitHub

Example for article Running Spark 3 with standalone Hive Metastore 3.0

Data-Lab

·github.com·Jul 5, 2023

spark_hive_test/src/main/scala/tests/SparkHiveTest.scala at master · arempter/spark_hive_test · GitHub

Running Spark 3 with standalone Hive Metastore 3.0

Intro

Data-Lab

·medium.com·Jul 5, 2023

Running Spark 3 with standalone Hive Metastore 3.0

A Guide to Optimising your Spark Application Performance (Part 1).

A cheat sheet to refer to when you run into performance issues with your Spark application.

Deep Dive

·newsletter.swirlai.com·Jul 2, 2023

A Guide to Optimising your Spark Application Performance (Part 1).

pyspark connect to aws s3a filesystem

jar dependencies are very finicky

Data-Lab

·codelovingyogi.medium.com·Jun 29, 2023

pyspark connect to aws s3a filesystem

Reading and Writing Data from/to MinIO using Spark

MinIO is a cloud object storage that offers high-performance, S3 compatible. Native to Kubernetes, MinIO is the only object storage suite…

Data-Lab

·medium.com·Jun 29, 2023

Reading and Writing Data from/to MinIO using Spark

java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics

I'm trying to run a simple spark to s3 app from a server but I keep getting the below error because the server has hadoop 2.7.3 installed and it looks like it doesn't include the GlobalStorageStati...

Data-Lab

·stackoverflow.com·Jun 29, 2023

java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics

cookbook/docs/apache-spark-with-minio.md at master · nitisht/cookbook · GitHub

Collection of Minio recipes. Contribute to nitisht/cookbook development by creating an account on GitHub.

Data-Lab

·github.com·Jun 29, 2023

cookbook/docs/apache-spark-with-minio.md at master · nitisht/cookbook · GitHub

Add Jar to standalone pyspark

I'm launching a pyspark program: $ export SPARK_HOME= $ export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.9-src.zip $ python And the py code: from pyspark import SparkContext,

.config('spark.jars.packages', 'org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1')

Data-Lab

·stackoverflow.com·Jun 29, 2023

Add Jar to standalone pyspark

Adding some MinIO to your standalone Apache Spark cluster

Disaggregated compute and storage for the apprentice Data Engineer

Data-Lab

·fithis2001.medium.com·Jun 29, 2023

Adding some MinIO to your standalone Apache Spark cluster

DataOps 02: Spawn up Apache Spark infrastructure by using Docker

When working on real data products, we will register an account on cloud providers such as Amazon, Azure, or Google so that we are able to…

Data-Lab

·medium.com·Jun 29, 2023

DataOps 02: Spawn up Apache Spark infrastructure by using Docker

The Beginner's Guide to Databases

There are 300+ databases; what do they all do?

Databases

·technically.substack.com·Mar 28, 2023

The Beginner's Guide to Databases

Optimizing Apache Spark™ on Databricks - Databricks

In this course, we will explore the vast majority of performance problems in an Apache Spark application: skew, spill, shuffle, storage, and serialization.

Spark

·databricks.com·Mar 26, 2023

Optimizing Apache Spark™ on Databricks - Databricks

ytsaurus/ytsaurus: YTsaurus is a scalable and fault-tolerant open-source big data platform.

YTsaurus is a scalable and fault-tolerant open-source big data platform. - ytsaurus/ytsaurus: YTsaurus is a scalable and fault-tolerant open-source big data platform.

Tools

·github.com·Mar 23, 2023

ytsaurus/ytsaurus: YTsaurus is a scalable and fault-tolerant open-source big data platform.