Data Engineering

157 bookmarks

Newest

Hive - Installation

Hive - Installation, All Hadoop sub-projects such as Hive, Pig, and HBase support Linux operating system. Therefore, you need to install any Linux flavored OS. The following simple

Hadoop

·tutorialspoint.com·Sep 20, 2022

Hive - Installation

panovvv/hadoop-hive-spark-docker: Base Docker image with just essentials: Hadoop, Hive and Spark.

Base Docker image with just essentials: Hadoop, Hive and Spark. - GitHub - panovvv/hadoop-hive-spark-docker: Base Docker image with just essentials: Hadoop, Hive and Spark.

Data-Lab

·github.com·Sep 19, 2022

panovvv/hadoop-hive-spark-docker: Base Docker image with just essentials: Hadoop, Hive and Spark.

Data Contracts — From Zero To Hero

A pragmatic approach to data contracts

Patterns

·towardsdatascience.com·Sep 12, 2022

Data Contracts — From Zero To Hero

spark createOrReplaceTempView vs createGlobalTempView

Spark Dataset 2.0 provides two functions createOrReplaceTempView and createGlobalTempView. I am not able to understand the basic difference between both functions. According to API documents:

createOrReplaceTempView() creates or replaces a local temporary view with this dataframe df. Lifetime of this view is dependent to SparkSession class

createGlobalTempView() creates a global temporary view with this dataframe df. life time of this view is dependent to spark application itself

Deep Dive

·stackoverflow.com·Sep 10, 2022

spark createOrReplaceTempView vs createGlobalTempView

Create Your Very Own Apache Spark/Hadoop Cluster....then do something with it? - Confessions of a Data Guy

I’ve never seen so many posts about Apache Spark before, not sure if it’s 3.0, or because the world is burning down. I’ve written about Spark a few times, even 2 years ago, but it still seems to be steadily increasing in popularity, albeit still missing from many companies tech stacks. With the continued rise […]

Data-Lab

·confessionsofadataguy.com·Sep 8, 2022

Create Your Very Own Apache Spark/Hadoop Cluster....then do something with it? - Confessions of a Data Guy

Kafka, for your data pipeline? Why not?

Create a streaming pipeline using Docker, Kafka, and Kafka Connect

Orchestration

·towardsdatascience.com·Sep 5, 2022

Kafka, for your data pipeline? Why not?

GitHub - public-apis/public-apis: A collective list of free APIs

A collective list of free APIs. Contribute to public-apis/public-apis development by creating an account on GitHub.

Tools

·github.com·Aug 27, 2022

GitHub - public-apis/public-apis: A collective list of free APIs

Onehouse

Architecture

·onehouse.ai·Aug 22, 2022

Onehouse

Spark Repartition & Coalesce - Explained

DataNoon - Making Big Data and Analytics simple!

Deep Dive

·datanoon.com·Aug 4, 2022

Spark Repartition & Coalesce - Explained

Spark Architecture: Shuffle

This is my second article about Apache Spark architecture and today I will be more specific and tell you about the shuffle, one of the most interesting topics in the overall Spark design. The previous part was mostly about general Spark architecture and its memory management. It can be accessed here. The next one is [...]

Deep Dive

·0x0fff.com·Aug 4, 2022

Spark Architecture: Shuffle

Spark Broadcast Variables - Spark by {Examples}

In Spark RDD and DataFrame, Broadcast variables are read-only shared variables that are cached and available on all nodes in a cluster in-order to access or use by the tasks. Instead of sending this data along with every task, spark distributes broadcast variables to the machine using efficient broadcast algorithms to reduce communication costs. Use

Deep Dive

·sparkbyexamples.com·Aug 3, 2022

Spark Broadcast Variables - Spark by {Examples}

Tuning - Spark 3.3.0 Documentation

Tuning and performance optimization guide for Spark 3.3.0

The main point to remember here is that the cost of garbage collection is proportional to the number of Java objects, so using data structures with fewer objects (e.g. an array of Ints instead of a LinkedList) greatly lowers this cost.

Deep Dive

·spark.apache.org·Aug 3, 2022

Tuning - Spark 3.3.0 Documentation

Reading Spark DAGs - DZone Java

See how to effectively read Directed Acyclic Graphs (DAGs) in Spark to better understand the steps a program takes to complete a computation.

Deep Dive

·dzone.com·Aug 3, 2022

Reading Spark DAGs - DZone Java

Dynamic Partition Pruning in Spark 3.0 - DZone Big Data

This blog will give you a deep insight on Dynamic Partition Pruning used in Apache Spark and how this works in the newer version of Spark released.

Therefore, we don’t need to actually scan the full fact table as we are only interested in two filtering partitions that result from the dimension table.

To avoid this, a simple approach is to take the filter from the dimension table incorporated into a sub query. Then run that sub query below the scan on the fact table.

Deep Dive

·dzone.com·Aug 3, 2022

Dynamic Partition Pruning in Spark 3.0 - DZone Big Data

Data Architecture Revisited: The Platform Hypothesis

Software systems are increasingly based on data, rather than code. A new class of tools and technologies have emerged to process data for both analytics and ML.

Architecture

·future.com·Jul 18, 2022

Data Architecture Revisited: The Platform Hypothesis

Configuration - Spark 3.2.1 Documentation

Memory Management

Deep Dive

·spark.apache.org·Jun 7, 2022

Configuration - Spark 3.2.1 Documentation

pyspark.SparkConf — PySpark 3.2.1 documentation

Most of the time, you would create a SparkConf object with SparkConf(), which will load values from spark.* Java system properties as well. In this case, any parameters you set directly on the SparkConf object take priority over system properties.

Deep Dive

·spark.apache.org·Jun 7, 2022

pyspark.SparkConf — PySpark 3.2.1 documentation

PySpark partitionBy() - Write to Disk Example - Spark by {Examples}

PySpark partitionBy() is a function of pyspark.sql.DataFrameWriter class which is used to partition the large dataset (DataFrame) into smaller files based on one or multiple columns while writing to disk, let’s see how to use this with Python examples. Partitioning the data on the file system is a way to improve the performance of the […]

Spark

·sparkbyexamples.com·Jun 6, 2022

PySpark partitionBy() - Write to Disk Example - Spark by {Examples}

How To Change The Column Names Of PySpark DataFrames

Discussing 5 ways for changing column names in PySpark DataFrames

Spark

·towardsdatascience.com·Jun 6, 2022

How To Change The Column Names Of PySpark DataFrames

Spark Window Functions with Examples - Spark by {Examples}

Spark Window functions are used to calculate results such as the rank, row number e.t.c over a range of input rows and these are available to you by importing org.apache.spark.sql.functions._, this article explains the concept of window functions, it’s usage, syntax and finally how to use them with Spark SQL and Spark’s DataFrame API. These […]

Deep Dive

·sparkbyexamples.com·May 24, 2022

Spark Window Functions with Examples - Spark by {Examples}

Star and Snowflake Schema in Data Warehouse with Model Examples

What is Multidimensional schemas? Multidimensional schema is especially designed to model data warehouse systems. The schemas are designed to address the unique needs of very large databases designed

Architecture

·guru99.com·Apr 28, 2022

Star and Snowflake Schema in Data Warehouse with Model Examples

What is the difference between a data lake and a data warehouse?

Confused by all the "data lake vs data warehouse" articles? Struggling to understand what the differences between data lakes and warehouses are? Then this post is for you. We go over what data lakes and warehouses are. We also cover the key points to consider when choosing your lake and warehouse tools.

Architecture

·startdataengineering.com·Apr 20, 2022

What is the difference between a data lake and a data warehouse?

How to Put a Database in Kubernetes - DZone Cloud

Learn the key steps of deploying databases and stateful workloads in Kubernetes and meet cloud-native technologies that can streamline Apache Cassandra for K8s.

Architecture

·dzone.com·Mar 25, 2022

How to Put a Database in Kubernetes - DZone Cloud

Data Lake vs Data Warehouse

Data Lake and the Data Warehouse. They seemed similar, but there are differences.

Architecture

·luminousmen.com·Mar 25, 2022

Data Lake vs Data Warehouse

Ultimate CI Pipeline for All of Your Python Projects

Everything you ever wanted for your Python project continuous integration pipeline — up-and-running in matter of minutes

Tools

·towardsdatascience.com·Mar 23, 2022

Ultimate CI Pipeline for All of Your Python Projects

15+ Data Engineering Projects for Beginners with Source Code

Explore top 15 real-world data engineering projects ideas for beginners with source code to gain hands-on experience on diverse data engineering skills.

Resources

·projectpro.io·Mar 2, 2022

15+ Data Engineering Projects for Beginners with Source Code

Is there a function in pyspark dataframe that is similar to pandas.io.json.json_normalize

I would like to perform operation similar to pandas.io.json.json_normalize is pyspark dataframe. Is there an equivalent function in spark? https://pandas.pydata.org/pandas-docs/stable/reference/api/

Spark

·stackoverflow.com·Jan 28, 2022

Is there a function in pyspark dataframe that is similar to pandas.io.json.json_normalize

Starting your journey with Microsoft Azure Data Factory

In this article, we will go through the Microsoft Azure Data Factory service, that can be used to ingest, copy and transform data generated from various data sources

Tools

·sqlshack.com·Jan 20, 2022

Starting your journey with Microsoft Azure Data Factory