Spark

51 bookmarks

Newest

Spark Concepts: pyspark.sql.DataFrame.observe Simplified | Orchestra

Spark Concepts: pyspark.sql.DataFrame.observe Simplified | Orchestra

Deep Dive

·getorchestra.io·Mar 22, 2024

Spark Concepts: pyspark.sql.DataFrame.observe Simplified | Orchestra

Structured Streaming Programming Guide - Spark 3.5.1 Documentation

Structured Streaming Programming Guide - Spark 3.5.1 Documentation

Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine.

Finally, the system ensures end-to-end exactly-once fault-tolerance guarantees through checkpointing and Write-Ahead Logs.

Structured Streaming queries are processed using a micro-batch processing engine, which processes data streams as a series of small batch jobs thereby achieving end-to-end latencies as low as 100 milliseconds and exactly-once fault-tolerance guarantees

However, since Spark 2.3, we have introduced a new low-latency processing mode called Continuous Processing, which can achieve end-to-end latencies as low as 1 millisecond with at-least-once guarantees.

The output can be defined in a different mode

Complete Mode

Append Mode

Update Mode

The engine uses checkpointing and write-ahead logs to record the offset range of the data being processed in each trigger.

sliding event-time

We can easily define watermarking on the previous example using withWatermark() as shown below.

In other words, late data within the threshold will be aggregated, but data later than the threshold will start getting dropped (see later in the section for the exact guarantees). Let’s understand this with an example

This means the system needs to know when an old aggregate can be dropped from the in-memory state because the application is not going to receive late data for that aggregate any more.

we have introduced watermarking, which lets the engine automatically track the current event time in the data and attempt to clean up old state accordingly. You can define the watermark of a query by specifying the event time column and the threshold on how late the data is expected to be in terms of event time

Note that after every trigger, the updated counts (i.e. purple rows) are written to sink as the trigger output, as dictated by the Update mode.

Deep Dive

·spark.apache.org·Mar 4, 2024

Structured Streaming Programming Guide - Spark 3.5.1 Documentation

RDD Programming Guide - Spark 3.5.0 Documentation

RDD Programming Guide - Spark 3.5.0 Documentation

Shuffle operations explained

u

Tutorials

·spark.apache.org·Feb 29, 2024

RDD Programming Guide - Spark 3.5.0 Documentation

Spark Optimization : Reducing Shuffle

Spark Optimization : Reducing Shuffle

“Shuffling is the only thing which Nature cannot undo.” — Arthur Eddington

Deep Dive

·selectfrom.dev·Feb 29, 2024

Spark Optimization : Reducing Shuffle

Faster PySpark Unit Tests

Faster PySpark Unit Tests

TL;DR: A PySpark unit test setup for pytest that uses efficient default settings and utilizes all CPU cores via pytest-xdist is available…

shuffle.partitions

Tutorials

Tutorials #testing

·medium.com·Feb 29, 2024

Faster PySpark Unit Tests

Understanding Spark Connect API - Part 1: Shortcomings of Spark Driver Architecture

Understanding Spark Connect API - Part 1: Shortcomings of Spark Driver Architecture

Thoughts on technology, life and everything else.

Deep Dive

·blog.madhukaraphatak.com·Feb 28, 2024

Understanding Spark Connect API - Part 1: Shortcomings of Spark Driver Architecture

A Guide to Optimising your Spark Application Performance (Part 2)

A Guide to Optimising your Spark Application Performance (Part 2)

A cheat sheet to refer to when you run into performance issues with your Spark application.

Deep Dive

·newsletter.swirlai.com·Jan 27, 2024

A Guide to Optimising your Spark Application Performance (Part 2)

spark_hive_test/src/main/scala/tests/SparkHiveTest.scala at master · arempter/spark_hive_test · GitHub

spark_hive_test/src/main/scala/tests/SparkHiveTest.scala at master · arempter/spark_hive_test · GitHub

Example for article Running Spark 3 with standalone Hive Metastore 3.0

Data-Lab

·github.com·Jul 5, 2023

spark_hive_test/src/main/scala/tests/SparkHiveTest.scala at master · arempter/spark_hive_test · GitHub

Running Spark 3 with standalone Hive Metastore 3.0

Running Spark 3 with standalone Hive Metastore 3.0

Intro

Data-Lab

·medium.com·Jul 5, 2023

Running Spark 3 with standalone Hive Metastore 3.0

A Guide to Optimising your Spark Application Performance (Part 1).

A Guide to Optimising your Spark Application Performance (Part 1).

A cheat sheet to refer to when you run into performance issues with your Spark application.

Deep Dive

·newsletter.swirlai.com·Jul 2, 2023

A Guide to Optimising your Spark Application Performance (Part 1).

pyspark connect to aws s3a filesystem

pyspark connect to aws s3a filesystem

jar dependencies are very finicky

Data-Lab

·codelovingyogi.medium.com·Jun 29, 2023

pyspark connect to aws s3a filesystem

Reading and Writing Data from/to MinIO using Spark

Reading and Writing Data from/to MinIO using Spark

MinIO is a cloud object storage that offers high-performance, S3 compatible. Native to Kubernetes, MinIO is the only object storage suite…

Data-Lab

·medium.com·Jun 29, 2023

Reading and Writing Data from/to MinIO using Spark

java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics

I'm trying to run a simple spark to s3 app from a server but I keep getting the below error because the server has hadoop 2.7.3 installed and it looks like it doesn't include the GlobalStorageStati...

Data-Lab

·stackoverflow.com·Jun 29, 2023

java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics

cookbook/docs/apache-spark-with-minio.md at master · nitisht/cookbook · GitHub

cookbook/docs/apache-spark-with-minio.md at master · nitisht/cookbook · GitHub

Collection of Minio recipes. Contribute to nitisht/cookbook development by creating an account on GitHub.

Data-Lab

·github.com·Jun 29, 2023

cookbook/docs/apache-spark-with-minio.md at master · nitisht/cookbook · GitHub

Add Jar to standalone pyspark

I'm launching a pyspark program: $ export SPARK_HOME= $ export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.9-src.zip $ python And the py code: from pyspark import SparkContext,

.config('spark.jars.packages', 'org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1')

Data-Lab

·stackoverflow.com·Jun 29, 2023

Add Jar to standalone pyspark

Adding some MinIO to your standalone Apache Spark cluster

Adding some MinIO to your standalone Apache Spark cluster

Disaggregated compute and storage for the apprentice Data Engineer

Data-Lab

·fithis2001.medium.com·Jun 29, 2023

Adding some MinIO to your standalone Apache Spark cluster

DataOps 02: Spawn up Apache Spark infrastructure by using Docker

DataOps 02: Spawn up Apache Spark infrastructure by using Docker

When working on real data products, we will register an account on cloud providers such as Amazon, Azure, or Google so that we are able to…

Data-Lab

·medium.com·Jun 29, 2023

DataOps 02: Spawn up Apache Spark infrastructure by using Docker

Optimizing Apache Spark™ on Databricks - Databricks

Optimizing Apache Spark™ on Databricks - Databricks

In this course, we will explore the vast majority of performance problems in an Apache Spark application: skew, spill, shuffle, storage, and serialization.

·databricks.com·Mar 26, 2023

Optimizing Apache Spark™ on Databricks - Databricks

One-hot encoding in PySpark

One-hot encoding in PySpark

To perform one-hot encoding in PySpark, we must convert the categorical column into a numeric column (0, 1, ...) using StringIndexer, and then convert the numeric column into one-hot encoded columns using OneHotEncoder.

Tutorials

·skytowner.com·Feb 16, 2023

One-hot encoding in PySpark

Shuffle join in Spark SQL

Shuffle join in Spark SQL

Shuffle consists on moving data with the same key to the one executor in order to execute some specific processing on it. We could think that it concerns only em*ByKey/em operations but it's not necessarily true.

Deep Dive

·waitingforcode.com·Oct 19, 2022

Shuffle join in Spark SQL

Spark SQL Query Engine Deep Dive (1) – Catalyst QueryExecution Overview

Spark SQL Query Engine Deep Dive (1) – Catalyst QueryExecution Overview

From this blog post on, I am going to start writing about Spark SQL Catalyst. Catalyst is the core of Spark SQL and there are many topics to cover. I don’t have a formal writing plan on this,…

Deep Dive

·dataninjago.com·Oct 12, 2022

Spark SQL Query Engine Deep Dive (1) – Catalyst QueryExecution Overview

pyspark.SparkContext.setLogLevel — PySpark 3.3.0 documentation

pyspark.SparkContext.setLogLevel — PySpark 3.3.0 documentation

Tutorials

·spark.apache.org·Sep 29, 2022

pyspark.SparkContext.setLogLevel — PySpark 3.3.0 documentation

Getting started with MongoDB, PySpark, and Jupyter Notebook | MongoDB Blog

Getting started with MongoDB, PySpark, and Jupyter Notebook | MongoDB Blog

Learn how to leverage MongoDB data in your Jupyter notebooks via the MongoDB Spark Connector and PySpark. We will load financial security data from MongoDB, calculate a moving average, and then update the data in MongoDB with the new data.

Tutorials

·mongodb.com·Sep 29, 2022

Getting started with MongoDB, PySpark, and Jupyter Notebook | MongoDB Blog

Apache Spark Cluster on Docker (ft. a JuyterLab Interface)

Apache Spark Cluster on Docker (ft. a JuyterLab Interface)

Build your own Apache Spark cluster in standalone mode on Docker with a JupyterLab interface

Data-Lab

·towardsdatascience.com·Sep 28, 2022

Apache Spark Cluster on Docker (ft. a JuyterLab Interface)

How to Build a Spark Cluster with Docker, JupyterLab, and Apache Livy—a REST API for Apache Spark

How to Build a Spark Cluster with Docker, JupyterLab, and Apache Livy—a REST API for Apache Spark

Read our step-by-step guide to building an Apache Spark cluster based on the Docker virtual environment with JupyterLab and the Apache Livy REST interface.

Data-Lab

·stxnext.com·Sep 28, 2022

How to Build a Spark Cluster with Docker, JupyterLab, and Apache Livy—a REST API for Apache Spark

DIY: Apache Spark & Docker

DIY: Apache Spark & Docker

Set up a Spark cluster in Docker from scratch

Data-Lab

·towardsdatascience.com·Sep 28, 2022

DIY: Apache Spark & Docker

How to install Apache Spark on Ubuntu using Apache Bigtop

How to install Apache Spark on Ubuntu using Apache Bigtop

Want to install Apache Spark using Apache Bigtop? Step by step tutorial. Bigtop is a package manager for Spark, HBase, Hadoop and other Apache projects related to big data. This tutorial is for Machine Learning engineers and Data Scientists looking for a convenient way to manage big data components of their ecosystem.

Data-Lab

·blog.miz.space·Sep 28, 2022

How to install Apache Spark on Ubuntu using Apache Bigtop

Building an Apache Airflow configured with Local Executor and Spark Standalone Cluster with Docker

Building an Apache Airflow configured with Local Executor and Spark Standalone Cluster with Docker

A guide on how to set up an environment to work with Airflow and Spark

Data-Lab

·mbvyn.medium.com·Sep 23, 2022

Building an Apache Airflow configured with Local Executor and Spark Standalone Cluster with Docker

How to connect to remote hive server from spark

I'm running spark locally and want to to access Hive tables, which are located in the remote Hadoop cluster. I'm able to access the hive tables by lauching beeline under SPARK_HOME [ml@master spa...

Tutorials

·stackoverflow.com·Sep 23, 2022

How to connect to remote hive server from spark

Spark Step-by-Step Setup on Hadoop Yarn Cluster

Spark Step-by-Step Setup on Hadoop Yarn Cluster

This post explains how to setup Apache Spark and run Spark applications on the Hadoop with the Yarn cluster manager that is used to run spark examples as deployment mode client and master as yarn. You can also try running the Spark application in cluster mode. Prerequisites : If you don't have Hadoop & Yarn installed, please Install and Setup Hadoop cluster and setup Yarn on Cluster before proceeding with this article.. Spark Install and Setup In order to install and setup Apache Spark on Hadoop cluster, access Apache Spark Download site and go to the Download Apache Spark section

Data-Lab

·sparkbyexamples.com·Sep 23, 2022

Spark Step-by-Step Setup on Hadoop Yarn Cluster