Spark

Spark

51 bookmarks
Newest
Structured Streaming Programming Guide - Spark 3.5.1 Documentation
Structured Streaming Programming Guide - Spark 3.5.1 Documentation
Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine.
Finally, the system ensures end-to-end exactly-once fault-tolerance guarantees through checkpointing and Write-Ahead Logs.
Structured Streaming queries are processed using a micro-batch processing engine, which processes data streams as a series of small batch jobs thereby achieving end-to-end latencies as low as 100 milliseconds and exactly-once fault-tolerance guarantees
However, since Spark 2.3, we have introduced a new low-latency processing mode called Continuous Processing, which can achieve end-to-end latencies as low as 1 millisecond with at-least-once guarantees.
The output can be defined in a different mode
Complete Mode
Append Mode
Update Mode
The engine uses checkpointing and write-ahead logs to record the offset range of the data being processed in each trigger.
sliding event-time
We can easily define watermarking on the previous example using withWatermark() as shown below.
In other words, late data within the threshold will be aggregated, but data later than the threshold will start getting dropped (see later in the section for the exact guarantees). Let’s understand this with an example
This means the system needs to know when an old aggregate can be dropped from the in-memory state because the application is not going to receive late data for that aggregate any more.
we have introduced watermarking, which lets the engine automatically track the current event time in the data and attempt to clean up old state accordingly. You can define the watermark of a query by specifying the event time column and the threshold on how late the data is expected to be in terms of event time
Note that after every trigger, the updated counts (i.e. purple rows) are written to sink as the trigger output, as dictated by the Update mode.
·spark.apache.org·
Structured Streaming Programming Guide - Spark 3.5.1 Documentation
Faster PySpark Unit Tests
Faster PySpark Unit Tests
TL;DR: A PySpark unit test setup for pytest that uses efficient default settings and utilizes all CPU cores via pytest-xdist is available…
shuffle.partitions
·medium.com·
Faster PySpark Unit Tests
Add Jar to standalone pyspark
Add Jar to standalone pyspark
I'm launching a pyspark program: $ export SPARK_HOME= $ export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.9-src.zip $ python And the py code: from pyspark import SparkContext,
.config('spark.jars.packages', 'org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1')
·stackoverflow.com·
Add Jar to standalone pyspark
Optimizing Apache Spark™ on Databricks - Databricks
Optimizing Apache Spark™ on Databricks - Databricks
In this course, we will explore the vast majority of performance problems in an Apache Spark application: skew, spill, shuffle, storage, and serialization.
·databricks.com·
Optimizing Apache Spark™ on Databricks - Databricks
One-hot encoding in PySpark
One-hot encoding in PySpark
To perform one-hot encoding in PySpark, we must convert the categorical column into a numeric column (0, 1, ...) using StringIndexer, and then convert the numeric column into one-hot encoded columns using OneHotEncoder.
·skytowner.com·
One-hot encoding in PySpark
Shuffle join in Spark SQL
Shuffle join in Spark SQL
Shuffle consists on moving data with the same key to the one executor in order to execute some specific processing on it. We could think that it concerns only em*ByKey/em operations but it's not necessarily true.
·waitingforcode.com·
Shuffle join in Spark SQL
Getting started with MongoDB, PySpark, and Jupyter Notebook | MongoDB Blog
Getting started with MongoDB, PySpark, and Jupyter Notebook | MongoDB Blog
Learn how to leverage MongoDB data in your Jupyter notebooks via the MongoDB Spark Connector and PySpark. We will load financial security data from MongoDB, calculate a moving average, and then update the data in MongoDB with the new data.
·mongodb.com·
Getting started with MongoDB, PySpark, and Jupyter Notebook | MongoDB Blog
How to install Apache Spark on Ubuntu using Apache Bigtop
How to install Apache Spark on Ubuntu using Apache Bigtop
Want to install Apache Spark using Apache Bigtop? Step by step tutorial. Bigtop is a package manager for Spark, HBase, Hadoop and other Apache projects related to big data. This tutorial is for Machine Learning engineers and Data Scientists looking for a convenient way to manage big data components of their ecosystem.
·blog.miz.space·
How to install Apache Spark on Ubuntu using Apache Bigtop
How to connect to remote hive server from spark
How to connect to remote hive server from spark
I'm running spark locally and want to to access Hive tables, which are located in the remote Hadoop cluster. I'm able to access the hive tables by lauching beeline under SPARK_HOME [ml@master spa...
·stackoverflow.com·
How to connect to remote hive server from spark
Spark Step-by-Step Setup on Hadoop Yarn Cluster
Spark Step-by-Step Setup on Hadoop Yarn Cluster
This post explains how to setup Apache Spark and run Spark applications on the Hadoop with the Yarn cluster manager that is used to run spark examples as deployment mode client and master as yarn. You can also try running the Spark application in cluster mode. Prerequisites : If you don't have Hadoop & Yarn installed, please Install and Setup Hadoop cluster and setup Yarn on Cluster before proceeding with this article.. Spark Install and Setup In order to install and setup Apache Spark on Hadoop cluster, access Apache Spark Download site and go to the Download Apache Spark section
·sparkbyexamples.com·
Spark Step-by-Step Setup on Hadoop Yarn Cluster