Search Data Engineering

Found 157 bookmarks

Newest

One-hot encoding in PySpark

To perform one-hot encoding in PySpark, we must convert the categorical column into a numeric column (0, 1, ...) using StringIndexer, and then convert the numeric column into one-hot encoded columns using OneHotEncoder.

Tutorials

·skytowner.com·Feb 16, 2023

One-hot encoding in PySpark

Parquet Best Practices: Discover your Data without loading them

Metadata, Statistics on Row Groups, Partitions discovery, and Repartitioning

Tools

·towardsdatascience.com·Jan 16, 2023

Parquet Best Practices: Discover your Data without loading them

Functional Data Engineering - A Blueprint

How to build a Recoverable & Reproducible data pipeline

Patterns

·dataengineeringweekly.com·Jan 13, 2023

Functional Data Engineering - A Blueprint

Data Pipeline Design Patterns - #1. Data flow patterns

Data pipelines built (and added on to) without a solid foundation will suffer from poor efficiency, slow development speed, long times to triage production issues, and hard testability. What if your data pipelines are elegant and enable you to deliver features quickly? An easy-to-maintain and extendable data pipeline significantly increase developer morale, stakeholder trust, and the business bottom line! Using the correct design pattern will increase feature delivery speed and developer value (allowing devs to do more in less time), decrease toil during pipeline failures, and build trust with stakeholders. This post goes over the most commonly used data flow design patterns, what they do, when to use them, and, more importantly, when not to use them. By the end of this post, you will have an overview of the typical data flow patterns and be able to choose the right one for your use case.

Patterns

·startdataengineering.com·Jan 7, 2023

Data Pipeline Design Patterns - #1. Data flow patterns

Functional Data Engineering — a modern paradigm for batch data processing

Batch data processing — historically known as ETL — is extremely challenging. It’s time-consuming, brittle, and often unrewarding. Not only…

Patterns

·maximebeauchemin.medium.com·Dec 12, 2022

Functional Data Engineering — a modern paradigm for batch data processing

Shuffle join in Spark SQL

Shuffle consists on moving data with the same key to the one executor in order to execute some specific processing on it. We could think that it concerns only em*ByKey/em operations but it's not necessarily true.

Deep Dive

·waitingforcode.com·Oct 19, 2022

Shuffle join in Spark SQL

Spark SQL Query Engine Deep Dive (1) – Catalyst QueryExecution Overview

From this blog post on, I am going to start writing about Spark SQL Catalyst. Catalyst is the core of Spark SQL and there are many topics to cover. I don’t have a formal writing plan on this,…

Deep Dive

·dataninjago.com·Oct 12, 2022

Spark SQL Query Engine Deep Dive (1) – Catalyst QueryExecution Overview

NoSQL databases sample models: MongoDB, Neo4j, Swagger, Cassandra

Get the sample models for MongoDB, Neo4j, Cassandra, Swagger, Avro, Parquet, Glue, and more! After download, open the models using Hackolade, and learn through the examples how to leverage the modeling power of the software.

Tools

·hackolade.com·Oct 7, 2022

NoSQL databases sample models: MongoDB, Neo4j, Swagger, Cassandra

Data Engineers Aren't Plumbers – DataTalks.Club

But almost identical to a less known profession

Resources

·datatalks.club·Oct 3, 2022

Data Engineers Aren't Plumbers – DataTalks.Club

pyspark.SparkContext.setLogLevel — PySpark 3.3.0 documentation

Tutorials

·spark.apache.org·Sep 29, 2022

pyspark.SparkContext.setLogLevel — PySpark 3.3.0 documentation

Getting started with MongoDB, PySpark, and Jupyter Notebook | MongoDB Blog

Learn how to leverage MongoDB data in your Jupyter notebooks via the MongoDB Spark Connector and PySpark. We will load financial security data from MongoDB, calculate a moving average, and then update the data in MongoDB with the new data.

Tutorials

·mongodb.com·Sep 29, 2022

Getting started with MongoDB, PySpark, and Jupyter Notebook | MongoDB Blog

Apache Spark Cluster on Docker (ft. a JuyterLab Interface)

Build your own Apache Spark cluster in standalone mode on Docker with a JupyterLab interface

Data-Lab

·towardsdatascience.com·Sep 28, 2022

Apache Spark Cluster on Docker (ft. a JuyterLab Interface)

How to Build a Spark Cluster with Docker, JupyterLab, and Apache Livy—a REST API for Apache Spark

Read our step-by-step guide to building an Apache Spark cluster based on the Docker virtual environment with JupyterLab and the Apache Livy REST interface.

Data-Lab

·stxnext.com·Sep 28, 2022

How to Build a Spark Cluster with Docker, JupyterLab, and Apache Livy—a REST API for Apache Spark

DIY: Apache Spark & Docker

Set up a Spark cluster in Docker from scratch

Data-Lab

·towardsdatascience.com·Sep 28, 2022

DIY: Apache Spark & Docker

How to build and run Bigtop Sandbox (Experimental) - Apache Bigtop - Apache Software Foundation

Hadoop

·cwiki.apache.org·Sep 28, 2022

How to build and run Bigtop Sandbox (Experimental) - Apache Bigtop - Apache Software Foundation

How to install Apache Spark on Ubuntu using Apache Bigtop

Want to install Apache Spark using Apache Bigtop? Step by step tutorial. Bigtop is a package manager for Spark, HBase, Hadoop and other Apache projects related to big data. This tutorial is for Machine Learning engineers and Data Scientists looking for a convenient way to manage big data components of their ecosystem.

Data-Lab

·blog.miz.space·Sep 28, 2022

How to install Apache Spark on Ubuntu using Apache Bigtop

Hadoop%20cluster

Hadoop

·docs.deistercloud.com·Sep 25, 2022

Hadoop%20cluster

Hadoop ecosystem with docker-compose

Description Construct Hadoop-ecosystem cluster composed of 1 master, 1 DB, and n of slaves, using docker-compose. Get experience of hadoop map-reduce routine and hive, sqoop, and hbase system, among the hadoop ecosystem.

Hadoop

·hjben.github.io·Sep 24, 2022

Hadoop ecosystem with docker-compose

Building an Apache Airflow configured with Local Executor and Spark Standalone Cluster with Docker

A guide on how to set up an environment to work with Airflow and Spark

Data-Lab

·mbvyn.medium.com·Sep 23, 2022

Building an Apache Airflow configured with Local Executor and Spark Standalone Cluster with Docker

How to setup Simple Hadoop Cluster on Docker

How to setup a Hadoop cluster on Docker in simplest way

Hadoop

·selectfrom.dev·Sep 23, 2022

How to setup Simple Hadoop Cluster on Docker

How to connect to remote hive server from spark

I'm running spark locally and want to to access Hive tables, which are located in the remote Hadoop cluster. I'm able to access the hive tables by lauching beeline under SPARK_HOME [ml@master spa...

Tutorials

·stackoverflow.com·Sep 23, 2022

How to connect to remote hive server from spark

Hadoop Yarn Configuration on Cluster

This post explains how to setup Yarn master on hadoop 3.1 cluster and run a map reduce program.Before you proceed this document, please make sure you have Hadoop3.1 cluster up and running. if you do not have a setup, please follow below link to setup your cluster and come back to this page.

Hadoop

·sparkbyexamples.com·Sep 23, 2022

Hadoop Yarn Configuration on Cluster

Spark Step-by-Step Setup on Hadoop Yarn Cluster

This post explains how to setup Apache Spark and run Spark applications on the Hadoop with the Yarn cluster manager that is used to run spark examples as deployment mode client and master as yarn. You can also try running the Spark application in cluster mode. Prerequisites : If you don't have Hadoop & Yarn installed, please Install and Setup Hadoop cluster and setup Yarn on Cluster before proceeding with this article.. Spark Install and Setup In order to install and setup Apache Spark on Hadoop cluster, access Apache Spark Download site and go to the Download Apache Spark section

Data-Lab

·sparkbyexamples.com·Sep 23, 2022

Spark Step-by-Step Setup on Hadoop Yarn Cluster

How to read data from HDFS in Pyspark -

This recipe helps you read data from HDFS in Pyspark

Tutorials

·projectpro.io·Sep 23, 2022

How to read data from HDFS in Pyspark -

Apache Hadoop Installation on Ubuntu (multi-node cluster).

Below are the steps of Apache Hadoop Installation on a Linux Ubuntu server, if you have a windows laptop with enough memory, you can create 4 virtual machines by using Oracle Virtual Box and install Ubuntu on these VM's. This article assumes you have Ubuntu OS running and doesn't explain how to create VM's and install Ubuntu. Apache Hadoop is an open-source distributed storing and processing framework that is used to execute large data sets on commodity hardware; Hadoop natively runs on Linux operating system, in this article I will explain step by step Apache Hadoop installation version (Hadoop 3.1.1)

Hadoop

·sparkbyexamples.com·Sep 23, 2022

Apache Hadoop Installation on Ubuntu (multi-node cluster).

How To Set Up a Hadoop 3.2.1 Multi-Node Cluster on Ubuntu 18.04 (2 Nodes)

To start: What is Hadoop?

Hadoop

·medium.com·Sep 21, 2022

How To Set Up a Hadoop 3.2.1 Multi-Node Cluster on Ubuntu 18.04 (2 Nodes)

apache hadoop-3.2.0 multi-node cluster with alternative backup recovery

deploy the latest version of Apache Hadoop (Stable release: 3.2.0) on the multi-node cluster to store unstructured data in a distributed manner.

Hadoop

·dataview.in·Sep 21, 2022

apache hadoop-3.2.0 multi-node cluster with alternative backup recovery

Hive server2

Hadoop

·dataview.in·Sep 21, 2022

Hive server2

Install and Configuration of Apache Hive on multi-node Hadoop cluster

The apache Hive is a data warehouse system. to install and configure the latest version of Apache Hive on top of the existing multi-node Hadoop cluster.

Hadoop

·dataview.in·Sep 20, 2022

Install and Configuration of Apache Hive on multi-node Hadoop cluster

Multinode Hadoop installation steps - DBACLASS

Multi Node Cluster in Hadoop 2.x Here, we are taking two machines – master and slave. On both the machines, a datanode will be running. Let us start with the setup of Multi Node Cluster in Hadoop. PREREQUISITES: Cent OS 6.5 Hadoop-2.7.3 JAVA 8 SSH We have two machines (master and slave) with IP: Master […]

Hadoop

·dbaclass.com·Sep 20, 2022

Multinode Hadoop installation steps - DBACLASS