Found 157 bookmarks
Newest
One-hot encoding in PySpark
One-hot encoding in PySpark
To perform one-hot encoding in PySpark, we must convert the categorical column into a numeric column (0, 1, ...) using StringIndexer, and then convert the numeric column into one-hot encoded columns using OneHotEncoder.
·skytowner.com·
One-hot encoding in PySpark
Data Pipeline Design Patterns - #1. Data flow patterns
Data Pipeline Design Patterns - #1. Data flow patterns
Data pipelines built (and added on to) without a solid foundation will suffer from poor efficiency, slow development speed, long times to triage production issues, and hard testability. What if your data pipelines are elegant and enable you to deliver features quickly? An easy-to-maintain and extendable data pipeline significantly increase developer morale, stakeholder trust, and the business bottom line! Using the correct design pattern will increase feature delivery speed and developer value (allowing devs to do more in less time), decrease toil during pipeline failures, and build trust with stakeholders. This post goes over the most commonly used data flow design patterns, what they do, when to use them, and, more importantly, when not to use them. By the end of this post, you will have an overview of the typical data flow patterns and be able to choose the right one for your use case.
·startdataengineering.com·
Data Pipeline Design Patterns - #1. Data flow patterns
Shuffle join in Spark SQL
Shuffle join in Spark SQL
Shuffle consists on moving data with the same key to the one executor in order to execute some specific processing on it. We could think that it concerns only em*ByKey/em operations but it's not necessarily true.
·waitingforcode.com·
Shuffle join in Spark SQL
NoSQL databases sample models: MongoDB, Neo4j, Swagger, Cassandra
NoSQL databases sample models: MongoDB, Neo4j, Swagger, Cassandra
Get the sample models for MongoDB, Neo4j, Cassandra, Swagger, Avro, Parquet, Glue, and more! After download, open the models using Hackolade, and learn through the examples how to leverage the modeling power of the software.
·hackolade.com·
NoSQL databases sample models: MongoDB, Neo4j, Swagger, Cassandra
Getting started with MongoDB, PySpark, and Jupyter Notebook | MongoDB Blog
Getting started with MongoDB, PySpark, and Jupyter Notebook | MongoDB Blog
Learn how to leverage MongoDB data in your Jupyter notebooks via the MongoDB Spark Connector and PySpark. We will load financial security data from MongoDB, calculate a moving average, and then update the data in MongoDB with the new data.
·mongodb.com·
Getting started with MongoDB, PySpark, and Jupyter Notebook | MongoDB Blog
How to install Apache Spark on Ubuntu using Apache Bigtop
How to install Apache Spark on Ubuntu using Apache Bigtop
Want to install Apache Spark using Apache Bigtop? Step by step tutorial. Bigtop is a package manager for Spark, HBase, Hadoop and other Apache projects related to big data. This tutorial is for Machine Learning engineers and Data Scientists looking for a convenient way to manage big data components of their ecosystem.
·blog.miz.space·
How to install Apache Spark on Ubuntu using Apache Bigtop
Hadoop ecosystem with docker-compose
Hadoop ecosystem with docker-compose
Description Construct Hadoop-ecosystem cluster composed of 1 master, 1 DB, and n of slaves, using docker-compose. Get experience of hadoop map-reduce routine and hive, sqoop, and hbase system, among the hadoop ecosystem.
·hjben.github.io·
Hadoop ecosystem with docker-compose
How to connect to remote hive server from spark
How to connect to remote hive server from spark
I'm running spark locally and want to to access Hive tables, which are located in the remote Hadoop cluster. I'm able to access the hive tables by lauching beeline under SPARK_HOME [ml@master spa...
·stackoverflow.com·
How to connect to remote hive server from spark
Hadoop Yarn Configuration on Cluster
Hadoop Yarn Configuration on Cluster
This post explains how to setup Yarn master on hadoop 3.1 cluster and run a map reduce program.Before you proceed this document, please make sure you have Hadoop3.1 cluster up and running. if you do not have a setup, please follow below link to setup your cluster and come back to this page.
·sparkbyexamples.com·
Hadoop Yarn Configuration on Cluster
Spark Step-by-Step Setup on Hadoop Yarn Cluster
Spark Step-by-Step Setup on Hadoop Yarn Cluster
This post explains how to setup Apache Spark and run Spark applications on the Hadoop with the Yarn cluster manager that is used to run spark examples as deployment mode client and master as yarn. You can also try running the Spark application in cluster mode. Prerequisites : If you don't have Hadoop & Yarn installed, please Install and Setup Hadoop cluster and setup Yarn on Cluster before proceeding with this article.. Spark Install and Setup In order to install and setup Apache Spark on Hadoop cluster, access Apache Spark Download site and go to the Download Apache Spark section
·sparkbyexamples.com·
Spark Step-by-Step Setup on Hadoop Yarn Cluster
Apache Hadoop Installation on Ubuntu (multi-node cluster).
Apache Hadoop Installation on Ubuntu (multi-node cluster).
Below are the steps of Apache Hadoop Installation on a Linux Ubuntu server, if you have a windows laptop with enough memory, you can create 4 virtual machines by using Oracle Virtual Box and install Ubuntu on these VM's. This article assumes you have Ubuntu OS running and doesn't explain how to create VM's and install Ubuntu. Apache Hadoop is an open-source distributed storing and processing framework that is used to execute large data sets on commodity hardware; Hadoop natively runs on Linux operating system, in this article I will explain step by step Apache Hadoop installation version (Hadoop 3.1.1)
·sparkbyexamples.com·
Apache Hadoop Installation on Ubuntu (multi-node cluster).
Multinode Hadoop installation steps - DBACLASS
Multinode Hadoop installation steps - DBACLASS
Multi Node Cluster in Hadoop 2.x Here, we are taking two machines – master and slave. On both the machines, a datanode will be running. Let us start with the setup of Multi Node Cluster in Hadoop. PREREQUISITES: Cent OS 6.5 Hadoop-2.7.3 JAVA 8 SSH We have two machines (master and slave) with IP: Master […]
·dbaclass.com·
Multinode Hadoop installation steps - DBACLASS