data-engineering-zoomcamp/week_5_batch_processing at LO_Module1 - data-engineering-zoomcamp - Co-op Cloud Code

linnealovespie/data-engineering-zoomcamp

Files

History

Victor Alexandre Padilha 7cb9d9a60b Padilha notes week5 (#361 )

2023-03-18 15:56:58 +01:00

..

Update download_data.sh (#345 )

2023-03-04 11:57:57 +01:00

Update windows.md (#335 )

2023-02-26 22:20:26 +01:00

.gitignore

cleanup-week1

2022-01-15 20:34:02 +01:00

README.md

Padilha notes week5 (#361 )

2023-03-18 15:56:58 +01:00

README.md

Week 5: Batch Processing

5.1 Introduction

🎥 5.1.1 Introduction to Batch Processing
🎥 5.1.2 Introduction to Spark

5.2 Installation

Follow these intructions to install Spark:

And follow this to run PySpark in Jupyter

🎥 5.2.1 (Optional) Installing Spark (Linux)

5.3 Spark SQL and DataFrames

🎥 5.3.1 First Look at Spark/PySpark
🎥 5.3.2 Spark Dataframes
🎥 5.3.3 (Optional) Preparing Yellow and Green Taxi Data

Script to prepare the Dataset download_data.sh

Note: The other way to infer the schema (apart from pandas) for the csv files, is to set the inferSchema option to true while reading the files in Spark.

🎥 5.3.4 SQL with Spark

5.4 Spark Internals

5.5 (Optional) Resilient Distributed Datasets

🎥 5.5.1 Operations on Spark RDDs
🎥 5.5.2 Spark RDD mapPartition

5.6 Running Spark in the Cloud

🎥 5.6.1 Connecting to Google Cloud Storage
🎥 5.6.2 Creating a Local Spark Cluster
🎥 5.6.3 Setting up a Dataproc Cluster
🎥 5.6.4 Connecting Spark to Big Query

Homework

Homework

Community notes

Did you take notes? You can share them here.

Notes by Alvaro Navas
Sandy's DE Learning Blog
Notes by Alain Boisvert
Alternative : Using docker-compose to launch spark by rafik
Marcos Torregrosa's blog (spanish)
Notes by Victor Padilha
Add your notes here (above this line)