Files
data-engineering-zoomcamp/week_5_batch_processing
Victor Alexandre Padilha 7cb9d9a60b Padilha notes week5 (#361)
2023-03-18 15:56:58 +01:00
..
2023-03-04 11:57:57 +01:00
2023-02-26 22:20:26 +01:00
2022-01-15 20:34:02 +01:00
2023-03-18 15:56:58 +01:00

Week 5: Batch Processing

5.1 Introduction

5.2 Installation

Follow these intructions to install Spark:

And follow this to run PySpark in Jupyter

5.3 Spark SQL and DataFrames

Script to prepare the Dataset download_data.sh

Note: The other way to infer the schema (apart from pandas) for the csv files, is to set the inferSchema option to true while reading the files in Spark.

5.4 Spark Internals

5.5 (Optional) Resilient Distributed Datasets

5.6 Running Spark in the Cloud

Homework

Community notes

Did you take notes? You can share them here.