Files
data-engineering-zoomcamp/05-batch/code/cloud.md
2024-01-08 17:51:51 +01:00

2.9 KiB

Running Spark in the Cloud

Connecting to Google Cloud Storage

Uploading data to GCS:

gsutil -m cp -r pq/ gs://dtc_data_lake_de-zoomcamp-nytaxi/pq

Download the jar for connecting to GCS to any location (e.g. the lib folder):

gsutil cp gs://hadoop-lib/gcs/gcs-connector-hadoop3-2.2.5.jar

See the notebook with configuration in 09_spark_gcs.ipynb

(Thanks Alvin Do for the instructions!)

Local Cluster and Spark-Submit

Creating a stand-alone cluster (docs):

./sbin/start-master.sh

Creating a worker:

URL="spark://de-zoomcamp.europe-west1-b.c.de-zoomcamp-nytaxi.internal:7077"
./sbin/start-slave.sh ${URL}

# for newer versions of spark use that:
#./sbin/start-worker.sh ${URL}

Turn the notebook into a script:

jupyter nbconvert --to=script 06_spark_sql.ipynb

Edit the script and then run it:

python 06_spark_sql.py \
    --input_green=data/pq/green/2020/*/ \
    --input_yellow=data/pq/yellow/2020/*/ \
    --output=data/report-2020

Use spark-submit for running the script on the cluster

URL="spark://de-zoomcamp.europe-west1-b.c.de-zoomcamp-nytaxi.internal:7077"

spark-submit \
    --master="${URL}" \
    06_spark_sql.py \
        --input_green=data/pq/green/2021/*/ \
        --input_yellow=data/pq/yellow/2021/*/ \
        --output=data/report-2021

Data Proc

Upload the script to GCS:

TODO

Params for the job:

  • --input_green=gs://dtc_data_lake_de-zoomcamp-nytaxi/pq/green/2021/*/
  • --input_yellow=gs://dtc_data_lake_de-zoomcamp-nytaxi/pq/yellow/2021/*/
  • --output=gs://dtc_data_lake_de-zoomcamp-nytaxi/report-2021

Using Google Cloud SDK for submitting to dataproc (link)

gcloud dataproc jobs submit pyspark \
    --cluster=de-zoomcamp-cluster \
    --region=europe-west6 \
    gs://dtc_data_lake_de-zoomcamp-nytaxi/code/06_spark_sql.py \
    -- \
        --input_green=gs://dtc_data_lake_de-zoomcamp-nytaxi/pq/green/2020/*/ \
        --input_yellow=gs://dtc_data_lake_de-zoomcamp-nytaxi/pq/yellow/2020/*/ \
        --output=gs://dtc_data_lake_de-zoomcamp-nytaxi/report-2020

Big Query

Upload the script to GCS:

TODO

Write results to big query (docs):

gcloud dataproc jobs submit pyspark \
    --cluster=de-zoomcamp-cluster \
    --region=europe-west6 \
    --jars=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar \
    gs://dtc_data_lake_de-zoomcamp-nytaxi/code/06_spark_sql_big_query.py \
    -- \
        --input_green=gs://dtc_data_lake_de-zoomcamp-nytaxi/pq/green/2020/*/ \
        --input_yellow=gs://dtc_data_lake_de-zoomcamp-nytaxi/pq/yellow/2020/*/ \
        --output=trips_data_all.reports-2020