109 Commits

Author SHA1 Message Date
89ea5e8bac update 05-batch README (#517) 2024-02-27 18:41:28 +01:00
0ac417886c Update RisingWave Homework (#519)
* Update rising-wave.md

* Update rising-wave.md
2024-02-27 18:33:51 +01:00
35d50cec77 Adding link to my week 4 notes (#520)
Adding my notes to the week 4 readme
2024-02-27 18:33:16 +01:00
be40774fdd Merge pull request #509 from inner-outer-space/patch-6
Update README.md
2024-02-23 09:33:20 +01:00
1b516814d8 Merge pull request #510 from inner-outer-space/patch-5
Update README.md
2024-02-23 09:33:02 +01:00
eee41d9457 Update homework.md 2024-02-23 08:46:03 +01:00
eea2214132 link to pyspark installation guide (#514) 2024-02-22 13:21:41 +01:00
e9b3a17b9c q3 important tip added (#512) 2024-02-21 14:07:22 +01:00
b94ab37921 Update homework.md 2024-02-20 14:57:00 +01:00
ae09f9b79d fix to spark link (#511) 2024-02-20 14:53:36 +01:00
f940e69e52 Update README.md
Restructured my repo and broke the link to my notes... fixing. thx.
2024-02-20 14:47:11 +01:00
a7caea6294 Update README.md
I restructured my repo, so am fixing my broken links to notes. Thx.
2024-02-20 14:42:50 +01:00
889b748f27 Update homework wording 2024-02-19 21:57:32 +01:00
22134a14f1 Added code to work with .parquet files (#405)
* Added code to work with .parquet files

* updated README.md
2024-02-19 18:28:30 +01:00
ee48f1d3f8 add Fedor Faizov to public leaderboard 2023 (#382)
* add Fedor Faizov to public leaderboard 2023

* upd by Alexey comment
2024-02-19 18:27:19 +01:00
884f9f0350 Update README.md - Add notes (#502) 2024-02-19 18:25:39 +01:00
fe849fdf5c update linux install instruction to download spark from archive instead (#504) 2024-02-19 18:25:13 +01:00
e719405956 Added Homework for Week 5 2024 (#503)
* Added Homework for Week 5 2024

* Update homework.md

---------

Co-authored-by: Alexey Grigorev <alexeygrigorev@users.noreply.github.com>
2024-02-19 18:24:47 +01:00
1ca12378ff Update homework.md 2024-02-16 16:54:44 +01:00
624efa10ab Update homework.md 2024-02-15 18:55:12 +01:00
da36243d1c Update homework.md 2024-02-15 18:54:31 +01:00
ddc22c29ab Merge pull request #500 from shayansm2/main
fixed a typo
2024-02-15 15:47:56 +01:00
19be2ed8f4 fix a type 2024-02-15 18:06:03 +03:30
db7f42d882 Update links in README.md 2024-02-14 11:36:27 +01:00
98f6a4df08 Merge pull request #498 from jessicadesilva/fix-typos
fixed typos in column names
2024-02-14 11:30:08 +01:00
762b0ce4b9 fixed typos in column names 2024-02-13 19:22:34 -08:00
c9fae602b4 Update hack-load-data.sql 2024-02-13 16:51:11 +01:00
51d4241650 Create hack-load-data.sql 2024-02-13 16:50:14 +01:00
1dd47ba96c changed week to module 2024-02-13 15:04:27 +01:00
a7393a4063 Update README.md (#486)
Added link to my notes
2024-02-13 11:38:08 +01:00
45991f4254 Update README.md (#488)
Added week 4 notes

Co-authored-by: Alexey Grigorev <alexeygrigorev@users.noreply.github.com>
2024-02-13 11:37:53 +01:00
b7a5d61406 Update README.md (#489)
Adding my week 3 notes/blog post

Co-authored-by: Alexey Grigorev <alexeygrigorev@users.noreply.github.com>
2024-02-13 11:37:18 +01:00
afdf9508e6 Update README.md (#490)
* Update README.md

Included mage script file to load parquet file from remote URL and push to google bucket for homework data loading.

* Update README.md

dataloader script file for mage to load parquet to google bucket. Also adds keyword argument to retain timestamp formating when parquet to bigquery conversion happens

---------

Co-authored-by: Alexey Grigorev <alexeygrigorev@users.noreply.github.com>
2024-02-13 11:36:18 +01:00
b44834ff60 add hw study note to 03-data-warehouse readme (#491) 2024-02-13 11:34:53 +01:00
c5a06cf150 Update README.md (#495)
videos transcript week4
2024-02-13 11:34:39 +01:00
770197cbe3 Update homework.md 2024-02-13 11:33:21 +01:00
cb874911ba Update homework.md 2024-02-11 23:37:28 +01:00
782acf26ce Update dlt.md 2024-02-09 18:56:01 +01:00
1c7926a713 Update README.md 2024-02-08 17:28:25 +01:00
68f0e6cb53 fix homework options (#485) 2024-02-08 16:43:30 +01:00
b17729fa9a README.md (#481) 2024-02-07 21:15:22 +01:00
7de55821ee Update README.md (#482)
steps to send data from Mage to GCS + creating external table with data from bucket
2024-02-07 21:14:54 +01:00
8a56888246 Update homework.md 2024-02-06 21:57:29 +01:00
c3e5ef4518 Update dlt.md 2024-02-06 21:23:26 +01:00
f31e2fe93a Update README.md with embedded YT URLs (#480) 2024-02-06 17:43:10 +01:00
36c29eaf1b Update README.md with embedded YT URLs (#479) 2024-02-06 08:01:19 +01:00
2ab335505c change week 2 homework for transformer block (#473) 2024-02-06 08:00:59 +01:00
3fabb1cfda modify like to gcp overview (#475) 2024-02-06 08:00:20 +01:00
baa2ea4cf7 Update README.md with embedded YT URLs (#476) 2024-02-06 07:59:42 +01:00
4553062578 Update README.md with embedded YT URLs (#477) 2024-02-06 07:59:23 +01:00
d3dabf2b81 Update README.md with embedded YT URLs (#478) 2024-02-06 07:58:52 +01:00
46e15f69e7 Create homework.md 2024-02-05 23:41:06 +01:00
d2e59f2350 Update URL in homework 3 (#448)
* Update URL in homework 3

URL was incorrect leading to errors in downloading

* Update homework.md
2024-02-05 18:12:54 +01:00
da6a842ee7 Update dlt.md 2024-02-05 17:43:58 +01:00
d763f07395 Update dlt.md 2024-02-05 16:49:12 +01:00
427d17d012 rearranged notebooks #461 2024-02-05 12:54:11 +01:00
51a9c95b7d Update homework.md (week 2 & 3) (#456)
* Update homework.md (week 2)

Update homework.md to explain beforehand what should be included in the homework repository

* Update homework.md (week 3)

Update homework.md to explain beforehand what should be included in the homework repository
2024-02-05 12:34:02 +01:00
6a2b86d8af Update README.md (#460)
week 3 notes
2024-02-05 12:33:37 +01:00
e659ff26b8 fix location join (#470) 2024-02-05 12:32:17 +01:00
6bc22c63cf Use embedded links in youtube URLs (#471)
Update README.md with markdown formatting from 

- https://markdown-videos-api.jorgenkh.no/docs#/
- https://github.com/orgs/community/discussions/16925
2024-02-05 12:29:51 +01:00
0f9b564bce Merge pull request #468 from DataTalksClub/de-zoomcamp-videos
De zoomcamp creating the whole project
2024-02-04 22:35:24 +01:00
fe4419866d Merge branch 'main' of https://github.com/DataTalksClub/data-engineering-zoomcamp into de-zoomcamp-videos 2024-02-04 21:34:26 +00:00
53b2676115 complete my whole project 2024-02-04 21:34:12 +00:00
c0c772b8ce Merge pull request #459 from inner-outer-space/patch-1
Update README.md
2024-02-04 22:16:06 +01:00
4117ce9f5d Merge pull request #458 from inner-outer-space/patch-2
Update README.md
2024-02-04 22:15:43 +01:00
b1ad88253c Merge pull request #466 from maria-fisher/patch-3
Update README.md
2024-02-04 22:15:17 +01:00
049dd34c6c fix conflics 2024-02-04 21:06:30 +00:00
1efd2a236c build a whole dbt project 2024-02-04 21:04:29 +00:00
72c4c821dc remove unused files 2024-02-04 20:48:14 +01:00
68e8e1a9cb make dm_monthly_zone_revenue cross-db 2024-02-04 20:47:15 +01:00
261b50d042 Update schema.yml tests 2024-02-04 20:34:52 +01:00
b269844ea3 Update dbt_project.yml variables 2024-02-04 20:32:52 +01:00
35b99817dc Update stg_yellow_tripdata to latest dbt syntax 2024-02-04 19:15:35 +01:00
78a5940578 Update to latest dbt functions naming 2024-02-04 19:11:46 +01:00
13a7752e5e Merge branch 'main' of https://github.com/DataTalksClub/data-engineering-zoomcamp into de-zoomcamp-videos 2024-02-04 17:28:29 +00:00
3af1021228 Update README.md
videos transcript week 3
2024-02-03 17:27:35 +00:00
f641f94a25 Update README.md
week 1 notes
2024-02-01 11:24:28 +01:00
0563fb5ff7 Update README.md
notes for week 2
2024-02-01 11:21:37 +01:00
a64e90ac36 Include logos for RisingWave Workshop (#455)
As per title.
2024-02-01 07:45:08 +01:00
e69c289b40 Update homework.md to explain beforehand what should be included in the homework repository (#447) 2024-01-31 18:57:25 +01:00
69bc9aec1b Update README.md batch [process (#449)
Update README.md batch [process
2024-01-31 18:55:05 +01:00
fe176c1679 Update README.md data streaming notes (#450)
Update README.md data streaming notes
2024-01-31 18:54:53 +01:00
d9cb16e282 Corrected errors in the instructions (#452) 2024-01-31 15:54:13 +01:00
6d2f1aa7e8 Delete Frame 124.jpg 2024-01-31 13:40:52 +03:00
390b2f6994 Add files via upload 2024-01-31 13:15:21 +03:00
ef6791e1cf Update README.md 2024-01-31 10:55:10 +01:00
865849b0ef Update README.md 2024-01-31 10:54:22 +01:00
9249bfba29 Add files via upload 2024-01-31 10:53:20 +01:00
bb43aa52e4 Delete images/architecture/untitled_diagram.drawio__10_.png 2024-01-31 10:48:33 +01:00
9a6d7878fd Delete images/architecture/arch_2.png 2024-01-31 10:48:22 +01:00
fe0b744ffe Update README.md 2024-01-31 10:43:28 +01:00
dbe68cd993 Add files via upload 2024-01-31 10:42:21 +01:00
a00f31fb85 formatting dlthub workshop (#451)
* adding dlt course

* adding dlt course

* improve formatting

* add cta

* add cta

* add links to slack

* visual improvements

* visual improvements

* visual improvements

---------

Co-authored-by: Adrian <Adrian>
2024-01-31 08:46:18 +01:00
9882dd7411 Update homework.md 2024-01-30 10:29:47 +01:00
f46e0044b9 Update homework.md 2024-01-30 10:29:16 +01:00
38087a646d Update homework.md (#429)
I believe the wording for question 2 is misleading or the correct answer isn't listed. When filtering the dataset to only contain records with more than zero passengers or trips longer than zero:

 ```
df = data[(data['passenger_count'] > 0) & (data['trip_distance'] > 0)]
```
the shape of the resulting dataframe is (139370, 20).

When filtering the dataframe based on the actual question:

```
df_2 = data[(data['passenger_count'] == 0) | (data['trip_distance'] == 0)]
```

the resulting shape is (9455, 20).
2024-01-29 23:31:41 +01:00
4617e63ddd Change the 1st homework of cohort 2024 to reduce ambiguity (#409) 2024-01-29 19:31:53 +01:00
738c22f91b Fix typo in JDK install instructions (#430)
Due to the missing extra dash the line yields the following error:
xcode-select: error: invalid argument '-install'
2024-01-29 19:28:48 +01:00
d576cfb1c9 Update README.md (#439)
Added youtube link to 2nd video on module-01 environment setup demo.
2024-01-29 19:27:43 +01:00
af248385c0 Update README.md (#443)
videos transcripts week 2

Co-authored-by: Alexey Grigorev <alexeygrigorev@users.noreply.github.com>
2024-01-29 19:27:31 +01:00
7abbbde00e Update README.md (#444) 2024-01-29 19:26:41 +01:00
dd84d736bc Fix typo in README.md (#446)
seperated -> separated
2024-01-29 19:26:16 +01:00
6ae0b18eea Update homework.md 2024-01-29 19:12:35 +01:00
e9c8748e29 add dlt course content (#445)
* adding dlt course

* adding dlt course

* improve formatting

* add cta

* add cta

* add links to slack

---------

Co-authored-by: Adrian <Adrian>
2024-01-29 18:45:11 +01:00
a6fda6d5ca Update rising-wave.md (#441) 2024-01-29 15:25:03 +01:00
ee88d7f230 Merge branch 'main' of https://github.com/DataTalksClub/data-engineering-zoomcamp into de-zoomcamp-videos 2024-01-28 21:57:02 +00:00
7a251b614b Update homework.md 2024-01-28 22:40:58 +01:00
b6901c05bf init my dbt project! 2024-01-28 00:16:23 +00:00
9e89d9849e delete 2024-01-28 00:14:21 +00:00
50 changed files with 18286 additions and 254 deletions

View File

@ -113,6 +113,10 @@ $ aws s3 ls s3://nyc-tlc
PRE trip data/
```
You can refer the `data-loading-parquet.ipynb` and `data-loading-parquet.py` for code to handle both csv and paraquet files. (The lookup zones table which is needed later in this course is a csv file)
> Note: You will need to install the `pyarrow` library. (add it to your Dockerfile)
### pgAdmin
Running pgAdmin

View File

@ -0,0 +1,938 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "52bad16a",
"metadata": {},
"source": [
"# Data loading \n",
"\n",
"Here we will be using the ```.paraquet``` file we downloaded and do the following:\n",
" - Check metadata and table datatypes of the paraquet file/table\n",
" - Convert the paraquet file to pandas dataframe and check the datatypes. Additionally check the data dictionary to make sure you have the right datatypes in pandas, as pandas will automatically create the table in our database.\n",
" - Generate the DDL CREATE statement from pandas for a sanity check.\n",
" - Create a connection to our database using SQLAlchemy\n",
" - Convert our huge paraquet file into a iterable that has batches of 100,000 rows and load it into our database."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "afef2456",
"metadata": {
"ExecuteTime": {
"end_time": "2023-12-03T23:55:14.141738Z",
"start_time": "2023-12-03T23:55:14.124217Z"
}
},
"outputs": [],
"source": [
"import pandas as pd \n",
"import pyarrow.parquet as pq\n",
"from time import time"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c750d1d4",
"metadata": {
"ExecuteTime": {
"end_time": "2023-12-03T02:54:01.925350Z",
"start_time": "2023-12-03T02:54:01.661119Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"<pyarrow._parquet.FileMetaData object at 0x7fed89ffa540>\n",
" created_by: parquet-cpp-arrow version 13.0.0\n",
" num_columns: 19\n",
" num_rows: 2846722\n",
" num_row_groups: 3\n",
" format_version: 2.6\n",
" serialized_size: 6357"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Read metadata \n",
"pq.read_metadata('yellow_tripdata_2023-09.parquet')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a970fcf0",
"metadata": {
"ExecuteTime": {
"end_time": "2023-12-03T23:28:08.411945Z",
"start_time": "2023-12-03T23:28:08.177693Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"VendorID: int32\n",
"tpep_pickup_datetime: timestamp[us]\n",
"tpep_dropoff_datetime: timestamp[us]\n",
"passenger_count: int64\n",
"trip_distance: double\n",
"RatecodeID: int64\n",
"store_and_fwd_flag: large_string\n",
"PULocationID: int32\n",
"DOLocationID: int32\n",
"payment_type: int64\n",
"fare_amount: double\n",
"extra: double\n",
"mta_tax: double\n",
"tip_amount: double\n",
"tolls_amount: double\n",
"improvement_surcharge: double\n",
"total_amount: double\n",
"congestion_surcharge: double\n",
"Airport_fee: double"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Read file, read the table from file and check schema\n",
"file = pq.ParquetFile('yellow_tripdata_2023-09.parquet')\n",
"table = file.read()\n",
"table.schema"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "43f6ea7e",
"metadata": {
"ExecuteTime": {
"end_time": "2023-12-03T23:28:22.870376Z",
"start_time": "2023-12-03T23:28:22.563414Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 2846722 entries, 0 to 2846721\n",
"Data columns (total 19 columns):\n",
" # Column Dtype \n",
"--- ------ ----- \n",
" 0 VendorID int32 \n",
" 1 tpep_pickup_datetime datetime64[ns]\n",
" 2 tpep_dropoff_datetime datetime64[ns]\n",
" 3 passenger_count float64 \n",
" 4 trip_distance float64 \n",
" 5 RatecodeID float64 \n",
" 6 store_and_fwd_flag object \n",
" 7 PULocationID int32 \n",
" 8 DOLocationID int32 \n",
" 9 payment_type int64 \n",
" 10 fare_amount float64 \n",
" 11 extra float64 \n",
" 12 mta_tax float64 \n",
" 13 tip_amount float64 \n",
" 14 tolls_amount float64 \n",
" 15 improvement_surcharge float64 \n",
" 16 total_amount float64 \n",
" 17 congestion_surcharge float64 \n",
" 18 Airport_fee float64 \n",
"dtypes: datetime64[ns](2), float64(12), int32(3), int64(1), object(1)\n",
"memory usage: 380.1+ MB\n"
]
}
],
"source": [
"# Convert to pandas and check data \n",
"df = table.to_pandas()\n",
"df.info()"
]
},
{
"cell_type": "markdown",
"id": "ccf039a0",
"metadata": {},
"source": [
"We need to first create the connection to our postgres database. We can feed the connection information to generate the CREATE SQL query for the specific server. SQLAlchemy supports a variety of servers."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "44e701ae",
"metadata": {
"ExecuteTime": {
"end_time": "2023-12-03T22:50:25.811951Z",
"start_time": "2023-12-03T22:50:25.393987Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"<sqlalchemy.engine.base.Connection at 0x7fed98ea3190>"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Create an open SQL database connection object or a SQLAlchemy connectable\n",
"from sqlalchemy import create_engine\n",
"\n",
"engine = create_engine('postgresql://root:root@localhost:5432/ny_taxi')\n",
"engine.connect()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c96a1075",
"metadata": {
"ExecuteTime": {
"end_time": "2023-12-03T22:50:43.628727Z",
"start_time": "2023-12-03T22:50:43.442337Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"CREATE TABLE yellow_taxi_data (\n",
"\t\"VendorID\" INTEGER, \n",
"\ttpep_pickup_datetime TIMESTAMP WITHOUT TIME ZONE, \n",
"\ttpep_dropoff_datetime TIMESTAMP WITHOUT TIME ZONE, \n",
"\tpassenger_count FLOAT(53), \n",
"\ttrip_distance FLOAT(53), \n",
"\t\"RatecodeID\" FLOAT(53), \n",
"\tstore_and_fwd_flag TEXT, \n",
"\t\"PULocationID\" INTEGER, \n",
"\t\"DOLocationID\" INTEGER, \n",
"\tpayment_type BIGINT, \n",
"\tfare_amount FLOAT(53), \n",
"\textra FLOAT(53), \n",
"\tmta_tax FLOAT(53), \n",
"\ttip_amount FLOAT(53), \n",
"\ttolls_amount FLOAT(53), \n",
"\timprovement_surcharge FLOAT(53), \n",
"\ttotal_amount FLOAT(53), \n",
"\tcongestion_surcharge FLOAT(53), \n",
"\t\"Airport_fee\" FLOAT(53)\n",
")\n",
"\n",
"\n"
]
}
],
"source": [
"# Generate CREATE SQL statement from schema for validation\n",
"print(pd.io.sql.get_schema(df, name='yellow_taxi_data', con=engine))"
]
},
{
"cell_type": "markdown",
"id": "eca7f32d",
"metadata": {},
"source": [
"Datatypes for the table looks good! Since we used paraquet file the datasets seem to have been preserved. You may have to convert some datatypes so it is always good to do this check."
]
},
{
"cell_type": "markdown",
"id": "51a751ed",
"metadata": {},
"source": [
"## Finally inserting data\n",
"\n",
"There are 2,846,722 rows in our dataset. We are going to use the ```parquet_file.iter_batches()``` function to create batches of 100,000, convert them into pandas and then load it into the postgres database."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e20cec73",
"metadata": {
"ExecuteTime": {
"end_time": "2023-12-03T23:49:28.768786Z",
"start_time": "2023-12-03T23:49:28.689732Z"
},
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>VendorID</th>\n",
" <th>tpep_pickup_datetime</th>\n",
" <th>tpep_dropoff_datetime</th>\n",
" <th>passenger_count</th>\n",
" <th>trip_distance</th>\n",
" <th>RatecodeID</th>\n",
" <th>store_and_fwd_flag</th>\n",
" <th>PULocationID</th>\n",
" <th>DOLocationID</th>\n",
" <th>payment_type</th>\n",
" <th>fare_amount</th>\n",
" <th>extra</th>\n",
" <th>mta_tax</th>\n",
" <th>tip_amount</th>\n",
" <th>tolls_amount</th>\n",
" <th>improvement_surcharge</th>\n",
" <th>total_amount</th>\n",
" <th>congestion_surcharge</th>\n",
" <th>Airport_fee</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>2023-09-01 00:15:37</td>\n",
" <td>2023-09-01 00:20:21</td>\n",
" <td>1</td>\n",
" <td>0.80</td>\n",
" <td>1</td>\n",
" <td>N</td>\n",
" <td>163</td>\n",
" <td>230</td>\n",
" <td>2</td>\n",
" <td>6.5</td>\n",
" <td>3.5</td>\n",
" <td>0.5</td>\n",
" <td>0.00</td>\n",
" <td>0.00</td>\n",
" <td>1.0</td>\n",
" <td>11.50</td>\n",
" <td>2.5</td>\n",
" <td>0.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>2023-09-01 00:18:40</td>\n",
" <td>2023-09-01 00:30:28</td>\n",
" <td>2</td>\n",
" <td>2.34</td>\n",
" <td>1</td>\n",
" <td>N</td>\n",
" <td>236</td>\n",
" <td>233</td>\n",
" <td>1</td>\n",
" <td>14.2</td>\n",
" <td>1.0</td>\n",
" <td>0.5</td>\n",
" <td>2.00</td>\n",
" <td>0.00</td>\n",
" <td>1.0</td>\n",
" <td>21.20</td>\n",
" <td>2.5</td>\n",
" <td>0.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>2023-09-01 00:35:01</td>\n",
" <td>2023-09-01 00:39:04</td>\n",
" <td>1</td>\n",
" <td>1.62</td>\n",
" <td>1</td>\n",
" <td>N</td>\n",
" <td>162</td>\n",
" <td>236</td>\n",
" <td>1</td>\n",
" <td>8.6</td>\n",
" <td>1.0</td>\n",
" <td>0.5</td>\n",
" <td>2.00</td>\n",
" <td>0.00</td>\n",
" <td>1.0</td>\n",
" <td>15.60</td>\n",
" <td>2.5</td>\n",
" <td>0.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2</td>\n",
" <td>2023-09-01 00:45:45</td>\n",
" <td>2023-09-01 00:47:37</td>\n",
" <td>1</td>\n",
" <td>0.74</td>\n",
" <td>1</td>\n",
" <td>N</td>\n",
" <td>141</td>\n",
" <td>229</td>\n",
" <td>1</td>\n",
" <td>5.1</td>\n",
" <td>1.0</td>\n",
" <td>0.5</td>\n",
" <td>1.00</td>\n",
" <td>0.00</td>\n",
" <td>1.0</td>\n",
" <td>11.10</td>\n",
" <td>2.5</td>\n",
" <td>0.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2</td>\n",
" <td>2023-09-01 00:01:23</td>\n",
" <td>2023-09-01 00:38:05</td>\n",
" <td>1</td>\n",
" <td>9.85</td>\n",
" <td>1</td>\n",
" <td>N</td>\n",
" <td>138</td>\n",
" <td>230</td>\n",
" <td>1</td>\n",
" <td>45.0</td>\n",
" <td>6.0</td>\n",
" <td>0.5</td>\n",
" <td>17.02</td>\n",
" <td>0.00</td>\n",
" <td>1.0</td>\n",
" <td>73.77</td>\n",
" <td>2.5</td>\n",
" <td>1.75</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>99995</th>\n",
" <td>2</td>\n",
" <td>2023-09-02 09:55:17</td>\n",
" <td>2023-09-02 10:01:45</td>\n",
" <td>2</td>\n",
" <td>1.48</td>\n",
" <td>1</td>\n",
" <td>N</td>\n",
" <td>163</td>\n",
" <td>164</td>\n",
" <td>1</td>\n",
" <td>9.3</td>\n",
" <td>0.0</td>\n",
" <td>0.5</td>\n",
" <td>2.66</td>\n",
" <td>0.00</td>\n",
" <td>1.0</td>\n",
" <td>15.96</td>\n",
" <td>2.5</td>\n",
" <td>0.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>99996</th>\n",
" <td>2</td>\n",
" <td>2023-09-02 09:25:34</td>\n",
" <td>2023-09-02 09:55:20</td>\n",
" <td>3</td>\n",
" <td>17.49</td>\n",
" <td>2</td>\n",
" <td>N</td>\n",
" <td>132</td>\n",
" <td>164</td>\n",
" <td>1</td>\n",
" <td>70.0</td>\n",
" <td>0.0</td>\n",
" <td>0.5</td>\n",
" <td>24.28</td>\n",
" <td>6.94</td>\n",
" <td>1.0</td>\n",
" <td>106.97</td>\n",
" <td>2.5</td>\n",
" <td>1.75</td>\n",
" </tr>\n",
" <tr>\n",
" <th>99997</th>\n",
" <td>2</td>\n",
" <td>2023-09-02 09:57:55</td>\n",
" <td>2023-09-02 10:04:52</td>\n",
" <td>1</td>\n",
" <td>1.73</td>\n",
" <td>1</td>\n",
" <td>N</td>\n",
" <td>164</td>\n",
" <td>249</td>\n",
" <td>1</td>\n",
" <td>10.0</td>\n",
" <td>0.0</td>\n",
" <td>0.5</td>\n",
" <td>2.80</td>\n",
" <td>0.00</td>\n",
" <td>1.0</td>\n",
" <td>16.80</td>\n",
" <td>2.5</td>\n",
" <td>0.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>99998</th>\n",
" <td>2</td>\n",
" <td>2023-09-02 09:35:02</td>\n",
" <td>2023-09-02 09:43:28</td>\n",
" <td>1</td>\n",
" <td>1.32</td>\n",
" <td>1</td>\n",
" <td>N</td>\n",
" <td>113</td>\n",
" <td>170</td>\n",
" <td>1</td>\n",
" <td>10.0</td>\n",
" <td>0.0</td>\n",
" <td>0.5</td>\n",
" <td>4.20</td>\n",
" <td>0.00</td>\n",
" <td>1.0</td>\n",
" <td>18.20</td>\n",
" <td>2.5</td>\n",
" <td>0.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>99999</th>\n",
" <td>2</td>\n",
" <td>2023-09-02 09:46:09</td>\n",
" <td>2023-09-02 10:03:58</td>\n",
" <td>1</td>\n",
" <td>8.79</td>\n",
" <td>1</td>\n",
" <td>N</td>\n",
" <td>138</td>\n",
" <td>170</td>\n",
" <td>1</td>\n",
" <td>35.9</td>\n",
" <td>5.0</td>\n",
" <td>0.5</td>\n",
" <td>10.37</td>\n",
" <td>6.94</td>\n",
" <td>1.0</td>\n",
" <td>63.96</td>\n",
" <td>2.5</td>\n",
" <td>1.75</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>100000 rows × 19 columns</p>\n",
"</div>"
],
"text/plain": [
" VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count \\\n",
"0 1 2023-09-01 00:15:37 2023-09-01 00:20:21 1 \n",
"1 2 2023-09-01 00:18:40 2023-09-01 00:30:28 2 \n",
"2 2 2023-09-01 00:35:01 2023-09-01 00:39:04 1 \n",
"3 2 2023-09-01 00:45:45 2023-09-01 00:47:37 1 \n",
"4 2 2023-09-01 00:01:23 2023-09-01 00:38:05 1 \n",
"... ... ... ... ... \n",
"99995 2 2023-09-02 09:55:17 2023-09-02 10:01:45 2 \n",
"99996 2 2023-09-02 09:25:34 2023-09-02 09:55:20 3 \n",
"99997 2 2023-09-02 09:57:55 2023-09-02 10:04:52 1 \n",
"99998 2 2023-09-02 09:35:02 2023-09-02 09:43:28 1 \n",
"99999 2 2023-09-02 09:46:09 2023-09-02 10:03:58 1 \n",
"\n",
" trip_distance RatecodeID store_and_fwd_flag PULocationID \\\n",
"0 0.80 1 N 163 \n",
"1 2.34 1 N 236 \n",
"2 1.62 1 N 162 \n",
"3 0.74 1 N 141 \n",
"4 9.85 1 N 138 \n",
"... ... ... ... ... \n",
"99995 1.48 1 N 163 \n",
"99996 17.49 2 N 132 \n",
"99997 1.73 1 N 164 \n",
"99998 1.32 1 N 113 \n",
"99999 8.79 1 N 138 \n",
"\n",
" DOLocationID payment_type fare_amount extra mta_tax tip_amount \\\n",
"0 230 2 6.5 3.5 0.5 0.00 \n",
"1 233 1 14.2 1.0 0.5 2.00 \n",
"2 236 1 8.6 1.0 0.5 2.00 \n",
"3 229 1 5.1 1.0 0.5 1.00 \n",
"4 230 1 45.0 6.0 0.5 17.02 \n",
"... ... ... ... ... ... ... \n",
"99995 164 1 9.3 0.0 0.5 2.66 \n",
"99996 164 1 70.0 0.0 0.5 24.28 \n",
"99997 249 1 10.0 0.0 0.5 2.80 \n",
"99998 170 1 10.0 0.0 0.5 4.20 \n",
"99999 170 1 35.9 5.0 0.5 10.37 \n",
"\n",
" tolls_amount improvement_surcharge total_amount \\\n",
"0 0.00 1.0 11.50 \n",
"1 0.00 1.0 21.20 \n",
"2 0.00 1.0 15.60 \n",
"3 0.00 1.0 11.10 \n",
"4 0.00 1.0 73.77 \n",
"... ... ... ... \n",
"99995 0.00 1.0 15.96 \n",
"99996 6.94 1.0 106.97 \n",
"99997 0.00 1.0 16.80 \n",
"99998 0.00 1.0 18.20 \n",
"99999 6.94 1.0 63.96 \n",
"\n",
" congestion_surcharge Airport_fee \n",
"0 2.5 0.00 \n",
"1 2.5 0.00 \n",
"2 2.5 0.00 \n",
"3 2.5 0.00 \n",
"4 2.5 1.75 \n",
"... ... ... \n",
"99995 2.5 0.00 \n",
"99996 2.5 1.75 \n",
"99997 2.5 0.00 \n",
"99998 2.5 0.00 \n",
"99999 2.5 1.75 \n",
"\n",
"[100000 rows x 19 columns]"
]
},
"execution_count": 66,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#This part is for testing\n",
"\n",
"\n",
"# Creating batches of 100,000 for the paraquet file\n",
"batches_iter = file.iter_batches(batch_size=100000)\n",
"batches_iter\n",
"\n",
"# Take the first batch for testing\n",
"df = next(batches_iter).to_pandas()\n",
"df\n",
"\n",
"# Creating just the table in postgres\n",
"#df.head(0).to_sql(name='ny_taxi_data',con=engine, if_exists='replace')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7fdda025",
"metadata": {
"ExecuteTime": {
"end_time": "2023-12-04T00:08:07.651559Z",
"start_time": "2023-12-04T00:02:35.940526Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"inserting batch 1...\n",
"inserted! time taken 12.916 seconds.\n",
"\n",
"inserting batch 2...\n",
"inserted! time taken 11.782 seconds.\n",
"\n",
"inserting batch 3...\n",
"inserted! time taken 11.854 seconds.\n",
"\n",
"inserting batch 4...\n",
"inserted! time taken 11.753 seconds.\n",
"\n",
"inserting batch 5...\n",
"inserted! time taken 12.034 seconds.\n",
"\n",
"inserting batch 6...\n",
"inserted! time taken 11.742 seconds.\n",
"\n",
"inserting batch 7...\n",
"inserted! time taken 12.351 seconds.\n",
"\n",
"inserting batch 8...\n",
"inserted! time taken 11.052 seconds.\n",
"\n",
"inserting batch 9...\n",
"inserted! time taken 12.167 seconds.\n",
"\n",
"inserting batch 10...\n",
"inserted! time taken 12.335 seconds.\n",
"\n",
"inserting batch 11...\n",
"inserted! time taken 11.375 seconds.\n",
"\n",
"inserting batch 12...\n",
"inserted! time taken 10.937 seconds.\n",
"\n",
"inserting batch 13...\n",
"inserted! time taken 12.208 seconds.\n",
"\n",
"inserting batch 14...\n",
"inserted! time taken 11.542 seconds.\n",
"\n",
"inserting batch 15...\n",
"inserted! time taken 11.460 seconds.\n",
"\n",
"inserting batch 16...\n",
"inserted! time taken 11.868 seconds.\n",
"\n",
"inserting batch 17...\n",
"inserted! time taken 11.162 seconds.\n",
"\n",
"inserting batch 18...\n",
"inserted! time taken 11.774 seconds.\n",
"\n",
"inserting batch 19...\n",
"inserted! time taken 11.772 seconds.\n",
"\n",
"inserting batch 20...\n",
"inserted! time taken 10.971 seconds.\n",
"\n",
"inserting batch 21...\n",
"inserted! time taken 11.483 seconds.\n",
"\n",
"inserting batch 22...\n",
"inserted! time taken 11.718 seconds.\n",
"\n",
"inserting batch 23...\n",
"inserted! time taken 11.628 seconds.\n",
"\n",
"inserting batch 24...\n",
"inserted! time taken 11.622 seconds.\n",
"\n",
"inserting batch 25...\n",
"inserted! time taken 11.236 seconds.\n",
"\n",
"inserting batch 26...\n",
"inserted! time taken 11.258 seconds.\n",
"\n",
"inserting batch 27...\n",
"inserted! time taken 11.746 seconds.\n",
"\n",
"inserting batch 28...\n",
"inserted! time taken 10.031 seconds.\n",
"\n",
"inserting batch 29...\n",
"inserted! time taken 5.077 seconds.\n",
"\n",
"Completed! Total time taken was 331.674 seconds for 29 batches.\n"
]
}
],
"source": [
"# Insert values into the table \n",
"t_start = time()\n",
"count = 0\n",
"for batch in file.iter_batches(batch_size=100000):\n",
" count+=1\n",
" batch_df = batch.to_pandas()\n",
" print(f'inserting batch {count}...')\n",
" b_start = time()\n",
" \n",
" batch_df.to_sql(name='ny_taxi_data',con=engine, if_exists='append')\n",
" b_end = time()\n",
" print(f'inserted! time taken {b_end-b_start:10.3f} seconds.\\n')\n",
" \n",
"t_end = time() \n",
"print(f'Completed! Total time taken was {t_end-t_start:10.3f} seconds for {count} batches.') "
]
},
{
"cell_type": "markdown",
"id": "a7c102be",
"metadata": {},
"source": [
"## Extra bit\n",
"\n",
"While trying to do the SQL Refresher, there was a need to add a lookup zones table but the file is in ```.csv``` format. \n",
"\n",
"Let's code to handle both ```.csv``` and ```.paraquet``` files!"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a643d171",
"metadata": {
"ExecuteTime": {
"end_time": "2023-12-05T20:59:29.236458Z",
"start_time": "2023-12-05T20:59:28.551221Z"
}
},
"outputs": [],
"source": [
"from time import time\n",
"import pandas as pd \n",
"import pyarrow.parquet as pq\n",
"from sqlalchemy import create_engine"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "62c9040a",
"metadata": {
"ExecuteTime": {
"end_time": "2023-12-05T21:18:11.346552Z",
"start_time": "2023-12-05T21:18:11.337475Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'yellow_tripdata_2023-09.parquet'"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"url = 'https://d37ci6vzurychx.cloudfront.net/misc/taxi+_zone_lookup.csv'\n",
"url = 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-09.parquet'\n",
"\n",
"file_name = url.rsplit('/', 1)[-1].strip()\n",
"file_name"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e495fa96",
"metadata": {
"ExecuteTime": {
"end_time": "2023-12-05T21:18:33.001561Z",
"start_time": "2023-12-05T21:18:32.844872Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"oh yea\n"
]
}
],
"source": [
"if '.csv' in file_name:\n",
" print('yay') \n",
" df = pd.read_csv(file_name, nrows=10)\n",
" df_iter = pd.read_csv(file_name, iterator=True, chunksize=100000)\n",
"elif '.parquet' in file_name:\n",
" print('oh yea')\n",
" file = pq.ParquetFile(file_name)\n",
" df = next(file.iter_batches(batch_size=10)).to_pandas()\n",
" df_iter = file.iter_batches(batch_size=100000)\n",
"else: \n",
" print('Error. Only .csv or .parquet files allowed.')\n",
" sys.exit() "
]
},
{
"cell_type": "markdown",
"id": "7556748f",
"metadata": {},
"source": [
"This code is a rough code and seems to be working. The cleaned up version will be in `data-loading-parquet.py` file."
]
}
],
"metadata": {
"hide_input": false,
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.5"
},
"varInspector": {
"cols": {
"lenName": 16,
"lenType": 16,
"lenVar": 40
},
"kernels_config": {
"python": {
"delete_cmd_postfix": "",
"delete_cmd_prefix": "del ",
"library": "var_list.py",
"varRefreshCmd": "print(var_dic_list())"
},
"r": {
"delete_cmd_postfix": ") ",
"delete_cmd_prefix": "rm(",
"library": "var_list.r",
"varRefreshCmd": "cat(var_dic_list()) "
}
},
"types_to_exclude": [
"module",
"function",
"builtin_function_or_method",
"instance",
"_Feature"
],
"window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@ -0,0 +1,86 @@
#Cleaned up version of data-loading.ipynb
import argparse, os, sys
from time import time
import pandas as pd
import pyarrow.parquet as pq
from sqlalchemy import create_engine
def main(params):
user = params.user
password = params.password
host = params.host
port = params.port
db = params.db
tb = params.tb
url = params.url
# Get the name of the file from url
file_name = url.rsplit('/', 1)[-1].strip()
print(f'Downloading {file_name} ...')
# Download file from url
os.system(f'curl {url.strip()} -o {file_name}')
print('\n')
# Create SQL engine
engine = create_engine(f'postgresql://{user}:{password}@{host}:{port}/{db}')
# Read file based on csv or parquet
if '.csv' in file_name:
df = pd.read_csv(file_name, nrows=10)
df_iter = pd.read_csv(file_name, iterator=True, chunksize=100000)
elif '.parquet' in file_name:
file = pq.ParquetFile(file_name)
df = next(file.iter_batches(batch_size=10)).to_pandas()
df_iter = file.iter_batches(batch_size=100000)
else:
print('Error. Only .csv or .parquet files allowed.')
sys.exit()
# Create the table
df.head(0).to_sql(name=tb, con=engine, if_exists='replace')
# Insert values
t_start = time()
count = 0
for batch in df_iter:
count+=1
if '.parquet' in file_name:
batch_df = batch.to_pandas()
else:
batch_df = batch
print(f'inserting batch {count}...')
b_start = time()
batch_df.to_sql(name=tb, con=engine, if_exists='append')
b_end = time()
print(f'inserted! time taken {b_end-b_start:10.3f} seconds.\n')
t_end = time()
print(f'Completed! Total time taken was {t_end-t_start:10.3f} seconds for {count} batches.')
if __name__ == '__main__':
#Parsing arguments
parser = argparse.ArgumentParser(description='Loading data from .paraquet file link to a Postgres datebase.')
parser.add_argument('--user', help='Username for Postgres.')
parser.add_argument('--password', help='Password to the username for Postgres.')
parser.add_argument('--host', help='Hostname for Postgres.')
parser.add_argument('--port', help='Port for Postgres connection.')
parser.add_argument('--db', help='Databse name for Postgres')
parser.add_argument('--tb', help='Destination table name for Postgres.')
parser.add_argument('--url', help='URL for .paraquet file.')
args = parser.parse_args()
main(args)

View File

@ -1,6 +1,6 @@
# Introduction
* [Video](https://www.youtube.com/watch?v=-zpVha7bw5A)
* [![](https://markdown-videos-api.jorgenkh.no/youtube/AtRhA-NfS24)](https://www.youtube.com/watch?v=AtRhA-NfS24&list=PL3MmuxUbc_hKihpnNQ9qtTmWYy26bPrSb&index=3)
* [Slides](https://www.slideshare.net/AlexeyGrigorev/data-engineering-zoomcamp-introduction)
* Overview of [Architecture](https://github.com/DataTalksClub/data-engineering-zoomcamp#overview), [Technologies](https://github.com/DataTalksClub/data-engineering-zoomcamp#technologies) & [Pre-Requisites](https://github.com/DataTalksClub/data-engineering-zoomcamp#prerequisites)
@ -15,46 +15,65 @@ if you have troubles setting up the environment and following along with the vid
[Code](2_docker_sql)
## :movie_camera: [Introduction to Docker](https://www.youtube.com/watch?v=EYNwNlOrpr0&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
## :movie_camera: Introduction to Docker
[![](https://markdown-videos-api.jorgenkh.no/youtube/EYNwNlOrpr0)](https://youtu.be/EYNwNlOrpr0&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=4)
* Why do we need Docker
* Creating a simple "data pipeline" in Docker
## :movie_camera: [Ingesting NY Taxi Data to Postgres](https://www.youtube.com/watch?v=2JM-ziJt0WI&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
## :movie_camera: Ingesting NY Taxi Data to Postgres
[![](https://markdown-videos-api.jorgenkh.no/youtube/2JM-ziJt0WI)](https://youtu.be/2JM-ziJt0WI&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=5)
* Running Postgres locally with Docker
* Using `pgcli` for connecting to the database
* Exploring the NY Taxi dataset
* Ingesting the data into the database
* **Note** if you have problems with `pgcli`, check [this video](https://www.youtube.com/watch?v=3IkfkTwqHx4&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb) for an alternative way to connect to your database
## :movie_camera: [Connecting pgAdmin and Postgres](https://www.youtube.com/watch?v=hCAIVe9N0ow&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
> [!TIP]
>if you have problems with `pgcli`, check this video for an alternative way to connect to your database in jupyter notebook and pandas.
>
> [![](https://markdown-videos-api.jorgenkh.no/youtube/3IkfkTwqHx4)](https://youtu.be/3IkfkTwqHx4&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=6)
## :movie_camera: Connecting pgAdmin and Postgres
[![](https://markdown-videos-api.jorgenkh.no/youtube/hCAIVe9N0ow)](https://youtu.be/hCAIVe9N0ow&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=7)
* The pgAdmin tool
* Docker networks
Note: The UI for PgAdmin 4 has changed, please follow the below steps for creating a server:
* After login to PgAdmin, right click Servers in the left sidebar.
* Click on Register.
* Click on Server.
* The remaining steps to create a server are the same as in the videos.
> [!IMPORTANT]
>The UI for PgAdmin 4 has changed, please follow the below steps for creating a server:
>
>* After login to PgAdmin, right click Servers in the left sidebar.
>* Click on Register.
>* Click on Server.
>* The remaining steps to create a server are the same as in the videos.
## :movie_camera: [Putting the ingestion script into Docker](https://www.youtube.com/watch?v=B1WwATwf-vY&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
## :movie_camera: Putting the ingestion script into Docker
[![](https://markdown-videos-api.jorgenkh.no/youtube/B1WwATwf-vY)](https://youtu.be/B1WwATwf-vY&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=8)
* Converting the Jupyter notebook to a Python script
* Parametrizing the script with argparse
* Dockerizing the ingestion script
## :movie_camera: [Running Postgres and pgAdmin with Docker-Compose](https://www.youtube.com/watch?v=hKI6PkPhpa0&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
## :movie_camera: Running Postgres and pgAdmin with Docker-Compose
[![](https://markdown-videos-api.jorgenkh.no/youtube/hKI6PkPhpa0)](https://youtu.be/hKI6PkPhpa0&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=9)
* Why do we need Docker-compose
* Docker-compose YAML file
* Running multiple containers with `docker-compose up`
## :movie_camera: [SQL refresher](https://www.youtube.com/watch?v=QEcps_iskgg&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
## :movie_camera: SQL refresher
[![](https://markdown-videos-api.jorgenkh.no/youtube/QEcps_iskgg)](https://youtu.be/QEcps_iskgg&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=10)
* Adding the Zones table
* Inner joins
@ -62,9 +81,12 @@ Note: The UI for PgAdmin 4 has changed, please follow the below steps for creati
* Left, Right and Outer joins
* Group by
## :movie_camera: Optional: Docker Networing and Port Mapping
## :movie_camera: Optional: Docker Networking and Port Mapping
Optional: If you have some problems with docker networking, check [Port Mapping and Networks in Docker](https://www.youtube.com/watch?v=tOr4hTsHOzU&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
> [!TIP]
> Optional: If you have some problems with docker networking, check **Port Mapping and Networks in Docker video**.
[![](https://markdown-videos-api.jorgenkh.no/youtube/tOr4hTsHOzU)](https://youtu.be/tOr4hTsHOzU&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=5)
* Docker networks
* Port forwarding to the host environment
@ -73,33 +95,38 @@ Optional: If you have some problems with docker networking, check [Port Mapping
## :movie_camera: Optional: Walk-Through on WSL
Optional: If you are willing to do the steps from "Ingesting NY Taxi Data to Postgres" till "Running Postgres and pgAdmin with Docker-Compose" with Windows Subsystem Linux please check [Docker Module Walk-Through on WSL](https://www.youtube.com/watch?v=Mv4zFm2AwzQ)
> [!TIP]
> Optional: If you are willing to do the steps from "Ingesting NY Taxi Data to Postgres" till "Running Postgres and pgAdmin with Docker-Compose" with Windows Subsystem Linux please check **Docker Module Walk-Through on WSL**.
[![](https://markdown-videos-api.jorgenkh.no/youtube/Mv4zFm2AwzQ)](https://youtu.be/Mv4zFm2AwzQ&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=33)
# GCP
## :movie_camera: Introduction to GCP (Google Cloud Platform)
[Video](https://www.youtube.com/watch?v=18jIzE41fJ4&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
[![](https://markdown-videos-api.jorgenkh.no/youtube/18jIzE41fJ4)](https://youtu.be/18jIzE41fJ4&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=3)
# Terraform
[Code](1_terraform_gcp)
## :movie_camera: Introduction Terraform: Concepts and Overview
## :movie_camera: Introduction Terraform: Concepts and Overview, a primer
[![](https://markdown-videos-api.jorgenkh.no/youtube/s2bOYDCKl_M)](https://youtu.be/s2bOYDCKl_M&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=11)
* [Video](https://youtu.be/s2bOYDCKl_M)
* [Companion Notes](1_terraform_gcp)
## :movie_camera: Terraform Basics: Simple one file Terraform Deployment
* [Video](https://youtu.be/Y2ux7gq3Z0o)
[![](https://markdown-videos-api.jorgenkh.no/youtube/Y2ux7gq3Z0o)](https://youtu.be/Y2ux7gq3Z0o&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=12)
* [Companion Notes](1_terraform_gcp)
## :movie_camera: Deployment with a Variables File
* [Video](https://youtu.be/PBi0hHjLftk)
[![](https://markdown-videos-api.jorgenkh.no/youtube/PBi0hHjLftk)](https://youtu.be/PBi0hHjLftk&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=13)
* [Companion Notes](1_terraform_gcp)
## Configuring terraform and GCP SDK on Windows
@ -115,17 +142,18 @@ For the course you'll need:
* Google Cloud SDK
* Docker with docker-compose
* Terraform
* Git account
If you have problems setting up the env, you can check these videos
## :movie_camera: GitHub Codespaces
[Preparing the environment with GitHub Codespaces](https://www.youtube.com/watch?v=XOSUt8Ih3zA&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
> [!NOTE]
>If you have problems setting up the environment, you can check these videos.
>
>If you already have a working coding environment on local machine, these are optional. And only need to select one method. But if you have time to learn it now, these would be helpful if the local environment suddenly do not work one day.
## :movie_camera: GCP Cloud VM
[Setting up the environment on cloud VM](https://www.youtube.com/watch?v=ae-CV2KfoN0&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
### Setting up the environment on cloud VM
[![](https://markdown-videos-api.jorgenkh.no/youtube/ae-CV2KfoN0)](https://youtu.be/ae-CV2KfoN0&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=14)
* Generating SSH keys
* Creating a virtual machine on GCP
* Connecting to the VM with SSH
@ -140,6 +168,12 @@ If you have problems setting up the env, you can check these videos
* Using `sftp` for putting the credentials to the remote machine
* Shutting down and removing the instance
## :movie_camera: GitHub Codespaces
### Preparing the environment with GitHub Codespaces
[![](https://markdown-videos-api.jorgenkh.no/youtube/XOSUt8Ih3zA)](https://youtu.be/XOSUt8Ih3zA&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=15)
# Homework
* [Homework](../cohorts/2024/01-docker-terraform/homework.md)
@ -169,6 +203,9 @@ Did you take notes? You can share them here
* Notes on [Docker, Docker Compose, and setting up a proper Python environment](https://medium.com/@verazabeida/zoomcamp-2023-week-1-f4f94cb360ae), by Vera
* [Setting up the development environment on Google Virtual Machine](https://itsadityagupta.hashnode.dev/setting-up-the-development-environment-on-google-virtual-machine), blog post by Aditya Gupta
* [Notes from Zharko Cekovski](https://www.zharconsulting.com/contents/data/data-engineering-bootcamp-2024/week-1-postgres-docker-and-ingestion-scripts/)
* [2024 Module Walkthough video by ellacharmed on youtube](https://youtu.be/VUZshlVAnk4)
* [2024 Module-01 Walkthough video by ellacharmed on youtube](https://youtu.be/VUZshlVAnk4)
* [2024 Companion Module Walkthough slides by ellacharmed](https://github.com/ellacharmed/data-engineering-zoomcamp/blob/ella2024/cohorts/2024/01-docker-terraform/walkthrough-01.pdf)
* Add your notes here
* [2024 Module-01 Environment setup video by ellacharmed on youtube](https://youtu.be/Zce_Hd37NGs)
* [Docker Notes by Linda](https://github.com/inner-outer-space/de-zoomcamp-2024/blob/main/1a-docker_sql/readme.md) • [Terraform Notes by Linda](https://github.com/inner-outer-space/de-zoomcamp-2024/blob/main/1b-terraform_gcp/readme.md)
* [Notes from Hammad Tariq](https://github.com/hamad-tariq/HammadTariq-ZoomCamp2024/blob/9c8b4908416eb8cade3d7ec220e7664c003e9b11/week_1_basics_n_setup/README.md)
* Add your notes above this line

View File

@ -1,7 +1,7 @@
> If you're looking for Airflow videos from the 2022 edition,
> check the [2022 cohort folder](../cohorts/2022/week_2_data_ingestion/). <br>
> If you're looking for Prefect videos from the 2023 edition,
> check the [2023 cohort folder](../cohorts/2023/week_2_data_ingestion/).
> [!NOTE]
>If you're looking for Airflow videos from the 2022 edition, check the [2022 cohort folder](../cohorts/2022/week_2_data_ingestion/).
>
>If you're looking for Prefect videos from the 2023 edition, check the [2023 cohort folder](../cohorts/2023/week_2_data_ingestion/).
# Week 2: Workflow Orchestration
@ -18,9 +18,8 @@ This week, you'll learn how to use the Mage platform to author and share _magica
* [2.2.5 - 🔍 ETL: GCS to BigQuery](#225----etl-gcs-to-bigquery)
* [2.2.6 - 👨‍💻 Parameterized Execution](#226----parameterized-execution)
* [2.2.7 - 🤖 Deployment (Optional)](#227----deployment-optional)
* [2.2.8 - 🧱 Advanced Blocks (Optional)](#228----advanced-blocks-optional)
* [2.2.9 - 🗒️ Homework](#229---%EF%B8%8F-homework)
* [2.2.10 - 👣 Next Steps](#2210----next-steps)
* [2.2.8 - 🗒️ Homework](#228----homework)
* [2.2.9 - 👣 Next Steps](#229----next-steps)
## 📕 Course Resources
@ -29,7 +28,9 @@ This week, you'll learn how to use the Mage platform to author and share _magica
In this section, we'll cover the basics of workflow orchestration. We'll discuss what it is, why it's important, and how it can be used to build data pipelines.
Videos
- 2.2.1a - [What is Orchestration?](https://www.youtube.com/watch?v=Li8-MWHhTbo&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
- 2.2.1a - What is Orchestration?
[![](https://markdown-videos-api.jorgenkh.no/youtube/Li8-MWHhTbo)](https://youtu.be/Li8-MWHhTbo&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=17)
Resources
- [Slides](https://docs.google.com/presentation/d/17zSxG5Z-tidmgY-9l7Al1cPmz4Slh4VPK6o2sryFYvw/)
@ -39,10 +40,17 @@ Resources
In this section, we'll introduce the Mage platform. We'll cover what makes Mage different from other orchestrators, the fundamental concepts behind Mage, and how to get started. To cap it off, we'll spin Mage up via Docker 🐳 and run a simple pipeline.
Videos
- 2.2.2a - [What is Mage?](https://www.youtube.com/watch?v=AicKRcK3pa4&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
-
- 2.2.2b - [Configuring Mage](https://www.youtube.com/watch?v=tNiV7Wp08XE?list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
- 2.2.2c - [A Simple Pipeline](https://www.youtube.com/watch?v=stI-gg4QBnI&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
- 2.2.2a - What is Mage?
[![](https://markdown-videos-api.jorgenkh.no/youtube/AicKRcK3pa4)](https://youtu.be/AicKRcK3pa4&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=18)
- 2.2.2b - Configuring Mage
[![](https://markdown-videos-api.jorgenkh.no/youtube/tNiV7Wp08XE)](https://youtu.be/tNiV7Wp08XE&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=19)
- 2.2.2c - A Simple Pipeline
[![](https://markdown-videos-api.jorgenkh.no/youtube/stI-gg4QBnI)](https://youtu.be/stI-gg4QBnI&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=20)
Resources
- [Getting Started Repo](https://github.com/mage-ai/mage-zoomcamp)
@ -53,12 +61,13 @@ Resources
Hooray! Mage is up and running. Now, let's build a _real_ pipeline. In this section, we'll build a simple ETL pipeline that loads data from an API into a Postgres database. Our database will be built using Docker— it will be running locally, but it's the same as if it were running in the cloud.
Videos
- 2.2.3a - [Configuring Postgres](https://www.youtube.com/watch?v=pmhI-ezd3BE&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
- 2.2.3b - [Writing an ETL Pipeline](https://www.youtube.com/watch?v=Maidfe7oKLs&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
- 2.2.3a - Configuring Postgres
Resources
- [Taxi Dataset](https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz)
- [Sample loading block](https://github.com/mage-ai/mage-zoomcamp/blob/solutions/magic-zoomcamp/data_loaders/load_nyc_taxi_data.py)
[![](https://markdown-videos-api.jorgenkh.no/youtube/pmhI-ezd3BE)](https://youtu.be/pmhI-ezd3BE&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=21)
- 2.2.3b - Writing an ETL Pipeline : API to postgres
[![](https://markdown-videos-api.jorgenkh.no/youtube/Maidfe7oKLs)](https://youtu.be/Maidfe7oKLs&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=22)
### 2.2.4 - 🤓 ETL: API to GCS
@ -68,26 +77,39 @@ Ok, so we've written data _locally_ to a database, but what about the cloud? In
We'll cover both writing _partitioned_ and _unpartitioned_ data to GCS and discuss _why_ you might want to do one over the other. Many data teams start with extracting data from a source and writing it to a data lake _before_ loading it to a structured data source, like a database.
Videos
- 2.2.4a - [Configuring GCP](https://www.youtube.com/watch?v=00LP360iYvE&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
- 2.2.4b - [Writing an ETL Pipeline](https://www.youtube.com/watch?v=w0XmcASRUnc&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
- 2.2.4a - Configuring GCP
[![](https://markdown-videos-api.jorgenkh.no/youtube/00LP360iYvE)](https://youtu.be/00LP360iYvE&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=23)
- 2.2.4b - Writing an ETL Pipeline : API to GCS
[![](https://markdown-videos-api.jorgenkh.no/youtube/w0XmcASRUnc)](https://youtu.be/w0XmcASRUnc&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=24)
Resources
- [DTC Zoomcamp GCP Setup](../week_1_basics_n_setup/1_terraform_gcp/2_gcp_overview.md)
- [DTC Zoomcamp GCP Setup](../01-docker-terraform/1_terraform_gcp/2_gcp_overview.md)
### 2.2.5 - 🔍 ETL: GCS to BigQuery
Now that we've written data to GCS, let's load it into BigQuery. In this section, we'll walk through the process of using Mage to load our data from GCS to BigQuery. This closely mirrors a very common data engineering workflow: loading data from a data lake into a data warehouse.
Videos
- 2.2.5a - [Writing an ETL Pipeline](https://www.youtube.com/watch?v=JKp_uzM-XsM)
- 2.2.5a - Writing an ETL Pipeline : GCS to BigQuery
[![](https://markdown-videos-api.jorgenkh.no/youtube/JKp_uzM-XsM)](https://youtu.be/JKp_uzM-XsM&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=25)
### 2.2.6 - 👨‍💻 Parameterized Execution
By now you're familiar with building pipelines, but what about adding parameters? In this video, we'll discuss some built-in runtime variables that exist in Mage and show you how to define your own! We'll also cover how to use these variables to parameterize your pipelines. Finally, we'll talk about what it means to *backfill* a pipeline and how to do it in Mage.
Videos
- 2.2.6a - [Parameterized Execution](https://www.youtube.com/watch?v=H0hWjWxB-rg&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
- 2.2.6b - [Backfills](https://www.youtube.com/watch?v=ZoeC6Ag5gQc&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
- 2.2.6a - Parameterized Execution
[![](https://markdown-videos-api.jorgenkh.no/youtube/H0hWjWxB-rg)](https://youtu.be/H0hWjWxB-rg&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=26)
- 2.2.6b - Backfills
[![](https://markdown-videos-api.jorgenkh.no/youtube/ZoeC6Ag5gQc)](https://youtu.be/ZoeC6Ag5gQc&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=27)
Resources
- [Mage Variables Overview](https://docs.mage.ai/development/variables/overview)
@ -98,10 +120,21 @@ Resources
In this section, we'll cover deploying Mage using Terraform and Google Cloud. This section is optional— it's not *necessary* to learn Mage, but it might be helpful if you're interested in creating a fully deployed project. If you're using Mage in your final project, you'll need to deploy it to the cloud.
Videos
- 2.2.7a - [Deployment Prerequisites](https://www.youtube.com/watch?v=zAwAX5sxqsg&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
- 2.2.7b - [Google Cloud Permissions](https://www.youtube.com/watch?v=O_H7DCmq2rA&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
- 2.2.7c - [Deploying to Google Cloud - Part 1](https://www.youtube.com/watch?v=9A872B5hb_0&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
- 2.2.7d - [Deploying to Google Cloud - Part 2](https://www.youtube.com/watch?v=0YExsb2HgLI&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
- 2.2.7a - Deployment Prerequisites
[![](https://markdown-videos-api.jorgenkh.no/youtube/zAwAX5sxqsg)](https://youtu.be/zAwAX5sxqsg&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=28)
- 2.2.7b - Google Cloud Permissions
[![](https://markdown-videos-api.jorgenkh.no/youtube/O_H7DCmq2rA)](https://youtu.be/O_H7DCmq2rA&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=29)
- 2.2.7c - Deploying to Google Cloud - Part 1
[![](https://markdown-videos-api.jorgenkh.no/youtube/9A872B5hb_0)](https://youtu.be/9A872B5hb_0&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=30)
- 2.2.7d - Deploying to Google Cloud - Part 2
[![](https://markdown-videos-api.jorgenkh.no/youtube/0YExsb2HgLI)](https://youtu.be/0YExsb2HgLI&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=31)
Resources
- [Installing Terraform](https://developer.hashicorp.com/terraform/tutorials/aws-get-started/install-cli)
@ -121,7 +154,9 @@ We've prepared a short exercise to test you on what you've learned this week. Yo
Congratulations! You've completed Week 2 of the Data Engineering Zoomcamp. We hope you've enjoyed learning about Mage and that you're excited to use it in your final project. If you have any questions, feel free to reach out to us on Slack. Be sure to check out our "Next Steps" video for some inspiration for the rest of your journey 😄.
Videos
- 2.2.9a - [Next Steps](https://www.youtube.com/watch?v=uUtj7N0TleQ)
- 2.2.9 - Next Steps
[![](https://markdown-videos-api.jorgenkh.no/youtube/uUtj7N0TleQ)](https://youtu.be/uUtj7N0TleQ&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=32)
Resources
- [Slides](https://docs.google.com/presentation/d/1yN-e22VNwezmPfKrZkgXQVrX5owDb285I2HxHWgmAEQ/edit#slide=id.g262fb0d2905_0_12)
@ -139,6 +174,11 @@ Did you take notes? You can share them here:
## 2024 notes
* [2024 Videos transcripts week 2](https://drive.google.com/drive/folders/1yxT0uMMYKa6YOxanh91wGqmQUMS7yYW7?usp=sharing) by Maria Fisher
* [Notes from Jonah Oliver](https://www.jonahboliver.com/blog/de-zc-w2)
* [Notes from Linda](https://github.com/inner-outer-space/de-zoomcamp-2024/blob/main/2-workflow-orchestration/readme.md)
* [Notes from Kirill](https://github.com/kirill505/data-engineering-zoomcamp/blob/main/02-workflow-orchestration/README.md)
* [Notes from Zharko](https://www.zharconsulting.com/contents/data/data-engineering-bootcamp-2024/week-2-ingesting-data-with-mage/)
* Add your notes above this line
## 2023 notes

View File

@ -7,26 +7,34 @@
## Data Warehouse
- [Data Warehouse and BigQuery](https://www.youtube.com/watch?v=jrHljAoD6nM&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
- Data Warehouse and BigQuery
[![](https://markdown-videos-api.jorgenkh.no/youtube/jrHljAoD6nM)](https://youtu.be/jrHljAoD6nM&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=34)
## :movie_camera: Partitoning and clustering
- [Partioning and Clustering](https://www.youtube.com/watch?v=jrHljAoD6nM&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
- [Partioning vs Clustering](https://www.youtube.com/watch?v=-CqXf7vhhDs&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
- Partioning and Clustering
[![](https://markdown-videos-api.jorgenkh.no/youtube/-CqXf7vhhDs)](https://youtu.be/-CqXf7vhhDs&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=35)
- Partioning vs Clustering
[![](https://markdown-videos-api.jorgenkh.no/youtube/-CqXf7vhhDs)](https://youtu.be/-CqXf7vhhDs?si=p1sYQCAs8dAa7jIm&t=193&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=35)
## :movie_camera: Best practices
- [BigQuery Best Practices](https://www.youtube.com/watch?v=k81mLJVX08w&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
[![](https://markdown-videos-api.jorgenkh.no/youtube/k81mLJVX08w)](https://youtu.be/k81mLJVX08w&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=36)
## :movie_camera: Internals of BigQuery
- [Internals of Big Query](https://www.youtube.com/watch?v=eduHi1inM4s&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
[![](https://markdown-videos-api.jorgenkh.no/youtube/eduHi1inM4s)](https://youtu.be/eduHi1inM4s&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=37)
## Advanced topics
### :movie_camera: Machine Learning in Big Query
* [BigQuery Machine Learning](https://www.youtube.com/watch?v=B-WtpB0PuG4&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
[![](https://markdown-videos-api.jorgenkh.no/youtube/B-WtpB0PuG4)](https://youtu.be/B-WtpB0PuG4&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=34)
* [SQL for ML in BigQuery](big_query_ml.sql)
**Important links**
@ -36,9 +44,10 @@
- [Hyper Parameter tuning](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-glm)
- [Feature preprocessing](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-preprocess-overview)
### :movie_camera: Deploying ML model
### :movie_camera: Deploying Machine Learning model from BigQuery
[![](https://markdown-videos-api.jorgenkh.no/youtube/BjARzEWaznU)](https://youtu.be/BjARzEWaznU&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=39)
- [BigQuery Machine Learning Deployment](https://www.youtube.com/watch?v=BjARzEWaznU&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
- [Steps to extract and deploy model with docker](extract_model.md)
@ -61,4 +70,11 @@ Did you take notes? You can share them here.
* [Notes by froukje](https://github.com/froukje/de-zoomcamp/blob/main/week_3_data_warehouse/notes/notes_week_03.md)
* [Notes by Alain Boisvert](https://github.com/boisalai/de-zoomcamp-2023/blob/main/week3.md)
* [Notes from Vincenzo Galante](https://binchentso.notion.site/Data-Talks-Club-Data-Engineering-Zoomcamp-8699af8e7ff94ec49e6f9bdec8eb69fd)
* [2024 videos transcript week3](https://drive.google.com/drive/folders/1quIiwWO-tJCruqvtlqe_Olw8nvYSmmDJ?usp=sharing) by Maria Fisher
* [Notes by Linda](https://github.com/inner-outer-space/de-zoomcamp-2024/blob/main/3a-data-warehouse/readme.md)
* [Jonah Oliver's blog post](https://www.jonahboliver.com/blog/de-zc-w3)
* [2024 - steps to send data from Mage to GCS + creating external table](https://drive.google.com/file/d/1GIi6xnS4070a8MUlIg-ozITt485_-ePB/view?usp=drive_link) by Maria Fisher
* [2024 - mage dataloader script to load the parquet files from a remote URL and push it to Google bucket as parquet file](https://github.com/amohan601/dataengineering-zoomcamp2024/blob/main/week_3_data_warehouse/mage_scripts/green_taxi_2022_v2.py) by Anju Mohan
* [2024 - steps to send data from Mage to GCS + creating external table](https://drive.google.com/file/d/1GIi6xnS4070a8MUlIg-ozITt485_-ePB/view?usp=drive_link) by Maria Fisher
* [Notes by HongWei](https://github.com/hwchua0209/data-engineering-zoomcamp-submission/blob/main/03-data-warehouse/README.md)
* Add your notes here (above this line)

View File

@ -11,27 +11,25 @@ By this stage of the course you should have already:
* Green taxi data - Years 2019 and 2020
* fhv data - Year 2019.
Note:
* A quick hack has been shared to load that data quicker, check instructions in [week3/extras](../03-data-warehouse/extras)
* If you recieve an error stating "Permission denied while globbing file pattern." when attemting to run fact_trips.sql this [Video](https://www.youtube.com/watch?v=kL3ZVNL9Y4A&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb) may be helpful in resolving the issue
> [!NOTE]
> * We have two quick hack to load that data quicker, follow [this video](https://www.youtube.com/watch?v=Mork172sK_c&list=PLaNLNpjZpzwgneiI-Gl8df8GCsPYp_6Bs) for option 1 or check instructions in [week3/extras](../03-data-warehouse/extras) for option 2
## Setting up your environment
> [!NOTE]
> the *cloud* setup is the preferred option.
>
> the *local* setup does not require a cloud database.
### Setting up dbt for using BigQuery (Alternative A - preferred)
| Alternative A | Alternative B |
---|---|
| Setting up dbt for using BigQuery (cloud) | Setting up dbt for using Postgres locally |
|- Open a free developer dbt cloud account following [this link](https://www.getdbt.com/signup/)|- Open a free developer dbt cloud account following [this link](https://www.getdbt.com/signup/)<br><br> |
| - [Following these instructions to connect to your BigQuery instance]([https://docs.getdbt.com/docs/dbt-cloud/cloud-configuring-dbt-cloud/cloud-setting-up-bigquery-oauth](https://docs.getdbt.com/guides/bigquery?step=4)) | - follow the [official dbt documentation]([https://docs.getdbt.com/dbt-cli/installation](https://docs.getdbt.com/docs/core/installation-overview)) or <br>- follow the [dbt core with BigQuery on Docker](docker_setup/README.md) guide to setup dbt locally on docker or <br>- use a docker image from oficial [Install with Docker](https://docs.getdbt.com/docs/core/docker-install). |
|- More detailed instructions in [dbt_cloud_setup.md](dbt_cloud_setup.md) | - You will need to install the latest version with the BigQuery adapter (dbt-bigquery).|
| | - You will need to install the latest version with the postgres adapter (dbt-postgres).|
| | After local installation you will have to set up the connection to PG in the `profiles.yml`, you can find the templates [here](https://docs.getdbt.com/docs/core/connect-data-platform/postgres-setup) |
1. Open a free developer dbt cloud account following[this link](https://www.getdbt.com/signup/)
2. [Following these instructions to connect to your BigQuery instance]([https://docs.getdbt.com/docs/dbt-cloud/cloud-configuring-dbt-cloud/cloud-setting-up-bigquery-oauth](https://docs.getdbt.com/guides/bigquery?step=4)). More detailed instructions in [dbt_cloud_setup.md](dbt_cloud_setup.md)
_Optional_: If you feel more comfortable developing locally you could use a local installation of dbt core. You can follow the [official dbt documentation]([https://docs.getdbt.com/dbt-cli/installation](https://docs.getdbt.com/docs/core/installation-overview)) or follow the [dbt core with BigQuery on Docker](docker_setup/README.md) guide to setup dbt locally on docker. You will need to install the latest version with the BigQuery adapter (dbt-bigquery).
### Setting up dbt for using Postgres locally (Alternative B)
As an alternative to the cloud, that require to have a cloud database, you will be able to run the project installing dbt locally.
You can follow the [official dbt documentation]([https://docs.getdbt.com/dbt-cli/installation](https://docs.getdbt.com/dbt-cli/installation)) or use a docker image from oficial [dbt repo](https://github.com/dbt-labs/dbt/). You will need to install the latest version with the postgres adapter (dbt-postgres).
After local installation you will have to set up the connection to PG in the `profiles.yml`, you can find the templates [here](https://docs.getdbt.com/docs/core/connect-data-platform/postgres-setup)
</details>
## Content
@ -41,30 +39,21 @@ After local installation you will have to set up the connection to PG in the `pr
* ETL vs ELT
* Data modeling concepts (fact and dim tables)
:movie_camera: [Video](https://www.youtube.com/watch?v=uF76d5EmdtU&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=32)
[![](https://markdown-videos-api.jorgenkh.no/youtube/uF76d5EmdtU)](https://youtu.be/uF76d5EmdtU&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=40)
### What is dbt?
* Intro to dbt
* Introduction to dbt
:movie_camera: [Video](https://www.youtube.com/watch?v=4eCouvVOJUw&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=33)
[![](https://markdown-videos-api.jorgenkh.no/youtube/4eCouvVOJUw)](https://www.youtube.com/watch?v=gsKuETFJr54&list=PLaNLNpjZpzwgneiI-Gl8df8GCsPYp_6Bs&index=5)
## Starting a dbt project
### Alternative A: Using BigQuery + dbt cloud
* Starting a new project with dbt init (dbt cloud and core)
* dbt cloud setup
* project.yml
:movie_camera: [Video](https://www.youtube.com/watch?v=iMxh6s_wL4Q&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=34)
### Alternative B: Using Postgres + dbt core (locally)
* Starting a new project with dbt init (dbt cloud and core)
* dbt core local setup
* profiles.yml
* project.yml
:movie_camera: [Video](https://www.youtube.com/watch?v=1HmL63e-vRs&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=35)
| Alternative A | Alternative B |
|-----------------------------|--------------------------------|
| Using BigQuery + dbt cloud | Using Postgres + dbt core (locally) |
| - Starting a new project with dbt init (dbt cloud and core)<br>- dbt cloud setup<br>- project.yml<br><br> | - Starting a new project with dbt init (dbt cloud and core)<br>- dbt core local setup<br>- profiles.yml<br>- project.yml |
| [![](https://markdown-videos-api.jorgenkh.no/youtube/iMxh6s_wL4Q)](https://www.youtube.com/watch?v=J0XCDyKiU64&list=PLaNLNpjZpzwgneiI-Gl8df8GCsPYp_6Bs&index=4) | [![](https://markdown-videos-api.jorgenkh.no/youtube/1HmL63e-vRs)](https://youtu.be/1HmL63e-vRs&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=43) |
### dbt models
@ -75,35 +64,42 @@ After local installation you will have to set up the connection to PG in the `pr
* Packages
* Variables
:movie_camera: [Video](https://www.youtube.com/watch?v=UVI30Vxzd6c&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=36)
[![](https://markdown-videos-api.jorgenkh.no/youtube/UVI30Vxzd6c)](https://www.youtube.com/watch?v=ueVy2N54lyc&list=PLaNLNpjZpzwgneiI-Gl8df8GCsPYp_6Bs&index=3)
_Note: This video is shown entirely on dbt cloud IDE but the same steps can be followed locally on the IDE of your choice_
> [!NOTE]
> *This video is shown entirely on dbt cloud IDE but the same steps can be followed locally on the IDE of your choice*
> [!TIP]
>* If you recieve an error stating "Permission denied while globbing file pattern." when attempting to run `fact_trips.sql` this video may be helpful in resolving the issue
>
>[![](https://markdown-videos-api.jorgenkh.no/youtube/kL3ZVNL9Y4A)](https://youtu.be/kL3ZVNL9Y4A&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=34)
### Testing and documenting dbt models
* Tests
* Documentation
:movie_camera: [Video](https://www.youtube.com/watch?v=UishFmq1hLM&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=37)
[![](https://markdown-videos-api.jorgenkh.no/youtube/UishFmq1hLM)](https://www.youtube.com/watch?v=2dNJXHFCHaY&list=PLaNLNpjZpzwgneiI-Gl8df8GCsPYp_6Bs&index=2)
_Note: This video is shown entirely on dbt cloud IDE but the same steps can be followed locally on the IDE of your choice_
>[!NOTE]
> *This video is shown entirely on dbt cloud IDE but the same steps can be followed locally on the IDE of your choice*
## Deployment
### Alternative A: Using BigQuery + dbt cloud
* Deployment: development environment vs production
* dbt cloud: scheduler, sources and hosted documentation
:movie_camera: [Video](https://www.youtube.com/watch?v=rjf6yZNGX8I&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=38)
### Alternative B: Using Postgres + dbt core (locally)
* Deployment: development environment vs production
* dbt cloud: scheduler, sources and hosted documentation
:movie_camera: [Video](https://www.youtube.com/watch?v=Cs9Od1pcrzM&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=39)
| Alternative A | Alternative B |
|-----------------------------|--------------------------------|
| Using BigQuery + dbt cloud | Using Postgres + dbt core (locally) |
| - Deployment: development environment vs production<br>- dbt cloud: scheduler, sources and hosted documentation | - Deployment: development environment vs production<br>- dbt cloud: scheduler, sources and hosted documentation |
| [![](https://markdown-videos-api.jorgenkh.no/youtube/rjf6yZNGX8I)](https://www.youtube.com/watch?v=V2m5C0n8Gro&list=PLaNLNpjZpzwgneiI-Gl8df8GCsPYp_6Bs&index=6) | [![](https://markdown-videos-api.jorgenkh.no/youtube/Cs9Od1pcrzM)](https://youtu.be/Cs9Od1pcrzM&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=47) |
## Visualising the transformed data
:movie_camera: [Google data studio Video](https://www.youtube.com/watch?v=39nLTs74A3E&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=42)
:movie_camera: [Metabase Video](https://www.youtube.com/watch?v=BnLkrA7a6gM&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=43)
:movie_camera: Google data studio Video (Now renamed to Looker studio)
[![](https://markdown-videos-api.jorgenkh.no/youtube/39nLTs74A3E)](https://youtu.be/39nLTs74A3E&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=48)
:movie_camera: Metabase Video
[![](https://markdown-videos-api.jorgenkh.no/youtube/BnLkrA7a6gM)](https://youtu.be/BnLkrA7a6gM&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=49)
## Advanced concepts
@ -133,7 +129,10 @@ Did you take notes? You can share them here.
* [Blog post by Dewi Oktaviani](https://medium.com/@oktavianidewi/de-zoomcamp-2023-learning-week-4-analytics-engineering-with-dbt-53f781803d3e)
* [Notes from Vincenzo Galante](https://binchentso.notion.site/Data-Talks-Club-Data-Engineering-Zoomcamp-8699af8e7ff94ec49e6f9bdec8eb69fd)
* [Notes from Balaji](https://github.com/Balajirvp/DE-Zoomcamp/blob/main/Week%204/Data%20Engineering%20Zoomcamp%20Week%204.ipynb)
*Add your notes here (above this line)*
* [Notes by Linda](https://github.com/inner-outer-space/de-zoomcamp-2024/blob/main/4-analytics-engineering/readme.md)
* [2024 - Videos transcript week4](https://drive.google.com/drive/folders/1V2sHWOotPEMQTdMT4IMki1fbMPTn3jOP?usp=drive)
* [Blog Post](https://www.jonahboliver.com/blog/de-zc-w4) by Jonah Oliver
* Add your notes here (above this line)
## Useful links
- [Slides used in the videos](https://docs.google.com/presentation/d/1xSll_jv0T8JF4rYZvLHfkJXYqUjPtThA/edit?usp=sharing&ouid=114544032874539580154&rtpof=true&sd=true)

View File

@ -0,0 +1,5 @@
# you shouldn't commit these into source control
# these are the default directory names, adjust/add to fit your needs
target/
dbt_packages/
logs/

View File

@ -0,0 +1,38 @@
Welcome to your new dbt project!
### How to run this project
### About the project
This project is based in [dbt starter project](https://github.com/dbt-labs/dbt-starter-project) (generated by running `dbt init`)
Try running the following commands:
- dbt run
- dbt test
A project includes the following files:
- dbt_project.yml: file used to configure the dbt project. If you are using dbt locally, make sure the profile here matches the one setup during installation in ~/.dbt/profiles.yml
- *.yml files under folders models, data, macros: documentation files
- csv files in the data folder: these will be our sources, files described above
- Files inside folder models: The sql files contain the scripts to run our models, this will cover staging, core and a datamarts models. At the end, these models will follow this structure:
![image](https://user-images.githubusercontent.com/4315804/152691312-e71b56a4-53ff-4884-859c-c9090dbd0db8.png)
#### Workflow
![image](https://user-images.githubusercontent.com/4315804/148699280-964c4e0b-e685-4c0f-a266-4f3e097156c9.png)
#### Execution
After having installed the required tools and cloning this repo, execute the following commnads:
1. Change into the project's directory from the command line: `$ cd [..]/taxi_rides_ny`
2. Load the CSVs into the database. This materializes the CSVs as tables in your target schema: `$ dbt seed`
3. Run the models: `$ dbt run`
4. Test your data: `$ dbt test`
_Alternative: use `$ dbt build` to execute with one command the 3 steps above together_
5. Generate documentation for the project: `$ dbt docs generate`
6. View the documentation for the project, this step should open the documentation page on a webserver, but it can also be accessed from http://localhost:8080 : `$ dbt docs serve`
### dbt resources:
- Learn more about dbt [in the docs](https://docs.getdbt.com/docs/introduction)
- Check out [Discourse](https://discourse.getdbt.com/) for commonly asked questions and answers
- Join the [chat](http://slack.getdbt.com/) on Slack for live discussions and support
- Find [dbt events](https://events.getdbt.com) near you
- Check out [the blog](https://blog.getdbt.com/) for the latest news on dbt's development and best practices

View File

@ -0,0 +1,49 @@
-- MAKE SURE YOU REPLACE taxi-rides-ny-339813-412521 WITH THE NAME OF YOUR DATASET!
-- When you run the query, only run 5 of the ALTER TABLE statements at one time (by highlighting only 5).
-- Otherwise BigQuery will say too many alterations to the table are being made.
CREATE TABLE `taxi-rides-ny-339813-412521.trips_data_all.green_tripdata` as
SELECT * FROM `bigquery-public-data.new_york_taxi_trips.tlc_green_trips_2019`;
CREATE TABLE `taxi-rides-ny-339813-412521.trips_data_all.yellow_tripdata` as
SELECT * FROM `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2019`;
insert into `taxi-rides-ny-339813-412521.trips_data_all.green_tripdata`
SELECT * FROM `bigquery-public-data.new_york_taxi_trips.tlc_green_trips_2020` ;
insert into `taxi-rides-ny-339813-412521.trips_data_all.yellow_tripdata`
SELECT * FROM `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2020`;
-- Fixes yellow table schema
ALTER TABLE `taxi-rides-ny-339813-412521.trips_data_all.yellow_tripdata`
RENAME COLUMN vendor_id TO VendorID;
ALTER TABLE `taxi-rides-ny-339813-412521.trips_data_all.yellow_tripdata`
RENAME COLUMN pickup_datetime TO tpep_pickup_datetime;
ALTER TABLE `taxi-rides-ny-339813-412521.trips_data_all.yellow_tripdata`
RENAME COLUMN dropoff_datetime TO tpep_dropoff_datetime;
ALTER TABLE `taxi-rides-ny-339813-412521.trips_data_all.yellow_tripdata`
RENAME COLUMN rate_code TO RatecodeID;
ALTER TABLE `taxi-rides-ny-339813-412521.trips_data_all.yellow_tripdata`
RENAME COLUMN imp_surcharge TO improvement_surcharge;
ALTER TABLE `taxi-rides-ny-339813-412521.trips_data_all.yellow_tripdata`
RENAME COLUMN pickup_location_id TO PULocationID;
ALTER TABLE `taxi-rides-ny-339813-412521.trips_data_all.yellow_tripdata`
RENAME COLUMN dropoff_location_id TO DOLocationID;
-- Fixes green table schema
ALTER TABLE `taxi-rides-ny-339813-412521.trips_data_all.green_tripdata`
RENAME COLUMN vendor_id TO VendorID;
ALTER TABLE `taxi-rides-ny-339813-412521.trips_data_all.green_tripdata`
RENAME COLUMN pickup_datetime TO lpep_pickup_datetime;
ALTER TABLE `taxi-rides-ny-339813-412521.trips_data_all.green_tripdata`
RENAME COLUMN dropoff_datetime TO lpep_dropoff_datetime;
ALTER TABLE `taxi-rides-ny-339813-412521.trips_data_all.green_tripdata`
RENAME COLUMN rate_code TO RatecodeID;
ALTER TABLE `taxi-rides-ny-339813-412521.trips_data_all.green_tripdata`
RENAME COLUMN imp_surcharge TO improvement_surcharge;
ALTER TABLE `taxi-rides-ny-339813-412521.trips_data_all.green_tripdata`
RENAME COLUMN pickup_location_id TO PULocationID;
ALTER TABLE `taxi-rides-ny-339813-412521.trips_data_all.green_tripdata`
RENAME COLUMN dropoff_location_id TO DOLocationID;

View File

@ -0,0 +1,52 @@
# Name your project! Project names should contain only lowercase characters
# and underscores. A good package name should reflect your organization's
# name or the intended use of these models
name: 'taxi_rides_ny'
version: '1.0.0'
config-version: 2
# This setting configures which "profile" dbt uses for this project.
profile: 'default'
# These configurations specify where dbt should look for different types of files.
# The `model-paths` config, for example, states that models in this project can be
# found in the "models/" directory. You probably won't need to change these!
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]
target-path: "target" # directory which will store compiled SQL files
clean-targets: # directories to be removed by `dbt clean`
- "target"
- "dbt_packages"
# Configuring models
# Full documentation: https://docs.getdbt.com/docs/configuring-models
# In dbt, the default materialization for a model is a view. This means, when you run
# dbt run or dbt build, all of your models will be built as a view in your data platform.
# The configuration below will override this setting for models in the example folder to
# instead be materialized as tables. Any models you add to the root of the models folder will
# continue to be built as views. These settings can be overridden in the individual model files
# using the `{{ config(...) }}` macro.
models:
taxi_rides_ny:
# Applies to all files under models/.../
staging:
materialized: view
core:
materialized: table
vars:
payment_type_values: [1, 2, 3, 4, 5, 6]
seeds:
taxi_rides_ny:
taxi_zone_lookup:
+column_types:
locationid: numeric

View File

@ -0,0 +1,17 @@
{#
This macro returns the description of the payment_type
#}
{% macro get_payment_type_description(payment_type) -%}
case {{ dbt.safe_cast("payment_type", api.Column.translate_type("integer")) }}
when 1 then 'Credit card'
when 2 then 'Cash'
when 3 then 'No charge'
when 4 then 'Dispute'
when 5 then 'Unknown'
when 6 then 'Voided trip'
else 'EMPTY'
end
{%- endmacro %}

View File

@ -0,0 +1,12 @@
version: 2
macros:
- name: get_payment_type_description
description: >
This macro receives a payment_type and returns the corresponding description.
arguments:
- name: payment_type
type: int
description: >
payment_type value.
Must be one of the accepted values, otherwise the macro will return null

View File

@ -0,0 +1,8 @@
{{ config(materialized='table') }}
select
locationid,
borough,
zone,
replace(service_zone,'Boro','Green') as service_zone
from {{ ref('taxi_zone_lookup') }}

View File

@ -0,0 +1,29 @@
{{ config(materialized='table') }}
with trips_data as (
select * from {{ ref('fact_trips') }}
)
select
-- Reveneue grouping
pickup_zone as revenue_zone,
{{ dbt.date_trunc("month", "pickup_datetime") }} as revenue_month,
service_type,
-- Revenue calculation
sum(fare_amount) as revenue_monthly_fare,
sum(extra) as revenue_monthly_extra,
sum(mta_tax) as revenue_monthly_mta_tax,
sum(tip_amount) as revenue_monthly_tip_amount,
sum(tolls_amount) as revenue_monthly_tolls_amount,
sum(ehail_fee) as revenue_monthly_ehail_fee,
sum(improvement_surcharge) as revenue_monthly_improvement_surcharge,
sum(total_amount) as revenue_monthly_total_amount,
-- Additional calculations
count(tripid) as total_monthly_trips,
avg(passenger_count) as avg_monthly_passenger_count,
avg(trip_distance) as avg_monthly_trip_distance
from trips_data
group by 1,2,3

View File

@ -0,0 +1,56 @@
{{
config(
materialized='table'
)
}}
with green_tripdata as (
select *,
'Green' as service_type
from {{ ref('stg_green_tripdata') }}
),
yellow_tripdata as (
select *,
'Yellow' as service_type
from {{ ref('stg_yellow_tripdata') }}
),
trips_unioned as (
select * from green_tripdata
union all
select * from yellow_tripdata
),
dim_zones as (
select * from {{ ref('dim_zones') }}
where borough != 'Unknown'
)
select trips_unioned.tripid,
trips_unioned.vendorid,
trips_unioned.service_type,
trips_unioned.ratecodeid,
trips_unioned.pickup_locationid,
pickup_zone.borough as pickup_borough,
pickup_zone.zone as pickup_zone,
trips_unioned.dropoff_locationid,
dropoff_zone.borough as dropoff_borough,
dropoff_zone.zone as dropoff_zone,
trips_unioned.pickup_datetime,
trips_unioned.dropoff_datetime,
trips_unioned.store_and_fwd_flag,
trips_unioned.passenger_count,
trips_unioned.trip_distance,
trips_unioned.trip_type,
trips_unioned.fare_amount,
trips_unioned.extra,
trips_unioned.mta_tax,
trips_unioned.tip_amount,
trips_unioned.tolls_amount,
trips_unioned.ehail_fee,
trips_unioned.improvement_surcharge,
trips_unioned.total_amount,
trips_unioned.payment_type,
trips_unioned.payment_type_description
from trips_unioned
inner join dim_zones as pickup_zone
on trips_unioned.pickup_locationid = pickup_zone.locationid
inner join dim_zones as dropoff_zone
on trips_unioned.dropoff_locationid = dropoff_zone.locationid

View File

@ -0,0 +1,129 @@
version: 2
models:
- name: dim_zones
description: >
List of unique zones idefied by locationid.
Includes the service zone they correspond to (Green or yellow).
- name: dm_monthly_zone_revenue
description: >
Aggregated table of all taxi trips corresponding to both service zones (Green and yellow) per pickup zone, month and service.
The table contains monthly sums of the fare elements used to calculate the monthly revenue.
The table contains also monthly indicators like number of trips, and average trip distance.
columns:
- name: revenue_monthly_total_amount
description: Monthly sum of the the total_amount of the fare charged for the trip per pickup zone, month and service.
tests:
- not_null:
severity: error
- name: fact_trips
description: >
Taxi trips corresponding to both service zones (Green and yellow).
The table contains records where both pickup and dropoff locations are valid and known zones.
Each record corresponds to a trip uniquely identified by tripid.
columns:
- name: tripid
data_type: string
description: "unique identifier conformed by the combination of vendorid and pickyp time"
- name: vendorid
data_type: int64
description: ""
- name: service_type
data_type: string
description: ""
- name: ratecodeid
data_type: int64
description: ""
- name: pickup_locationid
data_type: int64
description: ""
- name: pickup_borough
data_type: string
description: ""
- name: pickup_zone
data_type: string
description: ""
- name: dropoff_locationid
data_type: int64
description: ""
- name: dropoff_borough
data_type: string
description: ""
- name: dropoff_zone
data_type: string
description: ""
- name: pickup_datetime
data_type: timestamp
description: ""
- name: dropoff_datetime
data_type: timestamp
description: ""
- name: store_and_fwd_flag
data_type: string
description: ""
- name: passenger_count
data_type: int64
description: ""
- name: trip_distance
data_type: numeric
description: ""
- name: trip_type
data_type: int64
description: ""
- name: fare_amount
data_type: numeric
description: ""
- name: extra
data_type: numeric
description: ""
- name: mta_tax
data_type: numeric
description: ""
- name: tip_amount
data_type: numeric
description: ""
- name: tolls_amount
data_type: numeric
description: ""
- name: ehail_fee
data_type: numeric
description: ""
- name: improvement_surcharge
data_type: numeric
description: ""
- name: total_amount
data_type: numeric
description: ""
- name: payment_type
data_type: int64
description: ""
- name: payment_type_description
data_type: string
description: ""

View File

@ -0,0 +1,199 @@
version: 2
sources:
- name: staging
database: taxi-rides-ny-339813-412521
# For postgres:
#database: production
schema: trips_data_all
# loaded_at_field: record_loaded_at
tables:
- name: green_tripdata
- name: yellow_tripdata
# freshness:
# error_after: {count: 6, period: hour}
models:
- name: stg_green_tripdata
description: >
Trip made by green taxis, also known as boro taxis and street-hail liveries.
Green taxis may respond to street hails,but only in the areas indicated in green on the
map (i.e. above W 110 St/E 96th St in Manhattan and in the boroughs).
The records were collected and provided to the NYC Taxi and Limousine Commission (TLC) by
technology service providers.
columns:
- name: tripid
description: Primary key for this table, generated with a concatenation of vendorid+pickup_datetime
tests:
- unique:
severity: warn
- not_null:
severity: warn
- name: VendorID
description: >
A code indicating the TPEP provider that provided the record.
1= Creative Mobile Technologies, LLC;
2= VeriFone Inc.
- name: pickup_datetime
description: The date and time when the meter was engaged.
- name: dropoff_datetime
description: The date and time when the meter was disengaged.
- name: Passenger_count
description: The number of passengers in the vehicle. This is a driver-entered value.
- name: Trip_distance
description: The elapsed trip distance in miles reported by the taximeter.
- name: Pickup_locationid
description: locationid where the meter was engaged.
tests:
- relationships:
to: ref('taxi_zone_lookup')
field: locationid
severity: warn
- name: dropoff_locationid
description: locationid where the meter was engaged.
tests:
- relationships:
to: ref('taxi_zone_lookup')
field: locationid
- name: RateCodeID
description: >
The final rate code in effect at the end of the trip.
1= Standard rate
2=JFK
3=Newark
4=Nassau or Westchester
5=Negotiated fare
6=Group ride
- name: Store_and_fwd_flag
description: >
This flag indicates whether the trip record was held in vehicle
memory before sending to the vendor, aka “store and forward,”
because the vehicle did not have a connection to the server.
Y= store and forward trip
N = not a store and forward trip
- name: Dropoff_longitude
description: Longitude where the meter was disengaged.
- name: Dropoff_latitude
description: Latitude where the meter was disengaged.
- name: Payment_type
description: >
A numeric code signifying how the passenger paid for the trip.
tests:
- accepted_values:
values: "{{ var('payment_type_values') }}"
severity: warn
quote: false
- name: payment_type_description
description: Description of the payment_type code
- name: Fare_amount
description: >
The time-and-distance fare calculated by the meter.
Extra Miscellaneous extras and surcharges. Currently, this only includes
the $0.50 and $1 rush hour and overnight charges.
MTA_tax $0.50 MTA tax that is automatically triggered based on the metered
rate in use.
- name: Improvement_surcharge
description: >
$0.30 improvement surcharge assessed trips at the flag drop. The
improvement surcharge began being levied in 2015.
- name: Tip_amount
description: >
Tip amount. This field is automatically populated for credit card
tips. Cash tips are not included.
- name: Tolls_amount
description: Total amount of all tolls paid in trip.
- name: Total_amount
description: The total amount charged to passengers. Does not include cash tips.
- name: stg_yellow_tripdata
description: >
Trips made by New York City's iconic yellow taxis.
Yellow taxis are the only vehicles permitted to respond to a street hail from a passenger in all five
boroughs. They may also be hailed using an e-hail app like Curb or Arro.
The records were collected and provided to the NYC Taxi and Limousine Commission (TLC) by
technology service providers.
columns:
- name: tripid
description: Primary key for this table, generated with a concatenation of vendorid+pickup_datetime
tests:
- unique:
severity: warn
- not_null:
severity: warn
- name: VendorID
description: >
A code indicating the TPEP provider that provided the record.
1= Creative Mobile Technologies, LLC;
2= VeriFone Inc.
- name: pickup_datetime
description: The date and time when the meter was engaged.
- name: dropoff_datetime
description: The date and time when the meter was disengaged.
- name: Passenger_count
description: The number of passengers in the vehicle. This is a driver-entered value.
- name: Trip_distance
description: The elapsed trip distance in miles reported by the taximeter.
- name: Pickup_locationid
description: locationid where the meter was engaged.
tests:
- relationships:
to: ref('taxi_zone_lookup')
field: locationid
severity: warn
- name: dropoff_locationid
description: locationid where the meter was engaged.
tests:
- relationships:
to: ref('taxi_zone_lookup')
field: locationid
severity: warn
- name: RateCodeID
description: >
The final rate code in effect at the end of the trip.
1= Standard rate
2=JFK
3=Newark
4=Nassau or Westchester
5=Negotiated fare
6=Group ride
- name: Store_and_fwd_flag
description: >
This flag indicates whether the trip record was held in vehicle
memory before sending to the vendor, aka “store and forward,”
because the vehicle did not have a connection to the server.
Y= store and forward trip
N= not a store and forward trip
- name: Dropoff_longitude
description: Longitude where the meter was disengaged.
- name: Dropoff_latitude
description: Latitude where the meter was disengaged.
- name: Payment_type
description: >
A numeric code signifying how the passenger paid for the trip.
tests:
- accepted_values:
values: "{{ var('payment_type_values') }}"
severity: warn
quote: false
- name: payment_type_description
description: Description of the payment_type code
- name: Fare_amount
description: >
The time-and-distance fare calculated by the meter.
Extra Miscellaneous extras and surcharges. Currently, this only includes
the $0.50 and $1 rush hour and overnight charges.
MTA_tax $0.50 MTA tax that is automatically triggered based on the metered
rate in use.
- name: Improvement_surcharge
description: >
$0.30 improvement surcharge assessed trips at the flag drop. The
improvement surcharge began being levied in 2015.
- name: Tip_amount
description: >
Tip amount. This field is automatically populated for credit card
tips. Cash tips are not included.
- name: Tolls_amount
description: Total amount of all tolls paid in trip.
- name: Total_amount
description: The total amount charged to passengers. Does not include cash tips.

View File

@ -0,0 +1,52 @@
{{
config(
materialized='view'
)
}}
with tripdata as
(
select *,
row_number() over(partition by vendorid, lpep_pickup_datetime) as rn
from {{ source('staging','green_tripdata') }}
where vendorid is not null
)
select
-- identifiers
{{ dbt_utils.generate_surrogate_key(['vendorid', 'lpep_pickup_datetime']) }} as tripid,
{{ dbt.safe_cast("vendorid", api.Column.translate_type("integer")) }} as vendorid,
{{ dbt.safe_cast("ratecodeid", api.Column.translate_type("integer")) }} as ratecodeid,
{{ dbt.safe_cast("pulocationid", api.Column.translate_type("integer")) }} as pickup_locationid,
{{ dbt.safe_cast("dolocationid", api.Column.translate_type("integer")) }} as dropoff_locationid,
-- timestamps
cast(lpep_pickup_datetime as timestamp) as pickup_datetime,
cast(lpep_dropoff_datetime as timestamp) as dropoff_datetime,
-- trip info
store_and_fwd_flag,
{{ dbt.safe_cast("passenger_count", api.Column.translate_type("integer")) }} as passenger_count,
cast(trip_distance as numeric) as trip_distance,
{{ dbt.safe_cast("trip_type", api.Column.translate_type("integer")) }} as trip_type,
-- payment info
cast(fare_amount as numeric) as fare_amount,
cast(extra as numeric) as extra,
cast(mta_tax as numeric) as mta_tax,
cast(tip_amount as numeric) as tip_amount,
cast(tolls_amount as numeric) as tolls_amount,
cast(ehail_fee as numeric) as ehail_fee,
cast(improvement_surcharge as numeric) as improvement_surcharge,
cast(total_amount as numeric) as total_amount,
coalesce({{ dbt.safe_cast("payment_type", api.Column.translate_type("integer")) }},0) as payment_type,
{{ get_payment_type_description("payment_type") }} as payment_type_description
from tripdata
where rn = 1
-- dbt build --select <model_name> --vars '{'is_test_run': 'false'}'
{% if var('is_test_run', default=true) %}
limit 100
{% endif %}

View File

@ -0,0 +1,48 @@
{{ config(materialized='view') }}
with tripdata as
(
select *,
row_number() over(partition by vendorid, tpep_pickup_datetime) as rn
from {{ source('staging','yellow_tripdata') }}
where vendorid is not null
)
select
-- identifiers
{{ dbt_utils.generate_surrogate_key(['vendorid', 'tpep_pickup_datetime']) }} as tripid,
{{ dbt.safe_cast("vendorid", api.Column.translate_type("integer")) }} as vendorid,
{{ dbt.safe_cast("ratecodeid", api.Column.translate_type("integer")) }} as ratecodeid,
{{ dbt.safe_cast("pulocationid", api.Column.translate_type("integer")) }} as pickup_locationid,
{{ dbt.safe_cast("dolocationid", api.Column.translate_type("integer")) }} as dropoff_locationid,
-- timestamps
cast(tpep_pickup_datetime as timestamp) as pickup_datetime,
cast(tpep_dropoff_datetime as timestamp) as dropoff_datetime,
-- trip info
store_and_fwd_flag,
{{ dbt.safe_cast("passenger_count", api.Column.translate_type("integer")) }} as passenger_count,
cast(trip_distance as numeric) as trip_distance,
-- yellow cabs are always street-hail
1 as trip_type,
-- payment info
cast(fare_amount as numeric) as fare_amount,
cast(extra as numeric) as extra,
cast(mta_tax as numeric) as mta_tax,
cast(tip_amount as numeric) as tip_amount,
cast(tolls_amount as numeric) as tolls_amount,
cast(0 as numeric) as ehail_fee,
cast(improvement_surcharge as numeric) as improvement_surcharge,
cast(total_amount as numeric) as total_amount,
coalesce({{ dbt.safe_cast("payment_type", api.Column.translate_type("integer")) }},0) as payment_type,
{{ get_payment_type_description('payment_type') }} as payment_type_description
from tripdata
where rn = 1
-- dbt build --select <model.sql> --vars '{'is_test_run: false}'
{% if var('is_test_run', default=true) %}
limit 100
{% endif %}

View File

@ -0,0 +1,6 @@
packages:
- package: dbt-labs/dbt_utils
version: 1.1.1
- package: dbt-labs/codegen
version: 0.12.1
sha1_hash: d974113b0f072cce35300077208f38581075ab40

View File

@ -0,0 +1,5 @@
packages:
- package: dbt-labs/dbt_utils
version: 1.1.1
- package: dbt-labs/codegen
version: 0.12.1

View File

@ -0,0 +1,9 @@
version: 2
seeds:
- name: taxi_zone_lookup
description: >
Taxi Zones roughly based on NYC Department of City Planning's Neighborhood
Tabulation Areas (NTAs) and are meant to approximate neighborhoods, so you can see which
neighborhood a passenger was picked up in, and which neighborhood they were dropped off in.
Includes associated service_zone (EWR, Boro Zone, Yellow Zone)

View File

@ -0,0 +1,266 @@
"locationid","borough","zone","service_zone"
1,"EWR","Newark Airport","EWR"
2,"Queens","Jamaica Bay","Boro Zone"
3,"Bronx","Allerton/Pelham Gardens","Boro Zone"
4,"Manhattan","Alphabet City","Yellow Zone"
5,"Staten Island","Arden Heights","Boro Zone"
6,"Staten Island","Arrochar/Fort Wadsworth","Boro Zone"
7,"Queens","Astoria","Boro Zone"
8,"Queens","Astoria Park","Boro Zone"
9,"Queens","Auburndale","Boro Zone"
10,"Queens","Baisley Park","Boro Zone"
11,"Brooklyn","Bath Beach","Boro Zone"
12,"Manhattan","Battery Park","Yellow Zone"
13,"Manhattan","Battery Park City","Yellow Zone"
14,"Brooklyn","Bay Ridge","Boro Zone"
15,"Queens","Bay Terrace/Fort Totten","Boro Zone"
16,"Queens","Bayside","Boro Zone"
17,"Brooklyn","Bedford","Boro Zone"
18,"Bronx","Bedford Park","Boro Zone"
19,"Queens","Bellerose","Boro Zone"
20,"Bronx","Belmont","Boro Zone"
21,"Brooklyn","Bensonhurst East","Boro Zone"
22,"Brooklyn","Bensonhurst West","Boro Zone"
23,"Staten Island","Bloomfield/Emerson Hill","Boro Zone"
24,"Manhattan","Bloomingdale","Yellow Zone"
25,"Brooklyn","Boerum Hill","Boro Zone"
26,"Brooklyn","Borough Park","Boro Zone"
27,"Queens","Breezy Point/Fort Tilden/Riis Beach","Boro Zone"
28,"Queens","Briarwood/Jamaica Hills","Boro Zone"
29,"Brooklyn","Brighton Beach","Boro Zone"
30,"Queens","Broad Channel","Boro Zone"
31,"Bronx","Bronx Park","Boro Zone"
32,"Bronx","Bronxdale","Boro Zone"
33,"Brooklyn","Brooklyn Heights","Boro Zone"
34,"Brooklyn","Brooklyn Navy Yard","Boro Zone"
35,"Brooklyn","Brownsville","Boro Zone"
36,"Brooklyn","Bushwick North","Boro Zone"
37,"Brooklyn","Bushwick South","Boro Zone"
38,"Queens","Cambria Heights","Boro Zone"
39,"Brooklyn","Canarsie","Boro Zone"
40,"Brooklyn","Carroll Gardens","Boro Zone"
41,"Manhattan","Central Harlem","Boro Zone"
42,"Manhattan","Central Harlem North","Boro Zone"
43,"Manhattan","Central Park","Yellow Zone"
44,"Staten Island","Charleston/Tottenville","Boro Zone"
45,"Manhattan","Chinatown","Yellow Zone"
46,"Bronx","City Island","Boro Zone"
47,"Bronx","Claremont/Bathgate","Boro Zone"
48,"Manhattan","Clinton East","Yellow Zone"
49,"Brooklyn","Clinton Hill","Boro Zone"
50,"Manhattan","Clinton West","Yellow Zone"
51,"Bronx","Co-Op City","Boro Zone"
52,"Brooklyn","Cobble Hill","Boro Zone"
53,"Queens","College Point","Boro Zone"
54,"Brooklyn","Columbia Street","Boro Zone"
55,"Brooklyn","Coney Island","Boro Zone"
56,"Queens","Corona","Boro Zone"
57,"Queens","Corona","Boro Zone"
58,"Bronx","Country Club","Boro Zone"
59,"Bronx","Crotona Park","Boro Zone"
60,"Bronx","Crotona Park East","Boro Zone"
61,"Brooklyn","Crown Heights North","Boro Zone"
62,"Brooklyn","Crown Heights South","Boro Zone"
63,"Brooklyn","Cypress Hills","Boro Zone"
64,"Queens","Douglaston","Boro Zone"
65,"Brooklyn","Downtown Brooklyn/MetroTech","Boro Zone"
66,"Brooklyn","DUMBO/Vinegar Hill","Boro Zone"
67,"Brooklyn","Dyker Heights","Boro Zone"
68,"Manhattan","East Chelsea","Yellow Zone"
69,"Bronx","East Concourse/Concourse Village","Boro Zone"
70,"Queens","East Elmhurst","Boro Zone"
71,"Brooklyn","East Flatbush/Farragut","Boro Zone"
72,"Brooklyn","East Flatbush/Remsen Village","Boro Zone"
73,"Queens","East Flushing","Boro Zone"
74,"Manhattan","East Harlem North","Boro Zone"
75,"Manhattan","East Harlem South","Boro Zone"
76,"Brooklyn","East New York","Boro Zone"
77,"Brooklyn","East New York/Pennsylvania Avenue","Boro Zone"
78,"Bronx","East Tremont","Boro Zone"
79,"Manhattan","East Village","Yellow Zone"
80,"Brooklyn","East Williamsburg","Boro Zone"
81,"Bronx","Eastchester","Boro Zone"
82,"Queens","Elmhurst","Boro Zone"
83,"Queens","Elmhurst/Maspeth","Boro Zone"
84,"Staten Island","Eltingville/Annadale/Prince's Bay","Boro Zone"
85,"Brooklyn","Erasmus","Boro Zone"
86,"Queens","Far Rockaway","Boro Zone"
87,"Manhattan","Financial District North","Yellow Zone"
88,"Manhattan","Financial District South","Yellow Zone"
89,"Brooklyn","Flatbush/Ditmas Park","Boro Zone"
90,"Manhattan","Flatiron","Yellow Zone"
91,"Brooklyn","Flatlands","Boro Zone"
92,"Queens","Flushing","Boro Zone"
93,"Queens","Flushing Meadows-Corona Park","Boro Zone"
94,"Bronx","Fordham South","Boro Zone"
95,"Queens","Forest Hills","Boro Zone"
96,"Queens","Forest Park/Highland Park","Boro Zone"
97,"Brooklyn","Fort Greene","Boro Zone"
98,"Queens","Fresh Meadows","Boro Zone"
99,"Staten Island","Freshkills Park","Boro Zone"
100,"Manhattan","Garment District","Yellow Zone"
101,"Queens","Glen Oaks","Boro Zone"
102,"Queens","Glendale","Boro Zone"
103,"Manhattan","Governor's Island/Ellis Island/Liberty Island","Yellow Zone"
104,"Manhattan","Governor's Island/Ellis Island/Liberty Island","Yellow Zone"
105,"Manhattan","Governor's Island/Ellis Island/Liberty Island","Yellow Zone"
106,"Brooklyn","Gowanus","Boro Zone"
107,"Manhattan","Gramercy","Yellow Zone"
108,"Brooklyn","Gravesend","Boro Zone"
109,"Staten Island","Great Kills","Boro Zone"
110,"Staten Island","Great Kills Park","Boro Zone"
111,"Brooklyn","Green-Wood Cemetery","Boro Zone"
112,"Brooklyn","Greenpoint","Boro Zone"
113,"Manhattan","Greenwich Village North","Yellow Zone"
114,"Manhattan","Greenwich Village South","Yellow Zone"
115,"Staten Island","Grymes Hill/Clifton","Boro Zone"
116,"Manhattan","Hamilton Heights","Boro Zone"
117,"Queens","Hammels/Arverne","Boro Zone"
118,"Staten Island","Heartland Village/Todt Hill","Boro Zone"
119,"Bronx","Highbridge","Boro Zone"
120,"Manhattan","Highbridge Park","Boro Zone"
121,"Queens","Hillcrest/Pomonok","Boro Zone"
122,"Queens","Hollis","Boro Zone"
123,"Brooklyn","Homecrest","Boro Zone"
124,"Queens","Howard Beach","Boro Zone"
125,"Manhattan","Hudson Sq","Yellow Zone"
126,"Bronx","Hunts Point","Boro Zone"
127,"Manhattan","Inwood","Boro Zone"
128,"Manhattan","Inwood Hill Park","Boro Zone"
129,"Queens","Jackson Heights","Boro Zone"
130,"Queens","Jamaica","Boro Zone"
131,"Queens","Jamaica Estates","Boro Zone"
132,"Queens","JFK Airport","Airports"
133,"Brooklyn","Kensington","Boro Zone"
134,"Queens","Kew Gardens","Boro Zone"
135,"Queens","Kew Gardens Hills","Boro Zone"
136,"Bronx","Kingsbridge Heights","Boro Zone"
137,"Manhattan","Kips Bay","Yellow Zone"
138,"Queens","LaGuardia Airport","Airports"
139,"Queens","Laurelton","Boro Zone"
140,"Manhattan","Lenox Hill East","Yellow Zone"
141,"Manhattan","Lenox Hill West","Yellow Zone"
142,"Manhattan","Lincoln Square East","Yellow Zone"
143,"Manhattan","Lincoln Square West","Yellow Zone"
144,"Manhattan","Little Italy/NoLiTa","Yellow Zone"
145,"Queens","Long Island City/Hunters Point","Boro Zone"
146,"Queens","Long Island City/Queens Plaza","Boro Zone"
147,"Bronx","Longwood","Boro Zone"
148,"Manhattan","Lower East Side","Yellow Zone"
149,"Brooklyn","Madison","Boro Zone"
150,"Brooklyn","Manhattan Beach","Boro Zone"
151,"Manhattan","Manhattan Valley","Yellow Zone"
152,"Manhattan","Manhattanville","Boro Zone"
153,"Manhattan","Marble Hill","Boro Zone"
154,"Brooklyn","Marine Park/Floyd Bennett Field","Boro Zone"
155,"Brooklyn","Marine Park/Mill Basin","Boro Zone"
156,"Staten Island","Mariners Harbor","Boro Zone"
157,"Queens","Maspeth","Boro Zone"
158,"Manhattan","Meatpacking/West Village West","Yellow Zone"
159,"Bronx","Melrose South","Boro Zone"
160,"Queens","Middle Village","Boro Zone"
161,"Manhattan","Midtown Center","Yellow Zone"
162,"Manhattan","Midtown East","Yellow Zone"
163,"Manhattan","Midtown North","Yellow Zone"
164,"Manhattan","Midtown South","Yellow Zone"
165,"Brooklyn","Midwood","Boro Zone"
166,"Manhattan","Morningside Heights","Boro Zone"
167,"Bronx","Morrisania/Melrose","Boro Zone"
168,"Bronx","Mott Haven/Port Morris","Boro Zone"
169,"Bronx","Mount Hope","Boro Zone"
170,"Manhattan","Murray Hill","Yellow Zone"
171,"Queens","Murray Hill-Queens","Boro Zone"
172,"Staten Island","New Dorp/Midland Beach","Boro Zone"
173,"Queens","North Corona","Boro Zone"
174,"Bronx","Norwood","Boro Zone"
175,"Queens","Oakland Gardens","Boro Zone"
176,"Staten Island","Oakwood","Boro Zone"
177,"Brooklyn","Ocean Hill","Boro Zone"
178,"Brooklyn","Ocean Parkway South","Boro Zone"
179,"Queens","Old Astoria","Boro Zone"
180,"Queens","Ozone Park","Boro Zone"
181,"Brooklyn","Park Slope","Boro Zone"
182,"Bronx","Parkchester","Boro Zone"
183,"Bronx","Pelham Bay","Boro Zone"
184,"Bronx","Pelham Bay Park","Boro Zone"
185,"Bronx","Pelham Parkway","Boro Zone"
186,"Manhattan","Penn Station/Madison Sq West","Yellow Zone"
187,"Staten Island","Port Richmond","Boro Zone"
188,"Brooklyn","Prospect-Lefferts Gardens","Boro Zone"
189,"Brooklyn","Prospect Heights","Boro Zone"
190,"Brooklyn","Prospect Park","Boro Zone"
191,"Queens","Queens Village","Boro Zone"
192,"Queens","Queensboro Hill","Boro Zone"
193,"Queens","Queensbridge/Ravenswood","Boro Zone"
194,"Manhattan","Randalls Island","Yellow Zone"
195,"Brooklyn","Red Hook","Boro Zone"
196,"Queens","Rego Park","Boro Zone"
197,"Queens","Richmond Hill","Boro Zone"
198,"Queens","Ridgewood","Boro Zone"
199,"Bronx","Rikers Island","Boro Zone"
200,"Bronx","Riverdale/North Riverdale/Fieldston","Boro Zone"
201,"Queens","Rockaway Park","Boro Zone"
202,"Manhattan","Roosevelt Island","Boro Zone"
203,"Queens","Rosedale","Boro Zone"
204,"Staten Island","Rossville/Woodrow","Boro Zone"
205,"Queens","Saint Albans","Boro Zone"
206,"Staten Island","Saint George/New Brighton","Boro Zone"
207,"Queens","Saint Michaels Cemetery/Woodside","Boro Zone"
208,"Bronx","Schuylerville/Edgewater Park","Boro Zone"
209,"Manhattan","Seaport","Yellow Zone"
210,"Brooklyn","Sheepshead Bay","Boro Zone"
211,"Manhattan","SoHo","Yellow Zone"
212,"Bronx","Soundview/Bruckner","Boro Zone"
213,"Bronx","Soundview/Castle Hill","Boro Zone"
214,"Staten Island","South Beach/Dongan Hills","Boro Zone"
215,"Queens","South Jamaica","Boro Zone"
216,"Queens","South Ozone Park","Boro Zone"
217,"Brooklyn","South Williamsburg","Boro Zone"
218,"Queens","Springfield Gardens North","Boro Zone"
219,"Queens","Springfield Gardens South","Boro Zone"
220,"Bronx","Spuyten Duyvil/Kingsbridge","Boro Zone"
221,"Staten Island","Stapleton","Boro Zone"
222,"Brooklyn","Starrett City","Boro Zone"
223,"Queens","Steinway","Boro Zone"
224,"Manhattan","Stuy Town/Peter Cooper Village","Yellow Zone"
225,"Brooklyn","Stuyvesant Heights","Boro Zone"
226,"Queens","Sunnyside","Boro Zone"
227,"Brooklyn","Sunset Park East","Boro Zone"
228,"Brooklyn","Sunset Park West","Boro Zone"
229,"Manhattan","Sutton Place/Turtle Bay North","Yellow Zone"
230,"Manhattan","Times Sq/Theatre District","Yellow Zone"
231,"Manhattan","TriBeCa/Civic Center","Yellow Zone"
232,"Manhattan","Two Bridges/Seward Park","Yellow Zone"
233,"Manhattan","UN/Turtle Bay South","Yellow Zone"
234,"Manhattan","Union Sq","Yellow Zone"
235,"Bronx","University Heights/Morris Heights","Boro Zone"
236,"Manhattan","Upper East Side North","Yellow Zone"
237,"Manhattan","Upper East Side South","Yellow Zone"
238,"Manhattan","Upper West Side North","Yellow Zone"
239,"Manhattan","Upper West Side South","Yellow Zone"
240,"Bronx","Van Cortlandt Park","Boro Zone"
241,"Bronx","Van Cortlandt Village","Boro Zone"
242,"Bronx","Van Nest/Morris Park","Boro Zone"
243,"Manhattan","Washington Heights North","Boro Zone"
244,"Manhattan","Washington Heights South","Boro Zone"
245,"Staten Island","West Brighton","Boro Zone"
246,"Manhattan","West Chelsea/Hudson Yards","Yellow Zone"
247,"Bronx","West Concourse","Boro Zone"
248,"Bronx","West Farms/Bronx River","Boro Zone"
249,"Manhattan","West Village","Yellow Zone"
250,"Bronx","Westchester Village/Unionport","Boro Zone"
251,"Staten Island","Westerleigh","Boro Zone"
252,"Queens","Whitestone","Boro Zone"
253,"Queens","Willets Point","Boro Zone"
254,"Bronx","Williamsbridge/Olinville","Boro Zone"
255,"Brooklyn","Williamsburg (North Side)","Boro Zone"
256,"Brooklyn","Williamsburg (South Side)","Boro Zone"
257,"Brooklyn","Windsor Terrace","Boro Zone"
258,"Queens","Woodhaven","Boro Zone"
259,"Bronx","Woodlawn/Wakefield","Boro Zone"
260,"Queens","Woodside","Boro Zone"
261,"Manhattan","World Trade Center","Yellow Zone"
262,"Manhattan","Yorkville East","Yellow Zone"
263,"Manhattan","Yorkville West","Yellow Zone"
264,"Unknown","NV","N/A"
265,"Unknown","NA","N/A"
1 locationid borough zone service_zone
2 1 EWR Newark Airport EWR
3 2 Queens Jamaica Bay Boro Zone
4 3 Bronx Allerton/Pelham Gardens Boro Zone
5 4 Manhattan Alphabet City Yellow Zone
6 5 Staten Island Arden Heights Boro Zone
7 6 Staten Island Arrochar/Fort Wadsworth Boro Zone
8 7 Queens Astoria Boro Zone
9 8 Queens Astoria Park Boro Zone
10 9 Queens Auburndale Boro Zone
11 10 Queens Baisley Park Boro Zone
12 11 Brooklyn Bath Beach Boro Zone
13 12 Manhattan Battery Park Yellow Zone
14 13 Manhattan Battery Park City Yellow Zone
15 14 Brooklyn Bay Ridge Boro Zone
16 15 Queens Bay Terrace/Fort Totten Boro Zone
17 16 Queens Bayside Boro Zone
18 17 Brooklyn Bedford Boro Zone
19 18 Bronx Bedford Park Boro Zone
20 19 Queens Bellerose Boro Zone
21 20 Bronx Belmont Boro Zone
22 21 Brooklyn Bensonhurst East Boro Zone
23 22 Brooklyn Bensonhurst West Boro Zone
24 23 Staten Island Bloomfield/Emerson Hill Boro Zone
25 24 Manhattan Bloomingdale Yellow Zone
26 25 Brooklyn Boerum Hill Boro Zone
27 26 Brooklyn Borough Park Boro Zone
28 27 Queens Breezy Point/Fort Tilden/Riis Beach Boro Zone
29 28 Queens Briarwood/Jamaica Hills Boro Zone
30 29 Brooklyn Brighton Beach Boro Zone
31 30 Queens Broad Channel Boro Zone
32 31 Bronx Bronx Park Boro Zone
33 32 Bronx Bronxdale Boro Zone
34 33 Brooklyn Brooklyn Heights Boro Zone
35 34 Brooklyn Brooklyn Navy Yard Boro Zone
36 35 Brooklyn Brownsville Boro Zone
37 36 Brooklyn Bushwick North Boro Zone
38 37 Brooklyn Bushwick South Boro Zone
39 38 Queens Cambria Heights Boro Zone
40 39 Brooklyn Canarsie Boro Zone
41 40 Brooklyn Carroll Gardens Boro Zone
42 41 Manhattan Central Harlem Boro Zone
43 42 Manhattan Central Harlem North Boro Zone
44 43 Manhattan Central Park Yellow Zone
45 44 Staten Island Charleston/Tottenville Boro Zone
46 45 Manhattan Chinatown Yellow Zone
47 46 Bronx City Island Boro Zone
48 47 Bronx Claremont/Bathgate Boro Zone
49 48 Manhattan Clinton East Yellow Zone
50 49 Brooklyn Clinton Hill Boro Zone
51 50 Manhattan Clinton West Yellow Zone
52 51 Bronx Co-Op City Boro Zone
53 52 Brooklyn Cobble Hill Boro Zone
54 53 Queens College Point Boro Zone
55 54 Brooklyn Columbia Street Boro Zone
56 55 Brooklyn Coney Island Boro Zone
57 56 Queens Corona Boro Zone
58 57 Queens Corona Boro Zone
59 58 Bronx Country Club Boro Zone
60 59 Bronx Crotona Park Boro Zone
61 60 Bronx Crotona Park East Boro Zone
62 61 Brooklyn Crown Heights North Boro Zone
63 62 Brooklyn Crown Heights South Boro Zone
64 63 Brooklyn Cypress Hills Boro Zone
65 64 Queens Douglaston Boro Zone
66 65 Brooklyn Downtown Brooklyn/MetroTech Boro Zone
67 66 Brooklyn DUMBO/Vinegar Hill Boro Zone
68 67 Brooklyn Dyker Heights Boro Zone
69 68 Manhattan East Chelsea Yellow Zone
70 69 Bronx East Concourse/Concourse Village Boro Zone
71 70 Queens East Elmhurst Boro Zone
72 71 Brooklyn East Flatbush/Farragut Boro Zone
73 72 Brooklyn East Flatbush/Remsen Village Boro Zone
74 73 Queens East Flushing Boro Zone
75 74 Manhattan East Harlem North Boro Zone
76 75 Manhattan East Harlem South Boro Zone
77 76 Brooklyn East New York Boro Zone
78 77 Brooklyn East New York/Pennsylvania Avenue Boro Zone
79 78 Bronx East Tremont Boro Zone
80 79 Manhattan East Village Yellow Zone
81 80 Brooklyn East Williamsburg Boro Zone
82 81 Bronx Eastchester Boro Zone
83 82 Queens Elmhurst Boro Zone
84 83 Queens Elmhurst/Maspeth Boro Zone
85 84 Staten Island Eltingville/Annadale/Prince's Bay Boro Zone
86 85 Brooklyn Erasmus Boro Zone
87 86 Queens Far Rockaway Boro Zone
88 87 Manhattan Financial District North Yellow Zone
89 88 Manhattan Financial District South Yellow Zone
90 89 Brooklyn Flatbush/Ditmas Park Boro Zone
91 90 Manhattan Flatiron Yellow Zone
92 91 Brooklyn Flatlands Boro Zone
93 92 Queens Flushing Boro Zone
94 93 Queens Flushing Meadows-Corona Park Boro Zone
95 94 Bronx Fordham South Boro Zone
96 95 Queens Forest Hills Boro Zone
97 96 Queens Forest Park/Highland Park Boro Zone
98 97 Brooklyn Fort Greene Boro Zone
99 98 Queens Fresh Meadows Boro Zone
100 99 Staten Island Freshkills Park Boro Zone
101 100 Manhattan Garment District Yellow Zone
102 101 Queens Glen Oaks Boro Zone
103 102 Queens Glendale Boro Zone
104 103 Manhattan Governor's Island/Ellis Island/Liberty Island Yellow Zone
105 104 Manhattan Governor's Island/Ellis Island/Liberty Island Yellow Zone
106 105 Manhattan Governor's Island/Ellis Island/Liberty Island Yellow Zone
107 106 Brooklyn Gowanus Boro Zone
108 107 Manhattan Gramercy Yellow Zone
109 108 Brooklyn Gravesend Boro Zone
110 109 Staten Island Great Kills Boro Zone
111 110 Staten Island Great Kills Park Boro Zone
112 111 Brooklyn Green-Wood Cemetery Boro Zone
113 112 Brooklyn Greenpoint Boro Zone
114 113 Manhattan Greenwich Village North Yellow Zone
115 114 Manhattan Greenwich Village South Yellow Zone
116 115 Staten Island Grymes Hill/Clifton Boro Zone
117 116 Manhattan Hamilton Heights Boro Zone
118 117 Queens Hammels/Arverne Boro Zone
119 118 Staten Island Heartland Village/Todt Hill Boro Zone
120 119 Bronx Highbridge Boro Zone
121 120 Manhattan Highbridge Park Boro Zone
122 121 Queens Hillcrest/Pomonok Boro Zone
123 122 Queens Hollis Boro Zone
124 123 Brooklyn Homecrest Boro Zone
125 124 Queens Howard Beach Boro Zone
126 125 Manhattan Hudson Sq Yellow Zone
127 126 Bronx Hunts Point Boro Zone
128 127 Manhattan Inwood Boro Zone
129 128 Manhattan Inwood Hill Park Boro Zone
130 129 Queens Jackson Heights Boro Zone
131 130 Queens Jamaica Boro Zone
132 131 Queens Jamaica Estates Boro Zone
133 132 Queens JFK Airport Airports
134 133 Brooklyn Kensington Boro Zone
135 134 Queens Kew Gardens Boro Zone
136 135 Queens Kew Gardens Hills Boro Zone
137 136 Bronx Kingsbridge Heights Boro Zone
138 137 Manhattan Kips Bay Yellow Zone
139 138 Queens LaGuardia Airport Airports
140 139 Queens Laurelton Boro Zone
141 140 Manhattan Lenox Hill East Yellow Zone
142 141 Manhattan Lenox Hill West Yellow Zone
143 142 Manhattan Lincoln Square East Yellow Zone
144 143 Manhattan Lincoln Square West Yellow Zone
145 144 Manhattan Little Italy/NoLiTa Yellow Zone
146 145 Queens Long Island City/Hunters Point Boro Zone
147 146 Queens Long Island City/Queens Plaza Boro Zone
148 147 Bronx Longwood Boro Zone
149 148 Manhattan Lower East Side Yellow Zone
150 149 Brooklyn Madison Boro Zone
151 150 Brooklyn Manhattan Beach Boro Zone
152 151 Manhattan Manhattan Valley Yellow Zone
153 152 Manhattan Manhattanville Boro Zone
154 153 Manhattan Marble Hill Boro Zone
155 154 Brooklyn Marine Park/Floyd Bennett Field Boro Zone
156 155 Brooklyn Marine Park/Mill Basin Boro Zone
157 156 Staten Island Mariners Harbor Boro Zone
158 157 Queens Maspeth Boro Zone
159 158 Manhattan Meatpacking/West Village West Yellow Zone
160 159 Bronx Melrose South Boro Zone
161 160 Queens Middle Village Boro Zone
162 161 Manhattan Midtown Center Yellow Zone
163 162 Manhattan Midtown East Yellow Zone
164 163 Manhattan Midtown North Yellow Zone
165 164 Manhattan Midtown South Yellow Zone
166 165 Brooklyn Midwood Boro Zone
167 166 Manhattan Morningside Heights Boro Zone
168 167 Bronx Morrisania/Melrose Boro Zone
169 168 Bronx Mott Haven/Port Morris Boro Zone
170 169 Bronx Mount Hope Boro Zone
171 170 Manhattan Murray Hill Yellow Zone
172 171 Queens Murray Hill-Queens Boro Zone
173 172 Staten Island New Dorp/Midland Beach Boro Zone
174 173 Queens North Corona Boro Zone
175 174 Bronx Norwood Boro Zone
176 175 Queens Oakland Gardens Boro Zone
177 176 Staten Island Oakwood Boro Zone
178 177 Brooklyn Ocean Hill Boro Zone
179 178 Brooklyn Ocean Parkway South Boro Zone
180 179 Queens Old Astoria Boro Zone
181 180 Queens Ozone Park Boro Zone
182 181 Brooklyn Park Slope Boro Zone
183 182 Bronx Parkchester Boro Zone
184 183 Bronx Pelham Bay Boro Zone
185 184 Bronx Pelham Bay Park Boro Zone
186 185 Bronx Pelham Parkway Boro Zone
187 186 Manhattan Penn Station/Madison Sq West Yellow Zone
188 187 Staten Island Port Richmond Boro Zone
189 188 Brooklyn Prospect-Lefferts Gardens Boro Zone
190 189 Brooklyn Prospect Heights Boro Zone
191 190 Brooklyn Prospect Park Boro Zone
192 191 Queens Queens Village Boro Zone
193 192 Queens Queensboro Hill Boro Zone
194 193 Queens Queensbridge/Ravenswood Boro Zone
195 194 Manhattan Randalls Island Yellow Zone
196 195 Brooklyn Red Hook Boro Zone
197 196 Queens Rego Park Boro Zone
198 197 Queens Richmond Hill Boro Zone
199 198 Queens Ridgewood Boro Zone
200 199 Bronx Rikers Island Boro Zone
201 200 Bronx Riverdale/North Riverdale/Fieldston Boro Zone
202 201 Queens Rockaway Park Boro Zone
203 202 Manhattan Roosevelt Island Boro Zone
204 203 Queens Rosedale Boro Zone
205 204 Staten Island Rossville/Woodrow Boro Zone
206 205 Queens Saint Albans Boro Zone
207 206 Staten Island Saint George/New Brighton Boro Zone
208 207 Queens Saint Michaels Cemetery/Woodside Boro Zone
209 208 Bronx Schuylerville/Edgewater Park Boro Zone
210 209 Manhattan Seaport Yellow Zone
211 210 Brooklyn Sheepshead Bay Boro Zone
212 211 Manhattan SoHo Yellow Zone
213 212 Bronx Soundview/Bruckner Boro Zone
214 213 Bronx Soundview/Castle Hill Boro Zone
215 214 Staten Island South Beach/Dongan Hills Boro Zone
216 215 Queens South Jamaica Boro Zone
217 216 Queens South Ozone Park Boro Zone
218 217 Brooklyn South Williamsburg Boro Zone
219 218 Queens Springfield Gardens North Boro Zone
220 219 Queens Springfield Gardens South Boro Zone
221 220 Bronx Spuyten Duyvil/Kingsbridge Boro Zone
222 221 Staten Island Stapleton Boro Zone
223 222 Brooklyn Starrett City Boro Zone
224 223 Queens Steinway Boro Zone
225 224 Manhattan Stuy Town/Peter Cooper Village Yellow Zone
226 225 Brooklyn Stuyvesant Heights Boro Zone
227 226 Queens Sunnyside Boro Zone
228 227 Brooklyn Sunset Park East Boro Zone
229 228 Brooklyn Sunset Park West Boro Zone
230 229 Manhattan Sutton Place/Turtle Bay North Yellow Zone
231 230 Manhattan Times Sq/Theatre District Yellow Zone
232 231 Manhattan TriBeCa/Civic Center Yellow Zone
233 232 Manhattan Two Bridges/Seward Park Yellow Zone
234 233 Manhattan UN/Turtle Bay South Yellow Zone
235 234 Manhattan Union Sq Yellow Zone
236 235 Bronx University Heights/Morris Heights Boro Zone
237 236 Manhattan Upper East Side North Yellow Zone
238 237 Manhattan Upper East Side South Yellow Zone
239 238 Manhattan Upper West Side North Yellow Zone
240 239 Manhattan Upper West Side South Yellow Zone
241 240 Bronx Van Cortlandt Park Boro Zone
242 241 Bronx Van Cortlandt Village Boro Zone
243 242 Bronx Van Nest/Morris Park Boro Zone
244 243 Manhattan Washington Heights North Boro Zone
245 244 Manhattan Washington Heights South Boro Zone
246 245 Staten Island West Brighton Boro Zone
247 246 Manhattan West Chelsea/Hudson Yards Yellow Zone
248 247 Bronx West Concourse Boro Zone
249 248 Bronx West Farms/Bronx River Boro Zone
250 249 Manhattan West Village Yellow Zone
251 250 Bronx Westchester Village/Unionport Boro Zone
252 251 Staten Island Westerleigh Boro Zone
253 252 Queens Whitestone Boro Zone
254 253 Queens Willets Point Boro Zone
255 254 Bronx Williamsbridge/Olinville Boro Zone
256 255 Brooklyn Williamsburg (North Side) Boro Zone
257 256 Brooklyn Williamsburg (South Side) Boro Zone
258 257 Brooklyn Windsor Terrace Boro Zone
259 258 Queens Woodhaven Boro Zone
260 259 Bronx Woodlawn/Wakefield Boro Zone
261 260 Queens Woodside Boro Zone
262 261 Manhattan World Trade Center Yellow Zone
263 262 Manhattan Yorkville East Yellow Zone
264 263 Manhattan Yorkville West Yellow Zone
265 264 Unknown NV N/A
266 265 Unknown NA N/A

View File

@ -2,8 +2,13 @@
## 5.1 Introduction
* :movie_camera: 5.1.1 [Introduction to Batch Processing](https://youtu.be/dcHe5Fl3MF8?list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
* :movie_camera: 5.1.2 [Introduction to Spark](https://youtu.be/FhaqbEOuQ8U?list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
* :movie_camera: 5.1.1 Introduction to Batch Processing
[![](https://markdown-videos-api.jorgenkh.no/youtube/dcHe5Fl3MF8)](https://youtu.be/dcHe5Fl3MF8&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=51)
* :movie_camera: 5.1.2 Introduction to Spark
[![](https://markdown-videos-api.jorgenkh.no/youtube/FhaqbEOuQ8U)](https://youtu.be/FhaqbEOuQ8U&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=52)
## 5.2 Installation
@ -16,46 +21,82 @@ Follow [these intructions](setup/) to install Spark:
And follow [this](setup/pyspark.md) to run PySpark in Jupyter
* :movie_camera: 5.2.1 [(Optional) Installing Spark (Linux)](https://youtu.be/hqUbB9c8sKg?list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
* :movie_camera: 5.2.1 (Optional) Installing Spark (Linux)
[![](https://markdown-videos-api.jorgenkh.no/youtube/hqUbB9c8sKg)](https://youtu.be/hqUbB9c8sKg&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=53)
## 5.3 Spark SQL and DataFrames
* :movie_camera: 5.3.1 [First Look at Spark/PySpark](https://youtu.be/r_Sf6fCB40c?list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
* :movie_camera: 5.3.2 [Spark Dataframes](https://youtu.be/ti3aC1m3rE8?list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
* :movie_camera: 5.3.3 [(Optional) Preparing Yellow and Green Taxi Data](https://youtu.be/CI3P4tAtru4?list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
* :movie_camera: 5.3.1 First Look at Spark/PySpark
[![](https://markdown-videos-api.jorgenkh.no/youtube/r_Sf6fCB40c)](https://youtu.be/r_Sf6fCB40c&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=54)
* :movie_camera: 5.3.2 Spark Dataframes
[![](https://markdown-videos-api.jorgenkh.no/youtube/ti3aC1m3rE8)](https://youtu.be/ti3aC1m3rE8&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=55)
* :movie_camera: 5.3.3 (Optional) Preparing Yellow and Green Taxi Data
[![](https://markdown-videos-api.jorgenkh.no/youtube/CI3P4tAtru4)](https://youtu.be/CI3P4tAtru4&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=56)
Script to prepare the Dataset [download_data.sh](code/download_data.sh)
**Note**: The other way to infer the schema (apart from pandas) for the csv files, is to set the `inferSchema` option to `true` while reading the files in Spark.
> [!NOTE]
> The other way to infer the schema (apart from pandas) for the csv files, is to set the `inferSchema` option to `true` while reading the files in Spark.
* :movie_camera: 5.3.4 [SQL with Spark](https://www.youtube.com/watch?v=uAlp2VuZZPY&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
* :movie_camera: 5.3.4 SQL with Spark
[![](https://markdown-videos-api.jorgenkh.no/youtube/uAlp2VuZZPY)](https://youtu.be/uAlp2VuZZPY&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=57)
## 5.4 Spark Internals
* :movie_camera: 5.4.1 [Anatomy of a Spark Cluster](https://youtu.be/68CipcZt7ZA&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
* :movie_camera: 5.4.2 [GroupBy in Spark](https://youtu.be/9qrDsY_2COo&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
* :movie_camera: 5.4.3 [Joins in Spark](https://youtu.be/lu7TrqAWuH4&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
* :movie_camera: 5.4.1 Anatomy of a Spark Cluster
[![](https://markdown-videos-api.jorgenkh.no/youtube/68CipcZt7ZA)](https://youtu.be/68CipcZt7ZA&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=58)
* :movie_camera: 5.4.2 GroupBy in Spark
[![](https://markdown-videos-api.jorgenkh.no/youtube/9qrDsY_2COo)](https://youtu.be/9qrDsY_2COo&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=59)
* :movie_camera: 5.4.3 Joins in Spark
[![](https://markdown-videos-api.jorgenkh.no/youtube/lu7TrqAWuH4)](https://youtu.be/lu7TrqAWuH4&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=60)
## 5.5 (Optional) Resilient Distributed Datasets
* :movie_camera: 5.5.1 [Operations on Spark RDDs](https://youtu.be/Bdu-xIrF3OM&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
* :movie_camera: 5.5.2 [Spark RDD mapPartition](https://youtu.be/k3uB2K99roI&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
* :movie_camera: 5.5.1 Operations on Spark RDDs
[![](https://markdown-videos-api.jorgenkh.no/youtube/Bdu-xIrF3OM)](https://youtu.be/Bdu-xIrF3OM&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=61)
* :movie_camera: 5.5.2 Spark RDD mapPartition
[![](https://markdown-videos-api.jorgenkh.no/youtube/k3uB2K99roI)](https://youtu.be/k3uB2K99roI&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=62)
## 5.6 Running Spark in the Cloud
* :movie_camera: 5.6.1 [Connecting to Google Cloud Storage ](https://youtu.be/Yyz293hBVcQ&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
* :movie_camera: 5.6.2 [Creating a Local Spark Cluster](https://youtu.be/HXBwSlXo5IA&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
* :movie_camera: 5.6.3 [Setting up a Dataproc Cluster](https://youtu.be/osAiAYahvh8&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
* :movie_camera: 5.6.4 [Connecting Spark to Big Query](https://youtu.be/HIm2BOj8C0Q&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
* :movie_camera: 5.6.1 Connecting to Google Cloud Storage
[![](https://markdown-videos-api.jorgenkh.no/youtube/Yyz293hBVcQ)](https://youtu.be/Yyz293hBVcQ&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=63)
* :movie_camera: 5.6.2 Creating a Local Spark Cluster
[![](https://markdown-videos-api.jorgenkh.no/youtube/HXBwSlXo5IA)](https://youtu.be/HXBwSlXo5IA&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=64)
* :movie_camera: 5.6.3 Setting up a Dataproc Cluster
[![](https://markdown-videos-api.jorgenkh.no/youtube/osAiAYahvh8)](https://youtu.be/osAiAYahvh8&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=65)
* :movie_camera: 5.6.4 Connecting Spark to Big Query
[![](https://markdown-videos-api.jorgenkh.no/youtube/HIm2BOj8C0Q)](https://youtu.be/HIm2BOj8C0Q&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=66)
# Homework
* [2024 Homework](../cohorts/2024)
* [2024 Homework](../cohorts/2024/05-batch/homework.md)
# Community notes
@ -68,4 +109,6 @@ Did you take notes? You can share them here.
* [Alternative : Using docker-compose to launch spark by rafik](https://gist.github.com/rafik-rahoui/f98df941c4ccced9c46e9ccbdef63a03)
* [Marcos Torregrosa's blog (spanish)](https://www.n4gash.com/2023/data-engineering-zoomcamp-semana-5-batch-spark)
* [Notes by Victor Padilha](https://github.com/padilha/de-zoomcamp/tree/master/week5)
* [Notes by Oscar Garcia](https://github.com/ozkary/Data-Engineering-Bootcamp/tree/main/Step5-Batch-Processing)
* [Notes by HongWei](https://github.com/hwchua0209/data-engineering-zoomcamp-submission/blob/main/05-batch-processing/README.md)
* Add your notes here (above this line)

View File

@ -57,8 +57,7 @@ rm openjdk-11.0.2_linux-x64_bin.tar.gz
Download Spark. Use 3.3.2 version:
```bash
wget https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3.tgz
wget https://archive.apache.org/dist/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3.tgz
```
Unpack:

View File

@ -10,7 +10,7 @@ for other MacOS versions as well
Ensure Brew and Java installed in your system:
```bash
xcode-select install
xcode-select --install
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install.sh)"
brew install java
```

View File

@ -68,7 +68,7 @@ export PATH="${HADOOP_HOME}/bin:${PATH}"
Now download Spark. Select version 3.3.2
```bash
wget https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3.tgz
wget https://archive.apache.org/dist/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3.tgz
```

View File

@ -11,30 +11,66 @@ Confluent cloud provides a free 30 days trial for, you can signup [here](https:/
## Introduction to Stream Processing
- [Slides](https://docs.google.com/presentation/d/1bCtdCba8v1HxJ_uMm9pwjRUC-NAMeB-6nOG2ng3KujA/edit?usp=sharing)
- :movie_camera: 6.0.1 [DE Zoomcamp 6.0.1 - Introduction](https://www.youtube.com/watch?v=hfvju3iOIP0)
- :movie_camera: 6.0.2 [DE Zoomcamp 6.0.2 - What is stream processing](https://www.youtube.com/watch?v=WxTxKGcfA-k)
- :movie_camera: 6.0.1 Introduction
[![](https://markdown-videos-api.jorgenkh.no/youtube/hfvju3iOIP0)](https://youtu.be/hfvju3iOIP0&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=67)
- :movie_camera: 6.0.2 What is stream processing
[![](https://markdown-videos-api.jorgenkh.no/youtube/WxTxKGcfA-k)](https://youtu.be/WxTxKGcfA-k&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=68)
## Introduction to Kafka
- :movie_camera: 6.3 [DE Zoomcamp 6.3 - What is kafka?](https://www.youtube.com/watch?v=zPLZUDPi4AY)
- :movie_camera: 6.4 [DE Zoomcamp 6.4 - Confluent cloud](https://www.youtube.com/watch?v=ZnEZFEYKppw)
- :movie_camera: 6.5 [DE Zoomcamp 6.5 - Kafka producer consumer](https://www.youtube.com/watch?v=aegTuyxX7Yg)
- :movie_camera: 6.3 What is kafka?
[![](https://markdown-videos-api.jorgenkh.no/youtube/zPLZUDPi4AY)](https://youtu.be/zPLZUDPi4AY&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=69)
- :movie_camera: 6.4 Confluent cloud
[![](https://markdown-videos-api.jorgenkh.no/youtube/ZnEZFEYKppw)](https://youtu.be/ZnEZFEYKppw&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=70)
- :movie_camera: 6.5 Kafka producer consumer
[![](https://markdown-videos-api.jorgenkh.no/youtube/aegTuyxX7Yg)](https://youtu.be/aegTuyxX7Yg&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=71)
## Kafka Configuration
- :movie_camera: 6.6 [DE Zoomcamp 6.6 - Kafka configuration](https://www.youtube.com/watch?v=SXQtWyRpMKs)
- :movie_camera: 6.6 Kafka configuration
[![](https://markdown-videos-api.jorgenkh.no/youtube/SXQtWyRpMKs)](https://youtu.be/SXQtWyRpMKs&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=72)
- [Kafka Configuration Reference](https://docs.confluent.io/platform/current/installation/configuration/)
## Kafka Streams
- [Slides](https://docs.google.com/presentation/d/1fVi9sFa7fL2ZW3ynS5MAZm0bRSZ4jO10fymPmrfTUjE/edit?usp=sharing)
- [Streams Concepts](https://docs.confluent.io/platform/current/streams/concepts.html)
- :movie_camera: 6.7 [DE Zoomcamp 6.7 - Kafka streams basics](https://www.youtube.com/watch?v=dUyA_63eRb0)
- :movie_camera: 6.8 [DE Zoomcamp 6.8 - Kafka stream join](https://www.youtube.com/watch?v=NcpKlujh34Y)
- :movie_camera: 6.9 [DE Zoomcamp 6.9 - Kafka stream testing](https://www.youtube.com/watch?v=TNx5rmLY8Pk)
- :movie_camera: 6.10 [DE Zoomcamp 6.10 - Kafka stream windowing](https://www.youtube.com/watch?v=r1OuLdwxbRc)
- :movie_camera: 6.11 [DE Zoomcamp 6.11 - Kafka ksqldb & Connect](https://www.youtube.com/watch?v=DziQ4a4tn9Y)
- :movie_camera: 6.12 [DE Zoomcamp 6.12 - Kafka Schema registry](https://www.youtube.com/watch?v=tBY_hBuyzwI)
- :movie_camera: 6.7 Kafka streams basics
[![](https://markdown-videos-api.jorgenkh.no/youtube/dUyA_63eRb0)](https://youtu.be/dUyA_63eRb0&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=73)
- :movie_camera: 6.8 Kafka stream join
[![](https://markdown-videos-api.jorgenkh.no/youtube/NcpKlujh34Y)](https://youtu.be/NcpKlujh34Y&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=74)
- :movie_camera: 6.9 Kafka stream testing
[![](https://markdown-videos-api.jorgenkh.no/youtube/TNx5rmLY8Pk)](https://youtu.be/TNx5rmLY8Pk&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=75)
- :movie_camera: 6.10 Kafka stream windowing
[![](https://markdown-videos-api.jorgenkh.no/youtube/r1OuLdwxbRc)](https://youtu.be/r1OuLdwxbRc&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=76)
- :movie_camera: 6.11 Kafka ksqldb & Connect
[![](https://markdown-videos-api.jorgenkh.no/youtube/DziQ4a4tn9Y)](https://youtu.be/DziQ4a4tn9Y&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=77)
- :movie_camera: 6.12 Kafka Schema registry
[![](https://markdown-videos-api.jorgenkh.no/youtube/tBY_hBuyzwI)](https://youtu.be/tBY_hBuyzwI&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=78)
## Faust - Python Stream Processing
@ -43,8 +79,14 @@ Confluent cloud provides a free 30 days trial for, you can signup [here](https:/
## Pyspark - Structured Streaming
Please follow the steps described under [pyspark-streaming](python/streams-example/pyspark/README.md)
- :movie_camera: 6.13 [DE Zoomcamp 6.13 - Kafka Streaming with Python](https://www.youtube.com/watch?v=Y76Ez_fIvtk)
- :movie_camera: 6.14 [DE Zoomcamp 6.14 - Pyspark Structured Streaming](https://www.youtube.com/watch?v=5hRJ8-6Fpyk)
- :movie_camera: 6.13 Kafka Streaming with Python
[![](https://markdown-videos-api.jorgenkh.no/youtube/Y76Ez_fIvtk)](https://youtu.be/Y76Ez_fIvtk&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=79)
- :movie_camera: 6.14 Pyspark Structured Streaming
[![](https://markdown-videos-api.jorgenkh.no/youtube/5hRJ8-6Fpyk)](https://youtu.be/5hRJ8-6Fpyk&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=80)
## Kafka Streams with JVM library
@ -82,5 +124,6 @@ Did you take notes? You can share them here.
* [Notes by Alvaro Navas](https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/6_streaming.md )
* [Marcos Torregrosa's blog (spanish)](https://www.n4gash.com/2023/data-engineering-zoomcamp-semana-6-stream-processing/)
* [Notes by Oscar Garcia](https://github.com/ozkary/Data-Engineering-Bootcamp/tree/main/Step6-Streaming)
* Add your notes here (above this line)

View File

@ -3,7 +3,7 @@
In this document, you will be finding information about stream processing
using different Python libraries (`kafka-python`,`confluent-kafka`,`pyspark`, `faust`).
This Python module can be seperated in following modules.
This Python module can be separated in following modules.
#### 1. Docker
Docker module includes, Dockerfiles and docker-compose definitions

View File

@ -75,6 +75,12 @@ can take the course at your own pace
### [Workshop 1: Data Ingestion](cohorts/2024/workshops/dlt.md)
* Reading from apis
* Building scalable pipelines
* Normalising data
* Incremental loading
* Homework
[More details](cohorts/2024/workshops/dlt.md)
@ -143,9 +149,7 @@ Putting everything we learned to practice
## Overview
<img src="images/architecture/photo1700757552.jpeg" />
<img src="images/architecture/arch_v3_workshops.jpg" />
### Prerequisites
@ -171,14 +175,6 @@ Past instructors:
- [Sejal Vaidya](https://www.linkedin.com/in/vaidyasejal/)
- [Irem Erturk](https://www.linkedin.com/in/iremerturk/)
## Course UI
Alternatively, you can access this course using the provided UI app, the app provides a user-friendly interface for navigating through the course material.
* Visit the following link: [DE Zoomcamp UI](https://dezoomcamp.streamlit.app/)
![dezoomcamp-ui](https://github.com/DataTalksClub/data-engineering-zoomcamp/assets/66017329/4466d2bc-3728-4fca-8e9e-b1c6be30a430)
## Asking for help in Slack

View File

@ -634,5 +634,17 @@ Links:
<td><a href="https://github.com/ChungWasawat/dtc_de_project">Project</a></td>
<td> <a href="https://www.linkedin.com/in/wasawat-boonyarittikit-b1698b179/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/ChungWasawat"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Fedor Faizov</td>
<td><a href="https://github.com/Fedrpi/de-zoomcamp-bandcamp-project">Project</a></td>
<td> <a href="https://www.linkedin.com/in/fedor-faizov-a75b32245/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/Fedrpi"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td><details>
<summary>More info</summary>
> Absolutly amazing course <3 </details></td>
</tr>
</table>

View File

@ -1,5 +1,7 @@
## Module 1 Homework
ATTENTION: At the very end of the submission form, you will be required to include a link to your GitHub repository or other public code-hosting site. This repository should contain your code for solving the homework. If your solution includes code that is not in file format (such as SQL queries or shell commands), please include these directly in the README file of your repository.
## Docker & SQL
In this homework we'll prepare the environment
@ -66,11 +68,13 @@ Remember that `lpep_pickup_datetime` and `lpep_dropoff_datetime` columns are in
- 15859
- 89009
## Question 4. Largest trip for each day
## Question 4. Longest trip for each day
Which was the pick up day with the largest trip distance
Which was the pick up day with the longest trip distance?
Use the pick up time for your calculations.
Tip: For every trip on a single day, we only care about the trip with the longest distance.
- 2019-09-18
- 2019-09-16
- 2019-09-26

View File

@ -1,9 +1,17 @@
## Week 2 Homework
## Module 2 Homework
ATTENTION: At the end of the submission form, you will be required to include a link to your GitHub repository or other public code-hosting site. This repository should contain your code for solving the homework. If your solution includes code that is not in file format, please include these directly in the README file of your repository.
> In case you don't get one option exactly, select the closest one
For the homework, we'll be working with the _green_ taxi dataset located here:
`https://github.com/DataTalksClub/nyc-tlc-data/releases/tag/green/download`
To get a `wget`-able link, use this prefix (note that the link itself gives 404):
`https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/`
### Assignment
The goal will be to construct an ETL pipeline that loads the data, performs some transformations, and writes the data to a database (and Google Cloud!).
@ -13,7 +21,7 @@ The goal will be to construct an ETL pipeline that loads the data, performs some
- You can use the same datatypes and date parsing methods shown in the course.
- `BONUS`: load the final three months using a for loop and `pd.concat`
- Add a transformer block and perform the following:
- Remove rows where the passenger count is equal to 0 _or_ the trip distance is equal to zero.
- Remove rows where the passenger count is equal to 0 _and_ the trip distance is equal to zero.
- Create a new column `lpep_pickup_date` by converting `lpep_pickup_datetime` to a date.
- Rename columns in Camel Case to Snake Case, e.g. `VendorID` to `vendor_id`.
- Add three assertions:
@ -37,7 +45,7 @@ Once the dataset is loaded, what's the shape of the data?
## Question 2. Data Transformation
Upon filtering the dataset where the passenger count is equal to 0 _or_ the trip distance is equal to zero, how many rows are left?
Upon filtering the dataset where the passenger count is greater than 0 _and_ the trip distance is greater than zero, how many rows are left?
* 544,897 rows
* 266,855 rows
@ -48,10 +56,10 @@ Upon filtering the dataset where the passenger count is equal to 0 _or_ the trip
Which of the following creates a new column `lpep_pickup_date` by converting `lpep_pickup_datetime` to a date?
* data = data['lpep_pickup_datetime'].date
* data('lpep_pickup_date') = data['lpep_pickup_datetime'].date
* data['lpep_pickup_date'] = data['lpep_pickup_datetime'].dt.date
* data['lpep_pickup_date'] = data['lpep_pickup_datetime'].dt().date()
* `data = data['lpep_pickup_datetime'].date`
* `data('lpep_pickup_date') = data['lpep_pickup_datetime'].date`
* `data['lpep_pickup_date'] = data['lpep_pickup_datetime'].dt.date`
* `data['lpep_pickup_date'] = data['lpep_pickup_datetime'].dt().date()`
## Question 4. Data Transformation
@ -82,10 +90,9 @@ Once exported, how many partitions (folders) are present in Google Cloud?
## Submitting the solutions
* Form for submitting: TBA
Deadline: TBA
* Form for submitting: https://courses.datatalks.club/de-zoomcamp-2024/homework/hw2
* Check the link above to see the due date
## Solution
Will be added after the due date

View File

@ -1,12 +1,17 @@
## Week 3 Homework
<b><u>Important Note:</b></u> <p> For this homework we will be using the Green Taxi Trip Record Parquet files from the New York
## Module 3 Homework
Solution: https://www.youtube.com/watch?v=8g_lRKaC9ro
ATTENTION: At the end of the submission form, you will be required to include a link to your GitHub repository or other public code-hosting site. This repository should contain your code for solving the homework. If your solution includes code that is not in file format (such as SQL queries or shell commands), please include these directly in the README file of your repository.
<b><u>Important Note:</b></u> <p> For this homework we will be using the 2022 Green Taxi Trip Record Parquet Files from the New York
City Taxi Data found here: </br> https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page </br>
If you are using orchestration such as Mage, Airflow or Prefect do not load the data into Big Query using the orchestrator.</br>
Stop with loading the files into a bucket. </br></br>
<u>NOTE:</u> You will need to use the PARQUET option files when creating an External Table</br>
<b>SETUP:</b></br>
Create an external table using the Green Taxi Trip Records Data for 2022 data. </br>
Create an external table using the Green Taxi Trip Records Data for 2022. </br>
Create a table in BQ using the Green Taxi Trip Records for 2022 (do not partition or cluster this table). </br>
</p>
@ -35,7 +40,7 @@ How many records have a fare_amount of 0?
- 1,622
## Question 4:
What is the best strategy to make an optimized table in Big Query if your query will always order the results by PUlocationID and filter based on lpep_pickup_datetime?
What is the best strategy to make an optimized table in Big Query if your query will always order the results by PUlocationID and filter based on lpep_pickup_datetime? (Create a new table with this strategy)
- Cluster on lpep_pickup_datetime Partition by PUlocationID
- Partition by lpep_pickup_datetime Cluster on PUlocationID
- Partition by lpep_pickup_datetime and Partition by PUlocationID
@ -48,7 +53,7 @@ Write a query to retrieve the distinct PULocationID between lpep_pickup_datetime
Use the materialized table you created earlier in your from clause and note the estimated bytes. Now change the table in the from clause to the partitioned table you created for question 4 and note the estimated bytes processed. What are these values? </br>
Choose the answer which most closely matches.</br>
Use the BQ table you created earlier in your from clause and note the estimated bytes. Now change the table in the from clause to the partitioned table you created for question 4 and note the estimated bytes processed. What are these values? Choose the answer which most closely matches.
- 22.82 MB for non-partitioned table and 647.87 MB for the partitioned table
- 12.82 MB for non-partitioned table and 1.12 MB for the partitioned table
- 5.63 MB for non-partitioned table and 0 MB for the partitioned table
@ -71,16 +76,11 @@ It is best practice in Big Query to always cluster your data:
## (Bonus: Not worth points) Question 8:
No Points: Write a SELECT count(*) query FROM the materialized table you created. How many bytes does it estimate will be read? Why?
No Points: Write a `SELECT count(*)` query FROM the materialized table you created. How many bytes does it estimate will be read? Why?
Note: Column types for all files used in an External Table must have the same datatype. While an External Table may be created and shown in the side panel in Big Query, this will need to be validated by running a count query on the External Table to check if any errors occur.
## Submitting the solutions
* Form for submitting: TBD
* You can submit your homework multiple times. In this case, only the last submission will be used.
Deadline: TBD
* Form for submitting: https://courses.datatalks.club/de-zoomcamp-2024/homework/hw3

View File

@ -0,0 +1,81 @@
## Module 4 Homework
In this homework, we'll use the models developed during the week 4 videos and enhance the already presented dbt project using the already loaded Taxi data for fhv vehicles for year 2019 in our DWH.
This means that in this homework we use the following data [Datasets list](https://github.com/DataTalksClub/nyc-tlc-data/)
* Yellow taxi data - Years 2019 and 2020
* Green taxi data - Years 2019 and 2020
* fhv data - Year 2019.
We will use the data loaded for:
* Building a source table: `stg_fhv_tripdata`
* Building a fact table: `fact_fhv_trips`
* Create a dashboard
If you don't have access to GCP, you can do this locally using the ingested data from your Postgres database
instead. If you have access to GCP, you don't need to do it for local Postgres - only if you want to.
> **Note**: if your answer doesn't match exactly, select the closest option
### Question 1:
**What happens when we execute dbt build --vars '{'is_test_run':'true'}'**
You'll need to have completed the ["Build the first dbt models"](https://www.youtube.com/watch?v=UVI30Vxzd6c) video.
- It's the same as running *dbt build*
- It applies a _limit 100_ to all of our models
- It applies a _limit 100_ only to our staging models
- Nothing
### Question 2:
**What is the code that our CI job will run? Where is this code coming from?**
- The code that has been merged into the main branch
- The code that is behind the creation object on the dbt_cloud_pr_ schema
- The code from any development branch that has been opened based on main
- The code from the development branch we are requesting to merge to main
### Question 3 (2 points)
**What is the count of records in the model fact_fhv_trips after running all dependencies with the test run variable disabled (:false)?**
Create a staging model for the fhv data, similar to the ones made for yellow and green data. Add an additional filter for keeping only records with pickup time in year 2019.
Do not add a deduplication step. Run this models without limits (is_test_run: false).
Create a core model similar to fact trips, but selecting from stg_fhv_tripdata and joining with dim_zones.
Similar to what we've done in fact_trips, keep only records with known pickup and dropoff locations entries for pickup and dropoff locations.
Run the dbt model without limits (is_test_run: false).
- 12998722
- 22998722
- 32998722
- 42998722
### Question 4 (2 points)
**What is the service that had the most rides during the month of July 2019 month with the biggest amount of rides after building a tile for the fact_fhv_trips table?**
Create a dashboard with some tiles that you find interesting to explore the data. One tile should show the amount of trips per month, as done in the videos for fact_trips, including the fact_fhv_trips data.
- FHV
- Green
- Yellow
- FHV and Green
## Submitting the solutions
* Form for submitting: https://courses.datatalks.club/de-zoomcamp-2024/homework/hw4
Deadline: 22 February (Thursday), 22:00 CET
## Solution (To be published after deadline)
* Video:
* Answers:
* Question 1:
* Question 2:
* Question 3:
* Question 4:

View File

@ -0,0 +1,98 @@
## Week 5 Homework
In this homework we'll put what we learned about Spark in practice.
For this homework we will be using the FHV 2019-10 data found here. [FHV Data](https://github.com/DataTalksClub/nyc-tlc-data/releases/download/fhv/fhv_tripdata_2019-10.csv.gz)
### Question 1:
**Install Spark and PySpark**
- Install Spark
- Run PySpark
- Create a local spark session
- Execute spark.version.
What's the output?
> [!NOTE]
> To install PySpark follow this [guide](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/05-batch/setup/pyspark.md)
### Question 2:
**FHV October 2019**
Read the October 2019 FHV into a Spark Dataframe with a schema as we did in the lessons.
Repartition the Dataframe to 6 partitions and save it to parquet.
What is the average size of the Parquet (ending with .parquet extension) Files that were created (in MB)? Select the answer which most closely matches.
- 1MB
- 6MB
- 25MB
- 87MB
### Question 3:
**Count records**
How many taxi trips were there on the 15th of October?
Consider only trips that started on the 15th of October.
- 108,164
- 12,856
- 452,470
- 62,610
> [!IMPORTANT]
> Be aware of columns order when defining schema
### Question 4:
**Longest trip for each day**
What is the length of the longest trip in the dataset in hours?
- 631,152.50 Hours
- 243.44 Hours
- 7.68 Hours
- 3.32 Hours
### Question 5:
**User Interface**
Sparks User Interface which shows the application's dashboard runs on which local port?
- 80
- 443
- 4040
- 8080
### Question 6:
**Least frequent pickup location zone**
Load the zone lookup data into a temp view in Spark</br>
[Zone Data](https://github.com/DataTalksClub/nyc-tlc-data/releases/download/misc/taxi_zone_lookup.csv)
Using the zone lookup data and the FHV October 2019 data, what is the name of the LEAST frequent pickup location Zone?</br>
- East Chelsea
- Jamaica Bay
- Union Sq
- Crown Heights North
## Submitting the solutions
- Form for submitting: https://courses.datatalks.club/de-zoomcamp-2024/homework/hw5
- Deadline: See the website

View File

@ -1,60 +1,133 @@
## Data ingestion with dlt
# Data ingestion with dlt
In this hands-on workshop, well learn how to build data ingestion pipelines.
Well cover the following steps:
* Extracting data from APIs, or files.
* Normalizing and loading data
* Normalizing and loading data
* Incremental loading
By the end of this workshop, youll be able to write data pipelines like a senior data engineer: Quickly, concisely, scalable, and self-maintaining.
If you don't follow the course and only want to attend the workshop, sign up here: https://lu.ma/wupfy6dd
Video: https://www.youtube.com/live/oLXhBM7nf2Q
---
# Navigation
* [Workshop content](dlt_resources/data_ingestion_workshop.md)
* [Workshop notebook](dlt_resources/workshop.ipynb)
* [Homework starter notebook](dlt_resources/homework_starter.ipynb)
# Resources
- Website and community: Visit our [docs](https://dlthub.com/docs/intro), discuss on our slack (Link at top of docs).
- Course colab: [Notebook](https://colab.research.google.com/drive/1kLyD3AL-tYf_HqCXYnA3ZLwHGpzbLmoj#scrollTo=5aPjk0O3S_Ag&forceEdit=true&sandboxMode=true).
- dlthub [community Slack](https://dlthub.com/community).
---
# Teacher
Welcome to the data talks club data engineering zoomcamp, the data ingestion workshop.
- My name is [Adrian](https://www.linkedin.com/in/data-team/), and I work in the data field since 2012
- I built many data warehouses some lakes, and a few data teams
- 10 years into my career I started working on dlt “data load tool”, which is an open source library to enable data engineers to build faster and better.
- I started working on dlt because data engineering is one of the few areas of software engineering where we do not have developer tools to do our work.
- Building better pipelines would require more code re-use - we cannot all just build perfect pipelines from scratch every time.
- And so dlt was born, a library that automates the tedious part of data ingestion: Loading, schema management, data type detection, scalability, self healing, scalable extraction… you get the idea - essentially a data engineers “one stop shop” for best practice data pipelining.
- Due to its **simplicity** of use, dlt enables **laymen** to
- Build pipelines 5-10x faster than without it
- Build self healing, self maintaining pipelines with all the best practices of data engineers. Automating schema changes removes the bulk of maintenance efforts.
- Govern your pipelines with schema evolution alerts and data contracts.
- and generally develop pipelines like a senior, commercial data engineer.
---
# Course
You can find the course file [here](./dlt_resources/data_ingestion_workshop.md)
The course has 3 parts
- [Extraction Section](./dlt_resources/data_ingestion_workshop.md#extracting-data): In this section we will learn about scalable extraction
- [Normalisation Section](./dlt_resources/data_ingestion_workshop.md#normalisation): In this section we will learn to prepare data for loading
- [Loading Section](./dlt_resources/data_ingestion_workshop.md#incremental-loading)): Here we will learn about incremental loading modes
---
# Homework
The [linked colab notebook](https://colab.research.google.com/drive/1Te-AT0lfh0GpChg1Rbd0ByEKOHYtWXfm#scrollTo=wLF4iXf-NR7t&forceEdit=true&sandboxMode=true) offers a few exercises to practice what you learned today.
## Homework
#### Question 1: What is the sum of the outputs of the generator for limit = 5?
- **A**: 10.23433234744176
- **B**: 7.892332347441762
- **C**: 8.382332347441762
- **D**: 9.123332347441762
TBA
#### Question 2: What is the 13th number yielded by the generator?
- **A**: 4.236551275463989
- **B**: 3.605551275463989
- **C**: 2.345551275463989
- **D**: 5.678551275463989
### Question 1
#### Question 3: Append the 2 generators. After correctly appending the data, calculate the sum of all ages of people.
- **A**: 353
- **B**: 365
- **C**: 378
- **D**: 390
TBA
#### Question 4: Merge the 2 generators using the ID column. Calculate the sum of ages of all the people loaded as described above.
- **A**: 215
- **B**: 266
- **C**: 241
- **D**: 258
* Option 1
* Option 2
* Option 3
* Option 4
Submit the solution here: https://courses.datatalks.club/de-zoomcamp-2024/homework/workshop1
---
# Next steps
As you are learning the various concepts of data engineering,
consider creating a portfolio project that will further your own knowledge.
By demonstrating the ability to deliver end to end, you will have an easier time finding your first role.
This will help regardless of whether your hiring manager reviews your project, largely because you will have a better
understanding and will be able to talk the talk.
Here are some example projects that others did with dlt:
- Serverless dlt-dbt on cloud functions: [Article](https://docs.getdbt.com/blog/serverless-dlt-dbt-stack)
- Bird finder: [Part 1](https://publish.obsidian.md/lough-on-data/blogs/bird-finder-via-dlt-i), [Part 2](https://publish.obsidian.md/lough-on-data/blogs/bird-finder-via-dlt-ii)
- Event ingestion on GCP: [Article and repo](https://dlthub.com/docs/blog/streaming-pubsub-json-gcp)
- Event ingestion on AWS: [Article and repo](https://dlthub.com/docs/blog/dlt-aws-taktile-blog)
- Or see one of the many demos created by our working students: [Hacker news](https://dlthub.com/docs/blog/hacker-news-gpt-4-dashboard-demo),
[GA4 events](https://dlthub.com/docs/blog/ga4-internal-dashboard-demo),
[an E-Commerce](https://dlthub.com/docs/blog/postgresql-bigquery-metabase-demo),
[google sheets](https://dlthub.com/docs/blog/google-sheets-to-data-warehouse-pipeline),
[Motherduck](https://dlthub.com/docs/blog/dlt-motherduck-demo),
[MongoDB + Holistics](https://dlthub.com/docs/blog/MongoDB-dlt-Holistics),
[Deepnote](https://dlthub.com/docs/blog/deepnote-women-wellness-violence-tends),
[Prefect](https://dlthub.com/docs/blog/dlt-prefect),
[PowerBI vs GoodData vs Metabase](https://dlthub.com/docs/blog/semantic-modeling-tools-comparison),
[Dagster](https://dlthub.com/docs/blog/dlt-dagster),
[Ingesting events via gcp webhooks](https://dlthub.com/docs/blog/dlt-webhooks-on-cloud-functions-for-event-capture),
[SAP to snowflake replication](https://dlthub.com/docs/blog/sap-hana-to-snowflake-demo-blog),
[Read emails and send sumamry to slack with AI and Kestra](https://dlthub.com/docs/blog/dlt-kestra-demo-blog),
[Mode +dlt capabilities](https://dlthub.com/docs/blog/dlt-mode-blog),
[dbt on cloud functions](https://dlthub.com/docs/blog/dlt-dbt-runner-on-cloud-functions)
- If you want to use dlt in your project, [check this list of public APIs](https://dlthub.com/docs/blog/practice-api-sources)
### Question 2:
TBA
* Option 1
* Option 2
* Option 3
* Option 4
If you create a personal project, consider submitting it to our blog - we will be happy to showcase it. Just drop us a line in the dlt slack.
### Question 3:
TBA
* Option 1
* Option 2
* Option 3
* Option 4
**And don't forget, if you like dlt**
- **Give us a [GitHub Star!](https://github.com/dlt-hub/dlt)**
- **Join our [Slack community](https://dlthub.com/community)**
## Submitting the solutions
# Notes
* Form for submitting: TBA
* You can submit your homework multiple times. In this case, only the last submission will be used.
Deadline: TBA
## Solution
Video: TBA
* Add your notes here

View File

@ -0,0 +1,582 @@
# Intro
What is data loading, or data ingestion?
Data ingestion is the process of extracting data from a producer, transporting it to a convenient environment, and preparing it for usage by normalising it, sometimes cleaning, and adding metadata.
### “A wild dataset magically appears!”
In many data science teams, data magically appears - because the engineer loads it.
- Sometimes the format in which it appears is structured, and with explicit schema
- In that case, they can go straight to using it; Examples: Parquet, Avro, or table in a db,
- Sometimes the format is weakly typed and without explicit schema, such as csv, json
- in which case some extra normalisation or cleaning might be needed before usage
> 💡 **What is a schema?** The schema specifies the expected format and structure of data within a document or data store, defining the allowed keys, their data types, and any constraints or relationships.
### Be the magician! 😎
Since you are here to learn about data engineering, you will be the one making datasets magically appear.
Heres what you need to learn to build pipelines
- Extracting data
- Normalising, cleaning, adding metadata such as schema and types
- and Incremental loading, which is vital for fast, cost effective data refreshes.
### What else does a data engineer do? What are we not learning, and what are we learning?
- It might seem simplistic, but in fact a data engineers main goal is to ensure data flows from source systems to analytical destinations.
- So besides building pipelines, running pipelines and fixing pipelines, a data engineer may also focus on optimising data storage, ensuring data quality and integrity, implementing effective data governance practices, and continuously refining data architecture to meet the evolving needs of the organisation.
- Ultimately, a data engineer's role extends beyond the mechanical aspects of pipeline development, encompassing the strategic management and enhancement of the entire data lifecycle.
- This workshop focuses on building robust, scalable, self maintaining pipelines, with built in governance - in other words, best practices applied.
# Extracting data
### The considerations of extracting data
In this section we will learn about extracting data from source systems, and what to care about when doing so.
Most data is stored behind an API
- Sometimes thats a RESTful api for some business application, returning records of data.
- Sometimes the API returns a secure file path to something like a json or parquet file in a bucket that enables you to grab the data in bulk,
- Sometimes the API is something else (mongo, sql, other databases or applications) and will generally return records as JSON - the most common interchange format.
As an engineer, you will need to build pipelines that “just work”.
So heres what you need to consider on extraction, to prevent the pipelines from breaking, and to keep them running smoothly.
- Hardware limits: During this course we will cover how to navigate the challenges of managing memory.
- Network limits: Sometimes networks can fail. We cant fix what could go wrong but we can retry network jobs until they succeed. For example, dlt library offers a requests “replacement” that has built in retries. [Docs](https://dlthub.com/docs/reference/performance#using-the-built-in-requests-client). We wont focus on this during the course but you can read the docs on your own.
- Source api limits: Each source might have some limits such as how many requests you can do per second. We would call these “rate limits”. Read each sources docs carefully to understand how to navigate these obstacles. You can find some examples of how to wait for rate limits in our verified sources repositories
- examples: [Zendesk](https://developer.zendesk.com/api-reference/introduction/rate-limits/), [Shopify](https://shopify.dev/docs/api/usage/rate-limits)
### Extracting data without hitting hardware limits
What kind of limits could you hit on your machine? In the case of data extraction, the only limits are memory and storage. This refers to the RAM or virtual memory, and the disk, or physical storage.
### **Managing memory.**
- Many data pipelines run on serverless functions or on orchestrators that delegate the workloads to clusters of small workers.
- These systems have a small memory or share it between multiple workers - so filling the memory is BAAAD: It might lead to not only your pipeline crashing, but crashing the entire container or machine that might be shared with other worker processes, taking them down too.
- The same can be said about disk - in most cases your disk is sufficient, but in some cases its not. For those cases, mounting an external drive mapped to a storage bucket is the way to go. Airflow for example supports a “data” folder that is used just like a local folder but can be mapped to a bucket for unlimited capacity.
### So how do we avoid filling the memory?
- We often do not know the volume of data upfront
- And we cannot scale dynamically or infinitely on hardware during runtime
- So the answer is: Control the max memory you use
### Control the max memory used by streaming the data
Streaming here refers to processing the data event by event or chunk by chunk instead of doing bulk operations.
Lets look at some classic examples of streaming where data is transferred chunk by chunk or event by event
- Between an audio broadcaster and an in-browser audio player
- Between a server and a local video player
- Between a smart home device or IoT device and your phone
- between google maps and your navigation app
- Between instagram live and your followers
What do data engineers do? We usually stream the data between buffers, such as
- from API to local file
- from webhooks to event queues
- from event queue (Kafka, SQS) to Bucket
### Streaming in python via generators
Lets focus on how we build most data pipelines:
- To process data in a stream in python, we use generators, which are functions that can return multiple times - by allowing multiple returns, the data can be released as its produced, as stream, instead of returning it all at once as a batch.
Take the following theoretical example:
- We search twitter for “cat pictures”. We do not know how many pictures will be returned - maybe 10, maybe 10.000.000. Will they fit in memory? Who knows.
- So to grab this data without running out of memory, we would use a python generator.
- Whats a generator? In simple words, its a function that can return multiple times. Heres an example of a regular function, and how that function looks if written as a generator.
### Generator examples:
Lets look at a regular returning function, and how we can re-write it as a generator.
**Regular function collects data in memory.** Here you can see how data is collected row by row in a list called `data`before it is returned. This will break if we have more data than memory.
```python
def search_twitter(query):
data = []
for row in paginated_get(query):
data.append(row)
return data
# Collect all the cat picture data
for row in search_twitter("cat pictures"):
# Once collected,
# print row by row
print(row)
```
When calling `for row in search_twitter("cat pictures"):` all the data must first be downloaded before the first record is returned
Lets see how we could rewrite this as a generator.
**Generator for streaming the data.** The memory usage here is minimal.
As you can see, in the modified function, we yield each row as we get the data, without collecting it into memory. We can then run this generator and handle the data item by item.
```python
def search_twitter(query):
for row in paginated_get(query):
yield row
# Get one row at a time
for row in extract_data("cat pictures"):
# print the row
print(row)
# do something with the row such as cleaning it and writing it to a buffer
# continue requesting and printing data
```
When calling `for row in extract_data("cat pictures"):` the function only runs until the first data item is yielded, before printing - so we do not need to wait long for the first value. It will then continue until there is no more data to get.
If we wanted to get all the values at once from a generator instead of one by one, we would need to first “run” the generator and collect the data. For example, if we wanted to get all the data in memory we could do `data = list(extract_data("cat pictures"))` which would run the generator and collect all the data in a list before continuing.
## 3 Extraction examples:
### Example 1: Grabbing data from an api
> 💡 This is the bread and butter of data engineers pulling data, so follow along in the colab or in your local setup.
For these purposes we created an api that can serve the data you are already familiar with, the NYC taxi dataset.
The api documentation is as follows:
- There are a limited nr of records behind the api
- The data can be requested page by page, each page containing 1000 records
- If we request a page with no data, we will get a successful response with no data
- so this means that when we get an empty page, we know there is no more data and we can stop requesting pages - this is a common way to paginate but not the only one - each api may be different.
- details:
- method: get
- url: `https://us-central1-dlthub-analytics.cloudfunctions.net/data_engineering_zoomcamp_api`
- parameters: `page` integer. Represents the page number you are requesting. Defaults to 1.
So how do we design our requester?
- We need to request page by page until we get no more data. At this point, we do not know how much data is behind the api.
- It could be 1000 records or it could be 10GB of records. So lets grab the data with a generator to avoid having to fit an undetermined amount of data into ram.
In this approach to grabbing data from apis, we have pros and cons:
- Pros: **Easy memory management** thanks to api returning events/pages
- Cons: **Low throughput**, due to the data transfer being constrained via an API.
```bash
import requests
BASE_API_URL = "https://us-central1-dlthub-analytics.cloudfunctions.net/data_engineering_zoomcamp_api"
# I call this a paginated getter
# as it's a function that gets data
# and also paginates until there is no more data
# by yielding pages, we "microbatch", which speeds up downstream processing
def paginated_getter():
page_number = 1
while True:
# Set the query parameters
params = {'page': page_number}
# Make the GET request to the API
response = requests.get(BASE_API_URL, params=params)
response.raise_for_status() # Raise an HTTPError for bad responses
page_json = response.json()
print(f'got page number {page_number} with {len(page_json)} records')
# if the page has no records, stop iterating
if page_json:
yield page_json
page_number += 1
else:
# No more data, break the loop
break
if __name__ == '__main__':
# Use the generator to iterate over pages
for page_data in paginated_getter():
# Process each page as needed
print(page_data)
```
### Example 2: Grabbing the same data from file - simple download
> 💡 This part is demonstrative, so you do not need to follow along; just pay attention.
- Why am I showing you this? so when you do this in the future, you will remember there is a best practice you can apply for scalability.
Some apis respond with files instead of pages of data. The reason for this is simple: Throughput and cost. A restful api that returns data has to read the data from storage and process and return it to you by some logic - If this data is large, this costs time, money and creates a bottleneck.
A better way is to offer the data as files that someone can download from storage directly, without going through the restful api layer. This is common for apis that offer large volumes of data, such as ad impressions data.
In this example, we grab exactly the same data as we did in the API example above, but now we get it from the underlying file instead of going through the API.
- Pros: **High throughput**
- Cons: **Memory** is used to hold all the data
This is how the code could look. As you can see in this case our `data`and `parsed_data` variables hold the entire file data in memory before returning it. Not great.
```python
import requests
import json
url = "https://storage.googleapis.com/dtc_zoomcamp_api/yellow_tripdata_2009-06.jsonl"
def download_and_read_jsonl(url):
response = requests.get(url)
response.raise_for_status() # Raise an HTTPError for bad responses
data = response.text.splitlines()
parsed_data = [json.loads(line) for line in data]
return parsed_data
downloaded_data = download_and_read_jsonl(url)
if downloaded_data:
# Process or print the downloaded data as needed
print(downloaded_data[:5]) # Print the first 5 entries as an example
```
### Example 3: Same file, streaming download
> 💡 This is the bread and butter of data engineers pulling data, so follow along in the colab
Ok, downloading files is simple, but what if we want to do a stream download?
Thats possible too - in effect giving us the best of both worlds. In this case we prepared a jsonl file which is already split into lines making our code simple. But json (not jsonl) files could also be downloaded in this fashion, for example using the `ijson` library.
What are the pros and cons of this method of grabbing data?
Pros: **High throughput, easy memory management,** because we are downloading a file
Cons: **Difficult to do for columnar file formats**, as entire blocks need to be downloaded before they can be deserialised to rows. Sometimes, the code is complex too.
Heres what the code looks like - in a jsonl file each line is a json document, or a “row” of data, so we yield them as they get downloaded. This allows us to download one row and process it before getting the next row.
```bash
import requests
import json
def download_and_yield_rows(url):
response = requests.get(url, stream=True)
response.raise_for_status() # Raise an HTTPError for bad responses
for line in response.iter_lines():
if line:
yield json.loads(line)
# Replace the URL with your actual URL
url = "https://storage.googleapis.com/dtc_zoomcamp_api/yellow_tripdata_2009-06.jsonl"
# Use the generator to iterate over rows with minimal memory usage
for row in download_and_yield_rows(url):
# Process each row as needed
print(row)
```
In the colab notebook you can also find a code snippet to load the data - but we will load some data later in the course and you can explore the colab on your own after the course.
What is worth keeping in mind at this point is that our loader library that we will use later, `dlt`or data load tool, will respect the streaming concept of the generator and will process it in an efficient way keeping memory usage low and using parallelism where possible.
Lets move over to the Colab notebook and run examples 2 and 3, compare them, and finally load examples 1 and 3 to DuckDB
# Normalising data
You often hear that data people spend most of their time “cleaning” data. What does this mean?
Lets look granularly into what people consider data cleaning.
Usually we have 2 parts:
- Normalising data without changing its meaning,
- and filtering data for a use case, which changes its meaning.
### Part of what we often call data cleaning is just metadata work:
- Add types (string to number, string to timestamp, etc)
- Rename columns: Ensure column names follow a supported standard downstream - such as no strange characters in the names.
- Flatten nested dictionaries: Bring nested dictionary values into the top dictionary row
- Unnest lists or arrays into child tables: Arrays or lists cannot be flattened into their parent record, so if we want flat data we need to break them out into separate tables.
- We will look at a practical example next, as these concepts can be difficult to visualise from text.
### **Why prepare data? why not use json as is?**
- We do not easily know what is inside a json document due to lack of schema
- Types are not enforced between rows of json - we could have one record where age is `25`and another where age is `twenty five` , and another where its `25.00`. Or in some systems, you might have a dictionary for a single record, but a list of dicts for multiple records. This could easily lead to applications downstream breaking.
- We cannot just use json data easily, for example we would need to convert strings to time if we want to do a daily aggregation.
- Reading json loads more data into memory, as the whole document is scanned - while in parquet or databases we can scan a single column of a document. This causes costs and slowness.
- Json is not fast to aggregate - columnar formats are.
- Json is not fast to search.
- Basically json is designed as a "lowest common denominator format" for "interchange" / data transfer and is unsuitable for direct analytical usage.
### Practical example
> 💡 This is the bread and butter of data engineers pulling data, so follow along in the colab notebook.
In the case of the NY taxi rides data, the dataset is quite clean - so lets instead use a small example of more complex data. Lets assume we know some information about passengers and stops.
For this example we modified the dataset as follows
- We added nested dictionaries
```json
"coordinates": {
"start": {
"lon": -73.787442,
"lat": 40.641525
},
```
- We added nested lists
```json
"passengers": [
{"name": "John", "rating": 4.9},
{"name": "Jack", "rating": 3.9}
],
```
- We added a record hash that gives us an unique id for the record, for easy identification
```json
"record_hash": "b00361a396177a9cb410ff61f20015ad",
```
We want to load this data to a database. How do we want to clean the data?
- We want to flatten dictionaries into the base row
- We want to flatten lists into a separate table
- We want to convert time strings into time type
```python
data = [
{
"vendor_name": "VTS",
"record_hash": "b00361a396177a9cb410ff61f20015ad",
"time": {
"pickup": "2009-06-14 23:23:00",
"dropoff": "2009-06-14 23:48:00"
},
"Trip_Distance": 17.52,
"coordinates": {
"start": {
"lon": -73.787442,
"lat": 40.641525
},
"end": {
"lon": -73.980072,
"lat": 40.742963
}
},
"Rate_Code": None,
"store_and_forward": None,
"Payment": {
"type": "Credit",
"amt": 20.5,
"surcharge": 0,
"mta_tax": None,
"tip": 9,
"tolls": 4.15,
"status": "booked"
},
"Passenger_Count": 2,
"passengers": [
{"name": "John", "rating": 4.9},
{"name": "Jack", "rating": 3.9}
],
"Stops": [
{"lon": -73.6, "lat": 40.6},
{"lon": -73.5, "lat": 40.5}
]
},
]
```
Now lets normalise this data.
## Introducing dlt
dlt is a python library created for the purpose of assisting data engineers to build simpler, faster and more robust pipelines with minimal effort.
You can think of dlt as a loading tool that implements the best practices of data pipelines enabling you to just “use” those best practices in your own pipelines, in a declarative way.
This enables you to stop reinventing the flat tyre, and leverage dlt to build pipelines much faster than if you did everything from scratch.
dlt automates much of the tedious work a data engineer would do, and does it in a way that is robust. dlt can handle things like:
- Schema: Inferring and evolving schema, alerting changes, using schemas as data contracts.
- Typing data, flattening structures, renaming columns to fit database standards. In our example we will pass the “data” you can see above and see it normalised.
- Processing a stream of events/rows without filling memory. This includes extraction from generators.
- Loading to a variety of dbs or file formats.
Lets use it to load our nested json to duckdb:
Heres how you would do that on your local machine. I will walk you through before showing you in colab as well.
First, install dlt
```bash
# Make sure you are using Python 3.8-3.11 and have pip installed
# spin up a venv
python -m venv ./env
source ./env/bin/activate
# pip install
pip install dlt[duckdb]
```
Next, grab your data from above and run this snippet
- here we define a pipeline, which is a connection to a destination
- and we run the pipeline, printing the outcome
```python
# define the connection to load to.
# We now use duckdb, but you can switch to Bigquery later
pipeline = dlt.pipeline(pipeline_name="taxi_data",
destination='duckdb',
dataset_name='taxi_rides')
# run the pipeline with default settings, and capture the outcome
info = pipeline.run(data,
table_name="users",
write_disposition="replace")
# show the outcome
print(info)
```
If you are running dlt locally you can use the built in streamlit app by running the cli command with the pipeline name we chose above.
```bash
dlt pipeline taxi_data show
```
Or explore the data in the linked colab notebook. Ill switch to it now to show you the data.
# Incremental loading
Incremental loading means that as we update our datasets with the new data, we would only load the new data, as opposed to making a full copy of a sources data all over again and replacing the old version.
By loading incrementally, our pipelines run faster and cheaper.
- Incremental loading goes hand in hand with incremental extraction and state, two concepts which we will not delve into during this workshop
- `State` is information that keeps track of what was loaded, to know what else remains to be loaded. dlt stores the state at the destination in a separate table.
- Incremental extraction refers to only requesting the increment of data that we need, and not more. This is tightly connected to the state to determine the exact chunk that needs to be extracted and loaded.
- You can learn more about incremental extraction and state by reading the dlt docs on how to do it.
### dlt currently supports 2 ways of loading incrementally:
1. Append:
- We can use this for immutable or stateless events (data that doesnt change), such as taxi rides - For example, every day there are new rides, and we could load the new ones only instead of the entire history.
- We could also use this to load different versions of stateful data, for example for creating a “slowly changing dimension” table for auditing changes. For example, if we load a list of cars and their colors every day, and one day one car changes color, we need both sets of data to be able to discern that a change happened.
2. Merge:
- We can use this to update data that changes.
- For example, a taxi ride could have a payment status, which is originally “booked” but could later be changed into “paid”, “rejected” or “cancelled”
Here is how you can think about which method to use:
![Incremental Loading](./incremental_loading.png)
* If you want to keep track of when changes occur in stateful data (slowly changing dimension) then you will need to append the data
### Lets do a merge example together:
> 💡 This is the bread and butter of data engineers pulling data, so follow along.
- In our previous example, the payment status changed from "booked" to “cancelled”. Perhaps Jack likes to fraud taxis and that explains his low rating. Besides the ride status change, he also got his rating lowered further.
- The merge operation replaces an old record with a new one based on a key. The key could consist of multiple fields or a single unique id. We will use record hash that we created for simplicity. If you do not have a unique key, you could create one deterministically out of several fields, such as by concatenating the data and hashing it.
- A merge operation replaces rows, it does not update them. If you want to update only parts of a row, you would have to load the new data by appending it and doing a custom transformation to combine the old and new data.
In this example, the score of the 2 drivers got lowered and we need to update the values. We do it by using merge write disposition, replacing the records identified by `record hash` present in the new data.
```python
data = [
{
"vendor_name": "VTS",
"record_hash": "b00361a396177a9cb410ff61f20015ad",
"time": {
"pickup": "2009-06-14 23:23:00",
"dropoff": "2009-06-14 23:48:00"
},
"Trip_Distance": 17.52,
"coordinates": {
"start": {
"lon": -73.787442,
"lat": 40.641525
},
"end": {
"lon": -73.980072,
"lat": 40.742963
}
},
"Rate_Code": None,
"store_and_forward": None,
"Payment": {
"type": "Credit",
"amt": 20.5,
"surcharge": 0,
"mta_tax": None,
"tip": 9,
"tolls": 4.15,
"status": "cancelled"
},
"Passenger_Count": 2,
"passengers": [
{"name": "John", "rating": 4.4},
{"name": "Jack", "rating": 3.6}
],
"Stops": [
{"lon": -73.6, "lat": 40.6},
{"lon": -73.5, "lat": 40.5}
]
},
]
# define the connection to load to.
# We now use duckdb, but you can switch to Bigquery later
pipeline = dlt.pipeline(destination='duckdb', dataset_name='taxi_rides')
# run the pipeline with default settings, and capture the outcome
info = pipeline.run(data,
table_name="users",
write_disposition="merge",
merge_key="record_hash")
# show the outcome
print(info)
```
As you can see in your notebook, the payment status and Jacks rating were updated after running the code.
### Whats next?
- You could change the destination to parquet + local file system or storage bucket. See the colab bonus section.
- You could change the destination to BigQuery. Destination & credential setup docs: https://dlthub.com/docs/dlt-ecosystem/destinations/, https://dlthub.com/docs/walkthroughs/add_credentials
or See the colab bonus section.
- You could use a decorator to convert the generator into a customised dlt resource: https://dlthub.com/docs/general-usage/resource
- You can deep dive into building more complex pipelines by following the guides:
- https://dlthub.com/docs/walkthroughs
- https://dlthub.com/docs/build-a-pipeline-tutorial
- You can join our [Slack community](https://dlthub.com/community) and engage with us there.

View File

@ -0,0 +1,233 @@
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"source": [
"# **Homework**: Data talks club data engineering zoomcamp Data loading workshop\n",
"\n",
"Hello folks, let's practice what we learned - Loading data with the best practices of data engineering.\n",
"\n",
"Here are the exercises we will do\n",
"\n",
"\n"
],
"metadata": {
"id": "mrTFv5nPClXh"
}
},
{
"cell_type": "markdown",
"source": [
"# 1. Use a generator\n",
"\n",
"Remember the concept of generator? Let's practice using them to futher our understanding of how they work.\n",
"\n",
"Let's define a generator and then run it as practice.\n",
"\n",
"**Answer the following questions:**\n",
"\n",
"- **Question 1: What is the sum of the outputs of the generator for limit = 5?**\n",
"- **Question 2: What is the 13th number yielded**\n",
"\n",
"I suggest practicing these questions without GPT as the purpose is to further your learning."
],
"metadata": {
"id": "wLF4iXf-NR7t"
}
},
{
"cell_type": "code",
"source": [
"def square_root_generator(limit):\n",
" n = 1\n",
" while n <= limit:\n",
" yield n ** 0.5\n",
" n += 1\n",
"\n",
"# Example usage:\n",
"limit = 5\n",
"generator = square_root_generator(limit)\n",
"\n",
"for sqrt_value in generator:\n",
" print(sqrt_value)\n"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "wLng-bDJN4jf",
"outputId": "547683cb-5f56-4815-a903-d0d9578eb1f9"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"1.0\n",
"1.4142135623730951\n",
"1.7320508075688772\n",
"2.0\n",
"2.23606797749979\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [],
"metadata": {
"id": "xbe3q55zN43j"
}
},
{
"cell_type": "markdown",
"source": [
"# 2. Append a generator to a table with existing data\n",
"\n",
"\n",
"Below you have 2 generators. You will be tasked to load them to duckdb and answer some questions from the data\n",
"\n",
"1. Load the first generator and calculate the sum of ages of all people. Make sure to only load it once.\n",
"2. Append the second generator to the same table as the first.\n",
"3. **After correctly appending the data, calculate the sum of all ages of people.**\n",
"\n",
"\n"
],
"metadata": {
"id": "vjWhILzGJMpK"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "2MoaQcdLBEk6",
"outputId": "d2b93dc1-d83f-44ea-aeff-fdf51d75f7aa"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"{'ID': 1, 'Name': 'Person_1', 'Age': 26, 'City': 'City_A'}\n",
"{'ID': 2, 'Name': 'Person_2', 'Age': 27, 'City': 'City_A'}\n",
"{'ID': 3, 'Name': 'Person_3', 'Age': 28, 'City': 'City_A'}\n",
"{'ID': 4, 'Name': 'Person_4', 'Age': 29, 'City': 'City_A'}\n",
"{'ID': 5, 'Name': 'Person_5', 'Age': 30, 'City': 'City_A'}\n",
"{'ID': 3, 'Name': 'Person_3', 'Age': 33, 'City': 'City_B', 'Occupation': 'Job_3'}\n",
"{'ID': 4, 'Name': 'Person_4', 'Age': 34, 'City': 'City_B', 'Occupation': 'Job_4'}\n",
"{'ID': 5, 'Name': 'Person_5', 'Age': 35, 'City': 'City_B', 'Occupation': 'Job_5'}\n",
"{'ID': 6, 'Name': 'Person_6', 'Age': 36, 'City': 'City_B', 'Occupation': 'Job_6'}\n",
"{'ID': 7, 'Name': 'Person_7', 'Age': 37, 'City': 'City_B', 'Occupation': 'Job_7'}\n",
"{'ID': 8, 'Name': 'Person_8', 'Age': 38, 'City': 'City_B', 'Occupation': 'Job_8'}\n"
]
}
],
"source": [
"def people_1():\n",
" for i in range(1, 6):\n",
" yield {\"ID\": i, \"Name\": f\"Person_{i}\", \"Age\": 25 + i, \"City\": \"City_A\"}\n",
"\n",
"for person in people_1():\n",
" print(person)\n",
"\n",
"\n",
"def people_2():\n",
" for i in range(3, 9):\n",
" yield {\"ID\": i, \"Name\": f\"Person_{i}\", \"Age\": 30 + i, \"City\": \"City_B\", \"Occupation\": f\"Job_{i}\"}\n",
"\n",
"\n",
"for person in people_2():\n",
" print(person)\n"
]
},
{
"cell_type": "markdown",
"source": [],
"metadata": {
"id": "vtdTIm4fvQCN"
}
},
{
"cell_type": "markdown",
"source": [
"# 3. Merge a generator\n",
"\n",
"Re-use the generators from Exercise 2.\n",
"\n",
"A table's primary key needs to be created from the start, so load your data to a new table with primary key ID.\n",
"\n",
"Load your first generator first, and then load the second one with merge. Since they have overlapping IDs, some of the records from the first load should be replaced by the ones from the second load.\n",
"\n",
"After loading, you should have a total of 8 records, and ID 3 should have age 33.\n",
"\n",
"Question: **Calculate the sum of ages of all the people loaded as described above.**\n"
],
"metadata": {
"id": "pY4cFAWOSwN1"
}
},
{
"cell_type": "markdown",
"source": [
"# Solution: First make sure that the following modules are installed:"
],
"metadata": {
"id": "kKB2GTB9oVjr"
}
},
{
"cell_type": "code",
"source": [
"#Install the dependencies\n",
"%%capture\n",
"!pip install dlt[duckdb]"
],
"metadata": {
"id": "xTVvtyqrfVNq"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# to do: homework :)"
],
"metadata": {
"id": "a2-PRBAkGC2K"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Questions? difficulties? We are here to help.\n",
"- DTC data engineering course channel: https://datatalks-club.slack.com/archives/C01FABYF2RG\n",
"- dlt's DTC cohort channel: https://dlthub-community.slack.com/archives/C06GAEX2VNX"
],
"metadata": {
"id": "PoTJu4kbGG0z"
}
}
]
}

Binary file not shown.

After

Width:  |  Height:  |  Size: 26 KiB

File diff suppressed because one or more lines are too long

View File

@ -1,49 +1,165 @@
## Stream processing with Rising Wave
<p align="center">
<picture>
<source srcset="https://github.com/risingwavelabs/risingwave/blob/main/.github/RisingWave-logo-dark.svg" width="500px" media="(prefers-color-scheme: dark)">
<img src="https://github.com/risingwavelabs/risingwave/blob/main/.github/RisingWave-logo-light.svg" width="500px">
</picture>
</p>
More details to come
</div>
<p align="center">
<a
href="https://docs.risingwave.com/"
target="_blank"
><b>Documentation</b></a>&nbsp;&nbsp;&nbsp;📑&nbsp;&nbsp;&nbsp;
<a
href="https://tutorials.risingwave.com/"
target="_blank"
><b>Hands-on Tutorials</b></a>&nbsp;&nbsp;&nbsp;🎯&nbsp;&nbsp;&nbsp;
<a
href="https://cloud.risingwave.com/"
target="_blank"
><b>RisingWave Cloud</b></a>&nbsp;&nbsp;&nbsp;🚀&nbsp;&nbsp;&nbsp;
<a
href="https://risingwave.com/slack"
target="_blank"
>
<b>Get Instant Help</b>
</a>
</p>
<div align="center">
<a
href="https://risingwave.com/slack"
target="_blank"
>
<img alt="Slack" src="https://badgen.net/badge/Slack/Join%20RisingWave/0abd59?icon=slack" />
</a>
<a
href="https://twitter.com/risingwavelabs"
target="_blank"
>
<img alt="X" src="https://img.shields.io/twitter/follow/risingwavelabs" />
</a>
<a
href="https://www.youtube.com/@risingwave-labs"
target="_blank"
>
<img alt="YouTube" src="https://img.shields.io/youtube/channel/views/UCsHwdyBRxBpmkA5RRd0YNEA" />
</a>
</div>
## Stream processing with RisingWave
In this hands-on workshop, well learn how to process real-time streaming data using SQL in RisingWave. The system well use is [RisingWave](https://github.com/risingwavelabs/risingwave), an open-source SQL database for processing and managing streaming data. You may not feel unfamiliar with RisingWaves user experience, as its fully wire compatible with PostgreSQL.
![RisingWave](https://raw.githubusercontent.com/risingwavelabs/risingwave-docs/main/docs/images/new_archi_grey.png)
Well cover the following topics in this Workshop:
- Why Stream Processing?
- Stateless computation (Filters, Projections)
- Stateful Computation (Aggregations, Joins)
- Data Ingestion and Delivery
RisingWave in 10 Minutes:
https://tutorials.risingwave.com/docs/intro
[Project Repository](https://github.com/risingwavelabs/risingwave-data-talks-workshop-2024-03-04)
## Homework
TBA
**Please setup the environment in [Getting Started](https://github.com/risingwavelabs/risingwave-data-talks-workshop-2024-03-04?tab=readme-ov-file#getting-started) and for the [Homework](https://github.com/risingwavelabs/risingwave-data-talks-workshop-2024-03-04/blob/main/homework.md#setting-up) first.**
## Question 0
_This question is just a warm-up to introduce dynamic filter, please attempt it before viewing its solution._
What are the pick up taxi zones at the latest dropoff times?
For this part, we will use the [dynamic filter pattern](https://docs.risingwave.com/docs/current/sql-pattern-dynamic-filters/).
<details>
<summary>Solution</summary>
```sql
CREATE MATERIALIZED VIEW latest_dropoff_time AS
WITH t AS (
SELECT MAX(tpep_dropoff_datetime) AS latest_dropoff_time
FROM trip_data
)
SELECT taxi_zone.Zone as taxi_zone, latest_dropoff_time
FROM t,
trip_data
JOIN taxi_zone
ON trip_data.DOLocationID = taxi_zone.location_id
WHERE trip_data.tpep_dropoff_datetime = t.latest_dropoff_time;
-- taxi_zone | latest_dropoff_time
-- ----------------+---------------------
-- Midtown Center | 2022-01-03 17:24:54
-- (1 row)
```
</details>
### Question 1
TBA
Create a materialized view to compute the average, min and max trip time between each taxi zone.
* Option 1
* Option 2
* Option 3
* Option 4
From this MV, find the pair of taxi zones with the highest average trip time.
You may need to use the [dynamic filter pattern](https://docs.risingwave.com/docs/current/sql-pattern-dynamic-filters/) for this.
Bonus (no marks): Create an MV which can identify anomalies in the data. For example, if the average trip time between two zones is 1 minute,
but the max trip time is 10 minutes and 20 minutes respectively.
### Question 2:
Options:
1. Yorkville East, Steinway
2. Murray Hill, Midwood
3. East Flatbush/Farragut, East Harlem North
4. Midtown Center, University Heights/Morris Heights
TBA
### Question 2
* Option 1
* Option 2
* Option 3
* Option 4
Recreate the MV(s) in question 1, to also find the number of trips for the pair of taxi zones with the highest average trip time.
Options:
1. 5
2. 3
3. 10
4. 1
### Question 3:
### Question 3
TBA
From the latest pickup time to 17 hours before, what are the top 10 busiest zones in terms of number of pickups?
For example if the latest pickup time is 2020-01-01 12:00:00,
then the query should return the top 10 busiest zones from 2020-01-01 11:00:00 to 2020-01-01 12:00:00.
* Option 1
* Option 2
* Option 3
* Option 4
HINT: You can use [dynamic filter pattern](https://docs.risingwave.com/docs/current/sql-pattern-dynamic-filters/)
to create a filter condition based on the latest pickup time.
NOTE: For this question `17 hours` was picked to ensure we have enough data to work with.
Fill in the top 10:
1. `__________`
2. `__________`
3. `__________`
4. `__________`
5. `__________`
6. `__________`
7. `__________`
8. `__________`
9. `__________`
10. `__________`
## Submitting the solutions
* Form for submitting: TBA
* You can submit your homework multiple times. In this case, only the last submission will be used.
- Form for submitting: TBA
- You can submit your homework multiple times. In this case, only the
last submission will be used.
Deadline: TBA
## Solution
Video: TBA

Binary file not shown.

Before

Width:  |  Height:  |  Size: 163 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 309 KiB