https://freedium.cfd/https://medium.com/@patrick_nguyen_74695/end-to-end-azure-data-engineering-project-part-1-project-requirement-solution-architecture-and-3fedb53df400

https://freedium.cfd/https://medium.com/@patrick_nguyen_74695/complete-end-to-end-azure-data-engineering-project-part-2-using-databricks-to-ingest-and-5eb0746d1c6

End-to-end Azure data engineering project — Part 1: Project Requirement, Solution Architecture and…

This is a series of 4 articles demonstrating the end-to-end data engineering process on the Azure platform, using Azure Data Lake…

Patrick Nguyen Follow

androidstudio·June 5, 2023 (Updated: August 19, 2023)·Free: No

This series comprises 4 articles showcasing the comprehensive data engineering practice on the Azure platform. It encompasses the utilization of Azure Data Lake, DataBricks, Azure Data Factory, Python, Power BI, and Spark technology. In this initial part (part 1), we will acquaint ourselves with the dataset and delve into the solution architecture of Azure and Databricks. Furthermore, we will establish the initial environment setup and execute the loading of data from the source system to Azure Data Lake via the API approach using ADF.

Please review the related articles in the series here:

End-to-end Azure data engineering project — Part 1: Project Requirement, Solution Architecture and ADF reading data from API

End-to-end Azure data engineering project — Part 2: Using DataBricks to ingest and transform data

End-to-end Azure data engineering project — Part 3: Creating data pipelines and scheduling using Azure Data Factory

End-to-end Azure data engineering project — Part 4: Data Analysis and Data Visualization (Power BI)

1. Dataset

In this project, we will use the Formula One dataset from this page: http://ergast.com/mrd/ (you can find the same dataset in Kaggle as well). The data source is available through API JSON/XML format or downloaded CSV files. We will practise using Azure Data Factory to read the data from Rest API (JSON) and write to Azure SQL/DataLake, that’s why we will read data directly from ergast API. You can check out the ER diagram of source data here:

2. Solution Architecture

The end-to-end architecture for this project is below: raw data will be read from Ergast via API method using Azure Data Factory and imported into ADLS Raw containers in JSON format. Then Databricks will be used to perform data ingestion to ADLS Ingestion Layer, and transformation to the ADLS Presentation layer. Later data will be analyzed and visualized through Databricks and Power BI. Azure Data Factory will be an orchestration tool to monitor and schedule the pipeline. On the Databricks website, 3 different layers processed through DataBricks notebooks are described as Bronze (where raw data is loaded), Silver (where data is filtered and cleaned), and Gold (where data is transformed through business logic).

Our solution architecture is based on Microsoft’s recommended architecture for Azure Databricks Modern Analytics Architecture:

https://learn.microsoft.com/en-us/azure/architecture/solution-ideas/articles/azure-databricks-modern-analytics-architecture

Similar Databricks workflow can be found on the Databricks website:

https://www.databricks.com/blog/2020/03/06/connect-90-data-sources-to-your-data-lake-with-azure-databricks-and-azure-data-factory.html

4. Setup the Azure DataLake and DataBricks

How to create Azure storage and Databricks is not the focus of this article. I presume you have a basic knowledge of Python and Azure setup before this article. You can go ahead and create Azure storage and Databrick workspace. Ensure you enable ADLS Gen 2 and create 3 containers as in the picture, corresponding to 3 layers of the Databricks data pipeline model: Bronze, Silver and Gold.

Next, create a DataBricks service on Azure and a standard cluster, and a notebook to set up credentials to Azure data lake. There are several methods to grant access for Databricks to Azure DataLake: Access Keys, SAS Token, Credential Passthrough, and Utility Catalog. In this demo, I will use the Service Principal and then mount DataBrick DBFS to Azure Data Lake. If you aren’t familiar with this step, you can go through the documentation here: https://docs.databricks.com/storage/azure-storage.html and https://learn.microsoft.com/en-us/azure/databricks/getting-started/connect-to-azure-storage. I may have another article talking about different access methods (pros and cons) later.

If you’re not able to set up Service Principal and DataBricks mount because your account doesn’t have those features (student account and company account cannot access Active Directory to set up Service Principal), feel free to use simpler methods such as Access Key or SAS Token. Make sure to use the abfss file path to access ADLS files: abfss://raw@{your-storage-account-name}.dfs.core.windows/net. Comment or message me if want to know more about how to set up Service Principal and Databricks mount, I will write a step-by-step instructions in another article.

5. Source Data Integration

Next step, referring back to our solution architecture, we will design an Azure Data Factory activity to extract the data from the source using the API method. Ergast provides the API documentation that you can find here: http://ergast.com/mrd/

First, create a new pipeline called Formula1 Data Copy and select Copy Data, each activity will copy data for each table from the source in JSON format.

Then create a linked service for the source by using HTTP with the link from API document: http://ergast.com/api/f1/drivers.json. You can replace the drivers.json with other table names such as results.json. Also, specify parameter limit=1000 in the link, otherwise, the result by default is 30 records.

For the sink-linked service, please choose Azure Blob Storage and load the result into the raw container, also specify the file name such as drivers.json

Run the pipeline and then go to Azure Blob Storage to check the result file, you will see drivers.json with driver information in JSON format.

We’re all set for Part 2 which we will use Databricks to do data ingestion and transformation. Please follow me to get notified of the next parts of this project.

Patrick N

If you like what you read, consider joining Medium and reading many more articles. A portion of your fee goes to support authors like me. Click here to join.

#databricks #python #spark #data-analysis #azure-data-engineering

Freedium

< Go to the original Preview image

End-to-end Azure data engineering project — Part 2: Using DataBricks to ingest and…

Azure DataBricks, ADF, Python

Patrick Nguyen Follow

androidstudio·June 3, 2023 (Updated: August 18, 2023)·Free: No

This is a series of 4 articles demonstrating the end-to-end data engineering process on the Azure platform, using Azure Data Lake, DataBricks, Azure Data Factory, Python, Power BI and Spark technology. In this part 2, we will discuss how to use Databricks to ingest and transform the JSON files that we’ve imported from Part 1.

Please review the related articles in the series here:

End-to-end Azure data engineering project — Part 1: Project Requirement, Solution Architecture and ADF reading data from API

End-to-end Azure data engineering project — Part 2: Using DataBricks to ingest and transform data

End-to-end Azure data engineering project — Part 3: Creating data pipelines and scheduling using Azure Data Factory

End-to-end Azure data engineering project — Part 4: Data Analysis and Data Visualization (Power BI)

5. Data Ingestion

After you mount Databricks to ADLS and loading source data to the raw container, you can run this code to ensure that Databricks can read the JSON file from ADLS.

Copydisplay(spark.read.json("/mnt/formula1datalake133/raw/circuits.json"))

Also, you can browse DBFS to review the folder structure which is similar to Azure Data Lake storage created earlier:

In this exercise, we will load json files from raw container to processed folder and store them in columnar format (Parquet).

We will ingest circuits.json and define the schema and data type for each column. We will do some basic transformations such as changing column names to make them more meaningful for data consumers. Also, we drop the column URL because we won’t use it, we will also add the ingestion_date column with the current timestamp to track the data loading time. We only perform basic transformations here to make raw data more meaningful. The real transformation will be performed later based on business logic and requirements for the presentation layer.

Here is the Python code in Databricks notebook to perform data ingestion and some transformation to the raw data file circuits.json

First loading the source file into dataframe:

Copy#load circuits.csv into spark dataframe and infer schema
circuits_df = spark.read \
.option("header", True) \
.option("inferSchema", True) \
.json("/mnt/formula1datalake133/raw/circuits.json")

display(circuits_df)

circuits_df.printSchema()

Then we can do some basic transformation and save the final dataframe to processed container:

Copy#extract only required columns, drop the url column
from pyspark.sql.functions import col
circuits_selected_df = circuits_df.drop(col("url"))

#rename required columns
circuits_renamed_df = circuits_selected_df.withColumnRenamed("circuitId", "circuit_id") \
.withColumnRenamed("circuitRef", "circuit_ref") \
.withColumnRenamed("lat", "latitude") \
.withColumnRenamed("lng", "longitude") \
.withColumnRenamed("alt", "altitude") 

#add the ingestion_date column to dataframe for audit purpose
from pyspark.sql.functions import current_timestamp
circuits_final_df = circuits_renamed_df.withColumn("ingestion_date", current_timestamp())

The final step is to load transformed table to Processed folder as Parquet format:

Copy#finally export the dataframe to Parquet file to Processed folder
circuits_final_df.write.mode("overwrite").parquet("/mnt/formula1datalake133/processed/circuits")
display(spark.read.parquet("/mnt/formula1datalake133/processed/circuits"))

You can see the transformed table is much more descriptive and understandable for end users. You can apply the same code to the other files in the raw folder.

After the ingestion and transformation process, you should have Parquet folders containing processed files in the Processed container in ADLS.

6. Data Transformation for Presentation Layer

Now we have all processed files in Parquet format with proper schema. The next step is to build a presentation layer to extract data based on business requirements.

Let’s say business partner want to have the information of the 2023 year race round 8 in Canada. The information should contain the results, drivers and team. So what we’re going to do next is to extract relevant tables and columns and join them into the race_result table. Below is the sample code:

First reading the parquet files from processed folder and change the column names to differentiate them from joined tables:

Copyraw_folder_path = "/mnt/formula1datalake133/raw"
processed_folder_path = "/mnt/formula1datalake133/processed"
presentation_folder_path = "/mnt/formula1datalake133/presentation"

drivers_df = spark.read.parquet(f"{processed_folder_path}/drivers") \
.withColumnRenamed("number", "driver_number") \
.withColumnRenamed("name", "driver_name") \
.withColumnRenamed("nationality", "driver_nationality") 

constructors_df = spark.read.parquet(f"{processed_folder_path}/constructors") \
.withColumnRenamed("name", "team_name") 

circuits_df = spark.read.parquet(f"{processed_folder_path}/circuits")

races_df = spark.read.parquet(f"{processed_folder_path}/races") \
.withColumnRenamed("name", "race_name") \
.withColumnRenamed("race_timestamp", "race_date") 

results_df = spark.read.parquet(f"{processed_folder_path}/results") \
.withColumnRenamed("time", "race_time")

Then we can join the tables together and export it to presentation layer:

Copy#join the tables
race_results_df = results_df.join(races_df, results_df.race_id == race_circuits_df.race_id) \
                            .join(circuits_df, races_df.circuit_id == circuits_df.circuit_id) \
                            .join(drivers_df, results_df.driver_id == drivers_df.driver_id) \
                            .join(constructors_df, results_df.constructor_id == constructors_df.constructor_id)

#extract required columns
final_df = race_results_df.select("race_year", "race_name", "race_date", "location", "driver_name", "driver_number", "driver_nationality",
                                 "team_name", "grid", "fastest_lap", "race_time", "points", "position")


#write the result file to presentation layer
final_df.write.mode("overwrite").parquet(f"{presentation_folder_path}/race_results")

And here to display the race results of the year 2020 and race name to show it the business partner:

Copydisplay(final_df.filter("race_year == 2023 and race_name == 'Canadian Grand Prix'").orderBy(final_df.points.desc()))

Alright, let’s do another transformation required by business partners is to extract Drivers and Constructors standings/rankings:

Copyraw_folder_path = "/mnt/formula1datalake133/raw"
processed_folder_path = "/mnt/formula1datalake133/processed"
presentation_folder_path = "/mnt/formula1datalake133/presentation"

#first, we can reuse the race_results that we created earlier
race_results_df = spark.read.parquet(f"{presentation_folder_path}/race_results")

#aggregate the points and wins
from pyspark.sql.functions import sum, when, count, col

driver_standings_df = race_results_df \
.groupBy("race_year", "driver_name", "driver_nationality", "team_name") \
.agg(sum("points").alias("total_points"),
     count(when(col("position") == 1, True)).alias("wins"))

display(driver_standings_df)

#save the result to presentation layer
final_df.write.mode("overwrite").parquet(f"{presentation_folder_path}/driver_standings")

This is the sample Python code for driver_standings. The constructor_standings should be similar.

To sum up part 2, we ingested and transformed json files based on business requirements by using Databricks. In the next part, we will create a pipeline to schedule these Databricks jobs here by using Data Factory and Power BI for visualization.

End-to-end Azure data engineering project — Part 3: Creating data pipelines and scheduling using Azure Data Factory

Patrick NguyenFollow

4 min read

Jun 4, 2023

Please review the whole series here:

End-to-end Azure data engineering project — Part 1: Project Requirement, Solution Architecture and ADF reading data from API

End-to-end Azure data engineering project — Part 2: Using DataBricks to ingest and transform data

End-to-end Azure data engineering project — Part 3: Creating data pipelines and scheduling using Azure Data Factory

End-to-end Azure data engineering project — Part 4: Data Analysis and Data Visualization (Power BI)

7. Create Data Pipelines in Azure Data Factory to orchestrate Databricks Notebooks

First, create a new Azure Data Factory service on Azure, create new pipeline in ADF and drag Databricks — Notebook to the activity screen.

Set up a new linked service for Databricks as below. For Managed service identity and User Assigned Managed Identity, grant a Contributor role to both identities in Azure Databricks resource’s Access control menu in order to select the existing interactive cluster. Test connection and if it’s successful, click create.

Select the DataBricks Notebook you want to run in the Notebook path. In this case, I choose Ingest Circuits

After completion, you can click debug and publish this activity. Next, create the same activities for the rest of the notebooks in the Ingestion folder by cloning the existing one and changing the notebook path files. All notebooks can be set to run parallelly.

You may want to check if the raw container contains any files, if yes, we trigger Databricks notebooks. In order to do that, we can create 2 other activities: Get Metadata and If Condition. In the Get Metadata activity, you can create a Linked service to Azure Data Lake Storage and set the container as raw, leaving the Directory and File Name blank. Also, choose Child Items as Argument as below:

Set up dependency by dragging the arrow from Success of Get Metadata to If Condition Activity. You need to move all of Databricks Notebooks to True Condition. In order to run the pipeline only if the raw container contains files, Go to Acitivies of If Condition, then type the expression as in the below screenshot

Great, now your pipeline only runs if Raw Container contains a file. Next, we will create another pipeline to execute DataBricks transformation notebooks.

In ADF, create another pipeline (you can clone the first one) and rename it. In order to execute the transformation, we have to run the Ingestion pipeline first. Search for Execute pipeline, drag it to the screen and choose the Ingestion pipeline.

Next, create 3 Databricks activities to execute 3 transformation notebooks that we’ve created in Part 2 of the articles. If you still remember, the driver_standings and constructor_standings use the race_results as the input. Therefore, I create a dependency here.

Publish the changes. ADF pipelines are ready, we can create a trigger to run pipelines at certain date/time. You can choose Schedule or Tumbling Window.

After that, adding this trigger to the Transformation pipeline. this pipeline also triggers the Ingestion Pipeline as the first step in the process.

You can test the trigger and monitor the pipeline process in the Monitor menu.

Congratulations on completing the 3 parts of the project. We went through the solution design, reading data from source, data ingestion and transformation, and data pipeline orchestration. At this point, you completed most of the data engineering tasks to have transformed data ready for data analysts and scientists. In the next part, I will introduce how data consumers use the transformed data to do analysis and visualization. By understanding that, data engineers can coordinate better with data analysts to build good data pipelines.

Patrick N

Write

End-to-end Azure data engineering project — Part 4: Data Analysis and Data Visualization

Patrick NguyenFollow

5 min read

Jun 4, 2023

This is a series of 4 articles demonstrating the end-to-end data engineering process on the Azure platform, using Azure Data Lake, DataBricks, Azure Data Factory, Python, Power BI and Spark technology. In this article which is also the final part of the series, we will discuss how data analyst/BI developers analyze the dataset that have been ingested earlier by data engineering team. Also, we will learn how to leverage Power BI loading data from Databricks Hive for data visualization.

Please review the related articles in the series here:

End-to-end Azure data engineering project — Part 1: Project Requirement, Solution Architecture and ADF reading data from API

End-to-end Azure data engineering project — Part 2: Using DataBricks to ingest and transform data

End-to-end Azure data engineering project — Part 3: Creating data pipelines and scheduling using Azure Data Factory

End-to-end Azure data engineering project — Part 4: Data Analysis and Data Visualization (Power BI)

8. Data Analysis

Now we have transformed data ready for data consumers to analyze and visualize. Hold on, if you remember, the format that we use in the presentation layer is Parquet. Most data end users are familiar with tables and SQL so they can visualize data in Power BI/Tableau/Looker. Even though those tools can read Parquet formats, the transformation is not straightforward for end users. So, as data engineers, we may have different formats for data consumers to use in the presentation layer.

In this step, I will create a database and tables in Databricks stored in Hive Meta Store. A Hive metastore is a database that holds metadata about our data, such as the paths to the data in the data lake and the format of the data (parquet, delta, CSV, etc).

First, create a database named formula1-db:

create database if not exists formula1_db;

Then create 3 tables from 3 Parquet formats in the presentation folder: race_results, driver_standings and constructor_standings. the sample code as below

presentation_folder_path = "/mnt/formula1datalake133/presentation"

raceresults_df = spark.read.parquet(f"{presentation_folder_path}/race_results")

raceresults_df.write.mode("overwrite").format("parquet").saveAsTable("formula1_db.race_results")

Make sure 3 tables are present in formula1_db database:

Alright, the data engineering team complete their tasks. Let’s give an example in which the data analysis team comes in and wants to do some analysis and visualization. Data Analysts want to query the dominant drivers and teams. Let’s see their steps

First, they review the data from the race_results table and find that the position points are different for various periods/years. Therefore, they may want to standardize points across years or they just simply find dominant drivers over 10 years where position points are consistent.

Let’s simplify the problem for our article, here I only choose the period from 2010 to 2020 and the position points are consistent over these years

Then they extract the top 10 drivers with the most total points over this period:

8. Data Visualization

From the table result, you can simply click plus (+) sign to create different visualization, here is the sample.

You can even create a quick dashboard in Databricks. That’s pretty cool and convenient, right?

Having said that, the features of data visualization are limited in Databricks. For advanced features and self-service dashboards for data consumers, you may want to use Power BI. Let’s connect Databricks data source for Power BI’s dashboards:

After connection setup, you can view your Databricks Hive store with tables. Select and load driver_standings into Power BI

Then data analysts can make nice reports and dashboards and present them to management for decision-making. You can review my other post about Strategies to turn data into insights

We complete the end-to-end data engineering project about Formula 1 on Azure. To recap, here are the steps that we’ve completed:

Gathered requirements and collected source data
Designed solution architecture for our requirements
Setup environments on Azure and initiated Databricks cluster
Grant access for Databricks to Azure Data Lake
Loaded source data into Azure Data Lake Storage
Ingested and transformed data into Parquet formats on Databricks
Created data pipelines in Azure Data Factory
Scheduled and monitored data pipelines on Azure Data Factory
Created database and tables from Parquet files in Databricks
Analyzed data in Databricks and visualized data in Power BI

That’s a long process and congratulations to all of you getting to this point. Next articles I will write more about data engineering topics that we have not touched on in this series such as Incremental Load, integrating data sources by using ADF, dealing with streaming data, and data governance. Please make sure you subscribe to get notified about my new articles.

Patrick N

If you like what you read, consider joining Medium and reading many more articles. A portion of your fee goes to support authors like me. Click here to join.

Azure Data engineering project

End-to-end Azure data engineering project — Part 1: Project Requirement, Solution Architecture and…

This is a series of 4 articles demonstrating the end-to-end data engineering process on the Azure platform, using Azure Data Lake…

1. Dataset

2. Solution Architecture

4. Setup the Azure DataLake and DataBricks

5. Source Data Integration

End-to-end Azure data engineering project — Part 2: Using DataBricks to ingest and…

Azure DataBricks, ADF, Python

5. Data Ingestion

6. Data Transformation for Presentation Layer

End-to-end Azure data engineering project — Part 3: Creating data pipelines and scheduling using Azure Data Factory

7. Create Data Pipelines in Azure Data Factory to orchestrate Databricks Notebooks

End-to-end Azure data engineering project — Part 4: Data Analysis and Data Visualization

8. Data Analysis

8. Data Visualization

Leave a Reply Cancel reply

Explore More

Before and After: Amazon Monitron

Success Metrics: Maintenance Cost

Case Study Spotlight: Amazon Monitron