Big Data - Data Warehouse - Azure Data Factory(Part 1)

Hi guys ! It’s me again :laughing: . As scheduled, today we will go together in a series on data warehosue. In the previous section, we learned about the modern architecture of data warehouse, in this section we will learn about Azure Data Factory and how to create a pipeline to extract data from customer’s database into Azure Data Blob and update watermark by a stored proc.

Note :Those who have not read the previous section can review follow this link:
Big Data - Data Warehouse - Data Warehouse Modern Architecture

Yooo :love_you_gesture: :love_you_gesture: :love_you_gesture:

Before we go into pipeline creation, let’s find out what Azure Data Factory is and the basic components of Azure Data Factory(ADF).

As introduced in the previous section,

Azure Data Factory is Azure’s cloud ETL service for data integration and data transformation.

and in this section, we also know more

To do this job, ADF provides users with an interface to build pipeline.

Basically the interface working with the ADF looks like this

and pipeline is where we define the tasks to do to ETL data.

Core concepts

  • An activity is a task
  • A dataset is a named view of data that simply points or references the data we want to use in our activities as inputs and outputs.
  • Linked services are much like connection strings, which define the connection information needed for Data Factory to connect to external resources.

OK, now you have a very basic concept as well as the most important :innocent:

Approach

Our goal is that with customer data, we want to retrieve different data in different tables and then save that data in files.

How to do that ??? :thinking: :thinking: :thinking:

The first thing we can think of is making a separate pipeline for each table we need =)). Okay that’s a solution, but obviously with so many table this management of pipeline and dataset is increasingly a headache.

We will take a wiser approach :star:

Remember the basics of a programming language, when we want to print the numbers 1 to 100, instead of writing 100 print statements, we can just 1 for loop running from 1 to 100 and for each element i we print to the screen corresponding element i :100:

We will apply this when extracting data into Azure Blob Storage.
So we have a work flow like this

image

Let’s start !

Perform

Now we will create a json file to contain the elements, each element is the information from the table that we need.

It looks like this

Thank you @anon19898721 for this wise approach :innocent: :innocent: :innocent:

Now we will upload this json file to Azure blob Storage.

The first thing to do to create a pipeline is to create the linkservice and the dataset

Linked services to Azure Blob Storage where we have uploaded json file

Next, dataset :

Note: Let’s look closely when we create the dataset, we need to linked serviced that we just created

Now that everything is ready, let’s create our first pipeline.

We want to create a Lookup activity that returns us the list of items that are in the json file

Next step, we need a ForEach activity that iterates over each item in the json object returned from the Lookup activity

After this step we will have results like this

image.

Keep in mind: Each item will look like this

So that’s all for this section ! :smiley:
See you soon in the next section Big Data - Data Warehouse - Azure Data Factory(Part 2) ! :love_you_gesture: :love_you_gesture: :love_you_gesture:

2 Likes