Big Data - Data Warehouse - Azure Data Factory(Part 2)

Hi guys ! It’s me again :laughing: . As scheduled, today we will go together in a series on data warehosue.In the previous section, we looked at Azure Data Factory as well as pipeline, and activities like Lookup and ForEach. In this part, we will complete this pipeline :v:

Note :Those who have not read the previous section can review follow this link:
Big Data - Data Warehouse - Azure Data Factory(Part 1)

Yooo :love_you_gesture: :love_you_gesture: :love_you_gesture:

After we have retrieved the items, we need to retrieve the oldWatermark and newWaterMark of each table. To do this, we will use the Lookup activity as we used in the previous section.

image

Why is it necessary to have an old watermark and a new watermark for?

Ok, that’s a good question. We may not need old waternark as well as new waternark. Then, every time we extract data from the customer’s source and execute the ETL, we need to erase a lot of records from all tables, from the raw table, the staging table to the product table. That’s terrible :)).

So what is the benefit of using old watermark and new watermark?

When using old watermark and new watermark, we only need to get any new data that we do not have in the table instead of having to delete and get all data.

Now we just need to execute the query that is present in each item

Similar to the newWatermark Loopup activity

Note:

  • new watermark Lookup’s source dataset are the tables from the customer database
  • setting the First row only to true or false will change the structure of the outputs as well as how those results are accessed.

OK so far we have the old watermark and the new watermark. it is now possible to copy new data from the customer’s source

We need a Copy data activity , which must be specified as the source dataset (where we get the data) as well as the sink dataset (where we will store the data we get).

Retrieving the necessary data from the customer’s source is quite simple because the query is already in each item

With the sink dataset we can set the format for the file as well as the file name.if a file name is not set, the filename will be generated randomly

At this point we have achieved the goal of bringing new data into Azure Blob Storage. But we still have one thing to do, update the old watermark.

To update the old watermark we may use a stored proc. Azure Factory provides us with Stored procedure activity

This stored proc takes as input parameters the table name, new value for old watermark, probably merchantID …

OK, we’ve already completed a basic pipeline and it’s very important for a data ETL. :100:

So that’s all for this section ! :smiley:

See you soon ! :love_you_gesture: :love_you_gesture: :love_you_gesture:

1 Like

good explain đấy, chú viết lên 1 quyển trên handbook đi. Có gì đóng góp a sẽ edit vào đấy

1 Like