DP900: Explore Large-Scale Data Analytics in Azure | Azure Data Fundamentals - video Dailymotion

SkillTech Club

Welcome to the SkillTech DP-900: Microsoft Azure Data Fundamentals Series! In this module, we dive into large-scale data analytics and how Azure empowers enterprises to manage and analyze massive datasets using modern cloud-native tools.  What You’ll Learn in This Session: What is large-scale data analytics? Introduction to Azure Synapse Analytics, Azure Data Lake, and HDInsight Key use cases of big data analytics in the cloud How Azure supports data ingestion, transformation, and querying at scale  Who should watch this? Beginners preparing for the DP-900 certification Professionals new to big data and analytics Anyone interested in how Azure handles real-time and batch data processing  Explore our other Courses and Additional Resources on: https://skilltech.club/

Transcript

00:00In this video we are going to focus on how we can explore large-scale data analytics in Azure.

00:14The main focus of this data analytics will be on Azure Data Factory and Azure Synapse.

00:19We will provision Azure Synapse and pipelines inside the Azure Synapse and we will see that

00:25how the data analytics works with that. Obviously we need to first

00:29understand the data ingestion in Azure and we need to understand the components

00:34like Azure Data Factory where actually we are going to ingest data. Describe the

00:38data processing options for performing analytics in Azure. So we will see that

00:42what kind of options are there, what kind of tools and utilities we need to

00:46understand for that and then we will explore Azure Synapse Analytics where

00:50actually the pipelines of Azure Synapse Analytics will be created and

00:54configured for ETL and ELT kind of transformations. Now before we proceed

01:00with some practical demos, let's try to understand data ingestion in Azure and the

01:06first thing which you need to understand in this is Azure Data Factory. ADF is a data

01:12ingestion and transformation service that allows you to load data, most of the time

01:16raw data from many different sources and even it will allow you to take data from

01:21on-premise computers. As it ingests the data, Data Factory can clean, transform and

01:28restructure the data before loading it into the repository such as the data

01:33warehouse and then it will do further processed. Once the data is in the data

01:39warehouse you can analyze it using the features of data warehouse. The second

01:44option which you have is polybase. Polybase is a feature of SQL Server and Azure

01:49Synapse Analytics that enables you to run transact SQL queries that read data from

01:56external data sources. Polybase makes this external data source appears like

02:01tables in SQL database itself. So you will get the feeling like you are writing a

02:07query there itself in the database. Using polybase you can read data managed by

02:13Hadoop, Spark and Azure Blob Storage as well. As well as other databases management

02:18systems like Cosmos DB, Oracle, Teradata and even MongoDB can be integrated with polybase.

02:24The third and the important feature is we have SSIS. I'm sure if you have used SQL

02:30Server then you are familiar with SQL Server Integration Services which is

02:35nothing but a platform for building enterprise level data integration and data

02:39transformation solutions. Now maybe in older days if you have used SSIS then you

02:44know what it is and then in the modern days people use SSIS integrated with polybase

02:50and Azure Data Factory with that. You can use SSIS to solve complex business

02:55problems by copying or downloading the files, loading the data warehouse, cleaning

03:00the mining data and maybe managing SQL database objects and the data which is

03:06associated with Microsoft SQL Server. SSIS is also a part of Microsoft SQL Server

03:11product. Now let's try to understand the component of Azure Data Factory and how it

03:17is actually going to ingest data and process it. Azure Data Factory is having a

03:23service called linked service. This linked service is going to provide some kind of

03:27a connection to the various data sources which can be data lake store, Azure

03:32Databricks or even if you want to take simply a data from SQL Server which is

03:37located on on-premise. This linked service will allow you to maintain a

03:41connection of this data sources and your data factory. After that you are going to

03:47have some utility called data set which is actually going to allow you to refine

03:53the schema of your data which you are going to process. This is going to be that

03:57particular data which you have prepared from your raw data for further

04:02processing. The data which is coming from this kind of link services will be

04:06ingested into the pipeline of your Azure Data Factory and inside the pipeline you

04:12are going to have activities which will get executed and then it's going to process

04:16the data. Maybe there are multiple activities which are required in this

04:21process and that's the reason set of multiple activities like this are going to

04:25be surrounded by one thing called pipeline. So this four words pipeline, data set,

04:31activities and link services are the most important four words for you if you want

04:36to understand basics of Azure Data Factory. The pipeline is going to get

04:41triggered by different kind of events or maybe you can do the manual trigger of

04:45execution with this. The moment pipeline is going to get executed the data

04:49transformation will happen. Each activity inside the pipeline will get executed

04:53one by one based on the flow which you have defined and following the

04:57guidelines of the data set which is configured into that and maybe after

05:01that it can have customized things inside that. There are chances that your

05:06pipeline is going to use some parameters, some integration runtime, some

05:12control flows which are associated with that and then according to that

05:16parameter integration runtime and control flow configuration pipeline

05:20execution will execute which you can monitor with the help of the

05:24monitoring utility which is also given inside Azure Data Factory. As we know

05:29Azure Data Factory is going to do the data transformation from the raw data to

05:34some kind of a meaningful data. Just for your kind information guys, Azure Data

05:38Factory is available in Azure as a separate service as well as if you are

05:42using Azure Synapse Analytics then inside Azure Synapse Analytics we have a tool

05:48called Synapse Pipeline which is nothing but 100% Azure Data Factory. Whatever you can

05:54use inside Azure Data Factory you can do the same kind of thing in Synapse Pipelines

05:58also. Other than Synapse Pipeline, Azure Synapse Analytics is actually a bunch of

06:03multiple product and then it can additionally going to have Synapse SQL

06:08pull which is going to work like a dedicated data warehouse for the organized

06:13structure data. You are going to have Synapse Link which allows you to connect with

06:18the Cosmos DB in most of the analytical transactional processing. You are going

06:23to have a tool called Synapse Studio which will give you some kind of an

06:29integrated development environment where you can configure your databases, your

06:34pipelines, you can execute queries and you also have something which is known as

06:39Apache Spark Pull which formerly known as Synapse Spark where you can execute your

06:44Spark queries, you can take care of your analytics with the help of Python, C-Sharp or

06:52even languages like Ruby and Go. Ultimately same like Azure Databricks here

06:59also you can create notebooks which will have multiple cells inside that and the

07:04compute for executing those notebooks will be provided by Azure Synapse Analytics. We

07:09will see all these things in our next demo because exactly after this I want you to

07:14see Data Factory and Synapse Pipeline kind of things inside Azure Synapse Analytics and

07:19let's have a look at this thing practically in our Azure portal. Okay so now it's the

07:25time to provision our Azure Synapse Analytics. Let me go to my Azure portal and I'm

07:32going to click on create resource where if I go inside analytics I have Azure

07:37Synapse Analytics in that. Remember same way you can provision Azure Databricks

07:41Analytics or even Azure Data Factory also. I am going to select Azure Synapse Analytics which is

07:47a big service because it's going to have all the Synapse resources inside that. It's asking me

07:53that what kind of resource group and manage resource group you want to create here. Now

07:58we need to create two resource group in this case. I'm saying I'm going to create a new resource

08:02group which is going to be DP900RG like always we created this one and then I'm creating one more resource

08:10group which is going to be managed DP900RG. This managed resource group is actually useful for

08:17those resources which will associate with Synapse workspace. So maybe you want to create some other

08:23managed resources later on. As of now this is going to be empty once the provisioning and deployment of

08:28Synapse workspace is done. What will be the name of this workspace? I'm giving a name like this is my

08:34Maruti Synapse workspace and then some number at the end I'm giving just for the uniqueness.

08:41The region for this I'm selecting East US and then it is must that every Synapse workspace is going to

08:48associate with Azure Data Lake Storage Gen2. You are already familiar with the Azure storage account.

08:55This is going to be very much similar to that but with additional security with additional

09:01hierarchical namespace associated with the storage account where you can store blobs in the containers.

09:06So we are going to have Gen2 storage account. I do not have any Gen2 storage account as of now so I'm

09:12going to create a new one. I'm giving a meaningful name like this is my Gen2 storage with some number

09:20at the end just for the uniqueness and then inside this Gen2 storage they are going to create a new

09:25container to store the blob data and that is something which they call file system. So I'm going to give

09:31a name for that also which is going to be my file system that's it and then I'm going to click on okay.

09:40Remember because additional security is associated they will automatically associate a data contributor

09:47role actually for my storage account inside this for the user who is associated with this particular

09:53account right now and if I want to do this kind of role based access control manually for Synapse

09:59I can go into this links and I can see more details about that. I really don't want to change anything here.

10:04I'm happy with the default configuration. I'm going to click on next.

10:09Every Synapse workspace is going to have one serverless SQL pool associated with that. This serverless SQL

10:15is going to be like a SQL server where you can create some temporary databases, some kind of

10:20external data sources you can create and then you can use analytical queries on that. So I have to

10:27provide some username and password for this SQL server and let me just provide something meaningful.

10:33This will be useful if I want to connect to that SQL server later on. Do I need to enable some system

10:39assign manage identity permissions for this? Well, I don't want this right now. So I'm going with next, next.

10:46No need to change anything. We'll just click on review plus create which will take me to the last and final

10:52page where hopefully my validation will pass and then if it is passed then I will click on create.

10:59Before create I always check that what is an estimated cost for this. As I told you there is a serverless

11:04SQL server pool which is going to be created here and that's going to cost me around 360 rupees per

11:11month. This is a monthly uh okay sorry now this is not a monthly this is estimated cost per terabyte

11:17of data storage with that. So I think this is a enough very cheaper cost right now. So I'm okay with this

11:23and I'm going to click on create.

11:27The moment you submit this deployment it will take some time to provision the Synapse workspace because

11:32internally there are many things which are going to be deployed. So this is a point where I will just

11:38wait for the deployment should be done and then once it is done we will see how we can do the

11:43analytics with the Azure Synapse pipelines and we will also focus on Azure Synapse Studio which is

11:50which is some kind of a ready-made development environment kind of thing where you can do analytics,

11:55you can do querying, you can do all the customizations which are required for Synapse.

12:01Okay now once the deployment is complete you can see it's showing me go to resource group

12:05and if I click on go to resource group I can see that we have our storage account which is going

12:11to be our gen 2 storage and then we have a Synapse workspace. I'm going to open the Synapse workspace

12:17in the new tab and in this tab I'm going inside my storage account. Inside the storage same we have

12:25left side data storage options where container file share queues and tables are there and inside the

12:31containers I will have one container which is by default created with the name Maruti file system

12:38or something like that. This is the name which we have given while provisioning this Synapse workspace.

12:44Inside the container we don't have anything but yes this is a different kind of container compared to

12:49normal storage account because you will have a role-based access control in this plus you will have

12:55an access control list associated with some ready-made service principles associated with this Synapse

13:01workspace. So somehow logically this gen 2 storage is connected with the Synapse workspace and you can

13:08keep data into this gen 2 storage account and then you can load and transform the data into Synapse.

13:15If I go to my Azure Synapse workspace there are so many things which I can see here in the details

13:21like the SQL admin password the serverless SQL pool is already available here because it's something

13:29which is part of the Synapse workspace during the deployment. If I want to create some dedicated

13:34SQL pool, some Apache Spark pool or maybe some data explorer pool I can create all of this by just

13:40clicking on these buttons and then going forward with the step-by-step configuration. What I'm going to do

13:46is I'm going to click on open Synapse Studio link here which will take me to a new tab altogether which

13:53will give me a Synapse Studio. This is that workspace where where you're actually going to work on Synapse,

13:59you're actually going to deal with the different parts of the Synapse workspace. You can see I have an

14:05options like data which actually allows me to create a new databases, lake databases which connects with

14:14the obviously data lake. We can also connect to the external data sources or integrate the data sets into

14:21this. We have a develop section which allows me to create a new notebooks which will run on Apache Spark.

14:28I can also generate new SQL scripts or KQL scripts for this particular Synapse workspace. I can click on

14:37integrate and here I can create my new pipelines where I can do ETL, ELT kind of transformations and

14:44I also have couple of ready-made pipelines and notebooks available in the gallery if I try to click on

14:50browse gallery and find out something from that. I can monitor all these processes of analytics into

14:56this monitor section and whenever I run any pipeline or queries or any other associations it will be visible

15:03here inside the monitoring utility. And last but not the least we have a manage section where actually

15:10I can manage my link services, my pools, my security and if I want to associate this thing with some kind of

15:19git integration with the source control or maybe I would also see it with some devops integration,

15:24all those things I can do here inside the manage. What I'm going to do is I am going into my storage

15:31account and inside this storage account where I have one Maruti file system container. I'm going

15:38to upload one very simple let's say from my desktop I just have one file called input.txt. I have just

15:44uploaded this file into this particular container actually and I'm going to click on upload. You know

15:50that when we upload a file into this it's going to be treated like a blob file. So this is a file which is

15:56just added here it's a text file which is actually having some 38 bytes of content in this and it's

16:03created like a block block. If I select this file and if I show you the properties of that I can show

16:10you that this is a file which is just going to have some simple comma separated values inside that. I have

16:15some first name comma last name kind of comma separated values. Three records are there inside this

16:21and then this is a very simple text file which is there. What I'm going to do is I'm going to

16:26associate this text file and this container with my synapse workspace and I'm going to do this thing

16:33inside synapse pipeline. So let me go to my integrate section. I'm going to click on plus pipeline. So I'm

16:40going to create a new pipeline. They're giving me a default name for the pipeline which is pipeline one. I'm okay

16:45with that and now this is a place where actually I have a section as we discussed earlier pipeline can

16:53have multiple activities inside there. So I have a section of activities here from which I can select

16:58one of the activity one or more activities and then I can just you know do my transformational logic

17:04based on that. Like I have some synapse activity, some move and transform activity, some activities from

17:11data breaks, some activities from HD inside and some activities from Azure functions are also available

17:18here. Let me take a very simple activity right now which is copy data. Obviously this allows me to copy

17:24data from one place to another place and when I want to do this thing I need to understand the other

17:31features of this synapse pipeline like link services, data set and all because obviously if I want to

17:38connect this particular pipeline with my data set I need to associate some kind of link service I need

17:43to associate some table schema in data set all these configurations I'm going to do right now.

17:49Pipeline one is there I'm going to collapse this section so I have enough space on the screen

17:53and then let's say this copy data activity you can see is having two sections here source and sync.

18:00Source is like from where you want to get data and sync is like a target where you want to put the data

18:05in this copy activity. I'm going to click on source and then it's saying that where is your source data

18:10set you do not have any data set right now why don't you just click on plus new and create it.

18:16I'm going to click on plus new and it's asking me okay fine you're going to create a new integration

18:20data set but tell me what kind of data set you're looking for. From the list of these data sets I'm

18:26going to select that I'm going to look for Azure data lake storage gen 2 because this is what my data

18:31source is all about. I'm selecting this I'm going to click on continue and then they're also asking

18:37me what kind of format you have for your data actually. I have a data which is comma separated

18:42value CSV you know comma separated values are there in this text file which I know so I'm selecting this

18:48one it can be any other format if you want to try that you can try it by yourself. I'm selecting CSV

18:53and I'm going to click on continue. Now the moment I do this this is going to be my data set I'm giving

18:59a name of the thing delimited text underscore input data set this is my input data set and then

19:07where is this data actually located I need to configure that thing with the link service

19:11I do not have any link service right now but I can click on plus new and then I can say this is going

19:17to be my Azure data lake storage one underscore link service this is going to be my link service

19:24and then I need to choose my subscription I need to choose my storage account which is my gen 2 storage

19:30account and then if I click on taste connection it is going to show me connection successful it means

19:37I have a valid access on this and then all I'm going to do is I'm just going to click on create

19:45when we click on create this is going to give me a new link service creation done it's showing me

19:52successfully created a new link service and now because I'm connected with the storage account

19:57if I try to browse I am able to get my file system is nothing but a name of the container which I've

20:04created and inside that I'm getting that file also which I have uploaded in that container I'm selecting

20:10my input dot txt and I'm going to click on okay this is like I have created one data set which is input

20:17data set and I'm seeing I'm going to get data from this particular location and from this particular

20:23file I can click on advance and I'm going to say open this data set this is going to save the data

20:29set and it's going to show me that yes your data set configuration is loaded and this is your input

20:34data set which you have configured fine going back to the previous tab which is pipeline one

20:40and now source configuration is done I need to specify where is the sync configuration

20:46and it means like where exactly I want to put this data that I can configure here

20:51you can see that data set for input is already configured obviously there is no data set for

20:57output so I need to configure that now and I'm going to click on plus new only so I'm going to click on

21:02plus new this data set can be something else other than storage account like maybe I want to take data

21:09from storage account but I want to put it into a sql database or maybe I want to put it into some on-premise

21:14sql database all the things are possible I am going to select the same one right now

21:19I'm going to select same one which is assure data lake storage gen 2 and then I'm going to click on

21:24continue I will select delimited text csv the same everything same we are selecting here the name of the

21:32data set is going to be delimited text output data set this is going to be my output data set which

21:38will hold the final structure I can create a new link service if it's a different data source but

21:44this is the same location where I want to put it actually so I'm going to click on same link service

21:51and then very exactly I want to put this thing I am saying that I want to put this thing in the same

21:56file path so it's like the same container I'm going to select actually but inside this container I want

22:03to specify some dynamic content for this directory and file system so like I want that inside the

22:09container I should have a folder called output inside which the output data should be there

22:14and then the name of the file I do not want input.txt I want something like output.txt

22:22so again this is something which is a hard-coded name right now if I want to generate this dynamically

22:28that is also something which I can do with the parameters and all but I'm selecting simple name

22:33here output folder with output.txt will be there I'll click on advance and I'll click on open this

22:38data set which will save this now they are showing me that some error here they are saying that you know

22:45this operation is returning invalid status code because there is no directory called output there is

22:49no directory called output.txt this kind of errors are very common because when you're configuring data set

22:55they actually check for those values I'll keep these two things empty right now and then I click on open this data set

23:03and now once this is open for the output I am going to specify that now this is a time where I want

23:09to specify some dynamic content for this you can see here I'm going to click on this directory name which is output

23:19and then now in spite of giving some hard-coded value like output.txt

23:24I just want to say that I want to add a dynamic content and this dynamic content is going to be

23:29automatically generated when you actually run your pipeline. I am going to search for one function

23:35called concat and using the concat function let's say I'm saying that I want to concat

23:43it there is a property called add the pipeline dot run id this run id is actually going to be a property

23:59which is a unique id associated with the pipeline so I'm just specifying pipeline dot run id is going

24:05to be fetched from this executing pipeline which is going to be unique id and then I want to concat this

24:11with dot txt so that's going to be my full file name some unique id dot txt and this gives me a

24:17surety also that if I run this pipeline more than one time every time a new id will be generated and

24:23every time a new file will be created I'm going to click on ok my concat dynamic content is already there

24:31and in order to check this I'm going to click on validate all if everything is fine it's going to

24:37show me that no errors were found cool I'm going to click on publish all and then it's saying that

24:43what I have created one pipeline and two data set will be published right now I'm okay with that

24:49I clicked on publish and then I think it's going to take few seconds to publish this let's wait for

24:55this to be done yes publishing is completed successfully and if that is true let me check

25:01whether this pipeline is running or not so I'll go to my pipeline one and there is an option here

25:07called add trigger you can run your pipelines with triggers and that's where I'm going to click on

25:13trigger now this will take me to one page where it's showing me that okay you want to run the pipeline

25:20you can just trigger it I can click on ok and that's going to start the execution of my pipeline that's

25:26actually going to run the pipeline right now while this is running we can also have a look at the

25:31monitoring part right now but let me check if this is successfully done then I'll show you the monitoring

25:37part after that it's going to take few seconds to run this yes it is done it's showing me pipeline

25:44execution is successfully done it means the configuration which I've done is working fine

25:49let me go back to my container which is here in the storage account I'll refresh this and if I

25:55do that I am getting a folder created inside the container called output inside the output

26:02I have one file name which is pipeline.runid.txt I think the concatenation is not happening with the

26:09run id properly let me see what is wrong in that but yes this is fine if I click on this file this

26:16file should have a content of that same file you can see the same content is coming from input.txt

26:24I need to check this pipeline.runid so because I want this should be fetched from the unique id of

26:29that particular pipeline so let me just quickly correct it maybe some codes I have done which is

26:34which is wrong so I think let me go to my synapse studio this is how we do all the configuration

26:40actually let me go to my output data set and I think the mistake which I have done is this pipeline.runid

26:47should not be there in single code I guess because this is something which is a unique property it

26:55should not be there in the codes I guess and I guess if there is no code then this is also not

27:02required so this is going to be pipeline which is going to be treated like function.runid and then

27:09the .txt will be associated with this so this is like going to associate with the current pipeline and

27:14then it's going to take the run id from that make sure when you type this r and i should be capital

27:19because that's a predefined property I'll click on ok and then I'm going to publish this

27:27the new changes will be published into that and then I'm going to run it once again the same way

27:33I did first time let me run this once again yes publishing is completed I'll go to pipeline one

27:41I'll click on trigger now okay

27:48it's running again second time now obviously this is a second time execution so after this if I go to

27:54monitoring then I will be able to see all the successful and failure no no failed run into that so

28:02let's see that thing yep succeeded I'm going to my output folder once again and I'm going to refresh it

28:11if I refresh for the second execution I got a proper file name this is a unique id dot txt and where can

28:18I see this id well if I go to my monitoring panel which is at the left side column I can see that my two

28:26successful executions are there and I think this one is the newer one so it's like zero sorted is

28:33there and if I go inside this pipeline this pipeline is having a run id which is starting with 67 c99 some

28:40unique id with that and I think this is the same id which is my file name now if I run it once again

28:45a new id will be generated and that will be dynamically associated with this particular file

28:50and this is how we can use parameters we can use dynamic content we can use some other configurations

28:56while the pipeline is executing if I move my mouse on this monitoring panel I have some glasses kind

29:03of icon there in this bottom and that is going to show me what exactly happened it's showing me that

29:09you have taken the data from gen to storage and you've added this thing into another same location

29:14actually and 38 bytes of data is transferred from here to there the total time taken for that the

29:20throughput which is used for this all those things are visible in monitoring which gives me a crystal

29:26clear clarity that what exactly happened when I executed the pipeline run

29:29so in this video we have seen that how a large-scale data analytics can be done in azure cloud we have

29:37focused on azure synapse analytics and we have created some synapse pipeline in the demo of this particular

29:45lecture

DP900: Explore Large-Scale Data Analytics in Azure | Azure Data Fundamentals

Category

Transcript

Recommended