Skip to playerSkip to main contentSkip to footer
  • 4 days ago
Welcome to the SkillTech DP-900: Microsoft Azure Data Fundamentals Series!
In this module, we dive into large-scale data analytics and how Azure empowers enterprises to manage and analyze massive datasets using modern cloud-native tools.

What You’ll Learn in This Session:
What is large-scale data analytics?
Introduction to Azure Synapse Analytics, Azure Data Lake, and HDInsight
Key use cases of big data analytics in the cloud
How Azure supports data ingestion, transformation, and querying at scale

Who should watch this?
Beginners preparing for the DP-900 certification
Professionals new to big data and analytics
Anyone interested in how Azure handles real-time and batch data processing

Explore our other Courses and Additional Resources on: https://skilltech.club/

Category

🤖
Tech
Transcript
00:00In this video we are going to focus on how we can explore large-scale data analytics in Azure.
00:14The main focus of this data analytics will be on Azure Data Factory and Azure Synapse.
00:19We will provision Azure Synapse and pipelines inside the Azure Synapse and we will see that
00:25how the data analytics works with that. Obviously we need to first
00:29understand the data ingestion in Azure and we need to understand the components
00:34like Azure Data Factory where actually we are going to ingest data. Describe the
00:38data processing options for performing analytics in Azure. So we will see that
00:42what kind of options are there, what kind of tools and utilities we need to
00:46understand for that and then we will explore Azure Synapse Analytics where
00:50actually the pipelines of Azure Synapse Analytics will be created and
00:54configured for ETL and ELT kind of transformations. Now before we proceed
01:00with some practical demos, let's try to understand data ingestion in Azure and the
01:06first thing which you need to understand in this is Azure Data Factory. ADF is a data
01:12ingestion and transformation service that allows you to load data, most of the time
01:16raw data from many different sources and even it will allow you to take data from
01:21on-premise computers. As it ingests the data, Data Factory can clean, transform and
01:28restructure the data before loading it into the repository such as the data
01:33warehouse and then it will do further processed. Once the data is in the data
01:39warehouse you can analyze it using the features of data warehouse. The second
01:44option which you have is polybase. Polybase is a feature of SQL Server and Azure
01:49Synapse Analytics that enables you to run transact SQL queries that read data from
01:56external data sources. Polybase makes this external data source appears like
02:01tables in SQL database itself. So you will get the feeling like you are writing a
02:07query there itself in the database. Using polybase you can read data managed by
02:13Hadoop, Spark and Azure Blob Storage as well. As well as other databases management
02:18systems like Cosmos DB, Oracle, Teradata and even MongoDB can be integrated with polybase.
02:24The third and the important feature is we have SSIS. I'm sure if you have used SQL
02:30Server then you are familiar with SQL Server Integration Services which is
02:35nothing but a platform for building enterprise level data integration and data
02:39transformation solutions. Now maybe in older days if you have used SSIS then you
02:44know what it is and then in the modern days people use SSIS integrated with polybase
02:50and Azure Data Factory with that. You can use SSIS to solve complex business
02:55problems by copying or downloading the files, loading the data warehouse, cleaning
03:00the mining data and maybe managing SQL database objects and the data which is
03:06associated with Microsoft SQL Server. SSIS is also a part of Microsoft SQL Server
03:11product. Now let's try to understand the component of Azure Data Factory and how it
03:17is actually going to ingest data and process it. Azure Data Factory is having a
03:23service called linked service. This linked service is going to provide some kind of
03:27a connection to the various data sources which can be data lake store, Azure
03:32Databricks or even if you want to take simply a data from SQL Server which is
03:37located on on-premise. This linked service will allow you to maintain a
03:41connection of this data sources and your data factory. After that you are going to
03:47have some utility called data set which is actually going to allow you to refine
03:53the schema of your data which you are going to process. This is going to be that
03:57particular data which you have prepared from your raw data for further
04:02processing. The data which is coming from this kind of link services will be
04:06ingested into the pipeline of your Azure Data Factory and inside the pipeline you
04:12are going to have activities which will get executed and then it's going to process
04:16the data. Maybe there are multiple activities which are required in this
04:21process and that's the reason set of multiple activities like this are going to
04:25be surrounded by one thing called pipeline. So this four words pipeline, data set,
04:31activities and link services are the most important four words for you if you want
04:36to understand basics of Azure Data Factory. The pipeline is going to get
04:41triggered by different kind of events or maybe you can do the manual trigger of
04:45execution with this. The moment pipeline is going to get executed the data
04:49transformation will happen. Each activity inside the pipeline will get executed
04:53one by one based on the flow which you have defined and following the
04:57guidelines of the data set which is configured into that and maybe after
05:01that it can have customized things inside that. There are chances that your
05:06pipeline is going to use some parameters, some integration runtime, some
05:12control flows which are associated with that and then according to that
05:16parameter integration runtime and control flow configuration pipeline
05:20execution will execute which you can monitor with the help of the
05:24monitoring utility which is also given inside Azure Data Factory. As we know
05:29Azure Data Factory is going to do the data transformation from the raw data to
05:34some kind of a meaningful data. Just for your kind information guys, Azure Data
05:38Factory is available in Azure as a separate service as well as if you are
05:42using Azure Synapse Analytics then inside Azure Synapse Analytics we have a tool
05:48called Synapse Pipeline which is nothing but 100% Azure Data Factory. Whatever you can
05:54use inside Azure Data Factory you can do the same kind of thing in Synapse Pipelines
05:58also. Other than Synapse Pipeline, Azure Synapse Analytics is actually a bunch of
06:03multiple product and then it can additionally going to have Synapse SQL
06:08pull which is going to work like a dedicated data warehouse for the organized
06:13structure data. You are going to have Synapse Link which allows you to connect with
06:18the Cosmos DB in most of the analytical transactional processing. You are going
06:23to have a tool called Synapse Studio which will give you some kind of an
06:29integrated development environment where you can configure your databases, your
06:34pipelines, you can execute queries and you also have something which is known as
06:39Apache Spark Pull which formerly known as Synapse Spark where you can execute your
06:44Spark queries, you can take care of your analytics with the help of Python, C-Sharp or
06:52even languages like Ruby and Go. Ultimately same like Azure Databricks here
06:59also you can create notebooks which will have multiple cells inside that and the
07:04compute for executing those notebooks will be provided by Azure Synapse Analytics. We
07:09will see all these things in our next demo because exactly after this I want you to
07:14see Data Factory and Synapse Pipeline kind of things inside Azure Synapse Analytics and
07:19let's have a look at this thing practically in our Azure portal. Okay so now it's the
07:25time to provision our Azure Synapse Analytics. Let me go to my Azure portal and I'm
07:32going to click on create resource where if I go inside analytics I have Azure
07:37Synapse Analytics in that. Remember same way you can provision Azure Databricks
07:41Analytics or even Azure Data Factory also. I am going to select Azure Synapse Analytics which is
07:47a big service because it's going to have all the Synapse resources inside that. It's asking me
07:53that what kind of resource group and manage resource group you want to create here. Now
07:58we need to create two resource group in this case. I'm saying I'm going to create a new resource
08:02group which is going to be DP900RG like always we created this one and then I'm creating one more resource
08:10group which is going to be managed DP900RG. This managed resource group is actually useful for
08:17those resources which will associate with Synapse workspace. So maybe you want to create some other
08:23managed resources later on. As of now this is going to be empty once the provisioning and deployment of
08:28Synapse workspace is done. What will be the name of this workspace? I'm giving a name like this is my
08:34Maruti Synapse workspace and then some number at the end I'm giving just for the uniqueness.
08:41The region for this I'm selecting East US and then it is must that every Synapse workspace is going to
08:48associate with Azure Data Lake Storage Gen2. You are already familiar with the Azure storage account.
08:55This is going to be very much similar to that but with additional security with additional
09:01hierarchical namespace associated with the storage account where you can store blobs in the containers.
09:06So we are going to have Gen2 storage account. I do not have any Gen2 storage account as of now so I'm
09:12going to create a new one. I'm giving a meaningful name like this is my Gen2 storage with some number
09:20at the end just for the uniqueness and then inside this Gen2 storage they are going to create a new
09:25container to store the blob data and that is something which they call file system. So I'm going to give
09:31a name for that also which is going to be my file system that's it and then I'm going to click on okay.
09:40Remember because additional security is associated they will automatically associate a data contributor
09:47role actually for my storage account inside this for the user who is associated with this particular
09:53account right now and if I want to do this kind of role based access control manually for Synapse
09:59I can go into this links and I can see more details about that. I really don't want to change anything here.
10:04I'm happy with the default configuration. I'm going to click on next.
10:09Every Synapse workspace is going to have one serverless SQL pool associated with that. This serverless SQL
10:15is going to be like a SQL server where you can create some temporary databases, some kind of
10:20external data sources you can create and then you can use analytical queries on that. So I have to
10:27provide some username and password for this SQL server and let me just provide something meaningful.
10:33This will be useful if I want to connect to that SQL server later on. Do I need to enable some system
10:39assign manage identity permissions for this? Well, I don't want this right now. So I'm going with next, next.
10:46No need to change anything. We'll just click on review plus create which will take me to the last and final
10:52page where hopefully my validation will pass and then if it is passed then I will click on create.
10:59Before create I always check that what is an estimated cost for this. As I told you there is a serverless
11:04SQL server pool which is going to be created here and that's going to cost me around 360 rupees per
11:11month. This is a monthly uh okay sorry now this is not a monthly this is estimated cost per terabyte
11:17of data storage with that. So I think this is a enough very cheaper cost right now. So I'm okay with this
11:23and I'm going to click on create.
11:27The moment you submit this deployment it will take some time to provision the Synapse workspace because
11:32internally there are many things which are going to be deployed. So this is a point where I will just
11:38wait for the deployment should be done and then once it is done we will see how we can do the
11:43analytics with the Azure Synapse pipelines and we will also focus on Azure Synapse Studio which is
11:50which is some kind of a ready-made development environment kind of thing where you can do analytics,
11:55you can do querying, you can do all the customizations which are required for Synapse.
12:01Okay now once the deployment is complete you can see it's showing me go to resource group
12:05and if I click on go to resource group I can see that we have our storage account which is going
12:11to be our gen 2 storage and then we have a Synapse workspace. I'm going to open the Synapse workspace
12:17in the new tab and in this tab I'm going inside my storage account. Inside the storage same we have
12:25left side data storage options where container file share queues and tables are there and inside the
12:31containers I will have one container which is by default created with the name Maruti file system
12:38or something like that. This is the name which we have given while provisioning this Synapse workspace.
12:44Inside the container we don't have anything but yes this is a different kind of container compared to
12:49normal storage account because you will have a role-based access control in this plus you will have
12:55an access control list associated with some ready-made service principles associated with this Synapse
13:01workspace. So somehow logically this gen 2 storage is connected with the Synapse workspace and you can
13:08keep data into this gen 2 storage account and then you can load and transform the data into Synapse.
13:15If I go to my Azure Synapse workspace there are so many things which I can see here in the details
13:21like the SQL admin password the serverless SQL pool is already available here because it's something
13:29which is part of the Synapse workspace during the deployment. If I want to create some dedicated
13:34SQL pool, some Apache Spark pool or maybe some data explorer pool I can create all of this by just
13:40clicking on these buttons and then going forward with the step-by-step configuration. What I'm going to do
13:46is I'm going to click on open Synapse Studio link here which will take me to a new tab altogether which
13:53will give me a Synapse Studio. This is that workspace where where you're actually going to work on Synapse,
13:59you're actually going to deal with the different parts of the Synapse workspace. You can see I have an
14:05options like data which actually allows me to create a new databases, lake databases which connects with
14:14the obviously data lake. We can also connect to the external data sources or integrate the data sets into
14:21this. We have a develop section which allows me to create a new notebooks which will run on Apache Spark.
14:28I can also generate new SQL scripts or KQL scripts for this particular Synapse workspace. I can click on
14:37integrate and here I can create my new pipelines where I can do ETL, ELT kind of transformations and
14:44I also have couple of ready-made pipelines and notebooks available in the gallery if I try to click on
14:50browse gallery and find out something from that. I can monitor all these processes of analytics into
14:56this monitor section and whenever I run any pipeline or queries or any other associations it will be visible
15:03here inside the monitoring utility. And last but not the least we have a manage section where actually
15:10I can manage my link services, my pools, my security and if I want to associate this thing with some kind of
15:19git integration with the source control or maybe I would also see it with some devops integration,
15:24all those things I can do here inside the manage. What I'm going to do is I am going into my storage
15:31account and inside this storage account where I have one Maruti file system container. I'm going
15:38to upload one very simple let's say from my desktop I just have one file called input.txt. I have just
15:44uploaded this file into this particular container actually and I'm going to click on upload. You know
15:50that when we upload a file into this it's going to be treated like a blob file. So this is a file which is
15:56just added here it's a text file which is actually having some 38 bytes of content in this and it's
16:03created like a block block. If I select this file and if I show you the properties of that I can show
16:10you that this is a file which is just going to have some simple comma separated values inside that. I have
16:15some first name comma last name kind of comma separated values. Three records are there inside this
16:21and then this is a very simple text file which is there. What I'm going to do is I'm going to
16:26associate this text file and this container with my synapse workspace and I'm going to do this thing
16:33inside synapse pipeline. So let me go to my integrate section. I'm going to click on plus pipeline. So I'm
16:40going to create a new pipeline. They're giving me a default name for the pipeline which is pipeline one. I'm okay
16:45with that and now this is a place where actually I have a section as we discussed earlier pipeline can
16:53have multiple activities inside there. So I have a section of activities here from which I can select
16:58one of the activity one or more activities and then I can just you know do my transformational logic
17:04based on that. Like I have some synapse activity, some move and transform activity, some activities from
17:11data breaks, some activities from HD inside and some activities from Azure functions are also available
17:18here. Let me take a very simple activity right now which is copy data. Obviously this allows me to copy
17:24data from one place to another place and when I want to do this thing I need to understand the other
17:31features of this synapse pipeline like link services, data set and all because obviously if I want to
17:38connect this particular pipeline with my data set I need to associate some kind of link service I need
17:43to associate some table schema in data set all these configurations I'm going to do right now.
17:49Pipeline one is there I'm going to collapse this section so I have enough space on the screen
17:53and then let's say this copy data activity you can see is having two sections here source and sync.
18:00Source is like from where you want to get data and sync is like a target where you want to put the data
18:05in this copy activity. I'm going to click on source and then it's saying that where is your source data
18:10set you do not have any data set right now why don't you just click on plus new and create it.
18:16I'm going to click on plus new and it's asking me okay fine you're going to create a new integration
18:20data set but tell me what kind of data set you're looking for. From the list of these data sets I'm
18:26going to select that I'm going to look for Azure data lake storage gen 2 because this is what my data
18:31source is all about. I'm selecting this I'm going to click on continue and then they're also asking
18:37me what kind of format you have for your data actually. I have a data which is comma separated
18:42value CSV you know comma separated values are there in this text file which I know so I'm selecting this
18:48one it can be any other format if you want to try that you can try it by yourself. I'm selecting CSV
18:53and I'm going to click on continue. Now the moment I do this this is going to be my data set I'm giving
18:59a name of the thing delimited text underscore input data set this is my input data set and then
19:07where is this data actually located I need to configure that thing with the link service
19:11I do not have any link service right now but I can click on plus new and then I can say this is going
19:17to be my Azure data lake storage one underscore link service this is going to be my link service
19:24and then I need to choose my subscription I need to choose my storage account which is my gen 2 storage
19:30account and then if I click on taste connection it is going to show me connection successful it means
19:37I have a valid access on this and then all I'm going to do is I'm just going to click on create
19:45when we click on create this is going to give me a new link service creation done it's showing me
19:52successfully created a new link service and now because I'm connected with the storage account
19:57if I try to browse I am able to get my file system is nothing but a name of the container which I've
20:04created and inside that I'm getting that file also which I have uploaded in that container I'm selecting
20:10my input dot txt and I'm going to click on okay this is like I have created one data set which is input
20:17data set and I'm seeing I'm going to get data from this particular location and from this particular
20:23file I can click on advance and I'm going to say open this data set this is going to save the data
20:29set and it's going to show me that yes your data set configuration is loaded and this is your input
20:34data set which you have configured fine going back to the previous tab which is pipeline one
20:40and now source configuration is done I need to specify where is the sync configuration
20:46and it means like where exactly I want to put this data that I can configure here
20:51you can see that data set for input is already configured obviously there is no data set for
20:57output so I need to configure that now and I'm going to click on plus new only so I'm going to click on
21:02plus new this data set can be something else other than storage account like maybe I want to take data
21:09from storage account but I want to put it into a sql database or maybe I want to put it into some on-premise
21:14sql database all the things are possible I am going to select the same one right now
21:19I'm going to select same one which is assure data lake storage gen 2 and then I'm going to click on
21:24continue I will select delimited text csv the same everything same we are selecting here the name of the
21:32data set is going to be delimited text output data set this is going to be my output data set which
21:38will hold the final structure I can create a new link service if it's a different data source but
21:44this is the same location where I want to put it actually so I'm going to click on same link service
21:51and then very exactly I want to put this thing I am saying that I want to put this thing in the same
21:56file path so it's like the same container I'm going to select actually but inside this container I want
22:03to specify some dynamic content for this directory and file system so like I want that inside the
22:09container I should have a folder called output inside which the output data should be there
22:14and then the name of the file I do not want input.txt I want something like output.txt
22:22so again this is something which is a hard-coded name right now if I want to generate this dynamically
22:28that is also something which I can do with the parameters and all but I'm selecting simple name
22:33here output folder with output.txt will be there I'll click on advance and I'll click on open this
22:38data set which will save this now they are showing me that some error here they are saying that you know
22:45this operation is returning invalid status code because there is no directory called output there is
22:49no directory called output.txt this kind of errors are very common because when you're configuring data set
22:55they actually check for those values I'll keep these two things empty right now and then I click on open this data set
23:03and now once this is open for the output I am going to specify that now this is a time where I want
23:09to specify some dynamic content for this you can see here I'm going to click on this directory name which is output
23:19and then now in spite of giving some hard-coded value like output.txt
23:24I just want to say that I want to add a dynamic content and this dynamic content is going to be
23:29automatically generated when you actually run your pipeline. I am going to search for one function
23:35called concat and using the concat function let's say I'm saying that I want to concat
23:43it there is a property called add the pipeline dot run id this run id is actually going to be a property
23:59which is a unique id associated with the pipeline so I'm just specifying pipeline dot run id is going
24:05to be fetched from this executing pipeline which is going to be unique id and then I want to concat this
24:11with dot txt so that's going to be my full file name some unique id dot txt and this gives me a
24:17surety also that if I run this pipeline more than one time every time a new id will be generated and
24:23every time a new file will be created I'm going to click on ok my concat dynamic content is already there
24:31and in order to check this I'm going to click on validate all if everything is fine it's going to
24:37show me that no errors were found cool I'm going to click on publish all and then it's saying that
24:43what I have created one pipeline and two data set will be published right now I'm okay with that
24:49I clicked on publish and then I think it's going to take few seconds to publish this let's wait for
24:55this to be done yes publishing is completed successfully and if that is true let me check
25:01whether this pipeline is running or not so I'll go to my pipeline one and there is an option here
25:07called add trigger you can run your pipelines with triggers and that's where I'm going to click on
25:13trigger now this will take me to one page where it's showing me that okay you want to run the pipeline
25:20you can just trigger it I can click on ok and that's going to start the execution of my pipeline that's
25:26actually going to run the pipeline right now while this is running we can also have a look at the
25:31monitoring part right now but let me check if this is successfully done then I'll show you the monitoring
25:37part after that it's going to take few seconds to run this yes it is done it's showing me pipeline
25:44execution is successfully done it means the configuration which I've done is working fine
25:49let me go back to my container which is here in the storage account I'll refresh this and if I
25:55do that I am getting a folder created inside the container called output inside the output
26:02I have one file name which is pipeline.runid.txt I think the concatenation is not happening with the
26:09run id properly let me see what is wrong in that but yes this is fine if I click on this file this
26:16file should have a content of that same file you can see the same content is coming from input.txt
26:24I need to check this pipeline.runid so because I want this should be fetched from the unique id of
26:29that particular pipeline so let me just quickly correct it maybe some codes I have done which is
26:34which is wrong so I think let me go to my synapse studio this is how we do all the configuration
26:40actually let me go to my output data set and I think the mistake which I have done is this pipeline.runid
26:47should not be there in single code I guess because this is something which is a unique property it
26:55should not be there in the codes I guess and I guess if there is no code then this is also not
27:02required so this is going to be pipeline which is going to be treated like function.runid and then
27:09the .txt will be associated with this so this is like going to associate with the current pipeline and
27:14then it's going to take the run id from that make sure when you type this r and i should be capital
27:19because that's a predefined property I'll click on ok and then I'm going to publish this
27:27the new changes will be published into that and then I'm going to run it once again the same way
27:33I did first time let me run this once again yes publishing is completed I'll go to pipeline one
27:41I'll click on trigger now okay
27:48it's running again second time now obviously this is a second time execution so after this if I go to
27:54monitoring then I will be able to see all the successful and failure no no failed run into that so
28:02let's see that thing yep succeeded I'm going to my output folder once again and I'm going to refresh it
28:11if I refresh for the second execution I got a proper file name this is a unique id dot txt and where can
28:18I see this id well if I go to my monitoring panel which is at the left side column I can see that my two
28:26successful executions are there and I think this one is the newer one so it's like zero sorted is
28:33there and if I go inside this pipeline this pipeline is having a run id which is starting with 67 c99 some
28:40unique id with that and I think this is the same id which is my file name now if I run it once again
28:45a new id will be generated and that will be dynamically associated with this particular file
28:50and this is how we can use parameters we can use dynamic content we can use some other configurations
28:56while the pipeline is executing if I move my mouse on this monitoring panel I have some glasses kind
29:03of icon there in this bottom and that is going to show me what exactly happened it's showing me that
29:09you have taken the data from gen to storage and you've added this thing into another same location
29:14actually and 38 bytes of data is transferred from here to there the total time taken for that the
29:20throughput which is used for this all those things are visible in monitoring which gives me a crystal
29:26clear clarity that what exactly happened when I executed the pipeline run
29:29so in this video we have seen that how a large-scale data analytics can be done in azure cloud we have
29:37focused on azure synapse analytics and we have created some synapse pipeline in the demo of this particular
29:45lecture

Recommended