lectio-uninapoli-2021

Azure Synapse

General information
Demo
Agenda

General information

Synapse Studio is an integrated Azure tool that allows managing all the steps of an Analytics pipeline, within a single interface. In particular:

It integrates a tool for designing ETL pipelines, very similar to Azure Data Factory, with a graphical interface to create pipelines
It allows from its interface to query and navigate the data in an Azure Data Lake Storage
It is possible to create:
- a dedicated Relational Database (Dedicated SQL Pool)
- a Serverless SQL Pool, i.e. an SQL Engine containing only views (or External Tables) on the Data Lake
It allows from its interface to connect data to a PowerBI instance and see a preview of the Dashboards
It is possible to activate a Spark Engine with Synapse, and define Spark transformations using Jupyer Notebooks.

All the configuration and objects created within Synapse can be versioned with a Git repository.

Demo

Create resources on Azure

Create a resource group for your project
Open resource group, select Add –> MarketPlace –> Azure Synapse Analytics (not private hubs) –> Create
On the current screen, click on account name –> create new and select a name, same for file system name

Add the remaining configuration for Synapse workspace as in the screenshot

Click on review and create, then on Create

Create connection to external Database and a Serverless SQL Pool

Open the resource group of the project
Open Synapse Workspace
Click on Open Synapse Studio

NB: Synapse allows you to version all the configurations in a GIT Repository. We won’t go into details here, but you can switch between “Synapse Live” (no versioning) to a versioning in GIT in the upper left corner of the screen.

Click on Manage –> Linked Services –> New

Select Azure SQL Database and click on continue
Provide the required information.

Click on test connection on the bottom of the page to make sure connection is successful, then click on create.
Click on the “Data” icon on the left, then on the “+” and SQL Database.

Select “Serverless” and provide a name

Create a Power BI workspace and connect Synapse to it

Open powerbi.com
Login with your own company account
Click on workspaces –> Create a Workspace

Provide name and description and click on save
Go back to Synapse, click on Manage –> Linked services –> New

Select Connect to Power BI or search for Power BI
Provide a name for the linked service and the workspace name you have just created in Power BI, then click on Create

Click on Publish All to validate all modifications

Create ETL Pipelines to import data

The subsequent step must be executed for each business process you want to analyze, with facts and dimensions

Click on the Storage sign on the left (“Data”), then on the linked tab and on the primary data lake object as in the figure. Then, click on “New folder” to create a new folder with the name of your business process.

Click on “integrate” button on the left
Click on the plus “+” button on the right of “integrate”, then select “Pipeline”

Provide a name for the pipeline

The subsequent step must be repeated for each table you want to create in the Data Lake

On the left bar, expand “Move and Transform” and select “Copy data” tool
Provide a name for the step, like copy_[table_name]
Click on the source tab and apply the following steps:
- click on “+ New” button on the right of “Source Dataset”
- Choose Azure SQL Database as input (or your source DB)
- Fulfill the properties form with a name and the linked service we created for the external database, select the table (or “None” if you want to import via a query and not the full table), then click on OK.

The source dataset you just create should appear on the dropdown list, otherwise select it. Fulfill the rest of the source with the query (or table) of your choice.

Go the the sink tab, click on “+ New” on the right of “Sink dataset”
Select “Azure Data lake storage Gen2”, then “Parquet”
In the subsequent form, provide a name and select the workspace default storage as linked service.

Click on the folder on the right of the “file path” section, then select the path in the data lake where you want to drop your data (double click on a folder to open it). You may choose the folder you created for the business process in previous steps.
If you wrote a query in the source section, also add the file name on the last text form of “File path” section, with “.parquet” as an extension. If you selected a table, leave it blank.
Click on the “Debug” button above to test the pipeline. (NB: this will trigger the import and impute costs) and verify if the pipeline runs successfully, in the output tab below

If the pipeline terminates correctly, verify if data retrieved are correct. Click on “Data” on the left, then on “Linked” tab, click on the primary data lake as in the image below. Then, navigate to the path you selected on the steps below

Now we will create a view on the Data Lake. You may want to create external tables instead of of views. In this case, in next step select “Create an External Table” instead of SELECT TOP rows. Here an analysis of the differences.

Right click on the file you created, and select “New SQL Script” –> Select TOP 100 Rows
Click on Run and verify if results are expected.

Now, modify the script removing the TOP 100 text and adding at the beginning “CREATE VIEW [name] AS”, as in the figure below. Also, select the database you created in the Preliminary steps at the right of use database, then click on “Run”
You may want to provide a name to the Script for creating the view (on the right bar), while this is not strictly necessary

Click on “Validate all” tab above, and if the systems shows no errors, then click on “Commit All”

Add trigger for the pipeline

You should now define triggers to schedule the activation of the pipeline.

On the created pipeline, click on Add Trigger –> “New/Edit”

On the dropdown list, click on “New” TODO: I think that if you already created a trigger for another pipeline, you can reuse it and avoid creating a new one. To verify.
Create a new trigger providing all requested information. You may select “No” on activated to suspend the trigger and activate it later.

Generate PowerBI Dashboards

Click on the paper icon on the left (develop), then click on Power BI datasets as in the figure, then click on New Power BI datasets

Download the PowerBI file, double click on it to open PowerBI

From PowerBI, connect with your Microsoft account

Select the views you created in the Import data from external source step and click on Load

Select Direct Query or Import and validate
Verify the relationships in the correspondent view
Prepare your analysis
When you finish, click on file –> publish and select the Workspace of Power BI created on previous step
You can now see the report in Synapse, in Develop tab

Agenda

Presentation (00:00)
Introduction
Azure and microsoft resources (00:30)
Azure Synapse (01:00)
Q&A (02:00)