Azure Synapse
Contents
Synapse Studio is an integrated Azure tool that allows managing all the steps of an Analytics pipeline, within a single interface.
In particular:
- It integrates a tool for designing ETL pipelines, very similar to Azure Data Factory, with a graphical interface to create pipelines
- It allows from its interface to query and navigate the data in an Azure Data Lake Storage
- It is possible to create:
- a dedicated Relational Database (Dedicated SQL Pool)
- a Serverless SQL Pool, i.e. an SQL Engine containing only views (or External Tables) on the Data Lake
- It allows from its interface to connect data to a PowerBI instance and see a preview of the Dashboards
- It is possible to activate a Spark Engine with Synapse, and define Spark transformations using Jupyer Notebooks.
All the configuration and objects created within Synapse can be versioned with a Git repository.
Demo
Create resources on Azure
- Create a resource group for your project
- Open resource group, select Add –> MarketPlace –> Azure Synapse Analytics (not private hubs) –> Create
- On the current screen, click on account name –> create new and select a name, same for file system name
- Add the remaining configuration for Synapse workspace as in the screenshot
- Click on review and create, then on Create
Create connection to external Database and a Serverless SQL Pool
- Open the resource group of the project
- Open Synapse Workspace
- Click on Open Synapse Studio
NB: Synapse allows you to version all the configurations in a GIT Repository. We won’t go into details here, but you can switch between “Synapse Live” (no versioning) to a versioning in GIT in the upper left corner of the screen.
- Click on Manage –> Linked Services –> New
- Select Azure SQL Database and click on continue
- Provide the required information.
-
Click on test connection on the bottom of the page to make sure connection is successful, then click on create.
-
Click on the “Data” icon on the left, then on the “+” and SQL Database.
- Select “Serverless” and provide a name
Create a Power BI workspace and connect Synapse to it
- Open powerbi.com
- Login with your own company account
- Click on workspaces –> Create a Workspace
-
Provide name and description and click on save
-
Go back to Synapse, click on Manage –> Linked services –> New
- Select Connect to Power BI or search for Power BI
- Provide a name for the linked service and the workspace name you have just created in Power BI, then click on Create
- Click on Publish All to validate all modifications
Create ETL Pipelines to import data
The subsequent step must be executed for each business process you want to analyze, with facts and dimensions
- Click on the Storage sign on the left (“Data”), then on the linked tab and on the primary data lake object as in the figure. Then, click on “New folder” to create a new folder with the name of your business process.
-
Click on “integrate” button on the left
-
Click on the plus “+” button on the right of “integrate”, then select “Pipeline”
- Provide a name for the pipeline
The subsequent step must be repeated for each table you want to create in the Data Lake
-
On the left bar, expand “Move and Transform” and select “Copy data” tool
- Provide a name for the step, like copy_[table_name]
- Click on the source tab and apply the following steps:
- click on “+ New” button on the right of “Source Dataset”
- Choose Azure SQL Database as input (or your source DB)
- Fulfill the properties form with a name and the linked service we created for the external database, select the table (or “None” if you want to import via a query and not the full table), then click on OK.
- The source dataset you just create should appear on the dropdown list, otherwise select it. Fulfill the rest of the source with the query (or table) of your choice.
- Go the the sink tab, click on “+ New” on the right of “Sink dataset”
- Select “Azure Data lake storage Gen2”, then “Parquet”
- In the subsequent form, provide a name and select the workspace default storage as linked service.
- Click on the folder on the right of the “file path” section, then select the path in the data lake where you want to drop your data (double click on a folder to open it). You may choose the folder you created for the business process in previous steps.
-
If you wrote a query in the source section, also add the file name on the last text form of “File path” section, with “.parquet” as an extension. If you selected a table, leave it blank.
- Click on the “Debug” button above to test the pipeline. (NB: this will trigger the import and impute costs) and verify if the pipeline runs successfully, in the output tab below
- If the pipeline terminates correctly, verify if data retrieved are correct. Click on “Data” on the left, then on “Linked” tab, click on the primary data lake as in the image below. Then, navigate to the path you selected on the steps below
Now we will create a view on the Data Lake.
You may want to create external tables instead of of views. In this case, in next step select “Create an External Table” instead of SELECT TOP rows. Here an analysis of the differences.
-
Right click on the file you created, and select “New SQL Script” –> Select TOP 100 Rows
-
Click on Run and verify if results are expected.
-
Now, modify the script removing the TOP 100 text and adding at the beginning “CREATE VIEW [name] AS”, as in the figure below. Also, select the database you created in the Preliminary steps at the right of use database, then click on “Run”
-
You may want to provide a name to the Script for creating the view (on the right bar), while this is not strictly necessary
- Click on “Validate all” tab above, and if the systems shows no errors, then click on “Commit All”
Add trigger for the pipeline
You should now define triggers to schedule the activation of the pipeline.
- On the created pipeline, click on Add Trigger –> “New/Edit”
- On the dropdown list, click on “New” TODO: I think that if you already created a trigger for another pipeline, you can reuse it and avoid creating a new one. To verify.
- Create a new trigger providing all requested information. You may select “No” on activated to suspend the trigger and activate it later.
Generate PowerBI Dashboards
- Click on the paper icon on the left (develop), then click on Power BI datasets as in the figure, then click on New Power BI datasets
- Download the PowerBI file, double click on it to open PowerBI
- From PowerBI, connect with your Microsoft account
- Select the views you created in the Import data from external source step and click on Load
- Select Direct Query or Import and validate
- Verify the relationships in the correspondent view
- Prepare your analysis
- When you finish, click on file –> publish and select the Workspace of Power BI created on previous step
- You can now see the report in Synapse, in Develop tab
Agenda
-
Presentation (00:00)
- Introduction
-
Azure and microsoft resources (00:30)
- Azure Synapse (01:00)
-
Q&A (02:00)