Azure provides several useful resources for creating a Data Ingestion pipeline.
A traditional pipeline in Azure/PowerBI is composed of the following components:
Azure Analysis Services is a component that is often added to this pipeline, while this is not strictly necessary. It allows to pre-calculated some metrics from data in Data Warehouse and keep the results in-memory, so they can be quickly and efficiently added to the PowerBI dashboards. We won’t go into details of Analysis Services in this course.
A traditional relational database, used of operational systems, can also be used as a Data Warehouse.
Data Factory is a tool to develop ETL pipelines, allowing to integrate and transform data from several sources, with more than 90 connectors. It is completely managed and serverless, so users don’t have to worry about creating, configuring and maintaining containers or virtual machines.
It allows defining complex pipelines and activating triggers for launching them. The “Data Flow” option also gives the possibility to create complex no-code transformation (join, filtering, projection…) and it is based on a Spark Engine.
Data Factory can connect to Cloud or external data sources, and even to REST APIs and other services.
With the Self-hosted integration runtime, an instance of Data Factory can be launched on-premises, this is particularly useful when having a large quantity of on-premises data.
Azure Data Factory configuration can be easily versioned using GIT, using the Data Factory studio.
From https://docs.microsoft.com/it-it/azure/data-factory/author-visually
Azure provide a resource to create a Data Lake, called Data Lake Storage Gen2.
It stores data in form of files, structured in nested directories like classical file systems.
A Data lake typically contains different data, from audio to images to structured data. The most common format used in Data Lake for metadata are simple CSV files, or Parquet files, a compressed column-based format more efficient and less consuming than CSV. Data can also be provided in form of JSON or CSV.
PowerBI is a tool to create Dashboards from external data. Potentially it can retrieve data from different sources (Excel files, Databases…).
Indeed, for small projects it can be directly attached to operational databases. However, for a complex project it is always better to create an architecture with a Data Lake / Data Warehouse, and attach PowerBI to the Data Warehouse.
Once data are retrieven from a source, they can be shown in a tabular model, and relations between tables can be added by the user.
When tabular model is available, Dashboard pages can be created, adding plots, tables and text, providing for each component which column(s) to use.
Synapse is a tool for Data Ingestion and Analytics that has been recently added to Azure. It allows to configure in a single studio the ETL pipelines, replacing Data Factory (the pipelines and interface are very similar) and the Data Warehouse, also allowing navigation in the Data Lake in its own interface.
More details will be provided later in this course.
While Azure Synapse is more expensive compared to a traditional architecture, it has a unique interface and has high performances on big databases.
There are several Azure resources that are often used to enrich the Data Ingestion pipeline, according to the user needs.
We provide in the references some pointers for other interesting resources not covered in this course.
Data Ingestion pipeline:
PowerBI:
Other tools:
Real time processing:
Internet of Things: