Course Outline
PREREQUISITES FOR AZURE DATA ENGINEER:
- Any Cloud basics
- Basic networking knowledge
DP-900 MICROSOFT AZURE DATA FUNDAMENTALS
DP-203 AZURE DATA ENGINEER
The objectives covered in this course are:
• Design and implement data storage (40-45%)
• Design and develop data processing (25-30%)
• Design and implement data security (10-15%)
• Monitor and optimize data storage and data processing (10-15%)
• What is the purpose of an Azure Data Lake Gen 2 storage account?
• Basics on Transact-SQL commands
• How to work with Azure Synapse. This will include building a data warehouse into a dedicated
SQL Pool.
• How to build an ETL pipeline with the help of Azure Data Factory. There will be various scenarios
on how to create mapping data flows.
• How to stream data with the use of Azure Stream Analytics. You can see how SQL commands
can be used for your streaming data.
• Basics on the Scala programming language, and SPARK
• How to work with SPARK, Scala in Azure Databricks. We will see how to work with Notebooks.
We will also see how to stream data into Azure Databricks.
• The different security measures and monitoring aspects to consider when working with Azure
services.
1. DATA ENGINEERING FUNDAMENTALS
The objectives covered in this course are:
- 1.1. Course and Exam objectives
- 1.2. Azure Overview (Concepts, Azure AD, Azure Free Account Creation, Azure Portal tour)
- 1.3. Introduction to Different File Formats
- 1.3.1. CSV, TXT
- 1.3.2. JSON & JSON Data types
- 1.3.3. Parquet Format
- 1.4. Legacy Engineering & Analytical Revision
- 1.4.1. Relational Databases & Structured Query Language (SQL)
- 1.4.2. Power BI
- 1.4.3. ETL & ELT Batch Processing, Data Analytics
- 1.5. Creating Azure Resources
- 1.5.1. About Azure Active Directory
- 1.5.2. Deploying Microsoft SQL Server on Azure
- 1.5.3. Lab – Creating a virtual machine, Installing Microsoft SQL Server, Connecting to
Microsoft SQL Server, SQL Server template, Creating an Azure SQL Database,
Deleting Azure resources - 1.5.4. Different ways to deploy Azure resources – Overview
- 1.5.5. Lab – Creating an Azure SQL Database – PowerShell, Azure CLI, ARM Templates
- 1.5.6. Azure SQL Managed Instance (Creating and Connecting)
- 1.5.7. IaaS vs PaaS
1.5.8. Type of SQL statements - 1.5.9. Lab – Azure SQL Database – Working with data, Primary Keys, Foreign keys,
Additional SQL statements - 1.5.10. Lab – Creating another Azure SQL database
- 1.5.11. Lab – Table Views, Indexes in tables
- 1.5.12. Roles and tools
- 1.5.13. Lab – Azure Data Studio
- 1.5.14. Using Visual Studio Code
- 1.5.15. Lab – Using sqlcmd
- 1.5.16. Lab – Azure Database for MySQL, MariaDB, PostgreSQL
- 1.5.17. Commands for other databases
- 1.5.18. Azure SQL Encryption, Read Replica, Adding SQL users
- 1.5.19. Lab – Azure SQL Database – Firewall
2. DESIGN AND IMPLEMENT DATA STORAGE
- 2.1. Storage Introduction
- 2.2. Understanding “Data” and relevant relational / non-relational storage systems
- 2.3. Azure Storage Accounts
- 2.3.1. Storage Account Access Tiers
- 2.3.2. Storage Account Redundancy
- 2.3.3. Storage Account Lifecycle
- 2.3.4. Azure Storage Accounts – Introduction
- 2.3.5. Lab – Creating a storage account, Blob service, Table service, File service
- 2.3.6. Azure Storage Explorer
- 2.3.7. Storage Accounts – Data Redundancy
- 2.3.8. Lab – Azure Blob storage – Service tiers, Firewall
- 2.4. Azure SQL Databases
- 2.5. Connect Applications with Storage and SQL Databases (Lab)
- 2.6. File Formats Overview
- 2.7. Lab – Azure Data Lake Storage
- 2.8. Azure Data Lake Gen-2
- 2.8.1. Gen-2 Storage Account
- 2.8.2. Gen-2 Storage creation (Lab)
- 2.8.3. Authorize Gen-2 Storage though Access keys and shared access signature (Lab)
- 2.9. Azure Storage Costing and Budgeting
- 2.10. Power Bi to Visualize Data
- 2.11. SQL Query / T-SQL Theory
- 2.11.1. SQL / T-SQL Practice (Lab)
- 2.12. Azure Synapse Analytics
- 2.12.1. OLTP and OLAP
- 2.12.2. Azure Synapse (Create, Compute, External Tables)
- 2.12.3. Azure Synapse Architecture
- 2.12.4. Azure Synapse External Tables (Lab)
- 2.12.5. SQL Pool & Data Cleansing, External Tables (Lab >> CSV, Parquet)
- 2.12.6. Azure Synapse Data Loading and Bulk Insert
- 2.12.7. Azure Synapse Data Loading (Lab >> CSV, Parquet)
- 2.12.8. Azure Synapse PolyBase and pushing dedicated SQL Pool
- 2.13. Datawarehouse Design and Build to Set up database (Lab)
- 2.13.1. Fact and Dimension Tables
- 2.13.2. SC Dimensions
- 2.13.3. Schema Types (Star Schema and Snowflake Schema)
- 2.13.4. Using Power BI and Copy Commands with Azure Synapse
- 2.13.5. Synapse Table Types
- 2.13.5.1. Round Robin Tables and Hash Distributed Tables, Heap Tables
- 2.13.5.2. Table replication
- 2.13.5.3. Azure Synapse Tables (Lab>> Create, Partitions, Indexes, Window
Functions, JSON reading, Surrogate Keys, Case Statements) - 2.13.5.4. Modern Datawarehouse
- 2.13.5.5. Spark Pool
3. DESIGN AND DEVELOP DATA PROCESSING
The objectives covered in this course are:
- 3.1. Azure Data Factory (ADF)
- 3.1.1. Introduction
- 3.1.2. Extract, Transform, Load
- 3.1.3. Starting with Azure Data Factory
- 3.1.4. Lab – 1 ADF Azure Data Lake to Azure Synapse (Generating a Parquet files,
Log.csv file, Parquet files) - 3.1.5. Lab – 2 ADF What about using a query for data transfer
- 3.1.6. Deleting artifacts in Azure Data Factory & Mapping Data Flow
- 3.1.7. Lab – Mapping Data Flow
- 3.1.7.1. Fact Table
- 3.1.7.2. Dimension Table (DimCustomer, DimProduct)
- 3.1.7.3. Surrogate Keys – Dimension tables
- 3.1.8. Lab – Using Cache sink
- 3.1.9. Lab – Handling Duplicate rows
- 3.1.9.1. Note – What happens if we do not have any data in our DimProduct table
- 3.1.9.2. Changing connection details
- 3.1.9.3. Lab – Changing the Time column data in our Log.csv file
- 3.1.9.4. Lab – Convert Parquet to JSON
- 3.1.9.5. Lab – Loading JSON into SQL Pool
- 3.1.9.5.1. Self-Hosted Integration Runtime (Setting up nginx, Setting up the
runtime, Copy Activity, Mapping Data Flow, Using triggers) - 3.1.9.6. Lab – Processing JSON (Arrays, Objects)
- 3.1.9.7. Lab – Conditional Split, Schema Drift, Metadata activity
- 3.1.9.8. Lab – Azure DevOps (Git configuration, Release configuration)
- 3.1.10. Lab Material Folder
- 3.2. Azure Event Hubs and Stream Analytics
- 3.2.1. What are Azure Event Hubs
- 3.2.2. Batch and Real-Time Processing
- 3.2.3. Lab – (Creating an instance of Event hub, Sending and Receiving Events)
- 3.2.4. What is Azure Stream Analytics
- 3.2.5. Lab – (Creating a Stream Analytics job, Azure Stream Analytics – Defining the job)
- 3.2.6. Lab – Reading database diagnostic data – Setup
- 3.2.7. Lab – Reading data from a JSON file (Setup, Implementation)
- 3.2.8. Lab – Reading data from the Event Hub – (Setup, Implementation)
- 3.2.9. Lab – (Timing windows, Adding multiple outputs, Reference data, OVER clause)
- 3.2.10. Lab – Power BI Output
- 3.2.11. Lab – Reading Network Security Group Logs – (Server Setup, Enabling NSG Flow
Logs, Processing the data) - 3.2.12. Lab – User Defined Functions
- 3.2.13. Custom Serialization Formats
- 3.2.14. Lab – Azure Event Hubs – Capture Feature
- 3.2.15. Lab – Azure Data Factory – Incremental Data Copy
- 3.2.16. Lab Material Folder
3.3. SCALA, NOTEBOOKS AND SPARK
- 3.3.1. Introduction to Scala
- 3.3.2. Installing Scala
- 3.3.2.1. Playing with values, Installing IntelliJ IDE
- 3.3.2.2. If construct, for construct, while construct, case construct
- 3.3.2.3. Functions, List collection
- 3.3.3. Starting with Python
- 3.3.3.1. A simple program
- 3.3.3.2. If construct, while construct, List collection, Functions
- 3.3.4. Quick look at Jupyter Notebook
- 3.3.5. Lab – Azure Synapse – Creating a Spark pool
-
- 3.3.5.1. Starting out with Notebooks
- 3.3.5.2. Spark DataFrames, Sorting data, Load data, Removing NULL values, Using
SQL statements, Write data to Azure Synapse - 3.3.5.3. Combined Power, Sharing tables, Creating tables
- 3.3.5.4. JSON files
- 3.3.6. Lab Materials
3.4. AZURE DATABRICKS
- 3.4.1. Introduction to DataBricks / Azure DataBricks
- 3.4.2. Clusters in Azure Databricks
- 3.4.3. Lab – Creating a workspace, Creating a cluster
- 3.4.4. Lab – Simple notebook, Using DataFrames, Lab – Reading a CSV file
- 3.4.5. Databricks File System
- 3.4.6. Lab – The SQL Data Frame, Visualizations
- 3.4.7. Lab – Few functions on dates, Filtering on NULL values
- 3.4.8. Lab – Parquet-based files, JSON-based files
- 3.4.9. Lab – Structured Streaming
- 3.4.9.1. Let’s first understand our data
- 3.4.9.2. Streaming from Azure Event Hubs – a) Setup b) Implementation
- 3.4.10. Lab – Getting data from Azure Data Lake – a) Setup b) Implementation
- 3.4.11. Lab – Azure Synapse SQL Dedicated Pool
- 3.4.11.1. Lab – Stream and write Data to Azure Synapse SQL Dedicated Pool
- 3.4.12. Lab – Azure Data Lake Storage Credential Passthrough
- 3.4.13. Lab – Running an automated job
- 3.4.14. Autoscaling a cluster
- 3.4.15. Lab – Removing duplicate rows, Using the PIVOT command, Azure Databricks
Table, Azure Data Factory – Running a notebook - 3.4.16. Delta Lake Introduction
- 3.4.16.1. Lab – Creating a Delta Table, Streaming data into the table, Time Travel
- 3.4.17. Decision making to select Azure Synapse or Azure Databricks
- 3.4.18. Lab Materials Folder
4. DESIGN AND DEVELOP DATA SECURITY
- 4.1. Introduction
- 4.2. What is the Azure Key Vault Service
- 4.3. Azure Data Factory – Encryption
- 4.4. Azure Synapse – Customer Managed Keys
- 4.5. Azure Dedicated SQL Pool – Transparent Data Encryption
- 4.6. Lab – Azure Synapse – Data Masking, Auditing
- 4.7. Azure Synapse – Data Discovery and Classification
- 4.8. Azure Synapse – Azure AD Authentication
- 4.9. Lab – Azure Synapse – Azure AD Authentication
- 4.9.1. Setting the admin, creating a user, Row-Level Security, Column-Level Security,
Role Based Access Control (RBAC), Access Control Lists (ACL)
- 4.9.1. Setting the admin, creating a user, Row-Level Security, Column-Level Security,
- 4.10. Lab – Azure Synapse – External Tables Authorization
- 4.10.1. via Managed Identity, via Azure AD Authentication
- 4.11. Lab – Azure Synapse – Firewall
- 4.12. Lab – Azure Data Lake – Virtual Network Service Endpoint
- 4.13. Lab – Azure Data Lake – Managed Identity – Data Factory
- 4.14. Lab Materials Folder
5. MONITOR AND OPTIMIZE DATA STORAGE AND DATA PROCESSING
- 5.1.1. Best practices for structuring files in your data lake
- 5.1.2. Azure Storage accounts – Query acceleration
- 5.1.3. View on Azure Monitor
- 5.1.4. Azure Monitor – Alerts
- 5.1.5. Azure Synapse – System Views
- 5.1.6. Azure Synapse – Result set caching
- 5.1.7. Azure Synapse – Workload Management
- 5.1.8. Azure Synapse – Retention points
- 5.1.9. Lab – Azure Data Factory – Monitoring
- 5.1.10. Azure Data Factory – Monitoring – Alerts and Metrics
- 5.1.11. Lab – Azure Data Factory – Annotations
- 5.1.12. Azure Data Factory – Integration Runtime – Note
- 5.1.13. Azure Data Factory – Pipeline Failures
- 5.1.14. Azure Key Vault – High Availability
- 5.1.15. Azure Stream Analytics – Metrics, Streaming Units, The importance of time,
More on time aspect, Monitoring the stream analytics job Example - 5.1.16. Azure Event Hubs and Stream Analytics – Partitions
- 5.1.17. Azure Stream Analytics – An example on multiple partitions
- 5.1.18. Azure Stream Analytics – More on partitions
- 5.1.19. Azure Stream Analytics – An example on diagnosing errors
- 5.1.20. Azure Stream Analytics – Diagnostics setting
- 5.1.21. Azure Databricks – Monitoring
- 5.1.22. Azure Databricks – Sending logs to Azure Monitor
- 5.1.23. Azure Event Hubs – High Availability
6. WORKING WITH COSMOS DB
- 6.1.1. Introduction, Non-relational data
- 6.1.2. Different types of non-relational data stores
- 6.1.3. Lab – Azure Cosmos DB
- 6.1.4. Lab – Azure Cosmos DB – Working with data
- 6.1.5. Lab – Azure Cosmos DB – Working with data – Resource
- 6.1.6. NEW – Azure Cosmos DB – More on Partitions
- 6.1.7. The different API’s in Azure Cosmos DB
- 6.1.8. Lab – Azure Cosmos DB – Table API
- 6.1.9. NEW – Lab – Azure Cosmos DB – MongoDB API
- 6.1.10. NEW – Azure Cosmos DB – Gremlin API
- 6.1.11. NEW – Lab – Azure Cosmos DB – Gremlin API
- 6.1.12. NEW – Lab – Azure Cosmos DB – Gremlin API – Resources
- 6.1.13. Azure Cosmos DB – High Availability
- 6.1.14. Azure Storage Account – Azure CLI
- 6.1.15. Azure Cosmos DB – Azure CLI
- 6.1.16. Azure CLI – Commands
- 6.2. Practice Questions / Tests
- 6.2.1. Project 1
- 6.2.2. Project 2
- 6.3. Revision or Repeat Section requests if any.