Course Outline

PREREQUISITES FOR AZURE DATA ENGINEER:

  • Any Cloud basics
  • Basic networking knowledge

DP-900 MICROSOFT AZURE DATA FUNDAMENTALS

DP-203 AZURE DATA ENGINEER

The objectives covered in this course are:

• Design and implement data storage (40-45%)
• Design and develop data processing (25-30%)
• Design and implement data security (10-15%)
• Monitor and optimize data storage and data processing (10-15%)

• What is the purpose of an Azure Data Lake Gen 2 storage account?
• Basics on Transact-SQL commands
• How to work with Azure Synapse. This will include building a data warehouse into a dedicated
SQL Pool.
• How to build an ETL pipeline with the help of Azure Data Factory. There will be various scenarios
on how to create mapping data flows.
• How to stream data with the use of Azure Stream Analytics. You can see how SQL commands
can be used for your streaming data.
• Basics on the Scala programming language, and SPARK
• How to work with SPARK, Scala in Azure Databricks. We will see how to work with Notebooks.
We will also see how to stream data into Azure Databricks.
• The different security measures and monitoring aspects to consider when working with Azure
services.

1. DATA ENGINEERING FUNDAMENTALS

The objectives covered in this course are:

  • 1.1. Course and Exam objectives
  • 1.2. Azure Overview (Concepts, Azure AD, Azure Free Account Creation, Azure Portal tour)
  • 1.3. Introduction to Different File Formats
  • 1.3.1. CSV, TXT
  • 1.3.2. JSON & JSON Data types
  • 1.3.3. Parquet Format
  • 1.4. Legacy Engineering & Analytical Revision
  • 1.4.1. Relational Databases & Structured Query Language (SQL)
  • 1.4.2. Power BI
  • 1.4.3. ETL & ELT Batch Processing, Data Analytics
  • 1.5. Creating Azure Resources
  • 1.5.1. About Azure Active Directory
  • 1.5.2. Deploying Microsoft SQL Server on Azure
  • 1.5.3. Lab – Creating a virtual machine, Installing Microsoft SQL Server, Connecting to
    Microsoft SQL Server, SQL Server template, Creating an Azure SQL Database,
    Deleting Azure resources
  • 1.5.4. Different ways to deploy Azure resources – Overview
  • 1.5.5. Lab – Creating an Azure SQL Database – PowerShell, Azure CLI, ARM Templates
  • 1.5.6. Azure SQL Managed Instance (Creating and Connecting)
  • 1.5.7. IaaS vs PaaS
    1.5.8. Type of SQL statements
  • 1.5.9. Lab – Azure SQL Database – Working with data, Primary Keys, Foreign keys,
    Additional SQL statements
  • 1.5.10. Lab – Creating another Azure SQL database
  • 1.5.11. Lab – Table Views, Indexes in tables
  • 1.5.12. Roles and tools
  • 1.5.13. Lab – Azure Data Studio
  • 1.5.14. Using Visual Studio Code
  • 1.5.15. Lab – Using sqlcmd
  • 1.5.16. Lab – Azure Database for MySQL, MariaDB, PostgreSQL
  • 1.5.17. Commands for other databases
  • 1.5.18. Azure SQL Encryption, Read Replica, Adding SQL users
  • 1.5.19. Lab – Azure SQL Database – Firewall

2. DESIGN AND IMPLEMENT DATA STORAGE

  • 2.1. Storage Introduction
  • 2.2. Understanding “Data” and relevant relational / non-relational storage systems
  • 2.3. Azure Storage Accounts
    • 2.3.1. Storage Account Access Tiers
    • 2.3.2. Storage Account Redundancy
    • 2.3.3. Storage Account Lifecycle
    • 2.3.4. Azure Storage Accounts – Introduction
    • 2.3.5. Lab – Creating a storage account, Blob service, Table service, File service
    • 2.3.6. Azure Storage Explorer
    • 2.3.7. Storage Accounts – Data Redundancy
    • 2.3.8. Lab – Azure Blob storage – Service tiers, Firewall
  • 2.4. Azure SQL Databases
  • 2.5. Connect Applications with Storage and SQL Databases (Lab)
  • 2.6. File Formats Overview
  • 2.7. Lab – Azure Data Lake Storage
  • 2.8. Azure Data Lake Gen-2
    • 2.8.1. Gen-2 Storage Account
    • 2.8.2. Gen-2 Storage creation (Lab)
    • 2.8.3. Authorize Gen-2 Storage though Access keys and shared access signature (Lab)
  • 2.9. Azure Storage Costing and Budgeting
  • 2.10. Power Bi to Visualize Data
  • 2.11. SQL Query / T-SQL Theory
  • 2.11.1. SQL / T-SQL Practice (Lab)
  • 2.12. Azure Synapse Analytics
    • 2.12.1. OLTP and OLAP
    • 2.12.2. Azure Synapse (Create, Compute, External Tables)
    • 2.12.3. Azure Synapse Architecture
    • 2.12.4. Azure Synapse External Tables (Lab)
    • 2.12.5. SQL Pool & Data Cleansing, External Tables (Lab >> CSV, Parquet)
    • 2.12.6. Azure Synapse Data Loading and Bulk Insert
    • 2.12.7. Azure Synapse Data Loading (Lab >> CSV, Parquet)
    • 2.12.8. Azure Synapse PolyBase and pushing dedicated SQL Pool
  • 2.13. Datawarehouse Design and Build to Set up database (Lab)
    • 2.13.1. Fact and Dimension Tables
    • 2.13.2. SC Dimensions
    • 2.13.3. Schema Types (Star Schema and Snowflake Schema)
    • 2.13.4. Using Power BI and Copy Commands with Azure Synapse
    • 2.13.5. Synapse Table Types
    • 2.13.5.1. Round Robin Tables and Hash Distributed Tables, Heap Tables
    • 2.13.5.2. Table replication
    • 2.13.5.3. Azure Synapse Tables (Lab>> Create, Partitions, Indexes, Window
      Functions, JSON reading, Surrogate Keys, Case Statements)
    • 2.13.5.4. Modern Datawarehouse
    • 2.13.5.5. Spark Pool

3. DESIGN AND DEVELOP DATA PROCESSING

The objectives covered in this course are:

  • 3.1. Azure Data Factory (ADF)
    • 3.1.1. Introduction
    • 3.1.2. Extract, Transform, Load
    • 3.1.3. Starting with Azure Data Factory
    • 3.1.4. Lab – 1 ADF Azure Data Lake to Azure Synapse (Generating a Parquet files,
      Log.csv file, Parquet files)
    • 3.1.5. Lab – 2 ADF What about using a query for data transfer
    • 3.1.6. Deleting artifacts in Azure Data Factory & Mapping Data Flow
    • 3.1.7. Lab – Mapping Data Flow
    • 3.1.7.1. Fact Table
    • 3.1.7.2. Dimension Table (DimCustomer, DimProduct)
    • 3.1.7.3. Surrogate Keys – Dimension tables
    • 3.1.8. Lab – Using Cache sink
    • 3.1.9. Lab – Handling Duplicate rows
    • 3.1.9.1. Note – What happens if we do not have any data in our DimProduct table
    • 3.1.9.2. Changing connection details
    • 3.1.9.3. Lab – Changing the Time column data in our Log.csv file
    • 3.1.9.4. Lab – Convert Parquet to JSON
    • 3.1.9.5. Lab – Loading JSON into SQL Pool
    • 3.1.9.5.1. Self-Hosted Integration Runtime (Setting up nginx, Setting up the
      runtime, Copy Activity, Mapping Data Flow, Using triggers)
    • 3.1.9.6. Lab – Processing JSON (Arrays, Objects)
    • 3.1.9.7. Lab – Conditional Split, Schema Drift, Metadata activity
    • 3.1.9.8. Lab – Azure DevOps (Git configuration, Release configuration)
    • 3.1.10. Lab Material Folder
  • 3.2. Azure Event Hubs and Stream Analytics
    • 3.2.1. What are Azure Event Hubs
    • 3.2.2. Batch and Real-Time Processing
    • 3.2.3. Lab – (Creating an instance of Event hub, Sending and Receiving Events)
    • 3.2.4. What is Azure Stream Analytics
    • 3.2.5. Lab – (Creating a Stream Analytics job, Azure Stream Analytics – Defining the job)
    • 3.2.6. Lab – Reading database diagnostic data – Setup
    • 3.2.7. Lab – Reading data from a JSON file (Setup, Implementation)
    • 3.2.8. Lab – Reading data from the Event Hub – (Setup, Implementation)
    • 3.2.9. Lab – (Timing windows, Adding multiple outputs, Reference data, OVER clause)
    • 3.2.10. Lab – Power BI Output
    • 3.2.11. Lab – Reading Network Security Group Logs – (Server Setup, Enabling NSG Flow
      Logs, Processing the data)
    • 3.2.12. Lab – User Defined Functions
    • 3.2.13. Custom Serialization Formats
    • 3.2.14. Lab – Azure Event Hubs – Capture Feature
    • 3.2.15. Lab – Azure Data Factory – Incremental Data Copy
    • 3.2.16. Lab Material Folder

3.3. SCALA, NOTEBOOKS AND SPARK

  • 3.3.1. Introduction to Scala
    • 3.3.2. Installing Scala
    • 3.3.2.1. Playing with values, Installing IntelliJ IDE
    • 3.3.2.2. If construct, for construct, while construct, case construct
    • 3.3.2.3. Functions, List collection
    • 3.3.3. Starting with Python
    • 3.3.3.1. A simple program
    • 3.3.3.2. If construct, while construct, List collection, Functions
    • 3.3.4. Quick look at Jupyter Notebook
  • 3.3.5. Lab – Azure Synapse – Creating a Spark pool
    • 3.3.5.1. Starting out with Notebooks
    • 3.3.5.2. Spark DataFrames, Sorting data, Load data, Removing NULL values, Using
      SQL statements, Write data to Azure Synapse
    • 3.3.5.3. Combined Power, Sharing tables, Creating tables
    • 3.3.5.4. JSON files
    • 3.3.6. Lab Materials

3.4. AZURE DATABRICKS

  • 3.4.1. Introduction to DataBricks / Azure DataBricks
    • 3.4.2. Clusters in Azure Databricks
    • 3.4.3. Lab – Creating a workspace, Creating a cluster
    • 3.4.4. Lab – Simple notebook, Using DataFrames, Lab – Reading a CSV file
    • 3.4.5. Databricks File System
    • 3.4.6. Lab – The SQL Data Frame, Visualizations
    • 3.4.7. Lab – Few functions on dates, Filtering on NULL values
    • 3.4.8. Lab – Parquet-based files, JSON-based files
    • 3.4.9. Lab – Structured Streaming
    • 3.4.9.1. Let’s first understand our data
    • 3.4.9.2. Streaming from Azure Event Hubs – a) Setup b) Implementation
    • 3.4.10. Lab – Getting data from Azure Data Lake – a) Setup b) Implementation
    • 3.4.11. Lab – Azure Synapse SQL Dedicated Pool
    • 3.4.11.1. Lab – Stream and write Data to Azure Synapse SQL Dedicated Pool
    • 3.4.12. Lab – Azure Data Lake Storage Credential Passthrough
    • 3.4.13. Lab – Running an automated job
    • 3.4.14. Autoscaling a cluster
    • 3.4.15. Lab – Removing duplicate rows, Using the PIVOT command, Azure Databricks
      Table, Azure Data Factory – Running a notebook
    • 3.4.16. Delta Lake Introduction
    • 3.4.16.1. Lab – Creating a Delta Table, Streaming data into the table, Time Travel
    • 3.4.17. Decision making to select Azure Synapse or Azure Databricks
    • 3.4.18. Lab Materials Folder

4. DESIGN AND DEVELOP DATA SECURITY

  • 4.1. Introduction
  • 4.2. What is the Azure Key Vault Service
  • 4.3. Azure Data Factory – Encryption
  • 4.4. Azure Synapse – Customer Managed Keys
  • 4.5. Azure Dedicated SQL Pool – Transparent Data Encryption
  • 4.6. Lab – Azure Synapse – Data Masking, Auditing
  • 4.7. Azure Synapse – Data Discovery and Classification
  • 4.8. Azure Synapse – Azure AD Authentication
  • 4.9. Lab – Azure Synapse – Azure AD Authentication
    • 4.9.1. Setting the admin, creating a user, Row-Level Security, Column-Level Security,
      Role Based Access Control (RBAC), Access Control Lists (ACL)
  • 4.10. Lab – Azure Synapse – External Tables Authorization
    • 4.10.1. via Managed Identity, via Azure AD Authentication
  • 4.11. Lab – Azure Synapse – Firewall
  • 4.12. Lab – Azure Data Lake – Virtual Network Service Endpoint
  • 4.13. Lab – Azure Data Lake – Managed Identity – Data Factory
  • 4.14. Lab Materials Folder

5. MONITOR AND OPTIMIZE DATA STORAGE AND DATA PROCESSING

  • 5.1.1. Best practices for structuring files in your data lake
  • 5.1.2. Azure Storage accounts – Query acceleration
  • 5.1.3. View on Azure Monitor
  • 5.1.4. Azure Monitor – Alerts
  • 5.1.5. Azure Synapse – System Views
  • 5.1.6. Azure Synapse – Result set caching
  • 5.1.7. Azure Synapse – Workload Management
  • 5.1.8. Azure Synapse – Retention points
  • 5.1.9. Lab – Azure Data Factory – Monitoring
  • 5.1.10. Azure Data Factory – Monitoring – Alerts and Metrics
  • 5.1.11. Lab – Azure Data Factory – Annotations
  • 5.1.12. Azure Data Factory – Integration Runtime – Note
  • 5.1.13. Azure Data Factory – Pipeline Failures
  • 5.1.14. Azure Key Vault – High Availability
  • 5.1.15. Azure Stream Analytics – Metrics, Streaming Units, The importance of time,
    More on time aspect, Monitoring the stream analytics job Example
  • 5.1.16. Azure Event Hubs and Stream Analytics – Partitions
  • 5.1.17. Azure Stream Analytics – An example on multiple partitions
  • 5.1.18. Azure Stream Analytics – More on partitions
  • 5.1.19. Azure Stream Analytics – An example on diagnosing errors
  • 5.1.20. Azure Stream Analytics – Diagnostics setting
  • 5.1.21. Azure Databricks – Monitoring
  • 5.1.22. Azure Databricks – Sending logs to Azure Monitor
  • 5.1.23. Azure Event Hubs – High Availability

6. WORKING WITH COSMOS DB

  • 6.1.1. Introduction, Non-relational data
  • 6.1.2. Different types of non-relational data stores
  • 6.1.3. Lab – Azure Cosmos DB
  • 6.1.4. Lab – Azure Cosmos DB – Working with data
  • 6.1.5. Lab – Azure Cosmos DB – Working with data – Resource
  • 6.1.6. NEW – Azure Cosmos DB – More on Partitions
  • 6.1.7. The different API’s in Azure Cosmos DB
  • 6.1.8. Lab – Azure Cosmos DB – Table API
  • 6.1.9. NEW – Lab – Azure Cosmos DB – MongoDB API
  • 6.1.10. NEW – Azure Cosmos DB – Gremlin API
  • 6.1.11. NEW – Lab – Azure Cosmos DB – Gremlin API
  • 6.1.12. NEW – Lab – Azure Cosmos DB – Gremlin API – Resources
  • 6.1.13. Azure Cosmos DB – High Availability
  • 6.1.14. Azure Storage Account – Azure CLI
  • 6.1.15. Azure Cosmos DB – Azure CLI
  • 6.1.16. Azure CLI – Commands
  • 6.2. Practice Questions / Tests
    • 6.2.1. Project 1
    • 6.2.2. Project 2
    • 6.3. Revision or Repeat Section requests if any.