ContentsContents
 Microsoft Azure Well
-
Architected Framework
 Overview
 Reliability
 About
 Overview
 Principles
 Design
 Checklist
 Requirements
 Application design
 Resiliency & dependencies
 Best practices
 Testing
 Checklist
 Resiliency testing
 Backup & recovery
 Backup and recovery
 Automatic retry of failed backup jobs
 Error handling
 Chaos engineering
 Best practices
 Monitoring
 Checklist
 Application health
 Health modeling
 Best practices
 Reliability patterns
 Security
 About

 Overview
 Principles
 Design
 Governance
 Checklist
 Compliance requirements
 Landing zone
 Segmentation strategy
 Management groups
 Administration
 Identity and access management
 Checklist
 Roles and responsibilities
 Control plane
 Authentication
 Authorization
 Best practices
 Networking
 Checklist
 Network segmentation
 Connectivity
 Application endpoints
 Data flow
 Best practices
 Data protection
 Checklist
 Encryption
 Key and secret management
 Best practices
 Applications and services
 Application security considerations
 Application classification

 Threat analysis
 Secure PaaS deployments
 Configuration and dependencies
 Build
-
deploy
 Checklist
 Governance considerations
 Infrastructure provisioning
 Code deployments
 Monitor
-
remediate
 Checklist
 Tools
 Azure resources
 Logs and alerts
 Review and remediate
 Compliance review
 Validate and test
 Security operations
 Tradeoffs
 Cost Optimization
 About
 Overview
 Principles
 Design
 Checklist
 Cost model
 Capture requirements
 Azure regions
 Azure resources
 Governance
 Initial estimate
 Managed services
 Performance and price options

 Provision
 Checklist
 AI + Machine Learning
 Big data
 Compute
 Data stores
 Messaging
 Networking
 Cost for networking services
 Web apps
 Monitor
 Checklist
 Budgets and alerts
 Reports
 Reviews
 Optimize
 Checklist
 Autoscale
 VM instances
 Caching
 Tradeoffs
 Operational Excellence
 About
 Overview
 Principles
 Automation
 Automation overview
 Repeatable infrastructure
 Configure infrastructure
 Automate operational tasks
 Release engineering
 Application development

 Continuous integration
 Testing
 Performance
 Deployment
 Rollback
 Monitor
 Checklist
 Monitor cloud applications
 Monitoring stages
 Data sources
 Instrumentation
 Collection and storage
 Analysis
 Visualization
 Alerting
 Common use cases
 Health monitoring
 Usage monitoring
 Issue tracking
 Tracing and debugging
 Auditing
 Operational Excellence patterns
 Performance Efficiency
 About
 Overview
 Principles
 Design
 Checklist
 Distributed architecture challenges
 Application design
 Application efficiency
 Scalability

 Capacity planning
 Test
 Checklist
 Performance testing
 Testing tools
 Monitoring
 Checklist
 Application profiling
 Analyze infrastructure
 Performance data
 Performance Efficiency patterns
 Checklist
 Tradeoffs
 Workloads
 Mission
-
critical
 Quick links
 Get started
 Design methodology
 Design principles
 Architecture pattern
 Cross
-
cutting concerns
 Design areas
 Application design
 Application platform
 Data platform
 Networking and connectivity
 Health modeling
 Deployment and testing
 Security
 Operational procedures
 Assessment tool
 Carrier
-
grade

 Quick links
 Get started
 Design principles
 Design areas
 Fault tolerance
 Data model
 Health modeling
 Testing and validation
 Hybrid
 Overview
 Cost Optimization
 Operational Excellence
 Performance Efficiency
 Reliability
 Security
 IoT
 Overview
 Reliability
 Security
 Cost Optimization
 Operational Excellence
 Performance Efficiency
 SAP
 Overview
 Reliability
 Security
 Cost Optimization
 Operational Excellence
 Performance Efficiency
 Sustainability
 Quick links
 Get started

 Design methodology
 Design principles
 Design areas
 Application design
 Application platform
 Testing
 Operational procedures
 Networking and connectivity
 Storage
 Security
 Services
 Compute
 Azure Service Fabric
 Azure App Service
 Reliability
 Cost optimization
 Operational excellence
 Azure Batch
 Reliability
 Operational excellence
 Performance efficiency
 Azure Kubernetes Service
 Functions
 Security
 Virtual Machines
 Data
 Azure Cache for Redis
 Reliability
 Operational excellence
 Azure Databricks
 Security
 Azure Database for MySQL

 Cost optimization
 Azure Database for PostgreSQL
 Cost optimization
 Azure SQL Database
 Azure SQL Managed Instance
 Reliability
 Operational excellence
 Azure Cosmos DB
 Reliability
 Operational excellence
 Hybrid
 Azure Stack Hub
 Reliability
 Operational excellence
 Storage
 Storage Accounts
 Reliability
 Security
 Cost optimization
 Operational excellence
 Disks
 Cost optimization
 Messaging
 Event Grid
 Reliability
 Operational excellence
 Event Hubs
 Reliability
 Operational excellence
 Service Bus
 Reliability
 Operational excellence

 Queue Storage
 Reliability
 Operational excellence
 IoT Hub
 Reliability
 Operational excellence
 IoT Hub Device Provisioning Service
 Reliability
 Operational excellence
 Networking
 Application Delivery 
(
General
)
 Reliability
 Operational excellence
 Application Gateway v2
 Azure Firewall
 ExpressRoute
 API Management
 Reliability
 Cost optimization
 Operational excellence
 Azure Front Door
 Reliability
 Security
 Operational excellence
 Network Virtual Appliances 
(
NVA
)
 Reliability
 Cost optimization
 Operational excellence
 Network Connectivity
 Reliability
 Cost optimization
 Operational excellence

 Azure Virtual Network
 Reliability
 Operational excellence
 Azure Load Balancer
 Reliability
 Operational excellence
 Traffic Manager
 Reliability
 Operational excellence
 IP addresses
 Cost optimization
 Monitoring
 Log Analytics
 Cost optimization
 Application Insights
 Security
 Cost optimization
 Operational excellence
 Implementing Recommendations

Microsoft Azure Well

Architected Framework

12/16/2022 • 4 minutes to read • Edit Online

P ILLA RP ILLA R DESC RIP T IONDESC RIP T ION

Reliability The ability of a system to recover from failures and continue

to function.

Security Protecting applications and data from threats.

Cost Optimization Managing costs to maximize the value delivered.

Operational Excellence Operations processes that keep a system running in

production.

Performance Efficiency The ability of a system to adapt to changes in load.

Overview

The Azure Well-Architected Framework is a set of guiding tenets that can be used to improve the quality of a

workload. The framework consists of five pillars of architectural excellence:

Reliability

Security

Cost Optimization

Operational Excellence

Performance Efficiency

Incorporating these pillars helps produce a high quality, stable, and efficient cloud architecture:

Reference the following video about how to architect successful workloads on Azure with the Well-Architected

Framework:

The following diagram gives a high-level overview of the Azure Well-Architected Framework:

Assess your workloadAssess your workload

In the center, is the Well-Architected Framework, which includes the five pillars of architectural excellence.

Surrounding the Well-Architected Framework are six supporting elements:

Azure Well-Architected Review

Azure Advisor

Documentation

Partners, Support, and Services Offers

Reference Architectures

Design Principles

To assess your workload using the tenets found in the Microsoft Azure Well-Architected Framework, see the

Microsoft Azure Well-Architected Review.

We also recommend you use Azure Advisor and Advisor Score to identify and prioritize opportunities to

improve the posture of your workloads. Both services are free to all Azure users and align to the five pillars of

 
Reliability
  
Reliability guidanceReliability guidance
 
Security
  
Security guidanceSecurity guidance
the Well-Architected Framework:
Azure AdvisorAzure Advisor is a personalized cloud consultant that helps you follow best practices to optimize your
Azure deployments. It analyzes your resource configuration and usage telemetry. It recommends
solutions that can help you improve the reliability, security, cost effectiveness, performance, and
operational excellence of your Azure resources. Learn more about Azure Advisor.
Advisor ScoreAdvisor Score is a core feature of Azure Advisor that aggregates Advisor recommendations into a
simple, actionable score. This score enables you to tell at a glance if you're taking the necessary steps to
build reliable, secure, and cost-efficient solutions, and to prioritize the actions that will yield the biggest
improvement to the posture of your workloads. The Advisor score consists of an overall score, which can
be further broken down into five category scores corresponding to each of the Well-Architected pillars.
Learn more about Advisor Score.
A reliable workload is one that is both resilient and available. Resiliency is the ability of the system to recover
from failures and continue to function. The goal of resiliency is to return the application to a fully functioning
state after a failure occurs. Availability is whether your users can access your workload when they need to.
For more information about resiliency, reference the following video that will show you how to start improving
the reliability of your Azure workloads:
The following topics offer guidance on designing and improving reliable Azure applications:
Designing reliable Azure applications
Design patterns for resiliency
Best practices:
Transient fault handling
Retry guidance for specific services
For an overview of reliability principles, reference Principles of the reliability pillar.
Think about security throughout the entire lifecycle of an application, from design and implementation to
deployment and operations. The Azure platform provides protections against various threats, such as network
intrusion and DDoS attacks. But you still need to build security into your application and into your DevOps
processes.
Ask the right questions about secure application development on Azure by referencing the following video:
Consider the following broad security areas:
Identity management
Protect your infrastructure

 
Cost optimization
  
Cost guidanceCost guidance
 
Operational excellence
  
Operational excellence guidanceOperational excellence guidance
 
Performance efficiency
Application security
Data sovereignty and encryption
Security resources
For more information, reference Overview of the security pillar.
When you're designing a cloud solution, focus on generating incremental value early. Apply the principles of
Build-Measure-LearnBuild-Measure-Learn, to accelerate your time to market while avoiding capital-intensive solutions.
For more information, reference Cost optimization and the following video on how to start optimizing your
Azure costs:
The following topics offer cost optimization guidance as you develop the Well-Architected Framework for your
workload:
Review cost principles
Develop a cost model
Create budgets and alerts
Review the cost optimization checklist
For a high-level overview, reference Overview of the cost optimization pillar.
Operational excellence covers the operations and processes that keep an application running in production.
Deployments must be reliable and predictable. Automate deployments to reduce the chance of human error.
Fast and routine deployment processes won't slow down the release of new features or bug fixes. Equally
important, you must quickly roll back or roll forward if an update has problems.
For more information, reference the following video about bringing security into your DevOps practice on
Azure:
The following topics provide guidance on designing and implementing DevOps practices for your Azure
workload:
Design patterns for operational excellence
Best practices: Monitoring and diagnostics
For a high-level summary, reference Overview of the operational excellence pillar.
Performance efficiency is the ability of your workload to scale to meet the demands placed on it by users in an
efficient manner. The main ways to achieve performance efficiency include using scaling appropriately and

  
Performance efficiency guidancePerformance efficiency guidance
 
Next steps
implementing PaaS offerings that have scaling built in.
For more information, watch Performance Efficiency: Fast & Furious: Optimizing for Quick and Reliable VM
Deployments.
The following topics offer guidance on how to design and improve the performance efficiency posture of your
Azure workload:
Design patterns for performance efficiency
Best practices:
Autoscaling
Background jobs
Caching
CDN
Data partitioning
For a high-level synopsis, reference Overview of the performance efficiency pillar.
Learn more about:
Azure Well-Architected Review
Well-Architected Series
Introduction to the Microsoft Azure Well-Architected Framework
Microsoft Defender for Cloud
Cloud Adoption Framework

Microsoft Azure Well

Architected Framework

12/16/2022 • 4 minutes to read • Edit Online

P ILLA RP ILLA R DESC RIP T IONDESC RIP T ION

Reliability The ability of a system to recover from failures and continue

to function.

Security Protecting applications and data from threats.

Cost Optimization Managing costs to maximize the value delivered.

Operational Excellence Operations processes that keep a system running in

production.

Performance Efficiency The ability of a system to adapt to changes in load.

Overview

The Azure Well-Architected Framework is a set of guiding tenets that can be used to improve the quality of a

workload. The framework consists of five pillars of architectural excellence:

Reliability

Security

Cost Optimization

Operational Excellence

Performance Efficiency

Incorporating these pillars helps produce a high quality, stable, and efficient cloud architecture:

Reference the following video about how to architect successful workloads on Azure with the Well-Architected

Framework:

The following diagram gives a high-level overview of the Azure Well-Architected Framework:

Assess your workloadAssess your workload

In the center, is the Well-Architected Framework, which includes the five pillars of architectural excellence.

Surrounding the Well-Architected Framework are six supporting elements:

Azure Well-Architected Review

Azure Advisor

Documentation

Partners, Support, and Services Offers

Reference Architectures

Design Principles

To assess your workload using the tenets found in the Microsoft Azure Well-Architected Framework, see the

Microsoft Azure Well-Architected Review.

We also recommend you use Azure Advisor and Advisor Score to identify and prioritize opportunities to

improve the posture of your workloads. Both services are free to all Azure users and align to the five pillars of

 
Reliability
  
Reliability guidanceReliability guidance
 
Security
  
Security guidanceSecurity guidance
the Well-Architected Framework:
Azure AdvisorAzure Advisor is a personalized cloud consultant that helps you follow best practices to optimize your
Azure deployments. It analyzes your resource configuration and usage telemetry. It recommends
solutions that can help you improve the reliability, security, cost effectiveness, performance, and
operational excellence of your Azure resources. Learn more about Azure Advisor.
Advisor ScoreAdvisor Score is a core feature of Azure Advisor that aggregates Advisor recommendations into a
simple, actionable score. This score enables you to tell at a glance if you're taking the necessary steps to
build reliable, secure, and cost-efficient solutions, and to prioritize the actions that will yield the biggest
improvement to the posture of your workloads. The Advisor score consists of an overall score, which can
be further broken down into five category scores corresponding to each of the Well-Architected pillars.
Learn more about Advisor Score.
A reliable workload is one that is both resilient and available. Resiliency is the ability of the system to recover
from failures and continue to function. The goal of resiliency is to return the application to a fully functioning
state after a failure occurs. Availability is whether your users can access your workload when they need to.
For more information about resiliency, reference the following video that will show you how to start improving
the reliability of your Azure workloads:
The following topics offer guidance on designing and improving reliable Azure applications:
Designing reliable Azure applications
Design patterns for resiliency
Best practices:
Transient fault handling
Retry guidance for specific services
For an overview of reliability principles, reference Principles of the reliability pillar.
Think about security throughout the entire lifecycle of an application, from design and implementation to
deployment and operations. The Azure platform provides protections against various threats, such as network
intrusion and DDoS attacks. But you still need to build security into your application and into your DevOps
processes.
Ask the right questions about secure application development on Azure by referencing the following video:
Consider the following broad security areas:
Identity management
Protect your infrastructure

 
Cost optimization
  
Cost guidanceCost guidance
 
Operational excellence
  
Operational excellence guidanceOperational excellence guidance
 
Performance efficiency
Application security
Data sovereignty and encryption
Security resources
For more information, reference Overview of the security pillar.
When you're designing a cloud solution, focus on generating incremental value early. Apply the principles of
Build-Measure-LearnBuild-Measure-Learn, to accelerate your time to market while avoiding capital-intensive solutions.
For more information, reference Cost optimization and the following video on how to start optimizing your
Azure costs:
The following topics offer cost optimization guidance as you develop the Well-Architected Framework for your
workload:
Review cost principles
Develop a cost model
Create budgets and alerts
Review the cost optimization checklist
For a high-level overview, reference Overview of the cost optimization pillar.
Operational excellence covers the operations and processes that keep an application running in production.
Deployments must be reliable and predictable. Automate deployments to reduce the chance of human error.
Fast and routine deployment processes won't slow down the release of new features or bug fixes. Equally
important, you must quickly roll back or roll forward if an update has problems.
For more information, reference the following video about bringing security into your DevOps practice on
Azure:
The following topics provide guidance on designing and implementing DevOps practices for your Azure
workload:
Design patterns for operational excellence
Best practices: Monitoring and diagnostics
For a high-level summary, reference Overview of the operational excellence pillar.
Performance efficiency is the ability of your workload to scale to meet the demands placed on it by users in an
efficient manner. The main ways to achieve performance efficiency include using scaling appropriately and

  
Performance efficiency guidancePerformance efficiency guidance
 
Next steps
implementing PaaS offerings that have scaling built in.
For more information, watch Performance Efficiency: Fast & Furious: Optimizing for Quick and Reliable VM
Deployments.
The following topics offer guidance on how to design and improve the performance efficiency posture of your
Azure workload:
Design patterns for performance efficiency
Best practices:
Autoscaling
Background jobs
Caching
CDN
Data partitioning
For a high-level synopsis, reference Overview of the performance efficiency pillar.
Learn more about:
Azure Well-Architected Review
Well-Architected Series
Introduction to the Microsoft Azure Well-Architected Framework
Microsoft Defender for Cloud
Cloud Adoption Framework

Overview of the reliability pillar

12/16/2022 • 4 minutes to read • Edit Online

Reliability ensures your application can meet the commitments you make to your customers. Architecting

resiliency into your application framework ensures your workloads are available and can recover from failures

at any scale.

Building for reliability includes:

Ensuring a highly available architecture

Recovering from failures such as data loss, major downtime, or ransomware incidents

To assess the reliability of your workload using the tenets found in the Microsoft Azure Well-Architected

Framework, reference the Microsoft Azure Well-Architected Review.

For more information, explore the following video on diving deeper into Azure workload reliability:

In traditional application development, there has been a focus on increasing the mean time between failures

(MTBF). Effort was spent trying to prevent the system from failing. In cloud computing, a different mindset is

required, because of several factors:

Distributed systems are complex, and a failure at one point can potentially cascade throughout the system.

Costs for cloud environments are kept low through commodity hardware, so occasional hardware failures

must be expected.

Applications often depend on external services, which may become temporarily unavailable or throttle high-

volume users.

Today's users expect an application to be available 24/7 without ever going offline.

All of these factors mean that cloud applications must be designed to expect occasional failures and recover

from them. Azure has many resiliency features already built into the platform. For example:

Azure Storage, Azure SQL Database, and Azure Cosmos DB all provide built-in data replication across

availability zones and regions.

Azure managed disks are automatically placed in different storage scale units to limit the effects of hardware

failures.

Virtual machines (VMs) in an availability set are spread across several fault domains. A

fault domain

is a

group of VMs that share a common power source and network switch. Spreading VMs across fault domains

limits the impact of physical hardware failures, network outages, or power interruptions.

Availability Zones

are physically separate locations within each Azure region. Each zone is composed of one

or more datacenters equipped with independent power, cooling, and networking infrastructure. With

availability zones, you can design and operate applications, and databases that automatically transition

between zones without interruption, which ensures resiliency if one zone is affected. For more information,

reference Regions and Availability Zones in Azure.

That said, you still need to build resiliency into your application. Resiliency strategies can be applied at all levels

of the architecture. Some mitigations are more tactical in nature—for example, retrying a remote call after a

transient network failure. Other mitigations are more strategic, such as failing over the entire application to a

secondary region. Tactical mitigations can make a large difference. While it's rare for an entire region to

 
Topics and best practices
REL IA B IL IT Y  TO P ICREL IA B IL IT Y  TO P IC DESC RIP T IONDESC RIP T ION
Reliability principles These critical principles are used as lenses to assess the
reliability of an application deployed on Azure.
Design for reliability Consider how systems use Availability Zones, perform
scalability, respond to failure, and other strategies that
optimize reliability in application design.
Resiliency checklist for specific Azure services Every technology has its own particular failure modes, which
you must consider when designing and implementing your
application. Use this checklist to review the resiliency
considerations for specific Azure services.
Target and non-functional requirements Target and non-functional requirements such as availability
targets and recovery targets allow you to measure the
uptime and downtime of your workloads. Having clearly
defined targets is crucial to have a goal to work and measure
against.
Resiliency and dependencies Building failure recovery into the system should be part of
the architecture and design phases from the beginning to
avoid the risk of failure. Dependencies are required for the
application to fully operate.
Availability Zones Availability Zones can be used to spread a solution across
multiple zones within a region, allowing for an application to
continue functioning when one zone fails.
Availability of services Availability of services across Azure regions depends on a
region's type. Azure's general policy on deploying services
into any given region is primarily driven by region type,
service categories, and customer demand.
Availability zone terminology To better understand regions and availability zones in Azure,
it helps to understand key terms or concepts.
Best practices During the architectural phase, focus on implementing
practices that meet your business requirements, identify
failure points, and minimize the scope of failures.
Testing for reliability Regular testing should be performed as part of each major
change to validate existing thresholds, targets, and
assumptions.
experience a disruption, transient problems such as network congestion are more common—so target these
issues first. Having the right monitoring and diagnostics is also important, both to detect failures when they
happen, and to find the root causes.
When designing an application to be resilient, you must understand your availability requirements. How much
downtime is acceptable? The amount of downtime is partly a function of cost. How much will potential
downtime cost your business? How much should you invest in making the application highly available?
The reliability pillar covers the following topics and best practices to help you build a resilient workload:

Monitoring for reliability Get an overall picture of application health. If something

fails, you need to know

that

it failed,

when

it failed, and

why

Reliability patterns Applications must be designed and implemented to

maximize availability.

REL IA B IL IT Y TO P ICREL IA B IL IT Y TO P IC DESC RIP T IONDESC RIP T ION

Next step

Principles

Reliability design principles

12/16/2022 • 2 minutes to read • Edit Online

Design for business requirements

Design for failure

Observe application health

Drive automation

Building a reliable application in the cloud is different from traditional application development. Historically, you

may have purchased levels of redundant higher-end hardware to minimize the chance of an entire application

platform failing.

In the cloud, we acknowledge that failures happen. Instead of trying to prevent failures altogether, the goal is to

minimize the effects of a single failing component.

To assess your workload using the tenets found in the Azure Well-Architected Framework, reference the

Microsoft Azure Well-Architected Review.

The following design principles provide:

Context for questions

Why a certain aspect is important

How an aspect is applicable to Reliability

These critical design principles are used as lenses to assess the Reliability of an application deployed on Azure.

These lenses provide a framework for the application assessment questions.

Reliability is a subjective concept. For an application to be appropriately reliable, it must reflect the business

requirements surrounding it.

For example, a mission-critical application with a 99.999% service level agreement (SLA) requires a higher level

of reliability than another application with an SLA of 95% .

Cost implications are inevitable when introducing greater reliability and high availability. This trade-off should

be carefully considered.

Failure is impossible to avoid in a highly distributed and multi-tenant environment like Azure.

By anticipating failures, from individual components to entire Azure regions, you can develop a solution in a

resilient way to increase reliability.

Before mitigating issues that impact application reliability, you must first detect these issues.

By monitoring the operation of an application relative to a healthy state, you can detect and predict reliability

issues.

Monitoring allows you to take swift and remedial action.

One of the leading causes of application downtime is human error due to the deployment of insufficiently tested

software or through misconfiguration.

Design for self-healing

Design for scale-out

Next stepNext step

To minimize the possibility and consequence of human errors, it's vital to strive for automation in all aspects of a

cloud solution.

Automation improves:

Reliability

Automated testing

Deployment

Management

Self-healing

describes the ability of a system to deal with failures automatically. Handling failures happens

through pre-defined remediation protocols. These protocols connect to failure modes within the solution.

It's an advanced concept that requires a high level of system maturity with monitoring and automation.

From inception, self-healing should be an aspiration to maximize reliability.

Scale-out

is a concept that focuses on the ability of a system to respond to demand through horizontal growth.

As traffic grows,

resource units are added in parallel, instead of increasing the size of the existing

resources.

Through scale units, a system can handle expected and unexpected traffic increases, essential to overall

reliability.

Scale units further reduce the effects of a single resource failure.

Design

Design for reliability

12/16/2022 • 2 minutes to read • Edit Online

Checklist

Azure services

Reference architecture

Next step

Related links

Reliable applications should maintain a pre-defined percentage of uptime (

availability

). They should also balance

between high resiliency, low latency, and cost (

High Availability

). Just as important, applications should be able

to recover from failures (

resiliency

How have you designed your applications with reliability in mind?How have you designed your applications with reliability in mind?

Define availability and recovery targets to meet business requirements.

Build resiliency and availability into your apps by gathering requirements.

Ensure that application and data platforms meet your reliability requirements.

Configure connection paths to promote availability.

Use Availability Zones where applicable to improve reliability and optimize costs.

Ensure that your application architecture is resilient to failures.

Know what happens if the requirements of Service Level Agreements are not met.

Identify possible failure points in the system to build resiliency.

Ensure that applications can operate in the absence of their dependencies.

Azure Front Door

Azure Traffic Manager

Azure Load Balancer

Azure Virtual Network NAT

Service Fabric

Kubernetes Service (AKS)

Azure Site Recovery

Deploy highly available network virtual appliances

Failure Mode Analysis for Azure applications

Minimize coordination

Target & non-functional requirements

Use platform as a service (PaaS) options

Design to scale out

Workload availability targets.

Building solutions for high availability using Availability Zones

Make all things redundant

Target and non

functional requirements

12/16/2022 • 12 minutes to read • Edit Online

Key points

Availability targets

Considerations for availability targetsConsiderations for availability targets

Target and non-functional requirements such as

availability targets

and

recovery targets

allow you to measure

the uptime and downtime of your workloads. Having clearly defined targets is crucial in order to have a goal to

work and measure against. In addition to these targets, there are many other requirements you should consider

to improve reliability requirements and meet business expectations.

Building resiliency (recovering from failures) and availability (running in a healthy state without significant

downtime) into your apps begins with gathering requirements. For example, how much downtime is

acceptable? How much does potential downtime cost your business? What are your customer's availability

requirements? How much do you invest in making your application highly available? What is the risk versus the

cost?

Determine the acceptable level of uptime for your workloads.

Determine how long workloads can be unavailable and how much data is acceptable to lose during a

disaster.

Consider application and data platform requirements to improve resiliency and availability.

Ensure connection availability and improve reliability with Azure services.

Assess overall application health of workloads.

A Service Level Agreement (SLA), is an availability target that represents a commitment around performance

and availability of the application. Understanding the SLA of individual components within the system is

essential in order to define reliability targets. Knowing the SLA of dependencies will also provide a justification

for additional spend when making the dependencies highly available and with proper support contracts.

Availability targets for any dependencies leveraged by the application should be understood and ideally align

with application targets should also be considered.

Understanding your availability expectations is vital to reviewing overall operations for the application. For

example, if you are striving to achieve an application Service Level Objective (SLO) of 99.999%, the level of

inherent operational action required by the application is going to be far greater than if an SLO of 99.9% was the

goal.

Monitoring and measuring application availability is vital to qualifying overall application health and progress

towards defined targets. Make sure you measure and monitor key targets such as:

Mean Time Between Failures (MTBF) — The average time between failures of a particular component.

Mean Time To Recover (MTTR) — The average time it takes to restore a component after a failure.

Are SL As/SLOs/SLIs for all leveraged dependencies understood?Are SL As/SLOs/SLIs for all leveraged dependencies understood?

Availability targets for any dependencies leveraged by the application should be understood and ideally align

with application targets. Make sure SLAs/SLOs/SLIs for all leveraged dependencies are understood.

Has a composite SL A been calculated for the application and/or key scenarios using Azure SL As?Has a composite SL A been calculated for the application and/or key scenarios using Azure SL As?

NOTENOTE

Recovery targets

Meet application platform requirements

Multiple and paired regionsMultiple and paired regions

A composite SLA captures the end-to-end SLA across all application components and dependencies. It is

calculated using the individual SLAs of Azure services housing application components and provides an

important indicator of designed availability in relation to customer expectations and targets. Make sure the

composite SLA of all components and dependencies on the critical paths are understood. To learn more, see

Composite SLAs.

If you have contractual commitments to an SLA for your Azure solution, additional allowances on top of the Azure

composite SLA must be made to accommodate outages caused by code-level issues and deployments. This is often

overlooked and customers directly put the composite SLA forward to their customers.

Are availability targets considered while the system is running in disaster recover y mode?Are availability targets considered while the system is running in disaster recover y mode?

Availability targets might or might not be applied when running in disaster recovery mode. This depends from

application to application. If targets must also apply in a failure state then an N+1 model should be used to

achieve greater availability and resiliency. In this scenario, N is the capacity needed to deliver required

availability. There's also a cost implication, because more resilient infrastructure usually is more expensive. This

has to be accepted by business.

What are the consequences if availability targets are not satisfied?What are the consequences if availability targets are not satisfied?

Are there any penalties, such as financial charges, associated with failing to meet SLA commitments? Additional

measures can be used to prevent penalties, but that also brings additional cost to operate the infrastructure. This

has to be factored in and evaluated. It should be fully understood what are the consequences if availability

targets are not satisfied. This will also inform when to initiate a failover case.

Recovery targets identify how long the workload can be unavailable and how much data is acceptable to lose

during a disaster. Define target reports for the application and key scenarios. Target reports needed are

Recovery Time Objective (RTO) — the maximum acceptable time an application is unavailable after an incident,

and Recovery Point Objective (RPO) — the maximum duration of data loss that is acceptable during a disaster.

Recovery targets are nonfunctional requirements of a system and should be dictated by business requirements.

Recovery targets should be defined in accordance to the required RTO and RPO targets for the workloads.

Azure application platform services offer resiliency features to support application reliability, though they may

only be applicable at a certain SKU and configuration/deployment. For example, an SLA is dependent on the

number of instances deployed or a certain feature enabled. It is recommended that you review the SLA for

services used. For example, Service Bus Premium SKU provides predictable latency and throughput to mitigate

noisy neighbor scenarios. It also provides the ability to automatically scale and replicate metadata to another

Service Bus instance for failover purposes.

To learn more, see Azure Service Bus Premium SKU.

An application platform should be deployed across multiple regions if the requirements dictate. Covering the

requirements using zones is cheaper and less complex. Regional isolation should be an extra measure if the

SLAs given by the single region cross-zone setup are insufficient or if required by a geographical spread of

users.

Availability Zones and setsAvailability Zones and sets

Considerations for availabilityConsiderations for availability

NOTENOTE

Meet data platform requirements

Data consistencyData consistency

The ability to respond to disaster scenarios for overall compute platform availability and application resiliency

depends on the use of multiple regions or other deployment locations.

Use paired regions that exist within the same geography and provide native replication features for recovery

purposes, such as Geo-Redundant Storage (GRS) asynchronous replication. In the event of planned

maintenance, updates to a region will be performed sequentially only. To learn more, see Business continuity

with Azure Paired Regions.

Platform services that can leverage Availability Zones are deployed in either a zonal manner within a particular

zone, or in a zone-redundant configuration across multiple zones. To learn more, see Building solutions for high

availability using Availability Zones.

An Availability Set (AS) is a logical construct to inform Azure that it should distribute contained virtual machine

instances across multiple fault and update domains within an Azure region. Availability Zones (AZ) elevate the

fault level for virtual machines to a physical datacenter by allowing replica instances to be deployed across

multiple datacenters within an Azure region. While zones provide greater resiliency than sets, there are

performance and cost considerations where applications are extremely 'chatty' across zones given the implied

physical separation and inter-zone bandwidth charges. Ultimately, Azure Virtual Machines and Azure PaaS

services, such as Service Fabric and Azure Kubernetes Service (AKS) which use virtual machines underneath, can

leverage either AZs or an AS to provide application resiliency within a region. To learn more, see Business

continuity with data resiliency.

Is the application hosted across 2 or more application platform nodes?Is the application hosted across 2 or more application platform nodes?

To ensure application platform reliability, it is vital that the application be hosted across at least two nodes to

ensure there are no single points of failure. Ideally An n+1 model should be applied for compute availability

where n is the number of instances required to support application availability and performance requirements.

Higher SLAs provided for virtual machines and associated related platform services, require at least two replica nodes

deployed to either an Availability Set or across two or more Availability Zones. To learn more, see SLA for Virtual

Machines.

How is the client traffic routed to the application in the case of region, zone or network outage?How is the client traffic routed to the application in the case of region, zone or network outage?

In the event of a major outage, client traffic should be routable to application deployments which remain

available across other regions or zones. This is ultimately where cross-premises connectivity and global load

balancing should be used, depending on whether the application is internal and/or external facing. Services such

as Azure Front Door, Azure Traffic Manager, or third-party CDNs can route traffic across regions based on

application health solicited via health probes. To learn more, see Traffic Manager endpoint monitoring.

Data and storage services should be running in a highly available configuration/SKU. Azure data platform

services offer resiliency features to support application reliability, though they may only be applicable at a

certain SKU. Examples are Azure SQL Database Business Critical SKUs, or Azure Storage Zone Redundant

Storage (ZRS) with three synchronous replicas spread across availability zones.

Data types should be categorized by data consistency requirements. Data consistency requirements, such as

strong or eventual consistency, should be understood for all data types and used to inform data grouping and

Replication and RedundancyReplication and Redundancy

NOTENOTE

Networking and connectivity requirements

ConnectivityConnectivity

categorization, as well as what data replication/synchronization strategies can be considered to meet application

reliability targets.

CAP theorem proves that it is impossible for a distributed data store to simultaneously provide more than two

guarantees across:

Consistency:Consistency: Every read receives the most recent write or an error.

Availability:Availability: Every request receives a non-error response, without the guarantee that it contains the most

recent write.

Par tition tolerance:Par tition tolerance: A system continues to operate despite an arbitrary number of transactions being

dropped or delayed by the network between nodes.

Determining which of these guarantees are most important in the context of application requirements is critical.

Replicating data across zones or paired regions supports application availability objectives to limit the impact of

failure scenarios. The ability to restore data from a backup is essential when recovering from data corruption

situations as well as failure scenarios. To ensure sufficient redundancy and availability for zonal and regional

failure scenarios, backups should be stored across zones and/or regions.

Define and test a data restore process to ensure a consistent application state. Regular testing of the data restore

process promotes operational excellence and confidence in the ability to recover data in alignment with defined

recovery objectives for the application.

Consider how your application traffic is routed to data sources in the case of region, zone, or network outage.

Understanding the method used to route application traffic to data sources in the event of a major failure event

is critical to identify whether failover processes will meet recovery objectives. Many Azure data platform

services offer native reliability capabilities to handle major failures, such as Azure Cosmos DB Automatic

Failover or Azure SQL Database Active Geo-Replication.

Some capabilities such as Azure Storage RA-GRS and Azure SQL DB Active Geo-Replication require application-side

failover to alternate endpoints in some failure scenarios, so application logic should be developed to handle these

scenarios.

Consider these guidelines to ensure connection availability and improve reliability with Azure services.

Use a global load balancer used to distribute traffic and/or failover across regions.Use a global load balancer used to distribute traffic and/or failover across regions. Azure

Front Door, Azure Traffic Manager, or third-party CDN services can be used to direct inbound requests to

external-facing application endpoints deployed across multiple regions. It is important to note that Traffic

Manager is a DNS-based load balancer, so failover must wait for DNS propagation to occur. A sufficiently

low TTL (Time To Live) value should be used for DNS records, though not all ISPs may honor this. For

application scenarios requiring transparent failover, Azure Front Door should be used. To learn more, see

Disaster Recovery using Azure Traffic Manager and Azure Front Door routing architecture.

For cross-premises connectivity (ExpressRoute or VPN) ensure there are redundantFor cross-premises connectivity (ExpressRoute or VPN) ensure there are redundant

connections from different locations.connections from different locations. At least two redundant connections should be established

across two or more Azure regions and peering locations to ensure there are no single points of failure. An

active/active load-shared configuration provides path diversity and promotes availability of network

connection paths. To learn more, see Cross-network connectivity.

  
Zone
-
aware servicesZone
-
aware services
 
Next step
 
Related links
Simulate a failure path to ensure connectivity is available over alternative paths.Simulate a failure path to ensure connectivity is available over alternative paths. The failure of
a connection path onto other connection paths should be tested to validate connectivity and operational
effectiveness. Using Site-to-Site VPN connectivity as a backup path for ExpressRoute provides an
additional layer of network resiliency for cross-premises connectivity. To learn more, see Using site-to-site
VPN as a backup for ExpressRoute private peering.
Eliminate all single points of failure from the data path (on-premises and Azure.Eliminate all single points of failure from the data path (on-premises and Azure. Single-
instance Network Virtual Appliances (NVAs), whether deployed in Azure or within an on-premises
datacenter, introduce significant connectivity risk. To learn more, see Deploy highly available network
virtual appliances.
Use ExpressRoute/VPN zone-redundant Vir tual Network Gateways.Use ExpressRoute/VPN zone-redundant Vir tual Network Gateways. Zone-redundant virtual
network gateways distribute gateway instances across Availability Zones to improve reliability and
ensure availability during failure scenarios impacting a datacenter within a region. To learn more, see
Zone-redundant Virtual Network Gateways.
If used, deploy Azure Application Gateway v2 deployed in a zone-redundant configuration.If used, deploy Azure Application Gateway v2 deployed in a zone-redundant configuration.
Azure Application Gateway v2 can be deployed in a zone-redundant configuration to deploy gateway
instances across zones for improved reliability and availability during failure scenarios impacting a
datacenter within a region. To learn more, see Zone-redundant Application Gateway v2.
Use Azure Load Balancer Standard to load-balance traffic across Availability Zones.Use Azure Load Balancer Standard to load-balance traffic across Availability Zones. Azure
Load Balancer Standard is zone-aware to distribute traffic across Availability Zones. It can also be
configured in a zone-redundant configuration to improve reliability and ensure availability during failure
scenarios impacting a datacenter within a region. To learn more, see Standard Load Balancer and
Availability Zones.
Configure health probes for Azure Load Balancer(s)/Azure Application Gateways.Configure health probes for Azure Load Balancer(s)/Azure Application Gateways. Health
probes allow Azure Load Balancers to assess the health of backend endpoints to prevent traffic from
being sent to unhealthy instances. To learn more, see Load Balancer health probes.
Assess critical application dependencies with health probes.Assess critical application dependencies with health probes. Custom health probes should be
used to assess overall application health including downstream components and dependent services,
such as APIs and datastores, so that traffic is not sent to backend instances that cannot successfully
process requests due to dependency failures. To learn more, see Health Endpoint Monitoring Pattern.
Application design
To understand business metrics to design resilient Azure applications, see Workload availability targets.
For information on Availability Zones, see Building solutions for high availability using Availability Zones.
For information on health probes, see Load Balancer health probes and Health Endpoint Monitoring Pattern.
To learn about connectivity risk, see Deploy highly available network virtual appliances.
Go back to the main article: Design

Design reliable Azure applications

12/16/2022 • 4 minutes to read • Edit Online

Key Points

Use Availability Zones within a region

NOTENOTE

Respond to failure

Building a reliable application in the cloud is different from traditional application development. While

historically you may have purchased levels of redundant higher-end hardware to minimize the chance of an

entire application platform failing, in the cloud, we acknowledge up front that failures will happen. Instead of

trying to prevent failures altogether, the goal is to minimize the effects of a single failing component. Failures

you can expect here are inherent to highly distributed systems, not a feature of Azure.

Use Availability Zones where applicable to improve reliability and optimize costs.

Design applications to operate when impacted by failures.

Use the native resiliency capabilities of PaaS to support overall application reliability.

Design to scale out.

Validate that required capacity is within Azure service scale limits and quotas.

If your requirements demand an even greater failure isolation than Availability Zones alone can offer, consider

deploying to multiple regions. Multiple regions should be used for failover purposes in a disaster state.

Additional cost needs to be taken into consideration. Examples of cost needs are data and networking, and

services such as Azure Site Recovery.

Design your application architecture to use

Availability Zones

within a region. Availability Zones can be used to

optimize application availability within a region by providing datacenter level fault tolerance. However, the

application architecture must not share dependencies between zones to use them effectively.

Availability Zones may introduce performance and cost considerations for applications which are extremely "chatty" across

zones given the implied physical separation between each zone and inter-zone bandwidth charges. This also means that

Availability Zones can be considered to get higher SLA for lower cost.

Consider if component proximity is required for application performance reasons. If all or part of the application

is highly sensitive to latency, it may mandate component co-locality which can limit the applicability of multi-

region and multi-zone strategies.

Avoiding failure is impossible in the public cloud, and as a result applications require resilience to respond to

outages and deliver reliability. The application should therefore be designed to operate even when impacted by

regional, zonal, service or component failures across critical application scenarios and functionality. Application

operations may experience reduced functionality or degraded performance during an outage.

Define an availability strategy to capture how the application remains available when in a failure state. It should

apply across all application components and the application deployment stamp as a whole such as via multi-geo

scale-unit deployment approach. There are cost implications as well: More resources need to be provisioned in

advance to provide high availability. Active-active setup, while more expensive than single deployment, can

Considerations for improving reliability

Next step

Related links

balance cost by lowering load on one stamp and reducing the total amount of resources needed.

In addition to an availability strategy, define a Business Continuity Disaster Recovery (BCDR) strategy for the

application and/or its key scenarios. A disaster recovery strategy should capture how the application responds

to a disaster situation such as a regional outage or the loss of a critical platform service, using either a re-

deployment, warm-spare active-passive, or hot-spare active-active approach.

To drive cost down consider splitting application components and data into groups. For example:

Must protect

Nice to protect

Ephemeral/can be rebuilt/lost, instead of protecting all data with the same policy

Is the application designed to use managed ser vices?Is the application designed to use managed ser vices?

Azure-managed services provide native resiliency capabilities to support overall application reliability. Platform

as a service (PaaS) offerings should be used to leverage these capabilities. PaaS options are easier to configure

and administer. You don't need to provision VMs, set up VNets, manage patches and updates, and all of the other

overhead associated with running software on a VM. To learn more, see Use managed services.

Has the application been designed to scale out?Has the application been designed to scale out?

Azure provides elastic scalability and you should design to scale out. However, applications must leverage a

scale-unit approach to navigate service and subscription limits to ensure that individual components and the

application as a whole can scale horizontally. Don't forget about scale in, which is important to drive cost down.

For example, scale in and out for App Service is done via rules. Often customers write scale out rules and never

write scale in rules. This leaves the App Service more expensive.

Is the application deployed across multiple Azure subscriptions?Is the application deployed across multiple Azure subscriptions?

Understanding the subscription landscape of the application and how components are organized within or

across subscriptions is important when analyzing if relevant subscription limits or quotas can be navigated.

Review Azure subscription and service limits to validate that required capacity is within Azure service scale

limits and quotas. To learn more, see Azure subscription and service limits.

Resiliency and dependencies

For information on minimizing dependencies, see Minimize coordination.

For more information on fault-points and fault-modes, see Failure Mode Analysis for Azure applications.

For information on managed services, see Use platform as a service (PaaS) options.

Go back to the main article: Design

Resiliency and dependencies

12/16/2022 • 4 minutes to read • Edit Online

Key points

Build resiliency with failure mode analysis

NOTENOTE

Understand the impact of dependencies

Building failure recovery into the system should be part of the architecture and design phases from the

beginning to avoid the risk of failure. Dependencies are required for the application to fully operate.

Identify possible failure points in the system with failure mode analysis.

Eliminate all single point of failure.

Maintain a complete list of application dependencies.

Ensure that applications can operate in the absence of their dependencies.

Understand the SLA of individual components within the system to define reliability targets.

Failure mode analysis (FMA) is a process for building resiliency into a system, by identifying possible failure

points in the system. The FMA should be part of the architecture and design phases, so that you can build failure

recovery into the system from the beginning.

Identify all fault-points and fault-modes. Fault-points describe the elements within an application architecture

which are capable of failing, while fault-modes capture the various ways by which a fault-point may fail. To

ensure an application is resilient to end-to-end failures, it is essential that all fault-points and fault-modes are

understood and operationalized. To learn more, see Failure mode analysis for Azure applications.

Eliminate all single point of failure. A single point of failure describes a specific fault-point which if it where to

fail, would bring down the entire application. Single points of failure introduce significant risk since any failure

of this component will cause an application outage. To learn more, see Make all things redundant.

Eliminate all

singletons

. A singleton describes a logical component within an application for which there can only be a

single instance. It can apply to stateful architectural components or application code constructs. Ultimately, singletons

introduce a significant risk by creating single points of failure within the application design.

Internal

dependencies describe components within the application scope which are required for the application

to fully operate.

External

dependencies capture required components outside the scope of the application, such

as another application or third-party service. Dependencies may be categorized as either strong or weak based

on whether or not the application is able to continue operating in a degraded fashion in their absence. To learn

more, see Twelve-Factor App: Dependencies.

You should maintain a complete list of application dependencies. Examples of typical dependencies include

platform dependencies outside the remit of the application, such as Azure Active Directory, Express Route, or a

central NVA (Network Virtual Appliance), as well as application dependencies such as APIs. For cost purposes,

it's important to understand the price for these services and how they are being charged. For more details see

Cost models.

You can map application dependencies either as a simple list or a document. Usually this is part of a design

Next step

Related links

document or reference architecture.

Understand the impact of an outage with each dependency.Understand the impact of an outage with each dependency. Strong dependencies play a critical role in

application function and availability. Their absence will have a significant impact, while the absence of weak

dependencies may only impact specific features and not affect overall availability. This reflects the cost that is

needed to maintain the High Availability relationship between the service and its dependencies. Classifying

dependencies as either strong or weak will help you identify which components are essential to the application.

Maintain SL As and suppor t agreements for critical dependencies.Maintain SL As and suppor t agreements for critical dependencies. A Service Level Agreement (SLA)

represents a commitment around performance and availability of the application. Understanding the SLA of

individual components within the system is essential in order to define reliability targets. Knowing the SLA of

dependencies will also provide a justification for additional spend when making the dependencies highly

available and with proper support contracts. The operational commitments of all external and internal

dependencies should be understood to inform the broader application operations and health model.

The usage of platform level dependencies such as Azure Active Directory must also be understood to ensure

that their availability and recovery targets align with that of the application

Ensure that applications can operate in the absence of their dependencies.Ensure that applications can operate in the absence of their dependencies. If the application has

strong dependencies which it cannot operate in the absence of, then the availability and recovery targets of

these dependencies should align with that of the application itself. Make an effort to minimize dependencies to

achieve control over application reliability. To learn more, see Minimize dependencies.

Ensure that the lifecycle of the application decoupled from its dependencies.Ensure that the lifecycle of the application decoupled from its dependencies. If the application

lifecycle is closely coupled with that of its dependencies, it can limit the operational agility of the application. This

is true particularly where new releases are concerned.

Best practices

For information on failure mode analysis, see Failure mode analysis for Azure applications.

For information on single point of failure, see Make all things redundant.

For information on fault-points and fault-modes, see Failure Mode Analysis for Azure applications.

For information on minimizing dependencies, see Minimize coordination.

Go back to the main article: Design

Best practices for designing reliability in Azure

applications

12/16/2022 • 2 minutes to read • Edit Online

Build availability targets and recovery targets into your design

Ensure the application & data platforms meet your reliability

requirements

Ensure connectivity

Use zone-aware services

Design resilience to respond to outages

Perform a failure mode analysis (FMA)

This article lists Azure best practices to enhance designing Azure applications for reliability. These best practices

are derived from our experience with Azure reliability and the experiences of customers like yourself.

During the architectural phase, focus on implementing practices that meet your business requirements, identify

failure points, and minimize the scope of failures.

A Service Level Agreement (SLA), is an availability target that represents a commitment around performance

and availability of the application. Understanding the SLA of individual components within the system is

essential in order to define reliability targets. Recovery targets identify how long the workload can be

unavailable and how much data is acceptable to lose during a disaster. Define target reports for the application

and key scenarios. There may be penalties, such as financial charges, associated with failing to meet SLA

commitments. The consequences of not satisfying availability targets should be fully understood.

Designing application platform and data platform resiliency and availability are critical to ensuring overall

application reliability.

To ensure connection availability and improve reliability with Azure services:

Use a global load balancer used to distribute traffic and/or failover across regions.

For cross-premises connectivity (ExpressRoute or VPN) ensure there are redundant connections from

different locations.

Simulate a failure path to ensure connectivity is available over alternative paths.

Eliminate all single points of failure from the data path (on-premises and Azure).

Zone-aware services can improve reliability and ensure availability during failure scenarios affecting a

datacenter within a region. They can also be used to deploy gateway instances across zones for improved

reliability and availability during failure scenarios affecting a datacenter within a region.

Applications should be designed to operate even when affected by regional, zonal, service or component

failures across critical application scenarios and functionality. Application operations may experience reduced

functionality or degraded performance during an outage.

Understand the impact of an outage with each dependency

Design for scalability

Next step

FMA builds resiliency into an application early in the design stage. It helps you identify the types of failures your

application might experience, the potential effects of each, and possible recovery strategies.

Have all single points of failure been eliminated? A single point of failure describes a specific fault-point which if

it where to fail would bring down the entire application. Single points of failure introduce significant risk since

any failure of this component will cause an application outage.

Have all fault-points and fault-modes been identified? Fault-points describe the elements within an application

architecture which are capable of failing, while fault-modes capture the various ways by which a fault-point may

fail. To ensure an application is resilient to end-to-end failures, it is essential that all fault-points and fault-modes

are understood and operationalized.

Strong dependencies play a critical role in application function and availability. Their absence will have a

significant impact, while the absence of weak dependencies may only impact specific features and not affect

overall availability. Dependencies may be categorized as either strong or weak based on whether or not the

application is able to continue operating in a degraded fashion in their absence.

A cloud application must be able to scale to accommodate changes in usage. Begin with discrete components,

and design the application to respond automatically to load changes whenever possible. Keep scaling limits in

mind during design so you can expand easily in the future.

Testing

Go back to the main article: Design

Testing for reliability

12/16/2022 • 2 minutes to read • Edit Online

Checklist

Azure services

Reference architecture

Next step

Related links

Regular testing should be performed as part of each major change and if possible, on a regular basis to validate

existing thresholds, targets and assumptions. Testing should also ensure the validity of the health model,

capacity model, and operational procedures.

Have you tested your applications with reliability in mind?Have you tested your applications with reliability in mind?

Test regularly to validate existing thresholds, targets and assumptions.

Automate testing as much as possible.

Perform testing on both key test environments with the production environment.

Perform chaos testing by injecting faults.

Create and test a disaster recovery plan on a regular basis using key failure scenarios.

Design disaster recovery strategy to run most applications with reduced functionality.

Design a backup strategy that is tailored to business requirements and circumstances of the application.

Test and validate the failover and failback approach successfully at least once.

Configure request timeouts to manage inter-component calls.

Implement retry logic to handle transient application failures and transient failures with internal or external

dependencies.

Configure and test health probes for your load balancers and traffic managers.

Apply chaos principles continuously.

Create and organize a central chaos engineering team.

Azure Site Recovery

Azure Pipelines

Azure Traffic Manager

Azure Load Balancer

Failure Mode Analysis for Azure applications

High availability and disaster recovery scenarios for IaaS apps

Back up files and applications on Azure Stack Hub

Resiliency testing

For information on performance testing, see Performance testing.

For information on chaos engineering, see Chaos engineering.

For information on failure and disaster recovery, see Failure and disaster recovery for Azure applications.

For information on testing applications, see Testing your application and Azure environment.

Testing applications for availability and resiliency

12/16/2022 • 3 minutes to read • Edit Online

Key points

When to test

NOTENOTE

Testing for resiliency

Applications should be tested to ensure

availability

and

resiliency

. Availability describes the amount of time

when an application runs in a healthy state without significant downtime. Resiliency describes how quickly an

application recovers from failure.

Being able to measure availability and resiliency can answer questions like, How much downtime is acceptable?

How much does potential downtime cost your business? What are your availability requirements? How much

do you invest in making your application highly available? What is the risk versus the cost? Testing plays a

critical role in making sure your applications can meet these requirements.

Test regularly to validate existing thresholds, targets and assumptions.

Automate testing as much as possible.

Perform testing on both key test environments with the production environment.

Verify how the end-to-end workload performs under intermittent failure conditions.

Test the application against critical non-functional requirements for performance.

Conduct load testing with expected peak volumes to test scalability and performance under load.

Perform chaos testing by injecting faults.

Regular testing should be performed as part of each major change and if possible, on a regular basis to validate

existing thresholds, targets and assumptions. While the majority of testing should be performed within the

testing and staging environments, it is often beneficial to also run a subset of tests against the production

system. Plan a 1:1 parity of key test environments with the production environment.

Automate testing where possible to ensure consistent test coverage and reproducibility. Automate common testing tasks

and integrate them into your build processes. Manually testing software is tedious and susceptible to error, although

manual explorative testing may also be conducted.

To test resiliency, you should verify how the end-to-end workload performs under intermittent failure

conditions.

Run tests in production using both synthetic and real user data. Test and production are rarely identical, so it's

important to validate your application in production using a blue-green or canary deployment. This way, you're

testing the application under real conditions, so you can be sure that it will function as expected when fully

deployed.

As part of your test plan, include:

Chaos engineering

Automated pre-deployment testing

Performance testing

Simulation testing

Fault injection testing

Test under peak loads

Fault injection testing

Peak load testing

Disaster recovery testing

The primary goal of performance testing is to validate benchmark behavior for the application. Performance

testing is the superset of both

load testing

and

stress testing

Load testing validates application scalability by rapidly and/or gradually increasing the load on the application

until it reaches a threshold/limit. Stress testing involves various activities to overload existing resources and

remove components to understand overall resiliency and how the application responds to issues.

Simulation testing involves creating small, real-life situations. Simulations demonstrate the effectiveness of the

solutions in the recovery plan and highlight any issues that weren't adequately addressed.

As you perform simulation testing, follow best practices:

Conduct simulations in a manner that doesn't disrupt actual business but feels like a real situation.

Make sure that simulated scenarios are completely controllable. If the recovery plan seems to be failing, you

can restore the situation back to normal without causing damage.

Inform management about when and how the simulation exercises will be conducted. Your plan should detail

the time frame and the resources affected during the simulation.

For fault injection testing, check the resiliency of the system during failures, either by triggering actual failures or

by simulating them. Here are some strategies to induce failures:

Shut down virtual machine (VM) instances.

Crash processes.

Expire certificates.

Change access keys.

Shut down the DNS service on domain controllers.

Limit available system resources, such as RAM or number of threads.

Unmount disks.

Redeploy a VM.

Your test plan should incorporate possible failure points identified during the design phase, in addition to

common failure scenarios:

Test your application in an environment as close to production as possible.

Test failures in combination.

Measure the recovery times, and be sure that your business requirements are met.

Verify that failures don't cascade and are handled in an isolated way.

Load testing is crucial for identifying failures that only happen under load, such as the back-end database being

overwhelmed or service throttling. Test for peak load and anticipated increase in peak load, using production

data or synthetic data that is as close to production data as possible. Your goal is to see how the application

behaves under real-world conditions.

Next step

Related links

Backup and recovery

For more test types, see Test types.

To learn about load and stress tests, see Performance testing.

To learn about chaos testing, see Chaos engineering.

Go back to the main article: Testing

Backup and disaster recovery for Azure applications

12/16/2022 • 6 minutes to read • Edit Online

Key points

Disaster recovery plan

Operational readiness testing

Disaster recovery

is the process of restoring application functionality in the wake of a catastrophic loss.

In the cloud, we acknowledge up front that failures will happen. Instead of trying to prevent failures altogether,

the goal is to minimize the effects of a single failing component. Testing is one way to minimize these effects.

Automate testing your applications where possible, but you need to be prepared for when they fail. When a

failure happens, having backup and recovery strategies becomes important.

Your tolerance for reduced functionality during a disaster is a business decision that varies from one application

to the next. It might be acceptable for some applications to be unavailable or to be partially available with

reduced functionality or delayed processing for a while. For other applications, any reduced functionality is

unacceptable.

Create and test a disaster recovery plan regularly using key failure scenarios.

Design disaster recovery strategy to run most applications with reduced functionality.

Design a backup strategy that is tailored to business requirements and circumstances of the application.

Automate failover and failback steps and processes.

Test and validate the failover and failback approach successfully at least once.

Start by creating a recovery plan. The plan is considered complete after it has been fully tested. Include the

people, processes, and applications needed to restore functionality within the service-level agreement (SLA)

you've defined for your customers.

Consider the following suggestions when creating and testing your disaster recovery plan:

Include the process for contacting support and for escalating issues. This information will help to avoid

prolonged downtime as you work out the recovery process for the first time.

Evaluate the business impact of application failures.

Choose a cross-region recovery architecture for mission-critical applications.

Identify a specific owner of the disaster recovery plan, including automation and testing.

Document the process, especially any manual steps.

Automate the process as much as possible.

Establish a backup strategy for all reference and transactional data, and test backup restoration regularly.

Set up alerts for the stack of the Azure services consumed by your application.

Train operations staff to execute the plan.

Perform regular disaster simulations to validate and improve the plan.

If you're using Azure Site Recovery to replicate virtual machines (VMs), create a fully automated recovery plan to

fail over the entire application.

Perform an operational readiness test for failover to the secondary region and for failback to the primary region.

Many Azure services support manual failover or test failover for disaster recovery drills. Instead, you can

Failover and failback testing

Dependent service outage

Network outage

Recovery automation

Backup strategy

simulate an outage by shutting down or removing Azure services.

Automated operational responses should be tested frequently as part of the normal application lifecycle to

ensure operational effectiveness.

Test failover and failback to verify that your application's dependent services come back up in a synchronized

manner during disaster recovery. Changes to systems and operations may affect failover and failback functions,

but the impact may not be detected until the main system fails or becomes overloaded. Test failover capabilities

before

they're required to compensate for a live problem. Also, be sure dependent services failover and failback

in the correct order.

If you're using Azure Site Recovery to replicate VMs, run disaster recovery drills periodically by testing failovers

to validate your replication strategy. A test failover doesn't affect the ongoing VM replication or your production

environment. For more information, see Run a disaster recovery drill to Azure.

For each dependent service, you should understand the implications of service disruption and the way that the

application will respond. Many services include features that support resiliency and availability, so evaluating

each service independently is likely to improve your disaster recovery plan. For example, Azure Event Hubs

supports failing over to the secondary namespace.

When parts of the Azure network are inaccessible, you may not be able to access your application or data. In this

situation, we recommend designing the disaster recovery strategy to run most applications with reduced

functionality.

If reducing functionality isn't an option, the remaining options are application downtime or failover to an

alternate region.

In a reduced functionality scenario:

If your application can't access its data because of an Azure network outage, you can run locally with reduced

application functionality by using cached data.

You can store data in an alternate location until connectivity is restored.

The steps required to recover or failover the application to a secondary Azure region in failure situations should

be codified, preferably in an automated manner, to ensure capabilities exist to respond effectively to an outage

in a way that limits impact. Similar codified steps should also exist to capture the process required to failback the

application to the primary region once a failover triggering issue has been addressed.

When automating failover procedures, ensure that the tooling used for orchestrating the failover is also

considered in the failover strategy. For example, if you run your failover from Jenkins running on a VM, you'll be

in trouble if that virtual machine is part of the outage. Azure DevOps Projects are scoped to a region too.

Many alternative strategies are available for implementing distributed compute across regions. These strategies

must be tailored to the specific business requirements and circumstances of the application. At a high level, the

approaches can be divided into the following categories:

Plan for regional failures

Next step

Related links

Redeploy on disasterRedeploy on disaster: In this approach, the application is redeployed from scratch at the time of

disaster. Redeploying from scratch is appropriate for non-critical applications that don't require a

guaranteed recovery time.

Warm Spare (Active/Passive)Warm Spare (Active/Passive): Create a secondary hosted service in an alternate region, and deploy

roles to guarantee minimal capacity. However, the roles don't receive production traffic. This approach is

useful for applications that have not been designed to distribute traffic across regions.

Hot Spare (Active/Active)Hot Spare (Active/Active): The application is designed to receive production load in multiple regions.

The cloud services in each region might be configured for higher capacity than required for disaster

recovery purposes. Instead, the cloud services might scale out as necessary at the time of a disaster and

failover. This approach requires a large investment in application design, but it has significant benefits.

These include low and guaranteed recovery time, continuous testing of all recovery locations, and

efficient usage of capacity.

Azure is divided physically and logically into units called regions. A region consists of one or more data centers

in close proximity. Many regions and services also support availability zones, which can be used to provide

more resiliency against outages in a single data center. Consider using regions with availability zones to improve

the availability of your solution.

Under rare circumstances, it's possible that facilities in an entire availability zone or region can become

inaccessible, for example, because of network failures. Or, facilities can be lost entirely, for example, because of a

natural disaster. Azure has capabilities for creating applications that are distributed across zones and regions.

Such distribution helps to minimize the possibility that a failure in one zone or region could affect other zones

or regions.

Automatic retry of failed backup jobs

For information on testing failovers, see Run a disaster recovery drill to Azure.

For information on Event Hubs, see Azure Event Hubs.

Go back to the main article: Testing

Automatic retry of failed backup jobs

12/16/2022 • 5 minutes to read • Edit Online

Prerequisites

Assign permissions to managed identities

Azure Backup comprehensively protects your data assets in Azure through a simple, secure, and cost-effective

solution that requires zero infrastructure. Azure's built-in data protection offers a solution for a wide range of

workloads, and helps protect your mission critical workloads running in the cloud. Azure's built-in data ensures

your backups are always available and managed at scale across your entire backup estate.

As a backup user or administrator, you can monitor all backup solutions and configure alerts to notify you about

important events.

Many of the failures or outages are transient. You can solve them just by retrying the backup, or the restore job.

However, waiting for an engineer to retry the job manually or assign the relevant permission wastes valuable

time. Automation is the smarter way to retry the failed jobs, and ensures you continue to meet your target RPOs

with one successful backup a day.

Retrieve relevant backup data through Azure Resource Graph (ARG) and combine the data with corrective

PowerShell and CLI steps. This article guides you through retrying the backup for all failed jobs using ARG and

PowerShell.

You'll need an Azure Automation account. You can use an existing account or create a new account with one

user-assigned managed identity, at minimum.

To assign permissions to managed identities, complete the following steps:

# Sign in to your Azure subscription

$sub = Get-AzSubscription -ErrorAction SilentlyContinue

if(-not($sub))

{

Connect-AzAccount

}

# If you have multiple subscriptions, set the one to use

# Select-AzSubscription -SubscriptionId <SUBSCRIPTIONID>

$resourceGroup = "resourceGroupName"

# These values are used in this tutorial

$automationAccount = "xAutomationAccount"

$userAssignedManagedIdentity = "xUAMI"

1. Sign in to Azure interactively using the Connect-AzAccount cmdlet and follow the instructions:

2. Provide an appropriate value for the following variables and then run the script:

3. Use the PowerShell cmdlet New-AzRoleAssignment to assign a role to the system-assigned managed

identity:

Add modules

Create a PowerShell runbook

$role1 = "DevTest Labs User"

$SAMI = (Get-AzAutomationAccount -ResourceGroupName $resourceGroup

-Name $automationAccount).Identity.PrincipalId

New-AzRoleAssignment

-ObjectId $SAMI

-ResourceGroupName $resourceGroup

-RoleDefinitionName $role1

$UAMI = (Get-AzUserAssignedIdentity -ResourceGroupName

$resourceGroup -Name $userAssignedManagedIdentity).PrincipalId

New-AzRoleAssignment

-ObjectId $UAMI

-ResourceGroupName $resourceGroup

-RoleDefinitionName $role1

$role2 = "Reader"

New-AzRoleAssignment

-ObjectId $SAMI

-ResourceGroupName $resourceGroup

-RoleDefinitionName $role2

4. You need the same role assignment for the user-assigned managed identity:

5. You'll need extra permissions for the system-assigned managed identity to run the cmdlets :

Get-AzUserAssignedIdentity and Get-AzAutomationAccount :

For the preceding scripts to work, install the following modules by navigating to the individual module gallery

after you've created the automation account:

Az.Accounts

Az.RecoveryServices

Az.Graph

To create a runbook that managed identities can run, complete the following steps:

$connection = Get-AutomationConnection -Name AzureRunAsConnection

$connectionResult = Connect-AzAccount

-ServicePrincipal

-Tenant $connection.TenantID

-ApplicationId $connection.ApplicationID

1. Sign in to the Azure portal and navigate to your Automation account.

2. Under Process AutomationProcess Automation, select RunbooksRunbooks.

3. Select Create a runbookCreate a runbook:

a. Name the runbook

miTesting

b. From the Runbook typeRunbook type dropdown menu, select PowerShellPowerShell.

c. Select CreateCreate.

4. In the runbook editor, paste the following code:

-CertificateThumbprint $connection.CertificateThumbprint

"Login successful.."

$query = "RecoveryServicesResources

| where type in~ ('microsoft.recoveryservices/vaults/backupjobs')

| extend vaultName = case(type =~

'microsoft.dataprotection/backupVaults/backupJobs',properties.vaultName,type =~

'Microsoft.RecoveryServices/vaults/backupJobs',split(split(id, '/Microsoft.RecoveryServices/vaults/')

[1],'/')[0],'--')

| extend friendlyName = case(type =~

'microsoft.dataprotection/backupVaults/backupJobs',strcat(properties.dataSourceSetName , '/',

properties.dataSourceName),type =~ 'Microsoft.RecoveryServices/vaults/backupJobs',

properties.entityFriendlyName, '--')

| extend dataSourceType = case(type =~

'Microsoft.RecoveryServices/vaults/backupJobs',properties.backupManagementType,type =~

'microsoft.dataprotection/backupVaults/backupJobs',properties.dataSourceType,'--')

| extend protectedItemName = split(split(properties.backupInstanceId, 'protectedItems')[1],'/')[1]

| extend vaultId = tostring(split(id, '/backupJobs')[0])

| extend vaultSub = tostring( split(id, '/')[2])

| extend jobStatus = case (properties.status == 'Completed' or properties.status ==

'CompletedWithWarnings','Succeeded',properties.status == 'Failed','Failed',properties.status ==

'InProgress', 'Started', properties.status), operation = case(type =~

'microsoft.dataprotection/backupVaults/backupJobs' and tolower(properties.operationCategory) =~

'backup' and properties.isUserTriggered == 'true',strcat('adhoc',properties.operationCategory),type

=~ 'microsoft.dataprotection/backupVaults/backupJobs', tolower(properties.operationCategory), type =~

'Microsoft.RecoveryServices/vaults/backupJobs' and tolower(properties.operation) =~ 'backup' and

properties.isUserTriggered == 'true',strcat('adhoc',properties.operation),type =~

'Microsoft.RecoveryServices/vaults/backupJobs',tolower(properties.operation), '--'),startTime =

todatetime(properties.startTime),endTime = properties.endTime, duration = properties.duration

| where startTime >= ago(24h)

| where (dataSourceType in~ ('AzureIaasVM'))

| where jobStatus=='Failed'

| where operation == 'backup' or operation == 'adhocBackup'

| project vaultSub, vaultId, protectedItemName, startTime, endTime, jobStatus, operation

| sort by vaultSub"

$subscriptions = Get-AzSubscription | foreach {$_.SubscriptionId}

$result = Search-AzGraph -Subscription $subscriptions -Query $query -First 5

$result = $result.data

$prevsub = ""

foreach($jobresponse in $result)

{

if($jobresponse.vaultSub -ne $prevsub)

{

Set-AzContext -SubscriptionId

$jobresponse.vaultSub

$prevsub = $jobresponse.vaultSub

}

$item = Get-AzRecoveryServicesBackupItem -VaultId

$jobresponse.vaultId -BackupManagementType AzureVM -WorkloadType AzureVM -Name

$jobresponse.protectedItemName

Backup-AzRecoveryServicesBackupItem -ExpiryDateTimeUTC

(get-date).AddDays(10) -Item $item -VaultId $jobresponse.vaultId

}

Create a new schedule with PowerShell

PS C:\> $StartTime = Get-Date "13:00:00"

PS C:\> $EndTime = $StartTime.AddYears(1)

PS C:\> New-AzAutomationSchedule -AutomationAccountName " MyAutomationAccount" -Name "Schedule02" -StartTime

$StartTime -ExpiryTime $EndTime -DayInterval 1 -ResourceGroupName "ResourceGroup01"

Link a schedule to a runbook

Link a schedule to a runbook with PowerShell

5. Select SaveSave and then Test paneTest pane.

You've now successfully created a PowerShell runbook.

To create a new schedule with PowerShell, you must:

Use the New-AzAutomationSchedule cmdlet to create schedules.

Specify the start time for the schedule and the frequency it should run.

The following code example shows how to create a recurring schedule that runs every day at 1:00 PM for one

year:

The first command creates a date object using the Get-Date cmdlet and then stores the object in the

$StartDate variable. Specify a time that is, at least, five minutes in the future.

The second command creates a date object using the Get-Date cmdlet and then stores the object in the

$EndDate variable. The command specifies a future time.

The final command creates a daily schedule named

Schedule02

to begin at the time stored in $StartDate

and expires at the time stored in $EndDate .

Consider the following concepts when you link a schedule to a runbook:

You can link a runbook to multiple schedules and a schedule can have multiple runbooks linked to it.

If a runbook has parameters, you can provide values for them.

Provide values for any mandatory parameters and any optional parameters. These values are used each time

the runbook is started by this schedule.

You can attach the same runbook to another schedule and specify different parameter values.

To link a schedule to a runbook with PowerShell, you must:

Use the Register-AzAutomationScheduledRunbook cmdlet to link a schedule.

You can specify parameter values for the runbook with the Parameters $parameter.

For more information about how to specify parameter values, reference Starting a Runbook in Azure

Automation.

The following code example shows how to link a schedule to a runbook by using an Azure Resource Manager

cmdlet with parameters:

$automationAccountName = "MyAutomationAccount"

$runbookName = "Test-Runbook"

$scheduleName = "Sample-DailySchedule"

$params = @{"FirstName"="Joe";"LastName"="Smith";"RepeatCount"=2;"Show"=$true}

-Name $runbookName -ScheduleName $scheduleName -Parameters $params `

-ResourceGroupName "ResourceGroup01"

Next steps

Error handling

Error handling for resilient applications in Azure

12/16/2022 • 3 minutes to read • Edit Online

Key points

Transient fault handling

Request timeouts

Cascading Failures

Ensuring your application can recover from errors is critical when working in a distributed system. You test your

applications to prevent errors and failure, but you need to be prepared for when applications encounter issues

or fail. Understanding how to handle errors and prevent potential failure becomes important, as testing doesn't

always catch everything.

Many things in a distributed system are outside your span of control and your means to test. This can be the

underlying cloud infrastructure, third party runtime dependencies, etc. You can be sure something will fail

eventually, so you need to prepare for that.

Uncover issues or failures in your application's retry logic.

Configure request timeouts to manage inter-component calls.

Implement retry logic to handle transient application failures and transient failures with internal or external

dependencies.

Configure and test health probes for your load balancers and traffic managers.

Segregate read operations from update operations across application data stores.

Track the number of transient exceptions and retries over time to uncover issues or failures in your application's

retry logic. A trend of increasing exceptions over time may indicate that the service is having an issue and may

fail. To learn more, reference Retry service specific guidance.

Use the Retry pattern, paying particular attention to issues and considerations. Avoid overwhelming dependent

services by implementing the Circuit Breaker pattern. Review and incorporate additional best practices guidance

for Transient fault handling. While calling systems that have Throttling pattern implemented, ensure that your

retries are not counter productive.

A reference implementation is available here.It uses Polly and IHttpClientBuilder to implement the Circuit

Breaker pattern.

When making a service call or a database call, ensure that appropriate request timeouts are set. Database

Connection timeouts are typically set to 30 seconds. For guidance on how to troubleshoot, diagnose, and

prevent SQL connection errors, see transient errors for SQL Database.

Leverage design patterns that encapsulate robust timeout strategies like Choreography pattern or

Compensating Transaction pattern.

A reference implementation is available on GitHub.

The Circuit Breaker pattern provides stability while the system recovers from a failure and minimizes the impact

on performance. It can help to maintain the response time of the system by quickly rejecting a request for an

 
Application Health Probes
 
Command and Query Responsibility Segregation (CQRS)
 
Next step
 
Related links
operation that's likely to fail, rather than waiting for the operation to time out, or never return.
A circuit breaker might be able to test the health of a service by sending a request to an endpoint exposed by
the service. The service should return information indicating its status.
Retry pattern. Describes how an application can handle anticipated temporary failures when it tries to connect to
a service or network resource by transparently retrying an operation that has previously failed.
 Samples related to this pattern are here.
Configure and test health probes for your load balancers and traffic managers. Ensure that your health endpoint
checks the critical parts of the system and responds appropriately.
For Azure Front Door and Azure Traffic Manager, the health probe determines whether to fail over to another
region. Your health endpoint should check any critical dependencies that are deployed within the same
region.
For Azure Load Balancer, the health probe determines whether to remove a VM from rotation. The health
endpoint should report the health of the VM. Don't include other tiers or external services. Otherwise, a
failure that occurs outside the VM will cause the load balancer to remove the VM from rotation.
 Samples related to heath probes are here.
ARM template that deploys an Azure Load Balancer and health probes that detect the health of the
sample service endpoint.
An ASP.NET Core Web API that shows configuration of health checks at startup.
Achieve levels of scale and performance needed for your solution by segregating read and write interfaces by
implementing the CQRS pattern.
Chaos engineering
For information on transient faults, see Troubleshoot transient connection errors.
For guidance on implementing health monitoring in your application, see Health Endpoint Monitoring
pattern.
Go back to the main article: Testing

Chaos engineering

12/16/2022 • 5 minutes to read • Edit Online

Key points

Increase resiliency

When to apply chaos

Chaos engineering is a methodology that helps developers attain consistent reliability by hardening services

against failures in production. Another way to think about chaos engineering is that it's about embracing the

inherent chaos in complex systems and, through experimentation, growing confidence in your solution's ability

to handle it.

A common way to introduce chaos is to deliberately inject faults that cause system components to fail. The goal

is to observe, monitor, respond to, and improve your system's reliability under adverse circumstances. For

example, taking dependencies offline (stopping API apps, shutting down VMs, etc.), restricting access (enabling

firewall rules, changing connection strings, etc.), or forcing failover (database level, Front Door, etc.), is a good

way to validate that the application is able to handle faults gracefully.

It's difficult to simulate the characteristics of a service's behavior at scale outside a production environment. The

transient nature of cloud platforms can exacerbate this difficulty. Architecting your service to expect failure is a

core approach to creating a modern service. Chaos engineering embraces the uncertainty of the production

environment and strives to anticipate rare, unpredictable, and disruptive outcomes, so that you can minimize

any potential impact on your customers.

Increase service resiliency and ability to react to failures.

Apply chaos principles continuously.

Create and organize a central chaos engineering team.

Follow best practices for chaos testing.

Chaos engineering is aimed at increasing your service's resiliency and its ability to react to failures. By

conducting experiments in a controlled environment, you can identify issues that are likely to arise during

development and deployment. During this process, be vigilant in adopting the following guidelines:

Be proactive.

Embrace failure.

Break the system.

Identify and address single points of failure early.

Install guardrails and graceful mitigation.

Minimize the blast radius.

Build immunity.

Chaos engineering should be an integral part of development team culture and an ongoing practice, not a

short-term tactical effort in response to a single outage.

Development team members are partners in the process. They must be equipped with the resources to triage

issues, implement the testability that's required for fault injection, and drive the necessary product changes.

Ideally, you should apply chaos principles continuously. There's constant change in the environments in which

Process

AT TA C K ERAT TA C K ER DEF EN DER F O R C LO UDDEF EN DER F O R C LO UD

Inject faults Assess

Provide hints Analyze

Mitigate

GoalsGoals

Overall methodOverall method

software and hardware run, so monitoring the changes is key. By constantly applying stress or faults on

components, you can help expose issues early, before small problems are compounded by a number of other

factors.

Apply chaos engineering principles when you're:

Deploying new code.

Adding dependencies.

Observing changes in usage patterns.

Mitigating problems.

Chaos engineering requires specialized expertise, technology, and practices. As with security and performance

teams, the model of a central team supporting the service teams is a common, effective approach.

If you plan to practice the simulated handling of potentially catastrophic scenarios under controlled conditions,

here's a simplified way to organize your teams:

Familiarize team members with monitoring tools.

Recognize outage patterns.

Learn how to assess the impact.

Determine the root cause and mitigate accordingly.

Practice log analysis.

1. Start with a hypothesis.

2. Measure baseline behavior.

3. Inject a fault or faults.

4. Monitor the resulting behavior.

5. Document the process and observations.

6. Identify and act on the result.

Periodically validate your process, architecture choices, and code. By conducting fault-injection experiments, you

can confirm that monitoring is in place and alerts are set up, the

directly responsible individual

(DRI) process is

effective, and your documentation and investigation processes are up to date. Keep in mind a few key

considerations:

Challenge system assumptions.

Validate change (topology, platform, resources).

Use service-level agreement (SLA) buffers.

Use live-site outages as opportunities.

Best practices

Shift leftShift left

Shift rightShift right

Blast radiusBlast radius

Error budget testingError budget testing

Considerations for chaos testing

Shift-left testing means experiment early, experiment often. Incorporate fault-injection configurations and create

resiliency-validation gates during the development stages and in the deployment pipeline.

Shift-right testing means that you verify that the service is resilient where it counts in a pre-production or

production environment with actual customer load. Adopt a proactive approach as opposed to reacting to

failures. Be a part of determining and controlling requirements for the blast radius.

Stop the experiment when it goes beyond scope. Unknown results are an expected outcome of chaos

experiments. Strive to achieve balance between collecting substantial result data and affecting as few production

users as possible. For an example of this principle in practice, see the Bulkhead pattern article.

Establish an error budget as an investment in chaos and fault injection. Your error budget is the difference

between achieving 100% of the service-level objective (SLO) and achieving the

agreed-upon

SLO.

The following questions and answers discuss considerations about chaos engineering, based on its application

inside Azure.

Have you identified faults that are relevant to the development team?Have you identified faults that are relevant to the development team?

Work closely with the development teams to ensure the relevance of the injected failures. Use past incidents or

issues as a guide. Examine dependencies and evaluate the results when those dependencies are removed.

An external team can't hypothesize faults for your team. A study of failures from an artificial source might be

relevant to your team's purposes, but the effort must be justified.

Have you injected faults in a way that accurately reflects production failures?Have you injected faults in a way that accurately reflects production failures?

Simulate production failures. Treat injected faults in the same way that you would treat production-level faults.

Enforcing a tighter limit on the blast radius will enable you to simulate a production environment. Each fault-

injection effort must be accompanied by tooling that's designed to inject the types of faults that are relevant to

your team's scenarios. Here are two basic ways:

Inject faults in a non-production environment, such as Canary or Test In Production (TIP).

Partition the production service or environment.

Halt all faults and roll back the state to its last-known good configuration if the state seems severe.

Have you built confidence incrementally?Have you built confidence incrementally?

Start by hardening the core, and then expand out in layers. At each point, lock in progress with automated

regression tests. Each team should have a long-term strategy based on a progression that makes sense for the

team's circumstances.

By applying the shift left strategy, you can help ensure that any obstacles to developer usage are removed early

and the testing results are actionable.

The process must be very

low tax

. That is, the process must make it easy for developers to understand what

happened and to fix the issues. The effort must fit easily into their normal workflow, not burden them with one-

Next step

Related links

off special activities.

Best practices

For information on release testing, see Testing your application and Azure environment.

For more information, see Bulkhead pattern.

Go back to the main article: Testing

Testing best practices for reliability in Azure

applications

12/16/2022 • 2 minutes to read • Edit Online

Test regularly

Test for resiliency

Design a backup strategy

Design a disaster recovery strategy

Codify steps to failover and fallback

Plan for regional failures

This article lists Azure best practices to enhance testing Azure applications for reliability. These best practices are

derived from our experience with Azure reliability and the experiences of customers like yourself.

During the architectural phase, focus on implementing practices that meet your business requirements, and

ensure that applications will run in a healthy state without significant downtime.

Test regularly to validate existing thresholds, targets, and assumptions. Regular testing should be performed as

part of each major change and if possible, on a regular basis. While most testing should be performed within

the testing and staging environments, it is often beneficial to also run a subset of tests against the production

system.

To test resiliency, you should verify how the end-to-end workload performs under intermittent failure

conditions. Consider performing the following tests:

Performance testing

Simulation testing

Fault injection testing

Load testing

Operational readiness testing

Failover and failback testing

Design a backup strategy that is tailored to the specific business requirements and circumstances of the

application. At a high level, the approaches can be divided into these categories: 1) Redeploy on disaster, 2)

Warm Spare (Active/Passive), and 3) Hot Spare (Active/Active).

When parts of the Azure network are inaccessible, you might not be able to access your application or data. In

this situation, design a disaster recovery strategy to run most applications with reduced functionality.

Codify steps, preferably automatically, to failover and fallback the application to the primary region once a

failover triggering issue has been addressed. Doing this should ensure capabilities exist to effectively respond to

an outage in a way that limits impact.

Use Azure for creating applications that are distributed across regions. Such distribution helps to minimize the

Implement retry logic

Configure and test health probes

Segregate read and write interfaces

Next step

possibility that a failure in one region could affect other regions.

Track the number of transient exceptions and retries over time to uncover issues or failures in your application's

retry logic. A trend of increasing exceptions over time may indicate that the service is having an issue and may

fail.

Configure and test health probes for your load balancers and traffic managers. Ensure that your health endpoint

checks the critical parts of the system and responds appropriately.

Achieve levels of scale and performance needed for your solution by segregating read and write interfaces by

implementing the Command and Query Responsibility Segregation (CQRS) pattern.

Monitoring

Go back to the main article: Testing

Monitoring for reliability

12/16/2022 • 2 minutes to read • Edit Online

Checklist

Azure services for monitoring

Reference architecture

Next step

Related links

Monitoring and diagnostics are crucial for resiliency. If something fails, you need to know

that

it failed,

when

failed — and

why

How do you monitor and measure application health?How do you monitor and measure application health?

The application is instrumented with semantic logs and metrics.

Application logs are correlated across components.

All components are monitored and correlated with application telemetry.

Key metrics, thresholds, and indicators are defined and captured.

A health model has been defined based on performance, availability, and recovery targets.

Azure Service Health events are used to alert on applicable service level events.

Azure Resource Health events are used to alert on resource health events.

Monitor long-running workflows for failures.

Azure Monitor

Application Insights

Azure Service Health

Azure Resource Health

Azure Resource Manager

Azure Policy

Hybrid availability and performance monitoring

Unified logging for microservices applications

Application health

Azure Monitor

Continuous monitoring

Monitoring application health for reliability

12/16/2022 • 6 minutes to read • Edit Online

Key points

Alerting

Service level alertsService level alerts

Resource level alertsResource level alerts

Monitoring and diagnostics are crucial for availability and resiliency. If something fails, you need to know

that

failed,

when

it failed, and

why

Monitoring

isn't the same as

failure detection

. For example, your application might detect a transient error and

retry, avoiding downtime. But it should also log the retry operation so that you can monitor the error rate to get

an overall picture of application health.

Define alerts that are actionable and effectively prioritized.

Create alerts that poll for services nearing their limits and quotas.

Use application instrumentation to detect and resolve performance anomalies.

Track the progress of long-running processes.

Troubleshoot issues to gain an overall view of application health.

Alerts are notifications of system health issues that are found during monitoring. Alerts only deliver value if they

are actionable and effectively prioritized by on-call engineers through defined operational procedures. Present

telemetry data in a dashboard or email alert format that makes it easy for an operator to notice problems or

trends quickly.

Use Azure Service Health to respond to

service level

events. Azure Service Health provides a view into the

health of Azure services and regions. It issues communications that impact the following services:

Outages

Planned maintenance activities

Other health advisories

Azure Service Health alerts should be configured to operationalize Service Health events. However, Service

Health alerts shouldn't be used to detect issues because of associated latencies. There is a 5 minute service

level objective (SLO) for automated issues, but many issues require manual interpretation to define a root cause

analysis (RCA). Instead, alerts should be used to provide useful information to help interpret issues that have

been detected and surfaced through the health model, to inform an operational response.

To learn more, reference Azure Service Health.

Use Azure Resource Health to respond to

resource level

events. Azure Resource Health provides information

about the health of individual resources such as a specific virtual machine, and is highly useful when diagnosing

unavailable resources.

Azure Resource Health alerts should be configured for specific resource groups and resource types. These alerts

should be adjusted to maximize signal to noise ratios. For example, only distribute a notification when a

resource becomes unhealthy according to the application health model or due to an Azure platform initiated

event. It's important to consider transient issues when setting an appropriate threshold for resource

unavailability. For example, configure an alert for a virtual machine with a threshold of 1 minute for

DashboardsDashboards

SamplesSamples

Azure subscription and service limits

Individual servicesIndividual services

Azure storage scalability and performance targetsAzure storage scalability and performance targets

Scalability targets for virtual machine disksScalability targets for virtual machine disks

Virtual machine sizeVirtual machine size

unavailability before an alert is triggered.

To learn more, reference Azure Resource Health.

You can also get a full-stack view of application state by using Azure dashboards to create a combined view of

monitoring graphs from the following:

Application Insights

Log Analytics

Azure Monitor metrics

Service Health

Here are some samples about creating and querying alerts:

HealthAlert: A sample about creating resource-level health activity log alerts. The sample uses Azure

Resource Manager to create alerts.

GraphAlertsPsSample: A set of PowerShell commands that queries for alerts generated against your

subscription.

Azure subscriptions have limits on certain resource types, such as number of resource groups, cores, and

storage accounts. To ensure your application doesn't run up against Azure subscription limits, create alerts that

poll for services nearing their limits and quotas.

Address the following subscription limits with alerts.

Individual Azure services have consumption limits on:

Storage

Throughput

Number of connections

Requests per second

Your application will fail if it attempts to use resources beyond these limits, resulting in service throttling and

possible downtime.

Depending on the specific service and your application requirements, you can often stay under these limits by

scaling up (choosing another pricing tier, for example) or scaling out (adding new instances).

Azure allows a maximum number of storage accounts per subscription. If your application requires more

storage accounts than are currently available in your subscription, create a new subscription with extra storage

accounts. For more information, reference Azure subscription and service limits, quotas, and constraints.

An Azure infrastructure as a service (IaaS) virtual machine supports attaching many data disks, depending on

several factors, including the virtual machine size and the type of storage account. If your application exceeds

the scalability targets for virtual machine disks, provision additional storage accounts and create the virtual

machine disks there. To learn more, reference Scalability and performance targets for VM disks.

Azure SQL DatabaseAzure SQL Database

Instrumentation

Long-running workflow failures

TIPTIP

Analysis and diagnosis

If the actual CPU, memory, disk, and I/O of your virtual machines approach the limits of the virtual machine size,

your application may experience capacity issues. To correct the issues, increase the virtual machine size.

If your workload fluctuates over time, consider using virtual machine scale sets to automatically scale the

number of virtual instances. Otherwise, you need to manually increase or decrease the number of virtual

machines.

If your Azure SQL Database tier isn't adequate to handle your application's Database Transaction Unit (DTU)

requirements, your data use will be throttled. For more information on selecting the correct service plan,

reference Azure SQL Database purchasing models.

Instrument applications to measure the customer experience. Effective instrumentation is vital for detecting and

resolving performance anomalies that can impact customer experience, and application availability. To build a

robust application health model, it's vital that you achieve visibility into the operational state of critical internal

dependencies, such as a shared NVA or Express Route connection.

Automated failover and failback systems depend on the correct functioning of monitoring and instrumentation.

Dashboards that visualize system health and operator alerts also depend on having accurate monitoring and

instrumentation. If these elements fail, miss critical information, or report inaccurate data, an operator might not

realize that the system is unhealthy or failing. Make sure you include monitoring systems in your test plan.

Instrument applications to track calls to dependent services. Dependency tracking and measuring the duration

or status of dependency calls is also vital to measuring overall application health. It should be used to inform a

health model for the application.

Microsoft recommends collecting and storing logs, and key metrics of critical components.

Provide rich instrumentation:

For failures that are likely, but have not yet occurred: provide enough data to determine the cause, mitigate

the situation, and ensure that the system remains available.

For failures that have already occurred: the application should return an appropriate error message to the

user, but should attempt to continue running despite reduced functionality.

Monitoring systems should capture comprehensive details so that applications can be restored efficiently and, if

necessary, designers and developers can modify the system to prevent the situation from recurring.

Long-running workflows often include multiple steps, each of which should be independent.

Track the progress of long-running processes to minimize the likelihood that the entire workflow will need to be

rolled back or that multiple compensating transactions will need to be executed.

Monitor and manage the progress of long-running workflows by implementing a pattern such as Scheduler Agent

Supervisor.

Analyze data combined in these data stores to troubleshoot issues and gain an overall view of application

health. Generally, you can search for and analyze the data in Application Insights, and Log Analytics using Kusto

Related links

Next step

queries, or view preconfigured graphs using management solutions. Use Azure Advisor to view

recommendations with a focus on resiliency and performance.

For information on dashboards, reference Azure dashboards.

For information on virtual machine sizes, reference Sizes for virtual machines in Azure.

For information on scale sets, reference virtual machine scale sets overview.

Go back to the main article: Monitoring

Health modeling

Health modeling for reliability

12/16/2022 • 5 minutes to read • Edit Online

NOTENOTE

Key points

Healthy and unhealthy states

Quantify application states

The health model should be able to surface the health of critical system flows or key subsystems to ensure

appropriate operational prioritization is applied. For example, the health model should be able to represent the

current state of the user login transaction flow.

The health model should not treat all failures the same. For example, the health model should distinguish

between transient and non-transient faults. It should clearly distinguish between expected-transient but

recoverable failures and a true disaster state.

The health model should clearly distinguish between expected-transient but recoverable failures and a true disaster state.

Know how to tell if an application is healthy or unhealthy.

Understand the impact of logs in diagnostic data.

Ensure the consistent use of diagnostic settings across the application.

Use critical system flows in your health model.

A health model qualifies what

healthy

and

unhealthy

states represent for the application. A holistic application

health model should be used to quantify what healthy and unhealthy states represent across all application

components. It's highly recommended that a "traffic light" model be used to indicate a green/healthy state when

key non-functional requirements and targets are fully satisfied and resources are optimally utilized. For example,

95 percent of requests are processed in <= 500ms with AKS node utilization at x% etc. Once established, this

health model should inform critical monitoring metrics across system components and operational sub-system

composition.

The overall health state can be impacted by both application level issues and resource level failures. Telemetry

correlation should be used to ensure transactions can be mapped through the end-to-end application and

critical system flows, as this is vital to root cause analysis for failures. Platform level metrics and logs such as

CPU percentage, network in/out, and disk operations/sec should be collected from the application to inform a

health model and detect/predict issues. This can also help to distinguish between transient and non-transient

faults.

Application level events should be automatically correlated with resource level metrics to quantify the current

application state. The overall health state can be impacted by both application level issues and resource level

failures.

Telemetry correlation should be used to ensure transactions can be mapped through the end-to-end application

and critical system flows, as this is vital to root cause analysis for failures. Platform level metrics and logs such

as CPU percentage, network in/out, and disk operations/sec should be collected from the application to inform a

health model and detect/predict issues(Telemetry correlation). This can also help to distinguish between

Application logs

White

box and black

box monitoringWhite

box and black

box monitoring

Use critical system flows in the health modelUse critical system flows in the health model

Create good health probesCreate good health probes

transient and non-transient faults.

Application logs are an important source of diagnostics data. To gain insight when you need it most, follow

these best practices for application logging:

Use semantic (structured) logging.Use semantic (structured) logging. With structured logs, it's easier to automate the consumption and

analysis of the log data, which is especially important at cloud scale. Generally, we recommend storing

Azure resources metrics and diagnostics data in a Log Analytics workspace rather than in a storage

account. This way, you can use Kusto queries to obtain the data you want quickly and in a structured

format. You can also use Azure Monitor APIs and Azure Log Analytics APIs.

Log data in the production environment.Log data in the production environment. Capture robust telemetry data while the application is

running in the production environment, so you have sufficient information to diagnose the cause of

issues in the production state.

Log events at ser vice boundaries.Log events at ser vice boundaries. Include a correlation ID that flows across service boundaries. If a

transaction flows through multiple services and one of them fails, the correlation ID helps you track

requests across your application and pinpoints why the transaction failed.

Use asynchronous logging.Use asynchronous logging. Synchronous logging operations sometimes block your application code,

causing requests to back up as logs are written. Use asynchronous logging to preserve availability during

application logging.

Separate application logging from auditing.Separate application logging from auditing. Audit records are commonly maintained for

compliance or regulatory requirements and must be complete. To avoid dropped transactions, maintain

audit logs separately from diagnostic logs.

All application resources should be configured to route diagnostic logs and metrics to the chosen log

aggregation technology. Azure Policy should also be used as a device to ensure the consistent use of diagnostic

settings across the application, to enforce the desired configuration for each Azure service.

Application level events should be automatically correlated with resource level metrics to quantify the current

application state. The overall health state can be impacted by both application level issues as well as resource

level failures. Telemetry correlation should be used to ensure transactions can be mapped through the end-to-

end application and critical system flows, as this is vital to root cause analysis (RCA) for failures. Platform level

metrics and logs such as CPU percentage, network in/out, and disk operations/sec should be collected from the

application to inform a health model and detect/predict issues. This can also help to distinguish between

transient and non-transient faults.

Use white box monitoring to instrument the application with semantic logs and metrics. Application level

metrics and logs, such as current memory consumption or request latency, should be collected from the

application to inform a health model and detect/predict issues.

Use black-box monitoring to measure platform services and the resulting customer experience. Black box

monitoring tests externally visible application behavior without knowledge of the internals of the system. This is

a common approach to measuring customer-centric service level indicators (SLIs), service level objectives

(SLOs), and service level agreements (SLAs).

The health model should be able to surface the respective health of critical system flows or key subsystems to

ensure appropriate operational prioritization is applied. For example, the health model should be able to

represent the current state of the user login transaction flow.

Next step

Related links

The health and performance of an application can degrade over time, and degradation might not be noticeable

until the application fails.

Implement probes or check functions, and run them regularly from outside the application. These checks can be

as simple as measuring response time for the application as a whole, for individual parts of the application, for

specific services that the application uses, or for separate components.

Check functions can run processes to ensure that they produce valid results, measure latency and check

availability, and extract information from the system.

The HealthProbesSample sample shows how to set up health probes. It provides an Azure Resource

Manager (ARM) template to set up the infrastructure. A load balancer accepts public requests and load balance

to a set of virtual machines. The health probe is set up so that it can check for service's path /Health.

Best practices

For information on monitoring metrics, see Azure Monitor Metrics overview.

For information on using Application Insights, see What is Application Insights?

Go back to the main article: Monitoring

Monitoring best practices for reliability in Azure

applications

12/16/2022 • 2 minutes to read • Edit Online

Implement health probes and check functions

Check long-running workflows

Maintain application logs

Measure remote call statistics

Track transient exceptions and retries

Set up an early warning system

Operate within Azure subscription limits

Monitor third-party services

This article lists Azure best practices to enhance monitoring Azure applications for reliability. These best

practices are derived from our experience with Azure reliability and the experiences of customers like yourself.

Implement these best practices for monitoring and alerts in your application so you can detect failures and alert

an operator to fix them.

Run them regularly from outside the application to identify degradation of application health and performance.

Catching issues early can minimize the need to roll back the entire workflow or to execute multiple

compensating transactions.

Log applications in production and at service boundaries.

Use semantic and asynchronous logging.

Separate application logs from audit logs.

Measure remote call statistics, and share the data with the application team to give your operations team an

instantaneous view into application health, summarize remote call metrics, such as latency, throughput, and

errors in the 99 and 95 percentiles. Perform statistical analysis on the metrics to uncover errors that occur within

each percentile.

A trend of increasing exceptions over time indicates that the service is having an issue and may fail. Track

transient exceptions and retries over an appropriate time frame to prevent failure.

Identify the key performance indicators (KPIs) of an application's health, such as transient exceptions and remote

call latency, and set appropriate threshold values for each of them. Send an alert to operations when the

threshold value is reached.

Azure subscriptions have limits on certain resource types, such as the number of resource groups, cores, and

storage accounts. Watch your use of resource types.

Train multiple operators

Log your invocations and correlate them with your application's health and diagnostic logging using a unique

identifier.

Train multiple operators to monitor the application and to perform manual recovery steps. Make sure there is

always at least one trained operator active.

Reliability patterns

12/16/2022 • 3 minutes to read • Edit Online

Availability

PAT T E RNPAT T E RN SUM M A RYSUM M A RY

Deployment Stamps Deploy multiple independent copies of application

components, including data stores.

Geodes Deploy backend services into a set of geographical nodes,

each of which can service any client request in any region.

Health Endpoint Monitoring Implement functional checks in an application that external

tools can access through exposed endpoints at regular

intervals.

Queue-Based Load Leveling Use a queue that acts as a buffer between a task and a

service that it invokes, to smooth intermittent heavy loads.

Throttling Control the consumption of resources by an instance of an

application, an individual tenant, or an entire service.

High availability

PAT T E RNPAT T E RN SUM M A RYSUM M A RY

Deployment Stamps Deploy multiple independent copies of application

components, including data stores.

Geodes Deploy backend services into a set of geographical nodes,

each of which can service any client request in any region.

Health Endpoint Monitoring Implement functional checks in an application that external

tools can access through exposed endpoints at regular

intervals.

Availability is measured as a percentage of uptime, and defines the proportion of time that a system is

functional and working. Availability is affected by system errors, infrastructure problems, malicious attacks, and

system load. Cloud applications typically provide users with a service level agreement (SLA), which means that

applications must be designed and implemented to maximize availability.

To mitigate against availability risks from malicious Distributed Denial of Service (DDoS) attacks, implement the

native Azure DDoS protection service or a third party capability.

Azure infrastructure is composed of geographies, regions, and Availability Zones, which limit the blast radius of

a failure and therefore limit potential impact to customer applications and data. The Azure Availability Zones

construct was developed to provide a software and networking solution to protect against datacenter failures

and to provide increased high availability (HA) to our customers. With HA architecture there is a balance

between high resilience, low latency, and cost.

Bulkhead Isolate elements of an application into pools so that if one
fails, the others will continue to function.
Circuit Breaker Handle faults that might take a variable amount of time to
fix when connecting to a remote service or resource.
PAT T E RNPAT T E RN SUM M A RYSUM M A RY
 
Resiliency
PAT T E RNPAT T E RN SUM M A RYSUM M A RY
Bulkhead Isolate elements of an application into pools so that if one
fails, the others will continue to function.
Circuit Breaker Handle faults that might take a variable amount of time to
fix when connecting to a remote service or resource.
Compensating Transaction Undo the work performed by a series of steps, which
together define an eventually consistent operation.
Health Endpoint Monitoring Implement functional checks in an application that external
tools can access through exposed endpoints at regular
intervals.
Leader Election Coordinate the actions performed by a collection of
collaborating task instances in a distributed application by
electing one instance as the leader that assumes
responsibility for managing the other instances.
Queue-Based Load Leveling Use a queue that acts as a buffer between a task and a
service that it invokes in order to smooth intermittent heavy
loads.
Retry Enable an application to handle anticipated, temporary
failures when it tries to connect to a service or network
resource by transparently retrying an operation that's
previously failed.
Scheduler Agent Supervisor Coordinate a set of actions across a distributed set of
services and other remote resources.
Resiliency is the ability of a system to gracefully handle and recover from failures, both inadvertent and
malicious.
The nature of cloud hosting, where applications are often multi-tenant, use shared platform services, compete
for resources and bandwidth, communicate over the Internet, and run on commodity hardware means there is
an increased likelihood that both transient and more permanent faults will arise. The connected nature of the
internet and the rise in sophistication and volume of attacks increase the likelihood of a security disruption.
Detecting failures and recovering quickly and efficiently, is necessary to maintain resiliency.

Overview of the security pillar

12/16/2022 • 8 minutes to read • Edit Online

Information security has always been a complex subject, and it evolves quickly with the creative ideas and

implementations of attackers and security researchers. The origin of security vulnerabilities started with

identifying and exploiting common programming errors and unexpected edge cases. However over time, the

attack surface that an attacker may explore and exploit has expanded well beyond these common errors and

edge cases. Attackers now freely exploit vulnerabilities in system configurations, operational practices, and the

social habits of the systems' users. As system complexity, connectedness, and the variety of users increase,

attackers have more opportunities to identify unprotected edge cases. Attackers can

hack

systems into doing

things they weren't designed to do.

Security is one of the most important aspects of any architecture. It provides the following assurances against

deliberate attacks and abuse of your valuable data and systems:

Confidentiality

Integrity

Availability

Losing these assurances can negatively affect your business operations and revenue, and your organization's

reputation. For the security pillar, we'll discuss key architectural considerations and principles for security and

how they apply to Azure.

The security of complex systems depends on understanding the business context, social context, and technical

context. As you design your system, cover these areas:

Understanding an IT solution as it interacts with its surrounding environment holds the key to preventing

unauthorized activity and to identifying anomalous behavior that may represent a security risk.

Another key factor in success: Adopt a mindset of assuming failure of security controls. Assuming failure allows

you to design compensating controls that limit risk and damage if a primary control fails.

Assuming failures can be referred to as

assume breach

assume compromise

. Assume breach is closely

related to the

Zero Trust

approach of continuously validating security assurances. The Zero Trust approach is

described in the Security Design Principles section in more detail.

Cloud architectures can help simplify the complex task of securing an enterprise estate through specialization

and shared responsibilities:

Specialization:Specialization: Specialist teams at cloud providers can develop advanced capabilities to operate and secure

systems on behalf of organizations. This approach is preferable to numerous organizations individually

developing deep expertise on managing and securing common elements, such as:

Datacenter physical security

Firmware patching

Hypervisor configuration

The economies of scale allow cloud provider specialist teams to invest in optimization of management and

security that far exceeds the ability of most organizations.

Cloud providers must be compliant with the same IT regulatory requirements as the aggregate of all their

customers. Providers must develop expertise to defend against the aggregate set of adversaries who attack their

customers. As a consequence, the default security posture of applications deployed to the cloud is frequently

much better than that of applications hosted on-premises.

Shared Responsibility Model:Shared Responsibility Model: As computing environments move from customer-controlled datacenters to

the cloud, the responsibility of security also shifts. Security of the operational environment is now a concern

shared by both cloud providers and customers. Organizations can reduce focus on activities that aren't core

business competencies by shifting these responsibilities to a cloud service like Azure. Depending on the specific

technology choices, some security protections will be built into the particular service, while addressing others

will remain the customer's responsibility. To ensure that proper security controls are provided, organizations

must carefully evaluate the services and technology choices.

Shared Responsibility and Key Strategies:Shared Responsibility and Key Strategies:

After reading this document, you'll be equipped with key insights about how to improve the security posture of

your architecture.

As part of your architecture design, you should consider all relevant areas that affect the success of your

application. While this article is concerned primarily with security principles, you should also prioritize other

requirements of a well-designed system, such as:

Availability

Scalability

Costs

Operational characteristics (trading off one over the other as necessary)

Consistently sacrificing security for gains in other areas isn't advisable because security risks tend to increase

dynamically over time.

Increasing security risks result in three key strategies:

SEC URIT Y TOPICSEC URIT Y TOPIC DESC RIP T IONDESC RIP T ION

Security design principles These principles describe a securely architected system

hosted on cloud or on-premises datacenters, or a

combination of both.

Governance, risk, and compliance How is the organization's security going to be monitored,

audited, and reported? What types of risks does the

organization face while trying to protect identifiable

information, Intellectual Property (IP), financial information?

Is there specific industry, government, or regulatory

requirements that dictate or provide recommendations on

criteria that your organization's security controls must meet?

Regulatory compliance Governments and other organizations frequently publish

standards to help define good security practices (due

diligence) so that organizations can avoid being negligent in

security.

Establish a modern perimeter :Establish a modern perimeter : For the elements that your organization controls to ensure you have a

consistent set of controls (a perimeter) between those assets and the threats to them. Perimeters should be

designed based on intercepting authentication requests for the resources (identity controls) versus

intercepting network traffic on enterprise networks. This traditional approach isn't feasible for enterprise

assets outside the network.

More on perimeters and how they relate to Zero Trust and Enterprise Segmentation are in the Governance, Risk,

and Compliance and Network Security & Containment sections.

Modernize infrastructure security:Modernize infrastructure security: For operating systems and middleware elements that legacy

applications require, take advantage of cloud technology to reduce security risk to the organization. For

example, knowing whether all servers in a physical datacenter are updated with security patches has

always been challenging because of discoverability. Software-defined datacenters allow easy and rapid

discovery of all resources. This rapid discovery enables technology like Microsoft Defender for Cloud to

measure quickly and accurately the patch state of all servers and remediate them.

"Trust but verify" each cloud provider :"Trust but verify" each cloud provider : For the elements, which are under the control of the cloud

provider. You should ensure the security practices and regulatory compliance of each cloud provider

(large and small) meet your requirements.

To assess your workload using the tenets found in the Microsoft Azure Well-Architected Framework, see the

Microsoft Azure Well-Architected Review.

We cover the following areas in the security pillar of the Microsoft Azure Well-Architected Framework:

Administration Administration is the practice of monitoring, maintaining,

and operating Information Technology (IT) systems to meet

service levels that the business requires. Administration

introduces some of the highest impact security risks because

performing these tasks requires privileged access to a broad

set of these systems and applications.

Applications and services Applications and the data associated with them ultimately

act as the primary store of business value on a cloud

platform.

Identity and access management Identity provides the basis of a large percentage of security

assurances.

Information protection and storage Protecting data at rest is required to maintain confidentiality,

integrity, and availability assurances across all workloads.

Network security and containment Network security has been the traditional linchpin of

enterprise security efforts. However, cloud computing has

increased the requirement for network perimeters to be

more porous and many attackers have mastered the art of

attacks on identity system elements (which nearly always

bypass network controls).

Security Operations Security operations maintain and restores the security

assurances of the system as live adversaries attack it. The

tasks of security operations are described well by the NIST

Cybersecurity Framework functions of Detect, Respond, and

Recover.

SEC URIT Y TOPICSEC URIT Y TOPIC DESC RIP T IONDESC RIP T ION

Identity management

Consider using Azure Active Directory (Azure AD) to authenticate and authorize users. Azure AD is a fully

managed identity and access management service. You can use it to create domains that exist purely on Azure,

or integrate with your on-premises Active Directory identities.

Azure AD is also used by:

Microsoft 365

Dynamics 365

Many third-party applications

For consumer-facing applications, Azure Active Directory B2C lets users authenticate with their existing social

accounts, such as:

Facebook

Google

Users can also create a new user account managed by Azure AD.

If you want to integrate an on-premises Active Directory environment with an Azure network, several

approaches are possible, depending on your requirements. For more information, reference Identity

Management reference architectures.

 
Protect your infrastructure
 
Application security
 
Data sovereignty and encryption
 
Security resources
Control access to the Azure resources that you deploy. Every Azure subscription has a trust relationship with an
Azure AD tenant.
Use Azure role-based access control (Azure RBAC role) to grant users within your organization the correct
permissions to Azure resources. Grant access by assigning Azure roles to users or groups at a certain scope. The
scope can be a:
Subscription
Resource group
Single resource
Audit all changes to infrastructure.
In general, the security best practices for application development still apply in the cloud. Best practices include:
Encrypt data in-transit with the latest supported  TLS  versions
Protect against  CSRF  and  XSS  attacks
Prevent SQL injection attacks
Cloud applications often use managed services that have access keys. Never check these keys into source
control. Consider storing application secrets in Azure Key Vault.
Make sure that your data remains in the correct geopolitical zone when using Azure data services. Azure's geo-
replicated storage uses the concept of a paired region in the same geopolitical region.
Use Key Vault to safeguard cryptographic keys and secrets. By using Key Vault, you can encrypt keys and secrets
by using keys that are protected by hardware security modules (HSMs). Many Azure storage and DB services
support data encryption at rest, including:
Azure Storage
Azure SQL Database
Azure Synapse Analytics
Azure Cosmos DB
Microsoft Defender for Cloud provides integrated security monitoring and policy management for your
workload.
Azure Security Documentation
Microsoft Trust Center
The security pillar is part of a comprehensive set of security guidance that also includes:
Security in the Microsoft Cloud Adoption Framework for Azure: A high-level overview of a cloud security end
state.
Security architecture design: Implementation-level journey of our security architectures.
Azure security benchmarks: Prescriptive best practices and controls for Azure security.
End-to-end security in Azure: Documentation that introduces you to the security services in Azure.
Browse our security architectures

Next stepNext step

Top 10 security best practices for Azure: Top Azure security best practices that Microsoft recommends based

on lessons learned across customers and our own environments.

Microsoft Cybersecurity Architectures: The diagrams describe how Microsoft security capabilities integrate

with Microsoft platforms and 3rd-party platforms.

Principles

Security design principles

12/16/2022 • 2 minutes to read • Edit Online

Plan resources and how to harden them

Automate and use least privilege

Classify and encrypt data

Monitor system security, plan incident response

Security design principles describe a securely architected system hosted on cloud or on-premises datacenters

(or a combination of both). Application of these principles dramatically increases the likelihood your security

architecture assures confidentiality, integrity, and availability.

To assess your workload using the tenets found in the Azure Well-Architected Framework, reference the

Microsoft Azure Well-Architected Review.

The following design principles provide:

Context for questions

Why a certain aspect is important

How an aspect is applicable to Security

These critical design principles are used as lenses to assess the Security of an application deployed on Azure.

These lenses provide a framework for the application assessment questions.

Recommendations:

Consider security when planning workload resources.

Understand how individual cloud services are protected.

Use a service enablement framework to evaluate.

Recommendations:

Implement least privilege throughout the application and control plane to protect against data exfiltration

and malicious actor scenarios.

Drive automation through DevSecOps to minimize the need for human interaction.

Recommendations:

Classify data according to risk.

Apply industry-standard encryption at rest and in transit, which ensures keys and certificates are stored

securely and managed properly.

Recommendations:

Correlate security and audit events to model application health.

Correlate security and audit events to identify active threats.

Establish automated and manual procedures to respond to incidents.

Use security information and event management (SIEM) tooling for tracking.

Identify and protect endpoints

Protect against code-level vulnerabilities

Model and test against potential threats

Next stepNext step

Recommendations:

Monitor and protect the network integrity of internal and external endpoints through security appliances or

Azure services, such as:

Use industry standard approaches to protect against common attack vectors, such as distributed denial of

service (DDoS) attacks like SlowLoris.

Firewalls

Web application firewalls

Recommendations:

Identify and mitigate code-level vulnerabilities, such as cross-site scripting and structured query language

(SQL) injection.

In the operational lifecycle, regularly incorporate:

Security fixes

Codebase and dependency patching

Recommendations:

Establish procedures to identify and mitigate known threats.

Use penetration testing to verify threat mitigation.

Use static code analysis to detect and prevent future vulnerabilities.

Use code scanning to detect and prevent future vulnerabilities

Design governance

Governance, risk, and compliance

12/16/2022 • 4 minutes to read • Edit Online

Prioritize security best practices investments

As part of overall design, prioritize where to invest the available resources; financial, people, and time.

Constraints on those resources also affect the security implementation across the organization. To achieve an

appropriate ROI on security the organization needs to first understand and define its security priorities.

Governance:Governance: How is the organization's security going to be monitored, audited, and reported? Design

and implementation of security controls within an organization is only the beginning of the story. How

does the organization know that things are actually working? Are they improving? Are there new

requirements? Is there mandatory reporting? Similar to compliance there may be external industry,

government or regulatory standards that need to be considered.

Risk :Risk : What types of risks does the organization face while trying to protect identifiable information,

Intellectual Property (IP), financial information? Who may be interested or could use this information if

stolen, including external and internal threats as well as unintentional or malicious? A commonly

forgotten but extremely important consideration within risk is addressing Disaster Recovery and

Business Continuity.

Compliance:Compliance: Is there a specific industry, government, or regulatory requirements that dictate or provide

recommendation on criteria that your organization's security controls must meet? Examples of such

standards, organizations, controls, and legislation are ISO27001, NIST, PCI-DSS.

The collective role of organization(s) is to manage the security standards of the organization through their

lifecycle:

Define:Define: Set organizational policies for operations, technologies, and configurations based on internal

factors (business requirements, risks, asset evaluation) and external factors (benchmarks, regulatory

standards, threat environment).

Improve:Improve: Continually push these standards incrementally forward towards the ideal state to ensure

continual risk reduction.

Sustain:Sustain: Ensure the security posture doesn't degrade naturally over time by instituting auditing and

monitoring compliance with organizational standards.

Security best practices are ideally applied proactively and completely to all systems as you build your cloud

program, but this isn't reality for most enterprise organizations. Business goals, project constraints, and other

factors often cause organizations to balance security risk against other risks and apply a subset of best practices

at any given point.

We recommend applying as many of the best practices as early as possible, and then working to retrofit any

gaps over time as you mature your security program to include review, prioritization, and proactive application

of best practices to cloud resources. We recommend evaluating the following considerations when prioritizing

which to follow first:

High business impact and highly exposed systems:High business impact and highly exposed systems: These include systems with direct intrinsic

value as well as the systems that provide attackers a path to them. For more information, see Identify and

classify business critical applications.

Easiest to implement mitigations:Easiest to implement mitigations: Identify quick wins by prioritizing the best practices, which your

Checklist

In this section

A SSESSM EN TA SSESSM EN T DESC RIP T IONDESC RIP T ION

Are there any regulator y requirements for thisAre there any regulator y requirements for this

workload?workload?

Understand all regulatory requirements. Check the Microsoft

Trust Center for the latest information, news, and best

practices in security, privacy, and compliance.

Is the organization using a landing zone for thisIs the organization using a landing zone for this

workload?workload?

Consider the security controls placed on the infrastructure

into which the workload will get deployed.

Do you have a segmentation strategyDo you have a segmentation strategy Reference model and strategies of how the functions and

teams can be segmented.

Are you using management groups as par t of yourAre you using management groups as par t of your

segmentation strategy?segmentation strategy?

Strategies using management groups to manage resources

across multiple subscriptions consistently and efficiently.

What security controls do you have in place forWhat security controls do you have in place for

access to Azure infrastructure?access to Azure infrastructure?

Guidance on reducing risk exposure in scope and time when

configuring critical impact accounts such as Administrators.

Azure security benchmark

organization can execute quickly because you already have the required skills, tools, and knowledge to do

it (for example, implementing a Web App Firewall (WAF) to protect a legacy application). Be careful not to

exclusively use (or overuse) this short-term prioritization method. Doing so can increase your risk by

preventing your program from growing and leaving critical risks exposed for extended periods.

Microsoft has provided some prioritized lists of security initiatives to help organizations start with these

decisions based on our experience with threats and mitigation initiatives in our own environments and across

our customers. See Module 4a of the Microsoft CISO Workshop.

What considerations for compliance and governance did you make?What considerations for compliance and governance did you make?

Create a landing zone for the workload. The infrastructure must have appropriate controls and be repeatable

with every deployment.

Enforce creation and deletion of services and their configuration through Azure Policies.

Ensure consistency across the enterprise by applying policies, permissions, and tags across all subscriptions

through careful implementation of root management group.

Understand regulatory requirements and operational data that may be used for audits.

Continuously monitor and assess the compliance of your workload. Perform regular attestations to avoid

fines.

Review and apply recommendations from Azure.

Remediate basic vulnerabilities to keep the attacker costs high.

Follow these questions to assess the workload at a deeper level.

The Azure Security Benchmark includes a collection of high-impact security recommendations you can use to

help secure the services you use in Azure:

The questions in this section are aligned to these controls:

Reference architecture

Next steps

Related links

Governance and Strategy

Posture and vulnerability management

Here are some reference architectures related to governance:

Cloud Adoption Framework enterprise-scale landing zone architecture

Provide security assurance through identity management to authenticate and grant permission to users,

partners, customers, applications, services, and other entities.

Identity and access management

Go back to the main article: Security

Regulatory compliance

12/16/2022 • 3 minutes to read • Edit Online

Key points

Review the requirements

Suggested actionSuggested action

A workload can have regulatory requirements, which may mandate that operational data, such as application

logs and metrics, remain within a certain geo-political region.

These requirements may need strict security measures that affect the overall architecture, the selection, and

configuration of specific PaaS, and SaaS services. The requirements also have implications for how the workload

should be operationalized.

Make sure that all regulatory and governance requirements are known, and well understood.

Periodically perform external and, or internal workload security audits.

Have compliance checks as part of the workload operations.

Use Microsoft Trust Center.

Regulatory organizations frequently publish standards and updates to help define good security practices so

that organizations can avoid negligence. The purpose and scope of these standards, and regulations vary. The

security requirements, however, can influence the design for data protection and retention, network access, and

system security.

Knowing whether your cloud resources are in compliance with standards mandated by governments or industry

organizations is essential in today's globalized world.

For example, a workload that handles credit card transactions is subject to the Payment Card Industry (PCI)

standard. One of the requirements prohibits access between the internet and any system component in the

cardholder data environment.

To provide a restrictive environment, you can choose to do the following:

Host the workload in different Azure compute options that supports bring your own VNet.

Remove any internet-facing endpoints by using Private Endpoints.

Use network security groups (NSGs) rules that define authorized inbound and outbound access.

Noncompliance can lead to fines or other business impact. Work with your regulators and carefully review the

standard to understand both the intent and the literal wording of each requirement. Here are some questions

that may help you understand each requirement.

How is compliance measured?

Who approves that the workload meets the requirements?

Are there processes for obtaining attestations?

What are the documentation requirements?

Use Microsoft Defender for Cloud to assess your current compliance score and to identify the gaps.

Learn moreLearn more

Tutorial: Improve your regulatory compliance

Use the Microsoft Trust Center

Elevated security capabilities

Suggested actionsSuggested actions

Operational considerations

Keep checking the Microsoft Trust Center for the latest information, news, and best practices in security, privacy,

and compliance.

Data governanceData governance. Focus on protecting information in cloud services, mobile devices, workstations, or

collaboration platforms. Build the security strategy by classifying and labeling information. Use strong

access control and encryption technology.

Compliance offeringsCompliance offerings. Microsoft offers a comprehensive set of compliance offerings to help your

organization follow national, regional, and industry-specific requirements governing the collection and

use of data. For information, see Compliance offerings.

Compliance scoreCompliance score. Use Microsoft Compliance Score to assess your data protection controls on an

ongoing basis. Act on the recommendations to make progress toward compliance.

Audit repor tsAudit repor ts. Use audit reports to stay current on the latest privacy, security, and compliance-related

information for Microsoft's cloud services. See Audit Reports.

Shared responsibilityShared responsibility. The workload can be hosted on Software as a Service (SaaS), Platform as a

Service (PaaS), Infrastructure as a Service (IaaS), or in an on-premises datacenter. Have a clear

understanding about the portions of the architecture you're responsible for versus Azure. Whatever the

hosting model, the following responsibilities are always retained by you:

Data

Endpoints

Account

Access management

For more information, reference Shared responsibility in the cloud.

Consider whether to use specialized security capabilities in your enterprise architecture.

Dedicated HSMs and Confidential Computing have the potential to enhance security and meet regulatory

requirements, but can introduce complexity that may negatively impact your operations and efficiency.

We recommend careful consideration and judicious use of these security measures as required:

Dedicated Hardware Security Modules (HSMs)Dedicated Hardware Security Modules (HSMs)

Dedicated Hardware Security Modules (HSMs) may help meet regulatory or security requirements.

Confidential ComputingConfidential Computing

Confidential Computing may help meet regulatory or security requirements.

Learn more about elevated security capabilities for Azure workloads.

Regulatory requirements may influence the workload operations. For example, there might be a requirement

that operational data, such as application logs and metrics, remain within a certain geo-political region.

Consider automation of deployment and maintenance tasks. Automation reduces security and compliance risk

by limiting opportunity to introduce human errors during manual tasks.

Related links

Azure maintains a compliance portfolio that covers US government, industry specific, and region/country

standards. For more information, reference Azure compliance offerings.

Monitor the compliance of the workload to check if the security controls are aligned to the regulatory

requirements. For more information, reference Security audits.

Go back to the main article: Governance

Azure landing zone

Azure landing zone integration

12/16/2022 • 3 minutes to read • Edit Online

Increase automation with Azure Blueprints

From a workload perspective, a

landing zone

refers to a prepared platform into which the application gets

deployed. A landing zone implementation can have compute, data sources, access controls, and networking

components already provisioned. With the required plumbing ready in place; the workload needs to plug into it.

When considering the overall security, a landing zone offers centralized security capabilities that adds a threat

mitigation layer for the workload. Implementations can vary but here are some common strategies that enhance

the security posture.

Isolation through segmentation. You can isolate assets at several layers from Azure enrollment down to a

subscription that has the resources for the workload. This strategy of having resources within a boundary

that is separate from other parts of the organization is an effective way of detecting and containing

adversary movements.

Consistent adoption of organizational policies. Policies govern which resources can be used and their

usage limits. Policies also provide identity controls. Only authenticated and authorized entities are

allowed access. This approach decouples the governance requirements from the workload requirements.

It's crucial that a landing zone is handed over to the workload owner with the security guardrails

deployed.

Configurations that align with principles of Zero Trust . For instance an implementation might have

network connectivity to on-premises data centers. When designing networking controls, the landing zone

may apply the least-privilege principle by opening communication paths only when necessary and only

to trusted entities.

The preceding examples are conceptually simple but the implementation can get complicated for an enterprise-

scale deployment. Azure landing zone as part of the Cloud Adoption Framework (CAF) provides architecture

guidance about identity and access management, networking, and other design areas necessary to achieve an

optimal implementation.

Learn moreLearn more

What is an Azure landing zone?

Landing zone implementation options

Use Azure's native automation capabilities to increase consistency, compliance, and deployment speed for

workloads. A recommended way to implement a landing zone is with Azure Blueprints and Azure Policies.

Automation of deployment and maintenance tasks reduces security and compliance risk by limiting opportunity

to introduce human errors during manual tasks. This will also allow both IT Operations teams and security

teams to shift their focus from repeated manual tasks to higher value tasks like enabling developers and

business initiatives, protecting information, and so on.

Utilize the Azure Blueprint service to rapidly and consistently deploy application environments that are

compliant with your organization's policies and external regulations. Azure Blueprint Service automates

deployment of environments including Azure roles, policies, resources, such as virtual machines, networking,

storage, and more. Azure Blueprints builds on Microsoft's significant investment into the Azure Resource

Manager to standardize resource deployment in Azure and enable resource deployment and governance based

 
Enforce policy compliance
 
Architecture
 
Next
on a desired-state approach. You can use built in configurations in Azure Blueprint, make your own, or just use
Resource Manager scripts for smaller scope.
Several Security and Compliance Blueprints samples are available to use as a starting template.
Organizations of all sizes will have security compliance requirements. Industry, government, and internal
corporate security policies all need to be audited and enforced. Policy monitoring is critical to check that initial
configurations are correct and that it continues to be compliant over time.
In Azure, you can take advantage of Azure Policy to create and manage policies that enforce compliance. Like
Azure Blueprints, Azure Policies are built on the underlying Azure Resource Manager capabilities in the Azure
platform (and Azure Policy can also be assigned via Azure Blueprints).
For more information on how to do this in Azure, please review Tutorial: Create and manage policies to enforce
compliance.
How do you consistently deploy landing zones that follow organizational policies?How do you consistently deploy landing zones that follow organizational policies?
Key Azure services that can help in creating a landing zone:
Azure Blueprints sketches a solution's design parameters based on an organization's standards, patterns, and
requirements.
Azure Resource Manager template specs stores an Azure Resource Manager template (ARM template) in
Azure for later deployment.
Azure Policy enforces organizational standards and to assess compliance at-scale.
Azure AD and Azure role-based access control (Azure RBAC) work in conjunction to provide identity and
access controls.
Microsoft Defender for Cloud
Microsoft Defender for Cloud
For information about an enterprise-scale reference architecture, see Cloud Adoption Framework enterprise-
scale landing zone architecture. The architecture provides considerations in these critical design areas:
Enterprise Agreement (EA) enrollment and Azure Active Directory tenants
Identity and access management
Management group and subscription organization
Network topology and connectivity
Management and monitoring
Business continuity and disaster recovery
Security, governance, and compliance
Platform automation and DevOps
Use management groups to manage resources across multiple subscriptions consistently and efficiently.
Management groups
Back to the main article: Governance

Segmentation strategies

12/16/2022 • 4 minutes to read • Edit Online

Reference model

Segmentation refers to the isolation of resources from other parts of the organization. It's an effective way of

detecting and containing adversary movements.

One approach to segmentation is network isolation. This approach is not recommended because different

technical teams may not be aligned with the business use cases and application workloads. One outcome of

such a mismatch is complexity, as especially seen with on-premises networking, and can lead to reduced

velocity, or in worse cases, broad network firewall exceptions. Although network control should be considered as

a segmentation strategy, it should be part of a unified segmentation strategy.

Network security has been the traditional linchpin of enterprise security efforts. However, cloud computing has

increased the requirement for network perimeters to be more porous and many attackers have mastered the art

of attacks on identity system elements (which nearly always bypass network controls). These factors have

increased the need to focus primarily on identity-based access controls to protect resources rather than

network-based access controls.

An effective segmentation strategy will guide

all

technical teams (IT, security, applications) to consistently isolate

access using networking, applications, identity, and any other access controls. The strategy should aim to:

Minimize operational friction by aligning to business practices and applications

Contain risk by adding cost to attackers. This is done by:

Monitor operations that might lead to potential violation of the integrity of the segments (account usage,

unexpected traffic).

Isolating sensitive workloads from compromise by other assets.

Isolating high-exposure systems from being used as a pivot to other systems.

Here are some recommendations for creating a unified strategy:

Ensure alignment of technical teams to a single strategy based on assessing business risks.

Establish a modern perimeter based on zero-trust principles, focused on identity, devices, applications, and

other signals. This helps overcome limitations of network controls in protecting from new resources and

attack types.

Reinforce network controls for legacy applications by exploring microsegmentation strategies.

Centralize the organizational responsibility for management and security of core networking functions such

as cross-premises links, virtual networking, subnetting, and IP address schemes as well as network security

elements such as virtual network appliances, encryption of cloud virtual network activity and cross-premises

traffic, network-based access controls, and other traditional network security components.

Start with this reference model and adapt it to your organization's needs. This model shows how functions,

resources, and teams can be segmented.

Example segmentsExample segments

Core Services segmentCore Services segment

Additional segmentsAdditional segments

Clear lines of responsibilityClear lines of responsibility

FU NC T IONFU NC T ION SC O P ESC O P E RE SP ON SIBIL IT YRESP O N SIB IL IT Y

Policy management (Core and

individual segments)

Some or all resources. Monitor and enforce compliance with

external (or internal) regulations,

standards, and security policy assign

appropriate permission to those roles.

Central IT operations (Core) Across all resources. Grant permissions to the central IT

department (often the infrastructure

team) to create, modify, and delete

resources like virtual machines and

storage.

Consider isolating shared and individual resources as shown in the preceding image.

This segment hosts shared services utilized across the organization. These shared services typically include

Active Directory Domain Services, DNS/DHCP, and system management tools hosted on Azure Infrastructure as

a Service (IaaS) virtual machines.

Other segments can contain grouped resources based on certain criteria. For instance, resources that are used

by one specific workload or application might be contained in a separate segment. You may also segment or

sub-segment by lifecycle stage, like

development

test

, and

production

. Some resources might intersect, such as

applications, and can use virtual networks for lifecycle stages.

These are the main functions for this reference model. Permissions for these functions are described in Team

roles and responsibilities.

Central networking group (Core and

individual segments)

All network resources. Centralize network management and

security to reduce the potential for

inconsistent strategies that create

potential attacker exploitable security

risks. Because all divisions of the IT and

development organizations do not

have the same level of network

management and security knowledge

and sophistication, organizations

benefit from leveraging a centralized

network team's expertise and tooling.

Ensure consistency and avoid technical

conflicts, assign network resource

responsibilities to a single central

networking organization. These

resources should include virtual

networks, subnets, Network Security

Groups (NSG), and the virtual

machines hosting virtual network

appliances.

Resource role permissions (Core) - For most core services, administrative

privileges required are granted

through the application (Active

Directory, DNS/DHCP, System

Management Tools). No additional

Azure resource permissions are

required. If your organizational model

requires these teams to manage their

own VMs, storage, or other Azure

resources, you can assign these

permissions to those roles.

Security operations (Core and

individual segments)

All resources. Assess risk factors, identify potential

mitigations, and advise organizational

stakeholders who accept the risk.

FU NC T IONFU NC T ION SC O P ESC O P E RE SP ON SIBIL IT YRESP O N SIB IL IT Y

IT operations (individual segments) All resources. Grant permission to create, modify,

and delete resources. The purpose of

the segment (and resulting

permissions) will depend on your

organization structure.

Service admin (Core and individual

segments)

Use the service admin role only for

emergencies (and initial setup if

required). Do not use this role for daily

tasks.

FU NC T IONFU NC T ION SC O P ESC O P E RE SP ON SIBIL IT YRESP O N SIB IL IT Y

Next steps

Segments with resources

managed by a centralized IT

organization can grant the

central IT department (often

the infrastructure team)

permission to modify these

resources.

Segments managed by

independent business units or

functions (such as a Human

Resources IT Team) can grant

those teams permission to all

resources in the segment.

Segments with autonomous

DevOps teams don't need to

grant permissions across all

resources because the resource

role (below) grants permissions

to application teams. For

emergencies, use the service

admin account (break-glass

account).

Start with this reference model and manage resources across multiple subscriptions consistently and efficiently

with management groups.

Management groups

Establish segmentation with management groups

12/16/2022 • 3 minutes to read • Edit Online

Support your segmentation strategy with management groups

Use root management group with caution

Management groups can manage resources across multiple subscriptions consistently and efficiently. However,

due to its flexibility, your design can become complex and compromise security and operations.

Structure management groups into a simple design that guides the enterprise segmentation model.

Management groups offer the ability to consistently and efficiently manage resources (including multiple

subscriptions as needed). However, because of their flexibility, it's possible to create an overly complex design.

Complexity creates confusion and negatively impacts both operations and security (as illustrated by overly

complex Organizational Unit (OU) and Group Policy Object (GPO) designs for Active Directory).

Microsoft recommends aligning the top level of management groups (MGs) into a simple enterprise

segmentation strategy and limit the levels to no more than two.

In the example reference, there are enterprise-wide resources used by all segments, a set of core services that

share services, additional segments for each workload.

C a u t i o nC a u t i o n

Root management group for enterprise-wide resources.

Use the root management group to include identities that have the requirement to apply policies across

every resource. For example, regulatory requirements, such as restrictions related to data sovereignty.

This group is effective in by applying policies, permissions, tags, across all subscriptions.

Be careful when using the root management group because the policies can affect all resources on Azure

and potentially cause downtime or other negative impacts. For considerations, see Use root management

group with caution later in this article.

For complete guidance about using management groups for an enterprise, see CAF: Management group

and subscription organization.

Management group for each workload segment.

Use a separate management group for teams with limited scope of responsibility. This group is typically

required because of organizational boundaries or regulatory requirements.

Root or segment management group for the core set of services.

Use the Root Management Group (MG) for enterprise consistency, but test changes carefully to minimize risk of

operational disruption.

The root management group enables you to ensure consistency across the enterprise by applying policies,

permissions, and tags across all subscriptions. Care must be taken when planning and implementing

assignments to the root management group because this can affect every resource on Azure and potentially

cause downtime or other negative impacts on productivity in the event of errors or unanticipated effects.

Plan carefully:Plan carefully: Select enterprise-wide elements to the root management group that have a clear

requirement to be applied across every resource and/or low impact.

Next steps

Select enterprise-wide identities that have a clear requirement to be applied across all resources. Good

candidates include:

Regulator y requirementsRegulator y requirements with clear business risk/impact. For example, restrictions related to

data sovereignty.

Near-zero potential negative impact.Near-zero potential negative impact. For example, policy with audit effect, tag assignment,

Azure RBAC permissions assignments that have been carefully reviewed.

Use a dedicated service principal name (SPN) to execute management group management operations,

subscription management operations, and role assignment. SPN reduces the number of users who have

elevated rights and follows least-privilege guidelines. Assign the User Access AdministratorUser Access Administrator at the root

management group scope (/) to grant the SPN just mentioned access at the root level. After the SPN is

granted permissions, the User Access AdministratorUser Access Administrator role can be safely removed. In this way, only the

SPN is part of the User Access AdministratorUser Access Administrator role. Assign ContributorContributor permission to the SPN, which

allows tenant-level operations. This permission level ensures that the SPN can be used to deploy and

manage resources to any subscription within your organization.

Limit the number of Azure Policy assignments made at the root management group scope ( / ). This

limitation minimizes debugging inherited policies in lower-level management groups.

Don't create any subscriptions under the root management group. This hierarchy ensures that

subscriptions don't only inherit the small set of Azure policies assigned at the root-level management

group, which don't represent a full set necessary for a workload.

Test first:Test first: Plan, test, and validate all enterprise-wide changes on the root management group before

applying (policy, tags, Azure RBAC model, and so on).

Test lab:Test lab: Representative lab tenant or lab segment in production tenant.

Production pilot:Production pilot: This can be a segment management group or designated subset in

subscription(s) management group.

Validate changes:Validate changes: to ensure they have the desired effect.

Administrative accounts

Administrative account security

12/16/2022 • 9 minutes to read • Edit Online

Administration is the practice of monitoring, maintaining, and operating Information Technology (IT) systems to

meet service levels that the business requires. Administration introduces some of the highest impact security

risks because performing these tasks requires privileged access to a very broad set of these systems and

applications. Attackers know that gaining access to an account with administrative privileges can get them

access to most or all of the data they would target, making the security of administration one of the most critical

security areas.

As an example, Microsoft makes significant investments in protection and training of administrators for our

cloud systems and IT systems:

Microsoft's recommended core strategy for administrative privileges is to use the available controls to reduce

risk.

Reduce risk exposure (scope and time):Reduce risk exposure (scope and time): The principle of least privilege is best accomplished with modern

controls that provide privileges on demand. This help to limit risk by limiting administrative privileges exposure

by:

Scope:Scope:

Just enough access (JEA)

provides only the required privileges for the administrative operation

required (vs. having direct and immediate privileges to many or all systems at a time, which is almost

never required).

Time:Time:

Just in time (JIT)

approaches provided the required privileged as they are needed.

Mitigate the remaining risks:Mitigate the remaining risks: Use a combination of preventive and detective controls to reduce risks

such as isolating administrator accounts from the most common risks phishing and general web

browsing, simplifying and optimizing their workflow, increasing assurance of authentication decisions,

and identifying anomalies from normal baseline behavior that can be blocked or investigated.

Microsoft has captured and documented best practices for protecting administrative accounts and published

prioritized roadmaps for protecting privileged access that can be used as references for prioritizing mitigations

Minimize number of critical impact admins

Managed accounts for admins

Separate accounts for admins

No standing access / just in time privileges

for accounts with privileged access.

Securing Privileged Access (SPA) roadmap for administrators of on-premises Active Directory

Guidance for securing administrators of Azure Active Directory

Grant the fewest number of accounts to privileges that can have a critical business impact.

Each admin account represents potential attack surface that an attacker can target, so minimizing the number of

accounts with that privilege helps limit the overall organizational risk. Experience has taught us that

membership of these privileged groups grows naturally over time as people change roles if membership not

actively limited and managed.

We recommend an approach that reduces this attack surface risk while ensuring business continuity in case

something happens to an administrator:

Assign at least two accounts to the privileged group for business continuity.

When two or more accounts are required, provide justification for each member including the original

two.

Regularly review membership & justification for each group member.

Ensure all critical impact admins are managed by enterprise directory to follow organizational policy

enforcement.

Consumer accounts such as Microsoft accounts like @Hotmail.com, @live.com, @outlook.com, don't offer

sufficient security visibility and control to ensure the organization's policies and any regulatory requirements are

being followed. Because Azure deployments often start small and informally before growing into enterprise-

managed tenants, some consumer accounts remain as administrative accounts long afterward for example,

original Azure project managers, creating blind spots, and potential risks.

Ensure all critical impact admins have a separate account for administrative tasks (vs the account they use for

email, web browsing, and other productivity tasks).

Phishing and web browser attacks represent the most common attack vectors to compromise accounts,

including administrative accounts.

Create a separate administrative account for all users that have a role requiring critical privileges. For these

administrative accounts, block productivity tools like Office 365 email (remove license). If possible, block

arbitrary web browsing (with proxy and/or application controls) while allowing exceptions for browsing to the

Azure portal and other sites required for administrative tasks.

Avoid providing permanent "standing" access for any critical impact accounts.

Permanent privileges increase business risk by increasing the time an attacker can use the account to do

damage. Temporary privileges force attackers targeting an account to either work within the limited times the

admin is already using the account or to initiate privilege elevation (which increases their chance of being

detected and removed from the environment).

Emergency access or 'Break Glass' accounts

Admin workstation security

Critical impact admin dependencies – Account/Workstation

Grant privileges required only as required using one of these methods:

Just in time:Just in time: Enable Azure AD Privileged Identity Management (PIM) or a third party solution to require

following an approval workflow to obtain privileges for critical impact accounts.

Break glass:Break glass: For rarely used accounts, follow an emergency access process to gain access to the

accounts. This is preferred for privileges that have little need for regular operational usage like members

of global admin accounts.

Ensure you have a mechanism for obtaining administrative access in case of an emergency.

While rare, sometimes extreme circumstances arise where all normal means of administrative access are

unavailable.

We recommend following the instructions at Managing emergency access administrative accounts in Azure AD

and ensure that security operations monitor these accounts carefully.

Ensure critical impact admins use a workstation with elevated security protections and monitoring.

Attack vectors that use browsing and email like phishing are cheap and common. Isolating critical impact

admins from these risks will significantly lower your risk of a major incident where one of these accounts is

compromised and used to materially damage your business or mission.

Choose level of admin workstation security based on the options available at https://aka.ms/securedworkstation

Highly Secure Productivity Device (Enhanced Security Workstation or SpecializedHighly Secure Productivity Device (Enhanced Security Workstation or Specialized

Workstation)Workstation)

You can start this security journey for critical impact admins by providing them with a higher security

workstation that still allows for general browsing and productivity tasks. Using this as an interim step

helps ease the transition to fully isolated workstations for both the critical impact admins as well as the IT

staff supporting these users and their workstations.

Privileged Access Workstation (Specialized Workstation or Secured Workstation)Privileged Access Workstation (Specialized Workstation or Secured Workstation)

These configurations represent the ideal security state for critical impact admins as they heavily restrict

access to phishing, browser, and productivity application attack vectors. These workstations don't allow

general internet browsing, only allow browser access to Azure portal and other administrative sites.

Carefully choose the on-premises security dependencies for critical impact accounts and their workstations.

To contain the risk from a major incident on-premises spilling over to become a major compromise of cloud

assets, you must eliminate or minimize the means of control that on premises resources have to critical impact

accounts in the cloud. As an example, attackers who compromise the on-premises Active Directory can access

and compromise cloud-based assets that rely on those accounts like resources in Azure, Amazon Web Services

(AWS), ServiceNow, and so on. Attackers can also use workstations joined to those on premises domains to gain

access to accounts and services managed from them.

Choose the level of isolation from on premises means of control also known as security dependencies for critical

impact accounts.

User Accounts:User Accounts: Choose where to host the critical impact accounts

Native Azure AD Accounts - Create Native Azure AD Accounts that are not synchronized with on-

Passwordless or multifactor authentication for admins

Enforce conditional access for admins - Zero Trust

Avoid granular and custom permissions

premises active directory.

Synchronize from On Premises Active Directory.

Use existing accounts hosted in the on-premises active directory.

Workstations:Workstations: Choose how you will manage and secure the workstations used by critical admin

accounts:

Native Cloud Management and Security (Recommended): Join workstations to Azure AD &

Manage/Patch them with Intune or other cloud services. Protect and Monitor with Windows

Microsoft Defender for Endpoint or another cloud service not managed by on premises based

accounts.

Manage with Existing Systems: Join existing AD domain and use existing management/security.

Require all critical impact admins to use passwordless authentication or multifactor authentication (MFA).

Attack methods have evolved to the point where passwords alone cannot reliably protect an account. This is well

documented in a Microsoft Ignite Session.

Administrative accounts and all critical accounts should use one of the following methods of authentication.

These capabilities are listed in preference order by highest cost/difficulty to attack (strongest/preferred options)

to lowest cost/difficult to attack:

Passwordless (such as Windows Hello)Passwordless (such as Windows Hello)

Passwordless (Authenticator App)Passwordless (Authenticator App)

Multifactor AuthenticationMultifactor Authentication

Note that SMS Text Message based MFA has become very inexpensive for attackers to bypass, so we

recommend you avoid relying on it. This option is still stronger than passwords alone, but is much weaker than

other MFA options.

Authentication for all admins and other critical impact accounts should include measurement and enforcement

of key security attributes to support a Zero Trust strategy.

Attackers compromising Azure Admin accounts can cause significant harm. Conditional Access can significantly

reduce that risk by enforcing security hygiene before allowing access to Azure management.

Configure Conditional Access policy for Azure management that meets your organization's risk appetite and

operational needs.

Require Multifactor Authentication and/or connection from designated work network.

Require Device integrity with Microsoft Defender for Endpointintegrity with Microsoft Defender for Endpoint (Strong Assurance).

Avoid permissions that specifically reference individual resources or users.

Specific permissions create unneeded complexity and confusion as they don't carry the intention to new similar

resources. This then accumulates into a complex legacy configuration that is difficult to maintain or change

without fear of "breaking something" – negatively impacting both security and solution agility.

Use built-in roles

Establish lifecycle management for critical impact accounts

Attack simulation for critical impact accounts

Instead of assigning specific resource-specific permissions, use either:

Management Groups for enterprise-wide permissions.

Resource groups for permissions within subscriptions.

Instead of granting permissions to specific users, assign access to groups in Azure AD. If there isn't an

appropriate group, work with the identity team to create one. This allows you to add and remove group

members externally to Azure and ensure permissions are current, while also allowing the group to be used for

other purposes such as mailing lists.

Use built-in roles for assigning permissions where possible.

Customization leads to complexity that increases confusion and makes automation more complex, challenging,

and fragile. These factors all negatively impact security.

We recommend that you evaluate the built-in roles designed to cover most normal scenarios. Custom roles are

a powerful and sometimes useful capability, but they should be reserved for cases when built in roles won't

work.

Ensure you have a process for disabling or deleting administrative accounts when admin personnel leave the

organization (or leave administrative positions).

See Regularly review critical access for more details.

Regularly simulate attacks against administrative users with current attack techniques to educate and empower

them.

People are a critical part of your defense, especially your personnel with access to critical impact accounts.

Ensuring these users (and ideally all users) have the knowledge and skills to avoid and resist attacks will reduce

your overall organizational risk.

You can use Office 365 Attack Simulation capabilities or any number of third party offerings.

Azure identity and access management

considerations

12/16/2022 • 2 minutes to read • Edit Online

Checklist

Azure security benchmark

Azure services for identity

Reference architecture

Most architectures have shared services that are hosted and accessed across networks. Those services share

common infrastructure and users need to access resources and data from anywhere. For such architectures, a

common way to secure resources is to use network controls. However, that isn't enough.

Provide security assurance through

identity management

: the process of authenticating and authorizing security

principals. Use identity management services to authenticate and grant permission to users, partners,

customers, applications, services, and other entities.

How are you managing the identity for your workload?How are you managing the identity for your workload?

Define clear lines of responsibility and separation of duties for each function. Restrict access based on a

need-to-know basis and least privilege security principles.

Assign permissions to users, groups, and applications at a certain scope through Azure RBAC. Use built-in

roles when possible.

Prevent deletion or modification of a resource, resource group, or subscription through management locks.

Use Managed Identities to access resources in Azure.

Support a single enterprise directory. Keep the cloud and on-premises directories synchronized, except for

critical-impact accounts.

Set up Azure AD Conditional Access. Enforce and measure key security attributes when authenticating all

users, especially for critical-impact accounts.

Have a separate identity source for non-employees.

Preferably use passwordless methods or opt for modern password methods.

Block legacy protocols and authentication methods.

The Azure Security Benchmark includes a collection of high-impact security recommendations you can use to

help secure the services you use in Azure:

The questions in this section are aligned to the Azure Security Benchmarks Identity and Access

Control.

The considerations and best practices in this section are based on these Azure services:

Azure AD

Azure AD B2B

Azure AD B2C

Next steps

Related links

Here are some reference architectures related to identity and access management:

Integrate on-premises AD domains with Azure AD

Integrate on-premises AD with Azure

Monitor the communication between segments. Use data to identify anomalies, set alerts, or block traffic to

mitigate the risk of attackers crossing segmentation boundaries.

Network-related risks

Five steps to securing your identity infrastructure

Go back to the main article: Security

Roles, responsibilities, and permissions

12/16/2022 • 9 minutes to read • Edit Online

Clear lines of responsibility

GRO UP O R IN DIVIDUA L ROLEGRO UP O R IN DIVIDUA L ROLE RESP O N SIB IL IT YRE SP ON SIBIL IT Y

Network SecurityNetwork Security

Typically existing network security team.

Configuration and

maintenance of Azure Firewall, Network Virtual Appliances

(and associated routing), Web Application Firewall (WAF),

Network Security Groups, Application Security Groups (ASG),

and other cross-network traffic.

Network ManagementNetwork Management

Typically existing network operations team.

Enterprise-wide

virtual network and subnet allocation.

Server Endpoint SecurityServer Endpoint Security

Typically IT operations, security, or jointly.

Monitor and

remediate server security (patching, configuration, endpoint

security).

Incident Monitoring and ResponseIncident Monitoring and Response

Typically security operations team.

Incident monitoring and

response to investigate and remediate security incidents in

Security Information and Event Management (SIEM) or

source console such as Microsoft Defender for Cloud Azure

AD Identity Protection.

Policy ManagementPolicy Management

Typically GRC team + Architecture.

Apply governance based

on risk analysis and compliance requirements. Set direction

for use of Azure role-based access control (Azure RBAC),

Microsoft Defender for Cloud, Administrator protection

strategy, and Azure Policy to govern Azure resources.

In an organization, several teams work together to make sure that the workload and the supporting

infrastructure are secure. To avoid confusion that can create security risks, define clear lines of responsibility and

separation of duties.

Based on Microsoft's experience with many cloud adoption projects, establishing clearly defined roles and

responsibilities for specific functions in Azure will avoid confusion that can lead to human and automation

errors creating security risk.

Do the teams have a clear view on responsibilities and individual/group access levels?Do the teams have a clear view on responsibilities and individual/group access levels?

Designate the parties responsible for specific functions in Azure.

Clearly documenting and sharing the contacts responsible for each of these functions will create consistency

and facilitate communication. Based on our experience with many cloud adoption projects, this will avoid

confusion that can lead to human and automation errors that create security risk.

Designate groups (or individual roles) that will be responsible for key functions.

Identity Security and StandardsIdentity Security and Standards

Typically Security Team + Identity Team jointly.

Set direction

for Azure AD directories, PIM/PAM usage, MFA,

password/synchronization configuration, Application Identity

Standards.

GRO UP O R IN DIVIDUA L ROLEGRO UP O R IN DIVIDUA L ROLE RESP O N SIB IL IT YRE SP ON SIBIL IT Y

NOTENOTE

Assign permissions

Reference model exampleReference model example

Core services reference permissionsCore services reference permissions

Application roles and responsibilities should cover different access level of each operational function. For example, publish

production release, access customer data, manipulate database records, and so on. Application teams should include

central functions listed in the preceding table.

Grant roles the appropriate permissions that start with least privilege and add more based on your operational

needs. Provide clear guidance to your technical teams that implement permissions. This clarity makes it easier to

detect and correct that reduces human errors such as overpermissioning.

Assign permissions at management group for the segment rather than the individual subscriptions. This

will drive consistency and ensure application to future subscriptions. In general, avoid granular and

custom permissions.

Consider the built-in roles in Azure before creating custom roles to grant the appropriate permissions to

VMs and other objects.

Security managersSecurity managers group membership may be appropriate for smaller teams/organizations where

security teams have extensive operational responsibilities.

When assigning permissions for a segment, consider consistency while allowing flexibility to accommodate

several organizational models. These models can range from a single centralized IT group to mostly

independent IT and DevOps teams.

This section uses this Reference model to demonstrate the considerations for assigning permissions for different

segments. Microsoft recommends starting from these models and adapting to your organization.

This segment hosts shared services utilized across the organization. These shared services typically include

Active Directory Domain Services, DNS/DHCP, System Management Tools hosted on Azure Infrastructure as a

Service (IaaS) virtual machines.

Security Visibility across all resources:Security Visibility across all resources: For security teams, grant read-only access to security attributes for

all technical environments. This access level is needed to assess risk factors, identify potential mitigations, and

advise organizational stakeholders who accept the risk. See Security Team Visibility for more details.

Policy management across some or all resources:Policy management across some or all resources: To monitor and enforce compliance with external (or

internal) regulations, standards, and security policy, assign appropriate permission to those roles. The roles and

permissions you choose will depend on the organizational culture and expectations of the policy program. See

Microsoft Cloud Adoption Framework for Azure.

Before defining the policies, consider:

How is the organization's security audited and reported? Is there mandatory reporting?

Are the existing security practices working?

Are there any requirements specific to industry, government, or regulatory requirements?

Designate group(s) (or individual roles) for central functions that affect shared services and applications.

After the policies are set, continuously improve those standards incrementally. Make sure that the security

posture doesn't degrade over time by having auditing and monitoring compliance. For information about

managing security standards of an organization, see governance, risk, and compliance (GRC).

Central IT operations across all resources:Central IT operations across all resources: Grant permissions to the central IT department (often the

infrastructure team) to create, modify, and delete resources like virtual machines and storage. ContributorContributor or

OwnerOwner roles are appropriate for this function.

Central networking group across network resources:Central networking group across network resources: To ensure consistency and avoid technical conflicts,

assign network resource responsibilities to a single central networking organization. These resources should

include virtual networks, subnets, Network Security Groups (NSG), and the virtual machines hosting virtual

network appliances. Assign network resource responsibilities to a single central networking organization. The

Network ContributorNetwork Contributor role is appropriate for this group. See Centralize Network Management And Security for

more details

Resource Role Permissions:Resource Role Permissions: For most core services, administrative privileges required to manage them are

granted through the application (Active Directory, DNS/DHCP, System Management Tools), so no additional

Azure resource permissions are required. If your organizational model requires these teams to manage their

own VMs, storage, or other Azure resources, you can assign these permissions to those roles.

Segment reference permissionsSegment reference permissions

Workload segments with autonomous DevOps teams will manage the resources associated with each

application. The actual roles and their permissions depend on the application size and complexity, the

application team size and complexity, and the culture of the organization and application team.

Ser vice admin (Break Glass Account):Ser vice admin (Break Glass Account): Use the Ser vice AdministratorSer vice Administrator role only for emergencies and

initial setup. Do not use this role for daily tasks. See Emergency Access ('Break Glass' Accounts) for more details.

This segment permission design provides consistency while allowing flexibility to accommodate the range of

organizational models from a single centralized IT group to mostly independent IT and DevOps teams.

Security visibility across all resources:Security visibility across all resources: For security teams, grant read-only access to security attributes for

all technical environments. This access level is needed to assess risk factors, identify potential mitigations, and

advise organizational stakeholders who accept the risk. See Security Team Visibility.

Policy management across some or all resources:Policy management across some or all resources: To monitor and enforce compliance with external (or

internal) regulations, standards, and security policy assign appropriate permission to those roles. The roles and

permissions you choose will depend on the organizational culture and expectations of the policy program. See

Microsoft Cloud Adoption Framework for Azure.

IT Operations across all resources:IT Operations across all resources: Grant permission to create, modify, and delete resources. The purpose of

the segment (and resulting permissions) will depend on your organization structure.

Segments with resources managed by a centralized IT organization can grant the central IT department

(often the infrastructure team) permission to modify these resources.

Segments managed by independent business units or functions (such as a Human Resources IT Team)

can grant those teams permission to all resources in the segment.

Segments with autonomous DevOps teams don't need to grant permissions across all resources because

the resource role (below) grants permissions to application teams. For emergencies, use the service

admin account (break-glass account).

Central networking group across network resources:Central networking group across network resources: To ensure consistency and avoid technical conflicts,

assign network resource responsibilities to a single central networking organization. These resources should

include virtual networks, subnets, Network Security Groups (NSG), and the virtual machines hosting virtual

network appliances. See Centralize Network Management And Security.

Security team visibility

IMPORTANTIMPORTANT

Manage connected tenants

Resource Role Permissions:Resource Role Permissions: Segments with autonomous DevOps teams will manage the resources

associated with each application. The actual roles and their permissions depend on the application size and

complexity, the application team size and complexity, and the culture of the organization and application team.

Ser vice Admin (Break Glass Account):Ser vice Admin (Break Glass Account): Use the service admin role only for emergencies (and initial setup if

required). Do not use this role for daily tasks. See Emergency Access ('Break Glass' Accounts) for more details.

An application team needs to be aware of security initiatives to align their security improvement plans with the

outcome of those activities. Provide security teams read-only access to the security aspects of all technical

resources in their purview.

Security organizations require visibility into the technical environment to perform their duties of assessing and

reporting on organizational risk. Without this visibility, security will have to rely on information provided from

groups, operating the environment, which have potential conflict of interest (and other priorities).

Note that security teams may separately be granted additional privileges if they have operational

responsibilities or a requirement to enforce compliance on Azure resources.

For example in Azure, assign security teams to the Security ReadersSecurity Readers permission that provides access to

measure security risk (without providing access to the data itself).

For enterprise security groups with broad responsibility for security of Azure, you can assign this permission

using:

Root management group

– for teams responsible for assessing and reporting risk on all resources

Segment management group(s)

– for teams with limited scope of responsibility (typically required

because of organizational boundaries or regulatory requirements)

Because security will have broad access to the environment (and visibility into potentially exploitable vulnerabilities), treat

security teams as critical impact accounts and apply the same protections as administrators. The Administration section

details these controls for Azure.

Suggested actionsSuggested actions

Define a process for aligning communication, investigation, and hunting activities with the application team.

Following the principle of least privilege, establish access control to all cloud environment resources for

security teams with sufficient access to gain required visibility into the technical environment and to perform

their duties of assessing, and reporting on organizational risk.

Learn moreLearn more

Engage your organization's security team

Does your security team have visibility into all existing subscriptions and cloud environments? How do they

discover new ones?

Ensure your security organization is aware of all enrollments and associated subscriptions connected to your

existing environment (via ExpressRoute or Site-Site VPN) and monitoring as part of the overall enterprise.

These Azure resources are part of your enterprise environment and security organizations require visibility into

Suggested actionsSuggested actions

Next steps

Related links

them. Security organizations need this access to assess risk and to identify whether organizational policies and

applicable regulatory requirements are being followed.

The organizations' cloud infrastructure should be well documented, with security team access to all resources

required for monitoring and insight. Frequent scans of the cloud-connected assets should be performed to

ensure no additional subscriptions or tenants have been added outside of organizational controls. Regularly

review Microsoft guidance to ensure security team access best practices are consulted and followed.

Ensure all Azure environments that connect to your production environment and network apply your

organization's policy, and IT governance controls for security.

You can discover existing connected tenants using a tool provided by Microsoft for guidance on permissions.

Restrict access to Azure resources based on a need-to-know basis starting with the principle of least privilege

security and add more based on your operational needs.

Azure control plane security

For considerations about using management groups to reflect the organization's structure within an Azure

Active Directory (Azure AD) tenant, see CAF: Management group and subscription organization.

Back to the main article: Azure identity and access management considerations

Azure control plane security

12/16/2022 • 3 minutes to read • Edit Online

Key points

Roles and permission assignment

The term

control plane

refers to the management of resources in your subscription. These activities include

creating, updating, and deleting Azure resources as required by the technical team.

Azure Resource Manager handles all control plane requests and applies restrictions that you specify through

Azure role-based access control (Azure RBAC), Azure Policy, locks. Apply those restrictions based on the

requirement of the organization.

It's recommended to implement Infrastructure as Code, and to deploy application infrastructure through

automation, and CI/CD for consistency and auditing purposes.

Restrict access based on a need-to-know basis and least privilege security principles.

Assign permissions to users, groups, and applications at a certain scope through Azure RBAC.

Use built-in roles when possible.

Prevent deletion or modification of a resource, resource group, or subscription through management locks.

Use less critical control in your CI/CD pipeline for development and test environments.

Is the workload infrastructure protected with Azure role-based access control (Azure RBAC)?Is the workload infrastructure protected with Azure role-based access control (Azure RBAC)?

Azure role-based access control (Azure RBAC) provides the necessary tools to maintain separation of concerns

for administration and access to application infrastructure. Decide who has access to resources at the granular

level and what they can do with those resources. For example:

Developers can't access production infrastructure.

Only the SecOps team can read and manage Key Vault secrets.

If there are multiple teams, Project A team can access and manage Resource Group A and all resources

within.

Grant roles the appropriate permissions that start with least privilege and add more based on your

operational needs. Provide clear guidance to your technical teams that implement permissions. This clarity

makes it easier to detect and correct which reduces human errors such as overpermissioning.

Azure RBAC helps you manage that separation. You can assign permissions to users, groups, and applications at

a certain scope. The scope of a role assignment can be a subscription, a resource group, or a single resource. For

details, see Azure role-based access control (Azure RBAC).

Assign permissions at management group instead of individual subscriptions to drive consistency and

ensure application to future subscriptions.

Consider the built-in roles before creating custom roles to grant the appropriate permissions to resources

and other objects.

For example, assign security teams with the Security ReadersSecurity Readers permission that provides access needed to

assess risk factors, identify potential mitigations, without providing access to the data.

IMPORTANTIMPORTANT

Management locks

Suggested actions

Learn more

Next steps

Treat security teams as critical accounts and apply the same protections as administrators.

Learn moreLearn more

Azure RBAC documentation

Are there resource locks applied on critical par ts of the infrastructure?Are there resource locks applied on critical par ts of the infrastructure?

Unlike Azure role-based access control, management locks are used to apply a restriction across all users and

roles.

Critical infrastructure typically doesn't change often. Use management locks to prevent deletion or modification

of a resource, resource group, or subscription. Lock in use cases where only specific roles and users with

permissions can delete, or modify resources.

As an administrator, you may need to lock a subscription, resource group, or resource to prevent other users in

your organization from accidentally deleting or modifying critical resources. You can set the lock level to

CanNotDelete or ReadOnly . In the portal, the locks are called DeleteDelete and Read-onlyRead-only, respectively:

CanNotDelete

means authorized users can still read and modify a resource, but they can't delete the

resource.

ReadOnly

means authorized users can read a resource, but they can't delete or update the resource. Applying

this lock is similar to restricting all authorized users to the permissions granted by the

Reader

role.

When you apply a lock at a parent scope, all resources within that scope inherit the same lock. Even resources

you add later inherit the lock from the parent. The most restrictive lock in the inheritance takes precedence.

Unlike role-based access control, you use management locks to apply a restriction across all users and roles. To

learn about setting permissions for users and roles, see Azure role-based access control (Azure RBAC).

Identify critical infrastructure and evaluate resource lock suitability.

Set locks in the DevOps process carefully because modification locks can sometimes block automation. For

examples of those blocks and considerations, see Considerations before applying locks.

Restrict application infrastructure access to CI/CD only.

Use conditional access policies to restrict access to Microsoft Azure Management.

Configure role-based and resource-based authorization within Azure AD.

Manage access to Azure management with Conditional Access

Role-based and resource-based authorization

Grant or deny access to a system by verifying whether the accessor has the permissions to perform the

requested action.

Authentication

Related links

Back to the main article: Azure identity and access management considerations

Authentication with Azure AD

12/16/2022 • 11 minutes to read • Edit Online

Key points

Use identity-based authentication

Authentication is a process that grants or denies access to a system by verifying the accessor's identity. Use a

managed identity service for all resources to simplify overall management (such as password policies) and

minimize the risk of oversights or human errors. Azure Active Directory (Azure AD) is the one-stop-shop for

identity and access management service for Azure.

Use Managed Identities to access resources in Azure.

Keep the cloud and on-premises directories synchronized, except for high-privilege accounts.

Preferably use passwordless methods or opt for modern password methods.

Enable Azure AD conditional access based on key security attributes when authenticating all users, especially

for high-privilege accounts.

How is the application authenticated when communicating with Azure platform ser vices?How is the application authenticated when communicating with Azure platform ser vices?

Managed identities enable Azure Services to authenticate to each other without presenting explicit credentials

via code.

Managed identities for Azure resources is a feature of Azure Active Directory. Each of the Azure services that

support managed identities for Azure resources are subject to their own timeline. Make sure you review the

availability status of managed identities for your resource and known issues before you begin. The feature

provides Azure services with an automatically managed identity in Azure AD. You can use the identity to

authenticate to any service that supports Azure AD authentication, including Key Vault, without any credentials

in your code. The managed identities for Azure resources feature is free with Azure AD for Azure subscriptions,

there's no additional cost.

There are two types of managed identities:

A system-assigned managed identity is enabled directly on an Azure service instance. When the identity is

enabled, Azure creates an identity for the instance in the Azure AD tenant that's trusted by the subscription of

the instance. After the identity is created, the credentials are provisioned onto the instance. The life cycle of a

system-assigned identity is directly tied to the Azure service instance that it's enabled on. If the instance is

deleted, Azure automatically cleans up the credentials and the identity in Azure AD.

A user-assigned managed identity is created as a standalone Azure resource. Through a create process, Azure

creates an identity in the Azure AD tenant that's trusted by the subscription in use. After the identity is

created, the identity can be assigned to one or more Azure service instances. The life cycle of a user-assigned

identity is managed separately from the life cycle of the Azure service instances to which it's assigned.

Authenticate with identity services instead of cryptographic keys. On Azure, Managed Identities eliminate the

need to store credentials that might be leaked inadvertently. When Managed Identity is enabled for an Azure

resource, it's assigned an identity that you can use to obtain Azure AD tokens. For more information, see Azure

AD-managed identities for Azure resources.

For example, an Azure Kubernetes Service (AKS) cluster needs to pull images from Azure Container Registry

(ACR). To access the image, the cluster needs to know the ACR credentials. The recommended way is to enable

Managed Identities during cluster configuration. That configuration assigns an identity to the cluster and allows

TIPTIP

Choose a system with cross-platform support

it to obtain Azure AD tokens.

This approach is secure because Azure handles the management of the underlying credentials for you.

The identity is tied to the lifecycle of the resource, in the AKS cluster example. When the resource is deleted,

Azure automatically deletes the identity.

Azure AD manages the timely rotation of secrets for you.

Here are the resources for the preceding example:

GitHub: Azure Kubernetes Service (AKS) Secure Baseline Reference Implementation.

The design considerations are described in Azure Kubernetes Service (AKS) production baseline.

Suggested actionsSuggested actions

Review workload authentication and identify opportunities to convert explicit credentials (for example,

connection string and API key) to use managed identities.

For all new Azure workloads, standardize on using managed identities where applicable.

Learn moreLearn more

What are managed identities for Azure resources?

What kind of authentication is required by application APIs?What kind of authentication is required by application APIs?

Don't assume that API URLs used by a workload are hidden and can't get exposed to attackers. For example,

JavaScript code on a website can be viewed. A mobile application can be decompiled and inspected. Even for

internal APIs used only on the backend, a requirement of authentication can increase the difficulty of lateral

movement if an attacker gets network access. Typical mechanisms include API keys, authorization tokens and IP

restrictions.

Managed Identity can help an API be more secure because it replaces the use of human-managed service

principals and can request authorization tokens.

How is user authentication handled in the application?How is user authentication handled in the application?

Don't use custom implementations to manage user credentials. Instead, use Azure AD or other managed identity

providers such as Microsoft account Azure B2C. Managed identity providers provide additional security features

such as modern password protections, multifactor authentication (MFA), and resets. In general, passwordless

protections are preferred. Also, modern protocols like OAuth 2.0 use token-based authentication with limited

timespan.

Are authentication tokens cached securely and encr ypted when sharing across web ser vers?Are authentication tokens cached securely and encr ypted when sharing across web ser vers?

Application code should first try to get OAuth access tokens silently from a cache before attempting to acquire a

token from the identity provider, to optimize performance and maximize availability. Tokens should be stored

securely and handled as any other credentials. When there's a need to share tokens across application servers

(instead of each server acquiring and caching their own) encryption should be used.

For information, see Acquire and cache tokens.

Centralize all identity systems

IMPORTANTIMPORTANT

TIPTIP

Use passwordless authentication

Use a single identity provider for authentication on all platforms (operating systems, cloud providers, and third-

party services.

Azure AD can be used to authenticate Windows, Linux, Azure, Office 365, other cloud providers, and third-party

services as service providers.

For example, improve the security of Linux virtual machines (VMs) in Azure with Azure AD integration. For

details, see Log in to a Linux virtual machine in Azure using Azure Active Directory authentication.

Keep your cloud identity synchronized with the existing identity systems to ensure consistency and reduce

human errors.

Consistency of identities across cloud and on-premises will reduce human error and resulting security risk.

Teams managing resources in both environments need a consistent authoritative source to achieve security

assurances. For monitoring, if identity can be determined without an intermediate mapping process, security

efficiency is improved.

Synchronization is all about providing users an identity in the cloud based on their on-premises identity.

Whether or not they will use synchronized account for authentication or federated authentication, the users will

still need to have an identity in the cloud. This identity will need to be maintained and updated periodically. The

updates can take many forms, from title changes to password changes.

Start by evaluating the organization's on-premises identity solution and user requirements. This evaluation is

important, as it defines the technical requirements for how user identities will be created and maintained in the

cloud. For the majority of organizations, Active Directory is established on-premises and will be the on-premises

directory from which users will be synchronized, but this is not always the case.

Consider using Azure AD Connect for synchronizing Azure AD with your existing on-premises directory. For

migration projects, have a requirement to complete this task before an Azure migration and development

projects begin.

Don't synchronize high-privilege accounts to an on-premises directory. If an attacker gets full control of on-premises

assets, they can compromise a cloud account. This strategy will limit the scope of an incident. For more information, see

Critical impact account dependencies.

Synchronization is blocked by default in the default Azure AD Connect configuration. Make sure that you haven't

customized this configuration. For information about filtering in Azure AD, see Azure AD Connect sync: Configure filtering.

For more information, see hybrid identity providers.

Here are the resources for the preceding example::

The design considerations are described in Integrate on-premises Active Directory domains with Azure AD.

Learn moreLearn more

Synchronize the hybrid identity systems

Attackers constantly scan public cloud IP ranges for open management ports. They attempt to exploit weak

credentials (

password spray

) and unpatched vulnerabilities in management protocols like SSH, and RDP.

Preventing direct internet access to virtual machines stops a misconfiguration or oversight becoming more

serious.

Attack methods have evolved to the point where passwords alone cannot reliably protect an account. Modern

authentication solutions including passwordless and multifactor authentication increase security posture

through strong authentication.

Remove the use of passwords, when possible. Also, require the same set of credentials to sign in and access the

resources on-premises or in the cloud. This requirement is crucial for accounts that require passwords, such as

admin accounts.

With modern authentication and security features in Azure AD, that basic password should be supplemented or

replaced with more secure authentication methods. Each organization has different needs when it comes to

authentication. Microsoft offers the following three passwordless authentication options that integrate with

Azure Active Directory (Azure AD):

Windows Hello for Business

Microsoft Authenticator app

FIDO2 security keys

It's recommended to follow a four-stage plan to become passwordless:

Develop password replacement offering

Reduce user-visible password surface area

Transition into password-less deployment

Eliminate passwords from the identity directory

The following methods of authentication are ordered by highest cost/difficulty to attack (strongest/preferred

options) to lowest cost/difficult to attack:

Passwordless authentication. Some examples of this method include Windows Hello or Authenticator App.

MFA. Although this method is more effective than passwords, we recommend that you avoid relying on SMS

text message-based MFA. For more information, see Enable per-user Azure Active Directory MFA to secure

sign-in events.

Managed Identities. See Use identity-based authentication.

Those methods apply to all users, but should be applied first and strongest to accounts with administrative

privileges.

An implementation of this strategy is enabling single sign-on (SSO) to devices, apps, and services. By signing in

once using a single user account, you can grant access to all the applications and resources per business needs.

Users don't have to manage multiple sets of usernames and passwords. You can provision or de-provision

application access automatically. For more information, see Single sign-on.

Suggested actionsSuggested actions

Develop a passwordless strategy that requires MFA for all users without significantly impacting operations.

Ensure policy and processes require restricting, and monitoring direct internet connectivity by virtual

machines.

Learn moreLearn more

Passwordless Strategy

Remove Virtual Machine (VM) direct internet connectivity

Use modern password protection

Enable conditional access

Require modern protections through methods that reduce the use of passwords. Modern authentication

protocols support strong controls such as MFA and should be used instead of legacy authentication methods.

Use of legacy methods increases risk of credential exposure.

Modern authentication is a method of identity management that offers more secure user authentication and

authorization. It's available for Office 365 hybrid deployments of Skype for Business server on-premises and

Exchange server on-premises, and split-domain Skype for Business hybrids.

Modern authentication is an umbrella term for a combination of authentication and authorization methods

between a client (for example, your laptop or your phone) and a server, as well as some security measures that

rely on access policies that you may already be familiar with. It includes:

Authentication methods

: MFA; smart card authentication; client certificate-based authentication

Authorization methods

: Microsoft's implementation of Open Authorization (OAuth)

Conditional access policies

: Mobile Application Management (MAM) and Azure Active Directory (Azure AD)

Conditional Access

Review workloads that do not leverage modern authentication protocols and convert where possible. In

addition, standardize using modern authentication protocols for all future workloads.

For Azure, enable protections in Azure AD:

1. Configure Azure AD Connect to synchronize password hashes. For information, see Implement password

hash synchronization with Azure AD Connect sync.

2. Choose whether to automatically or manually remediate issues found in a report. For more information,

see Monitor identity risks.

For more information about supporting modern passwords in Azure AD, see the following articles:

What is Identity Protection?

Enforce on-premises Azure AD Password Protection for Active Directory Domain Services

Users at risk security report

Risky sign-ins security report

For more information about supporting modern passwords in Office 365, see the following article:

What is modern authentication?

Modern cloud-based applications are typically accessible over the internet, making network location-based

access inflexible and single-factor passwords a liability. Conditional access describes your authentication policy

for an access decision. For example, if a user is connecting from an InTune-managed corporate PC, they might

not be challenged for MFA every time, but if the user suddenly connects from a different device in a different

geography, MFA is required.

Grant access requests based on the requestors' trust level and the target resources' sensitivity.

Are there any conditional access requirements for the application?Are there any conditional access requirements for the application?

Workloads can be exposed over public internet and location-based network controls are not applicable. To

enable conditional access, understand what restrictions are required for the use case. For example, MFA is a

necessity for remote access; IP-based filtering can be used to enable adhoc debugging (VPNs are preferred).

Configure Azure AD Conditional Access by setting up Access policy for Azure management based on your

Suggested actionsSuggested actions

Related links

operational needs. For information, see Manage access to Azure management with Conditional Access.

Conditional access can be an effective way to phase out legacy authentication and associated protocols. The

policies must be enforced for all admins and other critical impact accounts. Start by using metrics and logs to

determine users who still authenticate with old clients. Next, disable any down-level protocols that aren't used,

and set up conditional access for all users who aren't using legacy protocols. Finally, give notice and guidance to

users about upgrading before blocking legacy authentication completely. For more information, see Azure AD

Conditional Access support for blocking legacy auth.

Implement conditional access policies for this workload.

Learn more about Azure AD Conditional Access.

Grant or deny access to a system by verifying the accessor's identity.

Authorization

Back to the main article: Azure identity and access management considerations

Authorization with Azure AD

12/16/2022 • 5 minutes to read • Edit Online

Key points

Role-based authorization

Authorization is a process that grants or denies access to a system by verifying whether the accessor has the

permissions to perform the requested action. The accessor in this context is the workload (cloud application) or

the user of the workload. The action might be operational or related to resource management. There are two

main approaches to authorization: role-based and resource-based. Both can be configured with Azure AD.

Use a mix of role-based and resource-based authorization. Start with the principle of least privilege and add

more actions based on your needs.

Define clear lines of responsibility and separation of duties for application roles and the resources it can

manage. Consider the access levels of each operational function, such as permissions needed to publish

production release, access customer data, manipulate database records.

Do not provide permanent access for any critical accounts. Elevate access permissions that are based on

approval and is time bound using Azure AD Privileged Identity Management (Azure AD PIM).|

This approach authorizes an action based on the role assigned to a user. For example, some actions require an

administrator role.

A role is a set of permissions. For example, the administrator role has permissions to perform all read, write, and

delete operations. Also, the role has a scope. The scope specifies the management groups, subscriptions, or

resource groups within which the role is allowed to operate.

Applying consistent permissions to resources via management groups or resource groups reduces proliferation

of custom, specific, per-resource permissions. Custom resource-based permissions are often unnecessary, and

can cause confusion because they do not carry their intent to new similar resources. This process can

accumulate into a complex legacy configuration that is difficult to maintain or change without fear of

breaking

something

, and negatively impacting both security, and solution agility.

When assigning a role to a user consider what actions the role can perform and what is the scope of those

operations. Here are some considerations for role assignment:

Use built-in roles before creating custom roles to grant the appropriate permissions to VMs and other

objects. You can assign built-in roles to users, groups, service principals, and managed identities. For

more information, see Azure built-in roles.

If you need to create custom roles, grant roles with the appropriate action. Actions are categorized into

operational and data actions. Start with actions that have least privilege and add more based your

operational or data access needs. Provide clear guidance to your technical teams that implement

permissions. For more information, see Azure custom roles.

If you have a segmentation strategy, assign permissions with a scope. For example, if you use

management group to support your strategy, set the scope to the group rather than the individual

subscriptions. This will drive consistency and ensure application to future subscriptions. When assigning

permissions for a segment, consider consistency while allowing flexibility to accommodate several

organizational models. These models can range from a single centralized IT group to mostly independent

IT and DevOps teams. For information about assigning scope, see AssignableScopes.

Resource-based authorization

NOTENOTE

Authorization for critical accounts

You can use security groups to assign permissions. However, there are disadvantages. It can get complex

because the workload needs to keep track of which security groups correspond to which application

roles, for each tenant. Also, access tokens can grow significantly and Azure AD includes an "overage"

claim to limit the token size. See Microsoft identity platform access tokens.

Instead of granting permissions to specific users, assign access to Azure AD groups. In addition, build a

comprehensive delegation model that includes management groups, subscription, or resource groups

RBAC. For more information, see Azure role-based access control (Azure RBAC).

For information about implementing role-based authorization in an ASP.NET application, see Role-based

authorization.

Learn moreLearn more

Avoid granular and custom permissions

Delegate administration in Azure AD

With role-based authorization, a user gets the same level of control on a resource based on the user's role.

However, there might be situations where you need to define access rights per resource. For example, in a

resource group, you want to allow some users to delete the resource; other users cannot. In such situations, use

resource-based authorization that authorizes an action based on a particular resource. Every resource has an

Owner. Owner can delete the resource. Contributors can read and update but can't delete it.

The

owner

and

contributor

roles for a resource are not the same as application roles.

You'll need to implement custom logic for resource-based authorization. That logic might be a mapping of

resources, Azure AD object (like role, group, user), and permissions.

For information and code sample about implementing resource-based authorization in an ASP.NET application,

see Resource-based authorization.

There might be cases when you need to do activities that require access to important resources. Those resources

might already be accessible to critical accounts such as an administrator account. Or, you might need to elevate

the access permissions until the activities are complete. Both approaches can pose significant risks.

Critical accounts are those which can produce a business-critical outcome, whether cloud administrators or

workload-specific privileged users. Compromise or misuse of such an account can have a detrimental-to-

material effect on the business and its information systems. It's important to identify those accounts and adopt

processes including close monitoring, and lifecycle management, including retirement.

Securing privileged access is a critical first step to establishing security assurances for business assets in a

modern organization. The security of most or all business assets in an IT organization depends on the integrity

of the privileged accounts used to administer, manage, and develop. Cyberattackers often target these accounts

and other elements of privileged access to gain access to data, and systems using credential theft attacks like

Pass-the-Hash, and Pass-the-Ticket.

Protecting privileged access against determined adversaries requires you to take a complete and thoughtful

approach to isolate these systems from risks.

Are there any processes and tools leveraged to manage privileged activities?Are there any processes and tools leveraged to manage privileged activities?

Learn more

Related links

Do not provide permanent access for any critical accounts and lower permissions when access is no longer

required. Some strategies include:

Just-in-time privileged access to Azure AD and Azure resources.

Time-bound access.

Approval-based access.

Break glass for emergency access process to gain access.

Limit write access to production systems to service principals. No user accounts should have regular write-

access.

Ensure there's a process for disabling or deleting administrative accounts that are unused.

You can use native and third-party options to elevate access permissions for at least highly privileged if not all

activities. Azure AD Privileged Identity Management (Azure AD PIM) is the recommended native solution on

Azure.

For more information about PIM, see What is Azure AD Privileged Identity Management?

Establish lifecycle management for critical impact accounts

Back to the main article: Azure identity and access management considerations

Network security

12/16/2022 • 2 minutes to read • Edit Online

Checklist

Azure security benchmark

Azure services

Reference architecture

Protect assets by placing controls on network traffic originating in Azure, between on-premises and Azure

hosted resources, and traffic to and from Azure. If security measures aren't in place attackers can gain access, for

instance, by scanning across public IP ranges. Proper network security controls can provide defense-in-depth

elements that help detect, contain, and stop attackers who gain entry into your cloud deployments.

How have you secured the network of your workload?How have you secured the network of your workload?

Segment your network footprint and create secure communication paths between segments. Align the

network segmentation with overall enterprise segmentation strategy.

Design security controls that identify and allow or deny traffic, access requests, and application

communication between segments.

Protect all public endpoints with Azure Front Door, Application Gateway, Azure Firewall, Azure DDoS

Protection.

Mitigate DDoS attacks with Azure DDoS Protection for critical workloads.

Keep virtual machines private and secure when connecting to the internet with Azure Virtual Network NAT

(NAT gateway).

Control network traffic between subnets (east-west) and application tiers (north-south).

Protect from data exfiltration attacks through a defense-in-depth approach with controls at each layer.

The Azure Security Benchmark includes a collection of high-impact security recommendations you can use to

help secure the services you use in Azure:

The questions in this section are aligned to the Azure Security Benchmarks Network Security.

Azure Virtual Network

Azure Firewall

Azure Virtual Network NAT

Azure ExpressRoute

Azure Private Link

Azure DDoS Protection

Here are some reference architectures related to network security:

Hub-spoke network topology in Azure

Deploy highly available NVAs

Azure Kubernetes Service (AKS) production baseline

Next steps

Related links

We recommend applying as many as of the best practices as early as possible, and then working to retrofit any

gaps over time as you mature your security program.

Data protection

Combine network controls with application, identity, and other technical control types. This approach is effective

in preventing, detecting, and responding to threats outside the networks you control. For more information, see

these articles:

Applications and services security

Identity and access management considerations

Data protection

Ensure that resource grouping and administrative privileges align to the segmentation model. For more

information, see Administrative account security.

Go back to the main article: Security

Implement network segmentation patterns on Azure

12/16/2022 • 8 minutes to read • Edit Online

IMPORTANTIMPORTANT

Key points

A unified enterprise segmentation strategy guides technical teams to consistently segment access using

networking, applications, identity, and any other access controls. Create segmentation in your network footprint

by defining perimeters. The main reasons for segmentation are:

The ability to group related assets that are a part of (or support) workload operations.

Isolation of resources.

Governance policies set by the organization.

Assume compromise

is the recommended cybersecurity mindset and the ability to contain an attacker is vital in

protecting information systems. Model an attacker able to achieve a foothold at various points within the

workload and establish controls to mitigate further expansion.

Network controls can secure interactions between perimeters. This approach can strengthen the security

posture and contain risks in a breach because the controls can detect, contain, and stop attackers from gaining

access to an entire workload.

Containment of attack vectors within an environment is critical. However, to be effective in cloud environments,

traditional approaches may prove inadequate and security organizations may need to evolve their methods.

Traditional segmentation approaches typically fail to achieve their goals as they have not been developed in a

method to align with business use cases and application workloads. Often this results in overwhelming

complexity requiring broad firewall exceptions.

An evolving emerging best practice recommendation is to adopt a Zero Trust strategy based on user, device, and

application identities. In contrast to network access controls that are based on elements such as source and

destination IP address, protocols, and port numbers, Zero Trust enforces and validates access control at

access

time

. This avoids the need to play a prediction game for an entire deployment, network, or subnet — only the

destination resource needs to provide the necessary access controls.

Azure Network Security Groups can be used for basic layer 3 and 4 access controls between Azure Virtual

Networks, their subnets, and the internet.

Azure Web Application Firewall and the Azure Firewall can be used for more advanced network access

controls that require application layer support.

Local Admin Password Solution (LAPS) or a third-party Privileged Access Management can set strong local

admin passwords and just-in-time access to them.

How does the organization implement network segmentation?How does the organization implement network segmentation?

This article highlights some Azure networking features that create segments and restrict access to individual

services.

Align your network segmentation strategy with the enterprise segmentation model. This will reduce confusion and

challenges with different technical teams (networking, identity, applications, and so on). Each team should not develop

their own segmentation and delegation models that don't align with each other.

What is segmentation?

Suggested actions

Learn more

Azure features for segmentation

Create software-defined perimeters in your networking footprint and secure communication paths between

them.

Establish a complete zero trust segmentation strategy.

Align technical teams in the enterprise on micro segmentation strategies for legacy applications.

Azure Virtual Networks (VNets) are created in private address spaces. By default, no traffic is allowed

between any two VNets. Open paths only when it's really needed.

Use Network Security Groups (NSG) to secure communication between resources within a VNet.

Use Application Security Groups (ASGs) to define traffic rules for the underlying VMs that run the workload.

Use Azure Firewall to filter traffic flowing between cloud resources, the internet, and on-premise.

Place resources in a single VNet, if you don't need to operate in multiple regions.

If you need to be in multiple regions, have multiple VNets that are connected through peering.

For advanced configurations, use a hub-spoke topology. A VNet is designated as a hub in a given region for

all the other VNets as spokes in that region.

You can create software-defined perimeters in your networking footprint by using the various Azure services

and features. When a workload (or parts of a given workload) is placed into separate segments, you can control

traffic from/to those segments to secure communication paths. If a segment is compromised, you will be able to

better contain the impact and prevent it from laterally spreading through the rest of your network. This strategy

aligns with the key principle of Zero Trust model published by Microsoft that aims to bring world class security

thinking to your organization.

Create a risk containment strategy that blends proven approaches including:

Existing network security controls and practices

Native security controls available in Azure

Zero trust approaches

For information about creating a segmentation strategy, see Enterprise segmentation strategy.

When you operate on Azure, you have many segmentation options.

Segmentation patterns

Pattern 1: Single VNet

1. Subscription: A high-level construct, which provides platform powered separation between entities. It's

intended to carve out boundaries between large organizations within a company and communication

between resources in different subscriptions needs to be explicitly provisioned.

2. Virtual Network (VNets): Created within a subscription in private address spaces. They provide network

level containment of resources with no traffic allowed by default between any two virtual networks. Like

subscriptions, any communication between virtual networks needs to be explicitly provisioned.

3. Network Security Groups (NSG): An access control mechanisms for controlling traffic between resources

within a virtual network and also with external networks, such as the internet, other virtual networks.

NSGs can take your segmentation strategy to a granular level by creating perimeters for a subnet, a VM,

or a group of VMs. For information about possible operations with subnets in Azure, see Subnets (Azure

Virtual Networks).

4. Application Security Groups (ASGs): Similar to NSGs but are referenced with an application context. It

allows you to group a set of VMs under an application tag and define traffic rules that are then applied to

each of the underlying VMs.

5. Azure Firewall: A cloud native stateful Firewall as a service, which can be deployed in your VNet or in

Azure Virtual WAN hub deployments for filtering traffic flowing between cloud resources, the internet,

and on-premise. You create rules or policies (using Azure Firewall or Azure Firewall Manager) specifying

allow/deny traffic using layer 3 to layer 7 controls. You can also filter traffic going to the internet using

both Azure Firewall and third parties by directing some or all traffic through third-party security

providers for advanced filtering & user protection.

Here are some common patterns for segmenting a workload in Azure from a networking perspective. Each

pattern provides a different type of isolation and connectivity. Choose a pattern based on your organization's

needs.

All the components of the workload reside in a single VNet. This pattern is appropriate you are operating in a

single region because a VNet cannot span multiple regions.

Common ways for securing segments, such as subnets or application groups, are by using NSGs and ASGs. You

can also use a Network Virtualized Appliance (NVAs) from Azure Marketplace or Azure Firewall to enforce and

secure this segmentation.

In this image, Subnet1 has the database workload. Subnet2 has the web workloads. You can configure NSGs that

allow Subnet1 to only communicate with Subnet2 and Subnet2 can only communicate with the internet.

Pattern 2: Multiple VNets that communicate through with peering

Consider a use case where you have multiple workloads that are placed in separate subnets. You can place

controls that will allow one workload to communicate to the backend of another workload.

The resources are spread or replicated in multiple VNets. The VNets can communicate through peering. This

pattern is appropriate when you need to group applications into separate VNets. Or, you need multiple Azure

regions. One benefit is the built-in segmentation because you have to explicitly peer one VNet to another. Virtual

network peering is not transitive. You can further segment within a VNet by using NSGs and ASGs as shown in

pattern 1.

Pattern 3: Multiple VNets in a hub and spoke model

A VNet is designated as a

hub

in a given region for all the other VNets as

spokes

in that region. The hub and its

spokes are connected through peering. All traffic passes through the hub that can act as a gateway to other hubs

in different regions. In this pattern, the security controls are set up at the hubs so that they get to segment and

govern the traffic in between other VNets in a scalable way. One benefit of this pattern is, as your network

topology grows, the security posture overhead does not grow (except when you expand to new regions).

The recommended native option is Azure Firewall. This option works across both VNets and subscriptions to

TIPTIP

Pattern comparison

C O N SIDERAT IO NSC O N SIDERAT IO NS PAT T E RN 1PAT T ER N 1 PAT T E RN 2PAT T ER N 2 PAT T E RN 3PAT T ER N 3

Connectivity/routing:Connectivity/routing:

how each segmenthow each segment

communicates to eachcommunicates to each

otherother

System routing provides

default connectivity to any

workload in any subnet.

Same as a pattern 1. No default connectivity

between spoke networks. A

layer 3 router, such as the

Azure Firewall, in the hub is

required to enable

connectivity.

Network level trafficNetwork level traffic

filteringfiltering

Traffic is allowed by default.

Use NSG, ASG to filter

traffic.

Same as a pattern 1. Traffic between spoke

virtual networks is denied

by default. Open selected

paths to allow traffic

through Azure Firewall

configuration.

Centralized loggingCentralized logging NSG, ASG logs for the

virtual network.

Aggregate NSG, ASG logs

across all virtual networks.

Azure Firewall logs all

accepted/denied traffic sent

through the hub. View the

logs in Azure Monitor.

Unintended open publicUnintended open public

endpointsendpoints

DevOps can accidentally

open a public endpoint

through incorrect NSG, ASG

rules.

Same as a pattern 1. Accidentally opened public

endpoint in a spoke will not

enable access because the

return packet will get

dropped through stateful

firewall (asymmetric

routing).

Application levelApplication level

protectionprotection

NSG or ASG provides

network layer support only.

Same as a pattern 1. Azure Firewall supports

FQDN filtering for HTTP/S

and MSSQL for outbound

traffic and across virtual

networks.

Next step

govern traffic flows using layer 3 to layer 7 controls. You can define your communication rules and apply them

consistently. Here are some examples:

VNet 1 cannot communicate with VNet 2, but it can communicate VNet 3.

VNet 1 cannot access public internet except for *.github.com.

With Azure Firewall Manager preview, you can centrally manage policies across multiple Azure Firewalls and

enable DevOps teams to further customize local policies.

Here are some resources that illustrate provisioning of resources in a hub and spoke topology:

GitHub: Hub and Spoke Topology Sandbox.

The design considerations are described in Hub-spoke network topology in Azure.

Secure network connectivity

Related links

For information about setting up peering, reference Virtual network peering.

For best practices about using Azure Firewall in various configurations, reference Azure Firewall

Architecture Guide.

For information about different access policies and control flow within a VNet, reference Azure Virtual

Network Subnet

Back to the main article: Network security

Azure services for securing network connectivity

12/16/2022 • 12 minutes to read • Edit Online

Key points

Connectivity between network segments

It's often the case that the workload and the supporting components of a cloud architecture will need to access

external assets. These assets can be on-premises, devices outside the main virtual network, or other Azure

resources. Those connections can be over the internet or networks within the organization.

Protect non-publicly accessible services with network restrictions and IP firewall.

Use Network Security Groups (NSGs) or Azure Firewall to protect and control traffic within the VNet.

Use Service Endpoints or Private Link for accessing Azure PaaS services.

Use Azure Firewall to protect against data exfiltration attacks.

Restrict access to backend services to a minimal set of public IP addresses.

Use Azure controls over third-party solutions for basic security needs. These controls are native to the

platform and are easy to configure and scale.

Define access policies based on the type of workload and control flow between the different application tiers.

When designing a workload, you'll typically start by provisioning an Azure Virtual Network (VNet) in a private

address space which has the workload. No traffic is allowed by default between any two virtual networks. If

there's a need, define the communication paths explicitly. One way of connecting VNets is through Virtual

network peering.

A key aspect of protecting VMs in a VNet is to control the flow of network traffic. The network interfaces on the

VMs allow them to communicate with other VMs, the internet, and on-premises networks. To control traffic on

VMs within a VNet (and subnet), use Application Security Groups (ASGs). ASGs allow you to group a set of VMs

under an application tag and define traffic rules. Those rules are then applied to each of the underlying VMs.

A VNet is segmented into subnets based on business requirements. Ensure that proper network security

controls are configured to allow or deny inbound network traffic to, or outbound network traffic from, within

larger network space.

By default VMs are provisioned with private IP addresses. This allows you to take advantage of the Azure IP

address to determine incoming traffic, how and where it's translated on to the virtual network.

A good Azure IP addressing schema provides flexibility, room for growth, and integration with on-premises

networks. The schema ensures that communication works for deployed resources, minimizes public exposure of

systems, and gives the organization flexibility in its network. If not properly designed, systems might not be able

to communicate, and additional work will be required to remediate.

How do you isolate and protect traffic within the workload VNet?How do you isolate and protect traffic within the workload VNet?

To secure communication within a VNet, set rules that inspect traffic. Then,

allow

deny

traffic to, or from

specific sources, and route them to the specified destinations.

Review the rule set and confirm that the required services are not unintentionally blocked.

For traffic between subnets (also referred to as east-west traffic), it's recommended to use Network Security

TIPTIP

Groups (NSG). NSGs allow you to define rules that check the source and destination address, protocol and port

of Inbound and Outbound traffic. The address can be a single IP address, multiple IP addresses, an Azure service

tag or an entire subnet.

If NSGs are being used to isolate and protect the application, the rule set should be reviewed to confirm that

required services are not unintentionally blocked, or more permissive access than expected is allowed.

For advanced networking controls, use Azure Firewall. It can be used to perform deep packet inspection on both

east-west and north-south traffic. Firewalls rules can be defined as policies and centrally managed. An

alternative solution is to use network virtual appliances (NVAs) that check inbound (ingress) and outbound

(egress) traffic and filters based on rules.

How do you route network traffic through NVAs for security boundar y policy enforcement,How do you route network traffic through NVAs for security boundar y policy enforcement,

auditing, and inspection?auditing, and inspection?

Use User Defined Routes (UDR) to control the next hop for traffic between Azure, on-premises, and internet

resources. The routes can be applied to virtual appliance, virtual network gateway, virtual network, or internet.

For example, you need to inspect all ingress traffic from a public load balancer. One way is to host an NVA in a

subnet that allows traffic only if certain criteria is met. That traffic is sent to the subnet that hosts an internal load

balancer that routes that traffic to the backend services.

You can also use NVAs for egress traffic. For instance, all workload traffic is routed by using UDR to another

subnet. That subnet has an internal load balancer that distributes requests to the NVA (or a set of NVAs). These

NVAs direct traffic to the internet using their individual public IP addresses.

Here are the resources for the preceding example:

GitHub: Automated failover for network virtual appliances.

The design considerations are described in Deploy highly available NVAs.

Azure Firewall can serve as an NVA. Azure supports third-party network device providers. They're available in

Azure Marketplace.

How do you get insights about ingoing and outgoing traffic of this workload?How do you get insights about ingoing and outgoing traffic of this workload?

As a general rule, configure and collect network traffic logs. If you use NSGs, capture and analyze NSG flow logs

to monitor performance and security. The NSG flow logs enable Traffic Analytics to gain insights into internal

and external traffic flows of the application.

For information about defining network perimeters, see Network segmentation.

Can the VNet and subnet handle growth?Can the VNet and subnet handle growth?

Typically, you'll add more network resources as the design matures. Most organizations end up adding more

resources to networks than initially planned. Refactoring to accommodate the extra resources is a labor-

intensive process. There is limited security value in creating a very large number of small subnets and then

trying to map network access controls (such as security groups) to each of them.

Plan your subnets based on roles and functions that use the same protocols. That way, you can add resources to

the subnet without making changes to security groups that enforce network level access controls.

Don't use all open rules that allow inbound and outbound traffic to and from 0.0.0.0-255.255.255.255. Use a

least-privilege approach and only allow relevant protocols. It will reduce your overall network attack surface on

Suggested actionsSuggested actions

Learn moreLearn more

Internet edge traffic

Communication with backend services

the subnet. All open rules provide a false sense of security because such a rule enforces no security.

The exception is when you want to use security groups only for network logging purposes.

Design virtual networks and subnets for growth. We recommend planning subnets based on common roles and

functions that use common protocols for those roles and functions. This allows you to add resources to the

subnet without making changes to security groups that enforce network level access controls.

Use NSG or consider using Azure Firewall to protect and control traffic within the VNET.

Azure firewall documentation

Design virtual network subnet security

Design an IP addressing schema for your Azure deployment

Network security groups

As you design the workload, consider security for internet traffic. Does the workload or parts of it need to be

accessible from public IP addresses? What level of access should be given to prevent unauthorized access?

Internet edge traffic (also called

North-South traffic

) represents network connectivity between resources used

by the workload and the internet. An internet edge strategy should be designed to mitigate as many attacks

from the internet to detect or block threats. There are two primary choices that provide security controls and

monitoring:

Azure solutions such as Azure Firewall and Web Application Firewall (WAF).

Azure provides networking solutions to restrict access to individual services. Use multiple levels of security, such

as combination of IP filtering, firewall rules to prevent application services from being accessed by unauthorized

actors.

Network virtual appliances (NVAs). You can use Azure Firewall or third-party solutions available in Azure

Marketplace.

Azure security features are sufficient for common attacks, easy to configure, and scale. Third-party solutions

often have advanced features but they can be hard to configure if they don't integrate well with fabric

controllers. From a cost perspective, Azure options tend to be cheaper than partner solutions.

Information revealing the application platform, such as HTTP banners containing framework information (

X-Powered-By , X-ASPNET-VERSION ), are commonly used by malicious actors when mapping attack vectors of the

application.

HTTP headers, error messages, and website footers should not contain information about the application

platform. Azure CDN can be used to separate the hosting platform from end users. Azure API Management

offers transformation policies that allow you to modify HTTP headers and remove sensitive information.

Suggested actionSuggested action

Consider using CDN for the workload to limit platform detail exposure to attackers.

Learn moreLearn more

Azure CDN documentation

Connection with Azure PaaS services

Most workloads are composed of multiple tiers where several services can serve each tier. Common examples

of tiers are web front ends, business processes, reporting and analysis, backend infrastructure, and so on.

Application resources allowing multiple methods to publish app content (such as FTP, Web Deploy) should have

the unused endpoints disabled. For Azure Web Apps, SCM is the recommended endpoint and it can be protected

separately with network restrictions for sensitive scenarios.

Public access to any workload should be judiciously approved and planned, as public entry points represent a

key possible vector of compromise. When allowing access from public IPs to any back-end service, limiting the

range of allowed IPs can significantly reduce the attack surface of that service. For example, if using Azure Front

Door, you can limit backend tiers to allow Front Door IPs only; or if a partner uses your API, limit access to only

their nominated public IP(s).

How do you configure traffic flow between multiple application tiers?How do you configure traffic flow between multiple application tiers?

Use Azure Virtual Network Subnet to allocate separate address spaces for different elements or tiers within the

workload. Then, define different access policies to control traffic flows between those tiers and restrict access.

You can implement those restrictions through IP filtering or firewall rules.

Do you need to restrict access to the backend infrastructure?Do you need to restrict access to the backend infrastructure?

Restrict access to backend services to a minimal set of public IP addresses with App Services IP restrictions or

Azure Front Door.

Web applications typically have one public entry point and don't expose subsequent APIs and database servers

over the internet. Expose only a minimal set of public IP addresses based on need

and

only those who really

need it. For example, when using gateway services, such as Azure Front Door, it's possible to restrict access only

to a set of Front Door IP addresses and lock down the infrastructure completely.

Suggested actionSuggested action

Restrict and protect application publishing methods.

Learn moreLearn more

Set up Azure App Service access restrictions

Azure Front Door documentation

Deploy your app to Azure App Service using FTP/S

The workload will often need to communicate with other Azure services. For example, it might need to get

secrets from Azure Key Vault. Avoid making connections over the public internet.

Does the workload use secure ways to access Azure PaaS ser vices?Does the workload use secure ways to access Azure PaaS ser vices?

Common approaches for accessing PaaS services are Service Endpoints or Private Links. Both approaches

restrict access to PaaS endpoints only from authorized virtual networks, effectively mitigating data intrusion

risks and associated impact to application availability.

With Service Endpoints, the communication path is secure because you can reach the PaaS endpoint without

needing a public IP address on the VNet. Most PaaS services support communication through service endpoints.

For a list of generally available services, see Virtual Network service endpoints.

Another mechanism is through Azure Private Link. Private Endpoint uses a private IP address from your VNet,

effectively bringing the service into your VNet. For details, see What is Azure Private Link?.

Service Endpoints provide service level access to a PaaS service, whereas Private Link provides direct access to a

On-premises to cloud connectivity

Next step

specific PaaS resource to mitigate data exfiltration risks, such as malicious admin access. Private Link is a paid

service and has meters for inbound and outbound data processed. Private Endpoints are also charged.

How do you control outgoing traffic of Azure PaaS ser vices where Private Link isn't available?How do you control outgoing traffic of Azure PaaS ser vices where Private Link isn't available?

Use NVAs and Azure Firewall (for supported protocols) as a reverse proxy to restrict access to only authorized

PaaS services for services where Private Link isn't supported. Use Azure Firewall to protect against data

exfiltration concerns.

In a hybrid architecture, the workload runs partly on-premises and partly in Azure. Have security controls that

check traffic entering Azure virtual network from on-premises data center.

How do you establish cross premises connectivity?How do you establish cross premises connectivity?

Use Azure ExpressRoute to set up cross premises connectivity to on-premises networks. This service uses a

private, dedicated connection through a third-party connectivity provider. The private connection extends your

on-premises network into Azure. This way, you can reduce the risk of potential of access to company's

information assets on-premises.

How do you access VMs?How do you access VMs?

Use Azure Bastion to log into your VMs and avoid public internet exposure using SSH and RDP with private IP

addresses only. You can also disable RDP/SSH access to VMs and use VPN, ExpressRoute to access these virtual

machines for remote management.

Do the cloud or on-premises VMs have direct internet connectivity for users that may performDo the cloud or on-premises VMs have direct internet connectivity for users that may perform

interactive logins?interactive logins?

Attackers constantly scan public cloud IP ranges for open management ports and attempt low-cost attacks such

as common passwords and known unpatched vulnerabilities. Develop processes and procedures to prevent

direct internet access of VMs with logging and monitoring to enforce policies.

How is internet traffic routed?How is internet traffic routed?

Decide how to route internet traffic. You can use on-premises security devices (also called

forced tunneling

) or

allow connectivity through cloud-based network security devices.

For production enterprise, allow cloud resources to start and respond to internet request directly through cloud

network security devices defined by your internet edge strategy. This approach fits the Nth datacenter paradigm,

that is Azure datacenters are a part of your enterprise. It scales better for an enterprise deployment because it

removes hops that add load, latency, and cost.

Another option is to force tunnel all outbound internet traffic from on-premises through site-to-site VPN. Or, use

a cross-premise WAN link. Network security teams have greater security and visibility to internet traffic. Even

when your resources in the cloud try to respond to incoming requests from the internet, the responses are force

tunneled. This option fits a datacenter expansion use case and can work well for a quick proof of concept, but

scales poorly because of the increased traffic load, latency, and cost. For those reasons, we recommend that you

avoid forced tunneling.

Secure endpoints

Related links

For information about controlling next hop for traffic, see Azure Virtual Network User Defined Routes (UDR).

For information about web application firewalls, see Application Gateway WAF.

For information about Network Appliances from Azure Marketplace, see Network Appliances.

For information about cross premises connectivity, see Azure site-to-site VPN or ExpressRoute.

For information about using VPN/ExpressRoute to access these virtual machines for remote management, see

Disable RDP/SSH access to Azure Virtual Machines.

Go back to the main article: Network security

Best practices for endpoint security on Azure

12/16/2022 • 6 minutes to read • Edit Online

Key points

Public endpoints

Web application firewalls

(

WAFs

)

Web application firewalls

(

WAFs

)

endpoint

is an address exposed by a web application so that external entities can communicate with it. A

malicious or an inadvertent interaction with the endpoint can compromise the security of the application and

even the entire system. One way to protect the endpoint is by placing filter controls on the network traffic that it

receives, such as defining rule sets. A defense-in-depth approach can further mitigate risks. Include

supplemental controls that protect the endpoint if the primary traffic controls fail.

This article describes way in which you can protect web applications with Azure services and features. For

product documentation, see Related links.

Protect all public endpoints with Azure Front Door, Application Gateway, Azure Firewall, Azure DDoS

Protection.

Use web application firewall (WAF) to protect web workloads.

Protect workload publishing methods and restrict ways that are not in use.

Mitigate DDoS attacks. Use Standard protection for critical workloads where outage would have business

impact. Also consider CDN as another layer of protection.

Develop processes and procedures to prevent direct internet access of virtual machines (such as proxy or

firewall) with logging and monitoring to enforce policies.

Implement an automated and gated CI/CD deployment process.

A public endpoint receives traffic over the internet. The endpoints make the service easily accessible to attackers.

Service Endpoints and Private Link can be leveraged to restrict access to PaaS endpoints only from authorized

virtual networks, effectively mitigating data intrusion risks and associated impact to application availability.

Service Endpoints provide service level access to a PaaS service, while Private Link provides direct access to a

specific PaaS resource to mitigate data exfiltration risks such as malicious admin scenarios.

Configure service endpoints and private links where appropriate.

Are all public endpoints of this workload protected?Are all public endpoints of this workload protected?

An initial design decision is to assess whether you need a public endpoint at all. If you do, protect it by using

these mechanisms.

For more information, see Virtual Network service endpoints and What is Azure Private Endpoint?

WAFs provide a basic level of security for web applications. WAFs are appropriate if the organizations that have

invested in application security as WAFs provide additional defense-in-depth mitigation.

WAFs mitigate the risk of an attacker to exploit commonly seen security vulnerabilities for applications. WAFs

provide a basic level of security for web applications. This mechanism is an important mitigation because

attackers target web applications for an ingress point into an organization (similar to a client endpoint).

External application endpoints should be protected against common attack vectors, from Denial of Service (DoS)

attacks like Slowloris to app-level exploits, to prevent potential application downtime due to malicious intent.

TIPTIP

Suggestion actionsSuggestion actions

Azure FirewallAzure Firewall

TIPTIP

Combination approachCombination approach

Azure-native technologies such as Azure Firewall, Application Gateway/Azure Front Door, WAF, and DDoS

Network Protection can be used to achieve requisite protection (Azure DDoS Protection).

Azure Application Gateway has WAF capabilities to inspect web traffic and detect attacks at the HTTP layer. It's a

load balancer and HTTP(S) full reverse proxy that can do secure socket layer (SSL) encryption and decryption.

For example, your workload is hosted in Application Service Environments(ILB ASE). The APIs are consolidated

internally and exposed to external users. This external exposure could be achieved using an Application Gateway.

This service is a load balancer. It forwards request to the internal API Management service, which in turn

consumes the APIs deployed in the ASE. Application Gateway is also configured over port 443 for secured and

reliable outbound calls.

The design considerations for the preceding example are described in Publishing internal APIs to external users.

Azure Front Door and Azure Content Delivery Network (CDN) also have WAF capabilities.

Protect all public endpoints with appropriate solutions such as Azure Front Door, Application Gateway, Azure

Firewall, Azure DDOS Protection, or any third-party solution.

Learn moreLearn more

What is Azure Firewall?

Azure DDoS Protection overview

Azure Front Door documentation

What is Azure Application Gateway?

Protect the entire virtual network against potentially malicious traffic from the internet and other external

locations. It inspects incoming traffic and only passes the allowed requests to pass through.

A common design is to implement a DMZ or a perimeter network in front of the application. The DMZ is a

separate subnet with the firewall.

The design considerations are described in Deploy highly available NVAs.

When you want higher security and there's a mix of web and non-web workloads in the virtual network use

both Azure Firewall and Application Gateway. There are several ways in which those two services can work

together.

For example, you want to filter egress traffic. You want to allow connectivity to a specific Azure Storage Account

but not others. You'll need fully qualified domain name (FQDN)-based filters. In this case run Firewall and

Application Gateway in parallel.

Another popular design is when you want Azure Firewall to inspect all traffic and WAF to protect web traffic, and

the application needs to know the client's source IP address. In this case, place Application Gateway in front of

Firewall. Conversely, you can place Firewall in front of WAF if you want to inspect and filter traffic before it

reaches the Application Gateway.

For more information, see Firewall and Application Gateway for virtual networks.

AuthenticationAuthentication

Mitigate DDoS attacks

Suggested actionSuggested action

Learn moreLearn more

Adopt DevOps

It's challenging to write concise firewall rules for networks where different cloud resources dynamically spin up

and down. Use Microsoft Defender for Cloud to detect misconfiguration risks.

Disable insecure legacy protocols for internet-facing services. Legacy authentication methods are among the top

attack vectors for cloud-hosted services. Those methods don't support other factors beyond passwords and are

prime targets for password spraying, dictionary, or brute force attacks.

In a distributed denial-of-service (DDoS) attack, the server is overloaded with fake traffic. DDoS attacks are

common and can be debilitating. An attack can completely block access or take down services. Make sure all

business-critical web application and services have DDoS mitigation beyond the default defenses so that the

application doesn't experience downtime because that can negatively impact business.

Microsoft recommends adopting advanced protection for any services where downtime will have negative

impact on the business.

How do you implement DDoS protection?How do you implement DDoS protection?

Here are some considerations:

DDoS protection at the infrastructure level in which your workload runs. Azure infrastructure has built-in

defenses for DDoS attacks.

DDoS protection at the network (layer 3) layer. Azure provides additional protection for services provisioned

in a virtual network.

DDoS protection with caching. Content delivery network (CDN) can add another layer of protection. In a

DDoS attack, a CDN intercepts the traffic and stops it from reaching the backend server. Azure CDN is

natively protected. Azure also supports popular CDNs that are protected with proprietary DDoS mitigation

platform.

Advanced DDoS protection. In your security baseline, consider features with monitoring techniques that use

machine learning to detect anomalous traffic and proactively protect your application before service

degradation occurs.

For information about Azure DDoS Protection services, see Azure DDoS Protection documentation.

Identify critical workloads that are susceptible to DDoS attacks and enable Distributed Denial of Service (DDoS)

mitigations for all business-critical web applications and services.

For a list of reference architectures that demonstrate the use of DDoS protection, see Azure DDoS Protection

reference architectures.

Developers shouldn't publish their code directly to app servers.

Does the organization have an CI/CD process for publishing code in this workload?Does the organization have an CI/CD process for publishing code in this workload?

Implement lifecycle of continuous integration, continuous delivery (CI/CD) for applications. Have processes and

tools in place that aid in an automated and gated CI/CD deployment process.

How are the publishing methods secured?How are the publishing methods secured?

Next step

Related links

Here are the resources for the preceding AKS example:

GitHub: Azure Kubernetes Service (AKS) Secure Baseline Reference Implementation.

The design considerations are described in Azure Kubernetes Service (AKS) production baseline.

Data exfiltration is a common attack where an internal or external malicious actor does an unauthorized data

transfer. Most often access is gained because of lack of network controls.

Network virtual appliance (NVA) solutions and Azure Firewall (for supported protocols) can be leveraged as a

reverse proxy to restrict access to only authorized PaaS services for services where Private Link is not yet

supported (Azure Firewall).

Configure Azure Firewall or a third-party next generation firewall to protect against data exfiltration concerns.

Are there controls in the workload design to detect and protect from data exfiltration?Are there controls in the workload design to detect and protect from data exfiltration?

Choose a defense-in-depth design that can protect network communications at various layers, such as a hub-

spoke topology. Azure provides several controls to support the layered design:

Use Azure Firewall to allow or deny traffic using layer 3 to layer 7 controls.

Use Azure Virtual Network User Defined Routes (UDR) to control next hop for traffic.

Control traffic with Network Security Groups (NSGs) between resources within a virtual network, internet,

and other virtual networks.

Secure the endpoints through Azure PrivateLink and Private Endpoints.

Detect and protect at deep levels through packet inspection.

Detect attacks and respond to alerts through Microsoft Sentinel and Microsoft Defender for Cloud.

Network controls are not sufficient in blocking data exfiltration attempts. Harden the protection with proper identity

controls, key protection, and encryption. For more information, see these sections:

Data protection considerations

Identity and access management considerations

Have you considered a cloud application security broker (CASB) for this workload?Have you considered a cloud application security broker (CASB) for this workload?

CASBs provide a central point of control for enforcing policies. They provide rich visibility, control over data

travel, and sophisticated analytics to identify and combat cyberthreats across all Microsoft and third-party cloud

services.

Azure firewall documentation

Azure Marketplace networking apps

Azure Firewall

Network Security Groups (NSG)

What is Azure Web Application Firewall on Azure Application Gateway?

What is Azure Private Link?

Go back to the main article: Network security

Data protection considerations

12/16/2022 • 2 minutes to read • Edit Online

Checklist

Azure security benchmark

Reference architecture

Next steps

Related links

Classify, protect, and monitor sensitive data assets using access control, encryption, and logging in Azure.

Provide controls on data at rest and in transit.

How are you managing encr yption for this workload?How are you managing encr yption for this workload?

Use identity based storage access controls.

Use built-in features for data encryption for Azure services.

Classify all stored data and encrypt it.

Protect data moving over a network through encryption at all points so that it's not accessed unauthorized

users.

Store keys in managed key vault service with identity-based access control and audit policies.

Rotate keys and other secrets frequently.

The Azure Security Benchmark includes a collection of high-impact security recommendations you can use to

help secure the services you use in Azure:

The questions in this section are aligned to the Azure Security Benchmarks Data Protection.

Here are some reference architectures related to secure storage:

Using Azure file shares in a hybrid environment

DevSecOps in Azure

We recommend that you review the practices and tools implemented as part of the development cycle.

Data protection

Back to the main article: Security

Data encryption in Azure

12/16/2022 • 9 minutes to read • Edit Online

Key points

Azure encryption features

Data can be categorized by its state:

Data at restData at rest. All information storage objects, containers, and types that exist statically on physical media,

whether magnetic or optical disk.

Data in transitData in transit. Data that is being transferred between components, locations, or programs.

In a cloud solution, a single business transaction can lead to multiple data operations where data moves from

one storage medium to another. To provide complete data protection, it must be encrypted on storage volumes

and while it's transferred from one point to another.

Use identity-based storage access controls.

Use standard and recommended encryption algorithms.

Use only secure hash algorithms (SHA-2 family).

Classify your data at rest and use encryption.

Encrypt virtual disks.

Use an additional key encryption key (KEK) to protect your data encryption key (DEK).

Protect data in transit through encrypted network channels (TLS/HTTPS) for all client/server communication.

Use TLS 1.2 on Azure.

Azure provides built-in features for data encryption in many layers that participate in data processing. We

recommend that for each service, enable the encryption capability. The encryption is handled automatically

using Azure-managed keys. This almost requires no user interaction.

We recommend implementing identity-based storage access controls. Authentication with a shared key (like a

Shared Access Signature) doesn't permit the same flexibility and control as identity-based access control. The

leak of a shared key might allow indefinite access to a resource, whereas a role-based access control can be

identified and authenticated more strongly.

Storage in a cloud service like Azure is architected and implemented quite differently than on-premises

solutions to enable massive scaling, modern access through REST APIs, and isolation between tenants. Cloud

service providers make multiple methods of access control over storage resources available. Examples include

shared keys, shared signatures, anonymous access, and identity provider-based methods.

Consider some built-in features of Azure Storage:

Identity-based accessIdentity-based access. Supports access through Azure Active Directory (Azure AD) and key-based

authentication mechanisms, such as Symmetric Shared Key Authentication, or Shared Access Signature

(SAS).

Built-in encr yptionBuilt-in encr yption. All stored data is encrypted by Azure storage. Data cannot be read by a tenant if it has

not been written by that tenant. This feature provides control over cross tenant data leakage.

Region-based controlsRegion-based controls. Data remains only in the selected region and three synchronous copies of data are

maintained within that region. Azure storage provides detailed activity logging is available on an opt-in basis.

Firewall featuresFirewall features. The firewall provides an additional layer of access control and storage threat protection

Suggested action

Standard encryption algorithms

Data at rest

Data classification

to detect anomalous access and activities.

For the complete set of features, see Azure Storage Service encryption.

Identify provider methods of authentication and authorization that are the least likely to be compromised, and

enable more fine-grained role-based access controls over storage resources.

Learn moreLearn more

For more information, reference Authorize access to blobs using Azure Active Directory.

Does the organization use industr y standard encr yption algorithms instead of creating their own?Does the organization use industr y standard encr yption algorithms instead of creating their own?

Organizations should not develop and maintain their own encryption algorithms. Avoid using custom

encryption algorithms or direct cryptography in your workload. These methods rarely stand up to real world

attacks.

Secure standards already exist on the market and should be preferred. If custom implementation is required,

developers should use well-established cryptographic algorithms and secure standards. Use Advanced

Encryption Standard (AES) as a symmetric block cipher, AES-128, AES-192, and AES-256 are acceptable.

Developers should use cryptography APIs built into operating systems instead of non-platform cryptography

libraries. For .NET, follow the .NET Cryptography Model.

We advise using standard and recommended encryption algorithms.

For more information, refer to Choose an algorithm.

Are modern hashing functions used?Are modern hashing functions used?

Applications should use the SHA-2 family of hash algorithms (SHA-256, SHA-384, SHA-512).

All important data should be classified and encrypted with an encryption standard. Classify and protect all

information storage objects. Use encryption to make sure the contents of files cannot be accessed by

unauthorized users.

Data at rest is encrypted by default in Azure, but is your critical data classified and tagged, or labeled so that it

can be audited?

Your most sensitive data might include business, financial, healthcare, or personal information. Discovering and

classifying this data can play a pivotal role in your organization's information protection approach. It can serve

as infrastructure for:

Helping to meet standards for data privacy and requirements for regulatory compliance.

Various security scenarios, such as monitoring (auditing) and alerting on anomalous access to sensitive data.

Controlling access to and hardening the security of databases that contain highly sensitive data.

Suggested actionSuggested action

Classify your data. Consider using Data Discovery & Classification in Azure SQL Database.

A crucial initial exercise for protecting data is to organize it into categories based on certain criteria. The

classification criteria can be your business needs, compliance requirements, and the type of data.

Depending on the category, you can protect it through:

Standard encryption mechanisms.

Enforce security governance through policies.

Conduct audits to make sure the security measures are compliant.

One way of classifying data is through the use of tags.

Does the organization encr ypt vir tual disk files for vir tual machines that are associated with thisDoes the organization encr ypt vir tual disk files for vir tual machines that are associated with this

workload?workload?

There are many options to store files in the cloud. Cloud-native apps typically use Azure Storage. Apps that run

on VMs use them to store files. VMs use virtual disk files as virtual storage volumes and exist in a blob storage.

Consider a hybrid solution. Files can move from on-premises to the cloud, from the cloud to on-premises, or

between services hosted in the cloud. One strategy is to make sure that the files and their contents aren't

accessible to unauthorized users. You can use authentication-based access controls to prevent unauthorized

downloading of files. However, that is not enough. Have a backup mechanism to secure the virtual disk files in

case authentication and authorization or its configuration is compromised. There are several approaches. You

can encrypt the virtual disk files. If an attempt is made to mount disk files, the contents of the files cannot be

accessed because of the encryption.

We recommend that you enable virtual disk encryption. For information about how to encrypt Windows VM

disks, see Quickstart: Create and encrypt a Windows VM with the Azure CLI.

Azure-based virtual disks are stored as files in a Storage account. If no encryption is applied to a virtual disk,

and an attacker manages to download a virtual disk image file, it can be mounted and inspected at the attacker's

leisure as if they had physical access to the source computer. Encrypting virtual disk files helps prevent attackers

from gaining access to the contents of those disk files in the event they are able to download them. Depending

on the sensitivity of the information stored on the disk, unencrypted access could represent a critical risk to

confidential business data (such as a SQL database) or identity (such as an AD Domain Controller).

An example of virtual disk encryption is Azure Disk Encryption.

Azure Disk Encryption helps protect and safeguard your data to meet your organizational security and

compliance commitments. It uses the Bitlocker-feature of Windows (or DM-Crypt on Linux) to provide volume

encryption for the OS and data disks of Azure virtual machines (VMs). It is integrated with Azure Key Vault to

help you control and manage the disk encryption keys, and secrets.

Virtual machines use virtual disk files as storage volumes and exist in a cloud service provider's blob storage

system. These files can be moved from on-premises to cloud systems, from cloud systems to on-premises, or

between cloud systems. Due to the mobility of these files, it's recommended that the files and the contents are

not accessible to unauthorized users.

Does the organization use identity-based storage access controls for this workload?Does the organization use identity-based storage access controls for this workload?

There are many ways to control access to data: shared keys, shared signatures, anonymous access, identity

provider-based. Use Azure Active Directory (Azure AD) and role-based access control (RBAC) to grant access. For

more information, see Identity and access management considerations.

Does the organization protect keys in this workload with an additional key encr yption key (KEK)?Does the organization protect keys in this workload with an additional key encr yption key (KEK)?

Use more than one encryption key in an encryption at rest implementation. Storing an encryption key in Azure

Key Vault ensures secure key access and central management of keys.

Data in transit

Use an additional key encryption key (KEK) to protect your data encryption key (DEK).

Suggested actionsSuggested actions

Identify unencrypted virtual machines via Microsoft Defender for Cloud or script, and encrypt via Azure Disk

Encryption. Ensure all new virtual machines are encrypted by default and regularly monitor for unprotected

disks.

Learn moreLearn more

Azure Disk Encryption for virtual machines and virtual machine scale sets

Data in transit should be encrypted at all points to ensure data integrity.

Protecting data in transit should be an essential part of your data protection strategy. Because data is moving

back and forth from many locations, we generally recommend that you always use SSL/TLS protocols to

exchange data across different locations.

For data moving between your on-premises infrastructure and Azure, consider appropriate safeguards such as

HTTPS or VPN. When sending encrypted traffic between an Azure virtual network and an on-premises location

over the public internet, use Azure VPN Gateway.

Does the workload communicate over encr ypted network traffic only?Does the workload communicate over encr ypted network traffic only?

Any network communication between client and server where man-in-the-middle attacks can occur, must be

encrypted. All website communication should use HTTPS, no matter the perceived sensitivity of transferred data.

Man-in-the-middle attacks can occur anywhere on the site, not just login forms.

This mechanism can be applied to use cases such as:

Web applications and APIs for all communication with clients.

Data moving across a service bus from on-premises to the cloud and other way around, or during an

input/output process.

In certain architecture styles such as microservices, data must be encrypted during communication between the

services.

What TLS version is used across workloads?What TLS version is used across workloads?

Using the latest version of TLS is preferred. All Azure services support TLS 1.2 on public HTTPS endpoints.

Migrate solutions to support TLS 1.2 and use this version by default.

When traffic from clients using older versions of TLS is minimal, or it's acceptable to fail requests made with an

older version of TLS, consider enforcing a minimum TLS version. For information about TLS support in Azure

Storage, see Remediate security risks with a minimum version of TLS.

Sometimes you need to isolate your entire communication channel between your on-premises and the cloud

infrastructure by using either a virtual private network (VPN) or ExpressRoute. For more information, see these

articles:

Extending on-premises data solutions to the cloud

Configure a Point-to-Site VPN connection to a VNet using native Azure certificate authentication: Azure

portal

For more information, see Protect data in transit.

Is there any por tion of the application that doesn't secure data in transit?Is there any por tion of the application that doesn't secure data in transit?

Suggested actionsSuggested actions

Learn more

Next steps

Related links

All data should be encrypted in transit using a common encryption standard. Determine if all components in the

solution are using a consistent standard. There are times when encryption is not possible because of technical

limitations, make sure the reason is clear and valid.

Identify workloads using unencrypted sessions and configure the service to require encryption.

Encrypt data in transit

Azure encryption overview

While it's important to protect data through encryption, it's equally important to protect they keys that provide

access to the data.

Key and secret management

Identity and access management services authenticate and grant permission to users, partners, customers,

applications, services, and other entities. For security considerations, see Azure identity and access management

considerations.

Back to the main article: Data protection

Key and secret management considerations in

Azure

12/16/2022 • 5 minutes to read • Edit Online

Key points

Identity-based access control

NOTENOTE

Encryption is an essential tool for security because it restricts access. However, it's equally important to protect

the secrets (keys, certificates) key that provide access to the data.

Use identity-based access control instead of cryptographic keys.

Use standard and recommended encryption algorithms.

Store keys and secrets in managed key vault service. Control permissions with an access model.

Rotate keys and other secrets frequently. Replace expired or compromised secrets.

Organizations shouldn't develop and maintain their own encryption algorithms. There are many ways to

provide access control over storage resources available, such as:

Shared keys

Shared signatures

Anonymous access

Identity provider-based methods

Secure standards already exist on the market and should be preferred. AES should be used as symmetric block

cipher, AES-128 , AES-192 , and AES-256 are acceptable. Crypto APIs built into operating systems should be used

where possible, instead of non-platform crypto libraries. For .NET, make sure you follow the .NET Cryptography

Model.

Do you prioritize authentication through identity ser vices for a workload over cr yptographicDo you prioritize authentication through identity ser vices for a workload over cr yptographic

keys?keys?

Protection of cryptographic keys can often get overlooked or implemented poorly. Managing keys securely with

application code is especially difficult and can lead to mistakes such as accidentally publishing sensitive access

keys to public code repositories.

Use of identity-based options for storage access control is recommended. This option uses role-based access

controls (RBAC) over storage resources. Use RBAC to assign permissions to users, groups, and applications at a

certain scope. Identity systems such as Azure Active Directory (Azure AD) offer secure and usable experience for

access control with built-in mechanisms for handling key rotation, monitoring for anomalies, and others.

Grant access based on the principle of least privilege. Risk of giving more privileges than necessary can lead to data

compromise.

Suppose you need to store sensitive data in Azure Blob Storage. You can use Azure AD and RBAC to authenticate

a service principal that has the required permissions to access the storage. For more information about the

feature, reference Authorize access to blobs and queues using Azure Active Directory.

TIPTIP

Key storage

Operational considerations

Using SAS tokens is a common way to control access. SAS tokens are created by using the service owner's Azure AD

credentials. The tokens are created per resource and you can use Azure RBAC to restrict access. SAS tokens have a time

limit, which controls the window of exposure. Here are the resources for the preceding example:

GitHub: Azure Cognitive Services Reference Implementation.

The design considerations are described in Speech transcription with Azure Cognitive Services.

To prevent security leaks, store the following keys and secrets in a secure store:

API keys

Database connection strings

Data encryption keys

Passwords

Sensitive information shouldn't be stored within the application code or configuration. An attacker gaining read

access to source code shouldn't gain knowledge of application and environment-specific secrets.

Store all application keys and secrets in a managed key vault service such as Azure Key Vault or HashiCorp Vault.

Storing encryption keys in a managed store further limits access. The workload can access the secrets by

authenticating against Key Vault by using managed identities. That access can be restricted with Azure RBAC.

Make sure no keys and secrets for any environment types (Dev, Test, or Production) are stored in application

configuration files or CI/CD pipelines. Developers can use Visual Studio Connected Services or local-only files to

access credentials.

Have processes that periodically detect exposed keys in your application code. An option is Credential Scanner.

For information about the configuring task, reference Credential Scanner task.

Do you have an access model for key vaults to grant access to keys and secrets?Do you have an access model for key vaults to grant access to keys and secrets?

To secure access to your key vaults, control permissions to keys and secrets through an access model. For more

information, reference Access model overview.

Suggested actionsSuggested actions

Consider using Azure Key Vault for secrets and keys.

Who is responsible for managing keys and secrets in the application context?Who is responsible for managing keys and secrets in the application context?

Key and certificate rotation is often the cause of application outages. Even Azure has experienced expired

certificates. It's critical that the rotation of keys and certificates be scheduled and fully operationalized. The

rotation process should be automated and tested to ensure effectiveness. Azure Key Vault supports key rotation

and auditing.

Central SecOps team provides guidance on how keys and secrets are managed (governance). Application

DevOps team is responsible for managing the application-related keys and secrets.

What types of keys and secrets are used and how are those generated?What types of keys and secrets are used and how are those generated?

Suggested actions

Learn more

Related links

Go back to the main article: Secure deployment and testing in Azure

Infrastructure provisioning considerations in Azure

12/16/2022 • 2 minutes to read • Edit Online

Key points

Infrastructure as code (IaC)

Pipeline secret management

Build environments

Azure resources can be provisioned by code or user tools such as the Azure portal or via Azure CLI. It's not

recommended that resources are provisioned or configured manually. Those methods are error prone and can

lead to security gaps. Even the smallest of changes should be through code. The recommended approach is

Infrastructure as code (IaC). It's easy to track because the provisioned infrastructure can be fully reproduced and

reversed.

No infrastructure changes should be done manually outside of IaC.

Store keys and secrets outside of deployment pipeline in Azure Key Vault or in secure store for the pipeline.

Incorporate security fixes and patching to the operating system and all parts of the codebase, including

dependencies (preinstalled tools, frameworks, and libraries).

Make all operational changes and modifications through IaC. This is a key DevOps practice, and it's often used

with continuous delivery. IaC manages the infrastructure - such as networks, virtual machines, and others - with

a descriptive model, using a versioning system that is similar to what is used for source code. IaC model

generates the same environment every time it's applied. Common examples of IaC are Azure Resource Manager,

Bicep or Terraform.

IaC reduces configuration effort and automates full environment deployment (production and pre-production).

Also, IaC allows you to develop and release changes faster. All those factors enhance the security of the

workload.

For detailed information about IaC, see What is Infrastructure as code (IaC).

How are credentials, cer tificates, and other secrets used in the operations for the workloadHow are credentials, cer tificates, and other secrets used in the operations for the workload

managed during deployment?managed during deployment?

Store keys and secrets outside of deployment pipeline in a managed key store, such as Azure Key Vault. Or, in a

secure store for the pipeline. When deploying application infrastructure with Azure Resource Manager, Bicep or

Terraform, the process might generate credentials and keys. Store them in a managed key store and make sure

the deployed resources reference the store. Do not hard-code credentials.

Secret scanning tools like GitHub Secret Scanner can be used to scan for existing hard-coded credentials. Add

the scanning process in your continuous integration (CI) pipeline to prevent new hard-coded credentials from

being added.

Does the organization apply security controls (IP firewall restrictions, update management) toDoes the organization apply security controls (IP firewall restrictions, update management) to

self-hosted build agents for this workload?self-hosted build agents for this workload?

Custom build agents add management complexity and can become an attack vector. Build machine credentials

must be stored securely and the file system needs to be cleaned of any temporary build artifacts regularly.

Learn more

Next step

Related links

Network isolation can be achieved by only allowing outgoing traffic from the build agent, because it's using the

pull model of communication with Azure DevOps.

As part of the operational lifecycle, incorporate security fixes and patching to the operating system and all parts

of the codebase, including dependencies (preinstalled tools, frameworks, and libraries).

Apply security controls to self-hosted build agents in the same manner as with other Azure IaaS VMs. These

should be minimalistic environments as a way to reduce the attack surface.

Azure Pipelines agents

I'm running a firewall and my code is in Azure Repos. What URLs does the agent need to communicate with?

Secure code deployments

Go back to the main article: Secure deployment and testing in Azure

Code deployments

12/16/2022 • 4 minutes to read • Edit Online

Key points

Rollback and roll-forward

IMPORTANTIMPORTANT

The automated build and release pipelines should update a workload to a new version seamlessly without

breaking dependencies. Augment the automation with processes that allow high priority fixes to get deployed

quickly.

Organizations should leverage existing guidance and automation when securing applications in the cloud, rather

than starting from zero. Using resources and lessons learned by external organizations that are early adopters

of these models can accelerate the improvement of an organizations security posture with less expenditure of

effort and resources.

Involve the security team in the planning and design of the DevOps process to integrate preventive and

detective controls for security risks.

Design automated deployment pipelines that allow for quick roll-forward and rollback deployments to

address critical bugs and code updates outside of the normal deployment lifecycle.

Integrate code scanning tools within CI/CD pipeline.

If something goes wrong, the pipeline should roll back to a previous working version. N-1 and N+1 refer to

rollback and roll-forward versions. Automated deployment pipelines should allow for quick roll-forward and

rollback deployments to address critical bugs and code updates outside of the normal deployment lifecycle.

Can N-1 or N+1 versions be deployed via automated pipelines where N is current deploymentCan N-1 or N+1 versions be deployed via automated pipelines where N is current deployment

version in production?version in production?

Because security updates are a high priority, design a pipeline that supports regular updates and critical security

fixes.

A release is typically associated with approval processes with multiple sign-offs, quality gates, and so on. If the

workload deployment is small with minimal approvals, you can usually use the same process and pipeline to

release a security fix.

An approval process that is complex and takes a significant amount of time can delay a fix. Consider building an

emergency process to accelerate high priority fixes. The process might be business and, or communication

process between teams. Another way is to build a pipeline that might not include all the gated approvals, but

should be able to push out the fix quickly. The pipeline should allow for quick roll-forward and rollback

deployments that address security fixes, critical bugs, and code updates outside of the regular deployment life

cycle.

Deploying a security fix is a priority, but it shouldn't be at the cost of introducing a regression or bug. When designing an

emergency pipeline, carefully consider which automated tests can be bypassed. Evaluate the value of each test against the

execution time. For example, unit tests usually complete quickly. Integration or end-to-end tests can run for a long time.

Involve the security team in the planning and design of the DevOps process. Your automated pipeline design

  
Suggested actionSuggested action
 
Credential scanning
 
Suggested actions
 
Learn more
 
Community links
should have the flexibility to support both regular and emergency deployments. This is important to support the
rapid and responsible application of both security fixes and other urgent, important fixes.
Implement an automated deployment process with support for rollback scenarios via Azure App Services
deployment slots.
Learn moreLearn more
Set up staging environments in Azure App Service
Credentials, keys, and certificates grant access to the data or service used by the workload. Storing credentials in
code poses a significant security risk. Ensure that static code scanning tools are an integrated part of the
continuous integration (CI) process.
Are code scanning tools an integrated par t of the continuous integration (CI) process for thisAre code scanning tools an integrated par t of the continuous integration (CI) process for this
workload?workload?
To prevent credentials from being stored in the source code or configuration files, integrate code scanning tools
within the CI/CD pipeline:
During design time, use code analyzers to prevent credentials from getting pushed to the source code
repository. For example, .NET Compiler Platform (Roslyn) Analyzers inspect your C# or Visual Basic code.
During the build process, use pipeline add-ons to catch credentials in the source code. Some options include
GitHub Advanced Security and OWASP source code analysis tools.
Scan all dependencies, such as third-party libraries and framework components, as part of the CI process.
Investigate vulnerable components that are flagged by the tool. Combine this task with other code scanning
tasks that inspect code churn, test results, and coverage.
Use a combination of dynamic application security testing (DAST) and static application security testing
(SAST). DAST tests the application while its in use. SAST scans the source code and detects vulnerabilities
based on its design or implementation. Some technology options are provided by OWASP. For more
information, see SAST Tools and Vulnerability Scanning Tools.
Use scanning tools that are specialized in technologies used by the workload. For example, if the workload is
containerized, run container-aware scanning tools to detect risks in the container registry, before use, and
during use.
Incorporate Secure DevOps on Azure toolkit and the guidance published by the Organization for Web App
Security Project (OWASP), or an equivalent guiding organization.
Follow DevOps security guidance
Getting started with Credential Scanner (CredScan)
OWASP source code analysis tools
GitHub Advanced Security
Vulnerability Scanning Tools

Go back to the main article: Secure deployment and testing in Azure

Security monitoring and remediation in Azure

12/16/2022 • 2 minutes to read • Edit Online

Checklist

Azure security benchmark

Reference architecture

Next step

Regularly monitor resources to maintain the security posture and detect vulnerabilities. Detection can take the

form of reacting to an alert of suspicious activity or proactively hunting for anomalous events in the enterprise

activity logs. vigilantly responding to anomalies and alerts to prevent security assurance decay, and designing

for defense in depth and least privilege strategies.

How are you monitoring security-related events in this workload?How are you monitoring security-related events in this workload?

Use native tools in Azure to monitor the workload resources and the infrastructure in which it runs.

Consider investing in a Security Operations Center (SOC), or SecOps team and incident response plan.

Monitor traffic, access requests, and application communication between segments.

Discover and remediate common risks to improve secure score in Microsoft Defender for Cloud.

Use an industry standard benchmark to evaluate the security posture by learning from external

organizations.

Send logs and alerts to a central security log management for analysis.

Perform regular internal and external compliance audits, including regulatory compliance attestations.

Regularly test your security design and implementation using test cases based on real-world attacks.

The Azure Security Benchmark includes a collection of high-impact security recommendations. Use them to

secure the services and processes you use to run the workload in Azure:

The questions in this section are aligned to these controls:

Azure Security Benchmarks Logging and threat detection.

Azure Security Benchmarks Incident response.

Posture and Vulnerability Management

Hybrid Security Monitoring using Microsoft Defender for Cloud and Microsoft Sentinel

This reference architecture illustrates how to use Microsoft Defender for Cloud and Microsoft Sentinel to

monitor the security configuration and telemetry of on-premises and Azure operating system workloads.

Azure security solutions for AWS

This article provides AWS identity architects, administrators, and security analysts with immediate

insights and detailed guidance for deploying several Microsoft security solutions.

We recommend applying as many best practices as early as possible, and then working to retrofit any gaps over

Related links

applications

Azure Information Protection (AIP) is part of Microsoft Purview Information Protection solution, and extends the

labeling and classification functionality provided by Microsoft 365. For more information, see this article about

classification.

Azure Governance Visualizer is a PowerShell script that iterates through an Azure tenant's management group

hierarchy down to the subscription level. You can run the script either for your Tenant Root Group or any other

Management Group. It captures data from the most relevant Azure governance capabilities such as Azure Policy,

Azure role-based access control (Azure RBAC), and Azure Blueprints. From the collected data, the visualizer

shows your hierarchy map, creates a tenant summary, and builds granular scope insights about your

management groups and subscriptions.

The visualizer provides a holistic overview of your technical Azure Governance implementation by connecting

the dots.

PSRule for Azure is a set of tests and documentation to help you configure Azure solutions. These tests allow

you to check your Azure Template or Bicep Infrastructure as Code (IaC) before deployment to Azure. PSRule for

Azure includes tests that check how IaC is written and how Azure resources are configured.

Monitor workload resources in Microsoft Defender for Cloud

For information on the Microsoft Defender for Cloud tools, see Strengthen security posture.

For frequently asked questions on Microsoft Defender for Cloud, see FAQ - General Questions.

For information on the Microsoft Sentinel tools that will help to meet these requirements, see What is Microsoft

Sentinel?

For types of DDoS attacks that DDoS Protection mitigates as well as more features, see Azure DDoS Protection

overview.

Monitor Azure resources in Microsoft Defender for

Cloud

12/16/2022 • 10 minutes to read • Edit Online

Key points

General best practices

IaaS and PaaS securityIaaS and PaaS security

Virtual machines

Most cloud architecture have compute, networking, data, and identity components and each require different

monitoring mechanisms. Even Azure services have individual monitoring needs. For instance, to monitor Azure

Functions you want to enable Azure Application Insights.

Microsoft Defender for Cloud has many plans that monitor the security posture of machines, networks, storage

and data services, and applications to discover potential security issues. Common issues include internet

connected VMs, or missing security updates, missing endpoint protection or encryption, deviations from

baseline security configurations, missing Web Application Firewall (WAF), and more.

Enable Microsoft Defender for Cloud as a defense-in-depth measure. Use resource-specific Defender for

Cloud features such as Microsoft Defender for servers, Microsoft Defender for Endpoint, Microsoft Defender

for Storage.

Observe container hygiene through container aware tools and regular scanning.

Review all network flow logs through network watcher. See diagnostic logs in Microsoft Defender for Cloud.

Integrate all logs in a central SIEM solution to analyze and detect suspicious behavior.

Monitor identity-related risk events in Azure AD reporting amd Azure Active Directory Identity Protection.

Identifying common security activities will significantly reduce the overall risk.

Monitor suspicious activities from administrative accounts.

Monitor the location from where Azure resources are being managed.

Monitor attempts to access deactivated credentials.

Use automated tools to monitor network resource configurations and detect changes.

For more information, see Azure security baseline for Azure Monitor.

In an IaaS model, you can host the workload on Azure infrastructure. Azure provides security assurances that

maintain isolation and timely security updates to the infrastructure. For greater control, you host the entire IaaS

solution on-premises or in a hosted data center and are responsible for security. You must implement security

on the host, virtual machine, network, and storage. For instance if you have your own VNet, consider enabling

Azure Private Link over Azure Monitor so you can access this over a private endpoint.

In PaaS, you have shared responsibility with Azure in protecting the data.

If you're running your own Windows and Linux virtual machines, use Microsoft Defender for Cloud. Take

advantage of the free services to check for missing OS patches, security misconfiguration, and basic network

security. Enabling Microsoft Defender for Cloud is highly recommended because you get features that provide

adaptive application controls, file integrity monitoring (FIM), and others.

NOTENOTE

Remove direct internet connectivityRemove direct internet connectivity

Containers

For example, a common risk is the virtual machines don't have vulnerability scanning solutions that check for

threats. Microsoft Defender for Cloud reports those machines. You can remediate in Microsoft Defender for

Cloud by deploying a scanning solution. You can use the built-in vulnerability scanner for virtual machines. You

don't need a license. Instead, you can bring your license for supported partner solutions.

Vulnerability assessments are also available for container images, and SQL servers.

Attackers constantly scan public cloud IP ranges for open management ports, which can lead to attacks such as

common passwords and known unpatched vulnerabilities. JIT (Just In Time) access allows you to lock down the

inbound traffic to the virtual machines while providing easy access to connect to machines when needed.

Defender for Cloud identifies which machines should have JIT applied.

With Microsoft Defender for Cloud, you also get Microsoft Defender for Endpoint. This provides investigative

tools Endpoint Detection and Response (EDR) that helps in threat detection and analysis.

Microsoft Defender for servers also watches the network to and from virtual machines. If you are using network

security groups to control access to the virtual machines and the rules are overpermissive, Defender for Cloud

will flag them. Adaptive network hardening provides recommendations to further harden the NSG rules.

For a full list of features, see Feature coverage for machines.

Make sure policies and processes require restricting and monitoring direct internet connectivity by virtual

machines.

For Azure, you can enforce policies by:

Enterprise-wide prevention:Enterprise-wide prevention: Prevent inadvertent exposure by following the permissions and roles

described in the reference model.

Ensures that network traffic is routed through approved egress points by default.

Exceptions (such as adding a public IP address to a resource) must go through a centralized group

that evaluates exception requests and makes sure appropriate controls are applied.

Identify and remediateIdentify and remediate exposed virtual machines by using the Microsoft Defender for Cloud network

visualization to quickly identify internet exposed resources.

Restrict management por tsRestrict management por ts (RDP, SSH) using Just in Time access in Microsoft Defender for Cloud.

One way of managing VMs in the virtual network is by using Azure Bastion. This service allows you to log into

VMs in the virtual network through SSH or remote desktop protocol (RDP) without exposing the VMs directly to

the internet. To see a reference architecture that uses Bastion, see Network DMZ between Azure and an on-

premises datacenter.

Containerized workloads have an extra layer of abstraction and orchestration. That complexity requires specific

security measures that protect against common container attacks such as supply chain attacks.

Use container registries that are validated for security. Images in public registries might contain malware

or unwanted applications that activate when the container is running. Build a process for developers to

request and rapidly get security validation of new containers and images. The process should validate

against your security standards. This includes applying security updates, scanning for unwanted code

such as backdoors and illicit crypto coin miners, scanning for security vulnerabilities, and application of

Network

TIPTIP

secure development practices.

A popular process pattern is the quarantine pattern. This pattern allows you to get your images on a

dedicated container registry and subject them to security or compliance scrutiny applicable for your

organization. After it's validated, they can then be released from quarantine and promoted to being

available.

Microsoft Defender for Cloud identifies unmanaged containers hosted on IaaS Linux VMs, or other Linux

machines running Docker containers.

Make sure you use images from authorized registries. You can enforce this restriction through Azure

Policy. For example, for an Azure Kubernetes Service (AKS) cluster, have policies that restrict the cluster to

only pull images from Azure Container Registry (ACR) that is deployed as part of the architecture.

Here are the resources for the preceding example:

GitHub: Azure Kubernetes Service (AKS) Secure Baseline Reference Implementation.

The design considerations are described in Baseline architecture for an AKS cluster.

Regularly scan containers for known risks in the container registry, before use, and during use.

Use security monitoring tools that are container aware to monitor for anomalous behavior and enable

investigation of incidents.

Microsoft Defender for container registries are designed to protect AKS clusters, container hosts (virtual

machines running Docker), and ACR registries. When enabled, the images that are pulled or pushed to

registries are subject to vulnerability scans.

For more information, see these articles:

Container security in Defender for Cloud

How do you monitor and diagnose conditions of the network?How do you monitor and diagnose conditions of the network?

As an initial step, enable and review all logs (including raw traffic) from your network devices.

Security group logs – flow logs and diagnostic logs

Azure Network Watcher

Take advantage of the packet capture feature to set alerts and gain access to real-time performance information

at the packet level.

Packet capture tracks traffic in and out of virtual machines. It gives you the capability to run proactive captures

based on defined network anomalies including information about network intrusions.

For an example, see Scenario: Get alerts when VM is sending you more TCP segments than usual.

Then, focus on observability of specific services by reviewing the diagnostic logs. For example, for Azure

Application Gateway with integrated WAF, see Web application firewall logs. Microsoft Defender for Cloud

analyzes diagnostic logs on virtual networks, gateways, network security groups and determines if the controls

are secure enough. For example:

Is your virtual machine exposed to public internet. If so, do you have tight rules on network security groups

 
Identity
  
Review identity risksReview identity risks
  
Regularly review critical accessRegularly review critical access
  
Discover & replace insecure protocolsDiscover & replace insecure protocols
to protect the machine?
Are the network security groups (NSG) and rules that control access to the virtual machines overly
permissive?
Are the storage accounts receiving traffic over secure connections?
Follow the recommendations provided by Defender for Cloud. For more information, see Networking
recommendations. Use Azure Firewall logs and metrics for observability into operational and audit logs.
Integrate all logs into a security information and event management (SIEM) service, such as Microsoft Sentinel.
The SIEM solutions support ingestion of large amounts of information and can analyze large datasets quickly.
Based on those insights, you can:
Set alerts or block traffic crossing segmentation boundaries.
Identify anomalies.
Tune the intake to significantly reduce the false positive alerts.
Monitor identity-related risk events using adaptive machine learning algorithms, heuristics quickly before the
attacker can gain deeper access into the system.
Most security incidents take place after an attacker initially gains access using a stolen identity. Even if the
identity has low privileges, the attacker can use it to traverse laterally and gain access to more privileged
identities. This way the attacker can control access to the target data or systems.
Does the organization actively monitor identity-related risk events related to potentiallyDoes the organization actively monitor identity-related risk events related to potentially
compromised identities?compromised identities?
Monitor identity-related risk events on potentially compromised identities and remediate those risks. Review the
reported risk events in these ways:
Azure AD reporting. For information, see users at risk security report and the risky sign-ins security report.
Use the reporting capabilities of Azure Active Directory Identity Protection.
Use the Identity Protection risk events API to get programmatic access to security detections by using
Microsoft Graph. See riskDetection and riskyUser APIs.
Azure AD uses adaptive machine learning algorithms, heuristics, and known compromised credentials
(username/password pairs) to detect suspicious actions that are related to your user accounts. These
username/password pairs come from monitoring public and dark web and by working with security
researchers, law enforcement, security teams at Microsoft, and others.
Remediate risks by manually addressing each reported account or by setting up a user risk policy to require a
password change for high risk events.
Regularly review roles that are assigned privileges with a business-critical impact.
Set up a recurring review pattern to ensure that accounts are removed from permissions as roles change. You
can conduct the review manually or through an automated process by using tools such as Azure AD access
reviews.
Discover and disable the use of legacy insecure protocols SMBv1, LM/NTLMv1, wDigest, Unsigned LDAP Binds,
and Weak ciphers in Kerberos.

Learn moreLearn more

Connected tenantsConnected tenants

Suggested actionsSuggested actions

CI/CD pipelines

Next steps

Applications should use the SHA-2 family of hash algorithms (SHA-256, SHA-384, SHA-512). Use of weaker

algorithms, like SHA-1 and MD5, should be avoided.

Authentication protocols are a critical foundation of nearly all security assurances. These older versions can be

exploited by attackers with access to your network and are often used extensively on legacy systems on

Infrastructure as a Service (IaaS).

Here are ways to reduce your risk:

Discover protocol usage by reviewing logs with Microsoft Sentinel's Insecure Protocol Dashboard or

third-party tools.

Restrict or Disable use of these protocols by following guidance for SMB, NTLM, WDigest.

Use only secure hash algorithms (SHA-2 family).

We recommend implementing changes using pilot or other testing methods to mitigate risk of operational

interruption.

For more information about hash algorithms, see Hash and Signature Algorithms.

Does your security team have visibility into all existing subscriptions and cloud environments?Does your security team have visibility into all existing subscriptions and cloud environments?

How do they discover new ones?How do they discover new ones?

Make sure the security team is aware of all enrollments and associated subscriptions connected to your existing

environment through ExpressRoute or Site-Site VPN. Monitor them as part of the overall enterprise.

Assess if organizational policies and applicable regulatory requirements are followed for the connected tenants.

This applies to all Azure environments that connect to your production environment network.

The organizations' cloud infrastructure should be well documented, with security team access to all resources

required for monitoring and insight. Conduct frequent scans of the cloud-connected assets to ensure no

additional subscriptions or tenants have been added outside of organizational controls. Regularly review

Microsoft guidance to ensure security team access best practices are consulted and followed.

Ensure all Azure environments that connect to your production environment and network apply your

organization's policy, and IT governance controls for security.

You can discover existing connected tenants using a tool provided by Microsoft. Guidance on permissions you

may assign to security is in the Assign privileges for managing the environment section.

DevOps practices are for change management of the workload through continuous integration, continuous

delivery (CI/CD). Make sure you add security validation in the pipelines. Follow the guidance described in Learn

how to add continuous security validation to your CI/CD pipeline.

View logs and alerts

Security logs and alerts using Azure services

12/16/2022 • 5 minutes to read • Edit Online

Key points

Use native services

Audit logging

Logs provide insight into the operations of a workload, the infrastructure, network communications, and so on.

When suspicious activity is detected, use alerts as a way of detecting potential threats. As part of your defense-

in-depth strategy and continuous monitoring, respond to the alerts to prevent security assurance from decaying

over time.

Configure central security log management.

Enable audit logging for Azure resources.

Collect security logs from operating systems.

Configure security log storage retention.

Enable alerts for anomalous activities.

Azure MonitorAzure Monitor provides observability across your entire environment. You automatically get platform

metrics, activity logs, and diagnostics logs from most of your Azure resources with no configuration. The

activity logs provide detailed diagnostic and auditing information.

Microsoft Defender for CloudMicrosoft Defender for Cloud generates notifications as security alerts by collecting, analyzings, and

integrating log data from your Azure resources and the network. Alerts are available when you enable

the Microsoft Defender plans. This will add to the overall cost.

Microsoft SentinelMicrosoft Sentinel is a security information event management (SIEM) and security orchestration

automated response (SOAR) solution. It's a single solution for alert detection, threat visibility, proactive

hunting, and threat response.

Ideally use a combination of the preceding services to get a full view. For example, use Azure Monitor to collect

information about the operating system running on Azure compute. If you're running your own compute, use

Microsoft Defender for Cloud.

An important aspect of monitoring is tracking operations. For example, you want to know who created, updated,

deleted a resource. Or, get resource-specific information such as when an image was pulled from Azure

Container Registry. That information is crucial for a Security Operations (SecOps) team in detecting the presence

of adversaries, reacting to an alert of suspicious activity, or proactively hunting for anomalous events. They are

also useful for security auditing and compliance and offline analysis.

On Azure, that information is emitted as platform logs by the resources and the platform on which they run.

They are tracked by Azure Resource Manager as and when subscription-level events occur. Each resource emits

logs specific to the service.

Consider storing your data for audit purposes or statistical analysis. You can retain data in your log analytics

workspace and specify the data type. This example sets the retention for SecurityEvents to 730 days:

PUT /subscriptions/00000000-0000-0000-0000-

00000000000/resourceGroups/MyResourceGroupName/providers/Microsoft.OperationalInsights/workspaces/MyWorkspac

eName/Tables/SecurityEvent?api-version=2017-04-26-preview {"properties": {"retentionInDays": 730 } }

Alerts

Centralize logs and alerts

NOTENOTE

Retaining data in this manner can reduce your costs for data retention over time. For information about the type

of data you can retain, see security data types.

Another way is to send the logs to a storage account.

Security alerts are notifications that are generated when anomalous activity is detected on the resources used

by the workload or the platform.

With the Microsoft Defender plans, Microsoft Defender for Cloud analyzes log data and shows a list of alerts

that's based on logs collected from resources within a scope. Alerts include context information such as severity,

status, activity time. Defender for Cloud also provides a correlated view called incidentsincidents. Use this data to

analyze what actions the attacker took, and what resources were affected. Have strategies to react to alerts as

soon as they are generated. An option is to handle alerts in Azure Functions.

Use the data to support these activities:

Remediation of threats.

Investigation of an incident.

Proactive hunting activities.

For more information, see Security alerts and incidents.

Organizations typically follow one of three models when deploying logs: centralized, decentralized, or hybrid.

The choice depends on organizational structures. For example, if each team owns their resource group, log data

is segregated per resource. While access control to that data might be easy to set up, it's difficult to correlate

logs. This might be challenging for the SecOps team who need a holistic view to analyze the data.

Consider a central view of log and data, when applicable. Some advantages include:

The resources in the workload can share a common log workspace reducing duplication.

Single point of observability with all log data makes it easier consume data for hunting activities, querying,

and statistical evaluation.

The integrated data can be fed into modern machine learning analytics platforms support ingestion of large

amounts of information and can analyze large datasets quickly. In addition, these solutions can be tuned to

significantly reduce the false positive alerts.

You can collect logs and alerts from various sources centrally in a Log Analytics Workspace, storage account,

and Event Hubs. You can then review and query log data efficiently. In Azure Monitor, use the diagnosticdiagnostic

settingsetting on resources to route specific logs that are important for the organization. Logs vary by resource type.

In Microsoft Defender for Cloud, take advantage of the continuous export feature to route alerts.

Platform logs are not available indefinitely. You'll need to keep them so that you can review them later for auditing

purposes or offline analysis. Use Azure Storage Accounts for long-term/archival storage. In Azure Monitor, specify a

retention period when you enable diagnostic setting for your resources.

Related links

Another way to see all data in a single view is to integrate logs and alerts into Security Information and Event

Management (SIEM) solutions, such as Microsoft Sentinel. Other popular third-party choices are Splunk,

QRadar, ArcSight. Microsoft Defender for Cloud and Azure Monitor supports all of those solutions.

Integrating more data can enrich alerts with additional context. However, collection is not detection. Make sure a

high volume of low value data doesn't flow into those solutions.

If you don't have a reasonable expectation that the data will provide value, deprioritize integration of these

events. For example, high volume of firewall denies events may create noise without actual actions.

That choice will help in rapid response and remediation by filtering out false positives, and elevate true positives,

and so on. Also it will lower SIEM cost, false positives, and increase performance.

Other ways of log integration may involve a hybrid model that mixes centralized and decentralized (distributed

among teams) approaches. For details, see Important considerations for an access control strategy.

Responding to alerts is an essential way to prevent security assurance decay, and designing for defense-in depth

and least privilege strategies.

Remediate security risks

For more information, see these articles:

How to get started with Azure Monitor and third-party SIEM integration

How to collect platform logs and metrics with Azure Monitor

Export alerts

Understand Microsoft Defender for Cloud data collection

Go back to the main article: Monitor

Remediate security risks in Microsoft Defender for

Cloud

12/16/2022 • 4 minutes to read • Edit Online

Key points

Track Secure Score

Security controls must remain effective against attackers who continuously improve their ways to attack the

digital assets of an enterprise. Use the principle of drive continuous improvement to make sure systems are

regularly evaluated and improved.

Start by remediating common security risks. These risks are usually from well-established attack vectors. This

will forces attackers to acquire use advanced and more expensive attack methods.

Processes for handling incidents and post-incident activities, such as lessons learned and evidence retention.

Remediate the common risks identified by Microsoft Defender for Cloud.

Track remediation progress with secure score and comparing against historical results.

Address alerts and take action with remediation steps.

Do you review and remediate common risks in the workload boundar y?Do you review and remediate common risks in the workload boundar y?

Monitor the security posture of VMs, networks, storage, data services, and various other contributing factors.

Secure Score in Microsoft Defender for Cloud shows a composite score that represents the security posture at

the subscription level.

Do you have a process for formally reviewing Secure Score on Microsoft Defender for Cloud?Do you have a process for formally reviewing Secure Score on Microsoft Defender for Cloud?

As you review the results and apply recommendations, track the progress and prioritize ongoing investments.

Higher score indicates a better security posture.

Set up a regular cadence (typically monthly) to review the secure score and plan initiatives with specific

improvement goals.

C AT E GO RYC AT E GO RY RESOU RC ESR ESO URC ES RESP O N SIB L E T EA MRESP O N SIB L E T EA M

Compute and applications App Services Application Development/Security

Team(s)

Containers Application Development and/or

Infrastructure/IT Operations

Virtual machines, scale sets, compute IT/Infrastructure Operations

Data and Storage SQL/Redis/Data Lake Analytics/Data

Lake Store

Database Team

Storage Accounts Storage/Infrastructure Team

Identity and access management Subscriptions Identity Team(s)

Key Vault Information/Data Security Team

Networking Resources Networking Team and Network

Security Team

IoT Security IoT Resources IoT Operations Team

Review and remediate recommendations

Assign stakeholders for monitoring and improving the score. Gamify the activity if possible to increase

engagement and focus from the responsible teams.

As a technical workload owner, work with your organization's dedicated team that monitors Secure Score. In the

DevOps model, workload teams may be responsible for their own resources. Typically, these teams are

responsible.

Security posture management team

Vulnerability management or governance, risk, and compliance team

Architecture team

Resource-specific technical teams responsible for improving secure score, as shown in this table.

The Azure Secure Score sample shows how to get your Azure Secure Score for a subscription by calling the

Microsoft Defender for Cloud REST API. The API methods provide the flexibility to query the data and build your

own reporting mechanism of your secure scores over time.

Microsoft Defender for Cloud monitors the security status of machines, networks, storage and data services, and

applications to discover potential security issues. Enable this capability at no additional cost to detect vulnerable

virtual machines connected to internet, missing security updates, missing endpoint protection or encryption,

deviations from baseline security configurations, missing Web Application Firewall (WAF), and more.

View the recommendations to see the potential security issues and apply the Microsoft Defender for Cloud

recommendations to execute technical remediations.

Policy remediation

The recommendations are grouped by controls. Each recommendation has detailed information such as severity,

affected resources, and quick fixes where applicable. Start with high severity items.

Defender for Cloud has the capability of exporting results at configured intervals. Compare the results with

previous sets to verify that issues have been remediated.

For more information, see Continuous export.

A common approach for maintaining the security posture is through Azure Policy.

Along with organizational policies, a workload owner can use scoped policies for governance purposes, such as

check misconfiguration, prohibit certain resource types, and others. The resources are evaluated against rules to

identify unhealthy resources that are risky. Post evaluation, certain actions are required as remediation. The

actions can be enforced through Azure Policy effects.

For example, a workload runs in an Azure Kubernetes Service (AKS) cluster. The business goals require the

workload to run in a highly restrictive environment. As a workload owner, you want the resource group to

contain AKS clusters that are private. You can enforce that requirement with the DenyDeny effect. It will prevent a

cluster from being created if that rule isn't satisfied.

That sort of isolation can be maintained through policies at a higher level such as the subscription level or even

management groups.

Another use case is that it can be automatically remediated by deploying related resources. For example, the

organization wants all storage resources in a subscription to send logs to a common Log Analytics workspace. If

a storage account doesn't pass the policy, a deployment is automatically started as remediation. That

remediation can be enforced through DeployIfNotExistDeployIfNotExist. There are some considerations.

There's a significant wait before the resource is updated and the deployment starts. In the preceding example,

there won't be logs captured during that wait time. Avoid using this effect for resources that cannot tolerate a

delay.

The resource deployed because of DeployIfNotExistDeployIfNotExist are created by a separate identity than that of the

identity that did the original deployment. That identity must have high enough privileges to make the

required changes.

Manage alerts

Related links

Microsoft Defender for Cloud shows a list of alerts that's based on logs collected from resources within a scope.

Alerts include context information such as severity, status, activity time. Most alerts have MITRE ATT&CK®

tactics that can help you understand the kill chain intent. Select the alert and investigate the problem with

detailed information.

Finally take action. That action can be to fix the resources that are out of compliance with actionable remediation

steps. You can also suppress alerts that are false positives.

Make sure that you are integrating critical security alerts into Security Information and Event Management

(SIEM), Security Orchestration Automated Response (SOAR) without introducing a high volume of low value

data. Microsoft Defender for Cloud can stream alerts to Microsoft Sentinel. You can also use a third-party

solution by using Microsoft Graph Security API.

Azure security operations

Go back to the main article: Monitor

Security audits

12/16/2022 • 5 minutes to read • Edit Online

Key points

Evaluate using standard benchmarks

Suggested actionSuggested action

Audit regulatory compliance

To make sure that the security posture doesn't degrade over time, have regular auditing that checks compliance

with organizational standards. Enable, acquire, and store audit logs for Azure services.

Improve secure score in Microsoft Defender for Cloud.

Use an industry standard benchmark to evaluate your organizations current security posture.

Perform regular internal and external compliance audits, including regulatory compliance attestations.

Review the policy requirements.

Use Azure Governance Visualizer for a holistic overview of your technical Azure Governance implementation.

Do you evaluate the security posture of this workload using standard benchmarks?Do you evaluate the security posture of this workload using standard benchmarks?

Use an industry standard benchmark to evaluate your organizations current security posture.

Benchmarking allows you to improve your security program by learning from external organizations. It lets you

know how your current security state compares to that of other organizations, providing both external

validation for successful elements of your current system and identifying gaps that serve as opportunities to

enrich your team's overall security strategy. Even if your security program isn't tied to a specific benchmark or

regulatory standard, you will benefit from understanding the documented ideal states by those outside and

inside of your industry.

As an example, the Center for Internet Security (CIS) has created security benchmarks for Azure that map to the

CIS Control Framework. Another reference example is the MITRE ATT&CK™ framework that defines the various

adversary tactics and techniques based on real-world observations. These external references control mappings

and help you to understand any gaps between your current strategy, what you have, and what other experts

have in the industry.

Develop an Azure security benchmarking strategy aligned to industry standards.

As people in the organization and on the project change, it is crucial to make sure that only the right people

have access to the application infrastructure. Auditing and reviewing access control reduces the attack vector to

the application. Azure control plane depends on Azure AD and access reviews are often centrally performed as

part of internal, or external audit activities.

Make sure that the security team is auditing the environment to report on compliance with the security policy of

the organization. Security teams may also enforce compliance with these policies.

Compliance is important for several reasons. Aside from signifying levels of standards, like ISO 27001 and

others, noncompliance with regulatory guidelines may bring sanctions and penalties. Regularly review roles that

have high privileges. Set up a recurring review pattern to ensure that accounts are removed from permissions

as roles change. Consider auditing at least twice a year.

Suggested actionSuggested action

Learn moreLearn more

Use Microsoft Defender for Cloud to continuously assess and monitor your compliance score.

Assess your regulatory compliance

Have you established a monitoring and assessment solution for compliance?Have you established a monitoring and assessment solution for compliance?

Continuously assess and monitor the compliance status of your workload. Microsoft Defender for Cloud

provides a regulatory compliance dashboard that shows the current security state of workload against controls

mandated by the standard governments or industry organizations and Azure Security Benchmark. Keep your

resources in compliance with those standards. Defender for Cloud tracks many standards. You can set the

standards by management groups in a subscription.

Consider using Azure Access Reviews or Entitlement Management to periodically review access to the workload.

For Azure, use Azure Policy to create and manage policies that enforce compliance. Azure Policies are built on

the Azure Resource Manager capabilities. Azure Policy can also be assigned through Azure Blueprints.

For more information, see Tutorial: Create and manage policies to enforce compliance.

Here's an example management group that is tracking compliance to the Payment Card Industry (PCI) standard.

Review critical access

Check policy compliance

Do you have internal and external audits for this workload?Do you have internal and external audits for this workload?

A workload should be audited internally, external, or both with the goal of discovering security gaps. Make sure

that the gaps are addressed through updates.

Auditing is important for workloads that follow a standard. Aside from signifying levels of standards,

noncompliance with regulatory guidelines may bring sanctions and penalties.

Perform regulatory compliance attestation. Attestations are done by an independent party that examines if the

workload is in compliance with a standard.

Is access to the control plane and data plane of the application periodically reviewed?Is access to the control plane and data plane of the application periodically reviewed?

Regularly review roles that have high privileges. Set up a recurring review pattern to ensure that accounts are

removed from permissions as roles change. Consider auditing at least twice a year.

As people in the organization and on the project change, make sure that only the right people, have access to

the application infrastructure and just enough privileges to complete the task. Auditing and reviewing the access

control reduces the attack vector to the application.

Azure control plane depends on Azure AD. You can conduct the review manually or through an automated

process by using tools such as Azure AD access reviews. These reviews are often centrally performed often as

part of internal or external audit activities.

Make sure that the security team is auditing the environment to report on compliance with the security policy of

the organization. Security teams may also enforce compliance with these policies.

Capture critical data

Next steps

Related links

Enforce and audit industry, government, and internal corporate security policies. Policy monitoring checks that

initial configurations are correct and that it continues to be compliant over time.

For Azure, use Azure Policy to create and manage policies that enforce compliance. Azure Policies are built on

the Azure Resource Manager capabilities. Azure Policy can also be assigned through Azure Blueprints. For more

information, see Tutorial: Create and manage policies to enforce compliance.

Azure Governance Visualizer captures data from the most relevant Azure governance capabilities such as Azure

Policy, Azure role-based access control (Azure RBAC), and Azure Blueprints. The visualizer PowerShell script

iterates through an Azure tenant's management group hierarchy down to the subscription level. From the

collected data, the visualizer shows your hierarchy map, creates a tenant summary, and builds granular scope

insights about your management groups and subscriptions.

Remediate security risks in Microsoft Defender for Cloud

Secure score in Microsoft Defender for Cloud allows you view all the security vulnerabilities into a single

score.

Tutorial: Improve your regulatory compliance describes a step-by-step process to evaluate regulatory

requirements in Microsoft Defender for Cloud.

Azure security test practices

12/16/2022 • 3 minutes to read • Edit Online

Key points

Penetration testing (pentesting)

Regularly test your security design and implementation, as part the organization's operations. That integration

will make sure the security assurances are effective and maintained as per the security standards set by the

organization.

A well-architected workload should be resilient to attacks. It should recover rapidly from disruption and yet

provide the security assurances of confidentiality, integrity, and availability. Invest in simulated attacks as tests

that can indicate gaps. Based on the results of the results you can harden the defense and limit a real attacker's

lateral movement within your environment.

Simulated tests can also give you data to plan risk mitigation. Applications that are already in production should

use data from real-world attacks. New or updated applications with new features, should rely on structured

models for detecting risks early, such as threat modeling.

Define test cases that are realistic and based on real-world attacks.

Identify and catalog lowest cost methods for preventing and detecting attacks.

Use penetration testing as a one-time attack to validate security defenses.

Simulate attacks through red teams for long-term persistent attacks.

Measure and reduce the potential attack surface that attackers target for exploitation for resources within the

environment.

Ensure proper follow-up to educate users about the various means that an attacker may use.

Do you perform penetration testing on the workload?Do you perform penetration testing on the workload?

It's recommended that you simulate a one-time attack to detect vulnerabilities. Pentesting is a popular

methodology to validate the security defense of a system. The practitioners are security experts who are not

part of the organization's IT or application teams. So, they look at the system in a way that malicious actors

scope an attack surface. The goal is to find security gaps by gathering information, analyzing vulnerabilities, and

reporting.

Penetration tests provide a point-in-time validation of security defenses. Red teams can help provide ongoing

visibility and assurance that your defenses work as designed, potentially testing across different levels within

your workload(s). Red team programs can be used to simulate either one time, or persistent threats against an

organization to validate defenses that have been put in place to protect organizational resources.

Microsoft recommends penetration testing and red team exercises to validate security defenses for your

workload.

Penetration Testing Execution Standard (PTES) provides guidelines about common scenarios and the activities

required to establish a baseline.

Azure uses shared infrastructure to host your assets and assets belonging to other customers. In a pentesting

exercise, the practitioners may need access to sensitive data of the entire organization. Follow the rules of

engagement to make sure that access and the intent is not misused. For guidance about planning and executing

simulated attacks, see Penetration Testing Rules of Engagement.

Learn moreLearn more

Simulate attacks

Related links

Azure Penetration Testing

Penetration Testing

The way users interact with a system is critical in planning your defense. The risks are even higher for critical

impact accounts because they have elevated permissions and can cause more damage.

Do you carr y out simulated attacks on users of this workload?Do you carr y out simulated attacks on users of this workload?

Simulate a persistent threat actor targeting your environment through a red team. Here are some advantages:

Periodic checks. The workload will get checked through a realistic attack to make sure the defense is up to

date and effective.

Educational purposes. Based on the learnings, upgrade the knowledge and skill level. This will help the users

understand the various means that an attacker may use to compromise accounts.

A popular choice to simulate realistic attack scenarios is Office 365 Attack Simulator.

Is personal information detected and removed/obfuscated automatically?Is personal information detected and removed/obfuscated automatically?

Be cautious about using sensitive application information. Don't store personal information such as contact

information, payment information, and so on, in any application logs. Apply protective measures, such as

obfuscation. Machine learning tools can help with this measure. For more information, see PII Detection

cognitive skill.

Threat modeling is a structured process to identify the possible attack vectors. Based on the results, prioritize the

risk mitigate efforts. For more information, see Application threat analysis.

For more information on current attacks, see the Microsoft Security Intelligence (SIR) report.

Microsoft Cloud Red Teaming

Go back to the main article: Monitor

Security operations in Azure

12/16/2022 • 4 minutes to read • Edit Online

Tools

The responsibility of the security operation team (also known as Security Operations Center (SOC), or SecOps) is

to rapidly detect, prioritize, and triage potential attacks. These operations help eliminate false positives and focus

on real attacks, reducing the mean time to remediate real incidents. Central SecOps team monitors security-

related telemetry data and investigates security breaches. It's important that any communication, investigation,

and hunting activities are aligned with the application team.

Here are some general best practices for conducting security operations:

Follow the NIST Cybersecurity Framework functions as part of operations.

DetectDetect the presence of adversaries in the system.

RespondRespond by quickly investigating whether it's an actual attack or a false alarm.

RecoverRecover and restore the confidentiality, integrity, and availability of the workload during and after

an attack.

For information about the framework, see NIST Cybersecurity Framework.

Acknowledge an alert quickly. A detected adversary must not be ignored while defenders are triaging

false positives.

Reduce the time to remediate a detected adversary. Reduce their opportunity time to conduct and attack

and reach sensitive systems.

Prioritize security investments into systems that have high intrinsic value. For example, administrator

accounts.

Proactively hunt for adversaries as your system matures. This effort will reduce the time that a higher

skilled adversary can operate in the environment. For example, skilled enough to evade reactive alerts.

For information about the metrics that the Microsoft's SOC team uses , see Microsoft SOC.

Here are some Azure tools that a SOC team can use investigate and remediate incidents.

TOO LTOO L P U RP O SEP URP O SE

Microsoft SentinelMicrosoft Sentinel Centralized Security Information and Event Management

(SIEM) to get enterprise-wide visibility into logs.

Microsoft Defender for CloudMicrosoft Defender for Cloud Alert generation. Use security playbook in response to an

alert.

Azure MonitorAzure Monitor Event logs from application and Azure services.

Azure Network Security Group (NSG)Azure Network Security Group (NSG) Visibility into network activities.

Azure Information ProtectionAzure Information Protection Secure email, documents, and sensitive data that you share

outside your company.

Assign incident notification contact

Incident response

Investigation practices should use native tools with deep knowledge of the asset type such as an Endpoint

detection and response (EDR) solution, Identity tools, and Microsoft Sentinel.

For more information about monitoring tools, see Security monitoring tools in Azure.

Security alerts need to reach the right people in your organization. Establish a designated point of contact to

receive Azure incident notifications from Microsoft, and, or Azure Defender for Cloud. In most cases, such

notifications indicate that your resource is compromised or attacking another customer. This enables your

security operations team to rapidly respond to potential security risks and remediate them.

This enables your security operations team to rapidly respond to potential security risks and remediate them.

Ensure administrator contact information in the Azure enrollment portal includes contact information that will

notify security operations directly or rapidly through an internal process.

Learn moreLearn more

To learn more about establishing a designated point of contact to receive Azure incident notifications from

Microsoft, reference the following articles:

Update notification settings

Configure email notifications for security alerts

Is the organization effectively monitoring security posture across workloads, with a central SecOps team

monitoring security-related telemetry data and investigating possible security breaches? Communication,

investigation, and hunting activities need to be aligned with the application team(s).

Are operational processes for incident response defined and tested?Are operational processes for incident response defined and tested?

Actions executed during an incident and response investigation could impact application availability or

performance. Define these processes and align them with the responsible (and in most cases central) SecOps

team. The impact of such an investigation on the application has to be analyzed.

Are there tools to help incident responders quickly understand the application and components toAre there tools to help incident responders quickly understand the application and components to

do an investigation?do an investigation?

Incident responders are part of a central SecOps team and need to understand security insights of an

Suggested actionSuggested action

Hybrid enterprise view

Leverage native detections and controls

Suggested actionsSuggested actions

Learn moreLearn more

Next steps

application. Security playbook in Microsoft Sentinel can help to understand the security concepts and cover the

typical investigation activities.

Consider using Microsoft Defender for Cloud to monitor security-related events and get alerted automatically.

Learn moreLearn more

Security alerts and incidents in Microsoft Defender for Cloud

Security operations tooling and processes should be designed for attacks on cloud and on-premises assets.

Attackers don't restrict their actions to a particular environment when targeting an organization. They attack

resources on any platform using any method available. They can pivot between cloud and on-premises

resources using identity or other means. This enterprise-wide view will enable SecOps to rapidly detect,

respond, and recover from attacks, reducing organizational risk.

Use Azure security detections and controls instead of creating custom features for viewing and analyzing event

logs. Azure services are updated with new features and have the ability to detect false positive with a higher

accuracy rate.

Integrating logs from the network devices, and even raw network traffic itself, will provide greater visibility into

potential security threats flowing over the wire.

To get a unified view across the enterprise, feed the logs collected through native detections (such as Azure

Monitor) into a centralized security information and event management (SIEM) solution like Microsoft Sentinel.

Avoid using generalized log analysis tools and queries. Within Azure Monitor, create Log Analytics Workspace to

store logs. You can also review logs and perform queries on log data. These tools can offer high-quality alerts.

The modern machine learning-based analytics platforms support ingestion of extremely large amounts of

information and can analyze large datasets very quickly. In addition, these solutions can be tuned to significantly

reduce false positive alerts.

Examples of network logs that provide visibility include:

Security group logs - flow logs and diagnostic logs

Web application firewall logs

Virtual network taps and their equivalents

Azure Network Watcher

Integrate network device log information in advanced SIEM solutions or other analytics platforms.

Enable enhanced network visibility

Security health modeling

Security tools

Security logs and audits

Check for identity, network, data risks

Tradeoffs for security

12/16/2022 • 4 minutes to read • Edit Online

Security vs Reliability

Security vs Cost Optimization

Security provides confidentiality, integrity, and availability assurances of an organization's data and systems.

When designing a system you can almost never compromise on security controls. When you enhance security

of an architecture there might be impact on reliability, performance efficiency, cost, and operational excellence.

This article describes some of those considerations.

Reliable applications are resilient and highly available. Every architectural component factors in achieving your

requirements for reliability. Workload security is often woven into many layers of the workload's architecture,

operations, and runtime requirements; and may come with their own implications on resiliency or availability.

For example, identity providers and authorization services are critical dependencies to consider. This includes the

identity service (Microsoft Identity Platform) and any libraries that help facilitate the use of those services. At

some points in the architecture, a failure at an identity layer is terminal. At other points, reliability can be still

achieved through strategies such as caching, taking advantage of TTLs on access tokens, and others. OAuth2

claims validation can happen mostly disconnected from the claims provider. However, not all authorization can

be achieved that way. In those situations reliability may be traded in favor of complete security.

Many workloads may quickly degrade in functionality with the loss of critical security controls. Consider

evaluating at each component of your architecture to detect that condition.

Other security considerations that might impact reliability are:

Poor or manual certifications or key rotation practices. Failure to do those tasks can lead to reliability issues.

Expired service principals. For example, a deployment pipeline that used a service principal might fail at a

later date, if that principal's access key has expired. Using managed identities helps keep reliability high while

also maintaining least privileges on that identity.

High availability is often achieved by redundancy (actively or passively), and security controls also need to

align with the failover mechanism. For example, failing over from one storage account to another for

reliability may impact how the client's active authorization session is handled. Using managed identity with

Azure AD integration for storage access can result in a higher reliability because the client doesn't have to

manage SAS tokens when switching to the new storage account.

Increasing security of the workload will almost always lead to higher cost. There are some ways to optimize cost.

Maximum security may not always be practical for all environments. Evaluate the security requirements

in pre-production and production environments. Are services such as Azure DDoS Protection, Microsoft

Sentinel, Dedicated HSMs, Microsoft Defender for Cloud needed in pre-production? Is inner loop

mocking of security controls sufficient? If resources are not publicly accessible, can you dial down some

controls for cost savings? Always make those choices,

if and only if

, the lowered environment still meets

the business requirements.

Premium security features can increase the cost. There are areas you can reduce cost by using native

security features. For example, avoid implementing custom roles if you can use built-in roles.

Every security control has an opportunity to impact workflows, and workflows that involve people can be

expensive. A security control that stops work from being done should be evaluated as necessary or

Security vs Operational Excellence

Related links

an alert is raised. Many partner integrations are ready to use out of the box.

For information on Azure Monitor and ITSM integration, reference IT Service Management Connector Overview.

All alerts being treated the same is going to reduce the efficacy of notifications.

When defining alerts, analyze the potential business impact and prioritize accordingly. Prioritizing alerts helps

operational teams in cases where multiple events require intervention at the same time. For example, alerts

concerning critical system flows might require special attention. When creating an alert, ensure you establish

and set the correct priority.

One way of specifying the priority is by using a severity level that indicates how critical a situation is. This image

shows this case in Azure Monitor.

Return to the operational excellence overview.

Use case: Health monitoring

Overview of alerts in Microsoft Azure

Create and manage action groups

Health monitoring

12/16/2022 • 7 minutes to read • Edit Online

Requirements for health monitoring

Best practices

Distributed tracing

A system is healthy if it's running and can process requests. Health monitoring generates a snapshot of the

current health of the system so that you can verify all components are functioning as expected.

If any part of the system is unhealthy, you're alerted within a matter of seconds. You'll determine which parts of

the system are functioning normally and which parts are experiencing problems. A traffic light system indicates

system health:

Red for unhealthy. The system has stopped.

Yellow for partially healthy. The system is running with reduced functionality.

Green for healthy.

A comprehensive health monitoring system enables you to drill down to view the health status of subsystems

and components. For example, if the overall system is partially healthy, you can zoom in and determine which

functionality is currently unavailable.

Correlate events across all application components.

Use an Application Performance Management (APM) tool used to collect application level logs.

Use Application Insights to gather key metrics.

Collect application logs from different application environments.

Consider using log levels used to capture different types of application events.

Capture log messages in a structured format.

Set out critical application performance targets and non-functional requirements with clarity.

Identify known gaps in application observability that led to missed incidents or false positives in the past.

Consider different log aggregation technologies to collect logs and metrics from Azure resources.

Make logs and metrics available for [critical internal dependencies]#logs-for-internal-dependencies).

Implement black-box monitoring to measure platform services and the resulting customer experience.

Implement detailed instrumentation in the application code to better understand the customer experience.

Apply white-box monitoring to instrument the application with semantic logs and metrics.

Trace the execution of user requests to generate raw data to determine which requests have:

Succeeded

Failed

Taken too long

Distributed tracing allows you to build and visualize end-to-end transaction flows for the application. Events

coming from different application components or different component tiers of the application should be

correlated to build these flows.

For instance, using consistent correlation IDs transferred between components within a transaction achieves

Application Performance Management (APM) tools

Logs and metrics

Application logsApplication logs

Log levelsLog levels

Log messagesLog messages

end-to-end transaction flows.

Event correlation between application layers allows you to connect tracing data of the complete application

stack. You can create a complete picture of where time is spent at each layer through tools that can query the

tracing data repositories in correlation to a unique identifier. This unique identifier represents a given transaction

that flowed through the system.

To successfully maintain an application, it's important to

turn the lights on

to have clear visibility into important

metrics, both in real time and historically.

Application Performance Management (APM) toolsApplication Performance Management (APM) tools

An APM technology, such as Application Insights, should be used to manage the performance and availability of

the application, aggregating application level logs, and events for later interpretation. With cost in mind,

consider the appropriate level of logging required.

Application Insights is an extensible Application Performance Management (APM) service for developers and

DevOps professionals to monitor live applications. It automatically detects performance anomalies and includes

analytics tools to help users diagnose issues, and to understand what customers do with your application.

Application Insights monitors diagnostic trace logs from your application so that you can correlate trace events

with requests.

Application Insights allows you to:

Verify that your application is running correctly.

Makes application troubleshooting easier.

Provides custom business telemetry to indicate whether your application is being used as intended.

For more information about how Application Insights helps you monitor applications, reference Monitoring

workloads.

Logging is essential to understand how an application operates in various environments and what events occur

and under which conditions. There are two types of logs: application-generated logs and platform logs.

Application logs

support the end-to-end application lifecycle. You should collect logs and events across all major

application environments. A sufficient degree of separation and filtering should be in place to ensure non-critical

environments don't convolute production log interpretation. Corresponding log entries across the application

should capture a correlation ID for their respective transactions.

Use the following log levels to capture different types of application events across environments:

InfoInfo

WarningWarning

ErrorError

DebugDebug

Pre-configure the preceding log levels and apply these levels within relevant environments. Changing log levels

includes simple configuration changes to support operational scenarios where it's necessary to raise the log

level within an environment.

MetricsMetrics

Analyze health dataAnalyze health data

Performance targets and non

functional requirements

(

NFRs

)

Performance targets and non

functional requirements

(

NFRs

)

Gaps in application monitoringGaps in application monitoring

TIPTIP

Platform logsPlatform logs

Activity logs and diagnostic settingsActivity logs and diagnostic settings

TIPTIP

Capture log messages and application events in a structured format, following well-known schema. The

structured data type includes machine-readable data points rather than unstructured string types.

Structured data can help you:

Parse and analyze logs.

Index and search logs.

Simplify reporting.

Metrics

are numerical values that are collected at regular intervals and describe some aspect of a system at a

particular time. They reflect the health and usage statistics of your resources.

Analyzing health data involves quickly indicating whether the system is running through metrics. Hot analysis of

the immediate data triggers an alert if a critical component is detected as unhealthy.

For example, the component fails to respond to a consecutive series of pings. You can then take the appropriate

corrective action. For more information about analyzing data, reference Analyze monitoring data for cloud

applications.

Application-level metrics should include end-to-end transaction times of key technical functions, such as:

Database queries.

Response times for external API calls.

Failure rates of processing steps, and so on.

To assess fully the health of key scenarios in the context of targets and NFRs, correlate application log events,

such as user login, across critical system flows.

Known gaps in application observability lead to missed incidents and false positives.

What you can't see, you can't measure. What you can't measure, you can't improve.

Platform logs

provide detailed diagnostic and auditing information for the infrastructure, and resources they

depend on. Monitoring your platform includes collecting rich metrics and logs to verify the state of your

complete infrastructure, and to react promptly if there are any issues.

Application Insights or a full-stack monitoring service like Azure Monitor can help you keep tabs on your entire

landscape.

For more information, reference Monitoring workloads.

Activity logs provide audit information about when a resource is modified, such as when a virtual machine is

started or stopped. Such information is useful for the interpretation and troubleshooting of operational and

performance issues because it provides transparency around configuration changes.

We recommend collecting and aggregating activity logs.

  
Log aggregation technologiesLog aggregation technologies
  
Logs for internal dependenciesLogs for internal dependencies
  
Black
-
box monitoringBlack
-
box monitoring
 
Instrumentation
  
White
-
box monitoringWhite
-
box monitoring
 
Next steps
Log aggregation technologies, such as Azure Log Analytics or Splunk, should be used to collate logs and metrics
across all application components for later evaluation. Resources may include Azure IaaS and PaaS services, and
third-party appliances such as firewalls or anti-malware solutions used in the application. For instance, if Azure
Event Hub is used, the Diagnostic SettingsDiagnostic Settings should be configured to push logs and metrics to the data sink.
Understanding usage helps with right-sizing of the workload, but extra costs for logging should be accepted and
included in the cost model.
All application resources should be configured to route diagnostic logs and metrics to the chosen log
aggregation technology. Use Azure Policy to ensure the consistent use of diagnostic settings across the
application and to enforce the configuration you want for each Azure service.
To build a robust application health model, it's vital to have visibility into the operational state of critical internal
dependencies, such as a shared Network Virtual Appliance (NVA) or ExpressRoute connection.
Black-box monitoring
 tests externally visible application behavior without knowledge of the internals of the
system. This type of monitoring is a common approach to measuring customer-centric SLIs, SLOs, and SLAs.
For more information, reference Azure Monitor.
Instrumentation of your code allows precise detection of underperforming pieces when you apply load or stress
tests. It's critical to have this data available to improve and identify performance opportunities in the application
code. Use Application Performance Monitoring (APM) tools, such as Application Insights, to manage the
performance and availability of the application, along with aggregating application-level logs, and events for
later interpretation.
For more resources about instrumentation, reference Monitor performance.
Application-level metrics and logs, such as current memory consumption or request latency, should be collected
from the application to inform a health model, detect, and predict issues.
For more information, reference Instrumenting an application with Application Insights and Instrument an
application.
Usage monitoring

Usage monitoring

12/16/2022 • 2 minutes to read • Edit Online

Benefits of usage monitoring

Requirements for usage monitoring

Requirements for data collection

Next steps

Usage monitoring tracks how the features and components of an application are used.

This article describes how you can use the data gathered from usage monitoring to gain insight into operational

events that affect your application and workloads.

The following list explores the use cases for data gathered from usage monitoring:

Determine which features you use heavily and determine any potential hotspots in the system. High-traffic

elements may benefit from functional partitioning or even replication to spread the load more evenly. You

can use this information to figure out which features you don't use often and possible candidates for

retirement, or replacement in a future version of the system.

Collect information about the operational events of the system under normal use. For example, in an e-

commerce site, you can record the statistical information about the number of transactions and the volume

of customers that are responsible for them. You can use this information for capacity planning as the number

of customers grows.

Detect user satisfaction with the performance or functionality of the system. For example, if a large number

of customers in an e-commerce system regularly abandon their shopping carts, this behavior may mean

there's a problem with the checkout functionality.

Generate billing information. An application or service may charge customers for the resources they use.

Enforce quotas. If a user exceeds their paid quota of processing time or resource usage during a specific

period, the system can limit their access or throttle processing.

To analyze system usage, you'll need monitoring information that includes:

The number of requests that each subsystem processes and directs to each resource.

The work that each user does.

The volume of data storage that each user occupies.

The resources that each user accesses.

You can track usage at a relatively high level. Usage tracking can note the start and end times of each request

and the nature of the request, such as read, write, and so on, depending on the resource in question. Retrieve

this information through the following ways:

Trace user activity.

Capture performance counters that measure the usage for each resource.

Monitor each users' resource consumption.

For accounting purposes, you'll want to identify which users are responsible for doing which operations, and the

resources that these operations use. The gathered information should be detailed enough for accurate billing.

Issue tracking

12/16/2022 • 2 minutes to read • Edit Online

Requirements for issue tracking

Requirements for data collection

Analyze data

Next steps

If unexpected events occur in the system, customers and other users may report these issues.

Issue tracking involves:

Managing issues.

Associating issues with efforts to resolve underlying problems in the system.

Informing customers of possible resolutions.

You can often track issues using a separate system that lets you record and report the details of problems that

users report. These details can include:

Tasks the user was doing.

Symptoms of the problem.

Sequence of events.

Error or warning messages.

The user who initially reported the issue is considered the primary data source. This user can provide more

information, such as:

A crash dump, if the application includes a component that runs on the user's desktop.

A screenshot.

The date and time the error occurred.

The user's location.

You can use this information to help debug and create a backlog for future software releases.

Consider the following scenarios when you analyze issue-tracking data:

Different users may report the same problem. The issue-tracking system should associate common reports.

Record the debugging progress against each issue report.

Inform customers of the solution when you've resolved the issue.

If a user reports an issue that has a known solution in the issue-tracking system, inform the user of the

solution immediately.

Tracing and debugging

12/16/2022 • 2 minutes to read • Edit Online

Root cause analysis

Requirements for tracing and debugging

Requirements for data collection

NOTENOTE

Next steps

When a user reports an issue, the user is often aware only of the immediate effect that the issue has on their

operations. The user can only report the results of their own experience back to the person who is responsible

for maintaining the system. These experiences are a visible symptom of one or more fundamental problems.

In many cases, an analyst must dig through the chronology of the underlying operations to establish the root

cause of the problem. This process is called a

root cause analysis

A root cause analysis may uncover inefficiencies in application design. In these situations, you can try to rework

the affected elements and deploy them as part of a later release. This process requires careful control, and you

should monitor the updated components closely.

For tracing unexpected events and other problems, consider the following requirements:

Monitoring data must provide enough information to enable an analyst to trace the origin of an issue and

reconstruct the sequence of events that lead up to the issue.

Data must be sufficient for the analyst to identify a root cause.

A root cause enables the developer to make the necessary changes to prevent the issue from recurring.

Troubleshooting involves the following data collection requirements:

Trace all methods and their parameters used in an operation to create a model that shows the logical

flow

through the system when a customer makes a specific request.

Capture and log exceptions, and warnings that the system generates because of this flow.

To support debugging, the system should provide the following data:

Hooks that enable you to capture state information at critical points in the system.

Step-by-step information as selected operations continue.

Capturing detailed data can cause extra load on the system and should be a temporary process. Only capture detailed

data when an unusual series of events occur, which are difficult to replicate. Alternately, only capture detailed data when

you're monitoring a new release to ensure that new elements in the system function as expected.

Auditing

12/16/2022 • 2 minutes to read • Edit Online

Requirements for auditing

Requirements for data collection

Analyze audit data

Next steps

Depending on the application, there may be legal requirements for auditing users' operations and recording all

data access. Auditing can provide evidence that links customers to specific requests. Affirming validity is an

important factor in many online business systems to help maintain trust between the customer and the

business responsible for the application, or service.

An analyst can trace the sequence of business operations that users perform so that you can reconstruct users'

actions. Tracing the sequence of operations may be necessary as a matter of record, or as part of a forensic

investigation.

Audit information is highly sensitive. This information includes data that identifies the users of the system and

the tasks that they're doing. Reports contain sensitive audit information available only to trusted analysts. An

analyst can generate a range of reports. For example, reports may list the following activities:

All users' activities occurring during a specified time frame.

The chronology of a single user's activity.

The sequence of operations performed against one or more resources.

The primary sources of auditing information can include:

The security system that manages user authentication.

Trace logs that record user activity.

Security logs that track all network requests.

Regulatory requirements may dictate the format of the audit data and the way it's stored. For example, it may

not be possible to clean the data in any way. It must be recorded in its original format. Access to the data

repository must be protected to prevent tampering.

An analyst must access all the raw data in its original form. Aside from the common audit report requirement,

the tools for analyzing this data are specialized and external to the system.

DevOps Checklist

Operational Excellence patterns

12/16/2022 • 2 minutes to read • Edit Online

PAT T E RNPAT T E RN SUM M A RYSUM M A RY

Ambassador Create helper services that send network requests on behalf

of a consumer service or application.

Anti-Corruption Layer Implement a façade or adapter layer between a modern

application and a legacy system.

External Configuration Store Move configuration information out of the application

deployment package to a centralized location.

Gateway Aggregation Use a gateway to aggregate multiple individual requests into

a single request.

Gateway Offloading Offload shared or specialized service functionality to a

gateway proxy.

Gateway Routing Route requests to multiple services using a single endpoint.

Health Endpoint Monitoring Implement functional checks in an application that external

tools can access through exposed endpoints at regular

intervals.

Sidecar Deploy components of an application into a separate

process or container to provide isolation and encapsulation.

Strangler Incrementally migrate a legacy system by gradually replacing

specific pieces of functionality with new applications and

services.

Cloud applications run in a remote datacenter where you do not have full control of the infrastructure or, in

some cases, the operating system. This can make management and monitoring more difficult than an on-

premises deployment. Applications must expose runtime information that administrators and operators can use

to manage and monitor the system, as well as supporting changing business requirements and customization

without requiring the application to be stopped or redeployed.

Overview of the performance efficiency pillar

12/16/2022 • 2 minutes to read • Edit Online

Topics

P ERF O RM A N C E EFF IC IEN C Y TOP ICP ERF O RM A N C E EFF IC IEN C Y TOP IC DESC RIP T IONDESC RIP T ION

Performance efficiency checklist Review your application architecture to ensure your

workload scales to meet the demands placed on it by users

in an efficient manner.

Performance principles Principles to guide you in your overall strategy for improving

performance efficiency.

Design for performance Review your application architecture from a performance

design standpoint.

Consider scalability Plan for growth by understanding your current workloads.

Plan for capacity Plan to scale your application tier by adding extra

infrastructure to meet demand.

Performance monitoring checklist Monitor services and check the health state of current

workloads to maintain overall workload performance.

Performance patterns Implement design patterns to build more performant

workloads.

Tradeoffs Consider tradeoffs between performance optimization and

other aspects of the design, such as reliability, security, cost

efficiency, and operability.

Performance efficiency is the ability of your workload to scale to meet the demands placed on it by users in an

efficient manner. Before the cloud became popular, when it came to planning how a system would handle

increases in load, many organizations intentionally provisioned oversized workloads to meet business

requirements. This decision made sense in on-premises environments because it ensured

capacity

during peak

usage. Capacity reflects resource availability (CPU and memory). Capacity was a major consideration for

processes that would be in place for many years.

Just as you need to anticipate increases in load in on-premises environments, you need to expect increases in

cloud environments to meet business requirements. One difference is that you may no longer need to make

long-term predictions for expected changes to ensure you'll have enough capacity in the future. Another

difference is in the approach used to manage performance.

To assess your workload using the tenets found in the Microsoft Azure Well-Architected Framework, reference

the Microsoft Azure Well-Architected Review.

To boost performance efficiency, we recommend watching Performance Efficiency: Fast & Furious: Optimizing

for Quick and Reliable VM Deployments.

The performance efficiency pillar covers the following topics to help you effectively scale your workload:

Next steps

Reference the performance efficiency principles intended to guide you in your overall strategy.

Principles

Performance efficiency principles

12/16/2022 • 2 minutes to read • Edit Online

Design for horizontal scaling

A P P RO A C HA P P R OA C H BE NE FITB EN E FIT

Define a capacity model according to the business

requirements

Test the limits for predicted and random spikes andpredicted and random spikes and

fluctuationsfluctuations in load to make sure the application can scale.

Factor in the SKU service limits and regional limits so that

application scales as expected if there's a regional failure.

Use PaaS offerings Take advantage of the built-in capabilities that

automatically trigger scalingautomatically trigger scaling operations instead of

investing in manual scaling efforts that often require custom

implementations and can be error prone.

Choose the right resources and right-size Determine if the resources can support the anticipated load.

Also, justify the cost implications of the choices.

Apply strategies in your design early Accelerate adoption without significant changes. For

example, strive for stateless applicationstrive for stateless application and store state

externally in a database or distributed cache. Use cachingUse caching

where possible, to minimize the processing load.

Shift-left on performance testing

A P P RO A C HA P P R OA C H BE NE FITB EN E FIT

Run load and stress tests Measure the application's performance under

predetermined amounts of loadpredetermined amounts of load and also the

maximum loadmaximum load your application and its infrastructure can

withstand.

Performance efficiency is the ability of your workload to adjust to changes in demands placed on it by users in

an efficient manner. These principles are intended to guide you in your overall strategy for improving

performance efficiency.

Horizontal scaling allows for elasticity

. Instances are added (scale-out) or removed (scale-in) in response to

changes in load. Scaling out can improve resiliency by building redundancy. Scaling in can help reduce costs by

shutting down excess capacity.

An alternate approach is vertical scaling (scale up). However, you eventually may reach a limit where there isn't a

larger system, and you can't scale up anymore. At that point, any further scaling must be horizontal. So it's good

practice to employ a scale-out architecture early on.

Test early and test often to catch issues early

Establish performance baselines Determine the current efficiencycurrent efficiency of the application and its

supporting infrastructure. You'll be able to identify

bottlenecks early before it worsens with load. Also, this

strategy can lead to strategies for improvementsstrategies for improvements and

determine if the application is meeting the business goals.

Run the test in the continuous integration (CI) build pipeline. Detect issues early. Any development effort must go

through continuous performance testing to make sure

changes to the codebase doesn't negatively affect

performance.

A P P RO A C HA P P R OA C H BE NE FITB EN E FIT

Continuously monitor for performance in production

A P P RO A C HA P P R OA C H BE NE FITB EN E FIT

Monitor the health of the entire solution Know about the scalability

and

resiliency of the

infrastructure, application, and dependent services. Gather

and review key performance counters regularly.

Capture data from repeatable processes Evaluate the metrics over time that would allow for

autoscaling with demand. For reliability, look for early

warning signs that might require proactive intervention.

Reevaluate the needs of the workload continuously Identify improvement opportunities with resolution

planning. This effort may require updated configurations and

deprecations in favor of more-appropriate solutions.

Next section

Related links

Observe the system holistically to evaluate the overall health of the solution

. Capture the test results not only in

dev/test environment but also in production. Monitoring and logging in production can help identify bottlenecks

and opportunities for improvement.

Use this checklist to review your application architecture from a performance design standpoint.

Design checklist

Performance efficiency impacts the entire architecture spectrum and is interrelated with other pillars of

the Microsoft Azure Well-Architected Framework.

Assess your workload using the Microsoft Azure Well-Architected Review tool.

Checklist

Design for performance efficiency

12/16/2022 • 14 minutes to read • Edit Online

Application design

Application design is critical to handling scale as load increases. Design is part of the Performance Efficiency

pillar in the Microsoft Azure Well-Architected Framework. Use this checklist to review your application

architecture from a performance design standpoint.

Design for scalingDesign for scaling. Scaling allows applications to react to variable load by increasing and decreasing

the number of instances of roles, queues, and other services they use. However, the application must be

designed with this in mind. During scale operations, application and service instances come and go, and

because of this they must be stateless. This prevents the addition or removal of specific instances from

adversely affecting current users.

You should also implement configuration, autodetection, or load balancing so that as services are

added or removed, the application can perform the necessary routing. For example, a web application

might use a set of queues in a round-robin approach to route requests to background services

running in worker roles. The web application must be able to detect changes in the number of queues,

to successfully route requests and balance the load on the application.

You should scale outbound connectivity to the internet with Azure Virtual Network NAT. Virtual

Network NAT (NAT gateway) provides not only a scalable but also reliable and secure way to connect

outbound to the internet and also helps prevent connection failures caused by SNAT exhaustion.

Scale as a unitScale as a unit. Plan for additional resources to accommodate growth. For each resource, know the

upper scaling limits, and use sharding or decomposition to go beyond these limits. Determine the scale

units for the system in terms of well-defined sets of resources. This makes applying scale-out operations

easier. It also makes operations less prone to negative impact on the application through limitations

imposed by lack of resources in some part of the overall system. For example, adding x number of web

and worker roles might require y number of additional queues and z number of storage accounts to

handle the additional workload generated by the roles. So a scale unit could consist of x web and worker

roles,

queues, and

storage accounts. Design the application so that it's easily scaled by adding one or

more scale units.

Take advantage of platform autoscaling featuresTake advantage of platform autoscaling features. Where the hosting platform supports an

autoscaling capability, such as Azure Autoscale, prefer it to custom or third party mechanisms unless the

built-in mechanism can't fulfill your requirements. Use scheduled scaling rules where possible to ensure

resources are available without a start-up delay, but add reactive autoscaling to the rules where

appropriate to cope with unexpected changes in demand. You can use the autoscaling operations in the

classic deployment model (the older model) to adjust autoscaling, and to add custom counters to rules.

For more information, see Auto-scaling guidance.

Par tition the workloadPar tition the workload. Design parts of the process to be discrete and decomposable. Minimize the

size of each part, while following the usual rules for separation of concerns and the single responsibility

principle. This allows the component parts to be distributed in a way that maximizes use of each compute

unit (such as a role or database server). It also makes it easier to scale the application by adding instances

of specific resources. For complex domains, consider adopting a microservices architecture.

Avoid client affinityAvoid client affinity. Where possible, ensure that the application does not require affinity. When you do

this, requests can be routed to any instance, and the number of instances is irrelevant. This also avoids

the overhead of storing, retrieving, and maintaining state information for each user.

Data management

Offload CPU-intensive and I/O-intensive tasks as background tasksOffload CPU-intensive and I/O-intensive tasks as background tasks. If a request to a service is

expected to take a long time to run or absorb considerable resources, offload the processing for this

request to a separate task. Use worker roles or background jobs (depending on the hosting platform) to

execute these tasks. This strategy enables the service to continue receiving requests and remain

responsive. For more information, see Background jobs guidance.

Distribute the workload for background tasksDistribute the workload for background tasks. Where there are many background tasks, or the

tasks require considerable time or resources, spread the work across multiple compute units (such as

worker roles or background jobs). For one possible solution, see the Competing Consumers pattern.

Consider moving toward a Consider moving toward a

shared-nothingshared-nothing

architecture architecture. A shared-nothing architecture uses

independent, self-sufficient nodes that have no single point of contention (such as shared services or

storage). In theory, such a system can scale almost indefinitely. While a fully shared-nothing approach is

generally not practical for most applications, it may provide opportunities to design for better scalability.

For example, avoiding the use of server-side session state, client affinity, and data partitioning are good

examples of moving toward a shared-nothing architecture.

NOTENOTE

Use data par titioningUse data par titioning. Divide the data across multiple databases and database servers, or design the

application to use data storage services that can provide this partitioning transparently (examples include

Azure SQL Database Elastic Database, and Azure Table storage). This approach can help to maximize

performance and allow easier scaling. There are different partitioning techniques, such as horizontal,

vertical, and functional. You can use a combination of these to achieve maximum benefit from increased

query performance, simpler scalability, more flexible management, better availability, and matching the

type of store to the data it will hold.

Consider using different types of data store for different types of data. Choose the types based on how well they

are optimized for the specific type of data. This may include using table storage, a document database, or a

column-family data store, instead of, or as well as, a relational database. For more information, see Data

partitioning guidance.

Design for eventual consistencyDesign for eventual consistency. Eventual consistency improves scalability by reducing or removing

the time needed to synchronize related data partitioned across multiple stores. The cost is that data is not

always consistent when it is read, and some write operations may cause conflicts. Eventual consistency is

ideal for situations where the same data is read frequently but written infrequently.

Reduce chatty interactions between components and ser vicesReduce chatty interactions between components and ser vices. Avoid designing interactions in

which an application is required to make multiple calls to a service (each of which returns a small amount

of data), rather than a single call that can return all of the data. Where possible, combine several related

operations into a single request when the call is to a service or component that has noticeable latency.

This makes it easier to monitor performance and optimize complex operations. For example, use stored

procedures in databases to encapsulate complex logic, and reduce the number of round trips and

resource locking.

Use queues to level the load for high velocity data writesUse queues to level the load for high velocity data writes. Surges in demand for a service can

overwhelm that service and cause escalating failures. To prevent this, consider implementing the Queue-

Based Load Leveling pattern. Use a queue that acts as a buffer between a task and a service that it

invokes. This can smooth intermittent heavy loads that may otherwise cause the service to fail or the task

to time out.

Minimize the load on the data storeMinimize the load on the data store. The data store is commonly a processing bottleneck, a costly

NOTENOTE

resource, and often not easy to scale out. Where possible, remove logic (such as processing XML

documents or JSON objects) from the data store, and perform processing within the application. For

example, instead of passing XML to the database (other than as an opaque string for storage), serialize or

deserialize the XML within the application layer and pass it in a form that is native to the data store.

Typically, it's much easier to scale out the application than the data store, so you should attempt to do as much of

the compute-intensive processing as possible within the application.

Minimize the volume of data retrievedMinimize the volume of data retrieved. Retrieve only the data you require by specifying columns

and using criteria to select rows. Make use of table value parameters and the appropriate isolation level.

Use mechanisms such as entity tags to avoid retrieving data unnecessarily.

Aggressively use cachingAggressively use caching. Use caching wherever possible to reduce the load on resources and

services that generate or deliver data. Caching is typically suited to data that is relatively static, or that

requires considerable processing to obtain. Caching should occur at all levels where appropriate in each

layer of the application, including data access and user interface generation. For more information, see

Caching Guidance.

Handle data growth and retentionHandle data growth and retention. The amount of data stored by an application grows over time.

This growth increases storage costs as well as latency when accessing the data, affecting application

throughput and performance. It may be possible to periodically archive some of the old data that is no

longer accessed, or move data that is rarely accessed into long-term storage that is more cost efficient,

even if the access latency is higher.

Optimize Data Transfer Objects (DTOs) using an efficient binar y formatOptimize Data Transfer Objects (DTOs) using an efficient binar y format. DTOs are passed

between the layers of an application many times. Minimizing the size reduces the load on resources and

the network. However, balance the savings with the overhead of converting the data to the required

format in each location where it is used. Adopt a format that has the maximum interoperability to enable

easy reuse of a component.

Set cache controlSet cache control. Design and configure the application to use output caching or fragment caching

where possible, to minimize processing load.

Enable client side cachingEnable client side caching. Web applications should enable cache settings on the content that can be

cached. This is commonly disabled by default. Configure the server to deliver the appropriate cache

control headers to enable caching of content on proxy servers and clients.

Use Azure blob storage and the Azure Content Deliver y Network to reduce the load on theUse Azure blob storage and the Azure Content Deliver y Network to reduce the load on the

applicationapplication. Consider storing static or relatively static public content, such as images, resources, scripts,

and style sheets, in blob storage. This approach relieves the application of the load caused by dynamically

generating this content for each request. Additionally, consider using the Content Delivery Network to

cache this content and deliver it to clients. Using the Content Delivery Network can improve performance

at the client because the content is delivered from the geographically closest datacenter that contains a

Content Delivery Network cache. For more information, see Best practices for using content delivery

networks.

Optimize and tune SQL queries and indexesOptimize and tune SQL queries and indexes. Some T-SQL statements or constructs may have an

adverse effect on performance that can be reduced by optimizing the code in a stored procedure. For

example, avoid converting datetimedatetime types to a varcharvarchar before comparing with a datetimedatetime literal value.

Use date/time comparison functions instead. Lack of appropriate indexes can also slow query execution.

If you use an object/relational mapping framework, understand how it works and how it may affect

performance of the data access layer. For more information, see Query Tuning.

Implementation

Consider denormalizing dataConsider denormalizing data. Data normalization helps to avoid duplication and inconsistency.

However, maintaining multiple indexes, checking for referential integrity, performing multiple accesses to

small chunks of data, and joining tables to reassemble the data imposes an overhead that can affect

performance. Consider if some additional storage volume and duplication is acceptable in order to

reduce the load on the data store. Also consider if the application itself (which is typically easier to scale)

can be relied on to take over tasks such as managing referential integrity in order to reduce the load on

the data store. For more information, see Horizontal, vertical, and functional data partitioning.

NOTENOTE

Review the performance antipatternsReview the performance antipatterns. See Performance antipatterns for cloud applications for

common practices that are likely to cause scalability problems when an application is under pressure.

Use asynchronous callsUse asynchronous calls. Use asynchronous code wherever possible when accessing resources or

services that may be limited by I/O or network bandwidth, or that have a noticeable latency, in order to

avoid locking the calling thread.

Avoid locking resources, and use an optimistic approach insteadAvoid locking resources, and use an optimistic approach instead. Never lock access to resources

such as storage or other services that have noticeable latency, because this is a primary cause of poor

performance. Always use optimistic approaches to managing concurrent operations, such as writing to

storage. Use features of the storage layer to manage conflicts. In distributed applications, data may be

only eventually consistent.

Compress highly compressible data over high latency, low bandwidth networksCompress highly compressible data over high latency, low bandwidth networks. In the

majority of cases in a web application, the largest volume of data generated by the application and

passed over the network is HTTP responses to client requests. HTTP compression can reduce this

considerably, especially for static content. This can reduce cost as well as reduce the load on the network,

though compressing dynamic content does apply a fractionally higher load on the server. In more

generalized environments, data compression can reduce the volume of data transmitted and minimize

transfer time and costs, but the compression and decompression processes incur overhead.

Compression should only be used when there is a demonstrable gain in performance. Other serialization methods,

such as JSON or binary encodings, may reduce the payload size while having less impact on performance, whereas

XML is likely to increase it.

Minimize the time that connections and resources are in useMinimize the time that connections and resources are in use. Maintain connections and

resources only for as long as you need to use them. For example, open connections as late as possible,

and allow them to be returned to the connection pool as soon as possible. Acquire resources as late as

possible, and dispose of them as soon as possible.

Minimize the number of connections requiredMinimize the number of connections required. Service connections absorb resources. Limit the

number that are required and ensure that existing connections are reused whenever possible. For

example, after performing authentication, use impersonation where appropriate to run code as a specific

identity. This can help to make best use of the connection pool by reusing connections.

APIs for some services automatically reuse connections, provided service-specific guidelines are followed. It's

important that you understand the conditions that enable connection reuse for each service that your application

uses.

Send requests in batches to optimize network useSend requests in batches to optimize network use. For example, send and read messages in

Next steps

batches when accessing a queue, and perform multiple reads or writes as a batch when accessing storage

or a cache. This can help to maximize efficiency of the services and data stores by reducing the number of

calls across the network.

Avoid a requirement to store ser ver-side session stateAvoid a requirement to store ser ver-side session state where possible. Server-side session state

management typically requires client affinity (that is, routing each request to the same server instance),

which affects the ability of the system to scale. Ideally, you should design clients to be stateless with

respect to the servers that they use. However, if the application must maintain session state, store

sensitive data or large volumes of per-client data in a distributed server-side cache that all instances of

the application can access.

Optimize table storage schemasOptimize table storage schemas. When using table stores that require the table and column names to

be passed and processed with every query, such as Azure table storage, consider using shorter names to

reduce this overhead. However, do not sacrifice readability or manageability by using overly compact

names.

Create resource dependencies during deployment or at application star tupCreate resource dependencies during deployment or at application star tup. Avoid repeated

calls to methods that test the existence of a resource and then create the resource if it does not exist.

Methods such as

CloudTable.CreateIfNotExists

and

CloudQueue.CreateIfNotExists

in the Azure Storage

Client Library follow this pattern. These methods can impose considerable overhead if they are invoked

before each access to a storage table or storage queue. Instead:

Create the required resources when the application is deployed, or when it first starts. (A single call

CreateIfNotExists

for each resource in the startup code for a web or worker role is acceptable.)

However, be sure to handle exceptions that may arise if your code attempts to access a resource

that doesn't exist. In these situations, you should log the exception, and possibly alert an operator

that a resource is missing.

Under some circumstances, it may be appropriate to create the missing resource as part of the

exception handling code. Adopt this approach with caution, as the non-existence of the resource

might indicate a programming error (for example, a misspelled resource name), or some other

infrastructure-level issue.

Use lightweight frameworksUse lightweight frameworks. Carefully choose the APIs and frameworks you use to minimize resource

usage, execution time, and overall load on the application. For example, using Web API to handle service

requests can reduce the application footprint and increase execution speed, but it may not be suitable for

advanced scenarios where the additional capabilities of Windows Communication Foundation are

required.

Consider minimizing the number of ser vice accountsConsider minimizing the number of ser vice accounts. For example, use a specific account to

access resources or services that impose a limit on connections, or perform better where fewer

connections are maintained. This approach is common for services such as databases, but it can affect the

ability to accurately audit operations due to the impersonation of the original user.

Challenges of monitoring distributed architectures

12/16/2022 • 2 minutes to read • Edit Online

Key points

Team expertise

Scaling issues

Antipatterns in design

Fault tracking

Most cloud deployments are based on distributed architectures where components are distributed across

various services. Troubleshooting monolithic applications often requires only one or two lenses—the application

and the database. With distributed architectures, troubleshooting is complex and challenging because of various

factors. This article describes some of those challenges.

The team may not have the expertise across all the services used in an architecture.

Uncovering and resolving bottlenecks by monitoring all of your services and their infrastructure is complex.

Antipatterns in design and code causes issues the application is under pressure.

Resilience any service may impact your application's ability to meet current load.

Distributed architectures require many areas of expertise. To adequately monitor performance, it's critical that

telemetry is captured throughout the application, across all services, and is rich. Also, your team should have the

necessary skills to troubleshoot all services used in the architecture. When making technology choices, it's

advantageous and simple to choose a service over another because of the team's expertise. As the collective

skillset grows, you can incorporate other technologies.

For monolithic applications, scale is two-dimensional. An application usually consist of a group of application

servers, some web front ends (WFEs), and database servers.

Uncovering

bottlenecks is simpler but

resolving

them can require considerable effort. For distributed applications, complexity increases exponentially in both

aspects for performance issues. You must consider each application, its supporting service, and the latency

between all the application layers.

Performance efficiency is a complex mixture of applications and infrastructure (IaaS and PaaS). First, you must

ensure that all services can scale to support the expected load and that one service will not cause a bottleneck.

Second, while performance testing, you may realize that certain services scale under different load conditions as

opposed to scaling all services uniformly. Monitoring all of your services and their infrastructure can help fine-

tune your application for optimal performance.

For more information about monitoring for scalability, see Monitor performance for scalability and reliability.

Antipatterns in design and code are a common cause for performance problems when an application is under

pressure. For example, an application behaves as expected during performance testing. However, when it's

released to production and starts to handle live workloads, performance decreases. Problems such as rejecting

user requests, stalling, or throwing exceptions may arise. To learn how to identify and fix these antipatterns, see

Performance antipatterns for cloud applications.

If a service in the architecture fails, how will it affect your application's overall performance? Is the error

Community links

Related links

transient, allowing your application to continue to function; or, will the application experience a critical failure? If

the error is transient, does your application experience a decrease in performance? Resiliency plays a significant

role in performance efficiency because the failure of any service may impact your application's ability to meet

your business goals and scale to meet current load. Chaos testing—the introduction of random failures within

your infrastructure—against your application can help determine how well your application continues to

perform under varying stages of load.

For more information about reliability impacts on performance, see Monitor performance for scalability and

reliability.

Design scalable Azure applications

To learn more about chaos testing, see Advancing resilience through chaos engineering and fault injection.

Performance antipatterns for cloud applications

Monitor performance for scalability and reliability

Back to the main article

Design scalable Azure applications

12/16/2022 • 8 minutes to read • Edit Online

Choose the right data storage

Database considerations

Why use a relational database?Why use a relational database?

Application design is critical to handling scale as load increases. This article will give you insight on the most

important topics. For more topics related to handling scale, see the Design Azure applications for efficiency

article in the Performance efficiency pillar.

The overall design of the data tier can greatly affect an application's performance and scalability. The Azure

storage platform is designed to be massively scalable to meet the data storage and performance needs of

modern applications.

Data services in the Azure storage platform are:

Azure Blob - A massively scalable object store for text and binary data. Includes support for big data analytics

through Data Lake Storage Gen2.

Azure Files - Managed file shares for cloud or on-premises deployments.

Azure Queue - A messaging store for reliable messaging between application components.

Azure Tables - A NoSQL store for schemaless storage of structured data.

Azure Disks - Block-level storage volumes for Azure VMs.

Most cloud workloads adopt the

polyglot

persistence approach. Instead of using one data store service, a mix of

technologies is used. Your application will most likely require more than one type of data store depending on

your requirements. For examples of when to use these data storage types, see Example scenarios.

Each service is accessed through a storage account. To get started, see Create a storage account.

The choice of database can affect an application's performance and scalability. Database reads and writes involve

a network call and storage I/O, both of which are expensive operations. Choosing the right database service to

store and retrieve data is therefore a critical decision and must be considered to ensure application scalability.

Azure has many database services that will fit most needs. In addition, there are third-party options that can be

considered from Azure Marketplace.

To help you choose a database type, determine if the application's storage requirements fit a relational design

(SQL) versus a key-value/document/graph design (NoSQL). Some applications may have both a SQL and a

NoSQL database for different storage needs. Use the Azure data store decision tree to help you find the

appropriate managed data storage solution.

Use a relational database when strong consistency guarantees are important — where all changes are atomic,

and transactions always leave the data in a consistent state. However, a relational database generally can't scale

out horizontally without sharding the data in some way. Implementing manual sharding can be a time

consuming task. Also, the data in relational database must be normalized, which isn't appropriate for every data

set.

If a relational database is considered optimal, Azure offers several PaaS options that fully manage hosting and

operations of the database. Azure SQL Database can host single databases or multiple databases (Azure SQL

Database Managed Instance). The suite of offerings spans requirements that cross performance, scale, size,

  
Why use a NoSQL database?Why use a NoSQL database?
 
Choose the right VM size
 
Build with microservices
resiliency, disaster recovery, and migration compatibility. Azure offers the following PaaS relational database
services:
Azure SQL Database
Azure Database for MySQL
Azure Database for PostgreSQL
Azure Database for MariaDB
Use a NoSQL database when application performance and availability are more important than strong
consistency. NoSQL databases are ideal for handling large, unrelated, indeterminate, or rapidly changing data.
NoSQL databases have trade-offs. For specifics, see Some challenges with NoSQL databases.
Azure provides two managed services that optimize for NoSQL solutions: Azure Cosmos DB and Azure Cache
for Redis. For document and graph databases, Azure Cosmos DB offers extreme scale and performance.
For a detailed description of NoSQL and relational databases, see Understanding the differences.
Choosing the wrong VM size can result in capacity issues as VMs approach their limits. It can also lead to
unnecessary cost. To choose the right VM size, consider your workloads, number of CPUs, RAM capacity, disk
size, and speed according to business requirements. For a snapshot of Azure VM sizes and their purpose, see
Sizes for virtual machines in Azure.
Azure offers the following categories of VM sizes, each designed to run different workloads. Click a category for
details.
General-purpose - Provide balanced CPU-to-memory ratio. Ideal for testing and development, small to
medium databases, and low to medium traffic web servers.
Memory optimized - Offer a high memory-to-CPU ratio that is great for relational database servers, medium
to large caches, and in-memory analytics.
Compute optimized - Have a high CPU-to-memory ratio. These sizes are good for medium traffic web
servers, network appliances, batch processes, and application servers.
GPU optimized - Available with single, multiple, or fractional GPUs. These sizes are designed for compute-
intensive, graphics-intensive, and visualization workloads.
High performance compute - Designed to deliver leadership-class performance, scalability, and cost
efficiency for a variety of real-world HPC workloads.
Storage optimized - Offer high disk throughput and IO, and are ideal for Big Data, SQL, NoSQL databases,
data warehousing, and large transactional databases. Examples include Cassandra, MongoDB, Cloudera, and
Redis.
You can change the sizing requirements according to your needs and requirements.
Microservices are a popular architectural style for building applications that are resilient, highly scalable,
independently deployable, and able to evolve quickly. A microservices architecture consists of a collection of
small, autonomous services. Each service is self-contained and should implement a single business capability.
Breaking up larger entities into small discrete pieces alone doesn't ensure sizing and scaling capabilities.
Application logic needs to be written to control this.
One of the many benefits of microservices is that they can be scaled independently. This lets you scale out
subsystems that require more resources, without scaling out the entire application. Another benefit is fault
isolation. If an individual microservice becomes unavailable, it won't disrupt the entire application, as long as

Use dynamic service discovery for microservices applications

TIPTIP

Establish connection pooling

Pool size limitsPool size limits

any upstream microservices are designed to handle faults correctly (for example, by implementing circuit

breaking).

To learn more about the benefits of microservices, see Benefits.

Building with microservices comes with challenges such as development and testing. Writing a small service

that relies on other dependent services requires a different approach than writing a traditional monolithic or

layered application. Existing tools are not always designed to work with service dependencies. Refactoring

across service boundaries can be difficult. It is also challenging to test service dependencies, especially when the

application is evolving quickly.

Reference Challenges for a list of possible drawbacks of a microservice architecture.

When there are many separate services or instances of services in play, they will need to receive instructions on

who to contact and/or other configuration information. Hard coding this information is flawed, and this is where

service discovery steps in. A service instance can spin up and dynamically discover the configuration

information it needs to become functional without having that information hard coded.

When combined with an orchestration platform designed to execute and manage microservices such as

Kubernetes or Service Fabric, individual services can be right sized, scaled up, scaled down, and dynamically

configured to match user demand. Using an orchestrator such as Kubernetes or Service Fabric, you can pack a

higher density of services onto a single host, which allows for more efficient utilization of resources. Both of

these platforms provide built-in services for executing, scaling, and operating a microservices architecture; and

one of those key services is discovery and finding where a particular service is running.

Kubernetes supports pod autoscaling and cluster autoscaling. To learn more, see Autoscaling. A Service Fabric

architecture takes a different approach to scaling for stateless and stateful services. To learn more, reference

Scaling considerations.

When appropriate, decomposing an application into microservices is a level of decoupling that is an architectural best

practice. A microservices architecture can also bring some challenges. The design patterns in Design patterns for

microservices can help mitigate these challenges.

Establishing connections to databases is typically an expensive operation that involves establishing an

authenticated network connection to the remote database server. This is especially true for applications that

open new connections frequently. Use connection pooling to reduce connection latency by reusing existing

connections and enable higher database throughput (transactions per second) on the server. By doing this, you

avoid the expense of opening a new connection for each request.

Azure limits the number of network connections a virtual machine or AppService instance can make. Exceeding

this limit would cause connections to be slowed down or terminated. With connection pooling, a fixed set of

connections are established at the startup time and maintained. In many cases, a default pool size might only

consist of a small handful of connections that performs quickly in basic test scenarios, but become a bottleneck

under scale when the pool is exhausted. Establishing a pool size that maps to the number of concurrent

transactions supported on each application instance is a best practice.

Each database and application platform will have slightly different requirements for the right way to set up and

leverage the pool. Reference SQL Server Connection Pooling for a .NET code example using SQL Server and

TIPTIP

Integrated securityIntegrated security

Next steps

Azure Database. In all cases, testing is paramount to ensure a connection pool is properly established and

working as designed under load.

Use a pool size that uses the same number of concurrent connections. Choose a size that can handle more than the

existing connections so you can quickly handle a new request coming in.

Integrated security is a singular unified solution to protect every service that a business runs through a set of

common policies and configuration settings. In addition to reducing issues associated with scaling, provisioning,

and managing (including higher costs and complexity), integrated security also increases control and overall

security. However, there may be times when you may not want to use connection pooling for security reasons.

For example, although connection pooling improves the performance of subsequent database requests for a

single user, that user cannot take advantage of connections made by other users. It also results in at least one

connection per user to the database server.

Measure your business' security requirements against the advantages and disadvantages of connection pooling.

To learn more, see Pool fragmentation

Application efficiency

Design Azure applications for efficiency

12/16/2022 • 4 minutes to read • Edit Online

Reduce response time with asynchronous programming

Process faster by queuing and batching requests

Optimize with data compression

Making choices that effect performance efficiency is critical to application design. For additional related topics,

see the Design scalable Azure applications article in the Performance efficiency pillar.

The time for the caller to receive a response could range from milliseconds to minutes. During that time, the

thread is held by the process until the response comes back, or if an exception happens. This is inefficient

because it means that no other requests can be processed during the time waiting for a response. An example

when multiple requests in flight is inefficient is a bank account. In this situation, only one resource can operate

on the request at the same time. Another example is when connection pools can't be shared, and then all of the

requests need separate connections to complete.

Asynchronous programming is an alternative approach. It enables a remote service to be executed without

waiting and blocking resources on the client. This is a critical pattern for enabling cloud scalable software and is

available in most modern programming languages and platforms.

There are many ways to inject asynchronous programming into an application design. For APIs and services that

work across the internet, consider using the Asynchronous Request-Reply pattern. When writing code, remote

calls can be asynchronously executed using built-in language constructs like async / await in .NET C#. Review a

language construct example. .NET has other built-in platform support for asynchronous programming with task

and event based asynchronous patterns.

Similar to asynchronous programming, queuing services has long been used as a scalable mechanism to hand

off processing work to a service. Highly scalable queuing services are natively supported in Azure. The queue is

a storage buffer located between the caller and the processing service. It takes requests, stores them in a buffer,

and queues the requests to provide services around the reliable delivery and management of the queued data.

Using a queue is often the best way to hand off work to a processor service. The processor service receives work

by listening on a queue and dequeuing messages. If items to be processed enter too quickly, the queuing service

will keep them in the queue until the processing service has available resources and asks for a new work item

(message). By leveraging the dynamic nature of Azure Functions, the processor service can easily autoscale on

demand as the queue builds up to meet the intake pressure. Developing processor logic with Azure Functions to

run task logic from a queue is a common, scalable, and cost effective way to using queuing between a client and

a processor.

Azure provides some native first-party queueing services with Azure Storage Queues (simple queuing service

based on Azure Storage) and Azure Service Bus (message broker service supporting transactions and reduced

latency). Many other third-party options are also available through Azure Marketplace.

To learn more about queue-based Load Leveling, see Queue-based Load Leveling pattern. To compare and

contrast queues, see Storage queues and Service Bus queues - compared and contrasted.

A well-known optimization best practice for scaling is to use a compression strategy to compress and bundle

web pages or API responses. The idea is to shrink the data returned from a page or API back to the browser or

Improve scalability with session affinity

TIPTIP

Run background jobs to meet integration needs

Next steps

client app. Compressing the data returned to clients optimizes network traffic and accelerates the application.

Azure Front Door can perform file compression, and .NET has built-in framework support for this technique with

GZip compression. For more information, see Response compression in ASP.NET Core.

If an application is stateful, meaning that data or state will be stored locally in the instance of the application, it

may increase the performance of your application, if you enable session affinity. When session affinity is

enabled, subsequent requests to the application will be directed to the same server that processed the first

request. If session affinity is not enabled, subsequent requests would be directed to the next available server

depending on the load balancing rules. Session affinity allows the instance to have some persistent or cached

data/context, which can speed subsequent requests. However, if your application does not store large amounts

of state or cached data in memory, session affinity might decrease your throughput because one host could get

overloaded with requests, while others are dormant.

Migrate an Azure Cloud Services application to Azure Service Fabric describes best practicesbest practices about stateless services for

an application that is migrated from old Azure Cloud Services to Azure Service Fabric.

Many types of applications require background tasks that run independently of the user interface (UI). Examples

include batch jobs, intensive processing tasks, and long-running processes such as workflows. Background jobs

can be executed without requiring user interaction. The application can start the job and then continue to

process interactive requests from users. To learn more, see Background jobs.

Background tasks must offer sufficient performance to ensure they do not block the application, or cause

inconsistencies due to delayed operation when the system is under load. Typically, performance is improved by

scaling the compute instances that host the background tasks. For a list of considerations, see Scaling and

performance considerations.

Logic Apps is a serverless consumption (pay-per-use) service that enables a vast set of out-of-the-box ready-to-

use connectors and a long-running workflow engine to enable cloud-native integration needs quickly. Logic

Apps is flexible enough for scenarios like running tasks/jobs, advanced scheduling, and triggering. Logic Apps

also has advanced hosting options to allow it to run within enterprise restricted cloud environments. Logic Apps

can be combined with all other Azure services to complement one another, or it can be used independently.

Like all serverless services, Logic Apps doesn't require VM instances to be purchased, enabled, and scaled up

and down. Instead, Logic Apps scale automatically on serverless PaaS provided instances, and a consumer only

pays based on usage.

Design for scaling

12/16/2022 • 7 minutes to read • Edit Online

Scalability

is the ability of a system to handle increased load. Services covered by Azure Autoscale can scale

automatically to match demand to accommodate workload. These services scale out to ensure capacity during

workload peaks and return to normal automatically when the peak drops.

To achieve performance efficiency, consider how your application design scales and implement PaaS offerings

that have built-in scaling operations.

Two main ways an application can scale include

vertical scaling

and

horizontal scaling

. Vertical scaling (scaling

) increases the capacity of a resource, for example, by using a larger virtual machine (VM) size. Horizontal

scaling (scaling

out

) adds new instances of a resource, such as VMs or database replicas.

Horizontal scaling has significant advantages over vertical scaling, such as:

True cloud scale

: Applications are designed to run on hundreds or even thousands of nodes, reaching scales

that aren't possible on a single node.

Horizontal scale is elastic

: You can add more instances if load increases, or remove instances during quieter

periods.

Scaling out can be triggered automatically, either on a schedule or in response to changes in load.

Scaling out may be cheaper than scaling up. Running several small VMs can cost less than a single large VM.

Horizontal scaling can also improve resiliency, by adding redundancy. If an instance goes down, the

application keeps running.

An advantage of vertical scaling is that you can do it without making any changes to the application. But at some

point, you'll hit a limit, where you can't scale up anymore. At that point, any further scaling must be horizontal.

Horizontal scale must be designed into the system. For example, you can scale out VMs by placing them behind

a load balancer. But each VM in the pool must handle any client request, so the application must be stateless or

store state externally (say, in a distributed cache). Managed PaaS services often have horizontal scaling and

autoscaling built in. The ease of scaling these services is a major advantage of using PaaS services.

Just adding more instances doesn't mean an application will scale, however. It might push the bottleneck

somewhere else. For example, if you scale a web front end to handle more client requests, that might trigger

lock contentions in the database. You'd want to consider other measures, such as optimistic concurrency or data

partitioning, to enable more throughput to the database.

Always conduct performance and load testing to find these potential bottlenecks. The stateful parts of a system,

such as databases, are the most common cause of bottlenecks, and require careful design to scale horizontally.

Resolving one bottleneck may reveal other bottlenecks elsewhere.

Use the Performance efficiency checklist to review your design from a scalability standpoint.

In the cloud, the ability to take advantage of scalability depends on your infrastructure and services. Platforms,

such as Kubernetes, were built with scaling in mind. Virtual machines may not scale as easily although scale

operations are possible. With virtual machines, you may want to plan ahead to avoid scaling infrastructure in

the future to meet demand. Another option is to select a different platform such as Azure virtual machines scale

sets.

When using scalability, you can predict the current average and peak times for your workload. Payment plan

options allow you to manage this prediction. You pay either per minute or per-hour depending on the service

for a chosen time period.

Plan for growth

Add scale unitsAdd scale units

Use Autoscaling to manage load increases and decreases

Planning for growth starts with understanding your current workloads, which can help you anticipate scale

needs based on predictive usage scenarios. An example of a predictive usage scenario is an e-commerce site

that recognizes that its infrastructure should scale appropriately for an expected high volume of holiday traffic.

Perform load tests and stress tests to determine the necessary infrastructure to support the predicted spikes in

workloads. A good plan includes incorporating a buffer to accommodate for random spikes.

For more information on how to determine the upper and maximum limits of an application's capacity, reference

Performance testing in the performance efficiency pillar.

Another critical component of planning for scale is to make sure the region that hosts your application supports

the necessary capacity required to accommodate load increase. If you're using a multiregion architecture, make

sure the secondary regions can also support the increase. A region can offer the product, but may not support

the predicted load increase without the necessary SKUs (Stock Keeping Units) so you need to verify capacity.

To verify your region and available SKUs, first select the product and regions in Products available by region.

Then, check the SKUs available in the Azure portal.

For each resource, know the upper scaling limits, and use sharding or decomposition to go beyond those limits.

Design the application so that it's easily scaled by adding one or more scale units, such as by using the

Deployment Stamps pattern. Determine the scale units for the system for well-defined sets of resources.

The next step might be to use built-in scaling features or tools to understand which resources need to scale

concurrently with other resources. For example, adding X number of front-end VMs might require Y number of

extra queues and Z number of storage accounts to handle the extra workload. So a scale unit could consist of X

VM instances, Y queues, and Z storage accounts.

Autoscaling enables you to run the right amount of resources to handle the load of your app. It adds resources

(called scaling out) to handle an increase in load such as seasonal workloads. Autoscaling saves money by

removing idle resources (called scaling in) during a decrease in load such as nights and weekends for some

corporate apps.

You automatically scale between the minimum and maximum number of instances to run, and add, or remove

VMs automatically based on a set of rules.

  
Understand scale targetsUnderstand scale targets
 
Take advantage of platform autoscaling features
NOTENOTE
 
Autoscale CPU or memory-intensive applications
 
Autoscale with Azure compute services
For more information, see Autoscaling.
Scale operations (horizontal - changing the number of identical instances, vertical - switching to more/less
powerful instances) can be fast, but usually take time to complete. It's important to understand how this delay
affects the application under load and if degraded performance is acceptable.
For more information, reference Best practices for Autoscale.
Here's how you can benefit from autoscaling features:
Use built-in autoscaling features when possible rather than custom or third-party mechanisms.
Use scheduled scaling rules where possible to ensure that resources are available.
Add reactive autoscaling to the rules where appropriate to cope with unexpected changes in demand.
If your application is explicitly designed to handle the termination of some of its instances, ensure you use autoscaling to
scale down and scale in resources no longer necessary for the given load to reduce operational costs.
For more information, reference Autoscaling.
CPU or memory-intensive applications require scaling up to a larger machine SKU with more CPU or memory.
Once you've reduced the demand for CPU or memory, instances can revert back to the original instance.
For example, you may have an application that processes images, videos, or music. Given the process and
requirements, it may make sense to scale up a server by adding CPU or memory to quickly process the large
media file. While scaling 
out
 allows the system to process more files simultaneously, it won't impact processing
speed of each individual file.
Autoscaling works by collecting metrics for the resource (CPU and memory usage), and the application
(requests queued and requests per second). Rules can then be created to add and remove instances depending
on how the rule evaluates. An App Services app plan allows autoscale rules to be set for scale-out/scale-in and
scale-up/scale-down. Scaling also applies to Azure Automation.
 The Application Service autoscaling sample shows how to create an Azure App Service plan, which
includes an Azure App Service.
Azure Kubernetes Service (AKS) offers two levels of autoscale:
Horizontal autoscaleHorizontal autoscale: Can be enabled on service containers to add more or fewer pod instances within the
cluster.
Cluster autoscaleCluster autoscale: Can be enabled on the agent VM instances running an agent node-pool to add more or
remove VM instances dynamically.
Other Azure services include the following services:
Azure Ser vice FabricAzure Ser vice Fabric: Virtual machine scale sets offer autoscale capabilities for true IaaS scenarios.
Azure App GatewayAzure App Gateway and Azure API ManagementAzure API Management: PaaS offerings for ingress services that enable

NOTENOTE
 
Next steps
autoscale.
Azure FunctionsAzure Functions, Azure Logic AppsAzure Logic Apps, and App Ser vicesApp Ser vices: Serverless pay-per-use consumption modeling
that inherently provides autoscaling capabilities.
Azure SQL DatabaseAzure SQL Database: PaaS platform to change performance characteristics of a database on the fly and
assign more resources when needed or release the resources when they aren't needed. Allows scaling
up/down, read scale-out, and global scale-out/sharding capabilities.
Each service documents its autoscale capabilities. Review Autoscale overview for a general discussion on Azure
platform autoscale.
If your application doesn't have built-in ability to autoscale, or isn't configured to scale out automatically as load increases,
it's possible that your application's services will fail if they become saturated with user requests. Reference Azure
Automation for possible solutions.
Plan for capacity

Plan for capacity

12/16/2022 • 4 minutes to read • Edit Online

Scale out rather than scaling up

Prepare infrastructure for large-scale events

NOTENOTE

Choose the right resources

Azure offers many options to meet capacity requirements as your business grows. These options can also

minimize cost.

When using cloud technologies, it's generally easier, cheaper, and more effective to scale out than scaling up.

Plan to scale your application tier by adding extra infrastructure to meet demand. Be sure to remove the

resources when they are not needed. If you plan to scale up by increasing the resources allocated to your hosts,

you will reach a limit where it becomes cost-prohibitive to scale any further. Scaling up also often requires

downtime for your servers to reboot.

Large-scale application design takes careful planning and possibly involves complex implementation. Work with

your business and marketing teams to prepare for large-scale events. Knowing if there will be sudden spikes in

traffic such as Superbowl, Black Friday, or Marketing pushes, can allow you to prepare your infrastructure ahead

of time.

A fundamental design principle in Azure is to scale out by adding machines or service instances based on

increased demand. Scaling out can be a better alternative to purchasing additional hardware, which may not be

in your budget. Depending on your payment plan, you don't pay for idle VMs or need to reserve capacity in

advance. A pay-as-you-go plan is usually ideal for applications that need to meet planned spikes in traffic.

Don't plan for capacity to meet the highest level of expected demand. An inappropriate or misconfigured service can

impact cost. For example, building a multiregion service when the service levels don't require high-availability or geo-

redundancy will increase cost without reasonable business justification.

Right sizing your infrastructure to meet the needs of your applications can save you considerably as opposed to

a "one size fits all" solution often employed with on-premises hardware. You can choose various options when

you deploy Azure VMs to support workloads.

Each VM type has specific features and different combinations of CPU, memory, and disks. For example, the B-

series VMs are ideal for workloads that don't need the full performance of the CPU continuously, like web

servers, proof of concepts, small databases, and development build environments. The B-Series offers a cost

effective way to deploy these workloads that don't need the full performance of the CPU continuously and burst

in their performance.

For a list of sizes and a description of the recommended use, see sizes for virtual machines in Azure.

Continually monitor workloads after migration to find out if your VMs aren't optimized or have frequent periods

when they aren't used. If you discover this, it makes sense to either shut down the VMs or downscale them by

using virtual machine scale sets. You can optimize a VM with Azure Automation, virtual machine scale sets, auto-

shutdown, and scripted or third-party solutions. To learn more, see Automate VM optimization.

Use metrics to fine-tune scaling

Preemptively scaling based on trends

Next steps

Along with choosing the right VMs, selecting the right storage type can save your organization significant cost

every month. For a list of storage data types, access tiers, storage account types, and storage redundancy

options, see Select the right storage.

It's often difficult to understand the relationship between metrics and capacity requirements, especially when an

application is initially deployed. Provision a little extra capacity at the beginning, and then monitor and tune the

autoscaling

rules to bring the capacity closer to the actual load.

Autoscaling

enables you to run the right amount

of resources to handle the load of your app. It adds resources (called scaling out) to handle an increase in load

such as seasonal workloads and customer facing applications.

After configuring the autoscaling rules, monitor the performance of your application over time. Use the results

of this monitoring to adjust the way in which the system scales if necessary.

Azure Monitor autoscale provides a common set of autoscaling functionality for virtual machine scale sets,

Azure App Service, and Azure Cloud Service. Scaling can be performed on a schedule, or based on a runtime

metric, such as CPU or memory usage. For example, you can scale out by one instance if average CPU usage is

above 70%, and scale in by one instance if CPU usage falls below 50 percent.

The default autoscaling rules are set to know when it's time to execute an autoscaling action in order to prevent

the system from reacting too quickly. To learn more, see Autoscaling.

For a list of built-in metrics, see Azure Monitor autoscaling common metrics. You can also implement custom

metrics by using Application Insights to monitor the performance of your live applications. Some Azure services

use different scaling methods.

Preemptively scaling based on historical data can ensure your application has consistent performance, even

though your metrics haven't yet indicated the need to scale. Schedule-based rules allow you to scale when you

see time patterns in your load and want to scale before a possible load increase or decrease occurs. For example,

you can set a trigger attribute to scale out to 10 instances on weekdays, and scale in to four (4) instances on

Saturday and Sunday. If you can predict the load on the application, consider using scheduled autoscaling, which

adds and removes instances to meet anticipated peaks in demand.

To learn more, see Use Azure Monitor autoscale.

Performance testing

Capacity planning:Capacity planning: When performance testing, the business must communicate any fluctuation in expected

load. Load can be impacted by world events, such as political, economic, or weather changes; by marketing

initiatives, such as sales or promotions; or, by seasonal events, such as holidays. You should test variations of

load prior to events, including unexpected ones, to ensure that your application can scale. Additionally, you

should ensure that all regions can adequately scale to support total load, should one region fail.

Checklist

Testing for performance efficiency

12/16/2022 • 5 minutes to read • Edit Online

Performance testing

Is the application tested for performance, scalability, and resiliency?

Performance testing helps to maintain systems properly and fix defects before problems reach system users. It's

part of the Performance Efficiency pillar in the Microsoft Azure Well-Architected Framework.

Performance testing

is the superset of both load and stress testing. The primary goal of performance testing is

to validate benchmark behavior for the application.

Load Testing

validates application scalability by rapidly or gradually increasing the load on the application until it

reaches a threshold.

Stress Testing

is a type of negative testing, which involves various activities to overload existing resources and

remove components. This testing enables you to understand overall resiliency and how the application responds

to issues.

Use the following checklist to review your application architecture from a performance testing standpoint.

Ensure solid performance testing with shared Ensure solid performance testing with shared

teamteam

responsibility responsibility. Successfully implementing

meaningful performance tests requires many resources. It's not just a single developer or QA Analyst

running some tests on their local machine. Instead, performance tests need a test environment (also

known as a

test bed

) that tests can be executed against without interfering with production environments

and data. Performance testing requires input and commitment from developers, architects, database

administrators, and network administrators.

Capacity planningCapacity planning. When performance testing, the business must communicate any fluctuation in

expected load. Load can be impacted by world events, such as political, economic, or weather changes; by

marketing initiatives, such as sales or promotions; or, by seasonal events, such as holidays. Test variations

of load prior to events, including unexpected ones, to ensure that your application can scale. Additionally,

you should ensure that all regions can adequately scale to support total load, should one region fail.

Identify a path for ward to leveraging existing tests or the creation of new testsIdentify a path for ward to leveraging existing tests or the creation of new tests. There are

different types of performance testing: load testing, stress testing, API testing, client-side/browser testing,

and so on. It's important that you understand and articulate the different types of tests, along with their

advantages and disadvantages, to the customer.

Perform testing in all stages in the development and deployment life cyclePerform testing in all stages in the development and deployment life cycle. Application code,

infrastructure automation, and fault tolerance should all be tested. This step ensures that the application

will perform as expected in every situation. You'll want to test early enough in the application life cycle to

catch and fix errors. Errors are cheaper to repair when caught early and can be expensive or impossible to

fix later. To learn more, reference Testing your application and Azure environment.

Avoid experiencing poor performance with testingAvoid experiencing poor performance with testing. Two subsets of performance testing, load

testing and stress testing, can determine the upper limit, and maximum point of failure, respectively, of

the application's capacity. By performing these tests, you can determine the necessary infrastructure to

support the anticipated workloads.

Plan for a load buffer to accommodate random spikes without overloading thePlan for a load buffer to accommodate random spikes without overloading the

infrastructureinfrastructure. For example, if a normal system load is 100,000 requests per second, the infrastructure

Testing tools

Recommendation

Next steps

should support 100,000 requests at 80% of total capacity (for example, 125,000 requests per second). If

you expect that the application will continue to sustain 100,000 requests per second, and the current SKU

(Stock Keeping Unit) introduces latency at 65,000 requests per second, you'll most likely need to

upgrade your product to the next higher SKU. If there's a secondary region, you'll need to ensure that it

also supports the higher SKU.

Test failover in multiregionsTest failover in multiregions. Test the amount of time it would take for users to be rerouted to the

paired region so that the region doesn't fail. Typically, a planned test failover can help determine how

much time would be required to fully scale to support the redirected load.

Configure the environment based on testing results to sustain performance efficiencyConfigure the environment based on testing results to sustain performance efficiency. Scale

out or scale in to handle increases and decreases in load. For example, you may know that you'll

encounter high levels of traffic during the day and low levels on weekends. You may configure the

environment to scale out for increases in load or scale in for decreases before the load actually changes.

Choose testing tools based on the type of performance testing you are attempting toChoose testing tools based on the type of performance testing you are attempting to

executeexecute. There are various performance testing tools available for DevOps. Some tools like JMeter only

perform testing against endpoints and tests HTTP statuses. Other tools such as K6 and Selenium can

perform tests that also check data quality and variations. Azure Load Testing allows you to create a load

test by using an existing JMeter script, and monitor client-side and server-side metrics to identify

performance bottlenecks. Application Insights, while not necessarily designed to test server load, can test

the performance of an application within the user's browser.

Carr y out performance profiling and load testingCarr y out performance profiling and load testing during development, as part of test routines, and

before final release to ensure the application performs and scales as required. This testing should occur

on the same type of hardware as the production platform, and with the same types, and quantities of

data, and user load as it will encounter in production.

Determine if it is better to use automated or manual testingDetermine if it is better to use automated or manual testing. Testing can be automated or

manual. Automating tests is the best way to make sure that they're executed. Depending on how

frequently tests are performed, they're typically limited in duration and scope. Manual testing is run much

less frequently.

Cache data to improve performance, scalability, and availabilityCache data to improve performance, scalability, and availability . The more data that you have,

the greater the benefits of caching become. Caching typically works well with data that is immutable or

that changes infrequently.

Decide how you'll handle local development and testing when some static content isDecide how you'll handle local development and testing when some static content is

expected to be ser ved from a content deliver y network (CDN)expected to be ser ved from a content deliver y network (CDN). For example, you could pre-

deploy the content to the CDN as part of your build script. Instead, use compile directives or flags to

control how the application loads the resources. For example, in debug mode, the application could load

static resources from a local folder. In release mode, the application would use the CDN.

Simulate different workloads on your application and measure application performance forSimulate different workloads on your application and measure application performance for

each workloadeach workload. This technique is the best way to figure out what resources you'll need to host your

application. Use performance indicators to assess whether your application is performing as expected or

not.

Define a testing strategy. For more information, reference Testing.

Performance testing

12/16/2022 • 5 minutes to read • Edit Online

Establish baselines

Load testing

Performance testing helps to maintain systems properly and fix defects before problems reach system users. It

helps maintain the efficiency, responsiveness, scalability, and speed of applications when compared with

business requirements. When done effectively, performance testing should give you the diagnostic information

necessary to eliminate bottlenecks, which lead to poor performance. A bottleneck occurs when data flow is

either interrupted or stops due to insufficient capacity to handle the workload.

To avoid experiencing poor performance, commit time and resources to testing system performance. Two

subsets of performance testing, load testing and stress testing, can determine the upper (close to capacity limit)

and maximum (point of failure) limit, respectively, of the application's capacity. By performing these tests, you

can determine the necessary infrastructure to support the anticipated workloads.

A best practice is to plan for a load buffer to accommodate random spikes without overloading the

infrastructure. For example, if a normal system load is 100,000 requests per second, the infrastructure should

support 100,000 requests at 80% of total capacity (i.e., 125,000 requests per second). If you anticipate that the

application will continue to sustain 100,000 requests per second, and the current SKU (Stock Keeping Unit)

introduces latency at 65,000 requests per second, you'll most likely need to upgrade your product to the next

higher SKU. If there is a secondary region, you'll need to ensure that it also supports the higher SKU.

Depending on the scale of your performance test, you need to plan for and maintain a testing infrastructure. You

can use a cloud-based tool, such as Azure Load Testing Preview, to abstract the infrastructure needed to run

your performance tests.

First, establish performance baselines for your application. Then, establish a regular cadence for running the

tests. Run the test as part of a scheduled event or part of a continuous integration (CI) build pipeline.

Baselines help to determine the current efficiency state of your application and its supporting infrastructure.

Baselines can provide good insights for improvements and determine if the application is meeting business

goals. Baselines can be created for any application regardless of its maturity. No matter when you establish the

baseline, measure performance against that baseline during continued development. When code and, or

infrastructure changes, the effect on performance can be actively measured.

Load testing measures system performance as the workload increases. It identifies where and when your

application breaks, so you can fix the issue before shipping to production. It does this by testing system behavior

under typical and heavy loads.

Load testing takes places in stages of load. These stages are usually measured by virtual users (VUs) or

simulated requests, and the stages happen over given intervals. Load testing provides insights into how and

when your application needs to scale in order to continue to meet your SLA to your customers (whether internal

or external). Load testing can also be useful for determining latency across distributed applications and

microservices.

The following are key points to consider for load testing:

Know the Azure ser vice limits:Know the Azure ser vice limits: Different Azure services have

soft

and

hard

limits associated with

them. The terms soft limit and hard limit describe the current, adjustable service limit (soft limit) and the

Stress testing

Multiregion testing

Configure the environment based on testing results

maximum limit (hard limit). Understand the limits for the services you consume so that you are not

blocked if you need to exceed them. For a list of the most common Azure limits, see Azure subscription

and service limits, quotas, and constraints.

The ResourceLimits sample shows how to query the limits and quotas for commonly used

resources.

Measure typical loads:Measure typical loads: Knowing the typical and maximum loads on your system helps you understand

when something is operating outside of its designed limits. Monitor traffic to understand application

behavior.

Understand application behavior under various scales:Understand application behavior under various scales: Load test your application to understand

how it performs at various scales. First, test to see how the application performs under a typical load.

Then, test to see how it performs under load using different scaling operations. To get additional insight

into how to evaluate your application as the amount of traffic sent to it increases, see Autoscale best

practices.

Unlike load testing, which ensures that a system can handle what it's designed to handle, stress testing focuses

on overloading the system until it breaks. A stress test determines how stable a system is and its ability to

withstand extreme increases in load. It does this by testing the maximum number requests from another service

(for example) that a system can handle at a given time before performance is compromised and fails. Find this

maximum to understand what kind of load the current environment can adequately support.

Determine the maximum demand you want to place on memory, CPU, and disk IOPS. Once a stress test has

been performed, you will know the maximum supported load and an operational margin. It is best to choose an

operational threshold so that scaling can be performed before the threshold has been reached.

Once you determine an acceptable operational margin and response time under typical loads, verify that the

environment is configured adequately. To do this, make sure the SKUs that you selected are based on the desired

margins. Be careful to stay as close as possible to your margins. Allocating too much can increase costs and

maintenance unnecessarily; allocating too few can result in poor user experience.

In addition to stress testing through increased load, you can stress test by reducing resources to identify what

happens when the machine runs out of memory. You can also stress test by increasing latency (e.g., the database

takes 10x time to reply, writes to storage takes 10x longer, etc.).

A multiregion architecture can provide higher availability than deploying to a single region. If a regional outage

affects the primary region, you can use Front Door to use the secondary region. This architecture can also help if

an individual subsystem of the application fails.

Test the amount of time it would take for users to be rerouted to the paired region so that the region doesn't fail.

To learn more about routing, see Front Door routing methods. Typically, a planned test failover can help

determine how much time would be required to fully scale to support the redirected load.

Once you have performed testing and found an acceptable operational margin and response under increased

levels of load, configure the environment to sustain performance efficiency. Scale out or scale in to handle

increases and decreases in load. For example, you may know that you will encounter high levels of traffic during

the day and low levels on weekends. You may configure the environment to scale out for increases in load or

scale in for decreases before the load actually changes.

NOTENOTE

Next steps

For more information on autoscaling, see Design for scaling in the Performance Efficiency pillar.

Ensure that a rule has been configured to scale the environment back down once load reaches below the set thresholds.

This will save you money.

Testing tools

12/16/2022 • 5 minutes to read • Edit Online

Identify baselines and goals for performance

Caching data

Use Azure Redis to cache dataUse Azure Redis to cache data

There are multiple stages in the development and deployment life cycle in which tests can be performed.

Application code, infrastructure automation, and fault tolerance should all be tested. Testing in various stages

can ensure that the application will perform as expected in every situation. You'll want to test early enough in the

application life cycle to catch and fix errors. Errors are cheaper to repair when caught early and can be expensive

or impossible to fix later.

Testing can be automated or manual. Automating tests is the best way to make sure that they're executed.

Depending on how frequently tests are performed, they're typically limited in duration and scope. Manual

testing is run much less frequently. For a list of tests that you should consider while developing and deploying

applications, see Testing your application and Azure environment.

Knowing where you are (baseline) and where you want to be (goal) makes it easier to plan how to get there.

Established baselines and goals will help you to stay on track and measure progress. Testing may also uncover a

need to perform more testing on areas that you may not have planned.

Baselines can vary based on connections or platforms used for accessing the application. It may be important to

establish baselines that address the different connections, platforms, and elements such as time of day, or

weekday versus weekend.

There are many types of goals when determining baselines for application performance. Some examples are, the

time it takes to render a page, or a desired number of transactions if your site conducts e-commerce. The

following list shows some examples of questions that may help you to determine goals.

What are your baselines and goals for:

Establishing an initial connection to a service?

An API endpoint complete response?

Server response times?

Latency between systems/microservices?

Database queries?

Caching can dramatically improve performance, scalability, and availability. The more data that you have, the

greater the benefits of caching become. Caching typically works well with data that is immutable or that changes

infrequently. Examples include reference information such as product and pricing information in an e-commerce

application, or shared static resources that are costly to construct. Some or all of this data can be loaded into the

cache at application startup to minimize demand on resources and to improve performance.

Use performance testing and usage analysis to determine whether pre-populating or on-demand loading of the

cache, or a combination of both, is appropriate. The decision should be based on the volatility and usage pattern

of the data. Cache utilization and performance analysis are important in applications that encounter heavy loads

and must be highly scalable.

To learn more about how to use caching as a solution in testing, see Caching.

 
Content delivery network
 
Benchmark testing
 
Metrics
Azure Cache for Redis is a caching service that can be accessed from any Azure application, whether the
application is implemented as a cloud service, a website, or inside an Azure virtual machine. Caches can be
shared by client applications that have the appropriate access key. It's a high-performance caching solution that
provides availability, scalability, and security.
To learn more about using Azure Cache for Redis, see Considerations for implementing caching in Azure.
Content delivery networks (CDNs) are typically used to deliver static content such as images, style sheets,
documents, client-side scripts, and HTML pages. The major advantages of using a CDN are lower latency and
faster delivery of content to users, regardless of their geographical location in relation to the datacenter where
the application is hosted. CDNs can help to reduce load on a web application because the application doesn't
have to service requests for the content that is hosted in the CDN. Using a CDN is a good way to minimize the
load on your application, and maximize availability and performance. Consider adopting this strategy for all of
the appropriate content and resources your application uses.
Decide how you'll handle local development and testing when some static content is served from a CDN. For
example, you could pre-deploy the content to the CDN as part of your build script. Alternatively, use compile
directives or flags to control how the application loads the resources. For example, in debug mode, the
application could load static resources from a local folder. In release mode, the application would use the CDN.
To learn more about CDNs, see Best practices for using content delivery networks (CDNs).
Benchmarking is the process of simulating different workloads on your application and measuring application
performance for each workload. It's the best way to figure out what resources you'll need to host your
application.
Use performance indicators to assess whether your application is performing as expected or not.
For workloads running on virtual machines, take into consideration VM sizes and disk sizes when
benchmarking, as you may hit a particular bottleneck. The Optimize IOPS, throughput, and latency table offers
further guidance.
Tools such as Azure Load Testing can help you simulate load and different usage patterns. The simulation can
help you prepare for particular scenarios that are relevant to your organization or industry. (for example how a
promotion or flash sale might affect an online store)
Metrics measure trends over time. they're available for interactive analysis in the Azure portal with Azure
Metrics Explorer. Metrics also can be added to an Azure dashboard for visualization in combination with other
data and used for near-real time alerting.
Performance testing gives you the ability to see specific details on the processing capabilities of applications.
You'll most likely want a monitoring tool that allows you to discover proactively if the issues you find through
testing are appearing in both your infrastructure and applications. Azure Monitor Metrics is a feature of Azure
Monitor that collects metrics from monitored resources into a time series database.
With Azure Monitor, you can collect, analyze, and act on telemetry from your cloud and on-premises
environments. It helps you understand how applications are performing and identifies issues affecting them and
the resources they depend on.
For a list of Azure metrics, see Supported metrics with Azure Monitor.

Next steps

Performance monitoring

Monitor the performance of a cloud application

12/16/2022 • 2 minutes to read • Edit Online

Checklist

In this section

A SSESSM EN TA SSESSM EN T DESC RIP T IONDESC RIP T ION

Are application logs and events correlated across allAre application logs and events correlated across all

application components?application components?

Correlate logs and events for subsequent interpretation. This

correlation will give you visibility into end-to-end transaction

flows.

Are you collecting Azure Activity Logs within the logAre you collecting Azure Activity Logs within the log

aggregation tool?aggregation tool?

Collect platform metrics and logs to get visibility into the

health and performance of services that are part of the

architecture.

Are application and resource level logs aggregatedAre application and resource level logs aggregated

in a single data sink, or is it possible to cross-quer yin a single data sink , or is it possible to cross-quer y

events at both levels?events at both levels?)

Implement a unified solution to aggregate and query

application and resource level logs, such as Azure Log

Analytics.

Troubleshooting an application's performance requires monitoring and reliable investigation. Issues in

performance can arise from database queries, connectivity between services, under-provisioned resources, or

memory leaks in code.

Continuously monitoring services and checking the health state of current workloads is key in maintaining the

overall performance of the workload. An overall monitoring strategy consider these factors:

Scalability

Resiliency of the infrastructure, application, and dependent services

Application and infrastructure performance

How are you monitoring to ensure the workload is scaling appropriately?How are you monitoring to ensure the workload is scaling appropriately?

Enable and capture telemetry throughout your application to build and visualize end-to-end transaction

flows for the application.

See metrics from Azure services such as CPU and memory utilization, bandwidth information, current

storage utilization, and more.

Use resource and platform logs to get information about what events occur and under which conditions.

For scalability, look at the metrics to determine how to provision resources dynamically and scale with

demand.

In the collected logs and metrics look for signs that might make a system or its components suddenly

become unavailable.

Use log aggregation technology to gather information across all application components.

Store logs and key metrics of critical components for statistical evaluation and predicting trends.

Identify antipatterns in the code.

Follow these questions to assess the workload at a deeper level.

Azure services

Next section

Related links

The monitoring operations should utilize Azure Monitor. You can analyze data, set up alerts, get end-to-end

views of your applications, and use machine learning–driven insights to identify and resolve problems quickly.

Export logs and metrics to services such as Azure Log Analytics or an external service like Splunk. Furthermore,

application technologies such as Application Insights can enhance the telemetry coming out of applications.

Based on insights gained through monitoring, optimize your code. One option might be to consider other Azure

services that may be more appropriate for your objectives.

Optimize

Back to the main article

Application profiling considerations for performance

monitoring

12/16/2022 • 2 minutes to read • Edit Online

Key points

Application logs

Application instrumentation

Distributed tracing

Continuously monitor the application with Application Performance Monitoring (APM) technology, such as

Azure Application Insights. This technology can help you manage the performance and availability of the

application, aggregating application level logs, and events for subsequent interpretation.

Enable instrumentation and collect data using Azure Application Insights.

Use distributed tracing to build and visualize end-to-end transaction flows for the application.

Separate logs and events of a non-critical environment from a production environment.

Include end-to-end transaction times for key technical functions.

Correlate application log events across critical system flows.

Are application logs collected from different application environments?Are application logs collected from different application environments?

Application logs support the end-to-end application lifecycle. Logging is essential in understanding how the

application operates in various environments and what events occur and under which conditions.

Collect application logs and events across all application environments. Use a sufficient degree of separation

and filtering to ensure non-critical environments aren't mixed with production log interpretation. Furthermore,

corresponding log entries across the application should capture a correlation ID for their respective transactions.

Are log messages captured in a structured format?Are log messages captured in a structured format?

Structured format in a well-known schema can help expediate parsing and analyzing logs. Structured data can

be indexed, queried, and reported without complexity.

Also, application events should be captured as a structured data type with machine-readable data points rather

than unstructured string types.

Do you have detailed instrumentation in the application code?Do you have detailed instrumentation in the application code?

Instrumentation of your code allows precise detection of underperforming pieces when load or stress tests are

applied. It is critical to have this data available to improve and identify performance opportunities in the

application code.

Use APM such as Application Insights to continuously improve performance and usability. You need to enable

Application Insights by installing an instrumentation package. The service provides extensive telemetry out of

the box. You can customize what is captured for greater visibility. After it's enabled, metrics and logs related to

the performance and operations are collected. View and analyze the captured data in Azure Monitor.

Critical targets

Cost considerations

Related links

Events coming from different application components or different component tiers of the application should be

correlated to build end-to-end transaction flows. Use distributed tracing to build and visualize flows for the

application. For instance, this is often achieved by using consistent correlation IDs transferred between

components within a transaction.

Are application events correlated across all application components?Are application events correlated across all application components?

Event correlation between the layers of the application provides the ability to connect tracing data of the

complete application stack. Once this connection is made, you can see a complete picture of where time is spent

at each layer. You can then query the repositories of tracing data in correlation to a unique identifier that

represents a completed transaction that has flowed through the system.

For more information, see Distributed tracing.

Is it possible to evaluate critical application performance targets and non-functional requirementsIs it possible to evaluate critical application performance targets and non-functional requirements

(NFRs)?(NFRs)?

Application level metrics should include end-to-end transaction times of key technical functions, such as

database queries, response times for external API calls, failure rates of processing steps, and more.

Is the end-to-end performance of critical system flows monitored?Is the end-to-end performance of critical system flows monitored?

It should be possible to correlate application log events across critical system flows, such as user login, to fully

assess the health of key scenarios in the context of targets and NFRs.

If you are using Application Insights to collect instrumentation data, there are cost considerations. For more

information, see Manage usage and costs for Application Insights.

Monitor infrastructure

Distributed tracing

Application Insights

Azure Monitor

Back to the main article

Analyze infrastructure metrics and logs

12/16/2022 • 4 minutes to read • Edit Online

Key points

Platform metrics

Platform logs

Performance issues can occur because of interaction between the application and other services in the

architecture. For example, issues in database queries, connectivity between services, and under-provisioned

resources are all common causes for inefficiencies.

The practice of continuous monitoring must include analysis of platform metrics and logs to get visibility into

the health and performance of services that are part of the architecture.

View platform metrics to get visibility into the health and performance of Azure services.

Use log data to get visibility into the operations and events of the the management plane.

Track events from internal dependencies.

Check the health of external dependencies such as an API service.

Metrics are numerical values that are collected at regular intervals and describe some aspect of a system at a

particular time. View the platform metrics that are generated by the services used in the architecture. Each Azure

service has set of metrics that's unique to the functionality of the resource. These metrics give you visibility into

their health and performance. There's no added configuration for Azure resources. You can also define custom

metrics for an Azure service using the custom metrics API.

Azure Monitor Metrics is a feature of Azure Monitor that collects numeric data from monitored resources into a

time series database. To learn more about Azure Monitor Metrics, see What can you do with Azure Monitor

Metrics?

If your application is running in Azure Virtual Machines, configure Azure Diagnostics extension to send guest OS

performance metrics to Azure Monitor. Guest OS metrics include performance counters that track guest CPU

percentage or memory usage, both of which are frequently used for autoscaling or alerting.

For more information, see Supported metrics with Azure Monitor.

Also, use technology-specific tools for the services used in the architecture. For example, use network traffic

capturing tools, such as Azure Network Watcher.

One of the challenges to metric data is that it often has limited information to provide context for collected

values. Azure Monitor addresses this challenge with multi-dimensional metrics. These metrics are name-value

pairs that carry more data to describe the metric value. To learn about multi-dimensional metrics and an

example for network throughput, see multi-dimensional metrics.

Azure provides various operational logs from the platform and the resources. These logs provide insight into

what events occurred, what changes were made to the resource, and more. These logs are useful in tracking

operations. For example, you can track scaling events to check if autoscaling is working as expected.

Azure Monitor Logs can store various different data types each with their own structure. You can also perform

complex analysis on logs data using log queries, which cannot be used for analysis of metrics data. Azure

Monitor Logs is capable of supporting near real-time scenarios, making them useful for alerting and fast

 
Cost considerations of monitoring
 
Next
 
Related links
detection of issues. To learn more about Azure Monitor Logs, see What can you do with Azure Monitor Logs?
Are you collecting Azure Activity Logs within the log aggregation tool?Are you collecting Azure Activity Logs within the log aggregation tool?
Azure Activity Logs provide audit information about when an Azure resource is modified, such as when a virtual
machine is started or stopped. This information is useful for the interpretation and troubleshooting of issues. It
provides transparency around configuration changes that can be mapped to adverse performance events.
Are logs available for critical internal dependencies?Are logs available for critical internal dependencies?
To build a robust application health model, ensure there is visibility into the operational state of critical internal
dependencies.
In Azure Monitor, enable Azure resource logs so that you have visibility into operations that were done within an
Azure resource. Similar to platform metrics, resource logs vary by the Azure service and resource type.
For example, for services such as a shared NVA (network virtual appliance) or Express Route connection,
monitor the network performance. Azure monitor can also help diagnose networking related issues. You can
trigger a packet capture, diagnose routing issues, analyze network security group flow logs, and gain visibility
and control over your Azure network.
Here are some tools:
Network performance monitor
Service connectivity monitor
ExpressRoute monitor
Additionally, data from network traffic capturing tools, such as Azure Network Watchercan be helpful.
Are critical external dependencies monitored?Are critical external dependencies monitored?
Monitor critical external dependencies, such as an API service, to ensure operational visibility of performance.
For example, a probe could be used to measure the latency of an external API.
Azure Monitor billing model is based on consumption. Azure creates metered instances that tracks usage to
calculate your bill. Pricing will depend on the metrics, alerting, notifications, Log Analytics and Application
Insights.
For information about usage and estimated costs, see Monitoring usage and estimated costs in Azure Monitor.
You can also use the Pricing Calculator to determine your pricing. The Pricing Calculator will help you estimate
your likely costs based on your expected utilization.
Data analysis considerations
Supported metrics with Azure Monitor
Network performance monitor
Azure Network Watcher
Back to the main article

Performance data integration

12/16/2022 • 3 minutes to read • Edit Online

Key points

Data interpretation

Aggregated view

Performance testing and investigation should be based on data captured from repeatable processes. To

understand how an application's performance is affected by code and infrastructure changes, retain data for

analysis. Additionally, it's important to measure how performance has changed

over time

, not just compared to

the last measurement taken.

This article describes some considerations and tools you can use to aggregate data for troubleshooting and

analyzing performance trends.

Analyze performance data holistically to detect fault types, bottleneck regressions, and health states.

Use log aggregation technologies to consolidate data into a single workspace and analyze using a

sophisticated query language.

Retain data in a time-series database to predict performance issues before they occur.

Balance the retention policy and service pricing plans with the cost expectation of the organization.

The overall performance can be impacted by both application-level issues and resource-level failures. It's vital

that all data is correlated and evaluated together. This will optimize the detection of issues and troubleshooting

of detected issues. This approach will help to distinguish between transient and non-transient faults.

Use a holistic approach to quantify what

healthy

and

unhealthy

states represent across all application

components. It's highly recommended that a

traffic light

model is used to indicate a healthy state. For example,

green light to show key non-functional requirements and targets are fully satisfied and resources are optimally

used. For example, a healthy state can be 95% of requests are processed in <= 500 ms with AKS node utilization

at x%, and so on. Also, An Application Map can to help spot performance bottlenecks or failure hotspots across

components of a distributed application.

Also, analyze long-term operational data to get historical context and detect if there have been any regressions.

For example, check the average response times to see if they have been slowly increasing over time and getting

closer to the maximum target.

Log aggregation technologies should be used to collate logs and metrics across all application components,

including infrastructural components for later evaluation.

Resources may include Azure IaaS and PaaS services and third-party appliances such as firewalls or Anti-

Malware solutions used in the application. For example, if Azure Event Hub is used, the Diagnostic Settings

should be configured to push logs and metrics to the data sink.

Azure Monitor has the capability of collecting and organizing log and performance data from monitored

resources. Data is consolidated into an Azure Log Analytics workspace so they can be analyzed together using a

sophisticated query language that can quickly analyzing millions of records. Splunk is another popular choice.

How is aggregated monitoring enforced?How is aggregated monitoring enforced?

 
Long-term data
 
Cost considerations
 
Next
 
Related links
All application resources should be configured to route diagnostic logs and metrics to the chosen log
aggregation technology. Use Azure Policy to ensure the consistent use of diagnostic settings across the
application and to enforce the desired configuration for each Azure service.
Store long-term operational data to understand the history of application performance. This data store is
important for analyzing performance trends and regressions.
Are long-term trends analyzed to predict performance issues before they occur?Are long-term trends analyzed to predict performance issues before they occur?
It's often helpful to store such data in a time-series database (TSDB) and then view the data from an operational
dashboard. An Azure Data Explorer cluster is a powerful TSDB that can store any schema of data, including
performance test metrics. Grafana, an open-source platform for observability dashboards, can then be used to
query your Azure Data Explorer cluster to view performance trends in your application.
Have retention times been defined for logs and metrics, with housekeeping mechanismsHave retention times been defined for logs and metrics, with housekeeping mechanisms
configured?configured?
Clear retention times should be defined to allow for suitable historic analysis but also control storage costs.
Suitable housekeeping tasks should also be used to archive data to cheaper storage or aggregate data for long-
term trend analysis.
While correlating data is recommended, there are cost implications to storing long-term data. For example,
Azure Monitor is capable of collecting, indexing, and storing massive amounts of data in Log Analytics
workspace. The cost of a Log Analytics workspace is based on amount of storage, retention period, and the plan.
For more information about how you can balance cost, see Manage usage and costs with Azure Monitor Logs.
If you are using Application Insights to collect instrumentation data, there are cost considerations. For more
information, see Manage usage and costs for Application Insights.
Scalability and reliability considerations
Application Map
Azure Policy
Azure Data Explorer cluster
Manage usage and costs with Azure Monitor Logs
Manage usage and costs for Application Insights
Back to the main article

Performance Efficiency patterns

12/16/2022 • 2 minutes to read • Edit Online

PAT T E RNPAT T E RN SUM M A RYSUM M A RY

Cache-Aside Load data on demand into a cache from a data store.

Choreography Have each component of the system participate in the

decision-making process about the workflow of a business

transaction, instead of relying on a central point of control.

CQRS Segregate operations that read data from operations that

update data by using separate interfaces.

Event Sourcing Use an append-only store to record the full series of events

that describe actions taken on data in a domain.

Deployment Stamps Deploy multiple independent copies of application

components, including data stores.

Geodes Deploy backend services into a set of geographical nodes,

each of which can service any client request in any region.

Index Table Create indexes over the fields in data stores that are

frequently referenced by queries.

Materialized View Generate prepopulated views over the data in one or more

data stores when the data isn't ideally formatted for required

query operations.

Priority Queue Prioritize requests sent to services so that requests with a

higher priority are received and processed more quickly than

those with a lower priority.

Queue-Based Load Leveling Use a queue that acts as a buffer between a task and a

service that it invokes in order to smooth intermittent heavy

loads.

Sharding Divide a data store into a set of horizontal partitions or

shards.

Static Content Hosting Deploy static content to a cloud-based storage service that

can deliver them directly to the client.

Performance efficiency is the ability of your workload to scale to meet the demands placed on it by users in an

efficient manner. You need to anticipate these increases to meet business requirements. An important

consideration in achieving performance efficiency is to consider how your application scales and to implement

PaaS offerings that have built-in scaling operations. Scalability is ability of a system either to handle increases in

load without impact on performance or for the available resources to be readily increased. It concerns not just

compute instances, but other elements such as data storage, messaging infrastructure, and more.

Throttling Control the consumption of resources used by an instance

of an application, an individual tenant, or an entire service.

PAT T E RNPAT T E RN SUM M A RYSUM M A RY

Performance efficiency checklist

12/16/2022 • 14 minutes to read • Edit Online

Application design

Performance efficiency is the ability of your workload to scale to meet the demands placed on it by users in an

efficient manner, and is one of the pillars of the Microsoft Azure Well-Architected Framework. Use this checklist

to review your application architecture from a performance efficiency standpoint.

Par tition the workloadPar tition the workload. Design parts of the process to be discrete and decomposable. Minimize the size of

each part, while following the usual rules for separation of concerns and the single responsibility principle. This

allows the component parts to be distributed in a way that maximizes use of each compute unit (such as a role

or database server). It also makes it easier to scale the application by adding instances of specific resources. For

complex domains, consider adopting a microservices architecture.

Design for scalingDesign for scaling. Scaling allows applications to react to variable load by increasing and decreasing the

number of instances of roles, queues, and other services they use. However, the application must be designed

with this in mind. For example, the application and the services it uses must be stateless, to allow requests to be

routed to any instance. This also prevents the addition or removal of specific instances from adversely affecting

current users. You should also implement configuration or autodetection of instances as they are added and

removed, so that code in the application can perform the necessary routing. For example, a web application

might use a set of queues in a round-robin approach to route requests to background services running in

worker roles. The web application must be able to detect changes in the number of queues, to successfully route

requests and balance the load on the application.

Scale as a unitScale as a unit. Plan for additional resources to accommodate growth. For each resource, know the upper

scaling limits, and use sharding or decomposition to go beyond these limits. Determine the scale units for the

system in terms of well-defined sets of resources. This makes applying scale-out operations easier, and less

prone to negative impact on the application through limitations imposed by lack of resources in some part of

the overall system. For example, adding x number of web and worker roles might require y number of

additional queues and z number of storage accounts to handle the additional workload generated by the roles.

So a scale unit could consist of x web and worker roles,

queues, and

storage accounts. Design the application

so that it's easily scaled by adding one or more scale units. Consider using the Deployment Stamps pattern to

deploy scale units.

Avoid client affinityAvoid client affinity. Where possible, ensure that the application does not require affinity. Requests can thus

be routed to any instance, and the number of instances is irrelevant. This also avoids the overhead of storing,

retrieving, and maintaining state information for each user.

Take advantage of platform autoscaling featuresTake advantage of platform autoscaling features. Where the hosting platform supports an autoscaling

capability, such as Azure Autoscale, prefer it to custom or third-party mechanisms unless the built-in mechanism

can't fulfill your requirements. Use scheduled scaling rules where possible to ensure resources are available

without a start-up delay, but add reactive autoscaling to the rules where appropriate to cope with unexpected

changes in demand. You can use the autoscaling operations in the classic deployment model to adjust

autoscaling, and to add custom counters to rules. For more information, see Auto-scaling guidance.

Offload CPU-intensive and I/O-intensive tasks as background tasksOffload CPU-intensive and I/O-intensive tasks as background tasks. If a request to a service is expected

to take a long time to run or absorb considerable resources, offload the processing for this request to a separate

task. Use worker roles or background jobs (depending on the hosting platform) to execute these tasks. This

strategy enables the service to continue receiving further requests and remain responsive. For more

information, see Background jobs guidance.

Data management

Distribute the workload for background tasksDistribute the workload for background tasks. Where there are many background tasks, or the tasks

require considerable time or resources, spread the work across multiple compute units (such as worker roles or

background jobs). For one possible solution, see the Competing Consumers pattern.

Consider moving toward a Consider moving toward a

shared-nothingshared-nothing

architecture architecture. A shared-nothing architecture uses

independent, self-sufficient nodes that have no single point of contention (such as shared services or storage). In

theory, such a system can scale almost indefinitely. While a fully shared-nothing approach is generally not

practical for most applications, it may provide opportunities to design for better scalability. For example,

avoiding the use of server-side session state, client affinity, and data partitioning are good examples of moving

toward a shared-nothing architecture.

Use data par titioningUse data par titioning. Divide the data across multiple databases and database servers, or design the

application to use data storage services that can provide this partitioning transparently (examples include Azure

SQL Database Elastic Database, and Azure Table storage). This approach can help to maximize performance and

allow easier scaling. There are different partitioning techniques, such as horizontal, vertical, and functional. You

can use a combination of these to achieve maximum benefit from increased query performance, simpler

scalability, more flexible management, better availability, and to match the type of store to the data it will hold.

Also, consider using different types of data store for different types of data, choosing the types based on how

well they are optimized for the specific type of data. This may include using table storage, a document database,

or a column-family data store, instead of, or as well as, a relational database. For more information, see Data

partitioning guidance.

Design for eventual consistencyDesign for eventual consistency. Eventual consistency improves scalability by reducing or removing the

time needed to synchronize related data partitioned across multiple stores. The cost is that data is not always

consistent when it is read, and some write operations may cause conflicts. Eventual consistency is ideal for

situations where the same data is read frequently but written infrequently. For more information, see the Data

Consistency Primer.

Reduce chatty interactions between components and ser vicesReduce chatty interactions between components and ser vices. Avoid designing interactions in which an

application is required to make multiple calls to a service (each of which returns a small amount of data), rather

than a single call that can return all of the data. Where possible, combine several related operations into a single

request when the call is to a service or component that has noticeable latency. This makes it easier to monitor

performance and optimize complex operations. For example, use stored procedures in databases to encapsulate

complex logic, and reduce the number of round trips and resource locking.

Use queues to level the load for high velocity data writesUse queues to level the load for high velocity data writes. Surges in demand for a service can

overwhelm that service and cause escalating failures. To prevent this, consider implementing the Queue-Based

Load Leveling pattern. Use a queue that acts as a buffer between a task and a service that it invokes. This can

smooth intermittent heavy loads that may otherwise cause the service to fail or the task to time out.

Minimize the load on the data storeMinimize the load on the data store. The data store is commonly a processing bottleneck, a costly resource,

and often not easy to scale out. Where possible, remove logic (such as processing XML documents or JSON

objects) from the data store, and perform processing within the application. For example, instead of passing

XML to the database (other than as an opaque string for storage), serialize or deserialize the XML within the

application layer and pass it in a form that is native to the data store. It's typically much easier to scale out the

application than the data store, so you should attempt to do as much of the compute-intensive processing as

possible within the application.

Minimize the volume of data retrievedMinimize the volume of data retrieved. Retrieve only the data you require by specifying columns and using

criteria to select rows. Make use of table value parameters and the appropriate isolation level. Use mechanisms

like entity tags to avoid retrieving data unnecessarily.

Aggressively use cachingAggressively use caching. Use caching wherever possible to reduce the load on resources and services that

Implementation

generate or deliver data. Caching is typically suited to data that is relatively static, or that requires considerable

processing to obtain. Caching should occur at all levels where appropriate in each layer of the application,

including data access and user interface generation. For more information, see the Caching Guidance.

Handle data growth and retentionHandle data growth and retention. The amount of data stored by an application grows over time. This

growth increases storage costs as well as latency when accessing the data, affecting application throughput and

performance. It may be possible to periodically archive some of the old data that is no longer accessed, or move

data that is rarely accessed into long-term storage that is more cost efficient, even if the access latency is higher.

Optimize Data Transfer Objects (DTOs) using an efficient binar y formatOptimize Data Transfer Objects (DTOs) using an efficient binar y format. DTOs are passed between the

layers of an application many times. Minimizing the size reduces the load on resources and the network.

However, balance the savings with the overhead of converting the data to the required format in each location

where it is used. Adopt a format that has the maximum interoperability to enable easy reuse of a component.

Set cache controlSet cache control. Design and configure the application to use output caching or fragment caching where

possible, to minimize processing load.

Enable client side cachingEnable client side caching. Web applications should enable cache settings on the content that can be cached.

This is commonly disabled by default. Configure the server to deliver the appropriate cache control headers to

enable caching of content on proxy servers and clients.

Use Azure blob storage and the Azure Content Deliver y Network to reduce the load on theUse Azure blob storage and the Azure Content Deliver y Network to reduce the load on the

applicationapplication. Consider storing static or relatively static public content, such as images, resources, scripts, and

style sheets, in blob storage. This approach relieves the application of the load caused by dynamically generating

this content for each request. Additionally, consider using the Content Delivery Network to cache this content

and deliver it to clients. Using the Content Delivery Network can improve performance at the client because the

content is delivered from the geographically closest datacenter that contains a Content Delivery Network cache.

For more information, see the Content Delivery Network Guidance.

Optimize and tune SQL queries and indexesOptimize and tune SQL queries and indexes. Some T-SQL statements or constructs may have an adverse

effect on performance that can be reduced by optimizing the code in a stored procedure. For example, avoid

converting datetimedatetime types to a varcharvarchar before comparing with a datetimedatetime literal value. Use date/time

comparison functions instead. Lack of appropriate indexes can also slow query execution. If you use an

object/relational mapping framework, understand how it works and how it may affect performance of the data

access layer. For more information, see Query Tuning.

Consider denormalizing dataConsider denormalizing data. Data normalization helps to avoid duplication and inconsistency. However,

maintaining multiple indexes, checking for referential integrity, performing multiple accesses to small chunks of

data, and joining tables to reassemble the data imposes an overhead that can affect performance. Consider if

some additional storage volume and duplication is acceptable in order to reduce the load on the data store. Also

consider if the application itself (which is typically easier to scale) can be relied on to take over tasks such as

managing referential integrity in order to reduce the load on the data store. For more information, see Data

partitioning guidance.

Review the performance antipatternsReview the performance antipatterns. See Performance antipatterns for cloud applications for common

practices that are likely to cause scalability problems when an application is under pressure.

Use asynchronous callsUse asynchronous calls. Use asynchronous code wherever possible when accessing resources or services

that may be limited by I/O or network bandwidth, or that have a noticeable latency, in order to avoid locking the

calling thread.

Avoid locking resources, and use an optimistic approach insteadAvoid locking resources, and use an optimistic approach instead. Never lock access to resources such

as storage or other services that have noticeable latency, because this is a primary cause of poor performance.

Always use optimistic approaches to managing concurrent operations, such as writing to storage. Use features

NOTENOTE

of the storage layer to manage conflicts. In distributed applications, data may be only eventually consistent.

Compress highly compressible data over high latency, low bandwidth networksCompress highly compressible data over high latency, low bandwidth networks. In the majority of

cases in a web application, the largest volume of data generated by the application and passed over the network

is HTTP responses to client requests. HTTP compression can reduce this considerably, especially for static

content. This can reduce cost as well as reducing the load on the network, though compressing dynamic content

does apply a fractionally higher load on the server. In other, more generalized environments, data compression

can reduce the volume of data transmitted and minimize transfer time and costs, but the compression and

decompression processes incur overhead. As such, compression should only be used when there is a

demonstrable gain in performance. Other serialization methods, such as JSON or binary encodings, may reduce

the payload size while having less impact on performance, whereas XML is likely to increase it.

Minimize the time that connections and resources are in useMinimize the time that connections and resources are in use. Maintain connections and resources only

for as long as you need to use them. For example, open connections as late as possible, and allow them to be

returned to the connection pool as soon as possible. Acquire resources as late as possible, and dispose of them

as soon as possible.

Minimize the number of connections requiredMinimize the number of connections required. Service connections absorb resources. Limit the number

that are required and ensure that existing connections are reused whenever possible. For example, after

performing authentication, use impersonation where appropriate to run code as a specific identity. This can help

to make best use of the connection pool by reusing connections.

APIs for some services automatically reuse connections, provided service-specific guidelines are followed. It's important

that you understand the conditions that enable connection reuse for each service that your application uses.

Send requests in batches to optimize network useSend requests in batches to optimize network use. For example, send and read messages in batches

when accessing a queue, and perform multiple reads or writes as a batch when accessing storage or a cache.

This can help to maximize efficiency of the services and data stores by reducing the number of calls across the

network.

Avoid a requirement to store ser ver-side session stateAvoid a requirement to store ser ver-side session state where possible. Server-side session state

management typically requires client affinity (that is, routing each request to the same server instance), which

affects the ability of the system to scale. Ideally, you should design clients to be stateless with respect to the

servers that they use. However, if the application must maintain session state, store sensitive data or large

volumes of per-client data in a distributed server-side cache that all instances of the application can access.

Optimize table storage schemasOptimize table storage schemas. When using table stores that require the table and column names to be

passed and processed with every query, such as Azure table storage, consider using shorter names to reduce

this overhead. However, do not sacrifice readability or manageability by using overly compact names.

Create resource dependencies during deployment or at application star tupCreate resource dependencies during deployment or at application star tup. Avoid repeated calls to

methods that test the existence of a resource and then create the resource if it does not exist. Methods such as

CloudTable.CreateIfNotExists

and

CloudQueue.CreateIfNotExists

in the Azure Storage Client Library follow this

pattern. These methods can impose considerable overhead if they are invoked before each access to a storage

table or storage queue. Instead:

Create the required resources when the application is deployed, or when it first starts (a single call to

CreateIfNotExists

for each resource in the startup code for a web or worker role is acceptable). However, be

sure to handle exceptions that may arise if your code attempts to access a resource that doesn't exist. In these

situations, you should log the exception, and possibly alert an operator that a resource is missing.

Under some circumstances, it may be appropriate to create the missing resource as part of the exception

handling code. But you should adopt this approach with caution as the non-existence of the resource might

be indicative of a programming error (a misspelled resource name for example), or some other

infrastructure-level issue.

Use lightweight frameworksUse lightweight frameworks. Carefully choose the APIs and frameworks you use to minimize resource

usage, execution time, and overall load on the application. For example, using Web API to handle service

requests can reduce the application footprint and increase execution speed, but it may not be suitable for

advanced scenarios where the additional capabilities of Windows Communication Foundation are required.

Consider minimizing the number of ser vice accountsConsider minimizing the number of ser vice accounts. For example, use a specific account to access

resources or services that impose a limit on connections, or perform better where fewer connections are

maintained. This approach is common for services such as databases, but it can affect the ability to accurately

audit operations due to the impersonation of the original user.

Carr y out performance profiling and load testingCarr y out performance profiling and load testing during development, as part of test routines, and before

final release to ensure the application performs and scales as required. This testing should occur on the same

type of hardware as the production platform, and with the same types and quantities of data and user load as it

will encounter in production. For more information, see Testing the performance of a cloud service.

Tradeoffs for performance efficiency

12/16/2022 • 6 minutes to read • Edit Online

Performance efficiency vs. cost efficiency

Performance efficiency vs. operational excellence

Automated performance testingAutomated performance testing

Fast buildsFast builds

As you design the workload, consider tradeoffs between performance optimization and other aspects of the

design, such as cost efficiency, operability, reliability, and security.

Cost can increase as a result of boosting performance. Here are a few factors to consider when optimizing for

performance and how they impact cost:

Avoid cost estimation of a workload at consistently high utilization. Consumption-based pricing will be

more expensive than the equivalent provisioned pricing. Smooth out the peaks to get a consistent flow of

compute and data. Ideally, use manual and autoscaling to find the right balance. Scaling up is generally

more expensive than scaling out.

Cost scales directly with number of regions. Locating resources in cheaper regions shouldn't negate the

cost of network ingress and egress or degraded application performance because of increased latency.

Every render cycle of a payload consumes both compute and memory. You can use caching to reduce

load on servers and save with pre-canned storage and bandwidth costs. The savings can be dramatic,

especially for static content services.

While caching can reduce cost, there are some performance tradeoffs. For example, Azure Traffic

Manager pricing is based on the number of DNS (Domain Name Service) queries that reach the

service. You can reduce that number through caching and configure how often the cache is refreshed.

Relying on the cache that isn't frequently updated will cause longer user failover times if an endpoint

is unavailable.

Using dedicated resources for batch processing long running jobs will increase the cost. You can lower

cost by provisioning Spot VMs but be prepared for the job to be interrupted every time Azure evicts the

VM.

For cost considerations, see the Cost Optimization pillar.

As you determine how to scale your workload to meet the demands placed on it by users in an efficient manner,

consider the operations processes that are keeping an application running in production. To achieve operational

excellence with these processes, make sure the deployments remain reliable and predictable. They should be

automated to reduce the chance of human error. They should be a fast and routine process, so they don't slow

down the release of new features or bug fixes. Equally important, you must be able to quickly roll back or roll

forward if an update has problems.

One operational process that can help to identify performance issues early is automated performance testing.

The impact of a serious performance issue can be as severe as a bug in the code. While automated functional

tests can prevent application bugs, they might not detect performance problems. Define acceptable performance

goals for metrics such as latency, load times, and resource usage. Include automated performance tests in your

release pipeline, to make sure the application meets those goals.

  
Monitoring performance optimizationMonitoring performance optimization
 
Performance efficiency vs. reliability
Another operational efficiency process is making sure that your product is in a deployable state through a fast
build process. Builds provide crucial information about the status of your product.
The following can help faster builds:
Select the right size of VMs.
Ensure that the build server is located near the sources and a target location, so it can reduce the duration of
your build considerably.
Scale-out build servers.
Optimizing the build.
For an explanation of these items, see Builds.
As you consider making performance improvements, monitoring should be done to verify that your application
is running correctly. Monitoring should include the application, platform, and networking. To learn more, see
Monitoring.
For operational considerations, see the Operational Excellence pillar.
We acknowledge up front that failures will happen. Instead of trying to prevent failures altogether, the goal is to
minimize the effects of a single failing component.
Reliable applications are 
resilient
 and 
highly available
 (HA). Resiliency allows systems to recover gracefully from
failures, and they continue to function with minimal downtime and data loss before full recovery. HA systems
run as designed in a healthy state with no significant downtime. Maintaining reliability enables you to maintain
performance efficiency.
Some reliability considerations are:
Use the Circuit Breaker pattern to provide stability while the system recovers from a failure and
minimizes the impact on performance.
Achieve levels of scale and performance needed for your solution by segregating read and write
interfaces by implementing the CQRS pattern.
Often, you can achieve higher availability by adopting an 
eventual consistency
 model. To learn about
selecting the correct data store, see Use the best data store for the job.
If your application requires more storage accounts than are currently available in your subscription,
create a new subscription with additional storage accounts. For more information, see Scalability and
performance targets.
Avoid scaling up or down. Instead, select a tier and instance size that meet your performance
requirements under typical load, and then scale out the instances to handle changes in traffic volume.
Scaling up and down may trigger an application restart.
Create a separate storage account for logs. Don't use the same storage account for logs and application
data. This helps to prevent logging from reducing application performance.
Monitor performance. Use a performance monitoring service such as Application Insights or New Relic to
monitor application performance and behavior under load. Performance monitoring gives you real-time
insight into the application. It enables you to diagnose issues and perform root-cause analysis of failures.
For resiliency, availability, and reliability considerations, see the Reliability pillar.

Performance efficiency vs. security

If performance is so poor that the data is unusable, you can consider the data to be inaccessible. From a security

perspective, you need to do whatever you can to make sure that your services have optimal uptime and

performance.

A popular and effective method for enhancing availability and performance is load balancing. Load balancing is

a method of distributing network traffic across servers that are part of a service. It helps performance because

the processor, network, and memory overhead for serving requests are distributed across all the load-balanced

servers. We recommend that you employ load balancing whenever you can, and as appropriate for your

services. For information on load balancing scenarios, see Optimize uptime and performance.

Consider these security measures, which impact performance:

To optimize performance and maximize availability, application code should first try to get OAuth access

tokens silently from a cache before attempting to acquire a token from the identity provider. OAuth is a

technological standard that allows you to securely share information between services without exposing

your password.

Ensure that you are integrating critical security alerts and logs into SIEMs (security information and event

management) without introducing a high volume of low value data. Doing so can increase SIEM cost,

false positives, and lower performance. For more information, see Prioritize alert and log integration.

Use Azure AD Connect to synchronize your on-premises directory with your cloud directory. There are

factors that affect the performance of Azure AD Connect. Ensure Azure AD Connect has enough capacity

to keep underperforming systems from impeding security and productivity. Large or complex

organizations (organizations provisioning more than 100,000 objects) should follow the

recommendations to optimize their Azure AD Connect implementation.

If you want to gain access to real time performance information at the packet level, use packet capture to

set alerts.

For other security considerations, see the Security pillar.

Mission

critical workloads

12/16/2022 • 5 minutes to read • Edit Online

What is a mission-critical workload?

Video: Mission-critical workloads on Azure

What are the common challenges?

Is mission-critical only about reliability?

This section strives to address the challenges of designing mission-critical workloads on Azure. The guidance is

based on lessons learned from reviewing numerous customer applications and first-party solutions. This section

provides actionable and authoritative guidance that applies Well-Architected best practices as the technical

foundation for building and operating a highly reliable solution on Azure at-scale.

The term

workload

refers to a collection of application resources that support a common business goal or the

execution of a common business process, with multiple services, such as APIs and data stores, working together

to deliver specific end-to-end functionality.

The term

mission-critical

refers to a criticality scale that covers significant financial cost (business-critical) or

human cost (safety-critical) associated with unavailability or underperformance.

mission-critical workload

therefore describes a collection of application resources, which must be highly

reliable on the platform. The workload must always be available, resilient to failures, and operational.

Microsoft Azure makes it easy to deploy and manage cloud solutions. However, building mission-critical

workloads that are highly reliable on the platform remains a challenge for these main reasons:

Designing a reliable application at scale is complex. It requires extensive platform knowledge to select the

right technologies

and

optimally configure them to deliver end-to-end functionality.

Failure is inevitable in any complex distributed system, and the solution must therefore be architected to

handle failures with correlated or cascading impact. This is a change in mindset for many developers and

architects entering the cloud from an on-premises environment; reliability engineering is no longer an

infrastructure subject, but should be a first-class concern within the application development process.

Operationalizing mission-critical workloads requires a high degree of engineering rigor and maturity

throughout the end-to-end engineering lifecycle as well as the ability to learn from failure.

While the primary focus of mission-critical workloads is Reliability, other pillars of the Well-Architected

Framework are equally important when building and operating a mission-critical workload on Azure.

Security: how a workload mitigates security threats, such as Distributed Denial of Service (DDoS) attacks,

will have a significant bearing on overall reliability.

Operational Excellence: how a workload is able to effectively respond to operational issues will have a

direct impact on application availability.

Performance Efficiency: availability is more than simple uptime, but rather a consistent level of application

service and performance relative to a known healthy state.

What are the key design areas?

DESIGN A READESIGN A REA SUM M A RYSUM M A RY

Application designApplication design The use of a scale-unit architecture in the context of building

a highly reliable application. Also explores the cloud

application design patterns that allow for scaling, and error

handling.

Application platformApplication platform Decision factors and recommendations related to the

selection, design, and configuration of an appropriate

application hosting platform, application dependencies,

frameworks, and libraries.

Data platformData platform Choices in data store technologies, informed by evaluating

the required—volume, velocity, variety, veracity.

Networking and connectivityNetworking and connectivity Network topology concepts at an application level,

considering requisite connectivity and redundant traffic

management. Critical recommendations intended to inform

the design of a secure and scalable global network topology.

Achieving high reliability imposes significant cost tradeoffs, which may not be justifiable for every workload

scenario. It is therefore recommended that design decisions be driven by business requirements.

Mission-critical guidance within this series is composed of architectural considerations and recommendations

orientated around these key design areas.

The design areas are interrelated and decisions made within one area can impact or influence decisions across

the entire design. We recommend that readers familiarize themselves with these design areas, reviewing

provided considerations and recommendations to better understand the consequences of encompassed

decisions. For example, to define a target architecture it's critical to determine how best to monitor application

health across key components. In this instance, the reader should review the health modelinghealth modeling design area,

using the outlined recommendations to help drive decisions.

Health modeling and observabilityHealth modeling and observability Processes to define a robust health model, mapping
quantified application health states through observability
and operational constructs to achieve operational maturity.
Deployment and testingDeployment and testing Eradicate downtime and maintain application health for
deployment operations, providing key considerations and
recommendations intended to inform the design of optimal
CI/CD pipelines for a mission-critical application.
SecuritySecurity Protect the application against threats intended to directly
or indirectly compromise its reliability.
Operational proceduresOperational procedures Adoption of DevOps and related deployment methods is
used to drive effective and consistent operational
procedures.
DESIGN  A READESIGN  A REA SUM M A RYSUM M A RY
 
Illustrative examples
 
Industry scenarios
The guidance provided within this series is based on a solution-orientated approach to illustrate key design
considerations and recommendations. There are several reference implementations available as part of the
Mission-Critical open source project on GitHub. These implementations can be used as a basis for further
solution development.
Baseline architecture of an internet-facing application—Provides a foundation for building a cloud-native,
highly scalable, internet-facing application on Microsoft Azure. The workload is accessed over a public
endpoint and doesn't require private network connectivity to a surrounding organizational technical
estate.
Refer to the implementation: Mission-Critical Online
Baseline architecture of an internet-facing application with network controls—Extends the baseline
architecture with strict network controls in place to prevent unauthorized public access from the internet
to any of the workload resources.
Mission-Critical Connected
Provides a foundation for building a corporate-connected cloud-native application on Microsoft Azure
using existing network infrastructure and private endpoints. The workload requires private connectivity to
other organizational resources and takes a dependency on pre-provided Virtual Networks for
connectivity to other organizational resources. This use case is intended for scenarios that require
integration with a broader organizational technical estate for either public-facing or internal-facing
workloads.
The mission-critical guidance within this series forms an industry agnostic design methodology which can be
applied across a multitude of different industry contexts. The following list provides specific examples where the
mission-critical design methodology has been applied and tailored to a particular industry scenario.
Carrier-grade within the telecommunications industry
A carrier-grade workload pivots on both business-critical and safety-critical aspects, where there's a
fundamental requirement to be operational with only minutes or even seconds of downtime per calendar year.

Next step

Failure to achieve this uptime requirement can result in extensive loss of life, incur significant fines, or

contractual penalties.

Start by reviewing the design methodology for mission-critical application scenarios.

Design methodology

Design methodology for mission

critical workloads

on Azure

12/16/2022 • 4 minutes to read • Edit Online

1—Design for business requirements

Select a reliability tierSelect a reliability tier

REL IA B IL IT Y T IERRE L IA BIL IT Y T IER

( AVA IL A B ILIT Y SLO )( AVA IL A B ILIT Y SLO )

P ERM IT T E D DO W N T IM EP ERM IT T ED DOW N T IM E

( WE EK)( WE EK)

P ERM IT T E D DO W N T IM EP ERM IT T ED DOW N T IM E

( M O NT H )( M O N T H )

P ERM IT T E D DO W N T IM EP ERM IT T ED DOW N T IM E

( Y EA R)( Y EA R)

99.9% 10 minutes, 4 seconds 43 minutes, 49 seconds 8 hours, 45 minutes, 56

seconds

99.95% 5 minutes, 2 seconds 21 minutes, 54 seconds 4 hours, 22 minutes, 58

seconds

99.99% 1 minutes 4 minutes 22 seconds 52 minutes, 35 seconds

99.999% 6 seconds 26 seconds 5 minutes, 15 seconds

99.9999% <1 second 2 seconds 31 seconds

Building a mission-critical application on any cloud platform requires significant technical expertise and

engineering investment, particularly since there's significant complexity associated with:

Understanding the cloud platform,

Choosing the right services and composition,

Applying the correct service configuration,

Operationalizing utilized services, and

Constantly aligning with the latest best practices and service roadmaps.

This design methodology strives to provide an easy to follow design path to help navigate this complexity and

inform design decisions required to produce an optimal target architecture.

Not all mission-critical workloads have the same requirements. Expect that the review considerations and design

recommendations provided by this design methodology will yield different design decisions and trade-offs for

different application scenarios.

Reliability is a relative concept and for any workload to be appropriately reliable it should reflect the business

requirements surrounding it. For example, a mission-critical workload with a 99.999% availability Service Level

Objective (SLO) requires a much higher level of reliability than another less critical workload with an SLO of

99.9%.

This design methodology applies the concept of reliability tiers expressed as availability SLOs to inform required

reliability characteristics. The table below captures permitted error budgets associated with common reliability

tiers.

IMPORTANTIMPORTANT

2—Evaluate the design areas using the design principles

3—Deploy your first mission-critical application

Availability SLO is considered by this design methodology to be more than simple uptime, but rather a consistent level of

application service relative to a known healthy application state.

As an initial exercise, readers are advised to select a target reliability tier by determining how much downtime is

acceptable? The pursuit of a particular reliability tier will ultimately have a significant bearing on the design path

and encompassed design decisions, which will result in a different target architecture.

This image shows how the different reliability tiers and underlying business requirements influence the target

architecture for a conceptual reference implementation, particularly concerning the number of regional

deployments and utilized global technologies.

Recovery Time Objective (RTO) and Recovery Point Objective (RPO) are further critical aspects when

determining required reliability. For instance, if you're striving to achieve an application RTO of less than a

minute then back-up based recovery strategies or an active-passive deployment strategy are likely to be

insufficient.

At the core of this methodology lies a critical design path comprised of:

Foundational design principlesdesign principles

Fundamental design areadesign area with heavily interrelated and dependent design decisions.

The impact of decisions made within each design area will reverberate across other design areas and design

decisions. Review the provided considerations and recommendations to better understand the consequences of

encompassed decisions, which may produce trade-offs within related design areas.

For example, to define a target architecture it's critical to determine how best to monitor application health

across key components. We highly recommend that you review the health modeling design area, using the

outlined recommendations to help drive decisions.

Refer to these reference architectures that describe the design decisions based on this methodology.

TIPTIP
 
4—Integrate your workload in Azure landing zones
  
Online subscriptionOnline subscription
  
Baseline architecture of an internet-facing application
Baseline architecture of an internet-facing application with network controls
 The architecture is backed by Mission-Critical Online implementation that illustrates the design recommendations.
Production-grade ar tifactsProduction-grade ar tifacts Every technical artifact is ready for use in production environments with all end-
to-end operational aspects considered.
Rooted in real-wold experiencesRooted in real-wold experiences All technical decisions are guided by experiences of Azure customers and
lessons learned from deploying those solutions.
Azure roadmap alignmentAzure roadmap alignment The mission-critical reference architectures have their own roadmap that is
aligned with Azure product roadmaps.
Azure landing zone subscriptions provide shared infrastructure for enterprise deployments that need centralized
governance.
It's crucial to evaluate which connectivity use case is required by your mission-critical application. Azure landing
zones support two main archetypes separated into different Management Group scopes: OnlineOnline or Corp.Corp. as
shown in this image.
A mission-critical workload operates as an independent solution, without any direct corporate network
connectivity to the rest of the Azure landing zone architecture. The application will be further safeguarded
through the policy-driven governance and will automatically integrate with centralized platform logging
through policy.
The baseline architecture and Mission-Critical Online implementation align with the Online approach.

Corp. subscriptionCorp. subscription

TIPTIP

5—Deploy a sandbox application environment

6—Continuously evolve with Azure roadmaps

Next step

When deployed in a Corp. subscription a mission-critical workload depends on the Azure landing zone to

provide connectivity resources. This approach allows integration with other applications and shared services.

You'll need to design around some foundational resources, which will exist up-front as part of the shared-

service platform. For example, the regional deployment stamp should no longer encompass an ephemeral

Virtual Network or Azure Private DNS Zone because these will exist in the Corp. subscription.

To get started with this use case, we recommend the baseline architecture in an Azure landing zonebaseline architecture in an Azure landing zone

reference architecture.

The preceding architecture is backed by Mission-Critical Connected implementation.

In parallel to design activities, it's highly recommended that a sandbox application environment is established

using the Mission-Critical reference implementations.

This provides hands-on opportunities to validate design decisions by replicating the target architecture, allowing

for design uncertainty to be quickly assessed. If applied correctly with representative requirement coverage,

most problematic issues likely to hinder progress can be uncovered and subsequently addressed.

Application architectures established using this design methodology must continue to evolve in alignment with

Azure platform roadmaps to support optimized sustainability.

Review the design principles for mission-critical application scenarios.

Design principles

Design principles of a mission

critical workload

12/16/2022 • 7 minutes to read • Edit Online

IMPORTANTIMPORTANT

Reliability

DESIGN P R INC IP L EDESIGN P R IN C IP L E C O N SIDE RAT IO N SC O N SIDERAT IO NS

Active/Active designActive/Active design To maximize availability and achieve regional fault tolerance,

solution components should be distributed across multiple

Availability Zones and Azure regions using an active/active

deployment model where possible.

The mission-critical design methodology is underpinned by five key design principles which serve as a compass

for subsequent design decisions across the critical design areas. We highly recommend that you familiarize

yourselves with these principles to better understand their impact and the trade-offs associated with non-

adherence.

This article is part of the Azure Well-Architected mission-critical workload series. If you aren't familiar with this series, we

recommend you start with what is a mission-critical workload?

Mission-Critical open source project

The reference implementations are part of an open source project available on GitHub. The provided code assets illustrate

implementations associated with the design principles highlighted in this article.

These mission-critical design principles resonate and extend the quality pillars of the Azure Well-Architected

Framework—Reliability, Security, Cost Optimization, Operational Excellence, and Performance Efficiency.

Maximum reliabilityMaximum reliability

- Fundamental pursuit of the most reliable solution, ensuring trade-offs are properly

understood.

Blast radius reduction and fault isolationBlast radius reduction and fault isolation Failure is impossible to avoid in a highly distributed multi-

tenant cloud environment like Azure. By anticipating failures

and correlated impact, from individual components to entire

Azure regions, a solution can be designed and developed in

a resilient manner.

Obser ve application healthObser ve application health Before issues impacting application reliability can be

mitigated, they must first be detected and understood. By

monitoring the operation of an application relative to a

known healthy state it becomes possible to detect or even

predict reliability issues, allowing for swift remedial action to

be taken.

Drive automationDrive automation One of the leading causes of application downtime is human

error, whether that is due to the deployment of insufficiently

tested software or misconfiguration. To minimize the

possibility and impact of human errors, it's vital to strive for

automation in all aspects of a cloud solution to improve

reliability; automated testing, deployment, and management.

Design for self-healingDesign for self-healing Self healing describes a system's ability to deal with failures

automatically through pre-defined remediation protocols

connected to failure modes within the solution. It's an

advanced concept that requires a high level of system

maturity with monitoring and automation, but should be an

aspiration from inception to maximize reliability.

Complexity avoidanceComplexity avoidance Avoid unnecessary complexity when designing the solution

and all operational processes to drive reliability and

management efficiencies, minimizing the likelihood of

failures.

DESIGN P R INC IP L EDESIGN P R IN C IP L E C O N SIDE RAT IO N SC O N SIDERAT IO NS

Performance Efficiency

DESIGN P R INC IP L EDESIGN P R IN C IP L E C O N SIDE RAT IO N SC O N SIDERAT IO NS

Design for scale-outDesign for scale-out Scale-out is a concept that focuses on a system's ability to

respond to demand through horizontal growth. This means

that as traffic grows,moreresource units are added in parallel

instead of increasing the size of the existing resources. A

systems ability to handle expected and unexpected traffic

increases through scale-units is essential to overall

performance and reliability by further reducing the impact of

a single resource failure.

Automation for hyperscaleAutomation for hyperscale Scale operations throughout the solution should be fully

automated to minimize the performance and availability

impact from unexpected or expected increases in traffic,

ensuring the time it takes to conduct scale operations is

understood and aligned with a model for application health.

Sustainable performance and scalabilitySustainable performance and scalability

- Design for scalability across the end-to-end solution without

performance bottlenecks.

Continuous validation and testingContinuous validation and testing Automated testing should be performed within CI/CD

processes to drive continuous validation for each application

change. Load testing against a performance baseline with

synchronized chaos experimentation should be included to

validate existing thresholds, targets, and assumptions, as

well as helping to quickly identify risks to resiliency and

availability. Such testing should be conducted within staging

and testing environments, but also optionally within

development environments. It can also be beneficial to run a

subset of tests against the production environment,

particularly in conjunction with a blue/green deployment

model to validate new deployment stamps before receiving

production traffic.

Reduce overhead with managed compute servicesReduce overhead with managed compute services Using managed compute services and containerized

architectures significantly reduces the ongoing

administrative and operational overhead of designing,

operating, and scaling applications by shifting infrastructure

deployment and maintenance to the managed service

provider.

Baseline performance and identify bottlenecksBaseline performance and identify bottlenecks Performance testing with detailed telemetry from every

system component allows for the identification of

bottlenecks within the system, including components that

need to be scaled in relation to other components, and this

information should be incorporated into a capacity model.

Model capacityModel capacity A capacity model enables planning of resource scale levels

for a given load profile, and additionally exposes how system

components perform in relation to each other, therefore

enabling system-wide capacity allocation planning.

DESIGN P R INC IP L EDESIGN P R IN C IP L E C O N SIDE RAT IO N SC O N SIDERAT IO NS

Operational Excellence

DESIGN P R INC IP L EDESIGN P R IN C IP L E C O N SIDE RAT IO N SC O N SIDERAT IO NS

Loosely coupled componentsLoosely coupled components Loose coupling enables independent and on-demand

testing, deployments, and updates to components of the

application while minimizing inter-team dependencies for

support, services, resources, or approvals.

Automate build and release processesAutomate build and release processes Fully automated build and release processes reduce the

friction and increase the velocity of deploying updates,

bringing repeatability and consistency across environments.

Automation shortens the feedback loop from developers

pushing changes to getting insights on code quality, test

coverage, resiliency, security, and performance, which

increases developer productivity.

Developer agilityDeveloper agility Continuous Integration and Continuous Deployment

(CI/CD) automation enables the use of short-lived

development environments with lifecycles tied to that of an

associated feature branch, which promotes developer agility

and drives validation as early as possible within the

engineering cycle to minimize the engineering cost of bugs.

Operations by designOperations by design

- Engineered to last with robust and assertive operational management.

Quantify operational healthQuantify operational health Full diagnostic instrumentation of all components and
resources enables ongoing observability of logs, metrics and
traces, but also facilitates health modeling to quantify
application health in the context to availability and
performance requirements.
Rehearse recover y and practice failureRehearse recover y and practice failure Business Continuity (BC) and Disaster Recovery (DR)
planning and practice drills are essential and should be
conducted frequently, since learnings can iteratively improve
plans and procedures to maximize resiliency in the event of
unplanned downtime.
Embrace continuous operational improvementEmbrace continuous operational improvement Prioritize routine improvement of the system and user
experience, using a health model to understand and
measure operational efficiency with feedback mechanisms to
enable application teams to understand and address gaps in
an iterative manner.
DESIGN  P R INC IP L EDESIGN  P R IN C IP L E C O N SIDE RAT IO N SC O N SIDERAT IO NS
 
Security
DESIGN  P R INC IP L EDESIGN  P R IN C IP L E C O N SIDE RAT IO N SC O N SIDERAT IO NS
Monitor the security of the entire solution and planMonitor the security of the entire solution and plan
incident responsesincident responses
Correlate security and audit events to model application
health and identify active threats. Establish automated and
manual procedures to respond to incidents using Security
Information and Event Management (SIEM) tooling for
tracking.
Model and test against potential threatsModel and test against potential threats Ensure appropriate resource hardening and establish
procedures to identify and mitigate known threats, using
penetration testing to verify threat mitigation, as well as
static code analysis and code scanning.
Identify and protect endpointsIdentify and protect endpoints Monitor and protect the network integrity of internal and
external endpoints through security capabilities and
appliances, such as firewalls or web application firewalls. Use
industry standard approaches to protect against common
attack vectors like Distributed Denial-Of-Service (DDoS)
attacks, such as SlowLoris.
Protect against code level vulnerabilitiesProtect against code level vulnerabilities Identify and mitigate code-level vulnerabilities, such as
cross-site scripting or SQL injection, and incorporate security
patching into operational lifecycles for all parts of the
codebase, including dependencies.
Automate and use least privilegeAutomate and use least privilege Drive automation to minimize the need for human
interaction and implement least privilege across both the
application and control plane to protect against data
exfiltration and malicious actor scenarios.
Classify and encr ypt dataClassify and encr ypt data Classify data according to risk and apply industry standard
encryption at rest and in transit, ensuring keys and
certificates are stored securely and managed properly.
Always secureAlways secure
 - Design for end-to-end security to maintain application stability and ensure availability.

Cost Optimization

Cloud native design

Next step

There are obvious cost tradeoffs associated with introducing greater reliability, which should be carefully

considered in the context of workload requirements.

Maximizing reliability can impact the overall financial cost of the solution. For example, the duplication of

resources and the distribution of resources across regions to achieve high availability has clear cost implications.

To avoid excess costs, don't over-engineer or over-provision beyond the relevant business requirements.

Also, there is added cost associated with engineering investment in fundamental reliability concepts, such as

embracing infrastructure as code, deployment automation, and chaos engineering. This comes at a cost in terms

of both time and effort, which could be invested elsewhere to deliver new application functionality and features.

Azure-native managed ser vicesAzure-native managed ser vices - Azure-native managed services are prioritized due to their lower

administrative and operational overhead as well as tight integration with consistent configuration and

instrumentation across the application stack.

Roadmap alignmentRoadmap alignment - Incorporate upcoming new and improved Azure service capabilities as they

become Generally Available (GA) to stay close to the leading edge of Azure.

Embrace preview capabilities and mitigate known gapsEmbrace preview capabilities and mitigate known gaps - While Generally Available (GA) services

are prioritized for supportability, Azure service previews are actively explored for rapid incorporation,

providing technical and actionable feedback to Azure product groups to address gaps.

Azure landing zone alignmentAzure landing zone alignment - Deployable within an Azure landing zone and aligned to the Azure

landing zone design methodology, but also fully functional and deployable in a bare environment outside

of a landing zone.

Review cross-cutting concerns associated with mission-critical workloads.

Cross-cutting concerns

Architecture pattern for mission

critical workloads

on Azure

12/16/2022 • 5 minutes to read • Edit Online

C H A RA C T ERIST ICC H A RA C T ERIST IC C O N SIDERAT IO NSC O N SIDERAT IO NS

Lifetime What's the expected lifetime of the resource, relative to other

resources in the solution? Should the resource outlive or

share the lifetime with the entire system or region, or should

it be temporary?

State What impact will the persisted state at this layer have on

reliability or manageability?

Reach Is the resource required to be globally distributed? Can the

resource communicate with other resources, located globally

or within that region?

Dependencies What are the dependencies on other resources?

Scale limits What is the expected throughput for that resource? How

much scale is provided by the resource to fit that demand?

Availability/disaster recovery What is the impact on availability from a disaster at this

layer? Would it cause a systemic outage or only a localized

capacity or availability issue?

IMPORTANTIMPORTANT

Core architecture pattern

This article presents a key pattern for mission-critical architectures on Azure. Apply this pattern when you start

your design process, and then select components that are best suited for your business requirements. The

article recommends a

north star

design approach and includes other examples with common technology

components.

We recommend that you evaluate the key design areasthe key design areas, define the critical user and system flows that use the

underlying components, and develop a matrix of Azure resources and their configuration while keeping in mind

the following characteristics.

This article is part of the Azure Well-Architected mission-critical workload series. If you aren't familiar with this series, we

recommend you start with what is a mission-critical workload?

Global resourcesGlobal resources

C H A RA C T ERIST ICC H A RA C T ERIST IC C O N SIDERAT IO NSC O N SIDERAT IO NS

Lifetime These resources are expected to be long living (non-

ephemeral). Their lifetime spans the life of the system or

longer. Often the resources are managed with in-place data

and control plane updates, assuming they support zero-

downtime update operations.

State Because these resources exist for at least the lifetime of the

system, this layer is often responsible for storing global, geo-

replicated state.

Certain resources are globally shared by resources deployed within each region. Common examples are

resources that are used to distribute traffic across multiple regions, store permanent state for the whole

application, and monitor resources for them.

Reach The resources should be globally distributed and replicated

to the regions that host those resources. It’s recommended

that these resources communicate with regional or other

resources with low latency and the desired consistency.

Dependencies The resources should avoid dependencies on regional

resources because their unavailability can be a cause for

global failure. For example, certificates or secrets kept in a

single vault could have global impact if there's a regional

failure where the vault is located.

Scale limits Often these resources are singleton instances in the system,

and they should be able to scale such that they can handle

throughput of the system as a whole.

Availability/disaster recovery Regional and stamp resources can use global resources. It's

critical that global resources are configured with high

availability and disaster recovery for the health of the whole

system.

C H A RA C T ERIST ICC H A RA C T ERIST IC C O N SIDERAT IO NSC O N SIDERAT IO NS

Regional stamp resourcesRegional stamp resources

C H A RA C T ERIST ICC H A RA C T ERIST IC C O N SIDERAT IO NSC O N SIDERAT IO NS

Lifetime The resources are expected to have a short life span

(ephemeral) with the intent that they can get added and

removed dynamically while regional resources outside the

stamp continue to persist. The ephemeral nature is needed

to provide more resiliency, scale, and proximity to users.

State Because stamps are ephemeral and will be destroyed with

each deployment, a stamp should be stateless as much as

possible.

Reach Can communicate with regional and global resources.

However, communication with other regions or other stamps

should be avoided.

Dependencies The stamp resources must be independent. They're expected

to have regional and global dependencies but shouldn't rely

on components in other stamps in the same or other

regions.

Scale limits Throughput is established through testing. The throughput

of the overall stamp is limited to the least performant

resource. Stamp throughput needs to estimate the high-

level of demand caused by a failover to another stamp.

Availability/disaster recovery Because of the temporary nature of stamps, disaster

recovery is done by redeploying the stamp. If resources are

in an unhealthy state, the stamp, as a whole, can be

destroyed and redeployed.

Regional resourcesRegional resources

The stamp contains the application and resources that participate in completing business transactions. A stamp

typically corresponds to a deployment to an Azure region. Although a region can have more than one stamp.

C H A RA C T ERIST ICC H A RA C T ERIST IC C O N SIDERAT IO NC O N SIDE RAT IO N

Lifetime The resources share the lifetime of the region and out live

the stamp resources.

State State stored in a region can't live beyond the lifetime of the

region. If state needs to be shared across regions, consider

using a global data store.

Reach The resources don't need to be globally distributed. Direct

communication with other regions should be avoided at all

cost.

Dependencies The resources can have dependencies on global resources,

but not on stamp resources because stamps are meant to

be short lived.

Scale limits Determine the scale limit of regional resources by combining

all stamps within the region.

Baseline architectures for mission-critical workloads

A system can have resources that are deployed in region but outlive the stamp resources. For example,

observability resources that monitor resources at the regional level, including the stamps.

These baseline examples serve as the recommended north star architecture for mission-critical applications. The

baseline strongly recommends containerization and using a container orchestrator for the application platform.

The baseline uses Azure Kubernetes Service (AKS).

Refer to Well-Architected mission-critical workloads: Containerization.

Baseline architecture

If you're just starting your mission-critical journey, use this architecture as a reference. The workload is

accessed over a public endpoint and doesn't require private network connectivity to other company

resources.

Baseline with network controls

This architecture builds on the baseline architecture. The design is extended to provide strict network

controls to prevent unauthorized public access from the internet to the workload resources.

Design areas

Next step

Baseline in Azure landing zones

This architecture is appropriate if you're deploying the workload in an enterprise setup where integration

within a broader organization is required. The workload uses centralized shared services, needs on-

premises connectivity, and integrates with other workloads within the enterprise. It's deployed in an

Azure landing zone subscription that inherits from the Corp. management group.

We recommend that you use the provided design guidance to navigate the key design decisions to reach an

optimal solution. For information, see What are the key design areas?

Review the best practices for designing mission-critical application scenarios.

Application design

Cross

cutting concerns of mission

critical workloads

on Azure

12/16/2022 • 2 minutes to read • Edit Online

IMPORTANTIMPORTANT

Scale limits

IMPORTANTIMPORTANT

RecommendationsRecommendations

Automation

RecommendationsRecommendations

There are several cross-cutting concerns that traverse the key design areas. This article contextualizes these

cross-cutting concerns for subsequent consideration within each design area.

This article is part of the Azure Well-Architected mission-critical workload series. If you aren't familiar with this series, we

recommend you start with what is a mission-critical workload?

Mission-Critical open source project

There are reference implementations available as part of an open source project on GitHub. The code assets provided by

these implementations illustrate the recommendations highlighted in this article.

Azure applies various

limits

quotas

to ensure a consistent level of service for all customers. Examples of these

limits include restrictions on the number of deployable resources within a single subscription, and restrictions to

network and query throughput.

Service limits may have a significant bearing on a large mission-critical workload. Consider the limits of the

services used in the target architecture carefully to ensure sustainable scale. Otherwise, you may hit one or

more of these limits as the workload grows.

Limits and quotas may change as the platform evolves. Be sure to check the current limits at Azure subscription and

service limits, quotas, and constraints.

Employ a scale unit approach for resource composition, deployment, and management.

Use subscriptions as scale units, scaling out resources and subscriptions as required.

Ensure scale limits are considered as part of capacity planning.

If available, use data about existing application environments to explore which limits might be encountered.

A holistic approach to automation of deployment and management activities can maximize the reliability and

operability of the workload.

Automate continuous integration and continuous delivery (CI/CD) pipelines for all application components.

Automate application management activities, such as patching and monitoring.

Use declarative management semantics, such as Infrastructure as code (IaC), instead of over imperative

approaches.

Azure roadmap alignment

RecommendationsRecommendations

Next step

Prioritize templating over scripting. Defer to scripting only when using templates isn't possible.

Azure is constantly evolving through frequent updates to services, features, and regional availability. It's

important to align the target architecture with Azure platform roadmaps to inform an optimal application

trajectory. For example, making sure that the required services and features are available within the chosen

deployment regions.

Refer to Azure updates for the latest information about new services and features.

Align with Azure engineering roadmaps and regional rollout plans.

Unblock with preview services or by taking dependencies on the Azure platform roadmap.

Only take a dependency on committed services and features; validate roadmap dependencies with Microsoft

engineering product groups.

Explore the design areas that provide critical considerations and recommendations for building a mission-

critical workload.

Architecture pattern

Application design of mission

critical workloads on

Azure

12/16/2022 • 19 minutes to read • Edit Online

IMPORTANTIMPORTANT

Scale-unit architecture

TIPTIP

Both functional application requirements and non-functional requirements are critical to inform key design

decisions for a mission-critical application design. However, these requirements should be examined alongside

key cloud application design patterns to ensure aspirations are fully achieved.

This design area explores the important application design patterns for building a highly reliable application on

Azure.

This article is part of the Azure Well-Architected mission-critical workload series. If you aren't familiar with this series, we

recommend you start with what is a mission-critical workload?

Mission-Critical open source project

The reference implementations are part of an open source project available on GitHub. The implementations provide

solution-orientated showcases for how these foundational design concepts can be leveraged, alongside Azure-native

capabilities, to maximize reliability.

Architecturally, it is critical to optimize end-to-end scalability through the logical compartmentalization of

operational functions at all levels of the application stack. For achieving a highly available application design, all

functional aspects of the solution must be capable of scaling to meet changes in demand.

scale-unit

is a logical unit or function that can be scaled independently. A unit can be code components,

application hosting platforms, or even deployment stamps that encompass related components.

For more information, see the Deployment Stamps pattern for further details.

For example, the Mission-Critical online reference implementation considers a user flow for processing product

comments and ratings in a sample web catalog application that use APIs for retrieving and posting comments

and ratings, and supporting components such as an OAuth endpoint, datastore, and message queues. These

stateless API endpoints for retrieving and posting comments and ratings represent granular functional units that

must be able to adapt to changes in demand. However, for these to be truly scalable, the underlying application

platform must also be able to scale in-kind. Similarly, to avoid performance bottlenecks in the end-to-end user

flow and to achieve sustainable scale, the downstream components and dependencies must also be able to scale

to an appropriate degree, either independently, as a separate scale-unit, or together, as part of a single logical

unit.

This image shows the multiple scale-unit scopes that are considered by this reference implementation user flow.

These scopes range from microservice pods to cluster nodes and regional deployment stamps.

Design considerationsDesign considerations

Design recommendationsDesign recommendations

Using a scale-unit architecture is recommended to optimize the end-to-end scalability of a mission-critical

application so that all levels of the solution can appropriately scale. The relationship between related scale-units,

and the components inside a single scale-unit, should be defined according to a capacity model, taking into

consideration non-functional requirements around performance.

Here are some benefits of using a scale-unit architecture:

The scale-unit architecture pattern goes great lengths to address scale limits of individual resources and

the application as a whole.

A scale-unit architecture helps with complex deployment and update scenarios, since an entire regional

stamp can be deployed as one unit. This architecture allows you to test and validate specific versions of

components together, prior to directing traffic to it.

The scale-unit architectural pattern can also be applied to support multi-tenant requirements for

customer segregation.

Azure subscription scale limits and quotas might have a bearing on application design, technology choices, and

the definition of scale-units.

When designing a highly available architecture, start by asking these questions.

How many requests is the solution required to suppor t for each user flow? Are the usage patternsHow many requests is the solution required to suppor t for each user flow? Are the usage patterns

predictable?predictable?

The expected peak request rate (requests per second) and daily/weekly/seasonal traffic patterns are critical to

inform core scale requirements.

Is traffic expected to grow? At what rate will it grow?Is traffic expected to grow? At what rate will it grow?

The expected growth patterns for both traffic and data volume inform the design, about sustainable scale.

Is a degraded ser vice with high response times acceptable under load?Is a degraded ser vice with high response times acceptable under load?

The required performance of the solution under load is a critical decision factor when modeling required

capacity.

NOTENOTE

IMPORTANTIMPORTANT

Define a scale-unit when the scale-limits of a single deployment are likely to be exceeded.

Ensure all application components are able to scale, either as independent scale-units or as part of a

logical scale-unit that encompasses multiple related components.

Define the relationship between scale-units, according to a capacity model and non-functional

requirements.

Define a regional deployment stamp to unify the provisioning, management, and operation of regional

application resources, into a heterogenous but inter-dependent scale-unit.

As the load increases, extra stamps can be deployed within the same or different Azure regions, in order

to horizontally scale the solution.

When deploying within an Azure landing zone, ensure the landing zone subscription is dedicated to the application, in

order to provide a clear management boundary and to avoid potential the Noisy Neighbor antipattern.

For high-scale application scenarios with significant volumes of traffic, design the solution to scale across

multiple Azure subscriptions, to ensure the inherit scale-limits within a single subscription don't constrain

the scalability.

Define a subscription-scoped deployment as a scale-unit to avoid a 'spill-and-fill' subscription model.

Deploy each regional deployment stamp within a dedicated subscription, in order to ensure the

subscription limits only apply within the context of a single deployment stamp and not across the

application as a whole. Where appropriate, multiple deployment stamps can be considered within a

single region, but you should deploy them across independent subscriptions.

Separate the 'global' shared resources within a dedicated subscription to allow for consistent regional

subscription deployment. Avoid using a specialized deployment for a primary region.

The use of multiple subscriptions necessitates additional CI/CD complexity, which must be appropriately managed.

Therefore, it's only recommended in extreme scale scenarios, where the limits of a single subscription are likely to become

a hindrance.

Where multiple production subscriptions are needed to ensure requisite scale, consider using a dedicated

application management group, to simplify policy assignment through a policy aggregation boundary.

Deploy any considered environments, such as production, development, or test environments, into

separate subscriptions. This practice ensures that lower environments don't contribute towards scale

limits, and it reduces the risk of lower environment updates polluting production, by providing a clear

management and identity boundary.

Define and analyze non-functional requirements, such as the availability SLO, within the context of key

end-to-end user-flows. Technical and business scenarios will likely have distinct considerations for

resilience, availability, latency, capacity, and observability. This practice will allow for relative flexibility in

the design approach, tailoring design decisions and technology choices at a user-flow level, since one size

may not fit all.

Model the required capacity around identified traffic patterns, in order to ensure sufficient capacity is

provisioned at peak times and to prevent service degradation. Use traffic patterns to optimize capacity

and resource utilization, during periods of reduced traffic.

Example

Subscription scale

unit approachExample

Subscription scale

unit approach

Video: Subscription structure

Global distribution

Measure the time it takes to perform scale-out and scale-in operations, in order to ensure that the natural

variations in traffic don't create an unacceptable level of service degradation. - To drive continuous

improvement, track the scale operation durations as an operational metric.

This image demonstrates how the single subscription reference deployment model can be expanded across

multiple subscriptions, in an extreme scale scenario, to navigate subscription scale-limits.

Failure is impossible to avoid in any highly distributed environment. So, always plan for failure.

Here are some strategies to mitigate many fault scenarios.

Availability Zones (AZ) allows highly available regional deployments across different data centers within

a region. Nearly all Azure services are available in either a zonal configuration (where service is pinned to

a specific zone) or zone-redundant configuration (where the platform automatically ensures the service

spans across zones and can withstand a zone outage). These configurations allow for fault-tolerance up to

a datacenter level.

To maximize reliability, consider using multiple Azure regions to ensure regional fault tolerance, so that

application availability remains even when an entire region goes down. When designing a multi-region

application, consider different deployment strategies, such as active-active and active-passive, alongside

application requirements, because there are significant trade-offs between each approach.

An active-active deployment strategy represents the gold standard because it maximizes availability and allows

for higher composite Service Level Agreement (SLA). While active-active is the recommended approach, it can

introduce challenges around data synchronization and consistency for many application scenarios, and these

challenges must be fully addressed at a data platform level, with other trade-offs, from increased cost exposure

and increased engineering effort.

Design considerationsDesign considerations

Not every workload supports or requires multiple regions running simultaneously, and hence the precise

application requirements should be weighed against these trade-offs to inform an optimal design decision. For

certain application scenarios with lower reliability targets, different deployment models, such as active-passive

or sharding, can be suitable alternatives.

It's important to note that some Azure services are deployable or configurable as global resources, which aren't

constrained to a particular Azure region. So, when accommodating both 'Scale-Unit Architecture' and 'Global

Distribution', carefully consider to how resources are optimally distributed across Azure regions.

This image shows the high-level active-active design. A user accesses the application through a central global

entry point that then redirects requests to a suitable regional deployment stamp.

Not all services or capabilities are available in every Azure region, and so there can be service availability

implications depending on the selected deployment regions. - For example, Availability Zones aren't

available in every region.

Azure regions are grouped into regional pairs consisting of two regions within the same geography.

Some Azure services use paired regions to ensure business continuity and to protect against data loss.

For example, Azure Geo-redundant Storage (GRS) replicates data to a secondary paired region

automatically, ensuring that data is durable if the primary region isn't recoverable. If an outage affects

multiple Azure regions, at least one region in each pair will be prioritized for recovery.

The Azure Safe Deploy Practice (SDP) ensures all code and configuration changes (planned maintenance)

to the Azure platform undergo a phased roll-out, with health analyzed in case any degradation is detected

during the release. After the Canary and Pilot phases have successfully completed, platform updates are

serialized across regional pairs, ensuring that only one region in each pair is updated at a time.

Like any cloud provider, Azure ultimately has a finite amount of resources and as a result there are

situations that can lead to the unavailability of capacity in individual regions. In the event of a regional

outage there will be a significant increase in demand for resources within the paired region as impacted

customer workloads seek to recover within the paired region. In certain scenarios this may create a

capacity challenge where supply temporarily does not satisfy demand.

When designing a globally distributed architecture, start with these questions.

Are there specific regions where data must reside or where resources have to be deployed?Are there specific regions where data must reside or where resources have to be deployed?

Compliance requirements around geographical data residency, data protection, and data retention can have a

Design recommendationsDesign recommendations

significant bearing on appropriate geographical distribution.

Where are the requests physically originating from?Where are the requests physically originating from?

The geographic proximity and density of users or dependent systems should inform design decisions around

the global distribution.

Are users going to connect from home and/or organizational networks? Can all users be expectedAre users going to connect from home and/or organizational networks? Can all users be expected

to have fast internet connections?to have fast internet connections?

The connectivity method by which users or systems access the application, whether over the public Internet or

private networks using either VPN or Express Route connectivity.

Different Azure regions have slightly different cost profiles for some services. There may be further cost

implications depending on the precise deployment regions chosen.

Availability Zones have a latency perimeter of less than 2 milliseconds between availability zones.

For workloads that are particularly 'chatty' across zones this latency can accumulate to form a non-

trivial performance penalty, as well as incurring bandwidth charges for inter-zone data transfer.

An active-active deployment across Azure and other cloud providers can be considered to further

mitigate reliance on global dependencies within a single cloud provider. A multi-cloud active-active

deployment strategy introduces a significant amount of complexity around CI/CD, given the significant

difference in resource specifications and capabilities between cloud providers. This necessitates

specialized deployment stamps for each cloud.

IMPORTANTIMPORTANT

Deploy the solution within a minimum of two Azure regions to protect against regional outages. Prioritize

the use of paired regions to benefit from SDP risk mitigations and platform recovery capabilities.

For scenarios targeting a >= 99.99% SLO, a minimum of three deployment regions is recommended to maximize

the composite SLA and overall reliability.

Use an active-active deployment strategy where possible to maximize reliability.

Where data/state consistency challenges exist explore the use of a globally distributed data store,

stamped regional architecture, a partially active-active deployment, where some components are active

across all regions while others are located centrally within a primary region.

Calculate the composite SLA for all user flows. Ensure the composite SLA is in-line with business targets.

Deploy additional regional deployment stamps to achieve a greater composite SLA. The use of global

resources will constrain the increase in composite SLA from adding further regions.

Define and validate the recovery point objectives (RPO) and recovery time objectives (RTO).

Geographically co-locate Azure resources with users to minimize network latency and maximize end-to-

end performance.

Technical solutions such as a Content Delivery Network (CDN) or edge caching can also be used to

drive optimal network latency for distributed user bases.

For high-scale application scenarios with significant volumes of traffic, design the solution to scale across

multiple regions to navigate potential capacity constraints within a single region.

Select deployment regions that offer requisite capabilities and characteristics to achieve performance and

availability targets, while fulfilling data residency and retention requirements.

Example

Global distribution approachExample

Global distribution approach

Video - Global distribution

Loosely coupled event-driven architecture

Within a single geography, prioritize the use of regional pairs to benefit from SDP serialized rollouts for

planned maintenance, and regional prioritization in the event of unplanned maintenance.

It's not uncommon that data compliance requirements will constrain the number of available regions and

potentially force design compromises. In such cases, additional investment in operational wrappers is

highly recommended to predict, detect, and respond to failures.

If only a single Azure region is suitable, multiple deployment stamps ('regional scale-units') should be

deployed within the selected region to mitigate some risk, using Availability Zones to provide

datacenter-level fault tolerance. However, such a significant compromise in geographical distribution

will drastically constrain the attainable composite SLA and overall reliability.

If suitable Azure regions do not all offer requisite capabilities, be prepared to compromise on the

consistency of regional deployment stamps to prioritize geographical distribution and maximize

reliability.

For example, when constrained to a geography with two regions where only one region

supports Availability Zones (3 + 1 datacenter model), create a secondary deployment pattern

using fault domain isolation to allow for both regions to be deployed in an active configuration,

ensuring the primary region houses multiple deployment stamps.

Align current service availability with product roadmaps when selecting deployment regions; not all

services may be available in every region on day 1.

Use Availability Zones where possible to maximize availability within a single Azure region.

The Mission-Critical reference implementations consist of both global and regional resources, with regional

resources deployed across multiple regions to provide geo-availability, in the case of regional outages and to

bring services closer to end-users. These regional deployments also serve as scale-unit "stamps" to provide

additional capacity and availability when required.

TIPTIP

Design considerationsDesign considerations

Design recommendationsDesign recommendations

Loose coupling provides the cornerstone of a microservice architecture by allowing services to be designed in a

way that each service has little or no knowledge of surrounding services. The

loose

aspect allows a service to

operate independently. The coupling aspect allows for inter-service communication through well-defined

interfaces. In the context of a mission critical application it further facilitates high-availability by preventing

downstream failures from cascading to frontends or different deployment stamps.

When implementing loose coupling, event-driven architectureevent-driven architecture and asynchronous message processingasynchronous message processing

are key design patterns for interactions which don't require an immediate response. Events indicate a change in

state within a microservice and are generated by event

producers

. Producers don't know anything about how

events should be processed or handled. That is the responsibility of

consumers

. When using asynchronous

event-driven communication, a producer publishes an event when something happens within its domain, which

another component needs to be aware of. An example would be a price change in a product catalog, which

consumers will subscribe to receive so they can process the event asynchronously.

Refer to the event-driven architecture and asynchronous processing patterns for further details.

In reality, applications can combine loose and tight-coupling, depending on business objectives.

Loosely coupled services aren't constrained to use the same compute platform, programming language,

runtime, or operating system.

Services can scale independently, optimizing the use of infrastructure and platform resources.

Failures can be handled separately and don't affect client transactions.

Transactional integrity can be harder to maintain because data creation and persistence happens within

separate services.

End-to-end tracing requires more complex orchestration.

Key functionality should be deployed and managed as independent loosely coupled microservices with

event-driven interaction through well-defined interfaces (synchronous and asynchronous).

The definition of microservice boundaries should consider and align with critical user-flows.

Use event-driven asynchronous communication where possible to support sustainable scale and optimal

performance.

Example

Event

driven approachExample

Event

driven approach

Application-level resiliency patterns and error handling

Design considerationsDesign considerations

Use patterns like the Outbox and Transactional Session to guarantee consistency so that every message is

processed correctly.

The Mission-Critical Online reference implementation uses microservices to process a single business

transaction. It applies write operations asynchronously with a message broker and worker, while read operations

are synchronous with the result directly returned to the caller.

A mission-critical application must be developed with resiliency in-mind. It is therefore critical that application

code be designed and developed to be resilient, ensuring that the application can respond to failure, which is

ultimately an unavoidable characteristic of highly distributed multi-tenant cloud environments like Azure.

More specifically, all application components should be designed from the ground-up to apply key resiliency

patterns for self-healing, such as retries with back-off and circuit breaker. Such patterns go great lengths to

transparently handle transient faults such as network packet loss, or the temporary loss of a downstream

dependency. So, the application code should address as many failure scenarios as possible in order to maximize

service availability and reliability.

When issues are not transient in-nature and cannot be fully mitigated within application logic, it becomes the

role of the health model and operational wrappers to take corrective action. However, for this to happen

effectively, it is essential that the application code incorporates proper instrumentation and logging to inform

the health model and facilitate subsequent troubleshooting or root cause analysis when required. More

specifically, application code should be implemented to facilitate distributed tracing, by providing the caller with

a comprehensive error message that includes a correlation ID when a failure occurs.

Tools like Azure Application Insights can help significantly to query, correlate, and visualize application traces.

Vendor-provided SDKs, such as the Azure service SDKs, will typically provide built-in resiliency

capabilities like retry mechanisms.

It's not uncommon for application responses to transient issues to cause cascading failures.

For example, retry without appropriate back-off will exacerbate when a service is being throttled will

likely exacerbate the issue.

PAT T E RNPAT T E RN SUM M A RYSUM M A RY

Queue-Based Load Leveling Introduces a buffer between consumers and requested

resources to ensure consistent load levels. As consumer

requests are enqueued, a worker process dequeues the

requests and processes them against the requested resource

at a pace set by the worker and the requested resource's

ability to process the requests. If consumers expect replies to

their requests, a separate response mechanism will also need

to be implemented.

Circuit Breaker Provides stability by either waiting for recovery, or quickly

rejecting requests rather than blocking while waiting for an

unavailable remote service or resource. Also, handles faults

that might take a variable amount of time to recover from

when connecting to a remote service or resource.

Bulkhead Strives to partition service instances into groups based on

load and availability requirements, isolating failures to

sustain service functionality.

Saga Manage data consistency across microservices with

independent datastores by ensuring services update each

other through defined event or message channels. Each

service performs local transactions to update its own state

and publishes an event to trigger the next local transaction

in the saga. If a service update fails, the saga executes

compensating transactions to counteract preceding service

update steps. Individual service update steps can themselves

implement resiliency patterns, such as retry.

Health Endpoint Monitoring Implement functional checks in an application that external

tools can access through exposed endpoints at regular

intervals.

Retry Handles transient failures elegantly and transparently.

Throttling Controls the consumption of resources used by application

components, protecting them from becoming over

encumbered. When a resource reaches a load threshold, it

should safeguard its availability by deferring lower-

importance operations and degrading non-essential

functionality so that essential functionality can continue until

sufficient resources are available to return to normal

operation.

Design recommendationsDesign recommendations

Retry delays can be linearly spaced, or increase exponentially to 'backoff' via growing delays.

Here are some other resiliency-related patterns:

Design and develop application code to anticipate and handle failures.

Use vendor provided SDKs, such as the Azure SDKs, to connect to dependent services.

Use the resiliency capabilities provided by utilized SDKs instead of reimplementing resiliency

functionality.

Ensure a suitable back-off strategy is applied when retrying failed dependency calls to avoid a self-

inflicted DDoS scenario.

 
Next step
Define common engineering criteriacommon engineering criteria for all application microservice teams to drive consistency and
acceleration regarding the use application-level resiliency patterns.
Developers should familiarize themselves with common software engineering patterns for resilient
applications.
Implement resiliency patterns using proven standardized packages, such as NServiceBus or Polly for C#
or Sentinel for Java.
Implement Health Endpoint Monitoring by exposing functional checks within application code through
health endpoints which external monitoring solutions can poll to retrieve application component health
statuses. Responses should be interpreted alongside key operational metrics to inform application health
and trigger operational responses, such as raising an alert or performing a compensating roll-back
deployment.
Implement Queue-Based Load Leveling by applying a prioritized ordering so that the most important
activities are performed first.
Implement the Retry pattern to enable application code to handle transient failures elegantly and
transparently.
Cancel if the fault is unlikely to be transient and is unlikely to succeed if the operation is reattempted.
Retry if the fault is unusual or rare and the operation is likely to succeed if attempted again
immediately.
Retry after a delay if the fault is caused by a condition that may need a short time to recover, such as
network connectivity or high load failures.
Apply a suitable 'backoff' strategy with growing retry delays.
Use correlation IDs for all trace events and log messages to tie them to a given request.
Return correlation IDs to the caller for all calls not just failed requests.
Use structured logging for all log messages.
Select a unified operational data sink for application traces, metrics, and logs to enable operators to
seamlessly debug issues.
Ensure operational data is used in conjunction with business requirements to inform an application
health model.
Review the considerations for the application platform.
Application platform

Application platform considerations for mission

critical workloads on Azure

12/16/2022 • 26 minutes to read • Edit Online

IMPORTANTIMPORTANT

Programming language selection

Design considerationsDesign considerations

Azure provides many compute services for hosting highly available applications. The services differ in capability

and complexity. We recommend that you fully understand the capability matrix with focus on:

Non-functional requirements surrounding reliability, availability, performance, and security.

Decision factors such as scalability, cost, operability, and complexity.

The selection of an appropriate application hosting platform is a critical decision that has impact on all other

design areas. For example, the scale-limits of a particular service will have a key bearing on suitability and the

overall application design in terms of scale-unit definitions. A mission-critical application can use more than one

compute service in parallel to support multiple composite workloads and microservices with distinct platform

requirements.

This design area explores the important decision factors and provides recommendations related to the selection,

design, and configuration of an appropriate application hosting platform.

This article is part of the Azure Well-Architected mission-critical workload series. If you aren't familiar with this series, we

recommend you start with what is a mission-critical workload?

Mission-Critical open source project

The reference implementations are part of an open source project available on GitHub. The code assets illustrate the

considerations and implementations for a selected compute service.

Selecting the right programming languages and frameworks is a critical design decision. Typically this decision

is driven by the availability of development skills or by the use of standardized technologies within an

organization. However, given the reliable aspirations, it's essential to also evaluate the performance and

resilience aspects of different languages/frameworks as well as the capability differences within required Azure

SDKs.

There are sometimes significant differences in the capabilities offered by Azure service SDKs in different

languages, and this may influence the selection of an Azure service or programming language.

For instance, if Azure Cosmos DB is a significant dependency, 'Go' may not be an appropriate

development language since there's no first-party SDK.

New features are typically added to the .NET and Java libraries first, and there can be a delay in feature

availability for other supported languages.

The application can use multiple programming languages or frameworks in parallel to support multiple

composite workloads with distinct requirements.

However, significant technology sprawl should be avoided since it introduces management complexity

Design recommendationsDesign recommendations

Containerization

Design considerationsDesign considerations

and operational challenges.

Evaluate all relevant Azure SDKs to confirm requisite capabilities are available for selected programming

languages, ensuring alignment with non-functional requirements.

Optimize the selection of programming languages and frameworks at the microservice level; embrace

multiple technologies where appropriate.

Avoid extensive technology sprawl to prevent unnecessary operational complexity.

Prioritize the .NET SDK to optimize reliability and performance since .NET Azure SDKs typically provide

additional capabilities and receive new features first.

Containerization allows developers to create and deploy applications faster and more reliably by bundling

application code together with related configuration files, libraries, and dependencies required for it to run. This

single software package

container

runs on a shared kernel abstracted from the host operating system, and as a

result is highly portable, capable of running consistently across different infrastructure platforms or cloud

providers.

Containerization has become a major trend in software development since it provides measurable

benefits for developers and operations teams as well as optimizing infrastructure utilization. More

specifically, the benefits of containerizing application components include:

Improved infrastructure utilizationImproved infrastructure utilization: Containers don't include operating system images so require

less system resources. Multiple containers can therefore be hosted on the same virtualized

infrastructure, and this helps to optimize resource utilization by consolidating on fewer resources with

higher container density.

Por tabilityPor tability: Including all software dependencies within the container ensures that it will work across

different operating systems regardless of runtimes or library versions. Containerized applications are

therefore easier to move between application platforms due to the standardized container format.

Faster scaling operationsFaster scaling operations : Containers are lightweight and don't suffer from the slow start-up and

shutdown times afflicting virtual machines, and since container images are pre-built, the start-up

activity can be minimized to focus only on bootstrapping the application.

Simplified managementSimplified management: The consistent portability and ephemeral nature of containers provides a

simplified infrastructure management experience compared to traditional virtualized hosting.

Agile developmentAgile development: Containers support accelerated development, test, and production cycles

through consistent operation and less overhead.

The drawbacks of containerizing application components include:

Complex monitoringComplex monitoring: Monitoring services can find it harder to access applications running inside a

container. Third-party software is typically required to collect and store container state indicators, such

as CPU usage or RAM usage.

SecuritySecurity: The hosting platform OS kernel is shared across multiple containers, creating a single point

of attack. However, the risk of host VM access is limited since containers are isolated from the

underlying operating system.

Containerization has proven to be an excellent option for packaging applications across different

development languages, providing an abstraction layer for application code and its dependencies to

achieve separation from the underlying hosting platform.

Containerization enables workloads to run on Azure without application code needing to be re-written.

Design recommendationsDesign recommendations

Container Orchestration and Kubernetes

Design considerationsDesign considerations

Provides a good opportunity to modernize legacy applications without significant code change, and

can therefore be suitable for 'lift and shift' scenarios depending on the considered application

frameworks and external dependencies.

While it's possible to store data within a running container's filesystem, the data will not persist when the

container is recreated, so instead persistence is typically achieved by 'mounting' external storage.

Containerize all application components, using container images as the primary model for application

deployment packages.

Prioritize Linux-based container runtimes when possible.

Avoid persisting state/data within a container since containers should be immutable and replaceable with

short lifecycles.

Ensure that all relevant logs and metrics are gathered from the container, container host, and underlying

cluster. Gathered logs and metrics should be sent to a unified data sink for further processing.

There are several Azure application platforms capable of effectively hosting containers:

Azure Kubernetes Service (AKS)

Azure Container Instances (ACI)

Azure App Service

Azure Service Fabric

Azure Red Hat OpenShift

There are advantages and disadvantages associated with each of these Azure container platforms which should

be analyzed in the context of business requirements to inform an optimal technical choice; each platform serves

an optimal choice for certain scenarios. However, given the principles underpinning the design methodology

strive to optimize reliability, scalability, and performance, it's strongly recommended to prioritize the use of

Azure Kubernetes Service.

Azure Kubernetes Service (AKS) is Microsoft Azure's native managed Kubernetes service which allows for rapid

Kubernetes cluster provisioning without complex cluster administration activities, and enhances standard

Kubernetes with a rich feature set that includes advanced networking and identity capabilities.

For web and API based workload scenarios Azure App Services offers a feasible alternative to AKS, providing a

low-friction container platform without the complexity of Kubernetes.

The considerations and recommendations within this section will therefore focus on optimal AKS usage as well

as App Services for low-scale scenarios.

Azure Kubernetes Ser viceAzure Kubernetes Ser vice

There are many different container orchestrators, but Kubernetes has become the clear market leader and

is best supported across the open source community and major cloud providers.

Kubernetes expertise is readily available within the employment market.

Kubernetes has a steep learning curve, so if development teams are new, it will require non-trivial

engineering investment to set up and maintain a Kubernetes environment in a secure and reliable

way.

Kubernetes as well as managed Kubernetes offerings like AKS are widely available and can address

concerns reg. vendor lock-in.

Default 'vanilla' Kubernetes requires significant configuration to ensure a suitable security posture for

business-critical application scenarios. AKS addresses various security risks out of the box, such as
support for private clusters, auditing and logging into Log Analytics, and hardened node images.
AKS provides a control plane that is managed by Microsoft. By default the control plane of AKS is
provided free of charge, but without any guaranteed SLA. Customers only manage and pay for the
worker nodes which form the cluster.
The optional AKS Uptime SLA provides availability guarantees for the Kubernetes control plane. 99.95%
availability of the Kubernetes API server endpoint for AKS Clusters that use Azure Availability Zones and
99.9% availability for AKS Clusters that don't use Azure Availability Zones.
Some foundational configuration decisions have to be made upfront and can't be changed without re-
deploying an AKS cluster.
Selection between public and private AKS clusters.
Enabling Azure Network Policy.
Azure AD integration and the use of Managed Identities for AKS instead of Service Principals.
AKS supports Kubernetes versions aligned with the release cycle of the Kubernetes project. Clusters and
node pools need to be upgraded on a regular basis.
AKS supports different ways to update nodes and/or clusters in a manual or automated way. The AKS
team releases new images on a weekly basis for Windows and Linux nodes.
AKS supports different auto-upgrade channels to automatically upgrade AKS clusters to newer versions
of Kubernetes and/or newer node images once available. Planned Maintenance can be used to define
maintenance windows for these operations.
AKS supports different network plugins. The Azure CNI plugin is required to enable certain capabilities
within AKS, such as Windows-based node pools or Kubernetes Network Policies. It also supports bring-
your-own-cni for specific use cases.
AKS differentiates between system node pools and user node pools to separate system and workload
services. User node pools can be scaled down to 0 nodes if needed.
The AKS Stop/Start cluster feature allows an AKS cluster in dev/test scenarios to be paused while
maintaining cluster configuration, saving time and cost compared to re-provisioning.
Azure Monitor for containers (Container Insights) provides a seamless onboarding experience, enables
various monitoring capabilities out of the box as well as more advanced scenarios via its built-in
Prometheus scraping support.
AKS offers integration with Azure AD to enable the use of Managed Identities for AKS as well as for node
pool and individual pods, Role Based Access Control (RBAC) using Azure AD credentials as well as
authentication with Azure Container Registry (ACR).
Azure Policy can help to apply at-scale enforcements and safeguards to AKS clusters in a consistent
centralized manner.
Azure Policy can control what functions pods are granted, and if running contradicts policy. This access
is defined through built-in policies provided by the Azure Policy Add-on for AKS.
AKS has certain scale limits, such as the number of nodes and number of node pools per cluster, as well
as the number of clusters per subscription.
Azure App Ser viceAzure App Ser vice
SNAT port exhaustion is a common failure scenario with Azure App Services, which can be predicted
through load testing while monitoring ports using Azure Diagnostics. SNAT ports are used when making
outbound connections to public IP addresses.

  
Design recommendationsDesign recommendations
TCP port exhaustion is a further common failure scenario which occurs when the sum of outbound
connections from a given worker exceeds the capacity. The number of available TCP ports depend on the
size of the worker.
Azure App Service has a default, soft limit of instances per App Service Plan. This limit can be increased
by opening a support ticket, if the App Service routinely uses 15 or more instances.
Per-app scaling can be enabled at the App Service Plan level to allow an application to scale
independently from the App Service plan that hosts it. For example, an App Service Plan can be scaled to
10 instances, but an app can be set to use only 5.
Apps are allocated to available nodes using a best effort approach for an even distribution. While an
even distribution isn't guaranteed, the platform will make sure that two instances of the same app will
not be hosted on the same instance.
There are a number of events that can lead App Service workers to restart, such as content deployment,
App Settings changes, and Virtual Network integration configuration changes.
App Service plan autoscale will scale-out run if any rule within the profile is met, but will only scale-in if
all rules within the profile are met.
Diagnostic logging provides the ability to ingest application and platform level logs into either Log
Analytics, Azure Storage, or a third party tool via Event Hubs.
Application performance monitoring with Application Insights provides deep insights into application
performance.
For Linux Plans a code-based enablement (SDK) is required.
For Windows Plans a 'codeless deployment' approach is possible to quickly get insights without
changing any code.
For cost management, see plan and manage costs for Azure App Service.
Azure Kubernetes Ser viceAzure Kubernetes Ser vice
Use Azure Kubernetes Service (AKS) as the primary application hosting platform where requirements allow.
Availability
Deploy AKS clusters across different Azure regions as a scale-unit to maximize reliability and
availability.
Use Availability Zones to maximize resilience within an Azure region by distributing AKS control plane
and agent nodes across physically separate datacenters.
Where co-locality latency requirements exist, either AKS deployment within a single zone or proximity
placement groups should be used to minimize inter-node latency.
Use the AKS Uptime SLA for production clusters to maximize Kubernetes API endpoint availability
guarantees.
Ensure AKS subscription scale limits are appropriately considered when designing the AKS
deployment model to ensure requisite scalability.
Scalability
Enable cluster autoscaler to automatically adjust the number of agent nodes in response to resource
constraints.
Utilize the Horizontal pod autoscaler to adjust the number of pods in a deployment depending on CPU
utilization or other selected metrics.
For high scale and burst scenarios, consider the use of Virtual Nodes for extensive and rapid scale.
Define pod resource requests and limits in application deployment manifests.

Isolation
Ensure the System node pool is isolated from application workloads.
Use dedicated node pools for infrastructure components and tools that require high resource
utilization, to avoid noisy neighbor scenarios.
Separate distinct application workloads to dedicated node pools based on workload requirements,
considering requirements for specialized infrastructure resources such as GPU, high memory VMs.
Avoid deploying large numbers of node pools to reduce additional management overhead.
Use taints and tolerations to provide dedicated nodes and limit resource intensive applications.
Evaluate application affinity and anti-affinity requirements and configure the appropriate colocation of
containers on nodes.
Networking
Ensure proper selection of network plugin based on network requirements and cluster sizing.
Prioritize the use of Azure CNI.
Use Azure or Calico Network Policies to control traffic within the cluster. (requires Azure CNI)
Security
Apply configuration guidance provided within the AKS security baseline.
Harden the AKS cluster to remove critical security risks associated with Kubernetes deployments.
Evaluate Azure AD workload identity to assign managed identities at pod-level.
Use Secrets Store CSI Driver with Azure Key Vault to protect secrets, certificates, and connection
strings.
Use Managed Identities to avoid having to manage and rotate service principal credentials.
Utilize Azure Active Directory integration to take advantage of centralized account
management and passwords, application access management, and identity protection.
Use Kubernetes RBAC with Azure Active Directory for least privilege, and minimize granting
administrator privileges to protect configuration and secrets access.
Limit access to the Kubernetes cluster configuration file with Azure role-based access control.
Limit access to actions that containers can perform, provide the least number of permissions
and avoid the use of root / privileged escalation.
Establish a consistent reliability and security baseline for AKS cluster and pod configurations
using Azure Policy.
Use the Azure Policy Add-on for AKS to control pod functions, such as root privileges,
and disallow pods which don't conform to policy.
Policy assignments should be enforced at a subscription scope or higher to drive
consistency across development teams.
Stay current
Subscribe to the public AKS Roadmap and Release Notes on GitHub to stay up-to-date on upcoming
changes, improvements, and most importantly Kubernetes version releases or the deprecation of old
releases.
Consider and apply the guidance provided within the AKS checklist to ensure alignment with Well-
Architected best practice guidance.
Regularly upgrade to a supported version of Kubernetes.
Establish a governance process to check and upgrade as needed to not fall out of support.
Leverage the AKS Cluster auto-upgrade with Planned Maintenance.
Regularly process node image updates to remain current with new AKS images.
Observability
Utilize a solution like Azure Monitor and Application Insights to centrally collect metrics, logs, and

NOTENOTE
diagnostics from AKS resources for troubleshooting purposes.
Enable and review Kubernetes master node logs.
Configure the scraping of Prometheus metrics with Azure Monitor for containers.
Container Images
Store container images in Azure Container Registry.
Enable geo-replication to replicate container images across all leveraged AKS regions.
Enable Azure Defender for container registries to provide vulnerability scanning for container images.
Use Azure AD to access Azure Container Registry.
When deploying into an Azure landing zone, be aware that the required Azure Policy to ensure consistent
reliability and security should be provided by the landing zone implementation. The mission-critical reference
implementations provide a suite of baseline policies to drive recommended reliability and security configurations.
Azure App Ser viceAzure App Ser vice
For lower scale workload scenarios Azure App Services can provide a feasible alternative to AKS, without
the complexities associated with administering Kubernetes.
Consider and plan for future scalability requirements and application growth so that a strategic
technology decision can be applied from the start, avoiding future technical migration debt as the
solution grows.
If the lack of requisite Kubernetes expertise presents an unacceptable delivery risk, consider Azure
App Service as an alternative container platform.
Leverage the container based deployment model for App Service Plans.
Use Premium V3 plans with 2 or more worker instances for high availability.
Use Linux App Service Plans to optimize performance and costs.
Deploy App Service Plans in an Availability Zone configuration to ensure worker nodes are distributed
across zones within a region.
Deploy Azure App Service Plans across multiple regions as a scale unit, using multiple scale-units
deployed within a single region to navigate the default limit of 30 instances per App Service Plan.
Consider opening a support ticket to increase the maximum number of workers to twice the instance
count required to serve normal peak load.
Evaluate the use of TCP and SNAT ports to avoid outbound connection errors.
Predictively detect SNAT port exhaustion through load testing while monitoring ports using Azure
Monitor, and if SNAT errors occur, it's necessary to either scale across more/larger workers, or
implement coding practices to help preserve and re-use SNAT ports, such as connection pooling
or the lazy loading of resources.
It's recommended not to exceed 100 simultaneous outbound connections to a public IP Address
per worker, and to avoid communicating with downstream services via public IP addresses when a
Private Endpoint or Service Endpoint could be used.
Avoid unnecessary worker restarts.
Make changes within a deployment slot other than the slot currently configured to accept production
traffic. After workers are recycled and warmed up, a 'swap' can be performed without unnecessary
down time.

 
Serverless compute
  
Design considerationsDesign considerations
Enable AutoHeal to automatically recycle unhealthy workers.
Enable Health Check to identify non-responsive workers.
While any health check is better than none at all, the logic behind endpoint tests should assess all
critical downstream dependencies to ensure overall health.
Enable AutoScale to ensure adequate resources are available to service requests.
Use a scale-out and scale-in rule combination to ensure auto-scale can take action to both scale-out
and scale-in.
Understand the behavior of multiple scaling rules in a single profile.
Enable Diagnostic Logging to provide insight into application and platform behavior.
Enable Application Insights Alerts to be made aware of fault conditions.
Review Azure App Service diagnostics to ensure common problems are addressed.
It's a good practice to regularly review service-related diagnostics and recommendations, taking
action as appropriate.
Evaluate per-app scaling for high-density hosting on Azure App Service Plans.
Serverless computing provides compute resources on-demand and eliminates the need to manage
infrastructure all together; the cloud provider automatically provisions, scales, and manages resources required
to run deployed application code.
Microsoft Azure provides several serverless compute platforms:
Azure Functions: Application logic is implemented as distinct blocks of code ("functions") which run in
response to events, such as an HTTP request or queue message, with each function scaling as necessary to
meet demand.
Azure Logic Apps: Platform for creating and running automated workflows that integrate various apps, data
sources, services and systems. Similar to Azure Functions, there are built-in triggers for event-driven
processing but instead of deploying application code, Logic Apps can be composed using a graphical user
interface which supports code blocks like conditionals and loops.
Azure API Management: Publish, secure, transform, maintain, and monitor APIs using the Consumption tier.
Power Apps & Power Automate: Provides a 'low-code/no-code' development experience, with simple
workflow logic and integrations configurable through connections in a user interface. Developed Power Apps
can subsequently be deployed to a Microsoft 365 tenant and consumed from either a web browser or the
Power Apps mobile client.
In the context of a reliable application platform, serverless technologies provide a near-zero friction
development and operational experience, which can be highly valuable for simple business process scenarios.
However, this relative simplicity comes at the cost of flexibility in terms of scalability, reliability, and performance,
which is likely unacceptable for most business-critical application scenarios.
The design methodology positions serverless technologies as an alternative platform for simple business
process scenarios which don't share the same stringent business requirements as critical system flows. The
design considerations and recommendations within this section focus on optimal Azure Function and Azure
Logic Apps usage as alternative platforms for non-critical workflow scenarios.
Azure FunctionsAzure Functions
In most cases Azure Functions don't require additional code to call external services or to enable external

Design recommendationsDesign recommendations

events trigger function execution since these can be achieved with Azure Function Bindings.

Azure Functions supports multiple triggers, such as the HTTP trigger, and bindings for Azure Services,

such as Azure Cosmos DB, Azure Service Bus and Azure Blob Storage.

There are three hosting plans available for Azure Functions:

Consumption

is the fully serverless pay-per-use option, with instances dynamically added and

removed based on the number of incoming events; underlying compute resources are charged only

when running.

Premium

uses a Premium SKU App Service plan to host functions and allows the configuration of

compute instance size.

Dedicated

is the least serverless option as it's tied to a provisioned App Service plan or App Service

Environment. Autoscale can be enabled, but scale operations are slower than with the Consumption

and Premium plans.

Fully serverless hosting options, which help optimize costs by de-provisioning allocated resources when

workloads are not running, may incur "cold start" delays, especially for applications comprised of many

files, such as Node.js or PHP applications.

See Azure Functions hosting options for more details about the service limits.

Azure Logic AppsAzure Logic Apps

There are three deployment modes available for Azure Logic Apps:

Consumption

is the fully serverless pay-per-use model, with Azure managing the infrastructure which

is shared across multiple tenants.

Consumption (ISE)

uses the dedicated Integration Service Environment (ISE) to privately host logic

apps. A single logic app can have only one workflow.

Standard

uses the containerized single-tenant Azure Logic Apps runtime based on Azure Functions.

Standard

each logic app can have multiple stateful and stateless workflows.For

Consumption

and

Consumption (ISE)

a single logic app can have only one workflow.

Similar to Azure Functions, there are built-in triggers for event-driven processing, however, instead of

deploying application code Logic Apps can be composed using a graphical user interface which supports

blocks like conditionals, loops etc.

With the standard deployment model, default limits can be modified, however, some limits have upper

maximums.

For consumption plans Azure manages the default configuration limits, but some values can be

changed through the creation of a support ticket.

Azure FunctionsAzure Functions

Consider Azure Functions for simple business process scenarios which don't have the same stringent

business requirements as business-critical system flows.

Low-critical scenarios can also be hosted as separate containers within AKS to drive consistency,

provided affinity and anti-affinity requirements are fully considered when collocating containers on

nodes.

Azure Functions should perform distinct operations that run as fast as possible.

This makes them very flexible and highly scalable, ensuring they will work well in event-driven

workloads with short-lived processes.

Use the Premium Function Linux hosting plan to maximize reliability and performance while maintaining

Asynchronous messaging

Design considerationsDesign considerations

the serverless promise.

Take a scale-unit approach to navigate the resource limit of 20 Linux nodes.

When using the HTTP trigger to expose an external endpoint, protect the HTTP endpoint from common

external attack vectors using a Web Application Firewall (WAF).

For internal workloads, consider the use of Service Endpoints or Private Endpoints to restrict access to

private Virtual Networks.

If required, use Private Endpoints to mitigate data exfiltration risks, such as malicious admin scenarios.

Treat Azure Functions code just like any other code; subject it to code scanning tools that integrate it with

CI/CD pipelines.

Azure Logic AppsAzure Logic Apps

Use the standard deployment mode to ensure a single tenant deployment and mitigate 'noisy neighbor'

scenarios.

The recommendation for a loosely coupled microservice architecture relies heavily on asynchronous messaging

to provide requisite integration between application components through a well-defined message bus. Azure

provides several native messaging services which can be used to facilitate asynchronous message interfaces,

including Azure Event HubAzure Event Hub, Azure Ser vice BusAzure Ser vice Bus, Azure Storage QueuesAzure Storage Queues and Azure Event GridAzure Event Grid.

This section will therefore explore key decision factors when selecting an appropriate Azure message service,

and how to optimally configure each service in the context of a mission-critical workload scenario to maximize

reliability.

More than one type of messaging service can be used by an application for different workload scenarios.

There are many factors to consider when selecting an optimal messaging service for a specific scenario.

Asynchronous Messaging on Azure

Azure Storage Queues vs Azure Service Bus Queues

Comparison between Event Grid, Event Hubs and Service Bus

Azure Messaging Services -How to Choose the Right Messaging Technology in Azure

Azure Event HubAzure Event Hub

Azure Event Hubs are designed as a event streamingevent streaming service to scale to multi million messages per

second.

Event Hubs supports Availability Zones (AZs) for zonal redundancy within supported regions in all

pricing tiers.

Azure Event Hubs Namespaces support configurable failover replication between regions, however, only

configuration metadata is replicated, but not messages themselvesnot messages themselves. Configuration that is being

replicated includes: Event Hubs (inside a Namespace), Consumer Groups, Namespace Authorization

Rules, and Event Hubs Authorization Rules.

Azure Event Hubs can be configured to capture raw data into Azure Blob Storage or Azure Data Lake

Storage

Data can be processed from Event Hubs by a client using the Event Processor SDK, or by Azure Stream

Analytics.

Design recommendationsDesign recommendations

Event Hubs can be deployed in Basic, Standard or Dedicated tiers which provide different performance

characteristics and SLA. See SLA for Event Hubs and Azure Event Hubs quotas and limits for more.

Event Hubs provide an Apache Kafka-compatible messaging interface.

See Azure Event Hubs quotas and limits for more details.

Azure Ser vice BusAzure Ser vice Bus

Azure Service Bus provides reliable asynchronous message deliver yreliable asynchronous message deliver y that requires polling.

Service Bus supports Availability Zones (AZs) for zonal redundancy within supported regions in Premium

tier.

Service Bus supports messages sizes of 256 KB in Basic and Standard tier and 1 MB in Premium tier.

Service Bus offers a SLA of 99.9% on send and receive operations.

Azure Storage QueuesAzure Storage Queues

Azure Storage Queues provide a simple messaging solution, which can be communicated with using a

REST API.

Storage queues don't guarantee message ordering.

Throttling will occur if this maximum throughput limit is reached.

Through the geo-replication feature of Azure Storage Accounts, Storage Queues can be configured to

asynchronously replicate to another region.

Storage Queues provide the same SLA as their underlying Storage Accounts. A 99.99% SLA on read

requests for RA-GRS and 99.9% for write requests.

See Scalability and performance targets for Queue Storage for more details about service limitations.

Azure Event GridAzure Event Grid

Azure Event Grid is designed as a event distributionevent distribution service for reactive programming.

Event Grid integrates with many Azure services as event sources. For example, Event Grid can be

configured to react to status changes on Azure resources.

Event Grid provides a SLA of 99.99% for message publishing.

See Azure Event Grid quotas and limits for more details.

Prioritize the use of Event Hubs for scenarios which require high throughput and which can work with

message ordering on a partition-basis.

Use Service Bus for scenarios requiring a higher QoS and message guarantee by implementing two

phase commits.

Use Service Bus Premium in a zonal redundant configuration to provide high-availability within a

region.

Consider Storage Queues when message geo-replication is required provided the message size is less

than 64 KB.

Use an AZ-redundant tier for the underlying Storage Account (ZRS or GZRS).

Use Event Grid for scenarios where services need to react to changes in another service/component.

Constrained migrations using IaaS

Design considerationsDesign considerations

Design recommendationsDesign recommendations

IMPORTANTIMPORTANT

Many applications with existing on-premises deployments use virtualization technologies and redundant

hardware to provide mission-critical levels of reliability. Modernization is often hindered by business constraints

which prevent full alignment with the cloud-native baseline (north-star) architecture pattern recommended for

mission-critical workloads. So, many applications adopt a phased approach, with initial cloud deployments using

virtualization and Azure Virtual Machines as the primary application hosting model.

This section will therefore focus on the optimal usage of Azure Virtual Machines and associated services in

order to maximize the reliability of the application platform, highlighting key aspects of the mission-critical

design methodology which transpose cloud-native and IaaS migration scenarios.

The use of IaaS Virtual Machines can be required for certain scenarios:

Available PaaS services do not provide the required performance or level of control.

The workload requires operating system access, specific drivers, or network and system

configurations.

The workload does not support running in containers.

Lack of support for 3rd-party workloads.

The use of IaaS Virtual Machines significantly increases operational costs compared to PaaS services,

through management responsibility of the virtual machine and the operating system.

Managing virtual machines necessitates the frequent roll-out of software packages and updates.

Azure provides certain capabilities to increase the availability of Virtual Machines, options are:

Availability Sets can be used to protect against network, disk and power failures by distributing

virtual machines across up to fault domains and update domains.

Availability zones can be used to achieve even higher levels of reliability by distributing VMs

across physically separated data center within a region.

Virtual Machine Scale Sets provide functionality to automatically scale the number of virtual

machines along with capabilities to monitor instance health and automatically repair unhealthy

instances.

Prioritize the use of PaaS services and Containers where possible to reduce operational complexity and cost. Only use

IaaS Virtual Machines when required.

Right-size VM sku sizes to ensure effective resource utilization.

Deploy three or more Virtual Machines across Availability zones to achieve data center level fault

tolerance.

If you are deploying commercial off-the-shelf software, consult with the software vendor and test

adequately before deploying into production.

For workloads which cannot be deployed across Availability Zones, use Availability Sets with three or

more VMs.

Availability Sets should only be considered if Availability Zones do not comply with workload

requirements, such as for 'chatty' workloads with low latency requirements.

Prioritize the use of Virtual Machine Scale Sets for scalability and zone-redundancy. This is particularly

Next step

important for workloads with varying load (e.g. number of active users or requests per second).

Do not access individual virtual machines directly, use load balancers in front when possible.

To protect against regional outages, deploy application virtual machines across multiple Azure regions.

Please refer to the networking and connectivity design area for further details about how to optimally

route traffic between active deployment region.

For workloads which do not support multi-region active-active deployments, consider active-passive by

using hot/warm standby virtual machines for regional failover.

Prioritize the use of standard images from the Azure Marketplace over custom images that need to be

maintained.

Implement automated processes to deploy and rollout changes to virtual machines, avoiding any manual

intervention. See IaaS considerations in the Operational procedures design area for more.

Implement chaos experiments to inject application faults into virtual machine components while

observing the mitigation of faults. See Continuous validation and testing in the Deployment and testing

design area for more details.

Monitor virtual machines and ensure diagnostic logs and metrics are ingested into a unified data sink.

Follow and apply security practices for mission-critical application scenarios as described above, when

applicable, as well as the Security best practices for IaaS workloads in Azure.

Review the considerations for the data platform.

Data platform

Data platform considerations for mission

critical

workloads on Azure

12/16/2022 • 49 minutes to read • Edit Online

IMPORTANTIMPORTANT

The Four Vs of Big Data

Design ConsiderationsDesign Considerations

The selection of an effective application data platform is a further crucial decision area, which has far-reaching

implications across other design areas. Azure ultimately offers a multitude of relational, non-relational, and

analytical data platforms, which differ greatly in capability. It's therefore essential that key non-functional

requirements be fully considered alongside other decision factors such as consistency, operability, cost, and

complexity. For example, the ability to operate in a multi-region write configuration will have a critical bearing

on suitability for a globally available platform.

This design area expands on application design, providing key considerations and recommendations to inform

the selection of an optimal data platform.

This article is part of the Azure Well-Architected mission-critical workload series. If you aren't familiar with this series, we

recommend you start with what is a mission-critical workload?

Mission-Critical open source project

The reference implementations are part of an open source project available on GitHub. The code assets illustrate the

considerations and implementations for a selected data platform.

The 'Four Vs of Big Data' provide a framework to better understand requisite characteristics for a highly

available data platform, and how data can be used to maximize business value. This section will therefore

explore how the Volume, Velocity, Variety, and Veracity characteristics can be applied at a conceptual level to

help design a data platform using appropriate data technologies.

VVolume: how much data is coming in to inform storage capacity and tiering requirements -that is the size of

the dataset.

VVelocity: the speed at which data is processed, either as batches or continuous streams -that is the rate of

flow.

VVariety: the organization and format of data, capturing structured, semi-structured, and unstructured formats

-that is data across multiple stores or types.

VVeracity: includes the provenance and curation of considered data sets for governance and data quality

assurance -that is accuracy of the data.

VolumeVolume

Existing (if any) and expected future data volumes based on forecasted data growth rates aligned with

business objectives and plans.

Data volume should encompass the data itself and indexes, logs, telemetry, and other applicable

datasets.

Large business-critical and mission-critical applications typically generate and store high volumes (GB

and TB) on a daily basis.

There can be significant cost implications associated with data expansion.

Data volume may fluctuate due to changing business circumstances or housekeeping procedures.

Data volume can have a profound impact on data platform query performance.

There can be a profound impact associated with reaching data platform volume limits.

Will it result in downtime? and if so, for how long?

What are the mitigation procedures? and will mitigation require application changes?

Will there be a risk of data loss?

Features such as Time to Live (TTL) can be used to manage data growth by automatically deleting records

after an elapsed time, using either record creation or modification.

For example, Azure Cosmos DB provides an in-built TTL capability.

VelocityVelocity

The speed with which data is emitted from various application components, and the throughput

requirements for how fast data needs to be committed and retrieved are critical to determining an

optimal data technology for key workload scenarios.

The nature of throughput requirements will differ by workload scenario, such as those that are read-

heavy or write-heavy.

What is the required throughput? And how is throughput expected to grow?

What are the data latency requirements at P50/P99 under reference load levels?

For example, analytical workloads will typically need to cater to a large read throughput.

Capabilities such as supporting a lock-free design, index tuning, and consistency policies are critical to

achieving high-throughput.

Optimizing configuration for high throughput incurs trade-offs, which should be fully understood.

Load-levelling persistence and messaging patterns, such as CQRS and Event Sourcing, can be used to

further optimize throughput.

Load levels will naturally fluctuate for many application scenarios, with natural peaks requiring a

sufficient degree of elasticity to handle variable demand while maintaining throughput and latency.

Agile scalability is key to effectively support variable throughput and load levels without

overprovisioning capacity levels.

Both read and write throughput must scale according to application requirements and load.

Both vertical and horizontal scale operations can be applied to respond to changing load levels.

The impact of throughput dips can vary based on workload scenario.

Will there be connectivity disruption?

Will individual operations return failure codes while the control plane continues to operate?

Will the data platform activate throttling, and if so for how long?

The fundamental application design recommendation to use an active-active geographical distribution

introduces challenges around data consistency.

There's a trade-off between consistency and performance with regard to full ACID transactional

semantics and traditional locking behavior.

Minimizing write latency will come at the cost of data consistency.

In a multi-region write configuration, changes will need to be synchronized and merged between all

replicas, with conflict resolution where required, and this may impact performance levels and scalability.

Read-only replicas (intra-region and inter-region) can be used to minimize roundtrip latency and

distributing traffic to boost performance, throughput, availability, and scalability.

A caching layer can be used to increase read throughput to improve user experience and end-to-end

client response times.

Cache expiration times and policies need to be considered to optimize data recentness.

VarietyVariety

The data model, data types, data relationships, and intended query model will strongly affect data

platform decisions.

Does the application require a relational data model or can it support a variable-schema or non-

relational data model?

How will the application query data? And will queries depend on database-layer concepts such as

relational joins? Or does the application provide such semantics?

The nature of datasets considered by the application can be varied, from unstructured content such as

images and videos, or more structured files such as CSV and Parquet.

Composite application workloads will typically have distinct datasets and associated requirements.

In addition to relational or non-relational data platforms, graph or key-value data platforms may also be

suitable for certain data workloads.

Some technologies cater to variable-schema data models, where data items are semantically similar

and/or stored and queried together but differ structurally.

In a microservice architecture, individual application services can be built with distinct scenario-optimized

datastores rather than depending on a single monolithic datastore.

Design patterns such as SAGA can be applied to manage consistency and dependencies between

different datastores.

The use of multiple data technologies will add a degree of management overhead to maintain

encompassed technologies.

Inter-database direct queries can impose co-location constraints.

The feature-sets for each Azure service differ across languages, SDKs, and APIs, which can greatly impact

the level of configuration tuning that can be applied.

Capabilities for optimized alignment with the data model and encompassed data types will strongly

influence data platform decisions.

Query layers such as stored procedures and object-relational mappers.

Language-neutral query capability, such as a secured REST API layer.

Business continuity capabilities, such as backup and restore.

Analytical datastores typically support polyglot storage for various types of data structures.

Analytical runtime environments, such as Apache Spark, may have integration restrictions to analyze

polyglot data structures.

In an enterprise context, the use of existing processes and tooling, and the continuity of skills, can have a

significant bearing on the data platform design and selection of data technologies.

VeracityVeracity

Several factors must be considered to validate the accuracy of data within an application, and the

management of these factors can have a significant bearing on the design of the data platform.

Data consistency.

Platform security features.

Design RecommendationsDesign Recommendations

Data governance.

Change management and schema evolution.

Dependencies between datasets.

In any distributed application with multiple data replicas there's a trade-off between consistency and

latency, as expressed in the CAP and PACELC theorems.

When readers and writers are distinctly distributed, an application must choose to return either the

fastest-available version of a data item, even if it out of date compared to a just-completed write

(update) of that data item in another replica, or the most up-to-date version of the data item, which

may incur additional latency to determine and obtain the latest state.

Consistency and availability can be configured at platform level or at individual data request level.

What is the user experience if data was to be served from a replica closest to the user which doesn't

reflect the most recent state of a different replica? i.e. can the application support possibly serving out-

of-date data?

In a multi-region write context, when the same data item is changed in two separate write-replicas before

either change can be replicated, a conflict is created which must be resolved.

Standardized conflict resolution policies, such as "Last Write Wins", or a custom strategy with custom

logic can be applied.

The implementation of security requirements may adversely impact throughput or performance.

Encryption at-rest can be implemented in the application layer using client-side encryption and/or the

data layer using server-side encryption if necessary.

Azure supports various encryption models, including server-side encryption that uses service-managed

keys, customer-managed keys in Key Vault, or customer-managed keys on customer-controlled hardware.

With client-side encryption, keys can be managed in Key Vault or another secure location.

MACsec (IEEE 802.1AE MAC) data-link encryption is used to secure all traffic moving between Azure

datacenters on the Microsoft backbone network.

Packets are encrypted and decrypted on the devices before being sent, preventing physical 'man-in-

the-middle' or snooping/wiretapping attacks.

Authentication and authorization to the data plane and control plane.

How will the data platform authenticate and authorize application access and operational access?

Observability through monitoring platform health and data access.

How will alerting be applied for conditions outside acceptable operational boundaries?

VolumeVolume

Ensure future data volumes associated with organic growth aren't expected to exceed data platform

capabilities.

Forecast data growth rates aligned to business plans and use established rates to inform ongoing

capacity requirements.

Compare aggregate and per-data record volumes against data platform limits.

If there's a risk of limits being reached in exceptional circumstances, ensure operational mitigations

are in place to prevent downtime and data loss.

Monitor data volume and validate it against a capacity model, considering scale limits and expected data

growth rates.

Ensure scale operations align with storage, performance, and consistency requirements.

When a new scale-unit's introduced, underlying data may need to be replicated which will take time

and likely introduce a performance penalty while replication occurs. So ensure these operations are

performed outside of critical business hours if possible.

Define application data tiers to classify datasets based on usage and criticality to facilitate the removal or

offloading of older data.

Consider classifying datasets into 'hot', 'warm', and 'cold' ('archive') tiers.

For example, the foundational reference implementations use Azure Cosmos DB to store 'hot'

data that is actively used by the application, while Azure Storage is used for 'cold' operations

data for analytical purposes.

Configure housekeeping procedures to optimize data growth and drive data efficiencies, such as query

performance, and managing data expansion.

Configure Time-To-Live (TTL) expiration for data that is no-longer required and has no long-term

analytical value.

Offload non-critical data to secondary cold storage, but maintain it for analytical value and to satisfy

audit requirements.

Collect data platform telemetry and usage statistics to enable DevOps teams to continually evaluate

housekeeping requirements and 'right-size' datastores.

Validate that old data can be safely tiered to secondary storage, or deleted outright, without an

adverse impact to the application.

In-line with a microservice application design, consider the use of multiple different data technologies in-

parallel, with optimized data solutions for specific workload scenarios and volume requirements.

Avoid creating a single monolithic datastore where data volume from expansion can be hard to

manage.

VelocityVelocity

The data platform must inherently be designed and configured to support high-throughput, with

workloads separated into different contexts to maximize performance using scenario optimized data

solutions.

Ensure read and write throughput for each data scenario can scale according to expected load

patterns, with sufficient tolerance for unexpected variance.

Separate different data workloads, such as transactional and analytical operations, into distinct

performance contexts.

Load-level through the use of asynchronous non-blocking messaging, for example using the CQRS or

Event Sourcing patterns.

There might be latency between write requests and when new data becomes available to read, which

may have an impact on the user experience.

This impact must be understood and acceptable in the context of key business requirements.

Ensure agile scalability to support variable throughput and load levels.

If load levels are highly volatile, consider overprovisioning capacity levels to ensure throughput and

performance is maintained.

Test and validate the impact to composite application workloads when throughput can't be

maintained.

Prioritize Azure-native data services with automated scale-operations to facilitate a swift response to

load-level volatility.

Configure autoscaling based on service-internal and application-set thresholds.

Scaling should initiate and complete in timeframes consistent with business requirements.

For scenarios where manual interaction is necessary, establish automated operational 'playbooks' that

can be triggered rather than conducting manual operational actions.

Consider whether automated triggers can be applied as part of subsequent engineering

investments.

Monitor application data read and write throughput against P50/P99 latency requirements and align to

an application capacity model.

Excess throughput should be gracefully handled by the data platform or application layer and captured

by the health model for operational representation.

Implement caching for 'hot' data scenarios to minimize response times.

Apply appropriate policies for cache expiration and house-keeping to avoid runaway data growth.

Expire cache items when the backing data changes.

If cache expiration is strictly Time-To-Live (TTL) based, the impact and customer experience of

serving outdated data needs to be understood.

VarietyVariety

In alignment with the principle of a cloud- and Azure-native design, it's highly recommended to prioritize

managed Azure services to reduce operational and management complexity, and taking advantage of

Microsoft's future platform investments.

In alignment with the application design principle of loosely coupled microservice architectures, allow

individual services to use distinct data stores and scenario-optimized data technologies.

Identify the types of data structure the application will handle for specific workload scenarios.

Avoid creating a dependency on a single monolithic datastore.

Consider the SAGA design pattern where dependencies between datastores exist.

Validate that required capabilities are available for selected data technologies.

Ensure support for required languages and SDK capabilities. Not every capability is available for every

language/SDK in the same fashion.

VeracityVeracity

Adopt a multi-region data platform design and distribute replicas across regions for maximum reliability,

availability, and performance by moving data closer to application endpoints.

Distribute data replicas across Availability Zones (AZs) within a region (or use zone-redundant service

tiers) to maximize intra-region availability.

Where consistency requirements allow for it, use a multi-region write data platform design to maximize

overall global availability and reliability.

Consider business requirements for conflict resolution when the same data item is changed in two

separate write replicas before either change can be replicated and thus creating a conflict.

Use standardized conflict resolution policies such as "Last one wins" where possible

If a custom strategy with custom logic is required, ensure CI/CD DevOps practices are

applied to manage custom logic.

Test and validate backup and restore capabilities, and failover operations through chaos testing within

continuous delivery processes.

Run performance benchmarks to ensure throughput and performance requirements aren't impacted by

the inclusion of required security capabilities, such as encryption.

Ensure continuous delivery processes consider load testing against known performance benchmarks.

NOTENOTE

Additional referencesAdditional references

Globally distributed multi-region write datastore

IMPORTANTIMPORTANT

Design considerationsDesign considerations

When applying encryption, it's strongly recommended to use service-managed encryption keys as a way

of reducing management complexity.

If there's a specific security requirement for customer-managed keys, ensure appropriate key

management procedures are applied to ensure availability, backup, and rotation of all considered keys.

When integrating with a broader organizational implementation, it's critical that an application centric approach be applied

for the provisioning and operation of data platform components in an application design.

More specifically, to maximize reliability it's critical that individual data platform components appropriately respond to

application health through operational actions which may include other application components. For example, in a

scenario where additional data platform resources are needed, scaling the data platform along with other application

components according to a capacity model will likely be required, potentially through the provision of additional scale

units. This approach will ultimately be constrained if there's a hard dependency of a centralized operations team to

address issues related to the data platform in isolation.

Ultimately, the use of centralized data services (that is Central IT DBaaS) introduces operational bottlenecks that

significantly hinder agility through a largely uncontextualized management experience, and should be avoided in a

mission-critical or business-critical context.

Additional data-platform guidance is available within the Azure Application Architecture Guide.

Azure Data Store Decision Tree

Criteria for choosing a Data Store

Non-Relational Data Stores

Relational OLTP Data Stores

To fully accommodate the globally distributed active-active aspirations of an application design, it's strongly

recommended to consider a distributed multi-region write data platform, where changes to separate writeable

replicas are synchronized and merged between all replicas, with conflict resolution where required.

The microservices may not all require a distributed multi-region write datastore, so consideration should be given to the

architectural context and business requirements of each workload scenario.

Azure Cosmos DB provides a globally distributed and highly available NoSQL datastore, offering multi-region

writes and tunable consistency out-of-the-box. The design considerations and recommendations within this

section will therefore focus on optimal Azure Cosmos DB usage.

Azure Cosmos DBAzure Cosmos DB

Azure Cosmos DB stores data within Containers, which are indexed, row-based transactional stores

designed to allow fast transactional reads and writes with response times on the order of milliseconds.

Azure Cosmos DB supports multiple different APIs with differing feature sets, such as SQL, Cassandra,

and MongoDB.

The first-party Azure Cosmos DB for NoSQL provides the richest feature set and is typically the API

where new capabilities will become available first.

Azure Cosmos DB supports Gateway and Direct connectivity modes, where Direct facilitates connectivity

over TCP to backend Azure Cosmos DB replica nodes for improved performance with fewer network

hops, while Gateway provides HTTPS connectivity to frontend gateway nodes.

Direct mode is only available when using the Azure Cosmos DB for NoSQL and is currently only

supported on .NET and Java SDK platforms.

Within Availability Zone enabled regions, Azure Cosmos DB offers Availability Zone (AZ) redundancy

support for high availability and resiliency to zonal failures within a region.

Azure Cosmos DB maintains four replicas of data within a single region, and when Availability Zone (AZ)

redundancy is enabled, Azure Cosmos DB ensures data replicas are placed across multiple AZs to protect

against zonal failures.

The Paxos consensus protocol is applied to achieve quorum across replicas within a region.

An Azure Cosmos DB account can easily be configured to replicate data across multiple regions to

mitigate the risk of a single region becoming unavailable.

Replication can be configured with either single-region writes or multi-region writes.

With single region writes, a primary 'hub' region is used to serve all writes and if this 'hub'

region becomes unavailable, a failover operation must occur to promote another region as

writable.

With multi-region writes, applications can write to any configured deployment region, which

will replicate changes between all other regions. If a region is unavailable then the remaining

regions will be used to serve write traffic.

In a multi-region write configuration, update (insert, replace, delete) conflicts can occur where writers

concurrently update the same item in multiple regions.

Azure Cosmos DB provides two conflict resolution policies, which can be applied to automatically address

conflicts.

Last Write Wins (LWW) applies a time-synchronization clock protocol using a system-defined

timestamp _ts property as the conflict resolution path. If of a conflict the item with the highest value

for the conflict resolution path becomes the winner, and if multiple items have the same numeric value

then the system selects a winner so that all regions can converge to the same version of the

committed item.

Custom resolution policies allow for application-defined semantics to reconcile conflicts using a

registered merge stored procedure that is automatically invoked when conflicts are detected.

With delete conflicts, the deleted version always wins over either insert or replace conflicts

regardless of conflict resolution path value.

Last Write Wins is the default conflict resolution policy.

When using Azure Cosmos DB for NoSQL, a custom numerical property such as a custom

timestamp definition can be used for conflict resolution.

The system provides exactly once guarantee for the execution of a merge procedure as part of

the commitment protocol.

A custom conflict resolution policy is only available with Azure Cosmos DB for NoSQL and can

only be set at container creation time.

In a multi-region write configuration, there's a dependency on a single Azure Cosmos DB 'hub' region to

perform all conflict resolutions, with the Paxos consensus protocol applied to achieve quorum across

replicas within the hub region.

The platform provides a message buffer for write conflicts within the hub region to load level and

provide redundancy for transient faults.

The buffer is capable of storing a few minutes worth of write updates requiring consensus.

The strategic direction of the Azure Cosmos DB platform is to remove this single region dependency for

conflict resolution in a multi-region write configuration, utilizing a 2-phase Paxos approach to attain quorum

at a global level and within a region.

The primary 'hub' region is determined by the first region that Azure Cosmos DB is configured within.

A priority ordering is configured for additional satellite deployment regions for failover purposes.

The data model and partitioning across logical and physical partitions plays an important role in

achieving optimal performance and availability.

When deployed with a single write region, Azure Cosmos DB can be configured for automatic failover

based on a defined failover priority considering all read region replicas.

The RTO provided by the Azure Cosmos DB platform is ~10-15 minutes, capturing the elapsed time to

perform a regional failover of the Azure Cosmos DB service if a catastrophic disaster impacting the hub

region.

This RTO is also relevant in a multi-region write context given the dependency on a single 'hub' region

for conflict resolution.

If the 'hub' region becomes unavailable, writes made to other regions will fail after the message

buffer fills since conflict resolution won't be able to occur until the service fails over and a new

hub region is established.

The strategic direction of the Azure Cosmos DB platform is to reduce the RTO to ~5 minutes by allowing

partition level failovers.

Recovery Point Objectives (RPO) and Recovery Time Objectives (RTO) are configurable via consistency

levels, with a trade-off between data durability and throughput.

Azure Cosmos DB provides a minimum RTO of 0 for a relaxed consistency level with multi-region

writes or an RPO of 0 for strong consistency with single-write region.

Azure Cosmos DB offers a 99.999% SLA for both read and write availability for Database Accounts

configured with multiple Azure regions as writable.

The SLA is represented by the Monthly Uptime Percentage, which is calculated as 100% - Average

Error Rate.

The Average Error Rate is defined as the sum of Error Rates for each hour in the billing month divided

by the total number of hours in the billing month, where the Error Rate is the total number of Failed

Requests divided by Total Requests during a given one-hour interval.

Azure Cosmos DB offers a 99.99% SLA for throughput, consistency, availability, and latency for Database

Accounts scoped to a single Azure region when configured with any of the five Consistency Levels.

A 99.99% SLA also applies to Database Accounts spanning multiple Azure regions configured with

any of the four relaxed Consistency Levels.

There are two types of throughput that can be provisioned in Azure Cosmos DB, standard and autoscale,

which are measured using Request Units per second (RU/s).

Standard throughput allocates resources required to guarantee a specified RU/s value.

Autoscale defines a maximum throughput value, and Azure Cosmos DB will automatically scale up or

down depending on application load, between the maximum throughput value and a minimum of

10% of the maximum throughput value.

Standard is billed hourly for provisioned throughput.

Autoscale is billed hourly for the maximum throughput consumed.

Static provisioned throughput with a variable workload may result in throttling errors, which will impact

SIN GLE REGIO N  R EA D/ WRIT ESIN GLE REGIO N  R EA D/ WRIT E
SIN GLE REGIO N  W RIT E -  DUA L  R EGIO NSIN GL E RE GION W RIT E  -  DUA L  REGIO N
REA DREA D DU A L  R EGIO N  REA D/ WRIT EDUA L  REGIO N  REA D/ W RIT E
1 RU 2 RU 4 RU
perceived application availability.
Autoscale protects against throttling errors by enabling Azure Cosmos DB to scale up as needed, while
maintaining cost protection by scaling back down when load decreases.
When Azure Cosmos DB is replicated across multiple regions, the provisioned Request Units (RUs) are
billed per region.
There's a significant cost delta between a multi-region-write and single-region-write configuration which
in many cases may make a multi-master Azure Cosmos DB data platform cost prohibitive.
The delta between single-region-write and multi-region-write is actually less than the 1:2 ratio reflected in
the table above. More specifically, there's a cross-region data transfer charge associated with write updates
in a single-write configuration, which isn't captured within the RU costs as with the multi-region write
configuration.
Consumed storage is billed as a flat rate for the total amount of storage (GB) consumed to host data and
indexes for a given hour.
Session  is the default and most widely used consistency level since data is received in the same order as
writes.
Azure Cosmos DB supports authentication via either an Azure Active Directory identity or Azure Cosmos
DB keys and resource tokens, which provide overlapping capabilities.
It's possible to disable resource management operations using keys or resource tokens to limit keys and
resource tokens to data operations only, allowing for fine-grained resource access control using Azure
Active Directory Role-Based Access Control (RBAC).
Restricting control plane access via keys or resource tokens will disable control plane operations for
clients using Azure Cosmos DB SDKs and should therefore be thoroughly evaluated and tested.
The  disableKeyBasedMetadataWriteAccess  setting can be configured via ARM Template IaC definitions,
or via a Built-In Azure Policy.
Azure Cosmos DB Azure Active Directory RBAC support applies to account and resource control plane
management operations.
Application administrators can create role assignments for users, groups, service principals or
managed identities to grant or deny access to resources and operations on Azure Cosmos DB
resources.
There are several Built-in RBAC Roles available for role assignment, and custom RBAC roles can also
be used to form specific privilege combinations.
Cosmos DB Account Reader enables read-only access to the Azure Cosmos DB resource.
DocumentDB Account Contributor enables management of Azure Cosmos DB accounts
including keys and role assignments, but doesn't enable data-plane access.
Cosmos DB Operator, which is similar to DocumentDB Account Contributor, but doesn't provide
the ability to manage keys or role assignments.
Azure Cosmos DB resources (accounts, databases, and containers) can be protected against incorrect
modification or deletion using Resource Locks.

Resource Locks can be set at the account, database, or container level.

A Resource Lock set at on a resource will be inherited by all child resources. For example, a Resource

Lock set on the Azure Cosmos DB account will be inherited by all databases and containers within the

account.

Resource Locks onlyonly apply to control plane operations and do notnot prevent data plane operations,

such as creating, changing, or deleting data.

If control plane access isn't restricted with disableKeyBasedMetadataWriteAccess , then clients will be

able to perform control plane operations using account keys.

The Azure Cosmos DB change feed provides a time-ordered feed of changes to data in an Azure Cosmos

DB container.

The change feed only includes insert and update operations to the source Azure Cosmos DB

container; it doesn't include deletes.

The change feed can be used to maintain a separate data store from the primary Container used by the

application, with ongoing updates to the target data store fed by the change feed from the source

Container.

The change feed can be used to populate a secondary store for additional data platform redundancy

or for subsequent analytical scenarios.

If delete operations routinely affect the data within the source Container, then the store fed by the change

feed will be inaccurate and unreflective of deleted data.

A Soft Delete pattern can be implemented so that data records are included in the change feed.

A short Time-To-Live (TTL) is typically used with the soft-delete pattern so that Azure Cosmos DB

automatically deletes expired data, but only after it's reflected within the change feed with the deleted

flag set to True.

Instead of explicitly deleting data records, data records are

updated

by setting a flag (e.g.

IsDeleted ) to indicate that the item is considered deleted.

Any target data store fed by the change feed will need to detect and process items with a

deleted flag set to True; instead of storing soft-deleted data records, the

existing

version of the

data record in the target store will need to be deleted.

Accomplishes the original delete intent whilst also propagating the delete through the change

feed.

Azure Cosmos DB can be configured as an analytical store, which applies a column format for optimized

analytical queries to address the complexity and latency challenges that occur with the traditional ETL

pipelines.

Azure Cosmos DB automatically backs up data at regular intervals without affecting the performance or

availability, and without consuming RU/s.

Azure Cosmos DB can be configured according to two distinct backup modes.

Periodic is the default backup mode for all accounts, where backups are taken at a periodic interval

and the data is restored by creating a request with the support team.

The default periodic backup retention period is 8 hours and the default backup interval is

fourhours, which means only the latest two backups are stored by default.

The backup interval and retention period are configurable within the account.

Two backup copies are included at no extra cost, but additional backups incur additional costs.

The maximum retention period extends to a month with a minimum backup interval of

one hour.

A role assignment to the Azure "Cosmos DB Account Reader Role" is required to

configure backup storage redundancy.

Continuous backup mode allows for a restore to any point of time within the last 30 days.
By default, periodic backups are stored within separate Geo-Redundant Storage (GRS) that isn't
directly accessible.
Performing a restore operation requires a restore operation requires a Suppor t RequestSuppor t Request since customers can't directly
perform a restore.
A restore operation creates a new Azure Cosmos DB account where data is recovered.
If throughput is provisioned at the database level, backup and restore will happen at the
database level
Backup storage exists within the primary 'hub' region and is replicated to the paired
region through underlying storage replication.
The redundancy configuration of the underlying backup storage account is configurable
to Zone-Redundant Storage or Locally-Redundant Storage.
Before opening a support ticket, the backup retention period should be increased to at
least seven days within eight hours of the data loss event.
An existing Azure Cosmos DB account can't be used for Restore
By default, a new Azure Cosmos DB account named 
<Azure_Cosmos_account_original_name>-restored<n>  will be used.
This name can be adjusted, such as by reusing the existing name if the original
account was deleted.
It's not possible to select a subset of containers to restore.
Restore operations can be performed to return to a specific point in time (PITR) with a one-
second granularity.
The available window for restore operations is up to 30 days.
Continuous backups are taken within every Azure region where the Azure Cosmos DB account
exists.
A self-service restore can be performed using the Azure portal or IaC artifacts such as ARM
templates.
There are several limitations with Continuous Backup.
A restore operation creates a new Azure Cosmos DB account for the point-in-time restore.
There's an additional storage cost for Continuous backups and restore operations.
It's also possible to restore to the resource instantiation state.
Continuous backups are stored within the same Azure region as each Azure Cosmos DB
replica, using Locally-Redundant Storage (LRS) or Zone Redundant Storage (ZRS) within
regions that support Availability Zones.
The continuous backup mode isn't currently available in a multi-region-write
configuration.
Only Azure Cosmos DB for NoSQL and Azure Cosmos DB for MongoDB can be
configured for Continuous backup at this time.
If a container has TTL configured, restored data that has exceeded its TTL may be
immediately deleted
Existing Azure Cosmos DB accounts can be migrated from Periodic to Continuous, but not from
Continuous to Periodic; migration is one-way and not reversible.
Each Azure Cosmos DB backup is composed of the data itself and configuration details for provisioned
throughput, indexing policies, deployment region(s), and container TTL settings.
Backups don't contain firewall settings, virtual network access control lists, private endpoint settings,
consistency settings (an account is restored with session consistency), stored procedures, triggers,
UDFs, or multi-region settings.
Customers are responsible for redeploying capabilities and configuration settings. These aren't

  
Design RecommendationsDesign Recommendations
Azure Synapse Link analytical store data is also not included in Azure Cosmos DB backups.
restored via Azure Cosmos DB backup.
It's possible to implement a custom backup and restore capability for scenarios where Periodic and
Continuous approaches aren't a good fit.
A custom approach introduces significant costs and additional administrative overhead, which should
be understood and carefully assessed.
Azure Storage or an alternative data technology can be used, such an alternative Azure Cosmos DB
container.
Common restore scenarios should be modeled, such as the corruption or deletion of an
account, database, container, on data item.
Housekeeping procedures should be implemented to prevent backup sprawl.
Azure Storage and Azure Cosmos DB provide native integrations with Azure services such as
Azure Functions and Azure Data Factory.
The Azure Cosmos DB documentation denotes two potential options for implementing custom backups.
Azure Cosmos DB change feed to write data to a separate storage facility.
Both continuous or periodic (batched) custom backups can be implemented using the change feed.
The Azure Cosmos DB change feed doesn't yet reflect deletes, so a soft-delete pattern must be applied
using a boolean property and TTL.
Azure Data Factory Connector for Azure Cosmos DB (Azure Cosmos DB for NoSQL or MongoDB API
connectors) to copy data.
An Azure function or equivalent application process uses the change feed processor to bind to
the change feed and process items into storage.
This pattern won't be required when the change feed provides full-fidelity updates.
Azure Data Factory (ADF) supports manual execution and Schedule, Tumbling window, and
Event-based triggers.
ADF is primarily suitable for periodic custom backup implementations due to its batch-oriented
orchestration.
ADF supports Azure Private Link for high network security scenarios
Provides support for both Storage and Event Grid.
It's less suitable for continuous backup implementations with frequent events due to the
orchestration execution overhead.
Azure Cosmos DB is used within the design of many Azure services, so a significant regional outage for
Azure Cosmos DB will have a cascading effect across various Azure services within that region. The precise
impact to a particular service will heavily depend on how the underlying service design uses Azure Cosmos
DB.
Azure Cosmos DBAzure Cosmos DB
Use Azure Cosmos DB as the primary data platform where requirements allow.
For mission-critical workload scenarios, configure Azure Cosmos DB with a write replica inside each
deployment region to reduce latency and provide maximum redundancy.
Configure the application to prioritize the use of the local Azure Cosmos DB replica for writes and
reads to optimize application load, performance, and regional RU/s consumption.
The multi-region-write configuration comes at a significant cost and should be prioritized only for
workload scenarios requiring maximum reliability.
For less-critical workload scenarios, prioritize the use of single-region-write configuration (when using

Availability Zones) with globally distributed read replicas, since this offers a high level of data platform

reliability (99.999% SLA for read-, 99.995% SLA for write-operations) at a more compelling price-point.

Configure the application to use the local Azure Cosmos DB read replica to optimize read

performance.

Select an optimal 'hub' deployment region where conflict resolution will occur in a multi-region-write

configuration, and all writes will be performed in a single-region-write configuration.

Consider distance relative to other deployment regions and associated latency in selecting a primary

region, and requisite capabilities such as Availability Zones support.

Configure Azure Cosmos DB with Availability Zone (AZ) redundancy in all deployment regions with AZ

support, to ensure resiliency to zonal failures within a region.

Use Azure Cosmos DB for NoSQL since it offers the most comprehensive feature set, particularly where

performance tuning is concerned.

Alternative APIs should primarily be considered for migration or compatibility scenarios.

When using alternative APIs, validate that required capabilities are available with the selected

language and SDK to ensure optimal configuration and performance.

Use the Direct connection mode to optimize network performance through direct TCP connectivity to

backend Azure Cosmos DB nodes, with a reduced number of network 'hops'.

The Azure Cosmos DB SLA is calculated by averaging failed requests, which may not directly align with a

99.999% reliability tier error budget. When designing for 99.999% SLO, it's therefore vital to plan for

regional and multi-region Azure Cosmos DB write unavailability, ensuring a fallback storage technology is

positioned if a failure, such as a persisted message queue for subsequent replay.

Define a partitioning strategy across both logical and physical partitions to optimize data distribution

according to the data model.

Minimize cross-partition queries.

Iteratively test and validate the partitioning strategy to ensure optimal performance.

Select an optimal partition key.

The partition key can't be changed after it has been created within the collection.

The partition key should be a property value that doesn't change.

Select a partition key that has a high cardinality, with a wide range of possible values.

The partition key should spread RU consumption and data storage evenly across all logical partitions

to ensure even RU consumption and storage distribution across physical partitions.

Run read queries against the partitioned column to reduce RU consumption and latency.

Indexing is also crucial for performance, so ensure index exclusions are used to reduce RU/s and storage

requirements.

Only index those fields that are needed for filtering within queries; design indexes for the most-used

predicates.

Leverage the built-in error handling, retry, and broader reliability capabilities of the Azure Cosmos DB

SDK.

Implement retry logic within the SDK on clients.

Use service-managed encryption keys to reduce management complexity.

If there's a specific security requirement for customer-managed keys, ensure appropriate key

management procedures are applied, such as backup and rotation.

Disable Azure Cosmos DB key-based metadata write access by applying the built-in Azure Policy.

 
Relational data technologies
  
Design considerationsDesign considerations
Enable Azure Monitor to gather key metrics and diagnostic logs, such as provisioned throughput (RU/s).
Route Azure Monitor operational data into a Log Analytics workspace dedicated to Azure Cosmos DB
and other global resources within the application design.
Use Azure Monitor metrics to determine if application traffic patterns are suitable for autoscale.
Evaluate application traffic patterns to select an optimal option for provisioned throughput types.
Consider auto-scale provisioned throughput to automatically level-out workload demand.
Evaluate Microsoft performance tips for Azure Cosmos DB to optimize client-side and server-side
configuration for improved latency and throughput.
When using AKS as the compute platform: For query-intensive workloads, select an AKS node SKU that
has accelerated networking enabled to reduce latency and CPU jitters.
For single write region deployments, it's strongly recommended to configure Azure Cosmos DB for
automatic failover.
Load-level through the use of asynchronous non-blocking messaging within system flows, which write
updates to Azure Cosmos DB.
Consider patterns such as Command and Query Responsibility Segregation and Event Sourcing.
Configure the Azure Cosmos DB account for continuous backups to obtain a fine granularity of recovery
points across the last 30 days.
Consider the use of Azure Cosmos DB backups in scenarios where contained data or the Azure
Cosmos DB account is deleted or corrupted.
Avoid the use of a custom backup approach unless absolutely necessary.
It's strongly recommended to practice recovery procedures on non-production resources and data, as
part of standard business continuity operation preparation.
Define IaC artifacts to re-establish configuration settings and capabilities of an Azure Cosmos DB backup
restore.
Evaluate and apply the Azure Security Baseline control guidance for Azure Cosmos DB Backup and
Recovery.
BR-1: Ensure regular automated backups
BR-3: Validate all backups including customer-managed keys
BR-4, Mitigate risk of lost keys
For analytical workloads requiring multi-region availability, use the Azure Cosmos DB Analytical Store,
which applies a column format for optimized analytical queries.
For scenarios with a highly relational data model or dependencies on existing relational technologies, the use of
Azure Cosmos DB in a multi-region write configuration might not be directly applicable. In such cases, it's vital
that used relational technologies are designed and configured to uphold the multi-region active-active
aspirations of an application design.
Azure provides many managed relational data platforms, including Azure SQL Database and Azure Database for
common OSS relational solutions, including MySQL, PostgreSQL, and MariaDB. The design considerations and
recommendations within this section will therefore focus on the optimal usage of Azure SQL Database and
Azure Database OSS flavors to maximize reliability and global availability.
Whilst relational data technologies can be configured to easily scale read operations, writes are typically

constrained to go through a single primary instance, which places a significant constraint on scalability

and performance.

Sharding can be applied to distribute data and processing across multiple identical structured databases,

partitioning databases horizontally to navigate platform constraints.

For example, sharding is often applied in multi-tenant SaaS platforms to isolate groups of tenants into

distinct data platform constructs.

Azure SQL DatabaseAzure SQL Database

Azure SQL Database provides a fully managed database engine that is always running on the latest stable

version of the SQL Server database engine and underlying Operating System.

Provides intelligent features such as performance tuning, threat monitoring, and vulnerability

assessments.

Azure SQL Database provides built-in regional high availability and turnkey geo-replication to distribute

read-replicas across Azure regions.

With geo-replication, secondary database replicas remain read-only until a failover is initiated.

Up to four secondaries are supported in the same or different regions.

Secondary replicas can also be used for read-only query access to optimize read performance.

Failover must be initiated manually but can be wrapped in automated operational procedures.

Azure SQL Database provides Auto Failover Groups, which replicates databases to a secondary server

and allows for transparent failover if a failure.

Auto-failover groups support geo-replication of all databases in the group to only one secondary

server or instance in a different region.

Auto-failover groups aren't currently supported in the Hyperscale service tier.

Secondary databases can be used to offload read traffic.

Premium or Business Critical service tier database replicas can be distributed across Availability Zones at

no extra cost.

The control ring is also duplicated across multiple zones as three gateway rings (GW).

When using the Business Critical tier, zone redundant configuration is only available when the Gen5

compute hardware is selected.

The routing to a specific gateway ring is controlled by Azure Traffic Manager.

Azure SQL Database offers a baseline 99.99% availability SLA across all of its service tiers, but provides a

higher 99.995% SLA for the Business Critical or Premium tiers in regions that support availability zones.

Azure SQL Database Business Critical or Premium tiers not configured for Zone Redundant

Deployments have an availability SLA of 99.99%.

When configured with geo-replication, the Azure SQL Database Business Critical tier provides a Recovery

Time Objective (RTO) of 30 seconds for 100% of deployed hours.

When configured with geo-replication, the Azure SQL Database Business Critical tier has a Recovery point

Objective (RPO) of 5 seconds for 100% of deployed hours.

Azure SQL Database Hyperscale tier, when configured with at least two replicas, has an availability SLA of

99.99%.

Compute costs associated with Azure SQL Database can be reduced using a Reservation Discount.

It's not possible to apply reserved capacity for DTU-based databases.

Point-in-time restore can be used to return a database and contained data to an earlier point in time.

Design RecommendationsDesign Recommendations

Geo-restore can be used to recover a database from a geo-redundant backup.

Azure Database For PostgreSQLAzure Database For PostgreSQL

Azure Database For PostgreSQL is offered in three different deployment options:

Single Server, SLA 99.99%

Flexible Server, which offers Availability Zone redundancy, SLA 99.99%

Hyperscale (Citus), SLA 99.95% when High Availability mode is enabled.

Hyperscale (Citus) provides dynamic scalability through sharding without application changes.

Distributing table rows across multiple PostgreSQL servers is key to ensure scalable queries in

Hyperscale (Citus).

Multiple nodes can collectively hold more data than a traditional database, and in many cases can use

worker CPUs in parallel to optimize costs.

Autoscale can be configured through runbook automation to ensure elasticity in response to changing

traffic patterns.

Flexible server provides cost efficiencies for non-production workloads through the ability to stop/start

the server, and a burstable compute tier that is suitable for workloads that don't require continuous

compute capacity.

There's no additional charge for backup storage for up to 100% of total provisioned server storage.

Additional consumption of backup storage is charged according to consumed GB/month.

Compute costs associated with Azure Database for PostgreSQL can be reduced using either a Single

Server Reservation Discount or Hyperscale (Citus) Reservation Discount.

Consider sharding to partition relational databases based on different application and data contexts,

helping to navigate platform constraints, maximize scalability and availability, and fault isolation.

This recommendation is particularly prevalent when the application design considers three or more

Azure regions since relational technology constraints can significantly hinder globally distributed data

platforms.

Sharding isn't appropriate for all application scenarios, so a contextualized evaluation is required.

Prioritize the use of Azure SQL Database where relational requirements exist due to its maturity on the

Azure platform and wide array of reliability capabilities.

Azure SQL DatabaseAzure SQL Database

Use the Business-Critical service tier to maximize reliability and availability, including access to critical

resiliency capabilities.

Use the vCore based consumption model to facilitate the independent selection of compute and storage

resources, tailored to workload volume and throughput requirements.

Ensure a defined capacity model is applied to inform compute and storage resource requirements.

Consider Reserved Capacity to provide potential cost optimizations.

Configure the Zone-Redundant deployment model to spread Business Critical database replicas within

the same region across Availability Zones.

Use Active Geo-Replication to deploy readable replicas within all deployment regions (up to four).

Use Auto Failover Groups to provide transparent failover to a secondary region, with geo-replication

applied to provide replication to additional deployment regions for read optimization and database

redundancy.

IMPORTANTIMPORTANT

For application scenarios limited to only two deployment regions, the use of Auto Failover Groups

should be prioritized.

Consider automated operational triggers, based on alerting aligned to the application health model, to

conduct failovers to geo-replicated instances if a failure impacting the primary and secondary within the

Auto Failover Group.

For applications considering more than four deployment regions, serious consideration should be given to application

scoped sharding or refactoring the application to support multi-region write technologies, such as Azure Cosmos DB.

However, if this isn't feasible within the application workload scenario, it's advised to elevate a region within a single

geography to a primary status encompassing a geo-replicated instance to more evenly distributed read access.

Configure the application to query replica instances for read queries to optimize read performance.

Use Azure Monitor and Azure SQL Analytics for near real-time operational insights in Azure SQL DB for

the detection of reliability incidents.

Use Azure Monitor to evaluate usage for all databases to determine if they have been sized appropriately.

Ensure CD pipelines consider load testing under representative load levels to validate appropriate data

platform behavior.

Calculate a health metric for database components to observe health relative to business requirements

and resource utilization, using monitoring and alerts to drive automated operational action where

appropriate.

Ensure key query performance metrics are incorporated so swift action can be taken when service

degradation occurs.

Optimize queries, tables, and databases using Query Performance Insights and common performance

recommendations provided by Microsoft.

Implement retry logic using the SDK to mitigate transient errors impacting Azure SQL Database

connectivity.

Prioritize the use of service-managed keys when applying server-side Transparent Data Encryption (TDE)

for at-rest encryption.

If customer-managed keys or client-side (AlwaysEncrypted) encryption is required, ensure keys are

appropriately resilient with backups and automated rotation facilities.

Consider the use of point-in-time restore as an operational playbook to recover from severe

configuration errors.

Azure Database For PostgreSQLAzure Database For PostgreSQL

Flexible Server is recommended to use it for business critical workloads due to its Availability Zone

support.

When using Hyperscale (Citus) for business critical workloads, enable High Availability mode to receive

the 99.95% SLA guarantee.

Use the Hyperscale (Citus) server configuration to maximize availability across multiple nodes.

Define a capacity model for the application to inform compute and storage resource requirements within

the data platform.

Consider the Hyperscale (Citus) Reservation Discount to provide potential cost optimizations.

Caching for Hot Tier Data

Design ConsiderationsDesign Considerations

Design RecommendationsDesign Recommendations

An in-memory caching layer can be applied to enhance a data platform by significantly increasing read

throughput and improving end-to-end client response times for hot tier data scenarios.

Azure provides several services with applicable capabilities for caching key data structures, with Azure Cache for

Redis positioned to abstract and optimize data platform read access. This section will therefore focus on the

optimal usage of Azure Cache for Redis in scenarios where additional read performance and data access

durability is required.

A caching layer provides additional data access durability since even if an outage impacting the

underlying data technologies, an application data snapshot can still be accessed through the caching

layer.

In certain workload scenarios, in-memory caching can be implemented within the application platform

itself.

Azure Cache for RedisAzure Cache for Redis

Redis cache is an open source NoSQL key-value in-memory storage system.

The Enterprise and Enterprise Flash tiers can be deployed in an active-active configuration across

Availability Zones within a region and different Azure regions through geo-replication.

When deployed across at least three Azure regions and three or more Availability Zones in each

region, with active geo-replication enabled for all Cache instances, Azure Cache for Redis provides an

SLA of 99.999% for connectivity to one regional cache endpoint.

When deployed across three Availability Zones within a single Azure region a 99.99% connectivity

SLA is provided.

The Enterprise Flash tier runs on a combination of RAM and flash non-volatile memory storage, and

while this introduces a small performance penalty it also enables very large cache sizes, up to 13TB with

clustering.

With geo-replication, charges for data transfer between regions will also be applicable in addition to the

direct costs associated with cache instances.

The Scheduled Updates feature doesn't include Azure updates or updates applied to the underlying VM

operating system.

There will be an increase in CPU utilization during a scale-out operation while data is migrated to new

instances.

Consider an optimized caching layer for 'hot' data scenarios to increase read throughput and improve

response times.

Apply appropriate policies for cache expiration and housekeeping to avoid runaway data growth.

Consider expiring cache items when the backing data changes.

Azure Cache for RedisAzure Cache for Redis

Use the Premium or Enterprise SKU to maximize reliability and performance.

For scenarios with extremely large data volumes, the Enterprise Flash tier should be considered.

For scenarios where only passive geo-replication is required, the Premium tier can also be considered.

Deploy replica instances using geo-replication in an active configuration across all considered

Analytical Scenarios

DESC RIP T IONDESC RIP T ION A N A LY T IC A LA N A LY T IC A L T RA N SA C T IO N A LT RA N SA C T IO N A L

Use Case Analyze very large volumes of data

("big data")

Process very large volumes of

individual transactions

Optimized for Read queries and aggregations over

many records

Near real-time

Create/Read/Update/Delete (CRUD)

queries over few records

Key Characteristics - Consolidate from data sources of

record

- Column-based storage

- Distributed storage

- Parallel processing

- Denormalized

- Low concurrency reads and writes

- Optimize for storage volume with

compression

- Data source of record for application

- Row-based Storage

- Contiguous storage

- Symmetrical processing

- Normalized

- High concurrency reads and writes,

index updates

- Optimize for fast data access with in-

memory storage

Design ConsiderationsDesign Considerations

deployment regions.

Ensure replica instances are deployed across Availability Zones within each considered Azure region.

Use Azure Monitor to evaluate Azure Cache for Redis.

Calculate a health score for regional cache components to observe health relative to business

requirements and resource utilization.

Observe and alert on key metrics such as high CPU, high memory usage, high server load, and

evicted keys for insights when to scale the cache.

Optimize connection resilience by implementing retry logic, timeouts, and using a singleton

implementation of the Redis connection multiplexer.

Configure scheduled updates to prescribe the days and times that Redis Server updates are applied to the

cache.

It's increasingly common for mission-critical applications to consider analytical scenarios as a means to drive

additional value from encompassed data flows. Application and operational (AIOps) analytical scenarios

therefore form a crucial aspect of highly reliable data platform.

Analytical and transactional workloads require different data platform capabilities and optimizations for

acceptable performance within their respective contexts.

Azure Synapse provides an enterprise analytical platform that brings together relational and non-relational data

with Spark technologies, using built-in integration with Azure services such as Azure Cosmos DB to facilitate big

data analytics. The design considerations and recommendations within this section will therefore focus on

optimal Azure Synapse and Azure Cosmos DB usage for analytical scenarios.

Traditionally, large-scale analytical scenarios are facilitated by extracting data into a separate data platform

optimized for subsequent analytical queries.

Extract, Transform, and Load (ETL) pipelines are used to extract data will consume throughput and

impact transactional workload performance.

Running ETL pipelines infrequently to reduce throughput and performance impacts will result in

analytical data that is less up-to-date.

ETL pipeline development and maintenance overhead increases as data transformations become more

complex.

For example, if source data is frequently changed or deleted, ETL pipelines must account for

those changes in the target data for analytical queries through an additive/versioned approach,

dump and reload, or in-place changes on the analytical data. Each of these approaches will have

derivative impact, such as index re-creation or update.

Azure Cosmos DBAzure Cosmos DB

Analytical queries run on Azure Cosmos DB transactional data will typically aggregate across partitions

over large volumes of data, consuming significant Request Unit (RU) throughput, which can impact the

performance of surrounding transactional workloads.

The Azure Cosmos DB Analytical Store provides a schematized, fully isolated column-oriented data store

that enables large-scale analytics on Azure Cosmos DB data from Azure Synapse without impact to Azure

Cosmos DB transactional workloads.

When an Azure Cosmos DB Container is enabled as an Analytical Store, a new column store is

internally created from the operational data in the Container. This column store is persisted separately

from the row-oriented transaction store for the container.

Create, Update and Delete operations on the operational data are automatically synced to the

analytical store, so no change feed or ETL processing is required.

Data sync from the operational to the analytical store doesn't consume throughput Request Units

(RUs) provisioned on the Container or Database. There's no performance impact on transactional

workloads. Analytical Store doesn't require allocation of additional RUs on an Azure Cosmos DB

Database or Container.

Auto-Sync is the process where operational data changes are automatically synced to the Analytical

Store. Auto-Sync latency is usually less than two (2) minutes.

Analytical Store storage uses a consumption-based pricing model that charges for volume of data and

number of read and write operations. Analytical store pricing is separate from transactional store

pricing.

Auto-Sync latency can be up to five (5) minutes for a Database with shared throughput and a

large number of Containers.

As soon as Auto-Sync completes, the latest data can be queried from Azure Synapse.

Using Azure Synapse Link, the Azure Cosmos DB Analytical Store can be queried directly from Azure

Synapse. This enables no-ETL, Hybrid Transactional-Analytical Processing (HTAP) from Synapse, so that

Azure Cosmos DB data can be queried together with other analytical workloads from Synapse in near

real-time.

The Azure Cosmos DB Analytical Store isn't partitioned by default.

For certain query scenarios, performance will improve by partitioning Analytical Store data using keys

that are frequently used in query predicates.

Partitioning is triggered by a job in Azure Synapse that runs a Spark notebook using Synapse Link,

which loads the data from the Azure Cosmos DB Analytical Store and writes it into the Synapse

partitioned store in the primary storage account of the Synapse workspace.

Azure Synapse Analytics SQL Serverless pools can query the Analytical Store through automatically

updated views or via SELECT / OPENROWSET commands.

Azure Synapse Analytics Spark pools can query the Analytical Store through automatically updated Spark

tables or the spark.read command.

Data can also be copied from the Azure Cosmos DB Analytical Store into a dedicated Synapse SQL pool

using Spark, so that provisioned Azure Synapse SQL pool resources can be used.

  
Design RecommendationsDesign Recommendations
Azure Cosmos DB Analytical Store data can be queried with Azure Synapse Spark.
Spark notebooks allow for Spark dataframe combinations to aggregate and transform Azure Cosmos
DB analytical data with other data sets, and use other advanced Synapse Spark functionality including
writing transformed data to other stores or training AIOps Machine Learning models.
The Azure Cosmos DB change feed can also be used to maintain a separate secondary data store for
analytical scenarios.
Azure SynapseAzure Synapse
Azure Synapse brings together analytics capabilities including SQL data warehousing, Spark big data, and
Data Explorer for log and time series analytics.
Azure Synapse uses 
linked services
 to define connections to other services, such as Azure Storage.
Data can be ingested into Synapse Analytics via Copy activity from supported sources. This permits
data analytics in Synapse without impacting the source data store, but adds time, cost and latency
overhead due to data transfer.
Data can also be queried in-place in supported external stores, avoiding the overhead of data
ingestion and movement. Azure Storage with Data Lake Gen2 is a supported store for Synapse and
Log Analytics exported data can be queried via Synapse Spark.
Azure Synapse Studio unites ingestion and querying tasks.
Source data, including Azure Cosmos DB Analytical Store data and Log Analytics Export data, are
queried and processed in order to support business intelligence and other aggregated analytical use
cases.
Ensure analytical workloads don't impact transactional application workloads to maintain transactional
performance.
Application AnalyticsApplication Analytics
Use Azure Synapse Link with Azure Cosmos DB Analytical Store to perform analytics on Azure Cosmos
DB operational data by creating an optimized data store, which won't impact transactional performance.
Enable Azure Synapse Link on Azure Cosmos DB accounts.
Create a container enabled for Analytical Store, or enable an existing Container for Analytical Store.
Connect the Azure Synapse workspace to the Azure Cosmos DB Analytical Store to enable analytical
workloads in Azure Synapse to query Azure Cosmos DB data. Use a connection string with a read-only
Azure Cosmos DB key.
Prioritize Azure Cosmos DB Analytical Store with Azure Synapse Link instead of using the Azure Cosmos
DB change feed to maintain an analytical data store.
The Azure Cosmos DB change feed may be suitable for very simple analytical scenarios.
AIOps and Operational AnalyticsAIOps and Operational Analytics
Create a single Azure Synapse workspace with linked services and data sets for each source Azure
Storage account where operational data from resources are sent to.
Create a dedicated Azure Storage account and use it as the workspace primary storage account to store
the Synapse workspace catalog data and metadata. Configure it with hierarchical namespace to enable

Next step

Azure Data Lake Gen2.

Maintain separation between the source analytical data and Synapse workspace data and metadata.

Do not use one of the regional or global Azure Storage accounts where operational data is sent.

Review the considerations for networking considerations.

Network and connectivity

Networking and connectivity for mission

critical

workloads on Azure

12/16/2022 • 38 minutes to read • Edit Online

IMPORTANTIMPORTANT

Global traffic routing

Design considerationsDesign considerations

Networking is a fundamental area for a mission-critical application, given the recommended globally distributed

active-active design approach.

This design area explores various network topology topics at an application level, considering requisite

connectivity and redundant traffic management. More specifically, it highlights critical considerations and

recommendations intended to inform the design of a secure and scalable global network topology for a

mission-critical application.

This article is part of the Azure Well-Architected mission-critical workload series. If you aren't familiar with this series, we

recommend you start with what is a mission-critical workload?

Mission-Critical open source project

The reference implementations are part of an open source project available on GitHub. The code assets illustrate

considerations and recommendations for networking and connectivity for a globally distributed active-active design

approach.

The use of multiple active regional deployment stamps requires a global routing service to distribute traffic to

each active stamp.

Azure Front Door, Azure Traffic Manager, and Azure Standard Load Balancer provide the needed routing

capabilities to manage global traffic across a multi-region application.

Alternatively, third-party globally routing technologies can be considered. These options can almost seamlessly

be swapped in to replace or extend the use of Azure-native global routing services. Popular choices include

routing technologies by CDN providers.

This section explores the key differences Azure routing services to define how each can be used to optimize

different scenarios.

A routing service bound to a single region represents a single-point-of-failure and a significant risk with

regard to regional outages.

If the application workload encompasses client control, such as with mobile or desktop client applications,

it's possible to provide service redundancy within client routing logic.

Multiple global routing technologies, such as Azure Front Door and Azure Traffic Manager, can be

considered in parallel for redundancy, with clients configured to fail over to an alternative technology

when certain failure conditions are met. The introduction of multiple global routing services

introduces significant complexities around edge caching and Web Application Firewall capabilities,

and certificate management for SSL offload and application validation for ingress paths.

Third-party technologies can also be considered, providing global routing resiliency to all levels of

Azure platform failures.

Capability disparity between Azure Front Door and Traffic Manager means that if the two technologies

are positioned alongside one another for redundancy, a different ingress path or design changes would

be required to ensure a consistent and acceptable level of service is maintained.

Azure Front Door and Azure Traffic Manager are globally distributed services with built-in multi-region

redundancy and availability.

Hypothetical failure scenarios of a scale large enough to threaten the global availability of these

resilient routing services presents a broader risk to the application in terms of cascading and

correlated failures.

Failure scenarios of this scale are only feasibly caused by shared foundational services, such as

Azure DNS or Azure AD, which serve as global platform dependencies for almost all Azure

services. If a redundant Azure technology is applied, it's likely that the secondary service will

also be experiencing unavailability or a degraded service.

Global routing service failure scenarios are highly likely to significantly impact many other

services used for key application components through interservice dependencies. Even if a

third-party technology is used, the application will likely be in an unhealthy state due to the

broader impact of the underlying issue, meaning that routing to application endpoints on Azure

will provide little value anyway.

Global routing service redundancy provides mitigation for an extremely small number of hypothetical

failure scenarios, where the impact of a global outage is constrained to the routing service itself.

To provide broader redundancy to global outage scenarios, a multi-cloud active-active deployment

approach can be considered. A multi-cloud active-active deployment approach introduces significant

operational complexities, which pose significant resiliency risks, likely far outweighing the hypothetical

risks of a global outage.

For scenarios where client control isn't possible, a dependency must be taken on a single global routing

service to provide a unified entry point for all active deployment regions.

When used in isolation they represent a single-point-of-failure at a service level due to global

dependencies, even though built-in multi-region redundancy and availability are provided.

The SLA provided by the selected global routing service represents the maximum attainable

composite SLA, regardless of how many deployment regions are considered.

When client control isn't possible, operational mitigations can be considered to define a process for

migrating to a secondary global routing service if a global outage disables the primary service. Migrating

from one global routing service to another is typically a lengthy process lasting several hours,

particularly where DNS propagation is considered.

Some third-party global routing services provide a 100% SLA. However, the historic and attainable SLA

provided by these services is typically lower than 100%.

While these services provide financial reparations for unavailability, it comes of little significance when

the impact of unavailability is significant, such as with safety-critical scenarios where human life is

ultimately at stake. Technology redundancy or sufficient operational mitigations should therefore still

be considered even when the advertised legal SLA is 100%.

Azure Front DoorAzure Front Door

Azure Front Door provides global HTTP/S load balancing and optimized connectivity using the Anycast

protocol with split TCP to take advantage of the Microsoft global backbone network.

A number of connections are maintained for each of the backend endpoints.

Incoming client requests are first terminated at the edge node closest to the originating client.

After any required traffic inspection, requests are either forwarded over the Microsoft backbone to the

appropriate backend using existing connections, or served from the internal cache of an edge node.

This approach is very efficient in spreading high traffic volumes over the backend connections.

Provides a built-in cache that serves static content from edge nodes. In many use cases, this can also

eliminate the need for a dedicated Content Delivery Network (CDN).

Azure Web Application Firewall (WAF) can be used on Azure Front Door, and since it's deployed to Azure

network edge locations around the globe, every incoming request delivered by Front Door in inspected at

the network edge.

Azure Front Door protects application endpoints against DDoS attacks using Azure DDoS protection

Basic. Azure DDoS Standard provides additional and more advanced protection and detection capabilities

and can be added as an additional layer to Azure Front Door.

Azure Front Door offers a fully managed certificate service. Enables TLS connection security for endpoints

without having to manage the certificate lifecycle.

Azure Front Door Premium supports private endpoints, enabling traffic to flow from the internet directly

onto Azure virtual networks. This would eliminate the need of using public IPs on the VNet for making

the backends accessible via Azure Front Door Premium.

Azure Front Door relies on health probes and backend health endpoints (URLs) that are called on an

interval basis to return an HTTP status code reflecting if the backend is operating normally, with an HTTP

200 (OK) response reflecting a healthy status. As soon as a backend reflects an unhealthy status, from the

perspective of a certain edge node, that edge node will stop sending requests there. Unhealthy backends

are therefore transparently removed from traffic circulation without any delay.

Supports HTTP/S protocols only.

The Azure Front Door WAF and Application Gateway WAF provide a slightly different feature set, though

both support built-in and custom rules and can be set to operate in either detection mode or prevention

mode.

The Front Door backend IP space may change, but Microsoft will ensure integration with Azure IP Ranges

and Service Tags. It's possible to subscribe to Azure IP Ranges and Service Tags to receive notifications

about any changes or updates.

Azure Front Door supports various load distribution configurations:

Latency-based: the default setting that routes traffic to the "closest" backend from the client; based on

request latency.

Priority-based: useful for active-passive setups, where traffic must always be sent to a primary

backend unless it's not available.

Weighted: applicable for canary deployments in which a certain percentage of traffic is sent to a

specific backend. If multiple backends have the same weights assigned, latency-based routing is used.

By default Azure Front Door uses latency-based routing that can lead to situations where some backends

get a lot more incoming traffic than others, depending on where clients originate from.

If a series of client requests must be handled by the same backend, Session Affinity can be configured on

the frontend. It uses a client-side cookie to send subsequent requests to the same backend as the first

request, provided the backend is still available.

Azure Traffic ManagerAzure Traffic Manager

Azure Traffic Manager is a DNS redirection service.

The actual request payload isn't processed, but instead Traffic Manager returns the DNS name of one

IMPORTANTIMPORTANT

Design recommendationsDesign recommendations

of the backends it the pool, based on configured rules for the selected traffic routing method.

The backend DNS name is then resolved to its final IP address that is subsequently directly called by

the client.

The DNS response is cached and reused by the client for a specified Time-To-Live (TTL) period, and

requests made during this period will go directly to the backend endpoint without Traffic Manager

interaction. Eliminates the extra connectivity step that provides cost benefits compared to Front Door.

Since the request is made directly from the client to the backend service, any protocol supported by the

backend can be leveraged.

Similar to Azure Front Door, Azure Traffic Manager also relies on health probes to understand if a

backend is healthy and operating normally. If another value is returned or nothing is returned, the routing

service recognizes ongoing issues and will stop routing requests to that specific backend.

However, unlike with Azure Front Door this removal of unhealthy backends isn't instantaneous since

clients will continue to create connections to the unhealthy backend until the DNS TTL expires and a

new backend endpoint is requested from the Traffic Manager service.

In addition, even when the TTL expires, there no guarantee that public DNS servers will honor this

value, so DNS propagation can actually take much longer to occur. This means that traffic may

continue to be sent to the unhealthy endpoint for a sustained period of time.

Azure Standard Load BalancerAzure Standard Load Balancer

Cross-Region Standard Load Balancer is available in preview with technical limitations. This option isn't recommended for

mission-critical workloads.

Use Azure Front Door as the primary global traffic routing service for HTTP/S scenarios. Azure Front

Door is strongly advocated for HTTP/S workloads as it provides optimized traffic routing, transparent fail

over, private backend endpoints (with the Premium SKU), edge caching and integration with Web

Application Firewall (WAF).

For application scenarios where client control is possible, apply client side routing logic to consider

failover scenarios where the primary global routing technology fails. Two or more global routing

technologies should be positioned in parallel for added redundancy, if single service SLA isn't sufficient.

Client logic is required to route to the redundant technology in case of a global service failure.

Two distinct URLs should be used, with one applied to each of the different global routing services to

simplify the overall certificate management experience and routing logic for a failover.

Prioritize the use of third-party routing technologies as the secondary failover service, since this will

mitigate the largest number of global failure scenarios and the capabilities offered by industry leading

CDN providers will allow for a consistent design approach.

Consideration should also be given to directly routing to a single regional stamp rather than a

separate routing service. While this will result in a degraded level of service, it represents a far simpler

design approach.

This image shows a redundant global load balancer configuration with client failover using Azure Front Door as

primary global load balancer.

IMPORTANTIMPORTANT

To truly mitigate the risk of global failures within the Azure platform, a multi-cloud active-active deployment approach

should be considered, with active deployment stamps hosted across two or more cloud providers and redundant third-

party routing technologies used for global routing.

Azure can effectively be integrated with other cloud platforms. However, it's strongly recommended not to apply a multi-

cloud approach because it introduces significant operational complexity, with different deployment stamp definitions and

representations of operational health across the different cloud platforms. This complexity in-turn introduces numerous

resiliency risks within the normal operation of the application, which far outweigh the hypothetical risks of a global

platform outage.

Although not recommended, for HTTP(s) workloads using Azure Traffic Manager for global routing

redundancy to Azure Front Door, consider offloading Web Application Firewall (WAF) to Application Gateway

for acceptable traffic flowing through Azure Front Door.

This will introduce an additional failure point to the standard ingress path, an additional critical-path

component to manage and scale, and will also incur additional costs to ensure global high-availability.

It will, however, greatly simplify the failure scenario by providing consistency between the acceptable

and not acceptable ingress paths through Azure Front Door and Azure Traffic Manager, both in terms

of WAF execution but also private application endpoints.

The loss of edge caching in a failure scenario will impact overall performance, and this must be

aligned with an acceptable level of service or mitigating design approach. To ensure a consistent level

of service, consider offloading edge caching to a third-party CDN provider for both paths.

It's recommended to consider a third-party global routing service in place of two Azure global routing services.

This provides the maximum level of fault mitigation and a more simple design approach since most industry

leading CDN providers offer edge capabilities largely consistent with that offered by Azure Front Door.

Azure Front DoorAzure Front Door

Use the Azure Front Door managed certificate service to enable TLS connections, and remove the need to

manage certificate lifecycles.

Use the Azure Front Door Web Application Firewall (WAF) to provide protection at the edge from

common web exploits and vulnerabilities, such as SQL injection.

Use the Azure Front Door built-in cache to serve static content from edge nodes.

In most cases this will also eliminate the need for a dedicated Content Delivery Network (CDN).

Application delivery services

Configure the application platform ingress points to validate incoming requests through header based

filtering using the

X-Azure-FDID

to ensure all traffic is flowing through the configured Front Door

instance. Consider also configuring IP ACLing using Front Door Service Tags to validate traffic originates

from the Azure Front Door backend IP address space and Azure infrastructure services. This will ensure

traffic flows through Azure Front Door at a service level, but header based filtering will still be required to

ensure the use of a configured Front Door instance.

Define a custom TCP health endpoint to validate critical downstream dependencies within a regional

deployment stamp, including data platform replicas, such as Azure Cosmos DB in the example provided

by the foundational reference implementation. If one or more dependencies becomes unhealthy, the

health probe should reflect this in the response returned so that the entire regional stamp can be taken

out of circulation.

Ensure health probe responses are logged and ingest all operational data exposed by Azure Front Door

into the global Log Analytics workspace to facilitate a unified data sink and single operational view across

the entire application.

Unless the workload is extremely latency sensitive, spread traffic evenly across all considered regional

stamps to most effectively use deployed resources.

To achieve this, set the "Latency Sensitivity (Additional Latency)" parameter to a value that is high

enough to cater for latency differences between the different regions of the backends. Ensure a

tolerance that is acceptable to the application workload regarding overall client request latency.

Don't enable Session Affinity unless it's required by the application, since it can have a negative impact

the balance of traffic distribution. With a fully stateless application, if the recommended mission-critical

application design approach is followed, any request could be handled by any of the regional

deployments.

Azure Traffic ManagerAzure Traffic Manager

Use Traffic Manager for non HTTP/S scenarios as a replacement to Azure Front Door. Capability

differences will drive different design decisions for cache and WAF capabilities, and TLS certificate

management.

WAF capabilities should be considered within each region for the Traffic Manager ingress path, using

Azure Application Gateway.

Configure a suitably low TTL value to optimize the time required to remove an unhealthy backend

endpoint from circulation in the event that backend becomes unhealthy.

Similar to with Azure Front Door, a custom TCP health endpoint should be defined to validate critical

downstream dependencies within a regional deployment stamp, which should be reflected in the

response provided by health endpoints.

However, for Traffic Manager additional consideration should be given to service level regional fail over.

such as 'dog legging', to mitigate the potential delay associated with the removal of an unhealthy

backend due to dependency failures, particularly if it's not possible to set a low TTL for DNS records.

Consideration should be given to third-party CDN providers in order to achieve edge caching when using

Azure Traffic Manager as a primary global routing service. Where edge WAF capabilities are also offered

by the third-party service, consideration should be given to simplify the ingress path and potentially

remove the need for Application Gateway.

The network ingress path for a mission-critical application must also consider application delivery services to

ensure secure, reliable, and scalable ingress traffic.

Design considerationsDesign considerations

This section builds on global routing recommendations by exploring key application delivery capabilities,

considering relevant services such as Azure Standard Load Balancer, Azure Application Gateway, and Azure API

Management.

TLS encryption is critical to ensure the integrity of inbound user traffic to a mission-critical application,

with TLS OffloadingTLS Offloading applied only at the point of a stamp's ingress to decrypt incoming traffic. TLS

Offloading Requires the private key of the TLS certificate to decrypt traffic.

A Web Application FirewallWeb Application Firewall provides protection against common web exploits and vulnerabilities,

such as SQL injection or cross site scripting, and is essential to achieve the maximum reliability

aspirations of a mission-critical application.

Azure WAF provides out-of-the-box protection against the top 10 OWASP vulnerabilities using managed

rule sets.

Custom rules can also be added to extend the managed rule set.

Azure WAF can be enabled within either Azure Front Door, Azure Application Gateway, or Azure CDN

(currently in public preview).

The features offered on each of the services differ slightly. For example, the Azure Front Door

WAF provides rate limiting, geo-filtering and bot protection, which aren't yet offered within the

Application Gateway WAF. However, they all support both built-in and custom rules and can be

set to operate in detection mode or prevention mode.

The roadmap for Azure WAF will ensure a consistent WAF feature set is provided across all

service integrations.

Third-party WAF technologies such as NVAs and advanced ingress controllers within Kubernetes can also

be considered to provide requisite vulnerability protection.

Optimal WAF configuration typically requires fine tuning, regardless of the technology used.

Azure Front DoorAzure Front Door

Azure Front Door only accepts HTTP and HTTPS traffic, and will only process requests with a known Host

header. This protocol blocking helps to mitigate volumetric attacks spread across protocols and ports, and

DNS amplification and TCP poisoning attacks.

Azure Front Door is a global Azure resource so configuration is deployed globally to all edge locations.

Resource configuration can be distributed at a massive scale to handle hundreds of thousands of

requests per second.

Updates to configuration, including routes and backend pools, are seamless and won't cause any

downtime during deployment.

Azure Front Door provides both a fully managed certificate service and a bring-your-own-certificate

method for the client-facing SSL certificates. The fully managed certificate service provides a simplified

operational approach and helps to reduce complexity in the overall design by performing certificate

management within a single area of the solution.

Azure Front Door auto-rotates "Managed" certificates at least 60 days ahead of certificate expiration to

protect against expired certificate risks. If self-managed certificates are used, updated certificates should

be deployed no later than 24 hours prior to expiration of the existing certificate, otherwise clients may

receive expired certificate errors.

Certificate updates will only result in downtime if Azure Front Door is switched between "Managed" and

"Use Your Own Certificate".

Azure Front Door is protected by Azure DDoS Protection Basic, which is integrated into Front Door by

Design recommendationsDesign recommendations

Caching and static content delivery

Design considerationsDesign considerations

default. This provides always-on traffic monitoring, real-time mitigation, and also defends against

common Layer 7 DNS query floods or Layer 3/4 volumetric attacks.

These protections help to maintain Azure Front Door availability even when faced with a DDoS attack.

Distributed Denial of Service (DDoS) attacks can render a targeted resource unavailable by

overwhelming it with illegitimate traffic.

Azure Front Door also provides WAF capabilities at a global traffic level, while Application Gateway WAF

must be provided within each regional deployment stamp. Capabilities include firewall rulesets to protect

against common attacks, geo-filtering, address blocking, rate limiting, and signature matching.

Azure Load BalancerAzure Load Balancer

The Azure Basic Load Balancer SKU isn't backed by an SLA and has several capability constraints

compared to the Standard SKU.

Perform TLS Offloading in as few places as possible in order to maintain security whilst simplifying the

certificate management lifecycle.

Use encrypted connections (e.g. HTTPS) from the point where TLS offloading occurs to the actual

application backends. Application endpoints won't be visible to end users, so Azure-managed domains,

such as azurewebsites.net or cloudapp.net , can be used with managed certificates.

For HTTP(S) traffic, ensure WAF capabilities are applied within the ingress path for all publicly exposed

endpoints.

Enable WAF capabilities at a single service location, either globally with Azure Front Door or regionally

with Azure Application Gateway, since this simplifies configuration fine tuning and optimizes

performance and cost.

Configure WAF in Prevention mode to directly block attacks. Only use WAF in Detection mode (i.e. only

logging but not blocking suspicious requests) when the performance penalty of Prevention mode is too

high. The implied additional risk must be fully understood and aligned to the specific requirements of the

workload scenario.

Prioritize the use of Azure Front Door WAF since it provides the richest Azure-native feature set and

applies protections at the global edge, which simplifies the overall design and drives further efficiencies.

Use Azure API Management only when exposing a large number of APIs to external clients or different

application teams.

Use the Azure Standard Load Balancer SKU for any internal traffic distribution scenario within micros-

service workloads.

Provides an SLA of 99.99% when deployed across Availability Zones.

Provides critical capabilities such as diagnostics or outbound rules.

Use Azure DDoS Network Protection to help protect public endpoints hosted within each application

virtual network.

Special treatment of static content like images, JavaScript, CSS and other files can have a significant impact on

the overall user experience as well as on the overall cost of the solution. Caching static content at the edge can

speed up the client load times which results in a better user experience and can also reduce the cost for traffic,

read operations and computing power on backend services involved.

Design recommendationsDesign recommendations

Virtual network integration

C a u t i o nC a u t i o n

Not all content that a solution makes available over the Internet is generated dynamically. Applications serve

both static assets (images, JavaScript, CSS, localization files, etc.) and dynamic content.

Workloads with frequently accessed static content benefit greatly from caching since it reduces the load on

backend services and reduces content access latency for end users.

Caching can be implemented natively within Azure using either Azure Front Door or Azure Content Delivery

Network (CDN).

Azure Front Door provides Azure-native edge caching capabilities and routing features to divide static

and dynamic content.

More complex caching scenarios can be implemented using the Azure CDN service to establish a full-

fledged content delivery network for significant static content volumes.

When comparing the Azure Front Door and Azure CDN services, the following decision factors should

be explored:

By creating the appropriate routing rules in Azure Front Door, /static/* traffic can be

transparently redirected to static content.

The Azure CDN service will likely be more cost effective, but does not provide the same

advanced routing and Web Application Firewall (WAF) capabilities which are recommended for

other areas of an application design. It does, however, offer further flexibility to integrate with

similar services from third-party solutions, such as Akamai and Verizon.

Can required caching rules be accomplished using the rules engine.

Size of the stored content and the associated cost.

Price per month for the execution of the rules engine (charged per request on Azure Front

Door).

Outbound traffic requirements (price differs by destination).

Generated, static content like sized copies of image files that never or only rarely change can benefit from

caching as well. Caching can be configured based on URL parameters and with varying caching duration.

Separate the delivery of static and dynamic content to users and deliver relevant content from a cache to

reduce load on backend services optimize performance for end-users.

Given the strong recommendation (Network and connectivity design area) to use Azure Front Door for

global routing and Web Application Firewall (WAF) purposes, it's recommended to prioritize the use of Azure

Front Door caching capabilities unless gaps exist.

A mission-critical application will typically encompass requirements for integration with other applications or

dependent systems, which could be hosted on Azure, another public cloud, or on-premises data centers. This

application integration can be accomplished using public-facing endpoints and the internet, or private networks

through network-level integration. Ultimately, the method by which application integration is achieved will have

a significant impact on the security, performance, and reliability of the solution, and strongly impacting design

decisions within other design areas.

A mission-critical application can be deployed within one of three overarching network configurations, which

determines how application integration can occur at a network level.

1. PublicPublic application withoutwithout corporate network connectivity.

2. PublicPublic application withwith corporate network connectivity.

3. PrivatePrivate application withwith corporate network connectivity.

When deploying within an Azure landing zone, configuration 1. should be deployed within an Online Landing

Zone, while both 2) and 3) should be deployed within a Corp. Connected Landing Zone to facilitate network-

Design ConsiderationsDesign Considerations

No virtual networksNo virtual networks

Isolated virtual networksIsolated virtual networks

Connected virtual networksConnected virtual networks

level integration.

This section explores these network integration scenarios, layering in the appropriate use of Azure Virtual

Networks and surrounding Azure networking services to ensure integration requirements are optimally

satisfied.

The simplest design approach is to not deploy the application within a virtual network.

Connectivity between all considered Azure services will be provided entirely through public endpoints

and the Microsoft Azure backbone. Connectivity between public endpoints hosted on Azure will only

traverse the Microsoft backbone and won't go over the public internet.

Connectivity to any external systems outside Azure will be provided by the public internet.

This design approach adopts "identity as a security perimeter" to provide access control between the

various service components and dependent solution. This may be an acceptable solution for scenarios

that are less sensitive to security. All application services and dependencies are accessible through a

public endpoint leaves them vulnerable to additional attack vectors orientated around gaining

unauthorized access.

This design approach is also not applicable for all Azure services, since many services, such as AKS, have

a hard requirement for an underlying virtual network.

To mitigate the risks associated with unnecessary public endpoints, the application can be deployed

within a standalone network that isn't connected to other networks.

Incoming client requests will still require a public endpoint to be exposed to the internet, however, all

subsequent communication can be within the virtual network using private endpoints. When using Azure

Front Door Premium, it's possible to route directly from edge nodes to private application endpoints.

While private connectivity between application components will occur over virtual networks, all

connectivity with external dependencies will still rely on public endpoints.

Connectivity to Azure platform services can be established with Private Endpoints if supported. If

other external dependencies exist on Azure, such as another downstream application, connectivity will

be provided through public endpoints and the Microsoft Azure backbone.

Connectivity to any external systems outside Azure would be provided by the public internet.

For scenarios where there are no network integration requirements for external dependencies, deploying

the solution within an isolated network environment provides maximum design flexibility. No addressing

and routing constraints associated with broader network integration.

Azure Bastion is a fully platform-managed PaaS service that can be deployed on a virtual network and

provides secure RDP/SSH connectivity to Azure VMs. When you connect via Azure Bastion, virtual

machines don't need a public IP address.

The use of application virtual networks introduces significant deployment complexities within CI/CD

pipelines, since both data plane and control plane access to resources hosted on private networks is

required to facilitate application deployments.

Secure private network path must be established to allow CI/CD tooling to perform requisite actions.

Private build agents can be deployed within application virtual networks to proxy access to resources

secured by the virtual network.

For scenarios with external network integration requirements, application virtual networks can be

connected to other virtual networks within Azure, another cloud provider, or on-premises networks using

NOTENOTE

Design recommendationsDesign recommendations

a variety of connectivity options. For example, some application scenarios might consider application-

level integration with other line-of-business applications hosted privately within an on-premises

corporate network.

The application network design must align with the broader network architecture, particularly concerning

topics such as addressing and routing.

Overlapping IP address spaces across Azure regions or on-premises networks will create major

contention when network integration is considered.

A virtual network resource can be updated to consider additional address space, however, when a

virtual network address space of a peered network changes a sync on the peering link is required,

which will temporarily disable peering.

Azure reserves five IP addresses within each subnet, which should be considered when determining

appropriate sizes for application virtual networks and encompassed subnets.

Some Azure services require dedicated subnets, such as Azure Bastion, Azure Firewall, or Azure Virtual

Network Gateway. The size of these service subnets is very important, since they should be large

enough to support all current instances of the service considering future scale requirements, but not

so large as to unnecessarily waste addresses.

When on-premises or cross-cloud network integration is required, Azure offers two different solutions to

establish a secure connection.

An ExpressRoute circuit can be sized to provide bandwidths up to 100 Gbps.

A Virtual Private Network (VPN) can be sized to provide aggregated bandwidth up to 10 Gbps in hub

and spoke networks, and up to 20 Gbps in Azure Virtual WAN.

When deploying within an Azure landing zone, be aware that any required connectivity to on-premises networks should

be provided by the landing zone implementation. The design can use ExpressRoute and other virtual networks in Azure

using either Virtual WAN or a hub-and-spoke network design.

The inclusion of additional network paths and resources introduces additional reliability and operational

considerations for the application to ensure health is maintained.

It's recommended that mission-critical solutions are deployed within Azure virtual networks where

possible to remove unnecessary public endpoints, limiting the application attack surface to maximize

security and reliability.

Use Private Endpoints for connectivity to Azure platform services. Service Endpoints can be

considered for services that don support Private Link, provided data exfiltration risks are acceptable or

mitigated through alternative controls.

For application scenarios that don't require corporate network connectivity, treat all virtual networks as

ephemeral resources that are replaced when a new regional deployment is conducted.

When connecting to other Azure or on-premises networks, application virtual networks shouldn't be

treated as ephemeral since it creates significant complications where virtual network peering and virtual

network gateway resources are concerned. All relevant application resources within the virtual network

should continue to be ephemeral, with parallel subnets used to facilitate blue-green deployments of

updated regional deployment stamps.

In scenarios where corporate network connectivity is required to facilitate application integration over

private networks, ensure that the IPv4 address space used for regional application virtual networks

Internet egress

Design ConsiderationsDesign Considerations

NOTENOTE

doesn't overlap with other connected networks and is properly sized to facilitate required scale without

needing to update the virtual network resource and incur downtime.

It's strongly recommended to only use IP addresses from the address allocation for private internet

(RFC 1918).

Align with organization plans for IP addressing in Azure to ensure that application network IP address

space doesn't overlap with other networks across on-premises locations or Azure regions.

Don't create unnecessarily large application virtual networks to ensure that IP address space isn't

wasted.

For environments with a limited availability of private IP addresses (RFC 1918), consider using

IPv6.

If the use of public IP address is required, ensure that only owned address blocks are used.

Prioritize the use Azure CNI for AKS network integration, since it supports a richer feature set.

Consider Kubenet for scenarios with a limited rage available IP addresses to fit the application

within a constrained address space.

Prioritize the use of the Azure CNI network plugin for AKS network integration and consider

Kubenet for scenarios with a limited range of available IP addresses. See Micro-segmentation and

kubernetes network policies for more details.

For scenarios requiring on-premises network integration, prioritize the use Express Route to ensure

secure and scalable connectivity.

Ensure the reliability level applied to the Express Route or VPN fully satisfies application requirements.

Multiple network paths should be considered to provide additional redundancy when required, such

as cross connected ExpressRoute circuits or the use of VPN as a failover connectivity mechanism.

Ensure all components on critical network paths are in line with the reliability and availability

requirements of associated user flows, regardless of whether the management of these paths and

associated component is delivered by the application team of central IT teams.

When deploying within an Azure landing zone and integrating with a broader organizational network topology,

consider the network guidance to ensure the foundational network is aligned with Microsoft best-practices.

Use Azure Bastion or proxied private connections to access the data plane of Azure resources or perform

management operations.

Internet egress is a foundational network requirement for a mission-critical application to facilitate external

communication in the context of:

1. Direct application user interaction.

2. Application integration with external dependencies outside Azure.

3. Access to external dependencies required by the Azure services used by the application.

This section explores how internet egress can be achieved while ensuring security, reliability, and sustainable

performance are maintained, highlighting key egress requirements for services recommended in a mission-

critical context.

Many Azure services require access to public endpoints for various management and control plane

Design recommendationsDesign recommendations

NOTENOTE

functions to operate as intended.

Azure provides different direct internet outbound connectivity methods, such as Azure NAT gateway or

Azure Load Balancer, for virtual machines or compute instances on a virtual network.

When traffic from inside a virtual network travels out to the Internet, Network Address Translation (NAT)

must take place. This is a compute operation that occurs within the networking stack and that can

therefore impact system performance.

When NAT takes place at a small scale the performance impact should be negligible, however, if there are

a large number of outbound requests network issues may occur. These issues typically come in the form

of 'Source NAT (or SNAT) port exhaustion'.

In a multi-tenant environment, such as Azure App Service, there's a limited number of outbound ports

available to each instance. If these ports run out, no new outbound connections can be initiated. This

issue can be mitigated by reducing the number of private/public edge traversals or by using a more

scalable NAT solution such as the Azure NAT Gateway.

In addition to NAT limitations, outbound traffic may also be subject to requisite security inspections.

Azure Firewall provides appropriate security capabilities to secure network egress.

Azure Firewall (or an equivalent NVA) can be used to secure Kubernetes egress requirements by

providing granular control over outbound traffic flows.

Large volumes of internet egress will incur data transfer charges.

Azure NAT GatewayAzure NAT Gateway

Azure NAT Gateway supports 64,000 connections for TCP and UDP per assigned outbound IP address.

Up to 16 IP addresses can be assigned to a single NAT gateway.

A default TCP idle timeout of 4 minutes. If idle timeout is altered to a higher value, flows will be held

for longer, which will increase the pressure on the SNAT port inventory.

NAT gateway can't provide zonal isolation out-of-the-box. To get zonal redundancy, a subnet containing

zonal resources must be aligned with corresponding zonal NAT gateways.

Minimize the number of outgoing Internet connections as this will impact NAT performance.

If large numbers of internet-bound connections are required, consider using Azure NAT Gateway to

abstract outbound traffic flows.

Use Azure Firewall where requirements to control and inspect outbound internet traffic exist.

Ensure Azure Firewall isn't used to inspect traffic between Azure services.

When deploying within an Azure landing zone, consider using the foundational platform Azure Firewall resource (or

equivalent NVA). If a dependency is taken on a central platform resource for internet egress, then the reliability level of

that resource and associated network path should be closely aligned with application requirements. Operational data from

the resource should also be made available to the application in order to inform potential operational action in failure

scenarios.

If there are high-scale requirements associated with outbound traffic, consideration should be given to a

dedicated Azure Firewall resource for a mission-critical application, to mitigate risks associated with using a

centrally shared resource, such as noisy neighbor scenarios.

Inter-zone and inter-region Connectivity

Design ConsiderationsDesign Considerations

When deployed within a Virtual WAN environment, consideration should be given to Firewall Manager to

provide centralized management of dedicated application Azure Firewall instances to ensure organizational

security postures are observed through global firewall policies.

Ensure incremental firewall policies are delegated to application security teams via role-based access control

to allow for application policy autonomy.

While the application design strongly advocates independent regional deployment stamps, many application

scenarios may still require network integration between application components deployed within different

zones or Azure regions, even if only under degraded service circumstances. The method by which inter-zone

and inter-region communication is achieved has a significant bearing on overall performance and reliability,

which will be explored through the considerations and recommendations within this section.

The application design approach for a mission-critical application endorses the use of independent

regional deployments with zonal redundancy applied at all component levels within a single region.

An Availability Zone (AZ) is a physically separate data center location within an Azure region, providing

physical and logical fault isolation up to the level of a single data center.

A round-trip latency of less than 2 ms is guaranteed for inter-zone communication. Zones will have a

small latency variance given varied distances and fiber paths between zones.

Availability Zone connectivity depends on regional characteristics, and therefore traffic entering a region

via an edge location may need to be routed between zones to reach its destination. This will add a ~1ms-

2ms latency given inter-zone routing and 'speed of light' constraints, but this should only have a bearing

for hyper sensitive workloads.

Availability Zones are treated as logical entities within the context of a single subscription, so different

subscriptions might have a different zonal mapping for the same region. For example, zone 1 in

Subscription A could correspond to the same physical data center as zone 2 in subscription B.

Communication between zones within a region incurs a data transfer charge per GB of bandwidth.

With application scenarios that are extremely chatty between application components, spreading

application tiers across zones can introduce significant latency and increased costs. It's possible to

mitigate this within the design by constraining a deployment stamp to a single zone and deploying

multiple stamps across the different zones.

Communication between different Azure regions incurs a larger data transfer charge per GB of

bandwidth.

The applicable data transfer rate depends on the continent of the considered Azure regions.

Data traversing continents are charged at a considerably higher rate.

Express Route and VPN connectivity methods can also be used to directly connect different Azure regions

together for certain scenarios, or even different cloud platforms.

For services to service communication Private Link can be used for direct communication using private

endpoints.

Traffic can be hair-pinned through Express Route circuits used for on-premise connectivity in order to

facilitate routing between virtual networks within an Azure region and across different Azure regions

within the same geography.

Hair-pinning traffic through Express Route will bypass data transfer costs associated with virtual

network peering, so can be used as a way to optimize costs.

Design recommendationsDesign recommendations

Micro-segmentation and Kubernetes network policies

Design ConsiderationsDesign Considerations

This approach necessitates additional network hops for application integration within Azure, which

introduces latency and reliability risks. Expands the role of Express Route and associated gateway

components from Azure/on-premises to also encompass Azure/Azure connectivity.

When submillisecond latency are required between services, Proximity Placement Groups can be used

when supported by the services used.

Use virtual network peering to connect networks within a region and across different regions. It's

strongly recommended to avoid hair-pinning within Express Route.

Use Private Link to establish communication directly between services in the same region or across

regions (service in Region A communicating with service in Region B.

For application workloads that are extremely chatty between components, consider constraining a

deployment stamp to a single zone and deploying multiple stamps across the different zones. This

ensures zonal redundancy is maintained at the level of an encapsulated deployment stamp rather than a

single application component.

Where possible, treat each deployment stamp as independent and disconnected from other stamps.

Use data platform technologies to synchronize state across regions rather than achieving consistency

at an application level with direct network paths.

Avoid 'dog legging' traffic between different regions unless necessary, even in a failure scenario. Use

global routing services and end-to-end health probes to take an entire stamp out of circulation in the

event that a single critical component tier fails, rather than routing traffic at that faulty component

level to another region.

For hyper latency sensitive application scenarios, prioritize the use of zones with regional network

gateways to optimize network latency for ingress paths.

Micro-segmentation is a network security design pattern used to isolate and secure individual application

workloads, with policies applied to limit network traffic between workloads based on a Zero Trust model. It's

typically applied to reduce network attack surface, improve breach containment, and strengthen security

through policy-driven application-level network controls.

A mission-critical application can enforce application-level network security using Network Security Groups

(NSG) at either a subnet or network interface level, service Access Control Lists (ACL), and network policies

when using Azure Kubernetes Service (AKS).

This section explores the optimal use of these capabilities, providing key considerations and recommendations

to achieve application-level micro-segmentation.

AKS can be deployed in two different networking models:

Kubenet networking:Kubenet networking: AKS nodes are integrated within an existing virtual network, but pods exist

within a virtual overlay network on each node. Traffic between pods on different nodes is routed

through kube-proxy.

Azure Container Networking Interface (CNI) networking:Azure Container Networking Interface (CNI) networking: The AKS cluster is integrated within

an existing virtual network and its nodes, pods and services received IP addresses from the same

virtual network the cluster nodes are attached to. This is relevant for various networking scenarios

requiring direct connectivity from and to pods. Different node pools can be deployed into different

subnets.

Design recommendationsDesign recommendations

NOTENOTE

Azure CNI requires more IP address space compared to Kubenet. Proper upfront planning and sizing of the

network is required. For more information, refer to the Azure CNI documentation.

By default, pods are non-isolated and accept traffic from any source and can send traffic to any

destination; a pod can communicate with every other pod in a given Kubernetes cluster; Kubernetes

doesn't ensure any network level isolation, and doesn't isolate namespaces at the cluster level.

Communication between Pods and Namespaces can be isolated using Network Policies. Network Policy is

a Kubernetes specification that defines access policies for communication between Pods. Using Network

Policies, an ordered set of rules can be defined to control how traffic is sent/received, and applied to a

collection of pods that match one or more label selectors.

AKS supports two plugins that implement Network Policy,

Azure

and

Calico

. Both plugins use Linux

IPTables to enforce the specified policies. See Differences between Azure and Calico policies and their

capabilities for more details.

Network policies don't conflict since they're additive.

For a network flow between two pods to be allowed, both the egress policy on the source pod and the

ingress policy on the destination pod need to allow the traffic.

The network policy feature can only be enabled at cluster instantiation time. It is not possible to enable

network policy on an existing AKS cluster.

The delivery of network policies is consistent regardless of whether Azure or Calico is used.

Calico provides a richer feature set, including support for windows-nodes and supports Azure CNI as

well as Kubenet.

AKS supports the creation of different node pools to separate different workloads using nodes with

different hardware and software characteristics, such as nodes with and without GPU capabilities.

Using node pools doesn't provide any network-level isolation.

Node pools can use different subnets within the same virtual network. NSGs can be applied at the

subnet-level to implement micro-segmentation between node pools.

Configure an NSG on all considered subnets to provide an IP ACL to secure ingress paths and isolate

application components based on a Zero Trust model.

Use Front Door Service Tags within NSGs on all subnets containing application backends defined

within Azure Front Door, since this will validate traffic originates from a legitimate Azure Front

Door backend IP address space. This will ensure traffic flows through Azure Front Door at a service

level, but header based filtering will still be required to ensure the use of a particular Front Door

instance and to also mitigate 'IP spoofing' security risks.

Public internet traffic should be disabled on RDP and SSH ports across all applicable NSGs.

Prioritize the use of the Azure CNI network plugin and consider Kubenet for scenarios with a

limited range of available IP addresses to fit the application within a constrained address space.

AKS supports the use of both Azure CNI and Kubenet. It is selected at deployment time.

The Azure CNI network plugin is a more robust and scalable network plugin, and is

recommended for most scenarios.

Kubenet is a more lightweight network plugin, and is recommended for scenarios with a

limited range of available IP addresses.

See Azure CNI for more details.

Next step

The Network Policy feature in Kubernetes should be used to define rules for ingress and egress traffic

between pods in a cluster. Define granular Network Policies to restrict and limit cross-pod

communication.

Enable Network Policy for Azure Kubernetes Service at deployment time.

Prioritize the use of

Calico

because it provides a richer feature set with broader community adoption

and support.

Review the considerations for quantifying and observing application health.

Health modeling and observability

Health modeling and observability of mission

critical workloads on Azure

12/16/2022 • 27 minutes to read • Edit Online

IMPORTANTIMPORTANT

Video: Define a health model for your mission-critical workload

Layered application health

Health modeling and observability are essential concepts to maximize reliability, which focuses on robust and

contextualized instrumentation and monitoring. These concepts provide critical insight into application health,

promoting the swift identification and resolution of issues.

Most mission-critical applications are significant in terms of both scale and complexity and therefore generate

high volumes of operational data, which makes it challenging to evaluate and determine optimal operational

action. Health modeling ultimately strives to maximize observability by augmenting raw monitoring logs and

metrics with key business requirements to quantify application health, driving automated evaluation of health

states to achieve consistent and expedited operations.

This design area focuses on the process to define a robust health model, mapping quantified application health

states through observability and operational constructs to achieve operational maturity.

This article is part of the Azure Well-Architected mission-critical workload series. If you aren't familiar with this series, we

recommend you start with what is a mission-critical workload?

Mission-Critical open source project

The reference implementations are part of an open source project available on GitHub. The code assets illustrate

considerations and recommendations for observing application health.

There are three main levels of operational maturity when striving to maximize reliability.

Detect

and respond to issues as they happen.

Diagnose

issues that are occurring or have already occurred.

Predict

and prevent issues before they take place.

To build a health model, first define application health in the context of key business requirements by

quantifying ‘healthy’ and ‘unhealthy’ states in a layered and measurable format. Then, for each application

component, refine the definition in the context of a steady running state and aggregated according to the

application user flows. Superimpose with key non-functional business requirements for performance and

availability. Finally, aggregate the health states for each individual user flow to form an acceptable

representation of the overall application health. Once established, these layered health definitions should be

used to inform critical monitoring metrics across all system components and validate operational subsystem

composition.

IMPORTANTIMPORTANT

Design considerationsDesign considerations

Design recommendationsDesign recommendations

When defining what 'unhealthy' states, represent for all levels of the application. It's important to distinguish between

transient and non-transient failure states to qualify service degradation relative to unavailability.

The process of modeling health is a top-down design activity that starts with an architectural exercise to

define all user flows and map dependencies between functional and logical components, thereby

implicitly mapping dependencies between Azure resources.

A health model is entirely dependent on the context of the solution it represents, and therefore can't be

solved 'out-of-the-box' because 'one size doesn't fit all'.

Applications will differ in composition and dependencies

Metrics and metric thresholds for resources must also be finely tuned in terms of what values

represent healthy and unhealthy states, which are heavily influenced by encompassed application

functionality and target non-functional requirements.

A layered health model enables application health to be traced back to lower level dependencies, which

helps to quickly root cause service degradation.

To capture health states for an individual component, that component's distinct operational characteristics

must be understood under a steady state that is reflective of production load. Performance testing is

therefore a key capability to define and continually evaluate application health.

Failures within a cloud solution may not happen in isolation. An outage in a single component may lead

to several capabilities or additional components becoming unavailable.

Such errors may not be immediately observable.

Define a measurable health model as a priority to ensure a clear operational understanding of the entire

application.

The health model should be layered and reflective of the application structure.

The foundational layer should consider individual application components, such as Azure resources.

Foundational components should be aggregated alongside key non-functional requirements to build

a business-contextualized lens into the health of system flows.

System flows should be aggregated with appropriate weights based on business criticality to build a

meaningful definition of overall application health. Financially significant or customer-facing user

flows should be prioritized.

Each layer of the health model should capture what ‘healthy’ and ‘unhealthy’ states represent.

Ensure the heath model can distinguish between transient and non-transient unhealthy states to

isolate service degradation from unavailability.

Represent health states using a granular health score for every distinct application component and every

user flow by aggregating health scores for mapped dependent components, considering key non-

functional requirements as coefficients.

The health score for a user flow should be represented by the lowest score across all mapped

components, factoring in relative attainment against non-functional requirements for the user flow.

The model used to calculate health scores must consistently reflect operating health, and if not, the

model should be adjusted and redeployed to reflect new learnings.

Define health score thresholds to reflect health status.

The health score must be calculated automatically based on underlying metrics, which can be visualized

Example

Layered health modelExample

Layered health model

through observability patterns and acted on through automated operational procedures.

The health score should become core to the monitoring solution, so that operating teams no longer

have to interpret and map operational data to application health.

Use the health model to calculate availability Service Level Objective (SLO) attainment instead of raw

availability, ensuring the demarcation between service degradation and unavailability is reflected as

separate SLOs.

Use the health model within CI/CD pipelines and test cycles to validate application health is maintained

after code and configuration updates.

The health model should be used to observe and validate health during load testing and chaos testing

as part of CI/CD processes.

Building and maintaining a health model is an iterative process and engineering investment should be

aligned to drive continuous improvements.

Define a process to continually evaluate and fine-tune the accuracy of the model, and consider

investing in machine learning models to further train the model.

This is a simplified representation of a layered application health model for illustrative purposes. A

comprehensive and contextualized health model is provided in the Mission-Critical reference implementations:

Mission-Critical Online

Mission-Critical Connected

When implementing a health model it's important to define the health of individual components through the

aggregation and interpretation of key resource-level metrics. An example of how resource metrics can be used

is the image below:

This definition of health can subsequently be represented by a KQL query, as demonstrated by the example AKS

query below that aggregates InsightsMetrics (AKS Container insights) and AzureMetrics (Azure diagnostics) and

compares (inner join) against modeled health thresholds.

// ClusterHealthStatus

let Thresholds=datatable(MetricName: string, YellowThreshold: double, RedThreshold: double) [

// Disk Usage:

"used_percent", 50, 80,

// Average node cpu usage %:

"node_cpu_usage_percentage", 60, 90,

// Average node disk usage %:

"node_disk_usage_percentage", 60, 80,

// Average node memory usage %:

"node_memory_rss_percentage", 60, 80

];

InsightsMetrics

| summarize arg_max(TimeGenerated, *) by Computer, Name

| project TimeGenerated,Computer, Namespace, MetricName = Name, Value=Val

| extend NodeName = extract("([a-z0-9-]*)(-)([a-z0-9]*)$", 3, Computer)

| union (

AzureMetrics

| extend ResourceType = extract("(PROVIDERS/MICROSOFT.)([A-Z]*/[A-Z]*)", 2, ResourceId)

| where ResourceType == "CONTAINERSERVICE/MANAGEDCLUSTERS"

| summarize arg_max(TimeGenerated, *) by MetricName

| project TimeGenerated, MetricName, Namespace = "AzureMetrics", Value=Average

)

| lookup kind=inner Thresholds on MetricName

| extend IsYellow = iff(Value > YellowThreshold and Value < RedThreshold, 1, 0)

| extend IsRed = iff(Value > RedThreshold, 1, 0)

| project NodeName, MetricName, Value, YellowThreshold, IsYellow, RedThreshold, IsRed

// ClusterHealthScore

ClusterHealthStatus

| summarize YellowScore = max(IsYellow), RedScore = max(IsRed)

| extend HealthScore = 1-(YellowScore*0.25)-(RedScore*0.5)

Demo video: Monitoring and health modeling demo

The resulting table output can subsequently be transformed into a health score for easier aggregation at higher

levels of the health model.

These aggregated scores can subsequently be represented as a dependency chart using visualization tools such

as Grafana to illustrate the health model.

This image shows an example layered health model from the Azure Mission-Critical online reference

implementation, and demonstrates how a change in health state for a foundational component can have a

cascading impact to user flows and overall application health (the example values correspond to the table in the

previous image).

Unified data sink for correlated analysis

Design considerationsDesign considerations

TIPTIP

Many operational datasets must be gathered from all system components to accurately represent a defined

heath model, considering logs and metrics from both application components and underlying Azure resources.

This vast amount of data ultimately needs to be stored in a format that allows for near-real time interpretation

to facilitate swift operational action. Moreover, correlation across all encompassed data sets is required to

ensure effective analysis is unbounded, allowing for the layered representation of health.

A unified data sink is required to ensure all operational data is swiftly stored and made available for correlated

analysis to build a 'single pane' representation of application health. Azure provides several different operational

technologies under the umbrella of Azure Monitor, and Azure Monitor Log Analytics serves as the core Azure-

native data sink to store and analyze operational data.

Azure MonitorAzure Monitor

Azure Monitor is enabled by default for all Azure subscriptions, but Azure Monitor for Logs (Log

Analytics) and Azure Application Insights resources must be deployed and configured to incorporate data

collection and querying capabilities.

Azure Monitor supports three types of observability data: logs, metrics, and distributed traces.

Logs are stored in Azure Monitor Logs workspaces based on Azure Data Explorer. Log queries are

stored in query packs that can be shared across subscriptions, and are used to drive observability

components such as dashboards, workbooks, or other reporting and visualization tools.

Metrics are stored in an internal time-series diagnostic service database. For most Azure resources,

the retention period is retained for 93 days. Metric collection is configured through resource

Diagnostic settings.

All Azure resources expose logs and metrics, but resources must be appropriately configured to route

diagnostic data to your desired data sink.

Azure provides various Built-In Policies that can be applied to ensure deployed resources are configured to send logs and

metrics to an Azure Monitor instance.

It's not uncommon for regulatory controls to require operational data remains within originating

geographies or countries. Regulatory requirements may stipulate the retention of critical data types for

an extended period of time. For example, in regulated banking, audit data must be retained for at least

seven years.

Different operational data types may require different retention periods. For example, security logs may

need to be retained for a long period, while performance data is unlikely to require long-term retention

outside the context of AIOps.

Data can be exported from Log Analytics Workspaces for long term retention and/or auditing purposes.

Azure Monitor Logs Dedicated Clusters provides a deployment option that enables Availability Zones for

protection from zonal failures in supported Azure regions. Dedicated Clusters require a minimum daily

data ingest commitment.

Azure Monitor for Logs resources, including underlying log and metrics storage, are deployed into a

specified Azure region.

To protect against loss of data from unavailability of an Azure Monitor for Logs workspace, resources can

be configured with multiple Diagnostics configurations. Each Diagnostic configuration can target metrics

and logs at a separate Azure Monitor for Log workspace.

Each additional Azure Monitor for Logs workspace will incur extra costs.

The redundant Azure Monitor for Logs workspaces can be deployed into the same Azure region, or

into separate Azure regions for additional regional redundancy.

Sending logs and metrics from an Azure resource to an Azure Monitor for Logs workspace in a

different region will incur inter-region data egress costs.

Some Azure resources require an Azure Monitor for Logs workspace within the same region as the

resource itself.

Azure Monitor Logs workspace data can be exported to Azure Storage or Azure Event Hubs on a

continuous, scheduled, or one-time basis.

Data export allows for long-term data archiving and protects against possible operational data loss

due to unavailability.

Available export destinations are Azure Storage or Azure Event Hubs. Azure Storage can be configured

for different redundancy levels including zonal or regional. Data export to Azure Storage stores the

data within .json files.

Data export destinations must be within the same Azure region as the Azure Monitor Logs workspace.

An event hub data export destination to be within the same region as the Azure Monitor Logs

workspace. Azure Event Hubs geo-disaster recovery isn't applicable for this scenario.

There are several data export limitations. Only specific Azure Monitor Logs tables are supported for

data export.

Azure Monitor Logs has user query throttling limits, which may appear as reduced availability to clients,

such as observability dashboards.

Five concurrent queries per user: if five queries are already running, additional queries are placed in a

per-user concurrency queue until a running query ends.

Time in concurrency queue: if a query sits in the concurrency queue for over three minutes, it will be

terminated and a 429 error code returned.

Concurrency queue depth limit: the concurrency queue is limited to 200 queries, and additional

queries will be rejected with a 429 error code.

Query rate limit: there's a per-user limit of 200 queries per 30 seconds across all workspaces.

Query Packs are Azure Resource Manager resources, which can be used to protect and recover Azure

Monitor Logs queries if Azure Monitor Logs workspace is unavailable.

Query Packs contain queries as JSON and can be stored external to Azure similar to other

infrastructure-as-code assets.

Design recommendationsDesign recommendations

Deployable through the Microsoft.Insights REST API.

If an Azure Monitor for Logs workspace must be re-created the Query Pack can be redeployed

from an externally stored definition.

Application Insights can be deployed in a workspace-based deployment model, underpinned by a Log

Analytics Workspace where all the data is stored.

Sampling can be enabled within Application Insights to reduce the amount of telemetry sent and

optimize data ingest costs.

Log Analytics and Application Insights charge based on the volume of data ingested and the duration that

data is retained for.

Data ingested into a Log Analytics Workspace can be retained at no additional charge up to first 31

days (90 days if Sentinel is enabled)

Data ingested into a Workspace-based Application Insights is retained for the first 90 days at no extra

charge.

The Log Analytics Commitment Tier pricing model provides a predictable approach to data ingest

charges.

Any usage above the reservation level is billed at the same price as the current tier.

Azure Monitor Log Analytics, Application Insights, and Azure Data Explorer use the Kusto Query

Language (KQL).

Log Analytics queries are saved as

functions

within Log Analytics ( savedSearches ).

Use Azure Monitor for Logs (Log Analytics) as a unified data sink to provide a 'single pane' across all

operational data sets.

Decentralize Log Analytics Workspaces across all used deployment regions. Each Azure region with an

application deployment should consider a Log Analytics Workspace to gather all operational data

originating from that region. All global resources should use a separate dedicated Log Analytics

Workspace, which should be deployed within a primary deployment region.

All deployment stamps within the same region can use the same regional Log Analytics Workspace.

Sending all operational data to a single Log Analytics Workspace would create a single point of

failure.

Requirements for data residency might prohibit data leaving the originating region, and

federated workspaces solves for this requirement by default.

There's a substantial egress cost associated with transferring logs and metrics across regions.

Consider configuring resources with multiple diagnostic configurations pointing to different Azure

Monitor for Logs workspaces to protect against Azure Monitor unavailability for applications with fewer

regional deployment stamps.

Use Application Insights as a consistent Application Performance Monitoring (APM) tool across all

application components to collect application logs, metrics, and traces.

Deploy Application Insights in a workspace-based configuration to ensure each regional Log Analytics

Workspaces contains logs and metrics from both application components and underlying Azure

resources.

Use Cross-Workspace queries to maintain a unified 'single pane' across the different workspaces.

Use Query Packs to protect Azure Monitor Logs queries in the event of workspace unavailability.

Store query packs within the application git repository as infrastructure-as-code assets.

All Log Analytics Workspaces should be treated as long-running resources with a different life-cycle to

NOTENOTE

application resources within a regional deployment stamp.

Export critical operational data from Log Analytics for long-term retention and analytics to facilitate

AIOps and advanced analytics to refine the underlying health model and inform predictive action.

Carefully evaluate which data store should be used for long-term retention; not all data has to be stored

in a hot and queryable data store.

It's strongly recommended to use Azure Storage in a GRS configuration for long-term operational

data storage.

Use the Log Analytics Export capability to export all available data sources to Azure Storage.

Select appropriate retention periods for operational data types within log analytics, configuring longer

retention periods within the workspace where 'hot' observability requirements exist.

Use Azure Policy to ensure all regional resources route operational data to the correct Log Analytics

Workspace.

When deploying into an Azure landing zone, if there's a requirement for centralized storage of operational data, you can

fork data at instantiation so it's ingested into both centralized tooling and Log Analytics Workspaces dedicated to the

application. Alternatively, expose access to application Log Analytics workspaces so that central teams can query

application data. It's ultimately critical that operational data originating from the solution is available within Log Analytics

Workspaces dedicated to the application.

If SIEM integration is required, do not send raw log entries, but instead send critical alerts.

Only configure sampling within Application Insights if it's required to optimize performance, or if not

sampling becomes cost prohibitive.

Excessive sampling can lead to missed or inaccurate operational signals.

Use correlation IDs for all trace events and log messages to tie them to a given request.

Return correlation IDs to the caller for all calls not just failed requests.

Ensure application code incorporates proper instrumentation and logging to inform the health model and

facilitate subsequent troubleshooting or root cause analysis when required.

Application code should use Application Insights to facilitate Distributed Tracing, by providing the

caller with a comprehensive error message that includes a correlation ID when a failure occurs.

Use structured logging for all log messages.

Add meaningful health probes to all application components.

When using AKS, configure the health endpoints for each deployment (pod) so that Kubernetes can

correctly determine when a pod is healthy or unhealthy.

When using Azure App Service, configure the Health Checks so that scale out operations will not

cause errors by sending traffic to instances that are not-yet ready, and making sure unhealthy

instances are recycled quickly.

If the application is subscribed to Microsoft Mission-Critical Support, consider exposing key health probes to

Microsoft Support, so application health can be modelled more accurately by Microsoft Support.

Log successful health check requests, unless increased data volumes can't be tolerated in the context of

application performance, since they provide additional insights for analytical modeling.

Do not configure production Log Analytics Workspaces to apply a daily cap, which limits the daily

Visualization

Design considerationsDesign considerations

Design RecommendationsDesign Recommendations

ingestion of operational data, since this can lead to the loss of critical operational data.

In lower environments, such as Development and Test, it can be considered as an optional cost saving

mechanism.

Provided operational data ingest volumes meet the minimum tier threshold, configure Log Analytics

Workspaces to use Commitment Tier based pricing to drive cost efficiencies relative to the 'pay-as-you-

go' pricing model.

It's strongly recommended to store Log Analytics queries using source control and use CI/CD automation

to deploy them to relevant Log Analytics instances.

Visually representing the health model with critical operational data is essential to achieve effective operations

and maximize reliability. Dashboards should ultimately be utilized to provide near-real time insights into

application health for DevOps teams, facilitating the swift diagnosis of deviations from steady state.

Microsoft provides several data visualization technologies, including Azure Dashboards, Power BI, and Azure

Managed Grafana (currently in-preview). Azure Dashboards is positioned to provide a tightly integrated out-of-

the-box visualization solution for operational data within Azure Monitor. It therefore has a fundamental role to

play in the visual representation of operational data and application health for a mission-critical workload.

However, there are several limitations in terms of the positioning of Azure Dashboards as a holistic observability

platform, and as a result consideration should be given to the supplemental use of market-leading observability

solutions, such as Grafana, which is also provided as a managed solution within Azure.

This section focuses on the use of Azure Dashboards and Grafana to build a robust dashboarding experience

capable of providing technical and business lenses into application health, enabling DevOps teams and effective

operation. Robust dashboarding is essential to diagnose issues that have already occurred, and support

operational teams in detecting and responding to issues as they happen.

When visualizing the health model using Log Analytics queries, note that there are Log Analytics limits on

concurrent and queued queries, as well as the overall query rate, with subsequent queries queued and

throttled.

Queries to retrieve operational data used to calculate and represent health scores can be written and

executed in either Azure Monitor Log Analytics or Azure Data Explorer.

Sample queries are available here.

Log Analytics imposes several query limits, which must be designed for when designing operational

dashboards.

The visualization of raw resource metrics, such as CPU utilization or network throughput, requires

manual evaluation by operations teams to determine health status impacts, and this can be challenging

during an active incident.

If multiple users use dashboards within a tool like Grafana, the number of queries sent to Log Analytics

multiplies quickly.

Reaching the concurrent query limit on Log Analytics will queue subsequent queries, making the

dashboard experience feel 'slow'.

Collect and present queried outputs from all regional Log Analytics Workspaces and the global Log Analytics

Workspace to build a unified view of application health.

NOTENOTE

If deploying into an Azure landing zone, consider querying the central platform Log Analytics Workspace if key

dependencies on platform resources exist, such as ExpressRoute for on-premises communication.

A ‘traffic light’ model should be used to visually represent 'healthy' and 'unhealthy' states, with green

used to illustrate when key non-functional requirements are fully satisfied and resources are optimally

utilized. Use "Green", "Amber, and "Red" to represent "Healthy", "Degraded", and "Unavailable" states.

Use Azure Dashboards to create operational lenses for global resources and regional deployment stamps,

representing key metrics such as request count for Azure Front Door, server side latency for Azure

Cosmos DB, incoming/outgoing messages for Event Hubs, and CPU utilization or deployment statuses for

AKS. Dashboards should be tailored to drive operational effectiveness, infusing learnings from failure

scenarios to ensure DevOps teams have direct visibility into key metrics.

If Azure Dashboards can't be used to accurately represent the health model and requisite business

requirements, then it's strongly recommended to consider Grafana as an alternative visualization

solution, providing market-leading capabilities and an extensive open-source plugin ecosystem. Evaluate

the managed Grafana preview offering to avoid the operational complexities of managing Grafana

infrastructure.

When deploying self-hosted Grafana, employ a highly available and geo-distributed design to ensure

critical operational dashboards can be resilient to regional platform failures and cascading error

scenarios.

Separate configuration state into an external datastore, such as Azure Database for Postgres or

MySQL, to ensure Grafana application nodes remain stateless.

Configure database replication across deployment regions.

Deploy Grafana nodes to App Services in a highly available configuration across ones within a

region, using container based deployments.

Deploy App Service instances across considered deployment regions.

App Services provides a low-friction container platform, which is ideal for low-scale scenarios

such as operational dashboards, and isolating Grafana from AKS provides a clear separation of

concern between the primary application platform and operational representations for that

platform. Please refer to the Application Platform deign area for further configuration

recommendations.

Use Azure Storage in a GRS configuration to host and manage custom visuals and plugins.

Deploy app service and database read-replica Grafana components to a minimum of two

deployment regions, and consider employing a model where Grafana is deployed to all considered

deployment regions.

For scenarios targeting a >= 99.99% SLO, Grafana should be deployed within a minimum of 3 deployment

regions to maximize overall reliability for key operational dashboards.

Mitigate Log Analytics query limits by aggregating queries into a single or small number of queries, such

as by using the KQL 'union' operator, and set an appropriate refresh rate on the dashboard.

An appropriate maximum refresh rate will depend on the number and complexity of dashboard

queries; analysis of implemented queries is required.

If the concurrent query limit of log analytics is being reached, consider optimizing the retrieval pattern by

Automated incident response

IMPORTANTIMPORTANT

Design considerationsDesign considerations

Design recommendationsDesign recommendations

(temporarily) storing the data required for the dashboard in a high performance datastore such as Azure

SQL.

While the visual representations of application health provide invaluable operational and business insights to

support issue detection and diagnosis, it relies on the readiness and interpretations of operational teams, as well

as the effectiveness of subsequent human-triggered responses. Therefore, to maximize reliability it's necessary

to implement extensive alerting to detect proactively and respond to issues in near real-time.

Azure Monitor provides an extensive alerting framework to detect, categorize, and respond to operational

signals through Action Groups. This section will therefore focus on the use of Azure Monitor alerts to drive

automated actions in response to current or potential deviations from a healthy application state.

Alerting and automated action is critical to effectively detect and swiftly respond to issues as they happen, before greater

negative impact can occur. Alerting also provides a mechanism to interpret incoming signals and respond to prevent

issues before they occur.

Alert rules are defined to fire when a conditional criteria is satisfied for incoming signals, which can

include various data sources, such as metrics, log search queries, or availability tests.

Alerts can be defined within Log Analytics or Azure Monitor on the specific resource.

Some metrics are only interrogatable within Azure Monitor, since not all diagnostic data points are made

available within Log Analytics.

The Azure Monitor Alerts API can be used to retrieve active and historic alerts.

There are subscription limits related to alerting and action groups, which must be designed for:

Limits exist for the number of configurable alert rules.

The Alerts API has throttling limits, which should be considered for extreme usage scenarios.

Action Groups have several hard limits for the number of configurable responses, which must be

designed for.

Each response type has a limit of 10 actions, apart from email, which has a limit of 1,000

actions.

Alerts can be integrated within a layered health model by creating an Alert Rule for a saved log search

query from the model's 'root' scoring function. For example, using 'WebsiteHealthScore' and alerting on

a numeric value that represents an 'Unhealthy' state.

For resource-centric alerting, create alert rules within Azure Monitor to ensure all diagnostic data is

available for the alert rule criteria.

Consolidate automated actions within a minimal number of Action Groups, aligned with service teams to

support a DevOps approach.

Respond to excessive resource utilization signals through automated scale operations, using Azure-native

auto-scale capabilities where possible. Where built-in auto-scale functionality isn't applicable, use the

component health score to model signals and determine when to respond with automated scale

operations. Ensure automated scale operations are defined according to a capacity model, which

quantifies scale relationships between components, so that scale responses encompass components that

Predictive action and AI operations (AIOps)

need to be scaled in relation to other components.

Model actions to accommodate a prioritized ordering, which should be determined by business impact.

Use the Azure Monitor Alerts API to gather historic alerts to incorporate within 'cold' operational storage

for advanced analytics.

For critical failure scenarios, which can't be met with an automated response, ensure operational

'runbook automation' is in-place to drive swift and consistent action once manual interpretation and sign

out is provided. Use alert notifications to drive swift identification of issues requiring manual

interpretation

Create allowances within engineering sprints to drive incremental improvements in alerting to ensure

new failure scenarios that haven't previously been considered can be fully accommodated within new

automated actions.

Conduct operational readiness tests as part of CI/CD processes to validate key alert rules for deployment

updates.

Machine learning models can be applied to correlate and prioritize operational data, helping to gather critical

insights related to filtering excessive alert 'noise' and predicting issues before they cause impact, as well as

accelerating incident response when they do.

More specifically, an AIOps methodology can be applied to critical insights about the behavior of the system,

users, and DevOps processes. These insights can include identifying a problem happening now (

detect

quantifying why the problem is happening (

diagnose

), or signaling what will happen in the future (

predict

). Such

insights can be used to drive actions that adjust and optimize the application to mitigate active or potential

issues, using key business metrics, system quality metrics, and DevOps productivity metrics, to prioritize

according to business impact. Conducted actions can themselves be infused into the system through a feedback

loop that further trains the underlying model to drive additional efficiencies.

There are multiple analytical technologies within Azure, such as Azure Synapse and Azure Databricks, which can

be used to build and train analytical models for AIOps. This section will therefore focus on how these

technologies can be positioned within an application design to accommodate AIOps and drive predictive action,

focusing on Azure Synapse that reduces friction by bringing together the best of Azure's data services along

with powerful new features.

AIOps is used to drive predictive action, interpreting and correlating complex operational signals observed over

Design considerationsDesign considerations

Design recommendationsDesign recommendations

a sustained period in order to better respond to and prevent issues before they occur.

Azure Synapse Analytics offers multiple Machine Learning (ML) capabilities.

ML models can be trained and run on Synapse Spark Pools with libraries including MLLib, SparkML

and MMLSpark, as well as popular open-source libraries, such as Scikit Learn.

ML models can be trained with common data science tools like PySpark/Python, Scala, or .NET.

Synapse Analytics is integrated with Azure ML through Azure Synapse Notebooks, which enables ML

models to be trained in an Azure ML Workspace using Automated ML.

Synapse Analytics also enables ML capabilities using Azure Cognitive Services to solve general problems

in various domains, such as Anomaly Detection. Cognitive Services can be used in Azure Synapse, Azure

Databricks, and via SDKs and REST APIs in client applications.

Azure Synapse natively integrates with Azure Data Factory tools to extract, transform, and load (ETL) or

ingest data within orchestration pipelines.

Azure Synapse enables external dataset registration to data stored in Azure Blob storage or Azure Data

Lake Storage.

Registered datasets can be used in Synapse Spark pool data analytics tasks.

Azure Databricks can be integrated into Azure Synapse Analytics pipelines for additional Spark

capabilities.

Synapse orchestrates reading data and sending it to a Databricks cluster, where it can be transformed

and prepared for ML model training.

Source data typically needs to be prepared for analytics and ML.

Synapse offers various tools to assist with data preparation, including Apache Spark, Synapse

Notebooks, and serverless SQL pools with T-SQL and built-in visualizations.

ML models that have been trained, operationalized, and deployed can be used for

batch

scoring in

Synapse.

AIOps scenarios, such as running regression or degradation predictions in CI/CD pipelined, may

require

real-time

scoring.

There are subscription limits for Azure Synapse, which should be fully understood in the context of an

AIOps methodology.

To fully incorporate AIOps it's necessary to feed near real-time observability data into real-time ML

inference models on an ongoing basis.

Capabilities such as anomaly detection should be evaluated within the observability data stream.

Ensure all Azure resources and application components are fully instrumented so that a complete

operational dataset is available for AIOps model training.

Ingest Log Analytics operational data from the global and regional Azure Storage Accounts into Azure

Synapse for analysis.

Use the Azure Monitor Alerts API to retrieve historic alerts and store it within cold storage for operational

data to subsequently use within ML models. If Log Analytics data export is used, store historic alerts data

in the same Azure Storage accounts as the exported Log Analytics data.

After ingested data is prepared for ML training, write it back out to Azure Storage so that it's available for

ML model training without requiring Synapse data preparation compute resources to be running.

Next step

Ensure ML model operationalization supports both batch and real-time scoring.

As AIOps models are created, implement MLOps and apply DevOps practices to automate the ML

lifecycle for training, operationalization, scoring, and continuous improvement. Create an iterative CI/CD

process for AIOps ML models.

Evaluate Azure Cognitive Services for specific predictive scenarios due to their low administrative and

integration overhead. Consider Anomaly Detection to quickly flag unexpected variances in observability

data streams.

Review the deployment and testing considerations.

Deployment and testing

Deployment and testing for mission

critical

workloads on Azure

12/16/2022 • 29 minutes to read • Edit Online

IMPORTANTIMPORTANT

Application environments

Application outages are often caused by failed deployments or erroneous releases. This is why the design of

Continuous Integration and Continuous Deployment (CI/CD) pipelines for deployment and testing

methodologies play a critical role in the overall reliability of a mission-critical application.

Deployment and testing shouldn't be constrained to the delivery of planned application updates, but instead

should form the basis for how all application and infrastructure operations are conducted to ensure consistent

outcomes for mission-critical workloads. The variety of deployment contexts results in a frequent, and often

daily, deployment cadence. Therefore using CI/CD pipelines to exhibit maximum reliability is highly

recommended because they perform a critical operational function for a mission-critical application. The

strategy should aspire for:

Rigorous pre-release testingRigorous pre-release testing. Updates shouldn't introduce defects, vulnerabilities, or other factors that

might jeopardize application health.

Transparent deploymentsTransparent deployments. All clients and users should be able to continue application interaction without

interruption using a zero-downtime deployment approach.

Highly available operationsHighly available operations. Deployment and testing processes must themselves be highly available to

support overall application reliability.

End-to-End automationEnd-to-End automation. Manual intervention in the technical execution of deployment and testing

operations represents a significant reliability risk.

Consistent deployment processConsistent deployment process. Same application artifacts and processes are used to deploy the

infrastructure and application code across different environments.

This design area focuses on how to eradicate downtime and maintain application health for deployment

operations, providing key considerations and recommendations intended to inform the design of optimal CI/CD

pipelines for a mission-critical application.

This article is part of the Azure Well-Architected mission-critical workload series. If you aren't familiar with this series, we

recommend you start with what is a mission-critical workload?

Mission-Critical open source project

The reference implementations are part of an open source project available on GitHub. The code assets illustrate

considerations and recommendations for acheiving optimal CI/CD pipelines for a mission-critical application.

Before considering deployment processes and associated tooling, it's important to evaluate the application

environments required to appropriately validate and stage deployment operations. These environment types

will most likely differ in terms of requisite capabilities and longevity. Some environments might reflect

production on a permanent basis, others may be short lived with a reduced level of complexity. These

Design ConsiderationsDesign Considerations

Design recommendationsDesign recommendations

environments should be staged during the engineering and release cycle in order to ensure deployment

operations are fully tested before released into the production environment.

This section explores the key considerations and recommendations for application environments in a mission-

critical context, covering key design objectives such as developer agility and separation of concerns.

Development environmentsDevelopment environments

Development environments will typically not share the same reliability, capacity, and security

requirements as the production environment.

Given the reduced scale, reliability, and security requirements of a development environment, they can

more easily coexist within a single subscription.

It's likely that engineering teams will require multiple development environments to support the

completion of parallel feature development.

Development environments must be available when required, but need not exist permanently and

typically only exist for short periods of time. Keeping environments short lived saves costs, and prevents

configuration drift from the code base. Also, development environments often share the lifecycle of a

feature branch.

Development environments can also encompass the development of Infrastructure-as-Code (IaC)

artifacts such as Terraform or Azure Resource Manager (ARM) templates.

Staging EnvironmentsStaging Environments

Staging environments can vary depending on their intended function within the release cycle. They

typically resemble the requirements of the production environment for reliability, capacity, and security.

Staging environments can be used for various purposes, but the focus is testing and validation, with a

multitude of test cycles considered. Such as,

Load and performance testing.

Chaos testing.

Integration and build verification testing.

User acceptance testing.

Security and penetration testing.

Different test functions can be performed within the same environment, and in some cases this will be

required. For example, for chaos testing to provide meaningful results, the application must first be

placed under load to be able to understand how the application responds to injected faults. Chaos testing

and load testing are therefore typically performed in parallel.

During the early development cycles and in absence of a production load, a constant synthetic user load

against an environment should be used to provide realistic metrics and with valuable health modeling

input.

Production environmentsProduction environments

Some applications may consider multiple different production environments to cater to different clients,

users, or business functionality.

Ensure all environments reflect the production environment as much as possible, with simplifications

applied for lower environments as necessary.

Keep production environments separate from lower environments into a dedicated subscription. This

Demo video: Ephemeral dev environments and automated feature validationDemo video: Ephemeral dev environments and automated feature validation

Ephemeral blue/green deployments

helps to ensure resource utilization in lower-environments doesn't impact production quotas, and to

provide a clear governance boundary and separation of concerns. Depending on the scale requirements

of the application, multiple production subscriptions might be needed to serve as scale-units.

Separate development environments within a distinct subscription context, with all development

environments sharing the same subscription.

Ensure that there's an automated process to deploy code from a feature branch to a development

environment.

Treat development environments as ephemeral, sharing the lifecycle of the associated feature branch.

Define the number of staging environments and their purpose within the development and release cycle.

Avoid sharing components between environments. Possible exceptions are downstream security

appliances like firewalls, or source locations for synthetic test data.

Ensure at least one staging environment is fully reflective of production to enable production-like testing

and validation.

Capacity within this pre-production environment can flex based on the execution of test activities.

Use of a constant synthetic user load generation is required to provide a realistic test case for changes

on one of the pre-production environments.

The Mission Critical Online reference implementation provides an example user load generator.

A blue/green deployment approach requires a minimum of two identical deployment contexts, where an

existing deployment (blue) is actively serving user traffic, and a new secondary deployment (green) is

established and made ready to receive traffic.

After the new deployment is completed and tested, traffic is gradually switched from the blue deployment to

the green.

If the load transfer is successful, that new deployment becomes the new 'active' production environment and

the old, now 'inactive' deployment can be decommissioned.

If there are issues within the new deployment environment, the deployment can be aborted and traffic can

either remain in the old 'active' deployment, or be directed back to it.

Design ConsiderationsDesign Considerations

Design recommendationsDesign recommendations

This provides a clear fallback plan and minimizes potential for reliability issues, such as having to cut

production traffic to rectify faulty deployment issues.

To achieve zero interruptions while performing deployments to a mission-critical application, it's strongly

recommended to adopt a blue/green deployment approach for production environments with ephemeral

resources.

This allows new application code or resources to be deployed and tested in a new parallel environment, with

traffic only transitioned once ready in a phased process before subsequently decommissioning the old

environment.

A blue/green deployment can be implemented at either an application level or at the infrastructure level.

IMPORTANTIMPORTANT

Application LevelApplication Level: New code is deployed to a staging location within the existing infrastructure.

For example, Azure App Service provides this capability through secondary deployment slots that can

be swapped after the deployment, while in AKS this can be achieved using a separate pod deployment

on each node and updating the service definition.

This approach incurs less costs and is faster than a full infrastructure level blue/green deployment.

Infrastructure LevelInfrastructure Level: A deployment containing all infrastructure

and

application components within a

deployment scope.

New Azure resources are established before subsequently deploying application code to the new

infrastructure. When the new deployment has been fully tested and validated, traffic can be

transitioned through a phased process, and the old infrastructure can then be decommissioned when

appropriate.

The advantage of this approach is that all changes within the deployment scope are fully deployed and

tested in production before traffic is transitioned between the environments. Also, this approach

provides a much safer approach for any infrastructure-level changes within a release.

Individual deployments may take longer to complete using this methodology since it takes longer to

deploy the infrastructure and application than deploying the application in isolation.

There's an additional cost associated with an infrastructure based approach, since two deployment

contexts must exist side by side until the deployment is fully complete.

This infrastructure blue/green approach allows all changes within a deployment scope, both to the infrastructure

and application, to be achieved with zero downtime and maximum confidence. In addition all compatibilities with

downstream dependencies such as Azure platform, resource providers or IaC modules can be validated.

The blue and green environments can be long living and reused for each deployment, or treated as is

recommended to deploy a new infrastructure for each new deployment.

At an infrastructure level, the orchestration of user traffic between the blue and green environments can

be controlled using a global load balancer, such as Azure Front Door.

Utilize a blue/green deployment approach to release all production changes.

Prioritize an infrastructure level approach in order to achieve zero-downtime deployments and

provide one consistent deployment strategy for any kind of changes (application-level and/or

infrastructure-level). Use a global load balancer to orchestrate the automated transition of user traffic

between the blue and green environments.

Add a green backend endpoint and using a low traffic volume/weight, such as 10%.

Example

Zero

downtime deploymentExample

Zero

downtime deployment

After verifying that the low traffic volume on green is being managed as expected with a maintained

application health, the traffic can be gradually increased in increments until it reaches 100%.

When increasing traffic, a short ramp-up period should be applied to catch faults which may not come

to light immediately.

After all traffic has been migrated to the new green environment, remove the blue backend from

global load balancer service.

Decommission the old and inactive blue environment.

Repeat the process for the next deployment with blue and green reversed.

While blue and green environments can be reused, it's strongly recommended to deploy new

infrastructure for each new deployment. Treat each regional deployment stamp as ephemeral with a

lifecycle tied to that of a single release.

Decommission the old and inactive 'blue' environment, ensuring that any connections established while

this environment was active are also closed and any queues are drained before removing associated

resources. This will save costs relative to maintaining secondary production infrastructure and will ensure

new environments are free of configuration drift.

To prevent downtime, the process to control the transition of traffic between environments should be

fully automated.

Phase the transition of traffic between the blue/green environments to minimize client and user exposure

whilst confidence is established in the new environment.

Allow for a short ramp-up period when transitioning traffic between blue/green environments in order to

catch faults which may not come to light immediately.

Achieving zero-downtime deployments is a fundamental goal of a mission-critical application. However, it's a

complex issue that requires significant engineering investment and greatly influences the overall design. It's

important to invest effort up-front to define and plan deployment processes, to drive key design decisions such

as whether to treat resources as ephemeral.

The Mission-Critical Online and Azure Mission-Critical Connected reference implementations serve as practical

examples for these concepts and recommendations, to establish an optimized zero-downtime deployment

approach as represented in the illustration below.

Infrastructure-as-Code deployments

Design ConsiderationsDesign Considerations

Design RecommendationsDesign Recommendations

DevOps Tooling

Infrastructure-as-Code (IaC) treats infrastructure definitions as source code that is version controlled alongside

other application artifacts. Using IaC ensures code consistency across environments and eliminates the risk of

human error during automated deployments, and providing traceability and rollback.

The recommended approach for infrastructure-level blue/green deployment is to use IaC, with fully automated

and consistent infrastructure deployments.

Typically a mission-critical IaC repository has two resource definitions:

Global Resources: those that are deployed once within the solution, such as Azure Front Door and

Azure Cosmos DB.

Regional (

Stamp

) Resources: those that are deployed multiple times and into different regions, such as

AKS, Event Hubs and Storage.

Apply the concept of IaC and ensure all Azure resources are defined in declarative templates and

maintained in a source control repository from where they can be deployed automatically using CI/CD

pipelines.

Define infrastructure artifacts as declarative templates, and not as imperative scripts.

Ensure the deployment of both infrastructure and application components are fully automated.

Prohibit manual operations against production and lower environments. Only exception should be fully

independent developer environments.

There are a myriad of different products and services that can provide the necessary DevOps capabilities to

Design considerationsDesign considerations

effectively deploy and manage a mission-critical application; Microsoft provides two Azure-native toolsets

through GitHub

Actions

and Azure DevOps (ADO)

Pipelines

The appropriate and effective use of used deployment tooling is critical to ensure overall reliability for an

application, particularly because DevOps processes provide such a significant function within the overall

application design. For example, failover and scale operations may depend on automation provided by DevOps

tooling. Deployment tooling must therefore be implemented in a reliable and highly available manner, with

engineering teams understanding the application impact if the deployment service, or parts of it, become

unavailable.

This section focuses on the optimal use of GitHub Actions and Azure Pipelines and decision factors influencing

the optimal selection of DevOps tooling.

The capabilities of GitHub

Actions

and Azure DevOps (ADO)

Pipelines

are largely overlapping.

Different technologies can be used simultaneously to utilize the best features different technologies in

parallel.

A common approach is to hold code repositories in GitHub.com or GitHub AE whilst using the

deployment pipelines in ADO.

It should be noted that the use of multiple technologies adds an element of complexity and impacts

the risk landscape.

Azure PipelinesAzure Pipelines

Azure Pipelines provides highly mature deployment pipelines, including features like gates and

approvals.

ADO instances are hosted in a single Azure region that is chosen at organization-level.

Data is replicated across regions but only for Disaster Recovery purposes.

Hosted build agents are utilized from the same region as the ADO instance.

In the context of mission-critical aspiration for maximum reliability, the dependency on a single Azure

region represents an operational risk. For example, consider a scenario where traffic is spread over West

Europe and North Europe, with West Europe hosting the ADO instance. If West Europe experiences an

outage, the ADO instance would also be affected. While North Europe would automatically now handle

all application traffic, the ability to deploy additional scale-units to North Europe, in order to provide a

consistent failover experience, would be prohibited which may result in a severely degraded application

experience until the issue is resolved.

GitHub ActionsGitHub Actions

GitHub.com is well known and adopted by developers and used for many open source projects.

GitHub.com instances are also hosted in a single Azure region.

Data is replicated across regions but only for Disaster Recovery purposes.

A private and dedicated GitHub AE offering is available in a limited public preview.

GitHub Actions is well-suited for build-related tasks (Continuous Integration).

GitHub Actions is less mature when it comes to deployment tasks (Continuous Deployment).

Templating and reuse of pipeline steps is limited.

Gates and approval options are limited.

Missing options to control pipeline execution, such as the exclusion of specific stages.

Design recommendationsDesign recommendations

Branching strategy

Design considerationsDesign considerations

Design recommendationsDesign recommendations

IMPORTANTIMPORTANT

Define an availability SLA for deployment tooling and ensure alignment with broader application

reliability requirements.

In a multi-region scenario with an active-passive or active-active application deployment configuration,

ensure that failover orchestration and scaling operations can continue to function even if the primary

region hosting deployment toolsets becomes unavailable.

Branching strategies are a fundamental aspect of application source control, and while there are many valid

approaches to apply branching, there are several key aspects that should be considered in the context of a

mission-critical application scenario to ensure maximum reliability for mission-critical workloads.

Developers will carry out their work in

feature/*

and

fix/*

branches and these are the entry points for

changes.

Restrictions can be applied to branches as part of the branching strategy, such as only allowing

administrators to create release branches, or enforcing naming conventions for branches.

There might be rare occasions where a hotfix is urgently required and applied directly into an existing

production environment. Examples of potential hotfixes include critical security updates or remediation of

issues breaking the user experience. Typically, these hotfixes are created on a

fix/*

branch and merged

into the release branch. It's essential that the change is brought into

main

as soon as practical so that is

part of future releases and also avoids any reoccurrence of the issue. This process must only be used for

small changes addressing urgent issues and with restraint.

Prioritize the use of GitHub for source control.

Create a branching strategy that details

feature

work and

releases

as a minimum, using branch policies and permissions

to ensure the strategy is appropriately enforced.

When feature branch changes are pushed to

origin

, trigger an automated testing process to validate the

legitimacy of code contributions before any Pull Request (PR) can be completed.

Ensure any PR requires the review of at least one other team member before merging.

It's recommended to treat the

main

branch as a continuously forward moving and stable branch,

primarily used for integration testing.

Ensure changes are only made to

main

via PRs, using a branch policy to prohibit direct commits.

Every time a PR is merged into

main

, it should automatically kick off a deployment against an

integration environment.

main

should be considered stable and safe to create a release from at any given time.

It's recommended to consider the use of dedicated

release/*

branches, created from the

main

branch and

used to deploy to Production environments.

release/*

branches should remain in the repository and can be used to patch a release.

Define and document a hotfix process and apply it only when needed.

Create hotfixes in a

fix/*

branch for subsequent merging into the release branch and deployment to

production.

Container registry

Design considerationsDesign considerations

Ensure any changes are brought into

main

as soon as practical so that they reflected in all future

releases to avoid reoccurrence of the issue.

A hotfix process should only be used with restraint for small changes addressing urgent issues; almost

all operational issues should follow the standard operating procedure and CI/CD DevOps processes.

Container registries are a key aspect for any containerized application, providing hosting for container images

deployed to container runtime environments, such as AKS. There's a wide variety of container registry

technologies available that predominantly rely on the Docker-provided format and standards for both push and

pull operations.

This section will therefore examine the optimal configuration of container registries, focusing on the native

Azure Container Registry service, while also exploring the trade-offs associated with centralized and federated

deployment models.

Because most container registry solutions rely on the Docker-provided format and standards for both

push and pull operations, they're broadly compatible and mostly interchangeable.

Container registries can sometimes be deployed either as a centralized service that is shared and

consumed by numerous applications within an organization, or a separate application component

dedicated to a specific application workload.

Some application scenarios will require public container images be replicated within a private container

registry to limit egress traffic, increase availability, or avoid potential throttling.

Public Registries - Docker HubPublic Registries - Docker Hub

Container images stored on Docker Hub, or other public registries, exist outside of Azure and a given virtual

network. This isn't necessarily a problem, but in certain scenarios can lead to various potential issues where

service unavailability, throttling and data exfiltration are concerned.

Azure Container Registr y (ACR)Azure Container Registr y (ACR)

Azure Container Registry (ACR) provides an Azure-native service with a range of features including geo-

replication, Azure AD authentication, automated container building, and patching using ACR tasks.

ACR supports high availability through Geo-replication to multiple configured regions, providing

resiliency against regional outage. If a region becomes unavailable, the other regions will continue to

serve image requests, and when the region returns to health the ACR will recover and replicate changes

to it.

This capability also provides registry colocation within each configured region, reducing network

latency and cross-region data transfer costs.

Within Azure regions that provide Availability Zone support, the Premium ACR tier supports Zone

Redundancy to protect against zonal failure.

Tagged ACR images are mutable by default, meaning that the same tag can be used on multiple images

pushed to the registry.

In production scenarios, this may lead to unpredictable behavior that could impact application uptime.

ACR supports locking an image version or a repository to prevent changes or deletes.

Image Locking mitigates multiple failure scenarios and also protects against a previously deployed

image

version

being changed in-place, which would introduce the risk that same-version deployments

may have different results (before and after such a change).

Design recommendationsDesign recommendations

Secret management

Design considerationsDesign considerations

Locking Container Images doesn't protect against the ACR instance being deleted, but Azure Resource

Locks can be used to achieve this.

ACR in Premium tier also offers support to restrict a container registry to a given set of virtual networks

and subnets through Private Endpoints.

For a mission-critical application, employ container registry instances that are dedicated to the

application workload. Avoid taking a dependency on a centralized service unless availability and

reliability requirements are in full alignment with the application.

When using container registries outside Azure, ensure that the provided SLA is aligned with the reliability

and security targets. Take special note of throttling limits in cases such as when relying on Docker Hub.

Use Azure Container Registry to host container images.

Azure Container Registr y (ACR)Azure Container Registr y (ACR)

Treat container registries as 'global resources' with a sustained lifecycle ('long-living'). Consider a single

global container registry per environment, such as the use of a global production registry.

Configure geo-replication to all considered deployment regions in order to remove regional

dependencies and optimize latency.

Images should be hosted geographically as close as possible to the consuming compute resources,

within the same Azure regions.

Prioritize regions with Availability Zone support to take advantage of zonal redundancy capabilities.

Use Azure AD integrated authentication to push and pull images instead of relying on access keys.

For optimal security, fully disable the use of the admin access key.

Secret management is a key technical domain in the context of both security and reliability, since the secret

management solution for a mission-critical application must provide requisite security and also offer an

appropriate level of availability to align with maximum reliability aspirations.

There's a multitude of key and secret management solutions available that can be used on Azure.

Azure Key Vault provides a fully managed Azure-native PaaS solution.

Provides native integration with Azure services out-of-the-box.

Supports Availability Zone deployments and multi-region redundancy.

Offers direct integration with Azure AD for authentication and authorization.

Many Azure services already support Azure AD authentication instead of relying on connection strings /

keys. Doing so greatly reduces the need to managed secrets in the first place.

There are three common approaches applied to define at what point secrets must be read from the selected

secret store and injected into the application:

Deployment-Time RetrievalDeployment-Time Retrieval

Retrieving secrets at deployment time provides the advantage that the secret management solution only

needs to be available at deployment time, since there are no direct dependencies after this point. For

example, injecting secrets as environment variables into a Kubernetes deployment or into a Kubernetes

secret.

Design recommendationsDesign recommendations

Only the deployment service principal needs to be able to access secrets, which simplifies RBAC

permissions within the secret management system. It does, however, introduce additional RBAC

considerations within DevOps tooling around controlling service principal access and the application in

terms of protecting retrieved secrets.

This method introduces a trade-off since the security benefits of the secret management solution aren't

being utilized since this design approach relies solely on the access control within the application

platform to keep secrets safe.

Secret updates or rotation will require a full redeployment in order to take effect.

Application star t-up retrievalApplication star t-up retrieval

Retrieving and inject secrets at application startup provides the benefit that secrets can more easily be

updated or rotated. A restart of the application is required to fetch the latest value.

This method ensures that secrets do not need to be stored on the application platform but can be held in

memory only.

For AKS, available implementations for this approach include the CSI SecretStore driver for KeyVault

and akv2k8s.

A native Azure solution Azure Key Vault referenced App Settings is also available.

The disadvantage of this approach is that it creates a runtime dependency to the secret management

solution. If the secret management solution experiences an outage, application components already

running maymay be able to continue serving requests, however, any restart or scale-out operations will likely

result in failure.

Runtime retrievalRuntime retrieval

Retrieving secrets at runtime from within the application itself serves as the most secure approach since

even the application platform never has access to secrets.

Application components require a direct dependency and a connection to the secret management system.

This makes it harder to test components individually and usually requires the use of an SDK.

The application itself needs to be able to authenticate to the secret management system. For AKS the

latter can be achieved using Pod-managed Identities, but that is currently (as of August 2021) still in

preview.

Where possible, use Azure AD authentication to connect to other services instead of using connection

strings or keys. Use this in conjunction with Azure Managed Identities to remove the need for any secrets

to be stored on the application platform.

Use Azure Key Vault to store all application secrets.

Azure Key Vault instances should be deployed as part of a regional stamp to mitigate the potential impact

of a failure to a single deployment stamp. Global resources, such as Front Door (for certificate storage, if

needed), should use a separate Azure Key Vault instance dedicated to global resources, rather than using

one of the regional Key Vault instances.

Use Managed identities instead of service principals to access Key Vault whenever possible.

Secrets should be retrieved at application startup, not during deployment time or at runtime.

Implement coding patterns so when an authorization failure occurs at runtime, secrets are re-retrieved.

Apply a fully automated key-rotation process that runs periodically within the solution.

Continuous validation and testing

Design considerationsDesign considerations

Use key near expire notification to get alerted on upcoming expiration.

As previously stated, testing is a fundamental activity for any mission-critical solution, to fully validate the health

of both the application code and infrastructure. More specifically, to satisfy desired standards for reliability,

performance, availability, security, quality, and scale, testing must be well defined and applied as a core

component of the application design and DevOps methodologies.

Testing is a key concern for both the local developer experience ("Inner Loop") and the complete DevOps

lifecycle ("Outer Loop"), which captures when developed code begins release pipeline processes on its journey

to a production environment.

The scope of this section focuses on testing conducted within the outer loop for a product release, considering

various test scenarios, such as unit, build, static, security, integration, regression, UX, performance, capacity and

failure injection (chaos). The order of conducted tests is also a critical consideration due to various

dependencies, such as the need to have a running application environment.

With high degrees of deployment automation, automated testing is essential to validate application or

infrastructure changes in a timely and repeatable manor.

The purpose of testing is ultimately to detect errors and issues before they reach production

environments, and there are various methods that are required to holistically achieve this goal.

Unit testingUnit testing

Unit testing is intended to confirm that application business logic works as expected. Improve confidence

in the overall effect of code changes.

Unit testing is typically considered as part of the Inner Loop and as such isn't a primary focus for this

section.

Smoke testingSmoke testing

Smoke testing is used to identify whether infrastructure and application components are available and

act as expected. A smoke test focuses on functionality rather than performance under load. Typically only

a single virtual user session is tested.

Common smoke testing scenarios include; interrogating the HTTPS endpoint of a web application,

querying a database, and simulating a user flow in the application.

The outcome of a smoke test should be that the system responds with expected values and behavior.

UI testingUI testing

UI Testing validates that application user interfaces are deployed and functioning as expected.

UI testing is similar to smoke testing but it's focused on user interface interactions.

UI automation tools can and should be used to drive automation.

During a UI test, a script will mimic a realistic user scenario and follow a series of steps to execute

actions and achieve an intended outcome.

Load testingLoad testing

Load testing is designed to validate scalability and application operation under load through rapid and/or

gradual increase in application test load, until a threshold/limit's reached. Load tests are typically

designed around a particular user flow or scenario, in order to verify that application requirements are

satisfied under a defined load.

Azure services have different soft and hard limits associated with scalability, and load testing can reveal if

a system faces a risk of exceeding them during the expected production load.

Load testing can be used to fine-tune auto-scaling capabilities for services that provide automated

scalability (that is to set appropriate measured thresholds). For services that do not provide native auto-

scaling, established automated operational procedures can also be fine-tuned through load testing.

Stress testingStress testing

Stress testing is a type of negative testing, which applies activities aimed at overloading existing

resources in order to understand where solution limits exist, and to ensure the systems ability to recover

gracefully.

During a stress tests it's essential to monitor all components of the system in order to identify potential

bottlenecks.

Every component of the system unable to appropriately scale can turn into a limitation, such as

active/passive network components or databases. It's important to understand their limits so that effort

can be applied to mitigate potential impact.

Unlike load testing, stress tests do not adhere to a realistic usage pattern, but aim to identify performance

and scale limits.

An alternative approach is to limit (or scale down) the computing resources of the system and monitor

how it behaves under load and whether it's able to recover.

Performance testingPerformance testing

Performance testing combines aspects of

load

and

stress testing

to validate performance under load, and

establish benchmark behaviors for application operation.

Failure injection (chaos) testingFailure injection (chaos) testing

Chaos testing introduces artificial failures to the system to validate how the system reacts and the

effectiveness of resiliency measures, operational procedures and mitigations.

A mission-critical application should be resilient to infrastructure and application failures, so introducing

faults in the application and underlying infrastructure and observing how the application behaves is

essential to achieve confidence in the solutions redundancy mechanisms and validate that it can indeed

operate as an 'always on' application.

Shutting down infrastructure components, purposely degrading performance, or introducing application

faults are examples of test scenarios, which can be used to verify that the application is going to react as

expected in situations when they occur for real.

Azure Chaos Studio provides an Azure-native chaos experimentation suite of tools to easily conduct

chaos experiments and inject faults within Azure services and application components.

Provides built-in chaos experiments for common fault scenarios, providing a growing set of 'behind

the curtain' experiments for underlying and abstracted components of Azure services.

Supports custom experiments targeting infrastructure and application components.

Security (penetration) testingSecurity (penetration) testing

Penetration testing is used to ensures that an application and its environment satisfy an expected security

Design recommendationsDesign recommendations

Demo video: Continuous validation with Azure Load Test and Azure Chaos StudioDemo video: Continuous validation with Azure Load Test and Azure Chaos Studio

AI for DevOps

Design considerationsDesign considerations

posture.

Penetration tests will probe the application an environment for security vulnerabilities.

Security testing can encompass the end-to-end software supply chain and package dependencies, with

scanning and monitoring for known Common Vulnerabilities and Exposures (CVE).

All testing of both infrastructure and application components should be fully automated to ensure

consistency.

All test artifacts should be treated as code and maintained within the source control system and version

controlled along with other application code artifacts.

The results of the tests should be captured and analyzed as both individual test results and aggregated

for assessing trends over time. Test results should be continually evaluated for accuracy and coverage.

The availability of test infrastructure should be aligned with the SLA for deployment and testing cycles.

Use PaaS CI/CD orchestration platforms, such as Azure DevOps or GitHub Actions, to orchestrate and

execute tests where possible.

Execute smoke tests as part of every deployment.

Run extensive load tests, along with stress and chaos testing, as part of every deployment to validate

application performance and operability is maintained.

Use load profiles that are reflective of real peak usage patterns.

Run chaos experiments and failure injection tests at the same time as load tests.

If database interactions are required for load or smoke tests (that is to create new records), use test

accounts with reduced privileges and make test data separable from real user content.

Tests with shorter execution times should generally run earlier in the cycle where possible to increase

testing efficiency.

Scan and monitor the end to end software supply chain and package dependencies for known CVEs.

Use Dependabot for GitHub repositories to ensure the repository automatically keeps up-to-date with

the latest releases of packages and applications it depends on.

AIOps methodologies can be applied within CI/CD pipelines to supplement traditional testing approaches,

providing capabilities to detect likely regressions or degradations, and allowing deployments to be preemptively

stopped to prevent potential negative impact.

CI/CD pipelines and DevOps processes will expose a wide variety of telemetry for machine learning

models, from test results and deployment outcomes, to operational data of test components from

composite deployment stages.

CI/CD pipelines will include various types of automated testing, such as unit, smoke, performance,

load, and chaos tests.

Changes in a deployment will need to be stored in a manner suitable for automated analysis and

correlation to deployment outcomes.

Design recommendationsDesign recommendations

Next step

Traditional data processing approaches such as Extract, Transform, and Load (ETL) may not be able to

scale throughput to keep up with growth of deployment telemetry and application observability data.

Modern analytics approaches that do not require ETL and data movement, such as data virtualization,

can be used to enable ongoing analysis by AIOps models.

Define what DevOps process data will be collected and how it will be analyzed.

Expose application observability data from staged test environments and the production environment

for analysis and correlation within AIOps models.

Gather deployment telemetry from DevOps processes, such as test execution metrics and time series

data of changes within each deployment.

Adopt the MLOps Workflow.

Develop analytical models that are context-aware and dependency-aware to provide predictions along

with automated feature engineering to address schema and behavior changes.

Operationalize models by registering and deploying the best trained models within deployment

pipelines.

Review the security considerations.

Security

Security considerations for mission

critical

workloads on Azure

12/16/2022 • 19 minutes to read • Edit Online

IMPORTANTIMPORTANT

Alignment with the Zero Trust model

Design considerationsDesign considerations

Security is a one of the foundational design principles and also a key design area that must be treated as a first-

class concern within the mission-critical architectural process.

Given that the primary focus of a mission-critical design is to maximize reliability so that the application remains

performant and available, the security considerations and recommendations applied within this design area will

focus on mitigating threats with the capacity to impact availability and hinder overall reliability. For example,

successful Denial-Of-Service (DDoS) attacks are known to have a catastrophic impact on availability and

performance. How an application mitigates those attack vectors, such as SlowLoris will impact the overall

reliability. So, the application must be fully protected against threats intended to directly or indirectly

compromise application reliability to be truly mission critical in nature.

It's also important to note that there are often significant trade-offs associated with a hardened security posture,

particularly with respect to performance, operational agility, and in some cases reliability. For example, the

inclusion of inline Network Virtual Appliances (NVA) for Next-Generation Firewall (NGFW) capabilities, such as

deep packet inspection, will introduce a significant performance penalty, additional operational complexity, and a

reliability risk if scalability and recovery operations are not closely aligned with that of the application. It's

therefore essential that additional security components and practices intended to mitigate key threat vectors are

also designed to support the reliability target of an application, which will form a key aspect of the

recommendations and considerations presented within this section.

This article is part of the Azure Well-Architected mission-critical workload series. If you aren't familiar with this series, we

recommend you start with what is a mission-critical workload?

Mission-Critical open source project

The reference implementations are part of an open source project available on GitHub. The code assets adopt a Zero Trust

model to structure and guide the security design and implementation approach.

The Microsoft Zero Trust model provides a proactive and integrated approach to applying security across all

layers of an application. The guiding principles of Zero Trust strives to explicitly and continuously verify every

transaction, assert least privilege, use intelligence, and advanced detection to respond to threats in near real-

time. It's ultimately centered on eliminating trust inside and outside of application perimeters, enforcing

verification for anything attempting to connect to the system.

As you assess the security posture of the application, start with these questions as the basis for each

consideration.

Continuous security testing to validate mitigations for key security vulnerabilities.

Is security testing performed as a part of automated CI/CD processes?

If not, how often is specific security testing performed?

Design recommendationsDesign recommendations

Are test outcomes measured against a desired security posture and threat model?

Security level across all lower-environments.

Do all environments within the development lifecycle have the same security posture as the

production environment?

Authentication and Authorization continuity in the event of a failure.

If authentication or authorization services are temporarily unavailable, will the application be able to

continue to operate?

Automated security compliance and remediation.

Can changes to key security settings be detected

Are responses to remediate non-compliant changes automated?

Secret scanning to detect secrets before code is committed to prevent any secret leaks through source

code repositories.

Is authentication to services possible without having credentials as a part of code?

Secure the software supply chain.

Is it possible to track Common Vulnerabilities and Exposures (CVEs) within utilized package

dependencies?

Is there an automated process for updating package dependencies?

Data protection key lifecycles.

Can service-managed keys be used for data integrity protection?

If customer-managed keys are required, how is the secure and reliable key lifecycle?

CI/CD tooling should require Azure AD service principals with sufficient subscription level access to

facilitate control plane access for Azure resource deployments to all considered environment

subscriptions.

When application resources are locked down within private networks, is there a private data-plane

connectivity path so that CI/CD tooling can perform application level deployments and maintenance.

This introduces additional complexity and requires a sequence within the deployment process

through requisite private build agents.

Use Azure Policy to enforce security and reliability configurations for all service, ensuring that any

deviation is either remediated or prohibited by the control plane at configuration-time, helping to

mitigate threats associated with 'malicious admin' scenarios.

Use Azure AD Privileged Identity Management (PIM) within production subscriptions to revoke sustained

control plane access to production environments. This will significantly reduce the risk posed from

'malicious admin' scenarios through additional 'checks and balances'.

Use Azure Managed Identities for all services that support the capability, since it facilitates the removal of

credentials from application code and removes the operational burden of identity management for

service to service communication.

Use Azure AD Role Based Access Control (RBAC) for data plane authorization with all services that

support the capability.

Use first-party Microsoft identity platform authentication libraries within application code to integrate

with Azure AD.

Consider secure token caching to allow for a degraded but available experience if the chosen identity

platform, isn't available or is only partially available for application authorization.

Threat modeling

If the provider is unable to issue new access tokens, but still validates existing ones, the application

and dependent services can operate without issues until their tokens expire.

Token caching is typically handled automatically by authentication libraries (such as MSAL).

Use Infrastructure-as-Code (IaC) and automated CI/CD pipelines to drive updates to all application

components, including under failure circumstances.

Ensure CI/CD tooling service connections are safeguarded as critical sensitive information, and

shouldn't be directly available to any service team.

Apply granular RBAC to production CD pipelines to mitigate 'malicious admin' risks.

Consider the use of manual approval gates within production deployment pipelines to further

mitigate 'malicious admin' risks and provide additional technical assurance for all production changes.

Additional security gates may come at a trade-off in terms of agility and should be carefully

evaluated, with consideration given to how agility can be maintained even with manual gates.

Define an appropriate security posture for all lower environments to ensure key vulnerabilities are

mitigated.

Do not apply the same security posture as production, particularly with regard to data exfiltration,

unless regulatory requirements stipulate the need to do so, since this will significantly compromise

developer agility.

Enable Microsoft Defender for Cloud (formerly known as Azure Security Center) for all subscriptions that

contain the resources for a mission-critical workload.

Use Azure Policy to enforce compliance.

Enable Azure Defender for all services that support the capability.

Embrace DevSecOps and implement security testing within CI/CD pipelines.

Test results should be measured against a compliant security posture to inform release approvals, be

they automated or manual.

Apply security testing as part of the CD production process for each release.

If security testing each release jeopardizes operational agility, ensure a suitable security testing

cadence is applied.

Enable secret scanning and dependency scanning within the source code repository.

Threat modeling provides a risk based approach to security design, using identified potential threats to develop

appropriate security mitigations. There are many possible threats with varying probabilities of occurrence, and

in many cases threats can chain in unexpected, unpredictable, and even chaotic ways. This complexity and

uncertainty is why traditional technology requirement based security approaches are largely unsuitable for

mission-critical cloud applications. Expect the process of threat modeling for a mission-critical application to be

complex and unyielding.

To help navigate these challenges, a layered defense-in-depth approach should be applied to define and

implement compensating mitigations for modeled threats, considering the following defensive layers.

1. The Azure platform with foundational security capabilities and controls.

2. The application architecture and security design.

3. Security features built-in, enabled, and deployable applied to secure Azure resources.

4. Application code and security logic.

5. Operational processes and DevSecOps.

NOTENOTE

Design considerationsDesign considerations

Design recommendationsDesign recommendations

Network intrusion protection

When deploying within an Azure landing zone, be aware that an additional threat mitigation layer through the

provisioning of centralized security capabilities is provided by the landing zone implementation.

STRIDE provides a lightweight risk framework for evaluating security threats across key threat vectors.

Spoofed Identity: Impersonation of individuals with authority. For example, an attacker impersonating

another user by using their -

Tampering Input: Modification of input sent to the application, or the breach of trust boundaries to modify

application code. For example, an attacker using SQL Injection to delete data in a database table.

Repudiation of Action: Ability to refute actions already taken, and the ability of the application to gather

evidence and drive accountability. For example, the deletion of critical data without the ability to trace to a

malicious admin.

Information Disclosure: Gaining access to restricted information. An example would be an attacker gaining

access to a restricted file.

Denial of Service: Malicious application disruption to degrade user experience. For example, a DDoS botnet

attack such as Slowloris.

Elevation of Privilege: Gaining privileged application access through authorization exploits. For example, an

attacker manipulating a URL string to gain access to sensitive information.

Identity

Authentication

Data integrity

Validation

Blocklisting/allowlisting

Audit/logging

Signing

Encryption

Data exfiltration

Man-in-the-middle attacks

DDoS

Botnets

CDN and WAF capabilities

Remote code execution

Authorization

Isolation

Allocate engineering budget within each sprint to evaluate potential new threats and implement

mitigations.

Conscious effort should be applied to ensure security mitigations are captured within a common

engineering criteria to drive consistency across all application service teams.

Start with a service by service level threat modeling and unify the model by consolidating the thread

model on application level.

Preventing unauthorized access to a mission-critical application and encompassed data is vital to maintain

Design considerationsDesign considerations

Design recommendationsDesign recommendations

availability and safeguard data integrity.

Zero Trust assumes a breached state and verifies each request as though it originates from an

uncontrolled network.

An advanced zero-trust network implementation employs micro-segmentation and distributed

ingress/egress micro-perimeters.

Azure PaaS services are typically accessed over public endpoints. Azure provides capabilities to secure

public endpoints or even make them entirely private.

Azure Private Link/Private Endpoints provide dedicated access to an Azure PaaS resource using private

IP addresses and private network connectivity.

Virtual Network Service Endpoints provide service-level access from selected subnets to selected PaaS

services.

Virtual Network Injection provides dedicated private deployments for supported services, such as App

Service through an App Service Environment.

Management plane traffic still flows through public IP addresses.

For supported services, Azure Private Link using Azure Private Endpoints addresses data exfiltration risks

associated with Service Endpoints, such as a malicious admin writing data to an external resource.

When restricting network access to Azure PaaS services using Private Endpoints or Service Endpoints, a

secure network channel will be required for deployment pipelines to access both the Azure control plane

and data plane of Azure resources in order to deploy and manage the application.

Private self-hosted build agents deployed onto a private network as the Azure resource can be used as

a proxy to execute CI/CD functions over a private connection. A separate virtual network should be

used for build agents.

An alternative approach is to modify the firewall rules for the resource on-the-fly within the pipeline

to allow a connection from an Azure DevOps agent public IP address, with the firewall subsequently

removed after the task is completed.

To perform developer and administrative tasks on the application service jump boxes can be used.

Connectivity to the private build agents from CI/CD tooling is required.

However, this approach is only applicable for a subset of Azure services. For example, this isn't

feasible for private AKS clusters.

The completion of administration and maintenance tasks is a further scenario requiring connectivity to

the data plane of Azure resources.

Service Connections with a corresponding Azure AD service principal can be used within Azure DevOps

to apply RBAC through Azure AD.

Service Tags can be applied to Network Security Groups to facilitate connectivity with Azure PaaS

services.

Application Security Groups don't span across multiple virtual networks.

Packet capture in Azure Network Watcher is limited to a maximum period of five hours.

Limit public network access to the absolute minimum required for the application to fulfill its business

purpose to reduce the external attack surface.

Use Azure Private Link to establish private endpoints for Azure resources that require secure network

integration.

Use hosted private build agents for CI/CD tooling to deploy and configure Azure resources protected

NOTENOTE

Data integrity protection

Design considerationsDesign considerations

by Azure Private Link.

Microsoft-hosted agents won't be able to directly connect to network integrated resources.

When dealing with private build agents, never open an RDP or SSH port directly to the internet.

Use Azure Bastion to provide secure access to Azure Virtual Machines and to perform administrative

tasks on Azure PaaS over the Internet.

Use a DDoS standard protection plan to secure all public IP addresses within the application.

Use Azure Front Door with web application firewall policies to deliver and help protect global HTTP/S

applications that span multiple Azure regions.

Use Header ID validation to lock down public application endpoints so they only accept traffic

originating from the Azure Front Door instance.

If additional in-line network security requirements, such as deep packet inspection or TLS inspection,

mandate the use of Azure Firewall Premium or Network Virtual Appliance (NVA), ensure it's configured

for maximum high availability and redundancy.

If packet capture requirements exist, use Network Watcher packets to capture despite the limited capture

window.

Use Network Security Groups and Application Security Groups to micro-segment application traffic.

Avoid using a security appliance to filter intra-application traffic flows.

Consider the use of Azure Policy to enforce specific NSG rules are always associated with application

subnets.

Enable NSG flow logs and feed them into Traffic Analytics to gain insights into internal and external traffic

flows.

Use Azure Private Link/Private Endpoints, where available, to secure access to Azure PaaS services within

the application design. For information on Azure services that support Private Link, see Azure Private

Link availability.

If Private Endpoint isn't available and data exfiltration risks are acceptable, use Virtual Network Service

Endpoints to secure access to Azure PaaS services from within a virtual network.

Don't enable virtual network service endpoints by default on all subnets as this will introduce

significant data exfiltration channels.

For hybrid application scenarios, access Azure PaaS services from on-premises via ExpressRoute with

private peering.

When deploying within an Azure landing zone, be aware that network connectivity to on-premises data centers is

provided by the landing zone implementation. One approach is by using ExpressRoute configured with private peering.

Encryption is a vital step toward ensuring data integrity and is ultimately one of the most important security

capabilities that can be applied to mitigate a wide array of threats. This section will therefore provide key

considerations and recommendations related to encryption and key management in order to safeguard data

without compromising application reliability.

Azure Key Vault has transaction limits for keys and secrets, with throttling applied per vault within a

certain period.

Design recommendationsDesign recommendations

Azure Key Vault provides a security boundary since access permissions for keys, secrets, and certificates

are applied at a vault level.

Key Vault access policy assignments grant permissions separately to keys, secrets, or certificates.

Granular object-level permissions to a specific key, secret, or certificate are now possible.

After a role assignment is changed, there's a latency of up to 10 minutes (600 seconds) for the role to be

applied.

There's an Azure AD limit of 2,000 Azure role assignments per subscription.

Azure Key Vault underlying hardware security modules (HSMs) are FIPS 140-2 Level 2 compliant.

A dedicated Azure Key Vault managed HSM is available for scenarios requiring FIPS 140-2 Level 3

compliance.

Azure Key Vault provides high availability and redundancy to help maintain availability and prevent data

loss.

During a region failover, it may take a few minutes for the Key Vault service to fail over.

During a failover Key Vault will be in a read-only mode, so it won't be possible to change key vault

properties such as firewall configurations and settings.

If private link is used to connect to Azure Key Vault, it may take up to 20 minutes for the connection to be

re-established during a regional failover.

A backup creates a point-in-time snapshot of a secret, key, or certificate, as an encrypted blob that can't

be decrypted outside of Azure. To get usable data from the blob, it must be restored into a Key Vault

within the same Azure subscription and Azure geography.

Secrets may renew during a backup, causing a mismatch.

With service-managed keys, Azure will perform key management functions, such as rotation, thereby

reducing the scope of application operations.

Regulatory controls may stipulate the use of customer-managed keys for service encryption functionality.

When traffic moves between Azure data centers, MACsec data-link layer encryption is used on the

underlying network hardware to secure data in-transit outside of the physical boundaries not controlled

by Microsoft or on behalf of Microsoft.

Use service-managed keys for data protection where possible, removing the need to manage encryption

keys and handle operational tasks such as key rotation.

Only use customer-managed keys when there's a clear regulatory requirement to do so.

Use Azure Key Vault as a secure repository for all secrets, certificates, and keys if additional encryption

mechanisms or customer-managed keys need considered.

Provision Azure Key Vault with the soft delete and purge policies enabled to allow retention protection

for deleted objects.

Use HSM backed Azure Key Vault SKU for application production environments.

Deploy a separate Azure Key Vault instance within each regional deployment stamp, providing fault

isolation and performance benefits through localization, as well as navigating the scale limits imposed by

a single Key Vault instance.

Use a dedicated Azure Key Vault instance for application global resources.

Follow a least privilege model by limiting authorization to permanently delete secrets, keys, and

certificates to specialized custom Azure AD roles.

Policy-driven governance

Design considerationsDesign considerations

NOTENOTE

Design recommendationsDesign recommendations

Ensure encryption keys, and certificates stored within Key Vault are backed up, providing an offline copy

in the unlikely event Key Vault becomes unavailable.

Use Key Vault certificates to manage certificate procurement and signing.

Establish an automated process for key and certificate rotation.

Automate the certificate management and renewal process with public certificate authorities to ease

administration.

Set alerting and notifications, to supplement automated certificate renewals.

Monitor key, certificate, and secret usage.

Define alerts for unexpected usage within Azure Monitor.

Security conventions are ultimately only effective if consistently and holistically enforced across all application

services and teams. Azure Policy provides a framework to enforce security and reliability baselines, ensuring

continued compliance with a common engineering criteria for a mission-critical application. More specifically,

Azure Policy forms a key part of the Azure Resource Manager (ARM) control plane, supplementing RBAC by

restricting what actions authorized users can perform, and can be used to enforce vital security and reliability

conventions across utilized platform services.

This section will therefore explore key considerations and recommendations surrounding the use of Azure Policy

driven governance for a mission-critical application, ensuring security and reliability conventions are

continuously enforced.

Azure Policy provides a mechanism to drive compliance by enforcing security and reliability conventions,

such as the use of Private Endpoints or the use of Availability Zones.

When deploying within an Azure landing zone, be aware that the enforcement of centralized baseline policy assignments

will likely be applied in the implementation for landing zone management groups and subscriptions.

Azure Policy can be used to drive automated management activities, such as provisioning and

configuration.

Resource Provider registration.

Validation and approval of individual Azure resource configurations.

Azure Policy assignment scope dictates coverage and the location of Azure Policy definitions informs the

reusability of custom policies.

Azure Policy has several limits, such as the number of definitions at any particular scope.

It can take several minutes for the execution of Deploy If Not Exist (DINE) policies to occur.

Azure Policy provides a critical input for compliance reporting and security auditing.

Map regulatory and compliance requirements to Azure Policy definitions.

For example, if there are data residency requirements, a policy should be applied to restrict available

deployment regions.

Define a common engineering criteria to capture secure and reliable configuration definitions for all

NOTENOTE

IaaS specific considerations when using Virtual Machines

utilized Azure services, ensuring this criteria is mapped to Azure Policy assignments to enforce

compliance.

For example, apply an Azure Policy to enforce the use of Availability Zones for all relevant services,

ensuring reliable intra-region deployment configurations.

The Mission Critical reference implementation contain a wide array of security and reliability centric policies

to define and enforce a sample common engineering criteria.

Monitor service configuration drift, relative to the common engineering criteria, using Azure Policy.

For mission-critical scenarios with multiple production subscriptions under a dedicated management group,

prioritize assignments at the management group scope.

Use built-in policies where possible to minimize operational overhead of maintaining custom policy

definitions.

Where custom policy definitions are required, ensure definitions are deployed at suitable management

group scope to allow for reuse across encompassed environment subscriptions to this allow for policy

reuse across production and lower environments.

When aligning the application roadmap with Azure roadmaps, use available Microsoft resources to

explore if critical custom definitions could be incorporated as built-in definitions.

When deploying within an Azure landing zone, consider deploying custom Azure Policy Definitions within the intermediate

company root management group scope to enable reuse across all applications within the broader Azure estate. In a

landing zone environment, certain centralized security policies will be applied by default within higher management group

scopes to enforce security compliance across the entire Azure estate. For example, Azure policies should be applied to

automatically deploy software configurations through VM extensions and enforce a compliant baseline VM configuration.

Use Azure Policy to enforce a consistent tagging schema across the application.

Identify required Azure tags and use the append policy mode to enforce usage.

If the application is subscribed to Microsoft Mission-Critical Support, ensure that the applied tagging

schema provides meaningful context to enrich the support experience with deep application understanding.

Export Azure AD activity logs to the global Log Analytics Workspace used by the application.

Ensure Azure activity logs are archived within the global Storage Account along with operational data

for long-term retention.

In an Azure landing zone, Azure AD activity logs will also be ingested into the centralized platform Log

Analytics workspace. It needs to be evaluated in this case if Azure AD are still required in the global Log

Analytics workspace.

Integrate security information and event management with Microsoft Defender for Cloud (formerly known

as Azure Security Center).

In scenarios where the use of IaaS Virtual Machines is required, some specifics have to be taken into

consideration.

Design considerationsDesign considerations

Design recommendationsDesign recommendations

Next step

Images are not updated automatically once deployed.

Updates are not installed automatically to running VMs.

Images and individual VMs are typically not hardened out-of-the-box.

Do not allow direct access via the public Internet to Virtual Machines by providing access to SSH, RDP or

other protocols. Always use Azure Bastion and jumpboxes with limited access to a small group of users.

Restrict direct internet connectivity by using Network Security Groups, (Azure) Firewall or Application

Gateways (Level 7) to filter and restrict egress traffic.

For multi-tier applications consider using different subnets and use Network Security Groups to restrict

access in between.

Prioritize the use of Public Key authentication, when possible. Store secrets in a secure place like Azure Key

Vault.

Protect VMs by using authentication and access control.

Apply the same security practices as described for mission-critical application scenarios.

Follow and apply security practices for mission-critical application scenarios as described above, when

applicable, as well as the Security best practices for IaaS workloads in Azure.

Review the best practices for operational procedures for mission-critical application scenarios.

Operational procedures

Operational procedures for mission

critical

workloads on Azure

12/16/2022 • 10 minutes to read • Edit Online

IMPORTANTIMPORTANT

DevOps processes

Design considerationsDesign considerations

The mission-critical design methodology leans heavily on the principles

automation wherever possible

and

configuration as code

to drive reliable and effective operations through DevOps processes, with automated

deployment pipelines used to execute versioned application and infrastructure code artifacts within a source

repository. While this level of DevOps adoption requires substantial engineering investment to instantiate and

discipline to maintain, it yields significant operational dividends, enabling consistent and accurate operational

outcomes within minimal manual operational procedures.

This design area explores how the adoption of DevOps and related deployment methods is used to drive

effective and consistent operational procedures.

This article is part of the Azure Well-Architected mission-critical workload series. If you aren't familiar with this series, we

recommend you start with what is a mission-critical workload?

Mission-Critical open source project

The reference implementations are part of an open source project available on GitHub.

DevOps provides the engineering mindset, processes, and tooling to deliver application services in a fast,

efficient, and reliable manner. More specifically, DevOps brings together development and operational processes

as well as teams into a single engineering function that encompasses the entire application lifecycle, using

automation and DevOps tooling to conduct deployment operations swiftly and reliably.

DevOps processes support and sustain the concepts of continuous integration and continuous

deployment (CI/CD), while fostering a culture of continuous improvement.

DevOps can be difficult to apply when there are hard dependencies on central IT functions since it

prevents end-to-end operational action.

Key responsibilities of the DevOps team for a mission-critical application include:

Create and manage application and infrastructure resources through CI/CD automation.

Application monitoring and observability.

Azure RBAC and identity for application components.

Security monitoring and audit of application resources.

Network management for application components.

Cost management for application resources.

DevSecOps expands the DevOps model by integrating security and quality assurance teams with

development and operations throughout the application lifecycle.

A DevOps engineering team can consider various granular Azure RBAC roles for different technical

Design recommendationsDesign recommendations

Application Operations

personas, such as AppDataOps for database management.

A zero trust model can and should be applied across different application DevOps personas.

Define configuration settings and updates for application components or underlying infrastructure as

code.

Manage any changes to code through consistent release and update process, including tasks such as

key or secret rotation and permission management.

Prioritize pipeline-managed update processes, such as with scheduled pipeline runs, over built-in

auto-update mechanisms.

Do not use central processes or provisioning pipelines for the instantiation or management of application

resources, since this introduces external application dependencies and additional risk vectors, such as

those associated with 'noisy neighbor' scenarios.

If centralized provisioning processes are mandated, ensure the availability requirements of used

dependencies are fully in-line with application requirements, and ensure operational transparency is

provided to allow for holistic operationalization of the end-to-end application.

Embrace continuous improvement and allocate a proportion of engineering capacity within each sprint to

optimize platform fundamentals.

Consider allocating 20-40% of capacity within each sprint to drive fundamental platform

improvements and bolster reliability.

To accelerate the development of new services, consider the creation of a common engineering criteria

and reference architectures/libraries for service teams to use, ensuring consistent alignment with core

design principles.

Enforce a consistent baseline configuration for reliability, security, and operations through a policy-

driven approach using Azure Policy.

This common engineering criteria and associated artifacts, such as Azure Policies and Terraform for common

design patterns, can also be used across other workloads within the broader application ecosystem for an

organization.

Consider a DevSecOps for security sensitive and highly regulated scenarios, to ensure security is baked

within the DNA of engineering team throughout the development lifecycle rather than a specific release

stage/gate.

Apply a zero trust model within critical application environments, using capabilities such as Azure AD

Privileged Identity Management (PIM) to ensure consistent operations only occur through CI/CD

processes or automated operational procedures.

No physical users should have standing write-access to any environment, maybe with the exception of

development environments for easier testing and debugging.

Define emergency processes for Just-in-time access to production environments.

Ensure 'break glass' accounts exist for serious issues with the authentication provider.

Consider AIOps as a method to continually improve operational procedures and triggers.

The application design and platform recommendations have a significant bearing on effective operations, and

the extent to which these recommendations are adhered, will therefore greatly influence optimal operational

procedures.

Design considerationsDesign considerations

Furthermore, there's a varied set of operational capabilities provided by different Azure services, particularly

when it comes to high availability and recovery. It's therefore also important to understand and use the

operational capabilities of used services.

This section will therefore highlight key operational aspects associated with application design and

recommended platform services.

Azure services provide a combination of built-in (enabled by default) and configurable platform

capabilities, such as zonal redundancy or geo-replication. The service-level configuration of each service

within an application must therefore be considered by operational procedures.

Many configurable capabilities incur an additional cost, such as the multi-write deployment

configuration for Azure Cosmos DB.

Mission critical design strongly endorses the principle of ephemeral stateless application resources,

meaning that updates can typically be performed through a new deployment and standard delivery

pipelines.

Most of the required operations are exposed and accessible via the Azure ARM management APIs or

through the Azure portal.

Some more intensive operations, such as a restore from a periodic backup of an Azure Cosmos DB

database or the recovery of a deleted resource, can only be performed by Azure Support engineers via a

support case.

For stateless resources and resources that can be entirely configured from deployment, such as Azure

Front Door and associated backends/origins, redeployment will faster generally result in an operational

resource than a Support process to attempt recovery of the deleted resource.

Azure Policy provides a framework to enforce and audit security and reliability baselines, ensuring

continued compliance with a common engineering criteria for a mission-critical application. More

specifically, Azure Policy forms a key part of the Azure Resource Manager (ARM) control plane,

supplementing RBAC by restricting what actions authorized users can perform, and can be used to

enforce vital security and reliability conventions across utilized platform services.

Azure Policy can be extended within Azure Kubernetes Service (AKS) via Azure Policy for Kubernetes

that provides visibility of components deployed within clusters.

Azure resources can be locked to prevent them from being modified or deleted.

Locks introduce management overhead within deployment pipelines that must remove locks, perform

deployment steps, before subsequently re-enabling.

Generally, for most resource a robust RBAC process, with tight restrictions on who can perform write

operations, should be preferred over locking resources.

Update ManagementUpdate Management

Key, secret, and certificate expirations are common causes of application outage.

The Kubernetes version within AKS needs to be updated regularly, especially given support for older

versions isn't sustained.

Components running on K8s also need to be updated, such as cert-manager and Key Vault-csi, and

aligned with the k8s version within AKS.

Terraform providers need to be updated regularly. Newer provider versions frequently contain breaking

changes, which must be properly tested before applied in production.

Other application dependencies, such as the runtime environment (.NET, Java, Python), should be

Design recommendationsDesign recommendations

monitored and kept up-to-date.

New versions of packages, components, and dependencies need to be properly tested before

consideration in a production context.

Container registries will likely need regular housekeeping to delete old image versions that aren't used

anymore.

Azure Policy provides native support for a wide variety of Azure resources.

Use an active-active deployment model, using a health model and automated scale-operations to ensure

no failover intervention is required.

If using an active-passive or active-standby model, ensure failover procedures are automated or at

least codified within pipelines so that manual steps besides triggering is required during operational

crises.

Prioritize the use of Azure-native auto-scale functionality for all available services.

Establish automated operational processes to scale services that do not offer auto-scale.

Leverage scale-units composed of multiple services to provide requisite scalability under relevant

circumstances.

Identify operational procedures and tasks required by global (long-lived) application resources.

For example, if an Azure Cosmos DB resource or encompassed data is incorrectly modified or deleted,

the possible ways of recovery should be well understood and a recovery process should exist.

Similarly, establish procedures to manage decommissioned container images in the registry.

It's strongly recommended to practice recovery operations in advance, on non-production resources and

data, as part of standard business continuity preparations.

Use platform-native capabilities for backup and restore, ensuring they're aligned with RTO/RPO and data

retention requirements.

Avoid building custom solutions unless absolutely necessary.

Define a strategy for long-term backup retention if required.

Use built-in capabilities for SSL certificate management and renewal, such as those offered by Azure

Front Door.

Wherever possible, use Managed Identities in order to avoid dealing with Service Principal credentials or

API keys.

When storing secrets, keys or certificates in Azure Key Vault, make use of the expiry setting and have

alerting configured for upcoming expirations.

All key, secret, and certificate updates should be performed using the standard release process.

Identify critical operational alert and define target audiences and systems, with clear channels to reach

them.

Avoid 'white-noise' by only sending actionable alerts, to prevent operational stakeholders for ignoring

alerts and missing important information.

Leverage continuous improvement to optimize alerting and remove observed 'white-noise'.

Apply the principle of policy-driven governance and Azure Policy to ensure the appropriate use of

operational capabilities and a reliable configuration baseline across all application services.

Apply a resource lock to prevent the deletion of long-lived global resources, such as Azure Cosmos DB.

Avoid the use of resource locks on ephemeral regional resources, and instead rely on the appropriate

IaaS specific considerations when using Virtual Machines

Design considerationsDesign considerations

Design recommendationsDesign recommendations

use of Role Based Access Control (RBAC) and CI/CD pipelines to control operational updates.

Update managementUpdate management

Update external libraries, SDKs, and runtimes frequently, treating it as any other change to the

application. This will ensure the latest security vulnerabilities and performance optimizations are applied.

Ensure all updates are validated prior to production release.

Set up processes to monitor and automatically detect updates, such as GitHub's Dependabot.

All operational tasks, such as key and secret rotation, should be handled using either Azure-native

platform capabilities or via a standard release process applied for code and configuration changes.

Ensure key, secret, and certificate rotation is performed on a regular basis.

Manual operational changes to update components should be avoided and only considered by

emergency exception.

Ensure a process exists to reconciliate any manual changes back into the source repository, avoiding

drift and issue recurrence.

Establish an automated housekeeping procedure to remove old image versions from Azure Container

Registry.

In scenarios where the use of IaaS Virtual Machines is required, some of the procedures and practices described

above might differ. The use of Virtual Machines provides more flexibility in regards to configuration options,

selection of operating systems, access to drivers and low-level operating system access as well as to the kind of

software that can be installed. The price for that are increased operational costs and responsibility for tasks that

are usually done by the cloud provider when using PaaS services.

Individual VMs do not provide high availability, zone or geo-redundancy.

Individual VMs are not automatically updated once deployed.

Services running inside a VM need special treatment and additional tooling to be deployed and configured

via Infra-as-Code.

Azure periodically updates its platform and might require a reboot. These reboots are usually announced in

advance. See Maintenance for virtual machines in Azure and Handling planned maintenance notifications.

Avoid any manual operations on virtual machines and implement proper processes to deploy and rollout

changes.

Automate the provisioning of Azure resources using Infrastructure-as-Code solutions, such as Azure

Resource Manager templates, Bicep, Terraform, or other third-party solutions.

Make sure that operational processes for deployment of virtual machines, updates, backup and recovery

are in place and properly tested. To test for resiliency inject fault in application and take a note of failure

and mitigate those failures.

Ensure that strategies are in place to rollback changes to last known healthy state in case a newer version

is not functioning correctly.

Take frequent backups for stateful workloads and ensure backup tasks are working effectively and alert

on failed backup processes.

Monitor virtual machines and detect for failure. The raw data of for monitoring can come from variety of

sources. Ensure that monitoring is configured and analyze the cause of problems.

Next step

Analyze that scheduled backups are running as expected and that periodic backups are taken (if needed).

Azure Backup center can be used to gain more insights.

Prioritize the use of Virtual Machine Scale Sets over VMs to enable capabilities like scale, autoscale and

provide zone-redundancy.

Prioritize the use of standard images from the Azure Marketplace over custom images that need to be

maintained.

Use Azure VM Image Builder or other tools to automate build and maintenance processes for customized

images if required.

Besides these specifics, apply best practices for operational procedures for mission-critical application scenarios

when applicable.

Review the cross-cutting concerns for mission-critical application scenarios.

Architecture pattern

Well

Architected assessment for mission

critical

workloads

12/16/2022 • 2 minutes to read • Edit Online

The Assessment is a review tool for self-assessing the readiness of your mission-critical workload in production.

Working towards building a resilient mission critical architecture can be a complex process. The assessment is

organized in a way that you'll be able to methodically check the alignment to the best practices for resiliency,

reliability, and availability.

We recommend that the team doing the assessment is well versed in the architecture of the specific workload

and has a strong understanding of cloud principles and patterns. These roles include, but aren't limited to, cloud

architect, operators, DevOps engineer.

The assessment is a set of questions based on the mission-critical design methodology as a way of checking the

foundational design choices of a workload’s architecture and the end-to-end operational approach.

These questions are designed to help benchmark a workload’s maturity and alignment to the recommended

approach for operating a mission-critical workload. The outcome of the assessment is the delivery of technical

recommendations and documentation that provide guidance on how to implement a highly reliable solution on

Azure.

Next steps

Mission-critical review tool

See these reference architecture that describe design choices for production-ready implementations.

Baseline architecture of an internet-facing application

Baseline architecture of an internet-facing application with network controls

Carrier

grade workloads on Azure

12/16/2022 • 5 minutes to read • Edit Online

NOTENOTE

What is a carrier-grade workload?

What are the common challenges?

Lift and shift approachLift and shift approach

Mission-critical systems primarily focus on maximizing uptime and they exist in many industries. Within the

telecommunications industry, they're referred to as

carrier-grade systems

. These systems are developed due to

one or more of the following drivers:

Minimizing loss of life or injury.

Meeting regulatory requirements on reliability to avoid paying fines.

Optimizing service to customers to minimize churn to competitors.

Meeting contractual Service Level Agreements (SLAs) with business or government customers.

This series of articles applies the design methodology for mission-critical workloads to inform prescriptive

guidance for building and operating a highly reliable, resilient, and available telecommunication workload on

Azure.

The articles within this series focus on providing additional mission-critical considerations when designing for 99.999% ('5

9s') levels of reliability within the telecommunications industry.

The term

workload

refers to a collection of application resources that support a common business goal or the

execution of a common business process, with multiple services, such as APIs and data stores, working together

to deliver specific end-to-end functionality.

The term

mission-critical

refers to a criticality classification where a significant financial cost (business-critical) or

human cost (safety-critical) is associated with unavailability or underperformance.

carrier-grade workload

pivots on both business-critical and safety-critical, where there's a fundamental

requirement to be operational with only minutes or even seconds of downtime per calendar year. Failure to

achieve this uptime requirement can result in extensive loss of life, incur significant fines, or contractual

penalties.

The

operational

aspect of the workload includes how reliability is measured and the targets that it must meet or

exceed. Highly reliable systems typically target 99.999% uptime (commonly referred to as '5 9s') or 0.001%

downtime in a year (approximately 5 minutes). Some systems target 99.9999% uptime, or 30 seconds

downtime per year, or even higher levels of reliability. This covers all forms and causes of outage – scheduled

maintenance, infrastructure failure, human error, software issues and even natural disaster.

Although the platform used has evolved from dedicated, proprietary hardware through commercial, off-the-

sheld hardware to OpenStack or VMware clouds, Telecommunication companies consistently deliver services

achieving ≤ 5 minutes of downtime per year, and in many cases, achieve ≤ 30 seconds of downtime due to

unscheduled outages.

Building carrier-grade workloads is a challenge for these main reasons:

Monolithic solutionsMonolithic solutions

Only building zonal redundancyOnly building zonal redundancy

What are the key design areas?

DESIGN A READESIGN A REA DESC RIP T IONDESC RIP T ION

Fault toleranceFault tolerance Application design must allow for unavoidable failures so

that the application as a whole can continue to operate at

some level. The design must minimize points of failure and

take a federated approach.

Telecommunication companies have well-architected applications that deliver the expected behavior on their

existing infrastructure. However, care should be taken before assuming that porting these applications

as is

to a

public cloud infrastructure won't impact their resiliency.

The existing applications make a set of assumptions about their underlying infrastructure, which are unlikely to

remain true when moving from on-premises to public cloud. The architect must check that they still hold and

adjust infrastructure and application design to accommodate the new reality. The architect should also look for

opportunities where the new infrastructure removes limitations that existed on-premises.

For example, upgrade of on-premises systems must happen in place because it's not viable to maintain

sufficient hardware to create a new deployment alongside and slowly transition in a safe manner. This upgrade

path generates a host of requirements for how upgrades and rollbacks are managed. These requirements lead

to complexity and mean that upgrades are infrequent and only permitted in carefully managed maintenance

windows.

However, in public cloud, it's reasonable to create a new deployment in parallel with the existing deployment.

This process creates the opportunity for major simplifications in the application's operational design and

improvements in the user's experience, and expectations.

Experience across a range of mission-critical industries shows that it isn't realistic to attempt to construct a

monolithic solution which will achieve the desired levels of availability. There are just too many potential sources

of failure in a complex system. Instead, successful solutions are composed of individual independent and

redundant elements. Each unit has relatively basic availability (typically ~99.9%), but when combined together

the total solution achieves the necessary availability goals. The art of carrier-grade engineering then becomes

ensuring that the only dependencies common to the redundant elements are those which are absolutely

necessary since they exert the most significant influence on overall availability, often orders of magnitude

greater than anything else in the design.

Using Microsoft Azure Availability Zones is the basic choice for reducing the risk of outage due to hardware

failure or localized environmental issues. However, it isn't enough to achieve carrier-grade availability, mainly for

these reasons:

Availability Zones (AZs) are designed so that the network latency between any two zones in a single

region is ≤ 2 ms . AZs can't be widely and geographically dispersed. So, the AZs share a correlated risk of

failure due to natural disasters, such as flooding or massive power outages, which could disable multiple

AZs within a region.

Many Azure services are explicitly designed to be zone-redundant so that the applications using them

don't need explicit logic to benefit from the availability gain. This redundancy function within the service

requires collaboration between the elements in each zone. There's often an unavoidable risk of software

failure in one zone that causes correlated failures in other zones. For example, any issue with a secret or

certificate used with the zone redundant service could impact all AZs simultaneously.

When designing a carrier-grade workload, consider the following areas:

Data modelData model The design must address the consistency, availability, and

partition tolerance needs of the database.Carrier Grade

availability requires that the application is deployed across

multiple regions. This deployment requires careful thought

about how the application's data will be managed across

those regions.

Health modelingHealth modeling An effective health model provides observability of all

components, platform and application, so that issues can be

quickly detected and response is ready either through self-

healing or other remediations.

Testing and validationTesting and validation The design and implementation of the application must be

thoroughly tested. In addition, the integration and

deployment of the application as a whole solution must be

tested.

DESIGN A READESIGN A REA DESC RIP T IONDESC RIP T ION

Next step

Start by reviewing the design principles for carrier-grade application scenarios.

Design principles

Design principles of a carrier

grade workload on

Azure

12/16/2022 • 2 minutes to read • Edit Online

IMPORTANTIMPORTANT

Carrier grade workload must be designed as per the guiding principles of the Well-Architected Framework

quality pillars:

Reliability

Performance Efficiency

Operational Excellence

Security

Cost Optimization

This article describes the carrier-grade design principles that resonate and extend the mission-critical design

principles. These collective principles serve as a road map for subsequent design decisions across the critical

design areas. We highly recommend that you get to know these principles to better understand their effects and

the trade-offs associated with non-adherence.

There are obvious cost tradeoffs associated with introducing greater reliability, which should be carefully

considered in the context of workload requirements.

This article is part of the Azure Well-Architected carrier-grade workload series. If you aren't familiar with this series, we

recommend you start with What is a carrier-grade workload?

Keep this high-level architecture model in mind when considering these points.

Assume failure

Share nothing

Start from the assumption that everything can, and will fail. Application design must allow for these failures with

fault tolerance so that an application can continue to operate at some level.

Minimize single points of failure and implement a federated approach.

Deploy the application across multiple regions with proper data management across those regions,

allowing for the impacts of CAP theorem.

Detect issues automatically and respond within seconds. For more information, see health monitoring.

Test the full solution including the application implementation, platform integration, and deployment. This

testing should include chaos testing on production systems to avoid testing bias.

Share nothing

is a common and straightforward approach to achieve high availability. Use this approach when

an application can be serviced by multiple, distinct elements, which are interchangeable. The individual elements

must have a well-understood availability metric, but it doesn't need to be high. However, the elements must be

combined in a way to remain independent, with no shared infrastructure or dependencies.

To share nothing is often impossible. To start from the position that nothing

should

be shared, and only add in

the smallest possible set of shared dependencies, should result in an optimal solution.

ExampleExample

Next step

Given a single system that has six hours of downtime per year (around 3.5*9s ), a solution that combines four

systems where the periods of downtime are uncorrelated will experience less than 30s of downtime per year.

As soon as those four systems rely on a common service, such as global DNS, their downtime is no longer

uncorrelated. The resulting downtime will be higher.

Review the fault tolerance design area for carrier-grade workloads.

Design area: Fault tolerance

Fault tolerance for carrier

grade workloads

12/16/2022 • 4 minutes to read • Edit Online

High availability through combination

Traffic management failure response

Repair and recovery failure response

Telecommunication companies have shown that it's possible to implement applications that meet and exceed

carrier-grade availability requirements, whilst running on top of unreliable elements. Companies exceed these

requirements through

fault tolerance

, which includes the following aspects:

Combination of independent, non-highly available elements.

Traffic management failure response mechanisms.

Repair and capacity-recovery and failure response mechanisms.

Use of highly available cross-element databases.

When designing for carrier grade reliability and resiliency, assume that every component can

and

will fail. The

design will require a layered approach to failure resolution. Part of validating the design is creating a

quantitative probabilistic model of the various failure modes which clearly identifies the key dependencies that

the application has, and shows that the application can achieve the necessary availability given those

dependencies meet their own Service Level Objectives (SLOs). This model should be retained and continuously

validated after development and in production to assure that the test and live data match what the model

predicts.

To reach high availability, take independent elements, which aren't highly available, and combine them so that

they remain independent entities. The probability of multiple elements failing together is much lower than the

probability of failure of any single element. We define this process as the Share Nothing architectural style.

When individual elements fail, or there are connectivity issues, the system's traffic management layer has

various ways it can respond to maintain overall service. The solution should implement as many of the

following mechanisms as possible or necessary to achieve availability. The first two mechanisms rely on specific

behavior in the client, which may not always be an option:

1. Instantaneous client retr y to alternate targetInstantaneous client retr y to alternate target—On failure, or after no response from a retry, the

client can instantly retry the request with an alternative element that's considered likely to succeed, as

opposed to submitting a new DNS query to identify an alternative.

2. Client-based health listsClient-based health lists—Maintains lists of elements, which are

known-healthy

and

known-

unhealthy

. Requests are always sent to a known-healthy element, unless that list is empty. If empty, the

known-unhealthy elements are tried.

3. Ser ver-based pollingSer ver-based polling—System-based distribution mechanisms, such as DNS or Azure Traffic Manager

(ATM), implement their own liveness detection and monitoring to enable routing around unhealthy

elements.

Traffic management responses can work around failures, but where the failure is long-lived or permanent, the

defective element must be repaired or replaced. There are two main choices here:

1. Restoring redundancyRestoring redundancy—Failure of an element reduces overall system capacity. System design should

Failure rate analysis

Cascading failures and overload

allow for this capacity reduction through provisioning redundant capacity; however, multiple overlapping

failures will lead to true outages. There must be an automated process that detects the failure and adds

capacity to restore redundancy to its normal levels. The impact can be determined from the failure rate

analysis. In many cases, automatically restoring the redundancy level can increase the acceptable

downtime of any individual element.

2. (Optional) Repairing the failed element(Optional) Repairing the failed element—The solution may need to include a mechanism that

detects the failure and attempts to repair the failed element in place. This solution may not apply for

cases where the process of restoring redundancy instantiates a new element to replace the failed one,

and terminates and decommissions the failed element.

The following diagram shows how the two modes of failure response cooperate to provide service-level fault

tolerance:

No amount of effort leads to a perfect system, so start with the assumption that everything can and will fail.

Consider how the solution, as a whole, will behave. Failure rate analysis starts with the lower-level individual

systems and combines the results together to model the full solution.

The analysis is typically done using Markov modeling, which considers all possible failure modes. For each

failure mode, consider the following factors:

Rate

Duration and resolution time

Probability of correlated failure (what is the chance that a failure in service X causes a failure in service Y)

The outcome is an estimate of the system availability and informs consideration of the

blast radius

. The blast

radius is the set of things that won't work given a particular failure. Good design aims to limit the scope and

severity of that set, since failure is going to happen. For example, a failure that blocks creation of new users, but

doesn't impact existing ones is less concerning than a failure that stops service for all users.

Failure rate analysis provides an estimate of the overall system-level availability. The analysis identifies the

critical dependencies on which that availability relies. Where failure rate analysis indicates that system

availability falls short, it also provides specific, quantitative insights on where improvements are needed and to

what extent.

For more details on failure mode analysis in Azure, reference Failure mode analysis for Azure applications.

Carrier-grade application design must pay careful attention to the risks of cascading failures, where failure of

one element leads to failure of other elements, often due to overload. Cascading failures aren't unique to carrier-

grade applications, but the reliability and the response time demand graceful degradation and automated

recovery.

Next step

Review the considered data model design area for carrier-grade workloads.

Design area: Data model

Data modeling for carrier

grade workloads

12/16/2022 • 2 minutes to read • Edit Online

CAP theorem

NOTENOTE

IMPORTANTIMPORTANT

Enterprise applications deployed in a single region can typically ignore this model of application design and

safely delegate responsibility to the database layer to make data reliably available. This behavior isn't the case

for carrier-grade applications, which must be deployed across multiple regions. Multi-region deployment forces

the application architect to consider the compromises they're willing to accept for their data.

Databases can provide three key properties for the data they manage for an application, known as CAP:

Consistency: A data read returns the most recent write.

Availability: Every request receives a valid response.

Partition tolerance: The system continues to operate despite delay or total loss of communication between

elements.

The CAP theorem states that a database layer can't provide all three of these properties for the same data at the

same time in the presence of network partitions. The architect needs to make explicit design decisions about

which of the CAP properties to favor under which conditions, and how to deal with the edge cases.

According to the CAP theorem, any database can only guarantee two out of three possible properties for the

same data at the same time in the presence of network partition.

Multi-region deployment means partition tolerance becomes significant. In most cases, carrier-grade architects

prioritize partition tolerance and availability over consistency. For each type of data, the architect must consider

what tradeoffs they're willing to make, considering the edge cases.

For example, consider the database of system users. Is it acceptable for the user database to drop to read-only if

there's a network partition? This behavior prioritizes consistency and read availability over write availability. This

prioritization may not be suitable if it's unacceptable for a user to access an isolated site after their permissions

are revoked elsewhere. The described scenario would require all database access to be blocked if there's a

partition, which prioritizes write availability over read availability.

The compromises made can be different for different databases within the same application, since the databases are likely

to have different usage profiles.

Where consistency is the compromise, the application must cope with data inconsistency artifacts by using

conflict-free replicated data types (CRDTs). The use of CRDTs requires discipline in the application design and

implementation. Their use means data merges following partition events can be handled automatically by the

data layer without human intervention or complex application-level logic.

More details on data platform choices for your mission-critical workload is available here

The architect must also understand the tradeoffs in the data model, which were made within the dependent

services. Those tradeoffs impact the service delivered to their application because those tradeoffs may not align

Next step

with the application-level requirements.

Review the health modeling design area for carrier-grade workloads.

Design area: Health modeling

Health modeling for carrier

grade workloads

12/16/2022 • 3 minutes to read • Edit Online

IMPORTANTIMPORTANT

Management and monitoring

Federated modelFederated model

High availability requires careful health monitoring to automatically detect and respond to issues within

seconds. This monitoring requires built-in telemetry of key dependencies to reliably detect the failure. The

application itself requires additional telemetry (Service Level Indicators) which accurately report the health of

the application in a way that is perceived by users of the application. Evaluation against SLOs may be necessary.

The failure rate analysis and general health modeling of the application should generate clear metrics indicative

of the service and health of its constituent elements. These metrics must be included in the design so that the

true service availability can be monitored. By including metrics, you can track the most useful leading indicators

to trigger the automated failure responses and to generate the necessary alerts for human intervention.

More details on how to build a health model for your mission-critical workload are available here.

Monitoring and management require the following thought processes:

How will the application handle bugs in the framework?

How is the application upgraded?

What actions should be taken during an incident?

For example, a solution may rely on Azure DevOps (ADO) to host its Git repository for all configuration. If the

Azure region hosting that ADO repo fails, the recovery time is two hours. If the solution is deployed in the same

region, it's not possible to modify configuration to add capacity elsewhere for that entire two hour period. As a

result, the application architect must consider correlated failure modes for key services, such as:

Azure Traffic Manager

Azure Key Vault

Azure Kubernetes Service

Correlated failure modes for these key services may be a necessary part of the application-level response to

failure. It's vital to create control planes which aren't impacted by the same application failure.

The management tooling required to issue diagnosis and troubleshooting must be the same as tooling used for

normal day-to-day operations tasks. Similar tooling ensures that it's familiar and proven to work. Similar tooling

also maximizes the users' familiarity with the user interface and process steps. Requiring operators to switch to

a different tool set to resolve a high-pressure outage isn't conducive to identifying and resolving the issue

effectively.

A highly available application or service must have a highly available management and monitoring

infrastructure built using the same well-architected principles of federation and fault tolerance. Infrastructures

built on these well-architected principles ensure individual regions can be self sufficient if disconnected.

If there's a disconnect event, the system degenerates into individually functioning islands, instead of using a

primary/backup system. The federated model is flexible and resilient, and automatically adapts to partition and

reconnection events.

Health and unhealth metrics

Monitoring and tracing

Next step

For example, logs and metrics are stored in the Availability Zone (AZ) where they're generated. A query of

metrics uses an opaque process of federated search to query the metric stores in all reachable AZs. It comes

down to the requirements of the specific application about what level of logs, metrics, and alarms data should

be replicated to other regions. Typically, alarms should be replicated, but there may be insufficient justification to

replicate logs and metrics.

Internal metrics are useful as

unhealth

metrics. These metrics reliably indicate the presence of an issue, but the

reverse isn't true. No evidence of poor health isn't evidence of good health, as the customer perceives health.

For example, a DNS issue indicates requests aren't arriving at the database service. The DNS error doesn't affect

the database read-success metric because this metric isn't seeing any errors. However, the end user perceives a

total outage because they aren't able to access the database. At least a portion of

health

metrics must be

measured externally, so that these metrics include everything the end user will experience.

The support team's ability to detect, diagnose, and resolve issues is an important part of delivering a highly

available application. To ensure success, the monitoring and tracing element must deliver high levels of visibility,

so that one in 1,000 events can be found and resolved.

A tracing solution that only logs 0.1% of requests has a 1,000,000 chance of recording such events, which

means the diagnosis and resolution are highly unlikely. Yet, failure to resolve such issues will have a meaningful

impact on availability.

Review the testing and validation design area for carrier-grade workloads.

Design area: Testing and validation

Testing and validation for carrier

grade workloads

12/16/2022 • 2 minutes to read • Edit Online

IMPORTANTIMPORTANT

Human error

Clients

Continuous testing and validation can detect and help resolve issues before they become potentially life

threatening. Consider well-known testing methodologies such as chaos testing. Testing should be conducted for

the lifetime of the application because the deployment environment is complex and multi-layered.

More details on how to implement continuous validation for your mission-critical workload is available here.

Also, supportability must be strong throughout the application lifetime. Highly available systems rely on high

quality support teams able to rapidly respond to and resolve issues in the field, conduct root cause analysis and

look for systematic design flaws.

Proving that an application is well architected requires testing, ideally use a chaos testing framework to avoid

testing bias. This methodology simulates failures of all dependent elements. Robust and regular testing should

both prove the design and validate the original failure mode analysis.

A warning flag should be raised for any application or service for which the redundancy or resiliency measures

can't be tested because it's considered

too risky

If redundancy and resiliency measures aren't tested, then the only valid assumption, from a safety-critical point

of view, is that these measures aren't going to work when needed. Using common paths for software upgrades,

configuration updates, and fault recovery, for example, provide a good mechanism for validating that measures

will work.

Experience from Telcos is that as much as 60% of all outages are actually the result of human error. A well-

architected application recognizes this and seeks to compensate. Here are some suggested approaches, but the

list is not exhaustive, and what is applicable to a given workload needs to be considered on a case-by-case basis.

Maximizing use of automation avoids human operators having to enter long and complex commands or

conduct repetitive operations across multiple elements. However, care must be taken to consider the blast

radius, as there is a risk of automation actually magnifying the effect of a configuration error, allowing it to

roll out across a global network in seconds. Strong checks and balances such as decision gates requiring

human approval before proceeding to the next step are advised.

Leveraging syntax checkers and simulation tools minimize the chance of errors or unforeseen side effects

from changes making their way into widespread production.

Use of carefully controlled canary deployments ensure that the effect of changes in full production can be

observed and validated at limited scope.

Ensuring that the management interfaces and processes needed for fault recovery are the same as those

used in day-to-day operation avoid operators being confronted with unfamiliar screens and barely used

method of procedures (MOPs) at times of peak stress.

Common client libraries are also part of the end-to-end system and need equivalent analysis and testing.

Software issues in common client code that simultaneously impacts a proportion of the system clients will

Next step

impact overall availability in the same way as application server-side issues.

Revisit the five pillars of architectural excellence to form a solid foundation for your carrier-grade workloads.

Azure Well-Architected Framework

Overview of a hybrid workload

12/16/2022 • 4 minutes to read • Edit Online

Extend Azure management to any infrastructure

TIPTIP

Run Azure services anywhere

Run Azure data services anywhereRun Azure data services anywhere

Customer workloads are becoming increasingly complex, with many applications often running on different

hardware across on-premises, multicloud, and the edge. Managing these disparate workload architectures,

ensuring uncompromised security, and enabling developer agility are critical to success.

Azure uniquely helps you meet these challenges, giving you the flexibility to innovate anywhere in your hybrid

environment while operating seamlessly and securely. The Well-Architected Framework includes a hybrid

description for each of the five pillars: cost optimization, operational excellence, performance efficiency,

reliability, and security. These descriptions create clarity on the considerations needed for your workloads to

operate effectively across hybrid environments.

Adopting a hybrid model offers multiple solutions that enable you to confidently deliver hybrid workloads: run

Azure data services anywhere, modernize applications anywhere, and manage your workloads anywhere.

Applying the principles in this article series to each of your workloads will better prepare you for hybrid adoption. For

larger or centrally managed organizations, hybrid and multicloud are commonly part of a broader strategic objective. If

you need to scale these principle across a portfolio of workloads using hybrid and multicloud environments, you may

want to start with the Cloud Adoption Framework's hybrid and multicloud scenario and best practices. Then return to this

series to refine each of your workload architectures.

Use

Azure Arc enabled infrastructure

to extend Azure management to any infrastructure in a hybrid

environment. Key features of Azure Arc enabled infrastructure are:

Unified OperationsUnified Operations

Organize resources such as virtual machines, Kubernetes clusters and Azure services deployed across

your entire IT environment.

Manage and govern resources with a single pane of glass from Azure.

Integrate with Azure Lighthouse for managed service provider support.

Adopt cloud practicesAdopt cloud practices

Easily adopt DevOps techniques such as infrastructure as code.

Empower developers with self-service and choice of tools.

Standardize change control with configuration management systems, such as GitOps and DSC.

Azure Arc allows you to run Azure Services anywhere. This allows you to build consistent hybrid and multicloud

application archtiectures by using Azure services that can run in Azure, on-premises, at the edge, or at other

cloud providers.

Use

Azure Arc enabled data services

to run Azure data services anywhere to support your hybrid workloads.

Key features of Azure Arc enabled data services are:

Run Azure Application services anywhereRun Azure Application services anywhere

Modernize applications anywhere

Manage workloads anywhere

Next steps

Run Azure data services on any Kubernetes cluster deployed on any hardware.

Gain cloud automation benefits, always up-to-date innovation in Azure data services, unified management of

your on-premises and cloud data assets with a cloud billing model across both environments.

Azure SQL Database and Azure PostgreSQL Hyperscale are the first set of Azure data services that are Azure

Arc enabled.

Use

Azure Arc enabled Application services

to run Azure App Service, Functions, Logic Apps, Event Grid, and API

Management anywhere to support your hybrid workloads. Key features of Azure Arc enabled application

services are as follows:

Web Apps - Azure App Service makes building and managing web applications and APIs easy, with a fully

managed platform and features like autoscaling, deployment slots, and integrated web authentication.

Functions - Azure Functions makes event-driven programming simple, with state-of-the-art autoscaling, and

with triggers and bindings to integrate with other Azure services.

Logic Apps - Azure Logic Apps produces automated workflows for integrating apps, data, services, and

backend systems, with a library of more than 400 connectors.

Event Grid - Azure Event Grid simplifies event-based applications, with a single service for managing the

routing of events from any source to any destination.

Azure API Management gateway - Azure API Management provides a unified management experience and

full observability across all internal and external APIs.

Use the

Azure Stack family

to modernize applications without ever leaving the datacenter. Key features of the

Azure Stack family are:

Extend Azure to your on-premises workloads with Azure Stack Hub. Build and run cloud apps on premises, in

connected or disconnected scenarios, to meet regulatory or technical requirements.

Use Azure Stack HCI to run virtualized workloads on premises and easily connect to Azure to access cloud

management and security services.

Build and run your intelligent edge solutions on Azure Stack Edge, an Azure managed appliance to run

machine learning models and compute at the edge to get results quickly—and close to where data is being

generated. Easily transfer the full data set to Azure for further analysis or archive.

Use

Azure Arc management

to extend Azure management to all assets in your workloads, regardless of where

they are hosted. Key features of Azure Arc management are:

Adopt cloud practicesAdopt cloud practices

Easily adopt DevOps techniques such as infrastructure as code.

Empower developers with self-service and choice of tools.

Standardize change control with configuration management systems, such as GitOps and DSC.

Scale across workloads with Scale across workloads with Unified OperationsUnified Operations

Organize resources such as virtual machines, Kubernetes clusters and Azure services deployed across

your entire IT environment.

Manage and govern resources with a single pane of glass from Azure.

Integrate with Azure Lighthouse for managed service provider support.

Cost optimization

Cost optimization in a hybrid workload

12/16/2022 • 4 minutes to read • Edit Online

Workload definitions

Functionality

A key benefit of hybrid cloud environments is the ability to scale dynamically and back up resources in the

cloud, avoiding the capital expenditures of a secondary datacenter. However, when workloads sit in both on-

premises and cloud environments, it can be challenging to have visibility into the cost. With Azure's hybrid

technologies, you can define policies and constraints for both on-premises and cloud workloads with Azure Arc.

By utilizing Azure Policy, you're able to enforce organizational standards for your workload and the entire IT

estate.

Azure Arc helps minimize or even eliminate the need for on-premises management and monitoring systems,

which reduces operational complexity and cost, especially in large, diverse, and distributed environments. This

helps offset additional costs associated with Azure Arc-related services. For example, advanced data security for

Azure Arc enabled SQL Server instance requires Microsoft Defender for Cloud functionality of Microsoft

Defender for Cloud, which has pricing implications.

Other considerations are described in the Principles of cost optimization section in the Microsoft Azure Well-

Architected Framework.

Define the following for your workloads:

Monitor cloud spend with hybrid workloadsMonitor cloud spend with hybrid workloads. Track cost trends and forecast future spend with

dashboards in Azure for your on-prem data estates with Azure Arc.

Keep within cost constraintsKeep within cost constraints.

Choose a flexible billing modelChoose a flexible billing model. With Azure Arc enabled data services, you can use existing hardware

with the addition of an operating expense (OPEX) model.

Create, apply, and enforce standardized and custom tags and policies.

Enforce run-time conformance and audit resources with Azure Policy.

For budget concerns, you get a considerable amount of functionality at no cost that you can use across all of

your servers and cluster with Azure Arc enabled servers. You can turn on additional Azure services to each

workload as you need them, or not at all.

Free Core Azure Arc capabilitiesFree Core Azure Arc capabilities

Update, management

Search index

Group, tags

Portal

Templates, extensions

RBAC, subscriptions

Paid-for Azure Arc enabled attached ser vicesPaid-for Azure Arc enabled attached ser vices

Azure policy

Azure monitor

Defender for Cloud – Standard

TipsTips

Azure Architecture Center (AAC) resources related to hybrid cost

Infrastructure Decisions

Capacity planningCapacity planning

Provision

Microsoft Sentinel

Backup

Config and change management

Star t slowStar t slow. Light up new capabilities as needed. Most of Azure Arc's resources are free to start.

Save time with unified managementSave time with unified management for your on-premises and cloud workloads by projecting them all

into Azure.

Automate and delegateAutomate and delegate remediation of incidents and problems to service teams without IT intervention.

Manage configurations for Azure Arc enabled servers

Azure Arc hybrid management and deployment for Kubernetes clusters

Optimize administration of SQL Server instances in on-premises and multi-cloud environments by

leveraging Azure Arc

Disaster Recovery for Azure Stack Hub virtual machines

Build high availability into your BCDR strategy

Use Azure Stack HCI switchless interconnect and lightweight quorum for Remote Office/Branch Office

Archive on-premises data to cloud

Azure Stack HCI can help in cost-savings by using your existing Hyper-V and Windows Server skills to

consolidate aging servers and storage. Azure Stack HCI pricing follows the monthly subscription billing model,

with a flat rate per physical processor core in an Azure Stack HCI cluster.

Use Azure Stack HCI to modernize on-prem workloads with hyperconverged infra. Azure Stack HCI billing is

based on a monthly subscription fee per physical processor core, not a perpetual license. When customers

connect to Azure, the number of cores used is automatically uploaded and assessed for billing purposes. Cost

doesn't vary with consumption beyond the physical processor cores. This means that more VMs don't cost

more, and customers who are able to run denser virtual environments are rewarded.

If you are currently using VMware, you can take advantage of cost savings only available with Azure VMware

Solution. Easily move VMware workloads to Azure and increase your productivity with elasticity, scale, and fast

provisioning cycles. This will help enhance your workloads with the full range of Azure compute, monitor,

backup, database, IoT, and AI services.

Lastly, you can slowly begin migrating out of your datacenter and use Azure Arc while you're migrating to

project everything into Azure.

Check out our checklist under the Cost Optimization pillar in the Well-Framework to learn more about capacity

planning, and build a checklist to design cost-effective workloads.

Define SLAs

Determine regulatory needs

One advantage of cloud computing is the ability to use the PaaS model. And in some cases, PaaS services can be

cheaper than managing VMs on your own. Some workloads cannot be moved to the cloud though for

regulatory or latency reasons. Therefore, using a service like Azure Arc enabled services allows you to flexibly

Monitor and optimize

Next steps

use cloud innovation where you need it by deploying Azure services anywhere.

Click the following links for guidance in provisioning:

Azure Arc pricing

Azure Arc Jumpstart for templates (in GitHub)

Azure Stack HCI pricing

Azure VMware Solution pricing - Run your VMware workloads natively on Azure

Azure Stack Hub pricing

Azure Stack HCI can reduce costs by saving in server, storage, and network infrastructure.

Run your VMware workloads natively on Azure.

Treat cost monitoring and optimization as a process, rather than a point-in-time activity. You can conduct regular

cost reviews and forecast the capacity needs so that you can provision resources dynamically and scale with

demand.

Managing the Azure Arc enabled servers agent

With Azure Stack HCI

Bring all your resources into a single system so you can organize and inventory through a variety of

Azure scopes, such as Management groups, Subscriptions, and Resource Groups.

Create, apply, and enforce standardized and custom tags to keep track of resources.

Build powerful queries and search your global portfolio with Azure Resource Graph.

Costs for datacenter real estate, electricity, personnel, and servers can be reduced or eliminated.

Costs are now part of OPEX, which can be scaled as needed.

Operational excellence

Operational excellence in a hybrid workload

12/16/2022 • 5 minutes to read • Edit Online

Build cloud native apps anywhere, at scale

Connect Kubernetes clusters to Azure and start deploying using a

GitOps model

Operational excellence consists of the operations processes that keep a system running in production.

Applications must be designed with DevOps principles in mind, and deployments must be reliable and

predictable. Use monitoring tools to verify that your application is running correctly and to gather custom

business telemetry that will tell you whether your application is being used as intended.

Use

Azure Arc enabled infrastructure

to add support for cloud Operational Excellence practices and tools to any

environment. Be sure to utilize reference architectures and other resources from this section that illustrate

applying these principles in hybrid and multicloud scenarios. The architectures referenced here can also be

found in the Azure Architecture Center, Hybrid and Multicloud category.

To keep your systems running, many workload teams have architected and designed applications where

components are distributed across public cloud services, private clouds, data centers, and edge locations. With

Azure Arc enabled Kubernetes, you can accelerate development by using best in class application services with

standardized deployment, configuration, security, and observability. One of the primary benefits of Azure Arc is

facilitating implementation of DevOps principles that apply established development practices to operations.

This results in improved agility, without jeopardizing the stability of IT environment.

Centrally code and deploy applications confidently to any Kubernetes distribution in any location.

Centrally manage and delegate access for DevOps roles and responsibilities.

Reduce errors with consistent configuration and policy-driven deployment and operations for applications

and Kubernetes clusters.

Delegate access for DevOps roles and responsibilities through Azure RBAC.

Reduce errors with consistent policy driven deployment and operations through GitHub and Azure Policy.

GitOps relies on a Git repository to host files that contain the configuration representing the expected state of a

resource. An agent running on the cluster monitors the state of the repository and, when there is a change on

the repository, the agent pulls the changed files to the cluster and applies the new configuration.

In the context of Azure Arc enabled Kubernetes clusters, a Git repository hosts a configuration of a Kubernetes

cluster, including its resources such as pods and deployments. A pod or a set of pods running on the cluster

polls the status of the repository and, once it detects a change, it pulls and applies the new configuration to the

cluster.

Azure Arc enabled Kubernetes clusters rely on Flux, an open-source GitOps deployment tool to implement the

pods responsible for tracking changes to the Git repository you designate and applying them to the local cluster.

In addition, the containerized Flux operator also periodically reviews the existing cluster configuration to ensure

that it matches the one residing in the Git repository. If there is a configuration drift, the Flux agent remediates it

by reapplying the desired configuration.

Each association between an Azure Arc enabled Kubernetes cluster configuration and the corresponding GitOps

repository resides in Azure, as part of the Azure Resource Manager resource representing the Azure Arc enabled

Kubernetes clusters. You can configure that association via traditional Azure management interfaces, such as the

Modernize applications anywhere with Azure Kubernetes Service on

Azure Stack HCI

Azure Stack HCI use casesAzure Stack HCI use cases

Resources and architectures related to Operational Excellence

Application designApplication design

Azure portal or Azure CLI. Alternatively, you can use Azure Policy to automate this process, allowing you to apply

it consistently to all resources in an entire subscription or individual resource groups you designate.

If you are looking for a fully managed Kubernetes solution on-premises in your datacenters and/or edge

locations, AKS on Azure Stack HCI is a great option. Azure Kubernetes Service on Azure Stack HCI is an on-

premises implementation of Azure Kubernetes Service (AKS), which automates running containerized

applications at scale. Azure Kubernetes Service is now in preview on Azure Stack HCI and Windows Server 2019

Datacenter, making it quicker to get started hosting Linux and Windows containers in your datacenter.

AKS clusters on Azure Stack HCI can be connected to Azure Arc for centralized management. Once connected,

you can deploy your applications and Azure data services to these clusters and extend Azure services such as

Azure Monitor, Azure Policy and Microsoft Defender for Cloud.

Modernize your high-performance workloads and containerized applicationsModernize your high-performance workloads and containerized applications

Use Azure Stack HCI to enable automated deployment, scaling and management of containerized

applications by running a Kubernetes cluster on your hyperconverged infrastructure.

Deploy AKS on Azure Stack HCI using Windows Admin Center or PowerShell.

Deploy and manage workloads in remote and branch sitesDeploy and manage workloads in remote and branch sites

Use Azure Stack HCI to deploy your container-built edge workloads, and essential business

applications in highly available virtual machines (VMs).

Bring efficient application development and deployment to remote locations at the right price by

leveraging switchless deployment and 2 node clusters.

Get a global view of your system's health using Azure Monitor.

Upgrade your infrastructure for remote work using VDIUpgrade your infrastructure for remote work using VDI

Bring desktops on-premises for low latency and data sovereignty enabling remote work using a

brokerage service like Microsoft Remote Desktop Services. With Azure Stack HCI you can scale your

resources in a simple predictable way. Provide a secure way to deliver desktop services to a wide

range of devices without allowing users to store data locally or upload data from those local devices.

The introduction of cloud computing had a significant impact on how software is developed, delivered, and run.

With

Azure Arc enabled infrastructure

and Azure Arc components like Azure Arc enabled Kubernetes and Azure

Arc enabled data services it becomes possible to design cloud native applications with a consistent set of

principles and tooling across public cloud, private cloud, and the edge.

Click the following links for architecture details and diagrams that enable application design and DevOps

practices consistent with Operational excellence principles.

Azure Arc hybrid management and deployment for Kubernetes clusters

Run containers in a hybrid environment

Managing K8 clusters outside of Azure with Azure Arc

Optimize administration of SQL Server instances in on-premises and multi-cloud environments by

leveraging Azure Arc

Azure Data Studio dashboards

MonitoringMonitoring

Application performance managementApplication performance management

Manage data anywhere

Next steps

microsoft/azure_arc: Azure Arc environments bootstrapping for everyone (in github.com)

All Azure Architecture Center Hybrid and Multicloud Architectures

Enable monitoring of Azure Arc enabled Kubernetes cluster

Azure Monitor for containers overview

Hybrid availability and performance monitoring

Performance Efficiency

Performance efficiency in a hybrid workload

12/16/2022 • 5 minutes to read • Edit Online

Azure Arc design

Azure Arc enabled Kubernetes

Azure Arc enabled SQL Managed Instance

Performance efficiency is the ability of your workload to scale to meet the demands placed on it by users in an

efficient manner. In a hybrid environment, it is important to consider how you manage your on-premises or

multicloud workloads to ensure they can meet the demands for scale. You have options to scale up into the

cloud when your on-premises resources reach capacity. Scale up, down, and scale out your databases without

application downtime.

Using a tool like Azure Arc, you can build cloud native apps anywhere, at scale. Architect and design hybrid

applications where components are distributed across public cloud services, private clouds, data centers and

edge locations without sacrificing central visibility and control. Deploy and configure applications and

Kubernetes clusters consistently and at scale from source control and templates. You can also bring PaaS

services on premises. This allows you to use cloud innovation flexibly, where you need it by deploying Azure

services anywhere. Implement cloud practices and automation to deploy faster, consistently, and at scale with

always up-to-date Azure Arc enabled services. You can scale elastically based on capacity, with the ability to

deploy in seconds.

The first steps with Azure Arc are to connect the machines to Azure. To use Azure Arc to connect the machine to

Azure, you need to install the Azure Connected Machine agent on each machine that you plan to connect using

Azure Arc. You can connect any other physical or virtual machine running Windows or Linux to Azure Arc.

There are four ways to connect machines:

1. Manual installation

2. Script-based installation

3. Connect machines at scale using service principal

4. Installation using Windows PowerShell DSC

After connecting the machines, you can then manage the VM extensions all from Azure, which provides

consistent extension management between Azure and non-Azure VMs. In Azure you can use Azure Automation

State Configuration to centrally store configurations and maintain the desired state of Arc enabled servers

through the DSC VM extension. You can also collect log data for analysis with Azure Monitor Logs enabled

through the Log Analytics agent VM extension. With Azure Monitor, you can analyze the performance of your

Windows and Linux VMs and monitor their processes and dependencies on other resources and external

processes.

With Azure Arc enabled Kubernetes, you need to register the cluster first. You can register any CNCF Kubernetes

cluster that is running. You'll need a kubeconfig file to access the cluster and cluster-admin role on the cluster for

deploying Arc-enabled Kubernetes agents. You'll use Azure Command-Line Interface (Azure CLI) to perform

cluster registration tasks.

When planning for deployment of Azure Arc enabled SQL Managed Instance, you should identify the correct

amount of compute, memory, and storage that will be required to run the Azure Arc data controller and the

NOTENOTE

Azure Stack HCI

Monitoring in a hybrid environment

Monitoring containers

intended SQL managed instance server groups.

You have the flexibility to extend the capacity of the underlying Kubernetes or AKS cluster over time by adding

additional compute nodes or storage. Kubernetes or AKS offers an abstraction layer over the underlying

virtualization stack and hardware. Storage classes implement such abstraction for storage.

When provisioning a pod, you need to decide which storage class to use for its volumes. Your decision is important from a

performance standpoint because an incorrect choice could result in suboptimal performance.

When planning for deployment of Azure Arc enabled SQL Managed Instance, you should consider a range of

factors affecting storage configuration kubernetes-storage-class-factors for both data controller and database

instances.

With the scope of Azure Arc extended to Azure Stack HCI VMs, you'll be able to automate their configuration by

using Azure VM extensions and evaluate their compliance with industry regulations and corporate standards by

using Azure Policy.

In remote office/branch office scenarios, you must consider storage resiliency versus usage efficiency, versus

performance. Planning for Azure Stack HCI volumes involves identifying the optimal balance between resiliency,

usage efficiency, and performance. The challenge results from the fact that maximizing one of these

characteristics typically has a negative impact on at least one of the other two.

To learn more, see Use Azure Stack HCI switchless interconnect and lightweight quorum for Remote

Office/Branch Office.

Monitoring in a hybrid environment can be a challenge. However, with tools like Azure Arc as you bring Azure

services on-premises, you can easily enroll in additional Azure services such as monitoring, security, and update

by simply turning them on.

Across products: Integrate with Microsoft Sentinel, Microsoft Defender for Cloud

Bring Microsoft Defender for Cloud to your on-prem data and servers with Arc

Set security policies, resource boundaries, and RBAC for workloads across the hybrid infra

Proper admin roles for read, modify, re-onboard, and delete a machine

Monitoring your containers is critical. Azure Monitor for containers provides a rich monitoring experience for

the AKS and AKS engine clusters.

Configure Azure Monitor for containers to monitor Azure Arc enabled Kubernetes clusters hosted outside of

Azure. This helps achieve comprehensive monitoring of your Kubernetes clusters across Azure, on-premises,

and third-party cloud environments.

Azure Monitor for containers can provide you with performance visibility by collecting memory and processor

metrics from controllers, nodes, and containers available in Kubernetes through the Metrics application

programming interface (API). Container logs are also collected. After you enable monitoring from Kubernetes

clusters, metrics and logs are automatically collected for you through a containerized version of the Log

Analytics agent. Metrics are written to the metrics store and log data is written to the logs store associated with

your Log Analytics workspace. For more information about Azure Monitor for containers, refer to Azure Monitor

Deploy and manage containerized applications

Next steps

for containers overview.

Enable Azure Monitor for containers for one or more existing deployments of Kubernetes by using either a

PowerShell or a Bash script. To enable monitoring for Arc enabled Kubernetes clusters, refer to Enable

monitoring of Azure Arc enabled Kubernetes cluster.

Automatically enroll in additional Azure Arc enabled resources and services. Simply turn them on when needed:

Strengthen your security posture and protect against threats by turning on Microsoft Defender for Cloud.

Get actionable alerts from Azure Monitor.

Detect, investigate, and mitigate security incidents with the power of a cloud-native SIEM, by turning on

Microsoft Sentinel.

Deploy and manage containerized applications with GitHub and Azure Policy. Ensure that applications and

clusters are consistently deployed and configured at scale from source control. Write with your familiar tool set

to the same application service APIs that can run consistently on-premises, across multicloud, and in edge

environments. Easily instrument Azure monitoring, telemetry, and security services into your hybrid apps

wherever they run.

Reliability

Reliability in a hybrid workload

12/16/2022 • 5 minutes to read • Edit Online

NOTENOTE

Backup and Recovery

In the cloud, we acknowledge up front that failures will happen. Instead of trying to prevent failures altogether,

the goal is to minimize the effects of a single failing component. While historically you may have purchased

levels of redundant higher-end hardware to minimize the chance of an entire application platform failing, in the

cloud, we acknowledge up front that failures will happen.

For hybrid scenarios, Azure offers an end-to-end backup and disaster recovery solution that's simple, secure,

scalable, and cost-effective, and can be integrated with on-premises data protection solutions. In the case of

service disruption or accidental deletion or corruption of data, recover your business services in a timely and

orchestrated manner.

Many customers operate a second datacenter, however, Azure can help reduce the costs of deploying,

monitoring, patching, and scaling on-premises disaster recovery infrastructure, without the need to manage

backup resources or build a secondary datacenter.

Extend your current backup solution to Azure, or easily configure our application-aware replication and

application-consistent backup that scales based on your business needs. The centralized management interface

for Azure Backup and Azure Site Recovery makes it simple to define policies to natively protect, monitor, and

manage enterprise workloads across hybrid and cloud. These include:

Azure Virtual Machines

SQL and SAP databases

On-premises Windows servers

VMware machines

By not having to build on-premises solutions or maintain a costly secondary datacenter, customers can reduce the cost of

deploying, monitoring, patching, and scaling disaster recovery infrastructure by backing up their hybrid data and

applications with Azure.

Together, Azure Backup and Azure Site Recovery use the underlying power and unlimited scale of the cloud to

deliver high availability with minimal maintenance or monitoring overhead. These native capabilities are

available through a pay-as-you-use model that bills only for the storage that is consumed.

Using Azure Site Recovery, users can set up and manage replication, failover, and failback from a single location

in the Azure portal. The Azure hybrid services tool in Windows Admin Center can also be used as a centralized

hub to easily discover all the available Azure services that bring value to on-premises or hybrid environments.

Windows Admin Center streamlines setup and the process of replicating virtual machines on Hyper-V servers or

clusters, making it easier to bolster the resiliency of environments with Azure Site Recovery's disaster recovery

service.

Azure is committed to providing the best-in-class data protection to keep your applications running. Azure

Backup protects backups of on-premises and cloud-resources from ransomware attacks by isolating backup

data from source data, combined with multi-factor authentication and the ability to recover maliciously or

accidentally deleted backup data. With Azure Site Recovery you can fail over VMs to the cloud or between cloud

Availability Considerations

For Azure ArcFor Azure Arc

Azure Arc enabled data servicesAzure Arc enabled data services

Azure Stack HCI

Site

level fault domainsSite

level fault domains

Site awarenessSite awareness

Resiliency limitsResiliency limits

data centers and secure them with network security groups.

In the case of a disruption, accidental deletion, or corruption of data, customers can rest assured that they will be

able to recover their business services and data in a timely and orchestrated manner. These native capabilities

support low recovery-point objective (RPO) and recovery-time objective (RTO) targets for any mission-critical

workload in your organization. Azure is here to help customers pivot towards a strengthened BCDR strategy.

In most cases, the location you select when you create the installation script should be the Azure region

geographically closest to your machine's location. The rest of the data will be stored within the Azure geography

containing the region you specify, which might also affect your choice of region if you have data residency

requirements. If an outage affects the Azure region to which your machine is connected, the outage will not

affect the Arc enabled server, but management operations using Azure might not be able to complete. For

resilience in the event of a regional outage, if you have multiple locations which provide a geographically-

redundant service, it's best to connect the machines in each location to a different Azure region.

Ensure that Azure Arc is supported in your regions by checking supported regions. Also, ensure that services

referenced in the Architecture section are supported in the region to which Azure Arc is deployed.

With Azure Arc enabled SQL Managed Instance, you can deploy individual databases in either a single or

multiple pod pattern. For example, the developer or general-purpose pricing tier implements a single pod

pattern, while a highly available business critical pricing tier implements a multiple pod pattern. A highly

available Azure SQL managed instance uses Always On Availability Groups to replicate the data from one

instance to another either synchronously or asynchronously.

With Azure Arc enabled SQL Managed Instance, planning for storage is also critical from the data resiliency

standpoint. If there's a hardware failure, an incorrect choice might introduce the risk of total data loss. To avoid

such risk, you should consider a range of factors affecting storage configuration kubernetes-storage-class-

factors for both data controller and database instances.

Azure Arc enabled SQL Managed Instance provides automatic local backups, regardless of the connectivity

mode. In the Directly Connected mode, you also have the option of leveraging Azure Backup for off-site, long-

term backup retention.

Each physical site of an Azure Stack HCI stretched cluster represents distinct fault domains that provide

additional resiliency. A fault domain is a set of hardware components that share a single point of failure. To be

fault tolerant to a particular level, you need multiple fault domains at that level.

Site awareness allows you to control placement of virtualized workloads by designating their preferred sites.

Specifying the preferred site for a stretched cluster offers many benefits, including the ability to group

workloads at the site level and to customize quorum voting options. By default, during a cold start, all virtual

machines use the preferred site, although it is also possible to configure the preferred site at the cluster role or

group level.

Azure Stack HCI provides multiple levels of resiliency, but because of its hyper-converged architecture, that

resiliency is subject to limits imposed not only by the cluster quorum, but also by the pool quorum. You can

Next steps

eliminate this limit by implementing cluster sets in which you combine multiple Azure Stack HCI clusters to

create an HCI platform consisting of hundreds of nodes.

Security

Security in a hybrid workload

12/16/2022 • 3 minutes to read • Edit Online

Azure Architecture Center (AAC) resources

Principles

Azure Arc management security capabilitiesAzure Arc management security capabilities

Azure Arc enabled data services security capabilitiesAzure Arc enabled data services security capabilities

Azure Stack HCIAzure Stack HCI

Security is one of the most important aspects of any architecture. Particularly in hybrid and multicloud

environments, an architecture built on good security practices should be resilient to attacks and provide

confidentiality, integrity, and availability. To assess your workload using the tenets found in the Microsoft Azure

Well-Architected Framework, see the Microsoft Azure Well-Architected Review.

Microsoft Defender for Cloud can monitor on-premises systems, Azure VMs, Azure Monitor resources, and even

VMs hosted by other cloud providers. To support that functionality, the standard fee-based tier of Microsoft

Defender for Cloud is needed. We recommend that you use the 30-day free trial to validate your requirements.

Defender for Cloud's operational process won't interfere with your normal operational procedures. Instead, it

passively monitors your deployments and provides recommendations based on the security policies you enable.

Microsoft Sentinel can help simplify data collection across different sources, including Azure, on-premises

solutions, and across clouds using built-in connectors. Microsoft Sentinel works to collect data at cloud scale—

across all users, devices, applications, and infrastructure, both on-premises and in multiple clouds.

Hybrid Security Monitoring using Microsoft Defender for Cloud and Microsoft Sentinel

DevSecOps in Azure

Optimize administration of SQL Server instances in on-premises and multi-cloud environments by

leveraging Azure Arc

Implement a secure hybrid network

Securely managed web applications

Access unique Azure security capabilities such as Microsoft Defender for Cloud.

Centrally manage access for resources with Role-Based Access Control.

Centrally manage and enforce compliance and simplify audit reporting with Azure Policy.

Protect your data workloads with Microsoft Defender for Cloud in your environment, using the advanced

threat protection and vulnerability assessment features for unmatched security.

Set security policies, resource boundaries, and role-based access control for various data workloads

seamlessly across your hybrid infrastructure.

Protection in transitProtection in transit. Storage Replica offers built-in security for its replication traffic. This includes packet

signing, AES-128-GCM full data encryption, support for Intel AES-NI encryption acceleration, and pre-

authentication integrity man-in-the-middle attack prevention.

Encr yption at restEncr yption at rest. Azure Stack HCI supports BitLocker Drive Encryption for its data volumes, thus

facilitating compliance with standards such as FIPS 140-2 and HIPAA.

Integration with a range of Azure ser vices that provide more security advantagesIntegration with a range of Azure ser vices that provide more security advantages. You can

integrate virtualized workloads that run on Azure Stack HCI clusters with Azure services such as Microsoft

Storage Replica also utilizes Kerberos AES256 for authentication between the replicating nodes.

Design

Azure Arc enabled serversAzure Arc enabled servers

Azure Stack HCIAzure Stack HCI

Monitor

Defender for Cloud.

Firewall-friendly configurationFirewall-friendly configuration. Storage Replica traffic requires a limited number of open ports between

the replicating nodes.

Implement Azure MonitorImplement Azure Monitor

Use Azure Monitor to monitor your VMs, virtual machine scale sets, and Azure Arc machines at scale. Azure

Monitor analyzes the performance and health of your Windows and Linux VMs. It also monitors their processes

and dependencies on other resources and external processes. It includes support for monitoring performance

and application dependencies for VMs that are hosted on-premises or in another cloud provider.

Implement Microsoft SentinelImplement Microsoft Sentinel

Use Microsoft Sentinel to deliver intelligent security analytics and threat intelligence across the enterprise. This

provides a single solution for alert detection, threat visibility, proactive hunting, and threat response. Microsoft

Sentinel is a scalable, cloud-native, security information event management (SIEM) and security orchestration

automated response (SOAR) solution that enables several scenarios including:

Collect data at cloud scale across all users, devices, applications, and infrastructure, both on-premises and in

multiple clouds.

Detect previously undetected threats and minimize false positives.

Investigate threats with artificial intelligence and hunt for suspicious activities at scale.

Respond to incidents rapidly with built-in orchestration and automation of common tasks.

A stretched Azure Stack HCI cluster relies on Storage Replica to perform synchronous storage replication

between storage volumes hosted by the two groups of nodes in their respective physical sites. If a failure affects

the availability of the primary site, the cluster automatically transitions its workloads to nodes in the surviving

site to minimize potential downtime.

Across products: Integrate with Microsoft Sentinel and Microsoft Defender for Cloud.

Bring Microsoft Defender for Cloud to your on-premises data and servers with Arc.

Set security policies, resource boundaries, and RBAC for workloads across the hybrid infrastructure.

Set the correct admin roles to read, modify, re-onboard, and delete a machine.

Overview of well

architected IoT workloads

12/16/2022 • 12 minutes to read • Edit Online

Azure Well-Architected Framework for IoT workloads

IoT architectural patterns

Internet of Things (IoT) is a collection of managed and platform services across edge and cloud environments

that connect, monitor, and control physical assets. The IoT workload for the Microsoft Azure Well-Architected

Framework helps you meet architectural challenges to design, build, and operate IoT solutions according to your

requirements and constraints.

The IoT Well-Architected Framework addresses the three components of IoT systems:

Things

, or the physical objects, industrial equipment, devices, and sensors that connect to the cloud

persistently or intermittently.

Insights

, information that the things collect that humans or AI analyze and turn into actionable knowledge.

Actions

, the responses of people or systems to insights, which connect to business outcomes, systems, and

tools.

The IoT Well-Architected Framework uses a set of

IoT guiding principles

based on the Azure Well-Architected

Framework to drive planning and decision making. The principles guide a

layered architecture

approach that

identifies the logical elements of an IoT solution in either a

connected components

connected operations

architectural pattern.

This article describes the IoT guiding principles, architectural patterns, and architecture layers in the IoT Well-

Architectured Framework. The remaining articles in this series delve into how to apply the Azure Well-

Architected Framework pillars of excellence to IoT solutions.

The Azure Well-Architected Framework consists of five pillars of architectural excellence, which you can use to

improve the quality of IoT workloads. The following articles highlight how each pillar relates to IoT workloads

and guiding principles:

Reliability

ensures that applications meet availability commitments. Resiliency ensures that workloads are

available and can recover from failures at any scale. Reliability in your IoT workload discusses how the IoT

principles of heterogeneity, scale, connectivity, and hybridity affect IoT reliability.

Security

provides confidentiality, integrity, and availability assurances against deliberate attacks and

abuse of data and systems. Security in your IoT workload describes how heterogeneity and hybridity

affect IoT security.

Cost optimization

balances business goals with budget justification to create cost-effective workloads

while avoiding capital-intensive solutions. Cost optimization in your IoT workload looks at ways to reduce

expenses and improve operational efficiency across IoT workload layers.

Operational excellence

covers the processes that build and run applications in production. Operational

excellence in your IoT workload discusses how heterogeneity, scale, connectivity, and hybridity affect IoT

operations.

Performance efficiency

is a workload's ability to scale efficiently to meet demands. Performance efficiency

in your IoT workload describes how heterogeneity, scale, connectivity, and hybridity affect IoT

performance.

IoT guiding principles

HeterogeneityHeterogeneity

SecuritySecurity

Most IoT systems use either a

connected products

connected operations

architectural pattern. Each pattern

has specific requirements and constraints in the IoT Well-Architected Framework.

Connected products

architectures focus on the

hot path

. End users manage and interact with products by

using real-time applications. This pattern applies to manufacturers of smart devices for consumers and

businesses in a wide range of locations and settings. Examples include smart coffee machines, smart TVs,

and smart production machines. In these IoT solutions, the product builders provide connected services

to the product users.

Connected operations

architectures focus on the

warm or cold path

with edge devices, alerts, and cloud

processing. These solutions analyze data from multiple sources, gather operational insights, build

machine learning models, and initiate further device and cloud actions.

The connected operations pattern applies to enterprises and smart service providers that connect pre-

existing machines and devices. Examples include smart factories and smart buildings. In these IoT

solutions, service builders deliver smart services that provide insights and support the effectiveness and

efficiency of connected environments.

The IoT Well-Architected Framework adds IoT-specific guiding principles to the Azure Well-Architected

Framework pillars. These principles help clarify considerations to ensure your IoT workloads meet requirements

across architectural layers.

The high-level guiding principles that facilitate good IoT solution design are:

Heterogeneity

Security

Scale

Flexibility

Serviceability

Connectivity

Hybridity

The following sections describe the IoT guiding principles, and how they apply to the IoT connected products

and connected operations architectural patterns.

IoT solutions must accommodate various devices, hardware, software, scenarios, environments, processing

patterns, and standards. It's important to identify the necessary level of heterogeneity for each architectural

layer at design time.

In connected products architectures, heterogeneity describes the varieties of machines and devices that need to

be supported. Heterogeneity also describes the variety of environments where you can deploy smart product,

such as networks and types of users. In connected operations architectures, heterogeneity focuses on support

for different operational technology (OT) protocols and connectivity.

IoT solutions must consider security and privacy measures across all layers. Security measures include device

and user identity, authentication and authorization, data protection for data at rest and in transit, and strategies

for data attestation.

In connected products architectures, limited control over product use in heterogeneous and widely distributed

environments affects security. According to the Microsoft Threat Modeling Tool STRIDE model, the highest risk

to devices is from tampering, and the threat to services is from denial of services from hijacked devices.

ScalabilityScalability

FlexibilityFlexibility

ServiceabilityServiceability

ConnectivityConnectivity

HybridityHybridity

In connected operations architectures, the security requirements for the deployment environment are

important. Security focuses on specific OT environment requirements and deployment models, such as ISA95

and Purdue, and integration with the cloud-based IoT platform. Based on STRIDE, the highest security risks for

connected operations are spoofing, tampering, information disclosure, and elevation of privilege.

IoT solutions must be able to support

hyper-scalability

, with millions of connected devices and events ingesting

large amounts of data at high frequency. IoT solutions must enable proof of concept and pilot projects that start

with a few devices and events, and then scale out to hyper-scale dimensions. Considering the scalability of each

architectural layer is essential to IoT solution success.

In connected products architectures, scale describes the number of devices. In most cases, each device has a

limited set of data and interactions, controlled by the device builder, and scalability comes only from the number

of devices deployed.

In connected operations architectures, scalability depends on the number of messages and events to process. In

general, the number of machines and devices is limited, but OT machines and devices send large numbers of

messages and events.

IoT solutions build on the principle of

composability

, which enables combining various first-party or third-party

components as building blocks. A well-architected IoT solution has extension points that enable integration with

existing devices, systems, and applications. A high-scale, event-driven architecture with brokered

communication is part of the backbone, with loosely coupled composition of services and processing modules.

In connected products architectures, changing end-user requirements define flexibility. Solutions should allow

you to easily change device behavior and end-user services in the cloud, and provide new services. In connected

operations architectures, the support for different types of devices defines flexibility. Solutions should be able to

easily connect legacy and proprietary protocols.

IoT solutions must consider ease of maintaining and repairing components, devices, and other system elements.

Early detection of potential problems is critical. Ideally, a well-architected IoT solution should correct problems

automatically before serious trouble occurs. Maintenance and repair operations should cause as little downtime

or disruption as possible.

In connected products architectures, the wide distribution of devices affects serviceability. The ability to monitor,

manage, and update devices within end user context and control, without direct access to that environment, is

limited. In connected operations architectures, serviceability depends on the given context, controls, and

procedures of the OT environment, which may include systems and protocols already available or in use.

IoT solutions must be able to handle extended periods of offline, low-bandwidth, or intermittent connectivity. To

support connectivity, you can create metrics to track devices that don't communicate regularly.

Connected products run in uncontrolled consumer environments, so connectivity is unknown and hard to

sustain. Connected products architectures must be able to support unexpected extended periods of offline and

low-bandwidth connectivity.

In connected operations architectures, the deployment model of the OT environment affects connectivity.

Typically, the degree of connectivity, including intermittent connectivity, is known and managed in OT scenarios.

IoT solutions must address hybrid complexity, running on different hardware and platforms across on-premises,

edge, and multi-cloud environments. It's critical to manage disparate IoT workload architectures, ensure

uncompromised security, and enable developer agility.

IoT architecture layers

Core layers and servicesCore layers and services

In connected products architectures, the wide distribution of devices defines hybridity. The IoT solution builder

controls the hardware and runtime platform, and hybridity focuses on the diversity of the deployment

environments.

In connected operations architectures, hybridity describes the data distribution and processing logic. Scale and

latency requirements determine where to process data and how fast feedback must be.

An IoT architecture consists of a set of foundational layers. Specific technologies support the different layers,

and the IoT Well-Architected Framework highlights options for designing and creating each layer.

Core layers

identify IoT-specific solutions.

Common layers

aren't specific to IoT workloads.

Cross-cutting layers

support all layers in designing, building, and running solutions.

The IoT Well-Architected Framework addresses different layer-specific requirements and implementations. The

framework focuses on the

core layers

, and identifies the specific impact of the IoT workload on the

common

layers

The following sections describe the IoT architecture layers and the Microsoft technologies that support them.

The IoT core layers and services identify whether a solution is an IoT solution. The

core layers

of an IoT workload

  
Device and gateway layerDevice and gateway layer
  
Ingestion and communication layerIngestion and communication layer
  
Device management and modeling layerDevice management and modeling layer
are:
Device and gateway
Device management and modeling
Ingestion and communication
The IoT Well-Architected Framework focuses primarily on these layers. To realize these layers, Microsoft
provides IoT-specific services such as Azure IoT Hub, Azure IoT Edge, IoT Hub Device Provisioning Service (DPS),
and IoT Central.
This layer represents the physical or virtual device and gateway hardware deployed at the edge or on premises.
Elements in this layer include the operating systems that manage the processes on the devices and gateways,
and the device and gateway firmware, which is the software and instructions programmed onto devices and
gateways. This layer is responsible for:
Sensing and acting on other peripheral devices and sensors.
Processing and transferring IoT data.
Communicating with the IoT cloud platform.
Base level device security, encryption, and trust root.
Device level software and processing management.
Common use cases include reading sensor values from a device, processing and transferring data to the cloud,
and enabling local communication.
Relevant Microsoft technologies include:
Azure IoT Edge
Azure IoT device SDKs
Azure RTOS
Microsoft Defender for IoT
Azure Sphere
Windows for IoT
This layer aggregates and brokers communications between the device and gateway layer and the IoT cloud
solution. This layer enables:
Support for bi-directional communication with devices and gateways.
Aggregating and combining communications from different devices and gateways.
Routing communications to a specific device, gateway, or service.
Bridging and transforming between different protocols. For example, mediate cloud or edge services into an
MQTT message going to a device or gateway.
Relevant Microsoft technologies include:
Azure IoT Hub
Azure IoT Central
This layer maintains the list of devices and gateway identities, their state, and their capabilities. This layer also
enables the creation of device type models and relationships between devices.
Relevant Microsoft technologies include:
IoT Hub device twins

  
Common layers and servicesCommon layers and services
  
Transport layerTransport layer
  
Event processing and analytics layerEvent processing and analytics layer
  
Storage layerStorage layer
  
Interaction and reporting layerInteraction and reporting layer
IoT Central device templates
IoT Hub DPS
Azure Digital Twins
Azure IoT Plug and Play
Workloads other than IoT, such as Data & AI and modern applications, also use the common layers. The top-
level Microsoft Azure Well-Architected Framework addresses the generic elements of these common layers, and
other workload frameworks address other requirements. The following sections touch on the IoT-related impact
on requirements, and include links to other guidance.
This layer represents the way devices, gateways, and services connect and communicate, the protocols they use,
and how they move or route events, both on premises and in the cloud.
Relevant Microsoft technologies include:
OT and IoT protocols, such as MQTT(S), AMQP(S), HTTPS, OPC-UA, and Modbus
Azure IoT Hub routing
Azure IoT Edge routing
This layer processes and acts on the IoT events from the ingestion and communication layer.
Hot path
 stream processing and analytics happen in near real-time to identify immediate insights and
actions. For example, stream processing generates alerts when temperatures rise.
Warm path
 processing and analytics identify short-term insights and actions. For example, analytics predict a
trend of rising temperatures.
Cold path
 processing and analytics create intelligent data models for the hot or warm paths to use.
Relevant Microsoft technologies include:
Azure Stream Analytics
Azure Functions
Azure Databricks
Azure Machine Learning
Azure Synapse Analytics
This layer persists IoT device event and state data for some period of time. The type of storage depends on the
required use for the data.
Streaming storage
, such as message queues, decouple IoT services and communication availability.
Time series-based storage
 enables warm-path analysis.
Long-term storage
 supports machine learning and AI model creation.
Relevant Microsoft technologies include:
Azure Event Hubs
Azure Data Explorer
Azure Cosmos DB
Azure SQL
Azure Data Lake Storage

  
Integration layerIntegration layer
  
Cross
-
cutting activitiesCross
-
cutting activities
This layer lets end users interact with the IoT platform and have a role-based view into device state, analytics,
and event processing.
Relevant Microsoft technologies include:
Azure App Service
Power Apps
Power BI
Dynamics 365 Connected Field Service
This layer enables interaction with systems outside the IoT solution by using machine-to-machine or service-to-
service communications APIs.
Relevant Microsoft technologies include:
Azure Logic Apps
Azure Functions
Azure API Management
Azure Event Grid
Power Automate
Cross-cutting activities like DevOps help you design, build, deploy, and monitor IoT solutions. DevOps lets
formerly siloed roles, like development, operations, quality engineering, and security, coordinate and collaborate
to produce better, more reliable, and agile products.
DevOps is well-known in software development, but can apply to any product or process development and
operations. Teams who adopt a DevOps culture, practices, and tools can better respond to customer needs,
increase confidence in the applications and products they build, and achieve business goals faster.
The following diagram shows the DevOps continuous planning, development, delivery, and operations cycle:
Development and deployment activities include the design, build, test, and deployment of the IoT solution
and its components. The activity covers all layers and includes hardware, firmware, services, and reports.

 
Next steps
 
Related resources
Management and operations activities identify the current health state of the IoT system across all layers.
Correctly executing DevOps and other cross-cutting activities can determine your success in creating and
running a well-architected IoT solution. Cross-cutting activities help you meet the requirements set at design
time and adjust for changing requirements over time. It's important to clearly assess your expertise in these
activities and take measures to ensure execution at the required quality level.
Relevant Microsoft technologies include:
Visual Studio
Azure DevOps
Microsoft Security Development Lifecycle (SDL)
Azure Monitor
Azure Arc
Microsoft Defender for IoT
Microsoft Sentinel
Reliability in your IoT workload
Security in your IoT workload
Cost optimization in your IoT workload
Operational excellence in your IoT workload
Performance efficiency in your IoT workload
Azure IoT reference architecture
Azure IoT documentation

Reliability in your IoT workload

12/16/2022 • 16 minutes to read • Edit Online

Everything has the potential to break and IoT workloads are no exception. Because of this, you should design

your architecture with availability and resiliency in mind. The key considerations are how quickly you can detect

change and how quickly you can resume operations. IoT applications are unique as they're distributed at

massive scale, and operate over unreliable networks with no persistent access or visibility into the end-to-end

data flows.

Your environment should consider resilient architectures, cross-region redundancies, service levels indicators

(SLIs), service-level objectives (SLOs), service-level agreements (SLAs), and critical support. The existing

environment should include auditing, monitoring, and alerting by using integrated monitoring and a notification

framework.

On top of these environmental controls, the workload team should consider:

Defining SLIs and SLOs around observability for your specific use case.

Making architecture modifications to improve service level SLAs.

Building redundancy into the workload-specific architecture.

Adding processes for monitoring and notification beyond what's provided by the cloud operations teams.

Designing for reliability in an IoT solution should take into consideration the foundational layers in the IoT

architecture. To achieve overall solution reliability, each layer should have acceptable levels of reliability.

Device and gateway layerDevice and gateway layer: Devices and gateways come in many forms and shape. Typically, devices and

gateways perform the following functions:

Data collection

Supervisory control

Edge analytics

Data collection means the device is connected to sensors or subscribes to telemetry from downstream systems

and then pushes this data to the cloud. The solution design should ensure reliable device management and

reliable communications from the device to the cloud.

Devices that provide supervisory control not only collect data to send to the cloud, but also take actions based in

that data. Actions send data back to the machines or environment that the device is authorized to perform

supervisory actions on. The reliability of the application running on a device that has supervisory control is

crucial.

Transpor t layerTranspor t layer: To connect to the cloud service for data, control, and management, devices need access to a

network. Depending on the type of IoT solution, connectivity reliability is in your hands or the hands of the

network service provider. Networks may have intermittent connectivity issues and in this case the device needs

to manage its behavior accordingly.

Device management and modeling layerDevice management and modeling layer: The cloud service provides each device with an identity and

manages your devices at scale. The cloud is often the final data ingress point for all messages flowing from the

devices. IoT solutions must provide reliability for the cloud services required for the IoT devices to integrate and

Prerequisites

Principles

Device and gateway layer

Design devices for resiliencyDesign devices for resiliency

DesignDesign

ImplementImplement

transmit data. As described in the Azure WAF – Reliability pillar, the cloud services in your IoT solution must

implement reliability principles to provide high availability for your overall IoT solution.

Building a reliable IoT solution requires careful consideration for devices, cloud, and how they interact. The

choices you make for device hardware, connectivity and protocols, and cloud services affect the reliability needs

of your solution.

As described in the Azure WAF – Reliability pillar, desired reliability is subjective. Ensure you understand the

business requirements for your solution.

To assess your IoT workload based on the principles described in the Microsoft Azure Well-Architected

Framework, complete the IoT workload questionnaires in the Azure Well-Architected Review assessment:

Azure Well-Architected Review

After you complete the assessment, this guide helps you address the key reliability recommendations identified

for your solution.

Follow these principles as you design your IoT solution for reliability:

Design devices for resiliency.

Establish safe practices for updates.

Establish observability across your IoT solution.

Implement HA/DR in critical components.

Plan for capacity.

As part of your overall IoT solution, design your devices to satisfy the uptime and availability requirements of

your end-to-end solution. The following practices focus on connectivity to the cloud service, error handling, and

monitoring:

Ensure that your IoT device can operate efficiently with intermittent connectivity to the cloud. An IoT solution

should enable the flow of information between intermittently connected devices and the cloud-based solutions.

Best practices include:

Implement retry and backoff logic in device software.

Synchronize device state with the cloud.

Ensure that devices can store data on the device if your solution can't tolerate data loss.

Use data sampling and simulations to baseline the network capacity and storage requirements.

The Azure IoT device SDKs provide a set of client libraries that you can use on devices or gateways to simplify

connectivity with the Azure IoT services. You can use the SDKs to implement an IoT device client that:

Connects to the cloud.

Provides a consistent client development experience across different platforms.

Simplifies common connectivity tasks by abstracting details of the underlying protocols and message

processing patterns such as exponential backoff with jitter and retry logic.

MonitorMonitor

ResourcesResources

Establish safe practices for updatesEstablish safe practices for updates

DesignDesign

ImplementImplement

MonitorMonitor

Considerations for IoT devicesConsiderations for IoT devices

DesignDesign

Device lifecycleDevice lifecycle

Connectivity issues for IoT devices can be difficult to troubleshoot because of the many possible points of

failure. Application logic, physical networks, protocols, hardware, IoT Hub, and other cloud services can all cause

problems. The ability to detect and pinpoint the source of an issue is critical. However, an IoT solution at scale

could have thousands of devices, so it's not practical to manually check individual devices. Azure Monitor and

Event Grid are tools that can help you to diagnose connectivity issues in IoT Hub.

Manage connectivity and reliable messaging by using Azure IoT Hub device SDKs

Monitor, diagnose, and troubleshoot Azure IoT Hub disconnects

Retry general guidance - Best practices for cloud applications

Error handling for resilient apps - Azure Architecture Center

Circuit Breaker pattern - Cloud Design Patterns

Compensating Transaction pattern - Cloud Design Patterns

Throttling pattern - Cloud Design Patterns

An enterprise IoT solution should provide a strategy for how operators manage your devices. IoT operators

require simple and reliable tools and applications that enable reliable device updates.

Due to the distributed nature of IoT solutions it's important to adopt safe and secure policies for updating and

deploying applications. Unlike cloud-based solutions, the device side of IoT solutions brings new challenges. For

example, devices need to be continually updated for vulnerabilities and application changes.

A device update solution must support:

Gradual update rollout through device grouping and update scheduling controls.

Support for resilient device updates (A/B) to deliver seamless rollback.

Detailed update management and reporting tools.

Optimization for network based on available bandwidth.

Device Update for IoT Hub is a service that enables safe, secure, and reliable over-the-air IoT device updates.

Device Update for IoT Hub can group devices and specify which devices should receive an update. Operators can

view the status of update deployments and make sure each device successfully applies required updates.

When an update fails, Device Update for IoT Hub lets operators identify the devices that failed to apply the

update and see failure details. The ability to identify which devices failed to update results in fewer manual

hours trying to pinpoint the source of a failure.

Operators need to monitor the state of device deployments and updates. In Device Update for IoT Hub,

compliance measures how many devices have installed the highest version compatible update. A device is

compliant if it has installed the highest version available update that is compatible for it.

Design and select IoT devices to function reliably in the expected operating conditions and over their expected

lifetime.

A reliable device should perform according to its hardware and software specifications, and any failure should

be detected and managed through mitigation, repair, or replacement.

"Design for reliability, but also plan for failures."

Environmental requirementsEnvironmental requirements

Operational profileOperational profile

Regulations and standardsRegulations and standards

ConnectivityConnectivity

Management and operations layer (DevOps)

Establish observability across your IoT solutionEstablish observability across your IoT solution

DesignDesign

Device lifetimes are limited and affect the solution reliability. Assess the consequences of device failure on the

solution, and define a device lifecycle strategy in accordance with the solution requirements.

Device failure impact assessment includes severity (for example single points of failures), probability (for

example MTBF), detectability (see FMEA practice), and an acceptable downtime period.

The acceptable operational downtime determines the speed and extent to which devices should be maintained.

For a solution lifecycle, the availability or longevity of the supply of devices and parts is an important aspect. The

more modular the design, the easier it is to swap out parts of the system, especially if some parts become

obsolete earlier than others. Alternative or multi-sourcing of component and module supply chains are critical

for highly reliable solutions.

The conditions in which a device operates affect its reliability. Define your environmental requirements, and use

devices with appropriate feature specifications. These specifications include parameters such as operating

temperature range, humidity, ingress protection rating (IPxx), relevant and EMI immunity, and shock and

vibration immunity.

The operational behavior of devices affects the performance stress applied to devices, and therefore their

reliability. Define operational profiles (estimating device behavior throughout its lifetime), and assess device

reliability accordingly. Such profiles include operation modes (for example wireless transmission on, low-power

modes) and environmental condition (such as temperature) over device lifetime.

In normal operating conditions, the device and software should run safely inside the specified performance

profiles and avoid running at the limit of the device capabilities. A device should be able to service and process

all external sensors and data processing required by the solution.

For specific industries, devices are subject to applicable regulations and standards. Regulations and standards

should be defined, and devices should meet any compliance and conformity requirements. Regulations include

certification and marking (for example FCC, CE). Standards include industry/application (for example ATEX and

MILspec) and safety (for example IEC 61508) conformance.

Device connectivity conditions, including upstream to the cloud and downstream local network, should be part

of the solution design and reliability. Assess the potential effect of connectivity interruption or interference, and

define a connectivity strategy accordingly. The connectivity strategy should include robustness, for example

fallback capability and disconnection management, and include buffering backup to mitigate cloud dependency

for critical/safety functions.

Monitor every component of your IoT Solution and define alerts and procedures to manage the overall

reliability. All Azure IoT services publish metrics that describe the health and availability of the service. In

addition to the cloud service metrics, consider the metrics that you need on the device side to establish end-to-

end observability for reliability. Consider using these metrics as part of your overall solution reliability

monitoring.

By monitoring the operation of an IoT application and devices relative to a healthy state, you can detect and fix

reliability issues. Monitoring and diagnostics of an IoT application are crucial for availability and resiliency. If

something fails, you need to know that it failed, when it failed, and why it failed.

Before you mitigate issues that affect the reliability of an IoT application, you must be able to capture logs and

ImplementImplement

signals that help you detect issues in the end-to-end operation of the solution.

Use IoT solution logging and monitoring systems to determine whether the solution is functioning as expected

and to help troubleshoot what's wrong with its components.

Complete the following actions to establish observability on an IoT solution:

Establish a mechanism to collect and analyze performance metrics and alerts related to your IoT solution.

Configure devices, cloud services, and applications to collect and connect with Azure Monitor.

Enable a real-time dashboard and alerts to monitor the Azure backend services.

Define roles and responsibilities to monitor and act on events and alerts.

Implement continuous monitoring.

Establish a mechanism to collect and analyze performance metrics and alerts related to the IoT solution.

Azure Monitor is the recommended monitoring and visualization solution for Azure IoT solutions. Devices,

services, and applications, regardless of deployment location, can push log messages directly or through built-in

connectors into Azure Monitor. Azure Monitor provides custom log parsing to facilitate the decomposition of

events and records into individual fields for indexing and search.

Configure devices, cloud ser vices, and applications to collect and connect to Azure Monitor.Configure devices, cloud ser vices, and applications to collect and connect to Azure Monitor.

Enable remote monitoring of IoT Edge devices using Azure Monitor and built-in metrics integration. To enable

this capability on your devices, add the metrics-collector module to your deployment and configure it to collect

and transport module metrics to Azure Monitor.

If your solution uses IoT Central, you can use the set of metrics provided by IoT Central to assess the health of

devices connected to your IoT Central application and the health of your active data exports. IoT Central

applications enable metrics by default, and you access them from the Azure portal. In addition, the Azure

Monitor data platform exposes these metrics and provides several ways for you to interact with them.

IoT Hub provides usage metrics, such as the number of messages used, and the number of devices connected.

You should monitor data generated by Azure IoT Hub and relay it to Azure Monitor for analysis and to alert

other services.

If the IoT solution requires a custom application such as App Service, Azure Kubernetes Service, Azure Functions,

use Application Insights for monitoring and analysis. Application Insights is a feature of Azure Monitor that

provides extensible application performance management and monitoring for live web apps. Developers and

DevOps professionals can use Application Insights to:

Automatically detect performance anomalies.

Help diagnose issues by using powerful analytics tools.

See what users actually do with apps.

Help continuously improve app performance and usability.

Enable a real-time dashboard and aler ts to monitor the Azure backend ser vicesEnable a real-time dashboard and aler ts to monitor the Azure backend ser vices

Azure Monitor alerts proactively notify you when it finds specific conditions in your monitoring data. Alerts let

you identify and address issues in your system before your customers notice them. You can set alerts on

metrics, logs, and the activity log.

Define roles and responsibilities to monitor and act on events and aler tsDefine roles and responsibilities to monitor and act on events and aler ts

Roles, responsibilities, and permissions - Microsoft Azure Well-Architected Framework.

Continuous monitoringContinuous monitoring

Continuous Integration and Continuous Deployment (CI/CD) is a DevOps concept that helps you deliver

  
ResourcesResources
  
Implement HA/DR in critical componentsImplement HA/DR in critical components
  
DesignDesign
  
ImplementImplement
  
MonitorMonitor
software more quickly and reliably and provide continuous value to your users. Continuous Monitoring (CM) is
a new follow-up concept where you can incorporate monitoring across each phase of your DevOps and IT Ops
cycles. CM ensures the health, performance, and reliability of your apps and infrastructure continuously as it
flows through the developer, production, and customers:
Seven best practices for Continuous Monitoring with Azure Monitor
Continuous integration and continuous deployment to Azure IoT Edge devices
Monitoring Azure IoT Hub
Monitoring Azure IoT Hub data reference
Trace Azure IoT device-to-cloud messages with distributed tracing
Check Azure IoT Hub service and resource health
Collect and transport metrics - Azure IoT Edge
Monitoring data reference
Enable message tracking
Check Azure IoT Hub service and resource health
As you build out your IoT solution, your SLA guides you into what to implement in your critical components for
HA/DR. You must design to meet the SLA to recover from failures across the IoT solution stack. There are
multiple approaches, starting with redundancy across the IoT solution stack through to enabling redundancy for
specific layers of the solution stack. Cost is also a major consideration that you must weigh against the benefits
of meeting the SLAs.
Define uptime goals for your IoT solutionDefine uptime goals for your IoT solution: Azure IoT services have defined uptime and availability targets.
Review the SLAs for Azure IoT services that are part of your solution and define your uptime goals. For example,
Azure IoT Hub has an SLA of 99.9%, which translates into 1 minute and 36 seconds of potential downtime per
day to plan for. The Azure IoT SDK provides built-in, configurable logic to handle retries and backoff.
Consider breaking the uptime goals of your solution into two categories. Operational (data ingestion) and
device management. For example, if a device is sending data to an IoT hub, but the services to manage the
device are unavailable, how does that affect your solution?
To learn more, see Azure IoT Hub SDK reliability features.
Consider using redundant hardware options for sensors, power, and storage. Redundant hardware enables the
device to function completely if one of the critical features isn't available. Your hardware also needs to account
for issues with connectivity. For example, use a store and forward approach for data when connectivity isn't
available. Azure IoT Edge has this feature built in.
Your device must also be able to handle outages in the cloud. Microsoft Azure region pairing provides an HA/DR
strategy for IoT Hub. For many, Microsoft Azure region pairing meets their SLA requirements. If region pairing
isn't enough, consider an implementation with a secondary IoT Hub enabled. You should also use Azure Device
Provisioning Service (DPS) to avoid hardcoded IoT Hub configurations on your devices. Should your primary IoT
hub go down, DPS can assign your device to a different hub.
For devices that are expected to be online most of the time, consider implementing a heartbeat message pattern
using a custom route in IoT Hub that's processed by Azure Stream Analytics, a Logic App, or an Azure Function
to determine if a heartbeat has failed. You can then use the heartbeat to define Azure Monitor alerts that take
actions as needed.
Use Azure Monitor as it lets you monitor the state of your IoT Hub environment, ensure it's running properly,

  
ResourcesResources
 
Ingestion and communication layer
  
Plan for capacityPlan for capacity
  
DesignDesign
  
ImplementImplement
  
MonitorMonitor
  
ResourcesResources
and that your devices aren't being throttled or experiencing connection issues.
IoT Hub SDK reliability features
IoT Hub HA/DR
Apply IoT Hub Intra-region HA/DR
Apply IoT Hub Cross-region HA/DR
How to clone an Azure IoT hub
Test apps for availability and resiliency - Azure Architecture Center
Plan for ser vice quotas:Plan for ser vice quotas: As with all platform services, IoT Hub and DPS have quotas and throttles enforced on
certain operations, enabling Azure to deliver predictable service levels and cost for its services. Quotas and
throttles are tied to the service tier and number of units you deploy so that you can design your solution with
the right number of resources. Review quotas and throttles in advance and design your IoT Hub and DPS
resources accordingly.
Plan redundant capacity:Plan redundant capacity: When you're planning your thresholds and alerts, consider the latency between the
detection and the action taken so that the system and operators have enough time to respond to change
requests. Otherwise, you may detect a need to increase the number of units, but the system fails - for example
by losing messages - before the increase takes effect.
Establish benchmarks at production scaleEstablish benchmarks at production scale: Due to the distributed nature of IoT solutions, the number of
devices, and the volume of data, it's important to establish scale benchmarks for the overall solution. As the
number of devices or volumes of data increase, the cloud gateway must scale to support uninterrupted data
flow. These benchmarks help to plan for capacity risks. Use the IoT device telemetry simulator to simulate
production scale volumes.
Auto scale to adjust to quotas dynamically:Auto scale to adjust to quotas dynamically: To provide the lowest cost and operational effort, consider
implementing an automated system to scale your resources up and down with the varying needs of your
solution. To learn more, see the following Resources section.
Monitor quotas and throttling:Monitor quotas and throttling: A benefit of using PaaS services is the ability to scale up and down with a
little effort according to your needs. To ensure solution reliability, you need to continuously monitor resource
usage against quotas and throttles to detect increases in resource usage that indicate the need to scale.
Depending on your business requirements, you may implement something as simple as continuously
monitoring resource usage and alerting the operator when thresholds are met, or an automated system to auto
scale.
Understand Azure IoT Hub quotas and throttling
Overview of the Microsoft Azure IoT Hub Device Provisioning Service
Auto-scale your Azure IoT Hub - Code Samples
Azure IoT Device Telemetry Simulator - Code Samples
Understanding the IP address of your IoT hub
Device configuration best practices for Azure IoT Hub
Azure IoT Hub scaling
Azure IoT Central quotas and limits

Next steps

Azure IoT Hub scaling

Operational Throttles | Understand Azure IoT Hub quotas and throttling

Other Limits | Understand Azure IoT Hub quotas and throttling

Security in your IoT workload

12/16/2022 • 23 minutes to read • Edit Online

Prerequisites

The Internet of Things (IoT) provides vast opportunities to gain insights into the complex relationships of people,

systems, and data. Those opportunities also come with the challenge of securing diverse and heterogeneous

device-based solutions that may have little or no direct interaction.

The security guidance of Microsoft Azure Well-Architected Framework for IoT provides recommendations to

help improve the security of your IoT solution. It will help you identify key considerations for secure solution

design and provides guidance on how to implement capabilities that can reduce risk to your IoT solution.

There are multiple personas such as

IoT device builders

IoT application developers

, and

IoT solution operators

involved in an IoT solution. They all are required to ensure security in the full lifecycle of an IoT solution.

There are key differences between IoT solutions in traditional IT solutions and operational technology (OT)

solutions, such as having devices that are on-premises being used to monitor and control physical devices.

These OT devices add different security challenges such as, tampering, packing sniffing, the need for out of band

management, and over-the-air (OTA) updates.

When you design an IoT solution, it's important to understand the potential threats to that solution, and add

defense in depth as you design and architect it. It's important to design the solution from the start with security

in mind because understanding how an attacker might be able to compromise a system helps make sure

appropriate mitigations are in place from the start.

Security starts with a threat model. Threat modeling offers the greatest value when you incorporate it into the

design phase. When you're designing, you have the greatest flexibility to make changes to eliminate threats. To

learn more, see Internet of Things (IoT) security architecture.

Securing IoT solutions with a Zero Trust security model starts with non-IoT specific requirements—specifically

ensuring you've implemented the basics to securing identities, their devices, and limit their access. These include

explicitly verifying users, having visibility into the devices they're bringing on to the network, and being able to

make dynamic access decisions using real-time risk detections. This helps limit the potential impact of users

gaining unauthorized access to IoT services and data in the cloud or on-premises, which can lead to both mass

information disclosure (such as leaked production data of a factory) and potential elevation of privilege for

command and control of cyber-physical systems (such as stopping a factory production line). Factories and OT

environments are often easy targets for malware and security breaches due to equipment that can be more than

40 years old, physically vulnerable, and isolated from server level security. For an end-to-end perspective,

review the Azure Well-Architected Framework's security pillar, which includes guiding principles that are critical

to other aspects of your solution.

Guidance for creating zero trust architecture (ZTA) has also been provided by NIST in its Zero Trust Architecture

(nist.gov) document. This document provides general deployment models and uses cases where zero trust could

improve an enterprise's overall information technology security posture.

To assess your IoT workload based on the principles described in the Microsoft Azure Well-Architected

Framework, complete the IoT workload questionnaires in the Azure Well-Architected Review assessment:

Azure Well-Architected Review

After you complete the assessment, this guide helps you address the key security recommendations identified

for your solution.

Principles

The Azure Well-Architected Framework for IoT overview refers to the foundational layers in an IoT solution.

These layers are shown in the diagram below. Security functions cut across these layers and should follow

specific principles and recommendations.

When you extend the overall Well-Architected Framework security pillar, there are principles that relate

specifically to IoT. The Zero Trust Cybersecurity for the Internet of Things whitepaper describes how to apply a

Zero Trust approach to your IoT solutions based on customer experiences and Microsoft's own environment.

The key security principles that inform a strong security posture that apply to an IoT solution are:

Strong identityStrong identity to authenticate devices. Register devices, issue renewable credentials, employ strong

authentication for personnel using techniques such as multi-factor authentication or password-less

authentication, and use a hardware root of trust to ensure you can trust its identity before making

decisions.

Least privileged accessLeast privileged access to mitigate the impact of a device being compromised. Implement device and

workload access control to limit any impact from authenticated identities that may have been

compromised or running unapproved workloads.

Device healthDevice health to gate access or flag devices for remediation. Check security configuration, assess

vulnerabilities and insecure passwords, and monitor for active threats and anomalous behavioral alerts

to build ongoing risk profiles.

Continual updatesContinual updates to keep devices healthy. Utilize a centralized configuration and compliance

management solution and a robust update mechanism to ensure devices are up to date and in a healthy

state.

Security monitoring and responseSecurity monitoring and response to detect and respond to emerging threats. Employ proactive

monitoring to rapidly identify unauthorized or compromised devices.

An IoT architecture consists of a set of foundational layers. Layers are realized by using specific technologies,

and the IoT Well-Architected Framework highlights options for designing and realizing each layer. There are also

cross-cutting layers that enable the design, building, and running of IoT solutions:

Design

The following sections address the layer specifics for the security pillar.

As part of the threat modeling exercise, you should divide a typical IoT architecture into several

components/zones. This article focuses on the IoT aspects of security, but you should take a holistic approach to

security. The IoT security guidance focuses on these two zones:

Device and field gateway zonesDevice and field gateway zones: The immediate physical space around the device and gateway where

physical access and/or peer-to-peer digital access is feasible. Many industrial companies use the Purdue

model included in the ISA 95 standard to ensure their process control networks protect both the limited

bandwidth of the network and the ability to offer real time deterministic behavior. In more recent years,

with cyber security events being on the climb from internal and external parties security teams look at

the Purdue model as an extra layer of the defense in depth methodology.

Cloud gateway and ser vices zoneCloud gateway and ser vices zone: Any software component or module running in the cloud that is

interfacing with devices and/or gateways for data collection and analysis, as well as for command and

control.

Zones are broad way to segment a solution. Each zone often has its own data and authentication and

authorization requirements. Zones can also be used to isolate damage and restrict the impact of low trust zones

on higher trust zones.

All elements in the architecture are subject to various threats that can be classified according to one of the six

Strong IdentityStrong Identity

STRIDE categories:

spoofing

tampering

repudiation

information disclosure

denial of service

, and

elevation of

privilege

Follow the SDL process when you design and build these services.

Adopt DevOps methodologies also focus on security. To learn more, see Enable DevSecOps with Azure and

GitHub.

Strong device identity is delivered through tightly integrated capabilities of IoT devices and services, including:

A hardware root of trust

Strong authentication using techniques such as certificates, multi-factor auth, or password-less

authentication

Renewable credentials

Organizational IoT device registry

When possible, use a hardware

root of trust

(RoT) with the following attributes:

Secure storage of the credentials proving the identity in a dedicated, tamper-resistant hardware.

Immutable

onboarding identity

tied to the physical device.

Unique per-device renewable

operational credentials

for regular device access.

The onboarding identity represents the physical device, and is inseparable from it. It can’t be changed for the

device's lifetime, and is typically created and installed during manufacturing.

Given its immutability and lifetime, only use the device onboarding identity to onboard devices into IoT

solutions. After onboarding, provision and use a renewable operational identity and credentials for

authentication and authorization to the IoT application. Making this identity renewable enables you to manage

access and revocation of the devices for operational access. You can apply policy driven gates - such as

attestation of device integrity and health - at renewal time.

Protect the supply chain of the hardware RoT, or any other hardware components of an IoT device, to ensure

that attacks on supply chain don't compromise the integrity of such devices. This also ensures devices are built

to security specifications and design that conform to required compliance regimes.

Password-less authenticationPassword-less authentication, usually using standard x509 certificates to prove a device's identity, offers

greater protection than secrets such as passwords and symmetric tokens shared between both parties.

Certificates are a strong, standardized mechanism that provides renewable, password-less authentication.

Provision operational certificates from a trusted PKI and use a renewal lifetime appropriate for the security

posture of their business use, management overhead, and cost. Make renewal automatic to minimize any

potential access disruption due to manual rotation, and use standard up to date cryptography techniques. For

example, renew through CSR instead of transmitting a private key. Any access granted to a device should be

granted based on its operational identity. Support credential revocation (such as CRL when using x509

certificates) to enable immediate removal of device access. For example, to respond to compromise or theft.

Some legacy or resource constrained IoT devices aren't manufactured with or capable of utilizing a strong

identity, password-less authentication, or renewable credentials. For these devices, use IoT gateways as

guardians to locally interface with these less-capable devices, bridging them to access IoT services with strong

identity patterns. This enables Zero Trust adoption today, while transitioning to use more capable devices over

time.

If you can't use a hardware root of trust - for example, virtual machines, containers, or any service that embeds

an IoT client - use whatever capabilities are available. Common implementations use virtual machines or

containers, which don't have hardware root of trust support, but can use password-less authentication and

renewable credentials. Defense in depth lets the solution provide redundancies where possible and fills in gaps

Least

privileged accessLeast

privileged access

Device healthDevice health

where necessary. Virtual machines and containers might run from an area with more physical security, such as a

data center, as compared to an IoT device in the field.

Use a centralized Organizational IoT device registr yOrganizational IoT device registr y for your organization's IoT devices to manage their

lifecycle and audit device access. This approach is similar to the way you secure the user identities of an

organization's workforce to achieve Zero Trust security. Use a cloud-based identity registry to handle the scale,

management, and security of an IoT solution. IoT device registry information is used to onboard devices into an

IoT solution by verifying that the device identity and credentials are known and authorized. After a device is

onboarded, the device registry contains the core properties of the device, including its operational identity and

the renewable credentials used to authenticate for everyday use.

You can use IoT device registry data to view the inventory of an organization's IoT devices (including health,

patch, and security state), and to query and group devices for scaled operation, management, workload

deployment, and access control.

In cases where devices don't connect to Azure services for IoT, use network sensors to detect and inventory

unmanaged IoT devices in an organization's network for awareness and monitoring.

Least-privileged access control helps to limit any impact from authenticated identities that may have been

compromised or that are running unapproved workloads. For IoT scenarios, this means granting access to

solution operators, devices, and their workloads using:

Device and workload access control, to provide access control to the scoped workloads running on the

device.

Access only from secure locations and systems, granting just-in-time access, and implementing strong

authentication mechanisms such as multi-factor auth, password-less authentication.

Conditional access, to conditionally grant access based on the device's context. Examples include:

Location (IP address, GPS location)

Uniqueness (access from one location at a time)

Time of the day

Network traffic patterns

Services can also use the device's context to conditionally deploy workloads.

Configure the access management of the IoT cloud gateway to only grant appropriate access permissions based

on the functionality required by the backend.

Limit access points to IoT devices and cloud applications by ensuring the ports are kept to minimum access.

Build mechanisms to prevent and detect physical tempering of devices.

An IoT solution is used by a person. It's prudent to ensure that the user access is managed through an

appropriate access control model such as role based or attribute based access control. To layer least-privileged

access for IoT devices and design for defense in depth, use network segmentation to group IoT devices in

addition to end point protection, mitigating the impact of a potential compromise. Network micro-segmentation

enables isolation of less-capable devices at the network layer, either behind a gateway or on a discrete network

segment. Put in place a holistic firewall rule strategy to ensure devices access the network when required for the

solution to operate, and blocking access when not allowed. More mature organizations can also implement

micro-segmentation policies at multiple layers of the Purdue model.

Following the Zero Trust principle, use device health as a key factor to determine the risk profile (such as trust

level) of a device. Use this risk profile as an access gate to ensure only healthy devices can access IoT

applications and services, or to identify devices in questionable health for remediation.

According to industry standards, the evaluation of the device health should include:

Continual updatesContinual updates

Security monitoring and responseSecurity monitoring and response

Security configuration assessment and attestation - is the device configured securely?

Vulnerability assessment - is the device running software that is out of date, or software that has known

vulnerabilities?

Insecure credential assessment - is the device using secure credentials (such as certificates) and protocols

(such as TLS 1.2+)?

Active threats and threat alerts.

Anomalous behavioral alerts such as network patterns and usage deviation.

The other side of enabling your organization to control device access based on health is the need to proactively

maintain production devices in a working, healthy target state.

The capabilities to proactively support healthy devices include:

Centralized configuration and compliance management, to apply policies and to securely distribute and

update certificates.

Deployable updates to update the full set of software on devices, firmware, drivers, base OS and host

applications, and cloud-deployed workloads. The update mechanism should have remote deployment

capabilities and should be resilient to changes in environment, operating conditions, and authentication

mechanism (for example, a certificate change because of expiry or revocation). The update mechanism

should support update rollout verification, and ideally be integrated with pervasive security monitoring to

enable scheduled updates for security. You should be able to defer deployments so as to not interfere with

the business continuity, but be eventually completed within a well-defined time interval after a vulnerability

is detected. Devices that haven't been updated, should be flagged as unhealthy.

As a defense in depth strategy, monitoring adds an extra layer of protection for managed greenfield devices

while also providing a compensating control for legacy unmanaged brownfield devices that don't support

agents and can't easily be patched or configured remotely. You need to decide the levels of logging, types of

activities that you need to monitor, and the responses required for alerts. Logs should be stored securely and

not contain any security details.

As recommended by CISA, security monitoring includes:

Generating an as-is asset inventory and network map of all IoT/OT devices.

Identifying all communication protocols used across IoT/OT networks.

Cataloging all external connections to and from IoT/OT networks.

Identifying vulnerabilities in IoT/OT devices and using a risk-based approach to mitigate them.

Implementing a vigilant monitoring program with anomaly detection to detect malicious cyber tactics such

living off the land

techniques within IoT/OT systems. The program should monitor for unauthorized

changes to controllers, unusual behavior from devices, and audit access and authorization attempts.

Most IoT attacks follow a familiar

kill chain

pattern in which adversaries establish an initial foothold, elevate their

privileges, and move laterally across the network. Often, they use privileged credentials to bypass barriers such

as next generation firewalls established to enforce network segmentation across subnets. Rapidly detecting and

responding to these multistage attacks requires a bird's-eye view across IT, IoT, and OT networks, combined with

automation, machine learning, and threat intelligence. Collect signals from the entire environment: all users,

devices, applications, and infrastructure, both on-premises, and in multiple clouds. Analyze the signals in

centralized security information and event management (SIEM) and extended detection and response (XDR)

platforms, where security operations center (SOC) analysts can hunt for and uncover previously unknown

threats. Finally, security orchestration and automated response (SOAR) platforms are essential for responding to

incidents rapidly and mitigating attacks before they have material impact on your organization. This is

accomplished by defining playbooks that are automatically executed when specific incidents are detected. For

Implement

Evaluate and deploy a Zero Trust security modelEvaluate and deploy a Zero Trust security model

Apply Zero Trust to your existing IoT infrastructureApply Zero Trust to your existing IoT infrastructure

Use Zero Trust as criteria to select IoT servicesUse Zero Trust as criteria to select IoT services

example, you can automatically block or quarantine compromised devices so they're unable to infect other

systems.

The IoT solution needs to be able to perform monitoring and remediation at scale for all its connected devices.

Data protectionData protection: Data that's ingested into the IoT solution should be protected with the guidance in the overall

Azure Well-Architecture. For IoT solutions, these principles extend to the devices too. It's critical to ensure that

communication from the device to the cloud is secure and encrypted using the latest Transport Layer Security

(TLS) standards. If data is stored on devices (data at rest), use standard encryption algorithms to encrypt the

data.

Make sure devices are well protected physically. Turn-off or disable any features that aren't needed on the

device, such as physical ports (USB, UART) and connectivity (WIFI, BT). Perform physical removal or

covering/blocking when necessary.

When possible, use a firewall on the device to restrict the network access.

To learn more about Microsoft's approach, see Zero Trust.

Use the Zero Trust Assessment to analyze the gaps in your current protection for identity, endpoints, apps,

network, infrastructure and data.

Use the recommended solutions from the assessment to prioritize your Zero Trust implementation, and

move forward with guidance from the Microsoft Zero Trust Deployment Center.

To help prioritize IoT Zero Trust investments, you can use IIC's IoT Security Maturity Model to help assess the

security risks for your business.

Microsoft Defender for IoT is an agentless solution that continuously monitors network traffic using IoT-aware

behavioral analytics to immediately identify unauthorized or compromised IoT devices. Deploy Microsoft

Defender for IoT to:

Inventory all IoT devices

Assess all IoT devices for vulnerabilities and provide risk based mitigation recommendations

Continuously monitor devices for anomalous or unauthorized behavior.

Additionally, Microsoft Defender for IoT is tightly integrated with Azure Sentinel, a cloud-based SIEM/SOAR

platform that supports third-party SOC solutions such as Splunk, IBM QRadar, and ServiceNow.

Explore adding Azure Sphere guardian modules to your critical brownfield devices to enable them to connect to

IoT services with Zero Trust capabilities including strong identity, end-to-end encryption, and regular security

updates.

Network design and configuration provide other opportunities to build defense in depth by segmenting IoT

devices based on their traffic patterns and risk exposure. This minimizes the potential impact of compromised

devices and ability of adversaries to pivot to higher-value assets. Network segmentation is typically

accomplished using next-generation firewalls.

Use IoT ser vicesIoT ser vices that offer the following key zero-trust capabilities:

Enable full support for Zero Trust user access control via Zero Trust. For example, strong user identities,

multi-factor authentication, and conditional user access.

Include integration with user access control systems for least-privileged access and conditional controls.

Use Zero Trust as criteria to select IoT devicesUse Zero Trust as criteria to select IoT devices

Provide a central device registry for full device inventory and device management.

Perform mutual authentication, offering renewable device credentials with strong identity verification.

Enforce least-privileged device access control with conditional access to ensure only devices fulfilling criteria

can connect, such as only healthy devices or devices from known locations.

Support OTA updates to keep devices healthy.

Enable security monitoring of both the IoT services themselves, and of the range of connected IoT devices.

Monitor and control access to all public end points and implement authentication and authorization to any

calls made to these end points.

Microsoft Azure offers several IoT services that provide Zero Trust capabilities:

Azure IoT Hub, supports an operational registry for organizational IoT devices. Accepts device operational

certificates to enable strong identity. Devices can be disabled centrally from Azure IoT Hub to prevent

unauthorized connection. Azure IoT Hub supports provisioning of module identities in support of IoT

Edge workloads.

Azure IoT Hub Device Provisioning Service (DPS), provides a central device registry for organizational

devices to register for onboarding at scale. DPS accepts device certificates to enable onboarding with

strong device identity and renewable credentials, registering devices in IoT Hub for their daily operation.

Azure Device Update (ADU) for IoT Hub, enables you to deploy OTA updates for your IoT devices. It

provides a cloud-hosted solution to connect virtually any device. ADU supports a broad range of IoT

operating systems, including Linux and Azure RTOS.

Microsoft Defender for IoT is an agentless, network layer security platform delivering continuous asset

discovery, vulnerability management, and threat detection for IoT and OT devices, including proprietary,

embedded OT devices as well as legacy Windows systems commonly found in OT environments.

Microsoft Defender for IoT interoperates with Azure Sentinel to provide a bird's-eye view of security

across your entire enterprise. By collecting data at cloud scale—across all users, devices, applications, and

infrastructure, including firewall, NAC, and network switch devices, Sentinel can quickly spot anomalous

behaviors indicating potential compromise of IoT/OT devices.

IoT devicesIoT devices supporting Zero Trust should:

Contain a hardware root of trust that it can use to provide a strong device identity.

Use renewable credentials for regular operation and access.

Enforce least-privileged access control to local device resources such as cameras, storage, and sensors.

Emit proper device health signals to services to enable enforcement of conditional access.

Provide update agents and corresponding software updates for the usable lifetime of the device to ensure

security updates can be applied, along with device management capabilities to enable cloud-driven device

configuration and automated security response.

Run security agents that integrate with security monitoring, detection, and response systems.

Minimize physical attack footprint. For example, no USB ports.

Microsoft Azure offers several products that can be used to support the implementation of IoT devices:

Azure IoT Edge provides an edge runtime connection to IoT Hub and other services:

Supports use of certificates as strong device identities.

Supports the PKCS#11 standard that enables support for device manufacturing identities - and other

secrets for operational identities - stored on a TPM or HSM.

The Azure IoT platform SDKS are a set of device client libraries, developer guides, samples, and

documentation. Device SDKs implement various security features, such as encryption, authentication, to

assist you in developing a robust and secure device application.

Azure IoT Hub support for virtual networks enable you to restrict connectivity to IoT Hub through a VNet

that you operate. This network isolation prevents connectivity exposure to the public internet. It can help

to prevent exfiltration attacks from sensitive on-premises networks.

Azure RTOS provides a real-time operating system as collection of C-language libraries that you can

deploy on a wide range of embedded IoT platforms. Azure RTOS includes a complete TCP/IP stack with

TLS 1.2 and 1.3 and basic X.509 capabilities. However, Azure RTOS along with the Azure IoT Embedded

SDK is designed to integrate with Azure IoT Hub, DPS, and Microsoft Defender and can utilize some of the

same security mechanisms as larger, more expensive IoT devices. With features such as X.509 mutual

authentication and support for modern TLS cipher suites such as ECDHE and AES-GCM, customers can

build Azure RTOS applications knowing that the basics of secure network communication are covered. In

addition to the integration with Azure IoT services, Azure RTOS provides support for Zero Trust design on

microcontroller platforms that support hardware security features such as ARM TrustZone (a memory

protection/partitioning architecture), secure element devices such as the STSAFE-A110 from ST

Microelectronics, and industry standards such as the ARM Platform Security Architecture (PSA), which

combines hardware and firmware to provide a standardized set of security features including secure

boot, cryptography, attestation, and more.

Windows 10 IoT Enterprise has significant security features that can be used to help ensure security

across three key pillars of the IoT security spectrum:

Protect data. Securing data means always protecting, including at rest, during code execution, and

in motion. This is done by using BitLocker Drive Encryption, Secure Boot, Windows Defender

Application Control, Windows Defender Exploit Guard, secure Universal Windows Platform (UWP)

applications, Unified Write Filter, a secure communication stack, and security credential

management.

Monitor and detect. Device Health Attestation (DHA) lets you start with a trusted device and

maintain trust over time.

Update and manage. You can use Device Update Center and Windows Server Update Services to

apply the latest security patches. If you determine that a device might be exposed to a threat, you

can remediate that threat by using Azure IoT Hub device management features, Microsoft Intune

or third-party mobile device management solutions, and Microsoft System Center Configuration

Manager.

Along with the edge platforms above, Microsoft provides a program to certify different device capabilities:

The Azure Certified Device program empowers device partners to easily differentiate and promote

devices, and enables solution builders and end customers to find IoT devices built with features that

enable a zero-trust solution.

Edge Secured-core program (preview), which validates whether devices meet more security requirements

around device identity, secure boot, operating system hardening, device updates, data protection, and

vulnerability disclosures. The Edge Secured-core program requirements have been distilled from various

industry requirements and security engineering points of view. The Edge Secured-core program enables

Azure services such as the Azure Attestation service to make conditional decisions that are based on the

posture of the device, thus enabling the Zero Trust model. Some of the highlights consist of requiring the

device to include a hardware root of trust and provide secure boot and firmware protection. Attributes

such as these can be measured by the attestation service and used by downstream services to

conditionally grant access to sensitive resources.

Microsoft has several solutions available that provide hardware built specifically for IoT scenarios fully

integrated with Azure services:

Resources

Next steps

Azure Sphere is a fully managed integrated hardware, OS, and cloud platform solution for medium and

low-power IoT devices that meets all seven properties of highly secured devices. Azure Sphere has

several features that can help an organization implement Zero Trust. Devices are designed for explicit

verification and implement certificate-based Device Attestation and Authentication (DAA), which will

automatically renew trust. In addition to supporting the Zero Trust principle of explicit verification, Azure

Sphere implements least-privileged access, where applications are denied access by default to all

peripheral and connectivity options. This even extends to network connectivity, where permitted web

domains must be included in the software manifest or the application isn't able to connect outside of the

device. Azure Sphere is built around assumed breach. Protections are layered with defense in depth

throughout the OS design, and a secure-world partition—running in Arm TrustZone on Azure Sphere

devices—helps segment breaches to the OS from access to Pluton or hardware resources. Azure Sphere

can be applied as a guardian module to secure other devices, including existing brownfield systems that

weren't designed for trusted connectivity. In this scenario, an Azure Sphere guardian module is deployed

with an application designed to interface with an existing product through an interface such as ethernet,

serial, or BLE, that doesn't necessarily have direct internet connectivity.

Azure Percept is an end-to-end edge AI platform that includes hardware accelerators integrated with

Azure AI and IoT services, pre-built AI models, and solution management to help you start your proof of

concept in minutes. Azure Percept devices are designed with a hardware root of trust to help protect

inference data (in transit and at rest), AI models and privacy-sensitive sensors such as cameras and

microphones. It enables device authentication and authorization for Azure Percept Studio services. To

learn more, see Azure Percept security.

Azure Architecture Center - Azure Architecture Center

How to apply a Zero Trust approach to your IoT solutions - Microsoft Security Blog

Zero trust whitepaper (microsoft.com)

IoT Security Architecture

Overview of an IoT workload

Cost optimization in your IoT workload

12/16/2022 • 38 minutes to read • Edit Online

H IGH C O ST OF SC A L IN GH IGH C O ST O F SC A L IN G 32%32%

Lack of necessary technology 26%

Pilots demonstrate unclear business value/ROI 25%

Too many platforms to test 24%

Pilot takes too long to deploy 23%

According to the survey of IoT Signals EDITION 3 report, the top reason of proof of concept (PoC) failure is high

cost of scaling. The report also says that the high cost of scaling IoT projects comes from the complexities of

integrating across layers, for example devices, edge connectivity, compatibility across applications. 40% of

adopters experiencing this issue.

Reasons for PoC failureReasons for PoC failure (IoT Signals EDITION 3)

IoT solution implementations are complex projects that require understanding of broad technologies and

consideration of various layers. The IoT Well-Architected Framework describes these IoT architecture layers. The

layers help you simplify and understand Azure IoT architecture. You need to establish a different cost strategy

for each layer because each one has a different technology and sometimes a different ecosystem such as

devices, telecommunications, or edge. Cost optimization for IoT Well-Architected Framework provides guidance

on how to optimize costs across the IoT architecture layers.

Cost optimization is the process of closed-loop cost control that needs to be continuously monitored, analyzed,

and improved throughout a lifecycle. Understanding the IoT architecture layers helps you design an architecture

for your IoT solution that enables you to define a baseline of the costs and to consider multiple architectures for

cost comparison.

IoT cost results from a tradeoff between various technology options. Sometimes, it's not a simple comparison

because IoT is an end-to-end solution. You should also consider synergy and cost benefits when reconciling

multiple services and technologies such as IoT central with Plug and Play, or using device twins to handle events

in Azure Digital Twins. In some areas, a one-time cost is more effective than recurring costs. For example, in

security where hacking techniques are always changing, it's better to import a reliable commercial operating

system and module such as Azure Sphere that, for a one-off payment, provides monthly security patches to

devices for a long time.

The key criteria for architecture decisions are the requirements for your IoT solution. You can separate the

requirements into functional and non-functional requirements. You also need to separate the cost

considerations on each type of requirements because functional requirements affect system design and non-

functional requirements affect system architecture.

You need to develop multiple use cases based on the requirement and compare them before finalizing your

architecture choices.

A typical IoT project includes the following processes:

Design

Hardware sourcing

Prerequisites

Principles

Understand total cost of ownershipUnderstand total cost of ownership

Development

Integration

Onboarding

Deployment

Operation

The complexity of these processes leads to many opportunities to cut costs. Cost optimization in the IoT Well-

Architected Framework includes guidance to simplify cost drivers from complex IoT architecture and processes

and help you make better decisions to achieve cost effectiveness.

This guide considers various combinations of Azure IoT services and related technologies. However, it doesn't

consider cost optimization for specific industry and IoT use cases such as connected factory, predictive

maintenance, and remote monitoring.

Because IoT projects include cloud technologies, you should review Microsoft Azure Well-Architected

Framework - Cost Optimization and Cloud Cost Optimization | Microsoft Azure for overall guidance about cloud

costs.

To assess your IoT workload based on the principles described in the Microsoft Azure Well-Architected

Framework, complete the IoT workload questionnaires in the Azure Well-Architected Review assessment:

Azure Well-Architected Review

After you complete the assessment, this guide helps you address the key cost optimization recommendations

identified for your solution.

The Microsoft Well-Architected Framework cost optimization documents is a set of generic principles. This

workload view focuses on specifics for an IoT solution and considers unique factors that add to the generic

guidance.

Cost effectiveness is one of the key success factors in IoT projects. In a typical IoT solution, devices generate

large quantities of telemetry that is then stored in a cloud repository. How you develop devices and applications,

handle large volumes of data, and design your architecture have an effect on overall costs. Because an IoT

solution is a multilayered technology stack, there are many cost saving factors to consider.

Apply the following design principles to build cost effective IoT solutions:

Understand total cost of ownership (TCO)Understand total cost of ownership (TCO) – Take into account both direct and indirect costs when you

decide on the architecture for your IoT solution.

Establish strategies for IoT technology ecosystemsEstablish strategies for IoT technology ecosystems – Use the best approach for each IoT technology

area that has its own ecosystem.

Define implementation plans for each IoT architecture layerDefine implementation plans for each IoT architecture layer – Identify action items for each layer in

the IoT Well-Architected Framework architecture.

Monitor and optimize costsMonitor and optimize costs – Establish ongoing activities for cost optimization after you implement your

IoT solution.

Many IoT solutions fail because of the difficulty of estimating long-term aggregated costs of various cloud

services. It's critical to understand how much you're spending to run and operate all services involved. Without

this understanding, it's common for budgets to run into the red, leaving you with ballooning costs.

When you evaluate infrastructure costs, account for the basics first: storage, compute, and network. Then, you

Establish strategies for IoT technology ecosystemsEstablish strategies for IoT technology ecosystems

Device management strategyDevice management strategy

Edge, cloud, or hybrid implementation strategyEdge, cloud, or hybrid implementation strategy

Connectivity and IoT protocol managementConnectivity and IoT protocol management

Platform for IoT solutionsPlatform for IoT solutions

must account for all services your solution needs to ingest, egress, and prepare data for use in business

decisions. Make sure to estimate costs based on the architecture of the solution when it's running at scale in

production, not your PoC architecture. Architecture and costs evolve rapidly after the PoC.

Furthermore, don't overlook the long-term operational costs that increase in parallel with infrastructure costs.

Examples include employing technicians to operate the solution, vendors to manage non-public clouds, and

customer support teams.

Your costs will depend greatly on the

chattiness

of your devices and the size of the messages your devices send.

Chatty devices send many messages to the cloud every minute, while others are relatively quiet, only sending

data every hour or more. Clarity about device chattiness and message size helps reduce the likelihood of over-

provisioning, which leads to unused cloud capacity, or under-provisioning, which leads to elastic scale

challenges. As you plan your IoT solution, consider carefully the size and frequency of the message payloads to

ensure your infrastructure is the correct size for where you are today and ready to scale with you. Be sure to

consider the cost of developing new features as compared to using a

Platform as a Service

(PaaS) or

Application

Platform as a Service

(aPaaS) based solution.

Device management requires a strategic approach because various options are available during the

Plan –

Provision – Configure – Monitor – Retire

device lifecycle. Understanding the roles for IoT solution development

and operation and IoT device types provides a foundation for establishing a strategy for device cost

management.

Evaluate TCO for deploying edge assets. Edge solutions can be realized with different edge architectures:

multiple endpoints, IoT devices, directly connected to the cloud, or connected through an edge and/or cloud

gateway. Different options for sourcing the edge solution devices can affect total cost and lead time. Ongoing

maintenance and support of the device fleet also affects the overall solution cost. Cost optimization is an

important step in maximizing the return on investment (ROI) of any IoT solution.

Where data should be stored and processed in a given IoT solution affects many factors such as latency, security,

and cost. Analyze each use case and examine where it makes most sense to use edge processing and data

storage and how it affects cost. Storing and processing data on the edge can often save storage, transportation,

and processing costs. However, when you take scale into account, cloud services often become better options

because of cost and development overheads. Assess each case for costs and benefits.

The Azure pricing calculator is a useful tool to compare these options.

There are many factors to take into account for device to IoT gateway connectivity. For example, device

connectivity, network, and protocol. An understanding of IoT communication protocols, network types, and

messaging patterns helps you to design and optimize a cost effective architecture.

For device connectivity, it's important to specify the type of network to use. If you select a private LAN or WAN

solution, such as WiFi or LoraWAN, consider the TCO of the network as part of the overall costs. If you use

carrier networks (such as 4G, 5G, or LPWAN), include the recurring connectivity costs.

Understand the difference between Application Platform as a Service (aPaaS) and Platform as a Service (PaaS)

and choose the right platform for your IoT solution. Depending on your business requirements, development,

and operation capacities, you can choose an aPaaS and PaaS IoT solution platform:

aPaaSaPaaS – IoT Central is an Azure aPaaS. It enables you to focus on your business and product innovation

instead of maintenance, infrastructure, development, and operations. It simplifies the creation of IoT

solutions.

Build an architecture for scale by implementing deployment stampsBuild an architecture for scale by implementing deployment stamps

Define implementation plans for each IoT architecture layerDefine implementation plans for each IoT architecture layer

Monitor and optimize costsMonitor and optimize costs

Build a cost model and budgetBuild a cost model and budget

Continuously monitor and reviewContinuously monitor and review

Design

PaaSPaaS – IoT Hub is and Azure PaaS. It acts as a central message hub for bi-directional communications

between your IoT app and the devices it manages. You can tune services to control the overall cost.

Deployment stamping

is a common design pattern for designing applications for flexible deployment strategies,

predictable scale, and cost. Using this pattern for IoT solutions provides several advantages such as geo-

distributing groups of devices, deploying new features to specific stamps, observability of cost-per-device. To

learn more, see Scale IoT solutions with deployment stamps.

Overview of an IoT workload - IoT Well-Architected Framework describes the elements of an IoT architecture. It

enables you to easily review cost options for each layer in a complex IoT architecture. Cost optimization for IoT

Well-Architected Framework provides actionable guides for the IoT architecture layer implementations.

After you review the IoT technology stacks, estimate the initial cost through a cost model:

Pricing Calculator helps estimate the initial and operational costs.

Develop a cost model - Microsoft Azure Well-Architected Framework.

For your solution, factor in the growth in the number of devices deployed and their messaging patterns over

time. Both can have a linear or non-linear growth, which affects your total solution cost over time. You might

start with a limited number of connected devices that grows each year.

Continuously monitor and review costs to identify gaps between planned and actual costs. A regular cost review

meeting is a good way to achieve cost optimization.

An IoT architecture consists of a set of foundational layers. Layers are realized by using specific technologies,

and the IoT Well-Architected Framework highlights options for designing and realizing each layer. There are also

cross-cutting layers that enable the design, building, and running of IoT solutions.

Device and gateway layer

Understand design considerationsUnderstand design considerations

The following sections address the layer specifics for the cost optimization pillar:

This layer is responsible for generating the data in your solution. Cost is a key consideration for the design of

this layer. Because devices and gateways can optimize data before transferring it to the cloud, a device and

gateway strategy affects the costs in your solution.

IoT Device development requires internal or outsourced engineering resources with a specific skill set. Include

these resource costs in the overall device costs. Required skills include hardware design, embedded application

development, cloud and local connectivity, security and privacy, and IoT solution architecture. Industry-specific

expertise may also be required.

Choosing off-the-shelf hardware or a custom design affects the cost and time-to-market for an IoT device:

An off-the-shelf device may cost more per unit, but has predictable costs and lead times. Off-the-shelf

devices also remove the need for complex supply chain management.

A custom device can reduce unit costs, but development time and non-recurring engineering costs such

as design, test, certification submissions, and manufacture are incurred. Pre-certified system components

or modules can reduce time-to-market to create a semi-custom device, but these components are also

more expensive than discrete chips. Supply-chain and inventory management must be properly

Azure Certified Device catalogAzure Certified Device catalog

LPWAN devicesLPWAN devices

Azure RTOSAzure RTOS

resourced.

For a device connected to the cloud as part of an IoT Solution, you need to optimize upstream data

transmissions to maintain cost boundaries. Factors to consider include minimizing payload sizes, batching of

messages, and transmission during off-peak periods. These optimizations also have incurred costs to

implement.

A make-or-buy decision takes into account qualitative (for example, WiFi certification) and quantitative (for

example, BOM cost or time to market) variables.

Azure Certified Device catalog helps to reduce costs and time to build for devices that work well with Azure IoT.

Most of the device development process is reduced to hardware selection. IoT Plug and Play devices can reduce

both device and cloud development costs.

If LPWAN devices, such as LoRaWAN, NB-IoT, or LTE-M, are already connected to another IoT cloud, the Azure

IoT Central Device Bridge can help to bridge to Azure IoT Central without incurring costs to change your existing

devices.

Azure RTOS is an embedded development suite. Azure RTOS includes a small but powerful operating system

that provides reliable, ultra-fast performance for resource-constrained devices. It's easy-to-use and market-

proven, having been deployed on more than 10 billion devices worldwide. Azure RTOS supports the most

popular 32-bit microcontrollers and embedded development tools, so you can make the most of existing device

builder skills.

Azure RTOS is free for commercial deployment using pre-licensed processors. Azure RTOS comes with Azure IoT

cloud capabilities and features such as device update and security that work with cloud services. These features

help to reduce both device and cloud development costs.

Azure RTOS is certified for safety and security. This helps to reduce the time and cost of building compliant

Azure SphereAzure Sphere

Azure Stack solutionsAzure Stack solutions

Azure public or private multi

access edge compute

(

MEC

)

Azure public or private multi

access edge compute

(

MEC

)

Edge management and operationEdge management and operation

devices for specific verticals such as medical, automotive, and industrial.

Azure Sphere is a highly secure, end-to-end IoT solution platform with built-in communication and security

features for internet-connected devices. It comprises a secured, connected, crossover microcontroller unit

(MCU), a custom high-level Linux-based operating system (OS), and a cloud-based security service that provides

continuous, renewable security.

Azure Sphere reduces the effort to build and maintain a secure environment from device to the cloud.

Microsoft provides OS updates and zero-day renewable security for 10 years. X.509 based PKI, user app

updates, error reporting, and device management will be supported beyond 10 years without any extra

subscription cost. Azure Sphere reduces the operational cost of keeping millions of devices up to date with the

latest security.

Azure Stack solutions extend Azure services and capabilities to environments beyond Azure datacenters, such as

to a customer datacenter or edge location.

Azure Stack solutions consist of Azure Stack Edge, Azure Stack HCI, and Azure Stack Hub:

Azure Stack Edge is a Microsoft-managed appliance, ideal for hardware-accelerated machine learning

workload at edge locations. Azure Stack Edge deployed in edge location can serve multiple workloads as

the core of Azure Stack Edge runs on modern technology stacks such as containers. By sharing

computational power with other workloads, the cost of ownership can be reduced. To learn more, see

Azure Stack Edge.

Azure Stack HCI is a purposed-built hyperconverged solution, with native integration with Azure. Azure

Stack HCI offers a scalable virtualization solution to host industry solutions. Virtualization brings extra

benefits to overall IoT solutions such as security benefits on virtualization, scalability and flexibility of

virtualized environments, which can then reduce total cost of ownership by sharing the hardware with

other workloads. Azure Stack HCI offers more compute power compared to Azure Stack Edge and is ideal

for industry process transformation. To learn more, see Azure Stack HCI.

Whether you choose Azure Stack Edge or Azure Stack HCI, you should identify the use cases and estimated

compute power. While Azure Stack solutions bring Azure capability to the edge, the hardware sizing constrains

the total compute power. Therefore sizing must be factored in to optimize cost to performance needs.

Many IoT devices are small, inexpensive, and are designed for one or a few tasks such as collecting sensor or

location data and then offloading that data for further processing. Azure public or private MEC and 5G help to

optimize the costs associated with offloading data from devices.

IoT devices can generate large quantities of data. Some IoT devices also have strong requirements for low power

consumption and low costs. MEC based IoT solutions enable low-latency processing of this data at the edge

instead of on the devices or in the cloud – 1-5 ms instead of the typical 100-150 ms for the cloud. MEC based

IoT solutions remain flexible, and the devices themselves can remain inexpensive, operate with minimal

maintenance, and use smaller, cheaper, and long-lasting batteries. With MEC, data analytics, AI, and optimization

functions remain at the edge, which keeps your IoT solution simple and inexpensive.

In addition to serving as an edge processing, compute, and 5G communication device, for IoT workloads, other

workloads can use the same communication device to establish high-speed connections to the public cloud or

remote sites.

Azure IoT Edge managed devices with gateway capabilities reduce network costs and minimize the number of

Secure device provisioning and authenticationSecure device provisioning and authentication

messages through local processing and edge scenarios.

Deploying Azure services such as Azure Stream Analytics and Azure Functions to IoT Edge lets you aggregate

and filter large volumes of data at the edge and send only important data to the cloud IoT gateway. IoT Edge

reduces the costs of exchanging data between the edge and an IoT cloud gateway.

Azure Blob Storage on IoT Edge reduces the need to transfer large quantities of data and over the network. Edge

storage is useful for transforming and optimizing large quantities of data before sending it to the cloud.

There are free-of-charge Azure IoT Edge modules for open protocols such as OPC Publisher and Modbus. These

modules help you to connect various devices with minimal development. You can search for and download the

modules in the Azure Marketplace. Depending on your requirements, you need to decide between build and

buy. For example, if upload performance is critical, choosing a proven IoT Edge module from vendors can be

more cost effective than building a custom module.

Every edge solution requires IoT devices to be deployed in the field. The deployment may need networking and

power supply infrastructures that affect the cost. Using pre-existing infrastructures minimizes the installation

costs. However, you may need to guarantee that the installation doesn't affect existing systems. The installation

of your IoT devices requires dedicated internal or external resources that may need to be trained. Logistics, such

as storage, inventory management, and transport should be organized and considered in the solution cost.

Include the cost of any decommissioning activities that take place when the devices reach the end of their

operational lifecycle.

The lambda architecture pattern - hot/warm/cold - is common in IoT solutions. This pattern is often fully

implemented in the cloud, and there are also relevant implementations on the edge such as when you use more

performant edge devices or the Azure IoT Edge runtime. Optimizing this pattern on the edge can influence

overall solution costs as it lets you choose the most cost-effective service for cloud data ingestion and

processing:

Hot path processing includes near real-time processing, process alerts, or notifications at the edge. Use

Azure IoT Hub event streams to process the alerts in the cloud.

Warm path processing includes using storage solutions on the edge. For example, use open source, time

series databases, or Azure SQL Edge. Azure SQL Edge includes edge stream processing features and time

series optimized storage.

Cold path processing includes batching up events of lower importance at the edge and using a file

transfer option through the Azure Blob Storage module. This approach uses a lower cost data transfer

mechanism as compared to streaming everything directly through Azure IoT Hub or IoT Central. After

your cold data arrives in Azure blob storage, there are many options to process the data in the cloud.

Azure IoT Edge has built-in capabilities to use when you have high message volumes. These built-in features to

batch messages can help to reduce the costs of using Azure IoT Hub or IoT Central. This feature can help reduce

both the number of daily messages in IoT Hub, and the number of device-to-cloud operations per second,

enabling you to choose a lower SKU or scale in IoT Hub. To learn more, see Stretching the IoT Edge performance

limits.

Azure IoT Hub with the Device Provisioning Service (DPS) and Azure Central both support device authentication

with symmetric keys, trusted platform module (TPM) attestation, or X.509 certificates. There's a cost factor

associated with each of these options:

X.509 certificates are considered the most secure option for authenticating to Azure IoT but certificate

management can be costly. The lack of upfront certificate lifecycle management planning can make the

use of certificates even more costly in the long-run. Typically, there's a requirement to work with third-

party vendors who offer CA and certificate management solutions. This option requires using a public

key infrastructure (PKI). Options include a self-managed PKI, a third-party PKI, or the Azure Sphere

Support and maintenanceSupport and maintenance

Ingestion layer

Platform for building IoT solutionsPlatform for building IoT solutions

TIPTIP

security service.

The Azure Sphere security service is only available when you use Azure Sphere devices.

TPM when used with X.509 certificates stored in the TPM offers an added layer of security. DPS also

supports authentication through TPM endorsement keys. The main effect on costs is from hardware,

potential board redesign, and its complexity.

Symmetric key authentication is the simplest and lowest cost option. However, you must evaluate the

effect on security. You need to protect keys on the device and in the cloud. Options to securely store the

key on the device often move you towards a more secure option anyway.

Review costs associated with each of these options, and evaluate how they affect the overall security of your

solution, balancing potentially higher hardware or services costs with increased security. Integration with your

manufacturing process may also influence overall costs.

To learn more, see Security practices for Azure IoT device manufacturers.

After you deploy your IoT devices, you need to support and maintain them for the lifetime of the solution. Tasks

include hardware repairs, software upgrades, OS maintenance, and security patching. Consider ongoing

licensing costs for commercial software and proprietary drivers and protocols. If you can't do remote

maintenance, then you need to budget for onsite repairs and updates. For hardware repairs or replacements,

you should keep suitable spares in stock.

For solutions that use cellular or paid for connectivity media, select as suitable data plan based on the number of

devices, the size and frequency of data transmissions, and where the devices are deployed.

If a service level agreement (SLA) is in place, you need a cost-effective combination of hardware, infrastructure,

and trained staff to be able to meet it.

The long-term support and maintenance of field devices can escalate to become the largest cost burden for a

deployed solution. Careful consideration of the TCO of a system is crucial to realizing return on investments.

An IoT gateway is a bridge between devices and cloud services. As a front-end service to the cloud platform, a

gateway can aggregate all data with protocol translation and provide bi-directional communication with devices.

The following sections describe technology and capability considerations for a cost effective implementation of

this layer:

Azure IoT offers two types of platform for building IoT solutions: IoT Hub is a platform-as-a-service (PaaS) and

IoT Central is an application platform as a service (aPaaS). To reduce your costs and effort, you should

understand the difference between PaaS and aPaaS. Choose your platform type by considering factors such as

requirements, budget, timelines, operation resources, and outsourcing. An aPaaS solution typically has more

predictable operational costs in the device and gateway layer.

For total cost reduction, consider extended use cases as part of platform selection. Azure IoT Central with IoT

Plug and Play is one of scenarios to review for device onboarding.

TIPTIP

Reduce chatty interactions to avoid overhead:

Devices: Avoid designing device-to-cloud interactions that use many small messages. For example, group

multiple device or module twin updates into a single update. Twin updates have their own throttling. Use

message batching to send multiple telemetry messages to the cloud. Be aware of the message size used for

the daily quota (4K for non-free IoT Hub tiers), sending smaller messages leaves some capacity unused,

sending larger messages adds to the available quota. For device-to-device or module-to-module

communication at the edge, avoid designing interactions that use many small messages.

Cloud: Avoid designing cloud-to-device interactions that use many small messages. Use a single direct

method when direct feedback is required. Use a single device or module twin status update to exchange

configuration and status information asynchronously.

You can monitor chatty interactions by using Defender for IoT micro agent and on Azure IoT Hub. You can create custom

alerts on IoT Hub if any device-to-cloud or cloud-to-device interaction exceeds a certain threshold.

IoT Hub SKU selection for message ingestion: Most IoT solutions require bi-directional communication between

devices and the cloud to be fully functional and secure. The basic IoT Hub SKU provides core functionality, but

excludes bi-directional control. For some early implementations of a solution, you may be able to reduce costs

by using the basic SKU. As your solution progresses, you can switch to a standard SKU, so that you can optimize

a secure communication channel for a lower cloud-to-device messaging costs. To learn more, see Compare IoT

Hub and Event Hubs.

Understand the Azure IoT Hub limits and quotas to optimize costs for device-to-cloud and cloud-to-device

messaging:

Messages quota: Azure IoT Hub SKUs each have a set of quotas and throttling limits that apply to operations. For

example, the Standard S1 SKU has a daily quota of 400,000 messages. This quota is achieved when devices send

400,000 messages each day. The messages metric increases as a combination of several factors:

One device-to-cloud (D2C) message up to 4 KB.

D2C messages that exceed 4 KB are charged in chunks of 4 KB.

For messages smaller than 4 KB, use the Azure IoT SDK

SendEventBatchAsync

method to optimize batching

on the device side. For example, bundling up to four 1-KB messages at the edge increases the daily meter

with just one message. Batching is only applicable for AMQP or HTTPS.

Event processing and analytics layer

Select data process path based on analytics requirementsSelect data process path based on analytics requirements

Most operations such as cloud-to-device messages or device twin operations also charge messages in 4-KB

chunks. All these operations also add to the daily throughput, and maximum quota of messages.

Review the Azure IoT Hub pricing information documentation, which includes detailed pricing examples.To learn

more, see IoT Hub quotas and throttling.

Different throttling limits apply to different operations in Azure IoT Hub:

In addition to message daily quotas, operations on the service have throttling limits. A key part of cost

optimization with Azure IoT Hub is to optimize both message quotas and operations throttling limits.

Study the differences between the different limits in the form of operations per second, or bytes per

second. To learn more, see IoT Hub quotas and throttling.

Device-to-cloud operations have an operations per second throttle that depends on the SKU. In addition

to the message size, which is metered in 4-KB chunks, you should also consider the number of

operations. This is an example where the batching on the edge is useful as it lets you send more

messages in a single operation.

One single message of 2 KB, a batched message of 10 KB, or a batched message of 256 KB only counts as

one single operation, allowing you to send more data to the endpoint without reaching throttling limits.

Message size:

If payload size is critical to cost management, reducing the fixed length overhead is important when you have a

long device life-cycle or a large deployment. Options to reduce this overhead include:

Use a shorter device ID, module ID, twin name, and message topic because an MQTT packet contains this

information as payload. An MQTT payload looks like:

devices/{device_id}/modules/{module_id}/messages/events/

Abbreviating the fixed length overhead and the message reduces bandwidth costs.

Compressing the payload, for example by using Gzip, can help reduce bandwidth costs.

The purpose of the event processing and analytics layer is to enable data-driven decisions. The event timing and

purpose of analytics are the key factors to consider when you choose the service. The right service choice

increases the efficiency of the architecture and reduces cost of processing data and events.

Base on your requirements, implement hot, warm, or cold path processing for IoT data analytics. The Azure IoT

reference architecture - Azure Reference Architectures helps you understand the difference between those three

analytics paths and reviews the available analytics services on each path. The Microsoft Well-Architected

Framework includes cost considerations for these services. To get started, determine which types of data go

through the hot, warm, or cold path:

Hot path: Data is kept in-memory and analyzed in near-real-time, typically using a stream processing

engine. The output may trigger an alert or be written to a structured format that can be queried

immediately using analytical tools.

Warm path: Recent data (such as from the last day, week, or month) is kept in a storage service that can

be queried immediately.

Cold path: Historical data is kept in lower-cost storage to be queried in large batches.

Storage layer

Define an end

end IoT scenario for choosing storage typesDefine an end

end IoT scenario for choosing storage types

Storing data affects IoT solution costs because providing data to end-users is one of the goals in an IoT solution.

There are opportunities for cost optimization in this layer. It's important to understand storage types, capacity,

and pricing, to create strategy for optimizing storage costs.

The choice of a repository for telemetry depends on the use case for your IoT data. If the purpose is just to

monitor IoT data and volumes are low volume, you should consider using a database. However, if your scenario

includes data analysis, you need to save telemetry data to storage.

For time series optimized, append-only storage and querying, consider purpose-designed solutions such as

Azure Data Explorer.

Storage and database aren't mutually exclusive. Both can work together, especially if clear hot, warm, and cold

analytics paths are defined. Azure Data Explorer, or databases are commonly used for hot and warm path

scenarios.

From the Azure Storage perspective, it's also important to consider the data lifecycle throughout the solution.

For example, access frequency, retention requirements, and backup.

Azure Storage enables you to define the lifecycle of the data and to automate the process of moving data

from the hot tier to other tiers, which reduces storage costs in the long run. To learn more, see Configure a

 
Device management and modeling layer
  
Azure IoT Hub Device Provisioning Service 
(
DPS
)
Azure IoT Hub Device Provisioning Service 
(
DPS
)
lifecycle management policy.
For database capabilities, it's common to choose between SQL and no-SQL solutions. SQL databases are best
suited for fixed schema telemetry with simple data transformation or data aggregation requirements. To learn
more, see Types of databases on Azure
Azure SQL Database and TimescaleDB for PostgreSQL are common choices for SQL database. Adopt best
practices to optimize costs:
Plan and manage Azure SQL Database
Azure SQL Database cost optimization
Azure SQL Database for PostgreSQL Extension
Performance tuning for Azure SQL Databases
If the schema of the message isn't fixed and is best represented as an object or document, no-SQL is a better
option. Azure Cosmos DB provides multiple APIs such as SQL or MongoDB.
For any database, the choice of partition and index strategies are important for performance optimization and
reducing unnecessary costs. To learn more, see:
Azure Cosmos DB partitioning and scaling.
Azure Cosmos DB cost optimization
Azure Synapse Analytics is modern data warehouse offering from Azure. Depending on the use case, you can
pause compute when no job is running to reduce operational costs. Azure Synapse Analytics scales by Data
Warehouse Units (DWU) and you should choose the right capacity to handle your solution requirements.
Managing devices is a task that orchestrates complex processes such as supply chain management, device
inventory, deployment, installation, operational readiness, device update, bi-directional communication with
devices, and provisioning. There are many benefits of using the device management and modeling standards of
Azure IoT. The Device Provisioning Service enables low/no touch device provisioning, so you don't have to train
and send people on site. Using DPS reduces the cost for truck rolls and the time spent on training and
configuration. DPS also reduces the risk of mistakes due to manual provisioning.
DPS is a helper service for IoT Hub that enables low-cost, zero-touch, just-in-time provisioning to the right IoT
hub without requiring human intervention to reduce error and cost. DPS enables the provisioning of millions of
devices in a secure and scalable manner.
General device management stages are common in most enterprise IoT projects. In Azure IoT, there are five
stages within the device lifecycle. DPS helps device lifecycle management with IoT Hub through enrollment
allocation policies, zero-touch provisioning, initial configuration setting, reprovisioning, and de-provisioning.

  
IoT Plug and PlayIoT Plug and Play
  
Modeling assets and device state in the cloudModeling assets and device state in the cloud
 
Transport layer
  
Choose a proper messaging protocol and connectivityChoose a proper messaging protocol and connectivity
  
To learn more, see:
DPS pricing
Provisioning guide
Modeling devices reduces management costs and messaging traffic volumes. Use IoT Plug and Play and Azure
IoT Central for easy deployment and management. To learn more, see Modeling devices with DTDL.
Compare cost differences between several device topology and entity stores such as Azure Cosmos DB, Azure
Digital Twins, and Azure SQL Database. Each of these services has a different cost structure, but also deliver
different capabilities to your IoT solution. Depending on the required usage of the service, choose the most cost-
efficient service, in line with the lower development and operations effort from adopting PaaS services.
Azure Digital Twins can implement a graph-based model of the IoT environment, both for asset management,
device state, and telemetry data. You can use it as a tool to model entire environments, with real time IoT data
streaming, and merging business data from non-IoT sources. You can build custom ontologies, or use standards
based ontologies such as RealEstateCore, CIM, or NGSI-LD to simplify data exchange with third parties. Azure
Digital Twins has a pay-per-use pricing model without a fixed fee.
Azure Cosmos DB is a globally distributed, multi-model database. Cost is affected by storage and throughput,
with regional or globally distributed and replicated data options.
Azure SQL Database can be an efficient solution for device and asset modeling. SQL Database has several
pricing models to help you optimize costs.
The transport layer conveys data between other layers. As data travels between layers and services, the choice
of protocol affects costs. There are many options to transfer and route data at a low cost.
You can choose the protocol that your IoT devices use to send telemetry. Choosing the right protocol for your
scenario enables devices to reduce transmission sizes and costs. Use cases such as field gateways, industry open
protocol, and IoT network selection also affect costs in the transport layer.

Optimize network trafficOptimize network traffic

Interaction and reporting layer

Utilize data visualization platform for visualizing time series dataUtilize data visualization platform for visualizing time series data

Integration layer

Integrate business systems and Azure services with Azure Digital TwinsIntegrate business systems and Azure services with Azure Digital Twins

Managing connectivity and reliable messaging with the SDK helps to reduce costs of handling unexpected

behavior between device and Azure IoT services.

Device clients regularly send a

keep-alive

message to IoT Hub. For flexibility, some Azure IoT Device SDKs

provide the option to set a timespan for the message if you're using the AMQP or MQTT protocols. You don't

need to add a keep-alive property in the telemetry if there's no specific requirement for it. Charges per

operation indicates that keep-alive messages aren't charged.

Reconnect or keep-alive. Battery powered IoT devices must choose between keeping connections alive or

reconnecting whenever they wake up. This choice affects power consumption and network costs.

The tradeoff relates to the business scenario and costs:

Reconnect: Consumes packets around 6 KB for TLS connection, device authentication, and retrieving a

device twin, but saves battery capacity if the device wakes up once or twice per day.

Keep-alive: Consumes hundreds of bytes, so keeping the connection alive saves network costs if the

device wakes up every few hours or less.

Bundle messages together to decrease TLS overhead.

Device provisioning and reprovisioning for optimizing network traffic

Azure IoT Hub Device Provisioning Service (DPS) reduces the cost of device lifecycle management from zero-

touch provisioning to retirement.

Connecting to DPS every time consumes network cost for TLS and authentication. To reduce network traffic, the

device should cache the IoT Hub information when it's provisioned by DPS, and then connect to IoT Hub directly

until it needs to reprovision. To learn more, see Send a provisioning request from the device.

As IoT handles time series data, there are many interactions from a large number of devices. Reporting and

visualizing this IoT data realizes the value of the data. Building intuitive and simplified user experiences and well-

designed data interactions are costly.

Grafana is an open-source data visualization tool that provides optimized dashboards for time series data. Its

communities provide various examples that you can reuse and customize in your environment. You can

implement metrics and dashboard from time series data with less effort. Azure provides a Grafana plug-in for

Azure Monitor.

Reporting and dashboard tools such as Power BI enable a quick start from unstructured IoT data. Power BI

provides an intuitive user interface and capabilities. You can easily develop dashboards and reports using time

series data and get the benefits of security and deployment at less cost.

Integration with other systems and services is often complex. Maximizing efficiency is the way to optimize costs

within integration layer. There are many services that you should consider to optimize integration costs. Review

the guides that optimize Azure IoT for integration.

Azure Digital Twins provides capabilities to integrate various systems and services with IoT data. As Azure

Digital Twins transforms all data into its own digital entity, it's important to understand its service limits and

tuning points for cost reductions.

DevOps layer

DeploymentDeployment

Cloud GovernanceCloud Governance

Development EnvironmentDevelopment Environment

Review Azure Digital Twins service limits when designing your architecture. There are many factors to consider

for implementation. Understanding functional limitations helps effectively integrate with business systems.

Reduce the complexity queries and the number of query results to optimize Query Unit (QU) costs. When you

use the query API, Azure Digital Twins charges prices per QU. You can trace the number of QUs the query

consumed in the response header. To learn more, see Find the QU consumption in Azure Digital Twins.

Cloud platform transforms capital expenditure (CAPEX) to operational expenditure (OPEX). While this provides

flexibility and agility, you should still have a well-defined deployment and operational model to take full

advantage of the cloud platform. A well-planned deployment creates repeatable assets to shorten time to

market. A cloud platform provides agility for developers to deploy resources in seconds, but there's a risk that

resources are provisioned unintentionally, or that you over-provision cloud resources. A proper cloud

governance model can minimize such risks, and help to avoid incurring unwanted costs.

It's common to deploy workloads in multiple environments, such as development and production. Through

infrastructure-as-code, you can accelerate the deployment by reusing the code and reduce time to market.

Infrastructure-as-code can help avoid unintentional deployments such as deploying the incorrect SKU. First-

party services such as Azure Resource Manager and Azure Bicep, or third-party services such as Terraform and

Pulumi are common infrastructure-as-code options.

DevOps practices for deploying applications can be applied in IoT solutions, using the same concepts of build

and release pipelines to different environments. To learn more, see Please Use a DevOps pipeline to deploy a

predictive maintenance solution.

Coupled with Azure Policy, you can be sure that deployments follow Well-Architected framework guidance.

Cloud governance isn't just about compliance and security, it's essential to prevent unnecessary costs.

Resource tagging is helpful to provide labeling on deployed resources. Labeling can provide insights on ongoing

costs based on the label, together with Azure Cost Management. To learn more, see Common cost analysis uses.

Azure Policy comes with built-in policy to label resources automatically, or flag resources without tagging. To

learn more, see Assign policy definitions for tag compliance.

Another use case for Azure Policy is to prevent provisioning of certain SKUs, which helps to prevent over-

provisioning in the development or production environments.

Developers can take advantage of the flexibility that Azure provides to optimize development cost. Developers

can use the Azure IoT Hub free tier, limited to one instance per subscription. This tier offers standard SKU

capabilities but is limited to 8000 messages a day. This SKU is sufficient for early-stage development with a

limited number of devices and messages.

If you prefer agility over control, Azure IoT Central is a good starting point for IoT solutions development. Each

tier comes with two free devices, and some messages are included, depending on the SKU. Azure IoT Central

bills by devices. It's a cost effective and simple option to calculate total cost of ownership for end-to-end IoT

solutions.

For compute environments, serverless architecture is often adopted for cloud native IoT solutions. Some

popular Azure services for IoT workload include Azure Functions and Azure Stream Analytics. The billing

mechanism depends on respective Azure services, and some allow developers to pause services without

incurring extra compute cost. An example is Azure Stream Analytics for real time processing needs that

  
Testing IoT solutions for cost estimationTesting IoT solutions for cost estimation
 
Monitor
  
Azure AdvisorAzure Advisor
  
Azure cost management APIsAzure cost management APIs
  
Azure MonitorAzure Monitor
developers can pause. Some Azure services are billed by usage, for example, Azure Functions is billed based on
number of transactions.
Developers can take advantage of these cloud native capabilities to optimize both development cost and
operational cost, with the right understanding of billing mechanism for each Azure services.
An Integrated Development Environment accelerates development and deployment. Some open-source IDEs
such as Visual Studio Code provide Azure IoT extensions that enable developers to develop and deploy code to
Azure IoT services with no cost.
Azure IoT provides various free sample codes in GitHub with guidance. These samples help developers extend
device, IoT Edge, IoT gateway, and Azure Digital Twins applications. GitHub also has features to implement
seamless CI/CD environments, with less cost and effort. If the project is intended to be open-sourced, GitHub
Actions are free. To learn more, see GitHub plans and features.
When you build end-to-end IoT solutions, you need to estimate overall costs including various cloud services,
through load testing. As IoT solutions use large amounts of data, a simulator can help with load testing. Some
simulation code samples such as Azure IoT Device Telemetry Simulator help you test and estimate costs with
various parameters at scale.
There are many tools included in your Azure subscription to get more value out of your IoT services and
implement financial governance in your organization. Such tools enable organizations to track resource usage
and manage costs across all your clouds with a single, unified view, and access rich operational and financial
insights to make informed decisions:
Azure Advisor is a personalized cloud consultant that helps you follow best practices to optimize your Azure
deployments. It analyzes your resource configuration and usage telemetry and then recommends solutions that
can help you improve the cost effectiveness, performance, reliability, and security of your Azure resources.
Azure Advisor helps you optimize and reduce your overall Azure spend by identifying idle and underutilized
resources. You can get cost recommendations from the cost tab on the Advisor dashboard.
Although Azure advisor doesn't have specific out-of-box recommendations for IoT services, it can be useful to
gather recommendations for Azure infrastructure, storage, and analytics services. To learn more, see Reduce
service costs by using Azure Advisor.
These APIs provide the ability to explore cost and usage data through multidimensional analysis. They enable
you to create customized filters and expressions that let you answer consumption-related questions about your
Azure resources.
Azure Cost Management has an alert feature. Alerts are generated when consumption reaches a threshold.
Azure Cost Management APIs are available for:
IoT Central
IoT Hub
Device Provisioning Service
Azure Monitor and Log Analytics Workspace are commonly used for telemetry logging. Log Analytics includes 5
GB of storage and first 30 days of retention is free. Depending on business need you may need a longer

  
Auto
-
scale for IoT Hub unit adjustmentAuto
-
scale for IoT Hub unit adjustment
 
Next steps
 
Resources
retention period.
Review and decide the right retention period to avoid unintentional costs. Azure Log Analytics provides a
workspace environment to query logs interactively. You can export logs periodically to external locations such as
Azure Data Explorer or archive logs in a storage account for a cheaper storage option.
To learn more, see Monitor usage and estimated costs in Azure Monitor.
Each IoT Hub unit has a threshold for the number of messages. Adjusting the number of IoT Hub units
dynamically helps to optimize the cost when the message volume fluctuates. You can implement an auto-scale
service that automatically monitors and scales your IoT Hub service. IoT Hub provides a customizable Azure
Function sample to implement auto-scale capability. You can use your own custom logic to optimize IoT Hub tier
and number of units.
Operational excellence in your IoT workload
Overview of an IoT workload: Includes guidance and recommendations that apply to each of the five pillars
in IoT workload.
Cost optimization design principles - Microsoft Azure Well-Architected Framework: Understand cost
optimization principles. These principles are a series of important considerations that can help achieve
business objectives and cost justification.
Checklist - Design for cost - Microsoft Azure Well-Architected Framework: View checklists for the cost model
and architecture to use when you design a cost-effective workload in Azure
Capture cost requirements for an Azure - Microsoft Azure Well-Architected Framework: Learn to enumerate
cost requirements and considerations, and how to align costs with business goals.
Checklist - Optimize cost - Microsoft Azure Well-Architected Framework: Use these checklist considerations
to help monitor and optimize workloads by using the right resources and sizes.
Develop a cost model - Microsoft Azure Well-Architected Framework: Do cost modeling to map logical
groups of cloud resources to an organization's hierarchy, and then estimate costs for those groups.

Operational excellence in your IoT workload

12/16/2022 • 22 minutes to read • Edit Online

Given the complexity of non-functional or operational needs such as security, reliability, performance, and cost

of IoT devices, perhaps the single most important factor to scaling an IoT solution and driving sustainable

business value for an organization is the ability of an organization to have excellent IoT device operational

capabilities. Operational excellence in an IoT workload is about enabling an organization for full visibility and

control over the hardware and software components of an IoT solution, ensuring the best experience for users.

Operational excellence includes making the IoT solution design, development, provisioning, monitoring,

support, and maintenance practices more agile and valuable to business without increasing operational risk. The

principles of operational excellence in an IoT workload are a set of considerations that help achieve superior

operational practices. These principles are described in a following section.

As you consider operational excellence in an IoT workload, the shared responsibility model in cloud and hybrid

scenarios also applies. This model is best described in the context of security considerations. An organization's

IoT operation comprises three different stacks of technologies: cloud, communication network, and IoT devices.

Given the scale and diversity of devices in IoT solutions and the use of different types of communication

networks at geographically distributed locations where devices are installed, there's a significant shift in IoT

solutions in the cloud shared responsibility model. It leaves operational ownership of key elements of an IoT

solution with the organization. While Microsoft has made it easy in many ways for organizations to operate

these elements on their own or using third parties, the organization owns the operational responsibility for

these elements.

The IoT Well-Architected Framework operational excellence guide focuses on the operational excellence aspects

of the IoT devices and Azure services that uniquely address the core requirements of an IoT solution. The key

factor in IoT operational excellence is the ability of an organization to plan, provision, configure, monitor, and

retire IoT devices. The core Azure IoT services discussed in this guide provide an array of capabilities for an

organization to accomplish these tasks effectively and efficiently and enable an organization to have excellent

IoT device operational capabilities.

The communication network technology stack of an organization's IoT operation is typically handled by the

organization's network operations team partnering with the organization's telecommunication operator. It's

recommended that the organization coordinate with their telecommunication operator to set up and operate the

wired and wireless communication network components of their IoT solutions and operations.

An IoT architecture consists of a set of foundational layers. Layers are realized by using specific technologies,

and the IoT Well-Architected Framework highlights options for designing and realizing each layer. There are also

cross-cutting layers that enable the design, building, and running of IoT solutions:

Prerequisites

Principles

To assess your IoT workload based on the principles described in the Microsoft Azure Well-Architected

Framework, complete the IoT workload questionnaires in the Azure Well-Architected Review assessment:

Azure Well-Architected Review

After you complete the assessment, this guide helps you address the key operational excellence

recommendations identified for your solution.

The following principles apply to the operational excellence pillar in an IoT workload:

Solution operation and scaling:Solution operation and scaling: Follow the best practices around this principle to ensure that the IoT

solution can successfully manage the automated provisioning of devices, can be integrated with other backend

systems, is designed for use by different types of personas including solution developers, solution

administrators, and operators, and adapts/scales efficiently to any changes in demand such as new IoT devices

being deployed or higher ingestion throughput.

Device management:Device management: Any successful enterprise IoT solution requires a strategy to establish and update a

device or fleet of device's configuration. A device's configuration includes device properties, connection settings,

relationships, and firmware. IoT operators require simple and reliable tools that enable them to update a device

Solution operation and scaling

DesignDesign

ImplementImplement

Select and test IoT hardwareSelect and test IoT hardware

or fleet of device's configuration at any point during the device's lifetime.

Monitoring and aler ting:Monitoring and aler ting: Use IoT solution logging, monitoring, and alerting systems to determine whether

the solution is functioning as expected and to help troubleshoot what is wrong throughout the lifecycle of the

solution.

Automation and DevOps:Automation and DevOps: An IoT device is fundamentally a small computer with specialized hardware and

software. IoT devices are often constrained in hardware, for example having limited memory or compute

capacity. Automation and DevOps are essential to ensure that OS and software for IoT devices and gateways are

properly uploaded and deployed to minimize any operational downtime. Automation and DevOps become an

essential aspect for monitoring and managing the lifecycle of IoT devices.

The following sections address the layer specifics of these principles for the operational security pillar:

Some key design decisions to consider include:

Select IoT hardware that meets the business and technical requirements of the solution and define

appropriate testing procedures to ensure the operational reliability of your IoT devices and gateways.

Implement an automated way to identify device capabilities at scale as they connect to your IoT solution

to easily integrate them with your backend services.

Define a remote device provisioning strategy to enable zero-touch, just-in-time provisioning of IoT

devices in the field without requiring human intervention.

Determine the roles that are responsible for developing, managing, and operating the IoT solution at

scale. Determine the users in the organization to be assigned to those roles.

Implement a centralized device management solution to administer, monitor, and operate the lifecycle of

IoT devices and to manage the overall configuration of the IoT solution. Consider an integrated UI to

assist operation teams to management the device fleet.

Configure and test reliable integration with other Azure and third party services that support the backend

and frontend services of the IoT application. A typical IoT solution is composed of multiple components

such as ingestion, routing, data storage, and data processing. It's important to verify that all these

components in the data flow are functional.

Configure the ingestion and other backend layers of the IoT solution in the cloud to scale to be able to

handle expected and unexpected capacity needs as demand for the service grows.

The Azure Device Catalog site lets IoT solution developers find hardware from a list of certified partners and

select the IoT devices that meet the specific requirements of the IoT solution being implemented. For example,

during product selection or development, you may need to identify devices that comply with certain regional

certification requirements and regulations such as CE, FCC, UL, PCI, and FDA.

For a greenfield project, there's more flexibility regarding the types of devices that can be used, the required

firmware/connectivity features and the technical specification for the IoT devices. For brownfield projects, there

are typically more restrictions when you select hardware (because hardware has already been deployed in the

field) but you might still need to look for other type of hardware such as protocol/identity translation devices or

connectivity gateways - such as Bluetooth to MQTT gateway - depending on the specific scenario. For either

case, the Azure Device Catalog has search and filter capabilities that you can use to find IoT hardware that's been

certified by the IoT device certification program and are compatible with Azure IoT services.

Provision and manage the lifecycle of IoT devicesProvision and manage the lifecycle of IoT devices

Determine the list of roles and users for the IoT solutionDetermine the list of roles and users for the IoT solution

Integration with backend systemsIntegration with backend systems

An important feature to look for when selecting hardware is Azure Plug-and-Play and the Digital Twins

Definition Language (DTDL) compatibility. These features ensure the device integrates seamlessly with services

such as Azure IoT Central and Azure Digital Twins. For Azure IoT Edge scenarios, it's also important to find

devices in the catalog that have the IoT Edge Managed certification that guarantees the device can run the Azure

IoT Edge runtime and enables the deployment and management of IoT Edge Modules to support local

processing and analytical workloads.

When you select devices or components for your solution, take care to ensure a timely and secure supply of

equipment. Consider dual or multi supply sources and a trusted vendor chain. The availability of spares to cover

the solution lifetime and maintenance or support contracts should also be included, as this step is sometimes

forgotten at the start of a project and can be expensive to introduce later.

For remote provisioning of IoT devices, Azure Device Provisioning Service (DPS) enables connecting and

configuring remote devices to an IoT Hub. DPS enables zero-touch provisioning without hardcoding information

at the factory and it enables load-balancing of devices across multiple IoT Hubs. Although DPS supports

symmetric key attestation, in a production environment you should use either the X.509 certificate or TPM

attestation mechanisms. If you use X.509 certificates, the root certificate (or an intermediate certificate that has

been signed by the root certificate) should be deployed to DPS to allow devices in the field to properly

authenticate to the service and assigned to their corresponding IoT hub.

Part of an IoT solution lifecycle includes reprovisioning devices in the field and/or moving them between IoT

hubs. DPS enables the configuration of reprovisioning policies that determine the expected behavior when an

IoT device submits a new provisioning request. Devices should be programmed to send a provisioning request

on reboots and should implement a method to manually trigger a provisioning on demand. This ensures that

every time the device boots, it always contacts DPS before getting redirected to the appropriate IoT hub.

A key decision to make early in the IoT solution design phase is to determine the list of roles that implement and

manage the solution along with the corresponding responsibilities. Ideally, the solution should trust a

centralized identity provider, such as Azure Active Directory, and only let the appropriate users in those roles

perform any management and/or operation activities such as creating and provisioning new devices, sending

commands to hardware in the field, deploying updates, and modifying user permissions. If the solution uses IoT

Central, there are built-in roles that you can assign to existing users in Azure Active Directory eliminating the

need to develop a customized identity service. In a solution backed by IoT Hub, you can use Azure Active

Directory to authenticate requests to Azure IoT Hub service APIs, such as creating device identities or invoking

direct methods. You can develop a custom management UI interface for solution operators and administrators

can be developed that authenticates users against Azure Active Directory and executes the API requests to the

IoT solution backend on behalf of those users.

A successful IoT implementation requires the proper integration of IoT services such as IoT Hub and DPS with

other backend and frontend services, both Azure or third party. Although the configuration of those other

services is out of the scope of this document, it's important to document and have a good understanding of the

entire data flow of the IoT solution and to have testing procedures in place to ensure the different parts of the

solution are working as expected and meeting the technical and operational requirements of the organization.

For example, DPS supports custom allocation policies by using custom code and Azure Functions, therefore it's

important to confirm that the Azure Function allows traffic coming from DPS and IoT Hub. Another example is

the integration between IoT Hub and backend services to enable features such as message routing and file

upload. IoT Hub needs to properly authenticate to those Azure services and you should use managed identities

to eliminate the need for developers to manage those credentials manually. Integration of Azure IoT services

with other services should be tested periodically to ensure that the IoT solution is performing as expected and

meeting the operational SLAs defined by the business. The Azure Architecture Center contains many IoT

reference architectures that outline and describe the proper and secure integration of the Azure IoT platform

Design a management interfaceDesign a management interface

Capacity scalingCapacity scaling

ResourcesResources

Device management

DesignDesign

with other services.

Solution operators and administrators need an interface to interact with the IoT solution. If an IoT solution is

based on IoT Central, an easy-to-use management interface is already built into the application enabling

operators and administrators to complete tasks such as provisioning new devices, adding/removing users,

sending commands to IoT devices, and managing device updates. If the UI management interface provided by

IoT Central doesn't meet the requirements of the solution or the solution is based on PaaS services such IoT

Hub, then you can build a custom management UI that uses the service REST APIs exposed by IoT Hub and IoT

Central.

Azure offers many options to meet capacity requirements as your business grows. Capacity planning and

scaling for your IoT solution varies depending on whether you choose to build an IoT Central or IoT Hub based

solution.

When you implement an IoT solution by using the application-platform-as-a-service (aPaaS) IoT Central, little

needs to be done. IoT Central automatically manages multiple instances of its underlying services to scale your

IoT Central applications and make them highly available. However, since IoT Central only stores 30 days of data,

you're likely to export data to other services. These other services are where you should focus your time, making

sure the solution can handle expected and unexpected capacity needs.

With a solution based on IoT Hub, it's your responsibility to scale up to handle growth in the number of

messages being ingested and to scale out to handle regional demands for the solution. Understanding the

number of messages that will be sent to the IoT hub and the sustained throughput is critical to select the correct

IoT Hub tier to support the predicted demand. If you're approaching the message limit for your IoT Hub, you can

follow these steps to automatically scale the IoT Hub up to the next unit of capacity. In addition to properly

configuring IoT Hub to scale as demand changes, it's important to design any backend services in the IoT

solution (such as Azure Stream Analytics, Azure Cosmos DB, and Azure Data Explorer) with scalability in mind to

ensure there are no bottlenecks anywhere in the solution's data flow.

If your solution is tied to a connected product, the business must communicate any fluctuation in expected load.

Load can be impacted by marketing initiatives, such as sales or promotions; or, by seasonal events, such as

holidays. You should test variations of load prior to events, including unexpected ones, to ensure that your IoT

solution can scale.

Also, you should plan for the capacity needs/requirements of the IoT Edge devices. Whether you're managing

RTOS based devices or larger compute devices with IoT Edge, there are many factors that need to be taken into

account to make sure the sizing of compute and memory are adequate for the specific use-cases.

Azure IoT Hub scaling

Azure IoT Edge Requirements

DPS X.509 Attestation

Manage Users and Roles in IoT Central

Azure IoT Edge Managed Certification Requirements

Controlling access to Azure IoT Hub by using Azure Active Directory

Plan for capacity - Microsoft Azure Well-Architected Framework

An IoT solution's scale and specific use of a device's configuration influences the design of a configuration

management strategy. It's important to automate this strategy as much as possible and ensure a device or fleet

ImplementImplement

ResourcesResources

Monitoring and alerting

DesignDesign

of device's configuration is set and updated efficiently as well.

A configuration management strategy should support:

Inventory of IoT devices and IoT Edge devices deployed in the field

Gradual update rollout through device grouping

Resilient updates to support testing and rollbacks

Automatic updates for existing or new devices

Update status reports and alerts

Example services that support these requirements for configuration management are Azure IoT Hub Automatic

Device Management, Azure IoT Edge Automatic Deployment, Azure IoT Hub Scheduled Jobs, and Device Update

for IoT Hub.

For continuous updates to existing or new device and IoT Edge device configurations, such as properties,

application specific settings, and/or relationships, use either Azure IoT Hub Automatic Device Management or

Azure IoT Edge Automatic Deployment. Both services enable an efficient, secure, and reliable way to automate

configuration deployments for either a fleet or specific group of devices. Both services continuously monitor all

new and existing targeted devices, based on tags, and their configuration to ensure these devices always have a

specified configuration. The key difference between these services is that Automatic Device Management only

applies to non-Azure IoT Edge devices and Azures IoT Edge Automatic Deployment only applies to Azure IoT

Edge devices.

To update an existing device or IoT Edge device configuration, such as properties, application specific settings,

and/or relationships, based on a schedule, whether this update is onetime or recurring, use Azure IoT Hub

scheduled jobs. This service enables an efficient, secure, and reliable way to provide a configuration update for

either a fleet or specific group of devices at a scheduled time.

To update an existing device or IoT Edge device configuration, such as firmware, application, or package updates,

use Device Update for IoT Hub. This service enables a safe, secure, and reliable way to provide firmware,

application, or package updates over-the-air (OTA) to either a fleet or specific group of devices. It's a good idea to

include a manual update method when developing IoT devices. Because of root certificate changes to IoT Hub or

DPS, or connectivity issues, you may need to update devices using a manual method such as physically

connecting to a local computer or using a local connectivity protocol such as Bluetooth.

Azure IoT Hub Automatic Device Management

Azure IoT Edge Automatic Deployment

Azure IoT Hub Scheduled Jobs

Device Update for IoT Hub

Monitoring and logging systems help to answer the following operational questions:

Are devices or systems in an error condition?

Are devices or systems correctly configured?

Are devices or systems generating accurate data?

Are systems meeting the defined service level objectives?

Monitoring and logging systems can help answer these questions. If the answer is no, these systems surface

relevant information for operations teams to help mitigate problems. IoT solutions logging and monitoring

  
ImplementImplement
  
Monitor Azure IoT HubMonitor Azure IoT Hub
systems are often more complicated than those of standard line-of-business applications. The complexity arises
from the fact that IoT solutions span:
Physical sensors interacting with an environment.
Applications on the intelligent edge performing activities such as data shaping and protocol translation.
Infrastructure components such as on-premise gateways, firewalls, and switches.
Ingestion and messaging services.
Persistence mechanisms.
Insight and reporting applications.
Subsystems that operate and scale independently in the cloud.
The following simplified logging and monitoring architecture shows examples of typical IoT solution
components and how they use some recommended technologies.
When you have critical applications and business processes that rely on Azure resources, you should monitor
those resources for their availability and performance.
The Azure Monitor service supports these requirements. This service can:
Detect and diagnose issues across applications and dependencies with Application Insights.
Correlate infrastructure issues with VM insights and Container insights.
Drill into your monitoring data with Log Analytics for troubleshooting and deep diagnostics.
Support operations at scale with smart alerts and automated actions.
Create visualizations with Azure dashboards and workbooks.
Collect data from monitored resources using Azure Monitor Metrics.
The Over viewOver view  page in the Azure portal for each IoT hub includes charts that provide some usage metrics, such
as the number of messages used and the number of devices connected to the IoT Hub. The information on the
Over viewOver view  page is useful but represents only a small amount of the monitoring data available for an IoT hub.
Some monitoring data is collected automatically and is available for analysis as soon as you create your IoT hub.
You can configure other types of data collection.
Azure IoT Hub collects the same types of monitoring data as other Azure resources as described in Monitoring

IoT Edge metrics collectorIoT Edge metrics collector

data from Azure resources.

To learn more about the metrics and logs that IoT Hub creates, see Monitoring Azure IoT Hub data reference.

Microsoft provides a ready to use module called

Metrics Collector

. Add this first-party module to an IoT Edge

deployment to collect module metrics and send them to Azure Monitor. The module code is open source and is

provided as a multi-arch Docker container image that supports Linux x64, ARM32, ARM64 (version 1809). It's

available in the IoT Edge Module Marketplace.

Module features include:

Collect logs from all the modules that can emit metrics using the Prometheus data model. While built-in

metrics enable broad workload visibility by default, you can also use custom modules to emit scenario-

specific metrics that enhance the monitoring solution.

There are two options to send metrics from the metrics collector module to the cloud:

Send the metrics to Log Analytics. The collected metrics are ingested into the specified Log

Analytics workspace using a fixed, native table called InsightsMetrics.

Send the metrics to IoT Hub. The collector module can be configured to send the collected metrics

as UTF-8 encoded JSON device-to-cloud messages through the

edgeHub

module. This option

unlocks monitoring of locked-down IoT Edge devices that are only allowed external access to the

IoT Hub endpoint.

The

AllowedMetrics

and

BlockedMetrics

configuration options take space or comma separated lists of

metric selectors. A metric is matched to the list and included or excluded if it matches one or more

metrics in either list.

You can visually explore metrics collected from IoT Edge devices using Azure Monitor workbooks. Curated

monitoring workbooks for IoT Edge devices are provided as public templates:

For devices connected to IoT Hub, from the IoT HubIoT Hub page in the Azure portal, navigate to the

WorkbooksWorkbooks page in the MonitoringMonitoring section.

For devices connected to IoT Central, from the IoT Central page in the Azure portal, navigate to the

WorkbooksWorkbooks page in the MonitoringMonitoring section.

Curated workbooks use built-in metrics from the IoT Edge runtime. They first need metrics to be ingested into a

Log Analytics workspace. These views don't need any metrics instrumentation from the workload modules:

Monitor updatesMonitor updates

Monitor configuration managementMonitor configuration management

Monitor automation and DevOpsMonitor automation and DevOps

ResourcesResources

Automation and DevOps

As with any deployment/update, you should monitor the state of deployment and device. In the Device Update

for the IoT Hub service, compliance measures how many devices have installed the highest version compatible

update. A device is compliant if it has installed the highest version available compatible update.

As with any deployment and/or update, you should have monitoring and alerting on the status of a device

configuration deployment and/or update in place. Each Azure IoT configuration service, as mentioned

previously, collects and stores logs and metrics in Azure Monitor. You can use this data to create alerts in Azure

Monitor for sending notifications when a configuration deployment and/or update has been created, completed,

or failed.

If the monitoring data provided by each of the Azure IoT configuration services isn't enough, the Azure IoT Hub

service APIs offer a more granular view. See the following ResourcesResources section.

DPS, Azure IoT hub configurations and IoT Edge provide continuous metrics and status updates that are the key

inputs to monitor the state of CI/CD or automation script output. You can collect and analyze these metrics in a

Log Analytics workspace and then define alerts.

Monitor device connectivity using the Azure IoT Central Explorer

Monitor, diagnose, and troubleshoot Azure IoT Hub disconnects

Monitoring Azure IoT Hub

Azure IoT Hub Automatic Device Management Monitoring

Check Azure IoT Hub service and resource health

Tutorial - Set up and use metrics and logs with an Azure IoT hub

Collect and transport metrics - Azure IoT Edge

The key benefit for organizations that are mature in using DevOps is agility - the ability to quickly sense and

DesignDesign

ImplementImplement

EXPEC TAT IO N SEXP E C TAT IO NS P L AT F O RM FEAT URE AVA IL A B L E W IT H C O DE SN IPP E T SP L AT F ORM FE AT UR E AVA IL A B L E WIT H C O DE SNIP P E T S

Device registration (non-DPS) Bulk Device updates

respond to changes in business needs and seamlessly drive through the full lifecycle of developing, deploying,

and operating a solution. DevOps in the world of application software development is enabled by automation in

testing, integration, and deployment. Finally, when application software changes are deployed in an

infrastructure-as-code environment, the ongoing operation of deployed software is automated and managed.

In the world of IoT solutions, there are three areas where agility, DevOps and automation come into play in the

development, deployment, and operation of an IoT solution:

The automation of the IoT application software lifecycle from development through testing to

deployment to IT operations.

The IoT application software deployed in devices using Azure IoT Edge. Here, once again, you can use a

combination of services available in Azure IoT Hub and Azure IoT Edge to use Azure DevOps tools and

processes in automating software lifecycle aspects of an IoT solution at the edge.

Addressing the challenge to operationally manage the IoT devices by providing operators with tools that

let them collaborate and get the necessary visibility, insights, and control to properly maintain and

operate a reliable IoT solution.

Consider the following aspects when you design your automation and DevOps processes for IoT:

Spreading approach:Spreading approach: Thousands of devices are connected to single cloud endpoint such as Azure IoT

Hub. Any change in configuration or software must be spread across all the devices.

Cross functional aspect:Cross functional aspect: Device vendors and cross-functional solution developers work together to

develop and deploy the IoT solution. Embrace cross-functional teams to deliver continuously for the

solution.

Software or configuration defined solutions:Software or configuration defined solutions: System functionality is defined by the program or

software installed on the hardware. Therefore, to change system functionality, update software instead of

making hardware changes or local interventions.

Productivity and rapid development cycle:Productivity and rapid development cycle: A process that boosts the development cycle seamlessly

across solution development cycle. Use CI/CD DevOps principles and processes.

Evolving business and deployment model:Evolving business and deployment model: The core purposes of IoT solutions are to create

possibilities for different business models and pilot validation, deployment, and enhancements. DevOps

provides a way to consistently deliver fresh software updates to meet the solution and business

requirements.

Intelligence at the edge:Intelligence at the edge: Connected IoT Edge devices have a lifecycle that extends beyond deploy,

break and fix, and retire. Connectivity creates the opportunity to continuously add incremental innovation

across the system lifecycle. DevOps puts organizations in the best position to capitalize on that

opportunity.

When you implement automation and DevOps in IoT systems, follow the device lifecycle and specific

automation and DevOps requirements in each phase. The following tables describe three phases of the device

lifecycle:

Beginning of lifeBeginning of life:

Device provision DPS Service: Prepare the configuration required to provide
Zero touch device provisioning
Device certificate and token management Generate SAS token or public key at device for
authentication and channel security
Device certificate lifecycle management CA certificate lifecycle management (Preview feature with
DPS and DigiCert)
Device initial configurations Device Twins and device modules
EXPEC TAT IO N SEXP E C TAT IO NS P L AT F O RM  FEAT URE AVA IL A B L E W IT H  C O DE SN IPP E T SP L AT F ORM  FE AT UR E AVA IL A B L E WIT H  C O DE SNIP P E T S
EXPEC TAT IO N SEXP E C TAT IO NS P L AT F O RM  FEAT URE AVA IL A B L E W IT H  C O DE SN IPP E T SP L AT F ORM  FE AT UR E AVA IL A B L E WIT H  C O DE SNIP P E T S
Continuous device configuration management at scale Device Twins and device modules
Create a CI/CD pipeline for IoT Edge modules IoT Edge CI/CD
Reprovision device DPS device reprovision
SAS key generation if expired or require changes Generate SAS token or public key at device for
authentication and channel security
Log and device diagnostics Analyze the device logs, Azure IoT hub supported device
heartbeats and diagnostics, Azure IoT Hub workbook blade
with pre-configured workbooks
Azure IoT Edge monitoring diagnostics Collect and automate Azure IoT Edge device logs and
metrics
OTA device updates Device updates
EXPEC TAT IO N SEXP E C TAT IO NS P L AT F O RM  FEAT URE AVA IL A B L E W IT H  C O DE SN IPP E T SP L AT F ORM  FE AT UR E AVA IL A B L E WIT H  C O DE SNIP P E T S
Unenroll devices Revoke device access
Remove any configuration specific to a device Device Twins and device modules
Device replacement Follow beginning of life steps
 
Next steps
Middle of life:Middle of life:
End of Life:End of Life:
Performance efficiency in your IoT workload

Performance efficiency in your IoT workload

12/16/2022 • 22 minutes to read • Edit Online

Prerequisites

Principles

Run performance testing in the scope of developmentRun performance testing in the scope of development

Continuously monitor the application and the supporting infrastructureContinuously monitor the application and the supporting infrastructure

Performance efficiency is the ability of your workload to scale to meet the demands in an efficient manner.

IoT workloads span from millions of small devices connected to the cloud to industrial solutions where the

devices are a few powerful servers acting as gateways for cloud connectivity. A device, or more specifically an

IoT Device, may generally be defined as a computing device that collects and transmits information collected

from sensors. The number of devices, the number of messages sent/received and the physical geographical

placement of the devices are a few of the factors that define the performance efficiency of an IoT workload.

A benefit of the cloud is the geographical availability and the capability to scale services to meet the demands.

Scale our IoT Services without, or with minimal, application downtime.

To assess your IoT workload based on the principles described in the Microsoft Azure Well-Architected

Framework, complete the IoT workload questionnaires in the Azure Well-Architected Review assessment:

Azure Well-Architected Review

After you complete the assessment, this guide helps you address the key performance efficiency

recommendations identified for your solution.

This pillar represents the performance relative to the resources used under stated conditions. This characteristic

is composed of the following subcharacteristics:

Time behaviorTime behavior - Degree to which the response times, processing times and throughput rates of a

product or system, when performing its functions, meet requirements.

Resource utilizationResource utilization - Degree to which the amounts and types of resources used by a product or

system, when performing its functions, meet requirements.

CapacityCapacity - Degree to which the maximum limits of a product or system parameter meet requirements.

Overall performance principles can be found at Principles of the performance efficiency

An IoT solution has several specific challenges to address. An IoT project contains both IoT devices, edge

software and cloud software parts and, when deployed, may contain millions of devices connected from

multiple regions of the world and sending millions of messages per minute.

Be aware of the complexity of having sensors, devices and gateways in geographically different locations with

different characteristics, speed and reliability of communication. Plan for this in your testing and make sure to

test for failure scenarios like network disconnect etc. In addition, plan for stress/load test of device, edge and

cloud components of your full IoT solution.

Monitoring an IoT solution with several types of devices in multiple geographical regions requires a distributed

monitoring solution that balances the amount of information monitored and sent to the cloud versus the cost in

memory and performance of the monitoring (tune the transmission for diagnostics scenarios). Monitoring must

Invest in capacity planningInvest in capacity planning

Design

Device and gateway layer

occur at multiple levels / layers, exposing metrics etc. on a gateway etc. for industrial or gateway enabled

solutions.

An IoT solution can easily start with a few hundred devices or messages and grow to millions of devices and

messages per minute. One of the benefits of the cloud is that it can often be scaled to an increase in load. For IoT

devices and gateways, the situation is much more complex as these devices are often designed long before the

solution is finalized and the need to update the capacity is costly, as devices may need to be replaced. In

industrial IoT or similar industries, the lifespan of a device is measured in decades so it's more important than

ever in these scenarios to think ahead.

An IoT architecture consists of a set of foundational layers. Layers are realized by using specific technologies,

and the IoT Well-Architected Framework highlights options for designing and realizing each layer. There are also

cross-cutting layers that enable the design, building, and running of IoT solutions:

The following sections address the layer specifics for the performance efficiency pillar:

Devices are computing devices that connect to an IoT solution and have the ability to transmit or receive data.

Gateways are devices that serve as the connection point between an IoT solution and other devices. Design

Optimize the hardware specifications of devices and gatewaysOptimize the hardware specifications of devices and gateways

Categorize individual workloadsCategorize individual workloads

Optimize connectivityOptimize connectivity

solution accordingly, factor in the load and the limits/quotas.

Optimize the capabilities of your existing hardware by using more efficient languages and frameworks

like C and Rust.

Upgrading or replacing hardware is costly and takes time, therefore devices that are connected to the IoT

solution should be sized for the required capacity and functionality in advance.

Use gateways as units of scale. For example, if your solution adds OPC UA servers over time, other edge

gateways can be used to ingest data from these other servers.

When scaling the Device and gateway layer conduct a scale assessment for all upstream layers including

edge gateways, cloud gateways, and upstream cloud services.

Use multiple IoT hubs as scale units for the IoT solution, see Tutorial: Provision devices across load-

balanced IoT hubs. Use DPS to distribute devices over multiple IoT Hub instances.

Depending on the requirements of the solution, run CPU and I/O-intensive tasks on specific hardware

(for example, run ML algorithms on local GPUs) or run their workloads in the cloud (such as blurring

faces in a video stream).

Using the Azure IoT Device SDKs for connecting the cloud gateway because it manages required message

translation, error handling, and retry mechanisms needed for a resilient connection.

Consider using the Azure IoT Embedded C SDK when developing for constrained devices or when most

of the security and communication stack is already available on the device. Consider using the Azure IoT

device SDK for C to include all that is needed to connect to the cloud gateway.

Consider running workloads on the device/edge to use local compute for low latency and not always

connected scenarios. Consider running workloads in the cloud to use scale.

Consider separating workloads by time constraint and required latency and response times, for example

response within seconds versus and batch oriented per hour. Depending on system constraints such as

network throughput or latency the solution may demand running your workload at the edge.

On the edge, use priority queues for sending different streams of data in the order that is required.

Although messages will be sent in order of priority to the IoT Hub, this doesn't impact receiving the

messages from the IoT Hub as they're still journaled on the IoT Hub based on receipt order.

In the cloud, use different consumer groups on Event Hubs to separate out different streams of data, you

can handle and scale alarms differently from telemetry.

In the cloud, use IoT Hub routes to separate out different streams of data, with filtering and separate

endpoints. IoT Hub message routing will add some latency.

Consider using open stateful connections during the operational time of the devices and gateways to

minimize the overhead of setting up the connection

Use an open stateful connection to the IoT solution for bi-directional communication between the devices

and the IoT solution

Consider using IoT Hubs with the lowest latency to your devices. This could mean that IoT Hubs in

multiple regions are needed when devices need to connect from different geographical locations.

Use DPS to set up a connection to the IoT hub during provisioning, when the connection isn't available

anymore or during reboot of the device.

Optimize offline scenarioOptimize offline scenario

Ingestion and communication layer

Limits and throttlingLimits and throttling

Use the evenly weighted distribution policy of DPS to adjust the weight for provisioning, refer to How the

allocation policy assigns devices to IoT Hubs.

Consider devices not to connect all at once, for example after a regional power outage. Use truncated

exponential backoff with introduced jitter when retrying.

Consider using Device and Module twins to asynchronously sync state information between devices and

the cloud, even when the device isn't currently connected to the cloud gateway. Device and Module twins

contain only the current state at a point in time, not any history or removed information.

Consider that a device has enough information and context to work without the need for a connection to

the cloud and store this information locally so that the device can recover from a reboot.

Ensure a device is capable of storing the data locally, including logs and cached telemetry per priority

(when the device isn't connected).

Consider setting a time-to-live on the data, so that expired data will be removed automatically.

Consider discarding less important data when the device isn't connected to reduce the local storage

needed and reduce the synchronization time when the device reconnects.

Consider using a cache eviction strategy such as FIFO, LIFO, or priority based for when edge device

storage capacities are reached.

Consider using a separate disk (or disk controller) to store data, so that the device runtime/application

can continue to work if running low on storage.

Data ingestion is the process used to send data from the devices into the IoT solution. Next to data ingestion

there can be other patterns of communication between devices and the IoT solution. These include:

Device-to-cloud messages.

Cloud-to-device messages.

Initiate file uploads.

Retrieve and update device twin properties.

Receive direct method requests.

For more information about IoT Hub endpoints, see IoT Hub Dev Guide Endpoints

The cloud IoT gateway has well defined limits per unit dependent on the tier of the IoT Hub. These limits are

defined as

Quota

for sustained throughput and sustained send rates for the selected IoT Hub tier. Although these

quotas are defined as the sustained throughput and send rate, IoT Hub is capable of handling loads above these

quotas for short periods of time, to resiliently handle short bursts or load overshoots. This is where another

limit becomes important, which is the

Throttle Limit

, which is an hourly or daily upper limit for the load of the

service. The throttle limits protect an IoT Hub from too much load for a longer period of time.

The diagrams below show the relation between the load, the quota and the throttle limits. The left diagram

shows that an IoT Hub can handle sustained or constant high load up to the level of the quota for the selected

IoT Hub tier. The right diagram shows that an IoT Hub can handle load that is changing over time as long as it

isn't hitting the throttle limit and on average not above the quota for the selected IoT Hub tier.

Optimize interactions between device and cloud, components, and servicesOptimize interactions between device and cloud, components, and services

Optimize communication and cloud processing of messagesOptimize communication and cloud processing of messages

The number of messages sent from device to cloud is an important parameter for IoT solution performance

efficiency. Also, Azure IoT services such as IoT Hub and IoT Central have a defined limit of messages for a

selected tier and affect both performance and cost of the solution.

Consider the following recommendations for sending messages:

IoT Hub and IoT Central calculate daily quota message count based on a 4-KB max message size. Sending

smaller messages will leave some capacity unused. In general, use message sizes close to the boundary

of 4 KB.

Consider message sizing. Group smaller device-to-cloud messages into larger messages to reduce the

total number of messages. Consider the introduced latency when combining messages.

Consider message frequency. Avoid using chatty communication dependent on the use case.

Use Azure IoT SDK message batching for AMQP to send multiple telemetry messages to the cloud.

Consider using AMQP connection multiplexing to reduce the dependency on TCP connections limits per

SDK client. With AMQP connection multiplexing, a single TCP connection to IoT Hub can be used by

multiple devices.

Use built-in message batching capabilities for AMQP of IoT Edge to send multiple telemetry messages to

the cloud.

Consider using application-level batching by combining multiple smaller messages at the downstream

device and send larger messages to the gateway/edge device. This will limit the message overhead and

lower the writes to local disk storage at the edge.

For device-to-device or module-to-module communication at the edge, also avoid designing interactions

in which many small messages are used.

Consider using Direct Methods for request/reply interaction with a device that can succeed or fail

immediately (after a user-specified timeout). This approach is useful for scenarios where the course of

immediate action is different depending on whether the device was able to respond.

Consider using Device Twins for device state information including metadata and configurations. Azure

IoT Hub maintains a device twin for each device that you connect to IoT Hub.

It's common that messages from a device or gateway need to be translated, processed or enriched with

additional information before being stored. This could potentially be a time-consuming step, so it's important to

evaluate the effect on performance in your solution. Some of the recommendations are also conflicting, such as

using compression for optimizing data transfer versus avoiding cloud processing decrypting messages. These

recommendations need to be balanced and evaluated against other pillars.

Optimize the data format used to send data to the cloud. Consider both performance (and cost) of

Categorize and prioritize your dataCategorize and prioritize your data

Transport layer

Optimize resource usageOptimize resource usage

bandwidth used and performance effect of any cloud processing of data that is needed.

Consider using message enrichment in IoT Hub to add context to device messages.

For time critical event processing on the ingested data as it arrives, instead of storing unprocessed data

and requiring complex queries to acquire data. For time-critical event processing consider late arrival,

windowing, and their impacts. This depends on the use case, for example critical alarm handling versus

message enrichment.

Select the right IoT Hub tier (Basic or Standard) based on solution feature requirements. Be aware of the

features that aren't supported in the Basic tier.

Select the right IoT Hub tier size (1, 2 or 3) and the number of instances based on data throughput,

quotas and operation throttles. Select the right IoT Central tier (Standard 0, Standard 1, Standard 2) based

on the number of messages that are sent from a device to the cloud.

Consider using Event Grid for publish-subscribe event routing, see Azure IoT Hub and Event Grid and

[Compare Event Grid, routing for IoT Hub]/azure/iot-hub/iot-hub-event-grid-routing-comparison).

Often some of the data sent from devices to the cloud is more important than other data. Classifying the data

based on priority and handling the categories in different ways are often a good practice for performance

efficiency.

One good example is a thermometer sensor sending temperature, humidity, and other telemetry, but also

having a feature of sending an alarm when temperature is outside of a defined range. The alarm message would

be classified as more important than the temperature values.

Consider the following recommendations for data categories:

Use IoT Edge priority queues to make sure important data is prioritized while sending to IoT Hub. This

doesn’t impact the priority on the receiving side of IoT Hub.

Use IoT Hub message routing to separate routes for different data priorities dependent on the use case.

IoT Hub message routing will add some latency.

Consider saving and sending low priority data at longer intervals, or even using batch/file uploads.

Malware detection on uploaded files will increase latency.

When you use IoT Edge, messages will be buffered when there's no connection, but after the connection

is restored, all buffered messages will be sent in order and by priority first followed by new messages.

Consider separating messages based on timing constraints. For example, send messages to IoT Hub

directly when there's a time constraint, and utilize file upload via IoT Hub or batch data transfer (for

example, Data Factory) if there's no time constraint. When you use IoT Edge, the blob module can be used

for the file upload.

The transport layer handles connections between a device and the IoT solution, transforming IoT messages to

network packages, and sending them over the physical network.

Common protocols used in IoT solutions are:

AMQP – Advanced Message Queuing Protocol

MQTT – Message Queue Telemetry Transport

Optimize data communicationOptimize data communication

Device management and modeling layer

Device provisioningDevice provisioning

The connection between a device and the cloud needs to be secure, reliable and scalable to handle the targeted

number of devices and messages.

Use an open stateful connection from a device to the cloud gateway (for example using MQTT or AMQP).

IoT Hub is optimized for managing millions of open stateful connections (using MQTT, AMQP or

WebSocket protocols). If you keep open connections to the devices, the overhead of security handshakes,

authentication and authorization is minimized, which will improve performance and reduce the required

bandwidth considerably.

Consider using a protocol to connect to the cloud gateway (AMQP) that supports multiplexing of multiple

channels on a single connection to minimize the number of open connections required by the cloud

gateway. By using multiplexing, a transparent gateway can connect multiple leaf devices using their own

channel over a single connection.

Use the cloud gateway patterns of Device and Module twins to asynchronously exchange state

information between the device and the cloud.

Consider configuring DPS to move the device state when a device is connecting to another cloud

gateway.

The number and size of messages that are sent from device to cloud have an effect on performance and cost.

Evaluating data communication is key to performance efficiency in your IoT Workload.

Use an efficient data format and encoding used to send data to the cloud, ensure that no extensive device

to cloud bandwidth is used.

Optimize the data format used to send data to the cloud especially for low bandwidth networks. Consider

using a compressed or binary format to send data to the cloud using low bandwidth networks. Consider

the overhead of uncompressing or converting in the cloud after the cloud gateway.

Consider storing high volume data sent from device to cloud locally and uploading hourly/daily.

Group many smaller devices to cloud messages into fewer larger messages to reduce the total number of

messages. However, don't only send large messages but balance between average message size and

throughput.

Different types of devices, using different protocols, connectivity, data ingestion frequencies and communication

patterns can be connected to an IoT solution and an IoT solution can connect to many devices and gateways at

the same time. The IoT solution must be capable of managing which devices and gateways are connected and

how they're configured.

Besides the physical connection and configuration of the devices and gateways, the data captured and ingested

by the devices and gateways must be understood by the IoT solution and therefore needs to be transferred and

contextualized.

Use DPS to set up a connection to the IoT hub during provisioning, when the connection isn't available

anymore or during reboot of the device.

Consider using DPS to allocate devices to IoT Hubs in different regions based on the latency.

Use the evenly weighted distribution policy of DPS to adjust the weight for provisioning, based on the

use cases of the IoT solution.

Consider provisioning devices not all at once but over a period of time (distributed or in smaller batches)

Downstream devicesDownstream devices

Optimize sizing based on the device and message loadOptimize sizing based on the device and message load

Storage layer

Time series dataTime series data

Storage on devices and gatewaysStorage on devices and gateways

to the IoT solution to balance the load and quota of DPS. Account for maximum connections and device

registrations, including latency and retries.

When onboarding in batches, plan for the batches and overall migration timeline.

Consider using a caching strategy for connection string returned by DPS to reduce Device Provisioning

Service operations for reconnects.

Account for the limits of DPS in number of operations and registrations per minute.

Use multiple gateways devices in Translation mode when the number of downstream devices, their

messages and message size will change/increase over time, and their protocol or message must be

translated. With the ability to have multiple gateways/IoT Edge devices per site/location and downstream

devices that can connect to any of these gateways/IoT Edge devices, the solution will be horizontally

scalable. Gateways/IoT Edge devices in Translation mode will be able to translate protocols or messages

to and from downstream devices, however a mapping is needed to find the gateway a downstream

device is connected to.

Use multiple gateways/IoT Edge devices in Transparent mode when connecting downstream

MQTT/AMQP devices and their number can change/increase over time per site/location. With the ability

to have multiple gateways/IoT Edge devices per site/location and downstream devices that can connect to

any of these gateways/IoT Edge devices, the solution will be horizontally scalable. Gateways/IoT Edge

devices in Transparent mode will be able to connect MQTT/AMQP devices for bidirectional

communication.

Account for additional message translation and buffering overhead at the gateway/IoT Edge device when

using Translation mode.

Account for additional message buffering, storage and configuration overhead at the gateway/IoT Edge

device when using Transparent mode.

Account for the number of messages the cloud gateway can handle, dependent on the tier and number of

units used. Take into account anomalies in sustained throughput due to data distribution, seasonality and

bursting.

Consider using multiple cloud gateways (IoT Hubs) when millions of devices need to be managed using

the IoT solution. Use DPS to get devices assigned to one of the IoT Hubs in the IoT solution.

An IoT solution often requires multiple storage types residing from devices and gateways and into the cloud.

The different types of data collected and referenced in the IoT solution often require different storage types,

specialized, and optimized for different scenarios.

An IoT solution that is available in multiple geographical regions might also include complexity based on data

that needs to be available globally and/or locally and in some cases replicated for optimizing latency.

Use a Time Series Database for storing time series data, timestamp, value(s).

Consider enriching telemetry time series data with columns used for filtering when storing in a Time

Series Database, for example CustomerID, RoomID or any use-case specific column.

Use storage on devices and gateways for caching data.

Event processing and analytics layer

Optimize on edge versus cloud processingOptimize on edge versus cloud processing

Categorize individual workloadsCategorize individual workloads

Optimize high volume cloud data handlingOptimize high volume cloud data handling

Use storage on devices and gateways to make sure data is kept when disconnected. Account for required

storage space. Consider keeping not all data locally but use down sampling, store only aggregates, or for

a limited period of time.

Data generated by devices can be processed before sending it to the IoT solution or within the IoT solution. The

data processing can contain translation, contextualization, filtering and routing, or more advanced analytics like

trend analysis or anomaly detections.

Consider running local workloads on the device/edge using local compute. For example, running a

machine learning algorithm to count people in a video stream can run on the edge and an event

containing the count is sent to the cloud.

Consider running large workloads, or workloads that have additional or external data or compute

dependencies, in the cloud to use scale. For example, comparing actual trends between different factories.

Consider running analytics workloads on the edge using the Steam Analytics Edge module. For example,

anomaly detection can run on the edge and label the events sent to the cloud with the detected anomaly.

When running analytics on the edge, account for extra latency, late arrival, and windowing impact.

Consider running real-time and near real-time workloads on the device or at the edge.

Consider running small, optimized, low-latency processing with time constraints on a device.

Be aware of the overhead of an edge with many connected downstream devices. The edge node has to

forward or process all messages and handle caching all the data in case of lost and restored connectivity

to the cloud. Validate the performance impact on your solution by testing with the planned maximum of

downstream devices and messages per edge node.

Be aware of the performance impact that message translation or enrichment can have for the edge, IoT

Hub or the cloud event processing.

Consider separating workloads by time constraint and required latency and response times, for example

response within seconds versus and batch oriented per hour. Hybrid hardware SoCs can support this on

device level.

On the edge use priority queues to separate different streams of data with different priorities and time to

live. For example, alarms should always be sent first but have a lower time to live compared to telemetry.

In the cloud, use different consumer groups on Event Hubs to separate out different streams of data, you

can handle and scale alarms differently from telemetry.

In the cloud use IoT Hub routes to separate out different streams of data, with filtering and separate

endpoints. IoT Hub message routing will add some latency.

Use the right messaging construct (Event Hubs, Event Grids or Service Bus) to distribute workloads while

protecting against back pressure in the cloud.

Overly complex IoT Hub routing rules can have an impact on the throughput. This is especially true for

routing rules with message body JSON filters where every message needs to be deserialized and

scanned.

Consider using out-of-the-box service integration between IoT Hub and data destinations (like Azure Data

Lake Storage and Azure Data Explorer) as these have already been optimized for high performance

Integration layer

Loose couplingLoose coupling

ReportingReporting

DevOps

Optimize deployments and configurationOptimize deployments and configuration

MonitoringMonitoring

Development and deploymentDevelopment and deployment

throughput.

Consider using the Event Hubs SDK to develop custom ingestion from an IoT Hub and the included event

processor. The event processor can rebalance devices and hosts.

Consider separating the storage needed for data ingestion and event processing from the storage needed

for reporting and integration.

Account for the right number of IoT Hub partitions and consumer groups regarding the number of

simultaneous data readers and the required throughput.

Consider using the data storage that fits the need based on the required throughput, size, retention

period, data volume, CRUD requirements, and regional replication (ADLS, ADX, SQL, or Azure Cosmos DB;

see also Select an Azure data store for your application).

The integration layer is the connection between your IoT solution and the business applications utilizing the IoT

data.

Separate the IoT solution/ingestion pipeline from processing/integration.

Use well defined and versioned APIs for access to IoT data and commands.

Make sure complex queries or load from integration layer doesn't affect the data ingestion from a

performance perspective.

Consider a separate data store for integration.

Avoid tools for end users to create user-defined queries against IoT data storage.

Consider a separate data store for reporting.

Considered the use of a connected registry for local caching and deployment of container images.

Use the deployment mechanisms provided by IoT Hub to update deployments to multiple devices at

once, including devices and gateways.

Use device twins and module twins to update configurations of devices using IoT Hub in a scalable and

efficient way.

IoT Hub metrics collected using Azure Monitor with alerts for critical metrics. See Monitoring Azure IoT

Hub data reference.

Azure Service Health service alerts are set to trigger notifications when status of IoT Hub changes.

Consider setting up Azure Monitor alerts based on current scale limits such as device to cloud messages

sent per second. It's also good practice to set the alert to a percentage of the limit, such as 75%, to allow

for prenotification of upcoming scalability limits.

Consider setting up Azure Monitor alerts for metrics and logs such as "Number of throttling errors"

Next steps

Plan for and execute performance testing, including stress and load tests to replicate the production

environment, such as location, heterogenous devices, and so on.

Reliability in your IoT workload

SAP workload best practices

12/16/2022 • 2 minutes to read • Edit Online

SAP landing zone prerequisite

Guidance overview

WE L L -A RC H IT EC T ED F RA M E WO RK P IL L A RW EL L - A RC H IT E C T E D F RA M EW O RK P IL L A R SUM M A RYSUM M A RY

Reliability An SAP workload requires resiliency at the architecture layer.

You’ll learn how to create an SAP application with high

availability to process critical business data.

SAP is one of the world’s leading producers of software solutions for business management and customer

operations. SAP provides a suite of powerful applications that you can configure to meet specific environment

and organizational needs. These applications facilitate data processing and information flow across

organizations and provide critical capabilities that drive key organizational functions.

SAP applications thrive on infrastructure that is tailored to maximize their capabilities and with cloud services

designed to optimize SAP workload functionality. Microsoft has hosted SAP instances in its data centers for

decades and has custom built infrastructure and services to run SAP and its applications as they were designed.

We’ve used this SAP experience to create a concise set of best practices for evaluating, designing, and

optimizing an SAP workload from premigration to operations. The goal of this guidance is to help you get the

most benefits from each SAP workload.

Figure 1: Using SAP workload guidance throughout the lifecycle of an SAP workload.

The guidance is designed for one SAP workload at a time. Anytime you want to optimize a specific SAP

application (S4/HANA, NetWeaver, etc.) for Azure or in Azure, you should use this guidance. Work through the

guidance as many times as you need to derive the benefits you expect. In operations, this guidance should be

paired with the Well-Architected Review assessments and health checks. We address these tools in our

guidance.

Before using this content, you should have an SAP platform landing zone in Azure. The platform landing zone

provides shared services to one or more of your SAP workloads. If you don’t have a platform landing zone, you

should use the SAP cloud adoption framework and deploy the SAP landing zone accelerator. The landing zone

accelerator establishes the required foundation for your SAP workload. For more information, see SAP cloud

adoption framework.

We built this guidance around the Azure Well-Architected Framework and its five pillars of architectural

excellence. The table below lists each pillar and provides a general summary of the articles in this set.

Security An SAP workload contains critical business data. You’ll learn

to secure your SAP applications with multiple security layers,

including identity, access, input validation, data sovereignty,

and encryption.

Cost Optimization An SAP workload has several architecture layers and

numerous resources supporting it. You’ll learn how to make

sure your SAP application deployment meets performance

expectations while reducing the total cost of ownership.

Performance Efficiency An SAP workload needs to be high performing resources to

meet productivity requirements. You’ll learn how to make

sure that your SAP workload meets user demands while

managing costs.

Operational Excellence An SAP workload spends most of its lifecycle in operations.

You’ll learn how to manage an SAP workload and to keep it

running.

WE L L -A RC H IT EC T ED F RA M E WO RK P IL L A RW EL L - A RC H IT E C T E D F RA M EW O RK P IL L A R SUM M A RYSUM M A RY

Next Steps

We invite you to explore SAP workload design best practices and return to this content regularly throughout the

lifecycle of your SAP workload. The content highlights critical areas of focus but also refers you to other

documentation for deeper technical insight.

For more information, see:

Azure Center for SAP Solutions

SAP workload in Azure

SAP workload architectures

Reliability

Security

Cost Optimization

Operational Excellence

Performance Efficiency

SAP workload reliability

12/16/2022 • 9 minutes to read • Edit Online

Conduct a reliability assessment

Create architecture reliability

Create SAP central services reliability

Create SAPMNT share reliability

A reliable SAP workload is both resilient and available. Resiliency is the ability to recover from failures and

continue to function. Availability is uptime. High availability reduces SAP application downtime during critical

maintenance and improves recovery from failures such as VM crashes, backend updates, major downtime, or

ransomware incidents. Failures happen on-premises and in the cloud, so it’s important to design your SAP

workload for resiliency and availability. Below we’ve outlined key reliability recommendations.

Before you can standardize the reliability of an SAP workload and improve areas of weakness, you need to

assess its reliability. It’s critical to know how reliable an SAP workload is so steps can be taken to fix issues or

solidify those configurations. We recommend conducting a reliability assessment on your SAP workload. The

assessment asks you questions about your workload and provides specific recommendations to focus on. The

assessment builds on itself, so you can track your progress without restarting every time. For the assessment,

start an Azure Well-Architected Review and select “SAP on Azure” when prompted.

Creating a multi-tier architecture to support an SAP workload is essential for reliability. The number of tiers and

architecture varies for each SAP application. Make sure to isolate application components from each other and

create redundancy to achieve high availability. Where applicable, you should isolate the SAP Web Dispatcher,

SAP Central Services, SAP App Server, SAPMNT Share and database. We have sample architectures for several

different SAP applications you can use to inform your design. For more information, see:

SAP S/4HANA in Linux

SAP BW/4HANA

SAP NetWeaver

SAP central services (SCS) or ABAP SAP central services (ASCS) is the basis of SAP application communication. It

consists of the message server and enqueue server. The central services layer is often a single point of failure

and must be set up for high availability to achieve SAP application resiliency. To add redundancy, create a cluster

of SAP central services with compatible shared storage technology supporting the cluster. Depending on the

operating system and available shared storage technology in general availability or private/public preview,

various options are available. Availability zones provide an opportunity to create a highly available ASCS

architecture. For more information, see:

SAP workload configurations with Azure Availability Zones

High-availability architecture for an SAP ASCS/SCS instance on Windows

High-availability architecture for an SAP ASCS/SCS instance on Linux

SAPMNT hosts the physical kernel files for SAP application and can be a single point of failure. Several options

are available on Azure to created redundancy and architect a highly available SAPMNT share. We recommend

using Azure Premium Files or Azure NetApp Files for Linux and Azure Premium Files. For Windows-based

deployments, you should use Azure NetApp Files or Azure Shared Disk.

OSOS GUIDA N C EGUIDA N C E
Windows Cluster an SAP ASCS/SCS instance on a Windows failover
cluster by using a cluster shared disk in Azure 
Cluster an SAP ASCS/SCS instance on a Windows failover
cluster by using a file share in Azure 
High availability for SAP NetWeaver on Azure VMs on
Windows with Azure Files Premium SMB for SAP applications
High availability for SAP NetWeaver on Azure VMs on
Windows with Azure NetApp Files(SMB) for SAP applications
Red Hat Enterprise Linux (RHEL) High availability for SAP NetWeaver on Azure VMs on Red
Hat Enterprise Linux with NFS on Azure Files 
Azure Virtual Machines high availability for SAP NetWeaver
on Red Hat Enterprise Linux with Azure NetApp Files for SAP
applications
SUSE Linux Enterprise Server (SLES) High-availability SAP NetWeaver with simple mount and NFS
on SLES for SAP Applications VMs 
High availability for SAP NetWeaver on Azure VMs on SUSE
Linux Enterprise Server with NFS on Azure Files
 
Create database resiliency
There are also a few application specific configurations you should address for SAPMNT reliability. You need
shared directories in the environment ( /sapmnt/SID and /usr/sap/trans ) to deploy the SAP NetWeaver
application layer. We recommend creating highly available file systems and ensuring they're resilient. The 
/sapmnt/SID  and  /usr/sap/SID/ASCS  directories are important. You should place these file systems on NFS on
Azure Files to achieve the maximum reliability. For more general file share information see, NFS on Azure Files.
The table below provides SAPMNT guidance specific to each operating system.
An SAP application feeds data to multiple enterprise systems, making database resiliency a key workload
consideration. We recommend replicating production data for the highest resiliency. Cross-region replication is
the preferred disaster recovery solution. But for a more affordable option, you should configure zone
redundancy at a minimum. The methods you choose depends on the database management system (DBMS)
and required business service-level agreement (SLA). Below are recommendations for the database layer.
(1) Define RPO and RTO(1) Define RPO and RTO - Creating database resiliency requires a plan to recover data loss. A logical error on
the SAP database, a large-scale disaster, or a system outage can cause data loss in an SAP workload. Your
recovery plan should identify how much data you’re willing to lose and how fast you need to recover. The
amount of data loss you’re willing to lose is your recovery point objective (RPO). How fast you need to recover
is your recovery time objective (RTO). When you design for recoverability, you need to understand the desired
and actual RPO and RTO of your SAP application.
(2) Use synchronous replication for no data loss(2) Use synchronous replication for no data loss - In some scenarios, there’s no tolerance for data loss.
The recovery point objective is 0. To achieve this RPO, you need use synchronous replication on the database
layer. Synchronous replication commits database transactions to database instances in two separate zones or
regions. You should measure the latency between the two instances to ensure it meets workload needs, and you
can do it with the SAP  niping  measuring tool. Higher network latency will slow down the scalability of your
workload, and physical distance between the instances adds network latency. As a result, replication across
regions will have higher latency than across availability zones. Database replication between different regions

 
Create application server resiliency
 
Use backups
should be asynchronous and replication between availability zones should be synchronous. It’s important to
balance resiliency and latency in SAP workload design. For more information, see:
General Azure Virtual Machines DBMS deployment for SAP workload
High-availability architecture and scenarios for SAP NetWeaver
Network latency between and within zones
The goal of application server resiliency is to have multiple application servers load balance traffic and failover
when needed. Resiliency for the SAP application server layer can be achieved through redundancy. You can
configure multiple dialog instances on different instances of Azure virtual machines with a minimum of two
application servers. Here are application server resiliency recommendations.
(1) Use Availability Sets / Availability Zones(1) Use Availability Sets / Availability Zones - An SAP application server can be deployed in an availability
set or across availability zones. The decision you make needs to be based on workload requirements. We
recommend you choose one method to improve resiliency, but we don’t recommend scale sets. For more
information, see availability zones for SAP.
(2) Use multiple application ser vers(2) Use multiple application ser vers - Using multiple smaller application servers instead of one larger
application server is recommended. This setup avoids a single point of failure. It’s a best practice to configure
SAP Logon Group (SMLG) and Batch Server Group (RZ12) for better load balancing between end-user & batch
processing. For more information, see:
Azure Virtual Machines high availability for SAP NetWeaver
SAP HANA high availability for Azure virtual machines
SAP workload configurations with Azure Availability Zones
The SAP workload should implement a regular backup solution. Backups are the backbone of disaster recovery
and help ensure continuity of operations. We have a few recommendations for backup reliability.
(1) Star t with Azure Backup(1) Star t with Azure Backup - We recommend you use Azure Backup as the foundational backup strategy for
an SAP workload. Azure Backup is the native backup solution in Azure, and it provides multiple capabilities to
help streamline your SAP backups. With Azure Backup, we want to point out a few features.
Native database backup compatibility
 - Azure Backup provides native backups through the Backint connector for
SAP HANA, SQL Server, and Oracle databases used by SAP Applications. Azure backup for SAP offers an API
called Backint. Backint allows backup solutions to create backups directly on the database layer. Azure backup
also supports the database backup capability for HANA & SQL Server databases today.
Storage backup
 - The storage backup feature can help optimize the backup strategy by using disk snapshots of
Azure Premium storage for selective disks. For more information on application-consistent backups, see
snapshot consistency.
Virtual Machine backup
 - Back up and restore Azure VM data through the Azure portal. Cross-region restoration
lets you restore Azure VMs that were to a paired secondary region.
Long-term retention
 - Azure Backup allows you to retain SAP backups years for compliance and audit needs.
Backup Management
 - Azure Backup enables you to manage backups from the Azure portal with an easy user
interface.
For more information, see:
Azure Backup documentation

 
Next step
SAP HANA backup overview
Backup guide for SAP HANA on Azure Virtual Machines
Backup guide for SQL Server on Azure Virtual Machines
(2) Find marketplace solutions(2) Find marketplace solutions - Several certified third-party backup solutions exist in the Azure
Marketplace. These solutions offer vendor backup capabilities and SAP-certified backup capabilities. You should
consider layering these solutions on top of Azure Backup to generate custom solutions with foundational
support.
Microsoft partners provide solutions that are integrated with Azure Storage for archive, backup, and for
business continuity and disaster recovery (BCDR) workloads. The partner solutions take advantage of the scale
and cost benefits of Azure Storage. You can use the solutions to help solve backup challenges, create a disaster
recovery site, or archive unused content for long-term retention. They can replace tape-based backups and offer
an on-demand economic recovery site with all the compliance standards and storage features such as
immutable storage and lifecycle management.
(3) Use snapshots(3) Use snapshots - A snapshot is a point-in-time, copy of your data. The speed and reliability of snapshots
can help manage large databases and protect the primary database against corruption or failure. These features
make snapshots critical for disaster recovery. We have a few options to create and store backups for your SAP
workload.
Azure Backup can take database backups for HANA and SQL Server, for example. The Backup vault feature of
Azure Shared Disk can serve as your database storage solution. Azure NetApp Files (ANF) can also back up
critical data by using snapshots, such as ANF volumes Snapshot. ANF Cross Region Replication uses ANF
snapshots to replicate data from one region to another.
The right solution depends on your desired cost and availability levels. In some scenarios, you might want to
replicate your SAP on Azure data to other Azure regions for disaster recovery. However, you can achieve the
same capabilities with Azure Storage replication, such as Geo-redundant storage (GRS) or Azure Site Recovery.
For more information, see:
SAP workload configurations with Azure Availability Zones
SAP NetWeaver disaster recovery
Azure Site Recovery for SAP workloads
Azure Storage redundancy
Back up SAP HANA databases' instance snapshots in Azure VMs
(4) Implement Disaster Recover y(4) Implement Disaster Recover y - We recommend you invest in Disaster Recovery (DR) to improve the
reliability of the SAP workload. Disaster recovery is achieved by replicating primary data to a secondary
location. Several tools & methodology can be used to the achieve goal. Disaster Recovery is required when the
primary location isn't accessible due to technical or natural disaster. Disaster Recovery solutions can be across
zones within region or across regions based on your business requirements, but we recommended DR across
region for better resiliency. For more information, see:
Azure Site Recovery
Cross-region replication of Azure NetApp Files volumes
Cross-region snapshot copy for Azure Disk Storage
Backup and Disaster Recovery
Overview
Security

Cost Optimization

Operational Excellence

Performance Efficiency

SAP workload security

12/16/2022 • 7 minutes to read • Edit Online

Use identity management

Azure provides all the tools needed to secure your SAP workload. SAP applications can contain sensitive data

about your organization. You must protect your SAP architecture with secure authentication methods, hardened

networking, and encryption.

SAP on Azure is delivered in the infrastructure as a service (IaaS) cloud model. Microsoft builds security

protections into the service at the levels of the physical data center, physical network, physical host, and

hypervisor. But you're responsible for areas above the hypervisor, such as the guest operating system for SAP.

We recommend you regularly evaluate the services and technologies used to ensure your security posture

evolves with the threat landscape. Below are security recommendations for consideration.

Identity management is a framework to enforce the policies that control access to critical resources. Identity

management controls access your SAP workload within or outside its virtual network. There are three identity

management use cases to consider for your SAP workload, and the identity management solution differs for

each.

(1) Operating system(1) Operating system - Organizations can improve the security of Windows and Linux virtual machines in

Azure by integrating with Azure Active Directory (Azure AD). Azure AD is a fully managed identity and access

management service. Azure AD can authenticate and authorize end user’s access to the SAP operating system.

You can use Azure AD to create domains that exist on Azure, or use it integrate with your on-premises Active

Directory identities. Azure AD also integrates with Microsoft 365, Dynamics CRM Online, and many Software-

as-a-Service (SaaS) applications from partners. We recommend using System for Cross-Domain Identity

Management (SCIM) for identity propagation. This pattern enables optimal user life cycle. For more information,

see:

SCIM synchronization with Azure Active Directory

Configure SAP Cloud Platform Identity Authentication for automatic user provisioning

Azure Active Directory Single sign-on (SSO) integration with SAP NetWeaver

(2) SAP application(2) SAP application – You can access the SAP application with the SAP frontend software (SAP GUI) or a

browser with HTTP/S. We recommend configuring single sign-on (SSO) using Azure Active Directory or Active

Directory Federation Services (AD FS). SSO allows end users to connect to SAP applications via browser where

possible. For more information, see:

SAP HANA SSO

SAP NetWeaver SSO

SAP Fiori SSO

SAP Cloud Platform SSO

SuccessFactors SSO

Azure Active Directory overview

(3) SAP PaaS and SaaS application(3) SAP PaaS and SaaS application - We recommend consulting the SAP Identity Authentication Service for

SAP Analytics Cloud, SuccessFactors, and SAP Business Technology Platform. You can also integrate services

from the SAP Business Technology Platform with Microsoft Graph using Azure AD and the SAP Identity

SA P  SO L UT IO NSA P  SOLUT ION SSO  M ET H O DSSO M E T H O D
SAP NetWeaver based-web applications such as Fiori,
WebGui
Security Assertion Markup Language (SAML)
SAP GUI Kerberos with windows active directory or AAD-Domain
services or third party solution
SAP PaaS and SaaS applications such as SAP Business
Technology Platform (BTP), Analytics Cloud, Cloud Identity
Services, SuccessFactors, Cloud for Customer, Ariba
SAML / OAuth / JSON Web Tokens (JWT) and pre-configured
authentication flows with Azure AD directly or by proxy with
the SAP Identity Authentication Service
 
Use role-based access control (RBAC)
 
Enforce network and application security
Authentication Service. For more information, see:
Using Azure Active Directory to secure access to SAP platforms and applications.
SAP Identity Authentication Service
SAP Identity Provisioning Service
A common customer scenario is deploying SAP application into Microsoft Teams. This solution requires SSO
with Azure AD. We recommend browsing the Microsoft commercial marketplace to see which SAP apps are
available in Microsoft Teams. For more information, see:
Microsoft AppSource marketplace.
The table below provides a summary of the recommended SSO method for the given SAP solution.
It’s important to control access to the SAP workload resources you deploy. Every Azure subscription has a trust
relationship with an Azure AD tenant. We recommend you use Azure role-based access control (Azure RBAC) to
grant users within your organization access the SAP application. Grant access by assigning Azure roles to users
or groups at a certain scope. The scope can be a subscription, a resource group, or a single resource. The scope
depends on the user and how you’ve grouped your SAP workload resources. For more information, see Azure
AD trust relationship and Azure RBAC.
Network and application security controls are baseline security measures for every SAP workload. Their
importance bears repeating to enforce the idea that the SAP network and application requires rigorous security
review and baseline controls.
(1) Azure Network Design(1) Azure Network Design - It’s critical to differentiate between shared services and SAP application services.
A hub-spoke architecture is a good approach to security. You should keep workload specific resources in its own
virtual network separate from the shared services in hub such as management services and DNS.
For SAP-native setups, you should use SAP Cloud Connector and SAP Private Link for Azure as part of the hub-
spoke setup. These technologies support the SAP extension and innovation architecture for the SAP Business
Technology Platform (BTP). Azure native integrations fully integrated with Azure virtual networks and APIs and
don’t require these components.
(2) Vir tual network security(2) Vir tual network security - Network security groups (NSGs) allow you to filter network traffic to and from
your SAP workload. You can define NSG rules to allow or deny access to your SAP application. You can allow
access to the SAP application ports from on-premises IP addresses ranges and denying public internet access.
For more information, see network security groups
(3) Application security(3) Application security - In general, the security best practices for application development also apply in the

Encrypt data

cloud. These include things like protecting against cross-site request forgery, thwarting cross-site scripting (XSS)

attacks, and preventing SQL injection attacks.

Application security groups (ASGs) make it easier to configure the network security of a workload. The ASG can

be used in security rules instead of explicit IPs for VMs. The VMs are then assigned to ASG. This configuration

supports the reuse of the same policy over different application landscapes, because of this abstraction layer.

Cloud applications often use managed services that have access keys. Never check access keys into source

control. Instead, store application secrets in Azure Key Vault. For more information, see application security

groups.

(4) Internet facing SAP workload(4) Internet facing SAP workload – An internet facing workload must be protected using services like Azure

Firewall, Web Application Firewall, Application Gateway to create separation between endpoints. For more

information, see inbound and outbound internet connections for SAP on Azure.

Azure includes tools to safeguard data according to your organization's security and compliance needs. It's

essential that you encrypt SAP workload data at rest and in transit.

(1) Encr ypt data at rest(1) Encr ypt data at rest – Encrypting data at rest is a common security requirement. Azure Storage service-

side encryption is enabled by default for all managed disks, snapshots, and images. Service-side encryption

uses service-managed keys by default, and these keys are transparent to the application.

We recommend you review and understand service/server-side encryption (SSE) with customer-managed keys

(CMKs). The combination of server-side encryption and a customer-managed key allows you to encrypt data at

rest in the operating system (OS) and data disks for available SAP OS combinations. Azure Disk Encryption

doesn’t support all SAP operating systems. The customer-managed key should be stored in Key Vault to help

ensure the integrity of the operating system. We also recommend encrypting your SAP databases. Azure Key

Vault supports database encryption for SQL Server from the database management system (DBMS) and other

storage needs. Below is an encryption workflow to help you visualize the encryption process.

When you use client-side encryption, you encrypt the data and upload the data as an encrypted blob. Key

management is done by the customer.

For more information, see:

Server-side encryption for managed disks

 
Collect and analyze SAP application logs
 
Next Step
Azure Storage service-side encryption
Service-side encryption using customer-managed key in Azure Key Vault
Client-side encryption
(2) Encr ypt data in transit(2) Encr ypt data in transit – Encryption in transit applies to the state of data moving from one location to
another. Data in transit can be encrypted in several ways, depending on the nature of the connection. For more
information, see encryption of data in transit.
Application log monitoring is essential for detecting security threats at the application level. We recommend
using the Microsoft Sentinel Solution for SAP. It’s a cloud-native security information and event management
(SIEM) solution built for your SAP workload running on a VM. For more information, see Microsoft Sentinel
Solution for SAP.
For general security information, see:
Azure security documentation
Trusted Cloud eBook
Overview
Reliability
Cost Optimization
Operational Excellence
Performance Efficiency

SAP workload cost optimization

12/16/2022 • 6 minutes to read • Edit Online

Optimize workload compute costs

Microsoft makes significant investments in the fast evolution of its hardware to provide more value for less. The

frequent increase in Azure hardware capability provides regular opportunity for an SAP workload to optimize

costs, eliminate waste, and improve technology. To align Azure and your SAP workload, we recommend creating

a plan for each SAP workload. The plan should contain the objectives and motivations for the workload.

Organizational objectives and investment priorities should drive cost optimization initiatives. Below are cost-

optimization recommendations for your SAP workload.

Compute cost optimization is achieved through planning, monitoring, and resizing VMs throughout the SAP

workload lifecycle. VMs provide the compute power for the SAP application and have a direct effect on cost and

performance. We recommend monitoring the compute costs of an SAP workload to ensure the dollars spent are

helping you meet organizational goals. Here are cost-optimization recommendations for SAP workload

compute.

(1) Choose the right VM type(1) Choose the right VM type - Azure has SAP-certified VMs for your workload. The wrong VM type will

require larger sizes to get the performance need, increasing cost without benefit. A smaller VM of the correct

type can give you equal or better performance than a large instance of the wrong type. Azure offers a list of

SAP-certified configurations to help you understand what VMs work well with your business needs.

For more information, see SAP certified infrastructure.

Memory-optimized VMs can meet the requirements of most SAP applications. An SAP online analytical

processing (OLAP) workload and an online transactional processing (OLTP) workload should use M-series VMs.

Examples of an OLTP workload include SAP HANA, SAP Business Suite on HANA, and SAP S/4HANA. Examples

of OLAP workloads include SAP BW on HANA BW/4HANA. SAP Business One on HANA pairs with M-series

VMs.

SAP NetWeaver, Business All-in-One, Business Suite Software, and BusinessObjects BI have broader alignment

with different VM types. They can run on VMs in the D, G, E, and M-series.

(2) Optimize compute cost throughout migration(2) Optimize compute cost throughout migration - Many SAP journeys start on-premises, so it’s

important to plan for compute optimization throughout the migration of a workload. Make sure to follow best

practices across every SAP migration. You can find process guidance in our [CAF SAP documentation] (

/azure/cloud-adoption-framework/scenarios/sap/). With the broader process guidance in place, you’ll still need

to customize your compute optimization for each SAP workload. We want you to consider pre-migration and

post-migration milestones.

Optimize pre-migration

– Pre-migration optimization ensures you have sufficient cloud resources provisioned

to support the expected migration runtime of the SAP workload. You need to verify that the Azure VM meets the

technical requirements of the on-premises workload. Planning will shorten the migration time of a workload

and minimize the time of migration will keep costs lower.

Optimize post-migration

– Post-migration optimization focuses on the end-user experience. This step coincides

with the hypercare period, a time of elevated customer service to make sure that the workload is performing.

You should monitor the workload as users begin to interact with it. The performance metrics might indicate that

you need to downsize the VM or switch to a different VM type.

(3) Optimize in operations(3) Optimize in operations - It’s important to optimize VMs in operations for the most cost-savings. By VM

 
Optimize workload storage
operations, we're referring to the daily management of an SAP workload. This phase of a workload brings the
ability to predict the compute needs. It’s important to see how user demand affects compute needs over time.
The VM choice should change along with the SAP workload’s requirements. Here are cost-saving
recommendations for operations.
Use Azure Advisor
 – We recommend you use Azure Advisor to identify VM usage that needs optimization. For
more information, see Azure Advisor cost optimization.
Enforce VM governance
 – You should enforce VM governance at VM creation as a cost and security best
practice. Every VM deployment increases the operational cost and attack surface of the SAP workload. VMs
created outside of the governance process tend to create unneeded expense and have more vulnerabilities. We
recommend using Azure Policy to enforce VM governance for your SAP workload. For more information, see
Azure Policy for SAP.
Stop a non-critical SAP workload
 – Each SAP workload has different levels of criticality. Some workloads aren’t
needed on nights and weekends. A sandbox environment is a good example of low criticality. We recommend
stopping VMs that support non-critical workload environment to reduce costs. You can automate this process
for the SAP workload in Azure. For more information, see automate VM start and stop.
Use Reserved Instances
 – Any SAP workload that needs to run continuously should use reserved instances to
optimize cost. For budget predictability, you can make an advanced purchase for one or three years in a
specified region. For more information, see Azure Reservations.
Use the Azure Hybrid Benefit
 – Azure lets you use on-premises Software Assurance-enabled Windows Server
and SQL Server licenses. The benefit applies to Red Hat and SUSE Linux subscriptions. This benefit can generate
significant savings for a hybrid SAP workload. For more information, see hybrid licensing benefit.
For more information, see:
SAP on Azure
SAP NetWeaver
SAP HANA install
SAP HANA configuration
We recommend optimizing the storage cost for your SAP workload. Storage is an essential component of an
SAP workload. Storage contains active data and backup data that is critical to your organization. Storage affects
the performance, availability, and recoverability of an SAP workload. It's important to have the right
performance at the right cost. Here are recommendations to help you reach this goal.
(1) Use reser ved capacity storage type(1) Use reser ved capacity storage type - There are several storage options available to choose from based
on the workload requirement. Managed disks, blog storage, and backup storage can support an SAP workload
in various combinations. Each of these options comes with storage reservation options that lower overall costs
for persistent data. For more information, see:
Azure disk reserved capacity
Blob storage reserved capacity
Azure Backup Storage reserved capacity
(2) Use lifecycle management policies(2) Use lifecycle management policies - Other than reserved capacity, you need to ensure the data-
retention period is right for the SAP workload. An SAP database backup can be large and add to the storage cost
if not optimized. We recommend that you create a lifecycle policy that meets the recovery time objective (RTO)
and recovery point objective (RPO) of your SAP workload. The policy should move into Premium, Standard,
Cold, Archive storage based on its age and business requirements.

Optimize the SAP application

Next Step

Optimizing your SAP application can lower the total cost of ownership without reducing capabilities. The goal is

to generate the maximum return on investment (ROI). Here’s a few ways to optimize an SAP application.

(1) Identify application responsibility(1) Identify application responsibility - Optimizing an SAP Application should be the responsibility of the

customer business application team. Having someone or a group responsible for costs will help drive decisions

that optimize costs over the lifecycle of the SAP workload.

(2) Rationalize and rearchitect(2) Rationalize and rearchitect - You should consider rationalizing or rearchitecting the SAP application,

especially during migrations. S4 HANA often replaces older SAP applications that can be added as a legacy

system. The SAP WAF assessment can help validate rearchitecting efforts and should be conducted on a periodic

basis. For more information, see Azure Well-Architected Review.

(3) Minimize investment in legacy systems(3) Minimize investment in legacy systems - You should host a legacy SAP application on minimum-

supported architecture to help reduce cost. A legacy application is slower and less performant. Any legacy

systems that remain after rationalizing and rearchitecting should receive the minimum spend possible and be

retired when appropriate. For more information, see Azure Cost Management.

Overview

Reliability

Security

Operational Excellence

Performance Efficiency

SAP workload operational excellence

12/16/2022 • 4 minutes to read • Edit Online

Use health checks and assessments

Operational excellence is about creating efficient processes to support your SAP workload. Operations will be

the longest phase of the SAP workload lifecycle, and teams must be equipped with operational best practices to

manage the day-today tasks. Failure in operations will affect the other design areas and the overall success of

the SAP workload. It’s critical to tailor your operations to support an SAP workload in operations. Below are

recommendations to drive operational excellence.

Standard operating procedures (SOPs) are documented processes for managing a workload. Each SAP workload

should have SOPs to govern operations. Without SOPs, teams drift from management best practices, so we

recommend a continuous cycle of assessment and health checks for your SAP workload.

(1) Health checks(1) Health checks - We have four Azure SAP (AzSAP) health checks: (1) deployment checklist, (2) inventory

checklist, (3) quality checks, and (4) Linux VM OS analyzer. The image below shows how they share a cycle with

our Azure SAP assessments. For more information on the health checks, see SAP quality checks.

Figure 1: The cycle of SAP health checks and assessments throughout th journey.

(2) Assessments(2) Assessments - We have three SAP assessments: (1) landing zone accelerator (LZA), (2) Azure SAP (AzSAP)

deployment management assessment, and (3) the AzSAP Well-architected framework assessment. These

assessments are designed for different stages in the SAP workload lifecycle.

The AzSAP Well-architected framework assessment is for operations. It compares your SAP operations against

SAP workload best practices. The assessment encourages continuous improvement by building on each

previous assessment.

Figure 2: How the Well-architected assessment creates milestones and builds on these milestones over time.

The initial assessment creates a baseline, and the next iteration of assessment uses the previous assessment as

the starting point. It will maintain the selections from the last assessment to track and review the design

principle. Because the assessment builds on itself, you can track improvements overtime. The assessment is

designed for an existing SAP workload in Azure and can assess one or more of the WAF pillars.

Monitor the workload

Automate workload infrastructure

BEN EF IT DO M A INB EN EF IT DO M A IN AUTO M AT E DEP LOY M EN T B EN EF IT SA UTO M AT E DE P LO Y M E NT B EN EF IT S

M A N UA L DEP LOY M ENTM A N UA L DEP LO Y M EN T

DISA DVA N TA GESDISA DVA N TA GES

Knowledge Works immediately after some initial

preparation time. Requires little

domain knowledge.

Requires specialized knowledge in

many domains outside of SAP.

Time Consumes predictable time from 30

minutes to a couple of hours.

Can take much more time depending

on the size of the SAP landscape,

depending on the size of the SAP

landscape.

Cost Makes automated deployments cheap

due to less time spent.

Expensive due to more time spent.

We recommend using this SAP assessment to develop and realign the SOPs for your SAP workload. The

assessment identifies areas of strength and weakness that allow you to build better SOPs. For more information,

see Azure Well-Architected Review.

Monitoring is the process of collecting, analyzing, and acting on data gathered from an SAP workload.

Monitoring provides insights of the health of the workload to compare with an expected baseline. It allows you

to know when, where, and why failures occur.

A monitoring best practice is to use a common and consistent logging schema that lets you correlate events

across systems. The monitoring and diagnostics process has several distinct phases:

Instrumentation

- Generating the raw data from application logs, web server logs, the diagnostics built into

the Azure platform, and other sources.

Collection and storage

- Consolidating the data into one place.

Analysis and diagnosis

- Troubleshooting issues and seeing the overall health.

Visualization and alerts

- Using data to spot trends or alert your operations team.

We recommend using Azure Monitor for SAP solutions to drive these processes. Azure Monitor for SAP is an

Azure-native monitoring product for SAP landscapes that run on Azure. Azure Monitor for SAP solutions uses

specific parts of the Azure Monitor infrastructure to provide insights into the monitoring of SAP Netweaver, SAP

HANA, SQL Server & Pacemaker High-Availability deployments on Azure. For more information, see Azure

Monitor for SAP Solutions.

You should use infrastructure as code (IaC) to automate SAP workload deployments with minimal human

intervention and build a scalable and consistent SAP workload on Azure. The manual process of creating the

required SAP workload resources is slow and allows for errors. Microsoft has a repository of SAP deployment

templates that you should use. It’s called the SAP on Azure Deployment Automate Framework. The templates

support SAP HANA and NetWeaver with any database on any SAP-supported operating systems. For more

information, see:

SAP deployment automation framework

SAP automate repository

Azure Monitor for SAP solutions

The table below outlines benefits of automated deployments with IaC.

Testing Provides templates that include test

instrumentation during deployment

and migration.

Allows for limited testing. Requires

more work to inject tests in the

process.

Scaling Allows you to easily scale up, down,

and out. Provides new deployment

templates.

Takes more time to scale and

customize the environment.

Standardization Applies your defined standards with

each deployment.

Sometimes leads to unwanted

variations in design.

BEN EF IT DO M A INB EN EF IT DO M A IN AUTO M AT E DEP LOY M EN T B EN EF IT SA UTO M AT E DE P LO Y M E NT B EN EF IT S

M A N UA L DEP LOY M ENTM A N UA L DEP LO Y M EN T

DISA DVA N TA GESDISA DVA N TA GES

Next Step

Overview

Reliability

Security

Cost Optimization

Performance Efficiency

SAP workload performance efficiency

12/16/2022 • 5 minutes to read • Edit Online

Optimize workload compute

Optimize workload storage

Performance efficiency is about accelerating digital transformation with less. The goal is to get the most out of

your SAP workload and meeting user demand without over or under provisioning resources. Inefficient

performance can degrade user experience or inflate costs. Performance affects productivity for internal

applications. It determines growth for public facing applications. Designing an SAP workload that can't meet

user demand will slow the application. Overcompensating with too much compute power will drive up cost

needlessly. These scenarios are avoidable with the right guidance. Below are recommendations to drive

performance efficiency for your SAP workload.

Compute is the core that powers an SAP application. Compute includes the hardware, number of cores, and

memory. These features are foundational to organizations. If you don’t optimize your compute configuration, an

SAP workload will be unable to meet spikes in user demand or stay withing predefined budgets. It’s important

to know the demands on your workload and match those demands with the compute you use for your SAP

workload. Here are some compute performance considerations.

(1) Conduct reference sizing for on-premises workload(1) Conduct reference sizing for on-premises workload - Reference sizing is the process of checking the

configurations and resource utilization data of an SAP workload on-premises. Reference sizing data shows the

current compute needs of the workload, and these needs should be matched in Azure. To find this information,

use the SAP OS Collector. SAP OS Collector retrieves system utilization information that can be reported via SAP

transaction OS07N and the EarlyWatch Alert. Any system performance and statistics gathering tools can collect

similar information.

(2) Use SAP Quick Sizer for a new workload(2) Use SAP Quick Sizer for a new workload - SAP Quick Sizer is a free web-based tool developed by SAP

that translates business requirements into technical requirements. Use this tool when you build a new SAP

workload to find the Azure VM with the correct network and storage throughput. For more information, see SAP

quick sizer.

It’s important to choose the appropriate storage solutions to support the data needs of the SAP workload. The

correct solution can improve the performance of existing capabilities and allow you to add new features. In

general, storage needs to meet the input/output operations per second (IOPS) requirements and throughput

needs of the SAP database. For more information, see storage types for an SAP workload.

(1) Use storage that suppor ts performance requirement(1) Use storage that suppor ts performance requirement - Microsoft supports different storage

technology to meet your performance requirement. For SAP workload, you can use Azure Managed Disk (for

example, Premium SSD, Premium SSD v2, Standard SSD) and Azure NetApp Files.

(2) Configure storage for performance(2) Configure storage for performance - We've published a storage configuration guideline for SAP HANA

databases. It covers production scenarios and a cost-conscious non-production variant. Following the

recommended storage configurations will ensure the storage passes all SAP hardware and cloud measurement

tool (HCMT) KPIs. For more information, see SAP HANA Azure virtual machine storage configurations.

(3) Enable write accelerator(3) Enable write accelerator - Write accelerator is a capability for M-Series VMs on Premium Storage with

Azure Managed Disks exclusively. It’s imperative to enable write accelerator on the disks associated with the

/hana/log volume. This configuration facilitates sub millisecond writes latency for 4 KB and 16-KB blocks sizes.

For more information, see Azure Write Accelerator.

Optimize workload networking

(4) Choose the right VM(4) Choose the right VM - Choosing the right VM has cost and performance implications. The goal is to pick a

storage VM that supports the IOPS and throughput requirements of the SAP workload. There are three critical

areas to focus while selecting a VM.

Number of vCPUs

- The number of CPUs has a direct effect on the licenses in the database node. Most of the

databases follow a core-based licensing model. Use the amount that meets your needs and adjust licensing

agreements as necessary.

Memory

- Memory is critical to application performance, and your SAP application can have high memory

demands. In general, higher memory provides more memory-reads, less paging, and higher VM cost.

Throughput

- Throughput is important for an application hosted on one of the VMs to communicate with

outside the VM by using its network interface cards (NICs).

An SAP workload needs to communicate with other workloads. Common communication paths are to local

storage, external storage, NICs, VMs in the network, VMs in other networks, and third-party applications.

Optimize workload networking to improve these communication channels to meet workload and application

demand. If SAP network performance isn't considered, it will cause application performance issues.

(1) Understand proximity placement groups(1) Understand proximity placement groups - Proximity placement groups reduce the distance between

SAP workloads. They can group different VM types under a single network spine. As the Azure footprint grows, a

single availability zone may span multiple physical data centers. The distribution across data centers can create

network latency impacting your SAP application performance. The proximity of the VMs provides lower network

latency.

The capabilities are ideal, but there are drawbacks to be aware of. Proximity placement groups often limit your

VM choices and make resizing VMs more difficult. Proximity placement groups bind VMs to a specific network

spine. This binding limits the possible combinations of different VM types. The host hardware that is needed to

run a certain VM type might not be present in the data center or under the network spine to which the proximity

placement group was assigned. The availability of VM types can be severely restricted.

We recommend using proximity placement groups in two scenarios. (1) Use proximity placement groups in

Azure regions where latency across zones is higher than recommended for the SAP workload. (2) Use proximity

placement groups for application volume group. The Application Volume Group feature of Azure NetApp Files

(ANF) uses PPG to deploy ANF volumes close to the VM/compute cluster. We recommend using this feature as

designed. For more information, see:

Overview of proximity placement groups

SAP and proximity placement groups

(2) Use accelerated networking(2) Use accelerated networking - Accelerated Network is default for most of the VM deployments and is

recommended for every VM hosting an SAP workload. Accelerated Network improves the network performance

by bypassing the physical switch. We recommend you enable Accelerated Networking on the Azure VMs

running your SAP Application and Database. Accelerated networking provides improved latency, jitter, and CPU

utilization. You should test the latency between the SAP application server and database with the SAP ABAP

report /SSA/CAT. It's an Inventory Check for the SAP Azure Workbook. For more information, see accelerated

networking overview.

(3) Use ExpressRoute GlobalReach(3) Use ExpressRoute GlobalReach - ExpressRoute is a private and resilient way to connect your on-

premises networks to different Azure regions. This feature allows you to link ExpressRoute circuits to make a

private network between your on-premises networks. Global Reach should be used for SAP HANA Large

Instance deployments to enable direct access from on-premises to your HANA Large Instance units deployed in

different regions. For more information, see ExpressRoute Global Reach.

Next Step

Overview

Reliability

Security

Cost Optimization

Operational Excellence

Sustainable workloads

12/16/2022 • 3 minutes to read • Edit Online

What is a sustainable workload?

Cloud efficiency overviewCloud efficiency overview

What are the common challenges?

This section of the Microsoft Azure Well-Architected Framework aims to address the challenges of building

sustainable workloads on Azure. Review the provided guidance that applies Well-Architected best practices as a

technical foundation for building and operating sustainable solutions on Azure.

We encourage you to also read more about the Microsoft Cloud for Sustainability for opportunities to leverage

the capabilities of that platform as part of your solution architecture. However, guidance found in this article

series is focused on

any

solutions you're building or operating on Azure.

Additionally, read about The Carbon Benefits of Cloud Computing: a Study of the Microsoft Cloud to learn more

about how Azure is more energy efficient and carbon efficient than on-premises solutions.

The term

workload

refers to a collection of application resources that support a common business goal or the

execution of a common business process, with multiple services, such as APIs and data stores, working together

to deliver specific end-to-end functionality.

With

sustainability

, we refer to the environmental impact of our workloads.

sustainable workload

therefore describes the practice of designing solutions that maximize utilization while

minimizing waste, ultimately reducing the footprint on the environment.

Making workloads more cloud efficient requires combining efforts around cost optimization, reducing carbon

emissions, and optimizing energy consumption. Optimizing the application's cost is the initial step in making

workloads more sustainable.

Here's a conceptual overview of cloud efficiency in this context:

Scoring and measuring the cloud efficiency is essential to understand whether changes tracked over time have

any impact.

Learn about building more sustainable and efficient workloads by starting with the design area for sustainable

Application Design.

Building and designing sustainable workloads on Microsoft Azure can be challenging for these main reasons:

 
Is sustainability only about performance and cost?
 
What are the key design areas?
DESIGN  A READESIGN  A REA DESC RIP T IONDESC RIP T ION
Application design Cloud application patterns that allow for designing
sustainable workloads.
Application platform Choices around hosting environment, dependencies,
frameworks, and libraries.
Testing Strategies for CI/CD pipelines and automation, and how to
deliver more sustainable software testing.
Operational procedures Processes related to sustainable operations.
Storage Design choices for making the data storage options more
sustainable.
Network and connectivity Networking considerations that can help reduce traffic and
amount of data transmitted to and from the application.
Security Relevant recommendations to design more efficient security
solutions on Azure.
 
Next step
Understanding if your workloads are in alignment with sustainability targets. It requires assessments of
current workloads to determine if they're designed in a sustainable way.
Designing workloads that are natively environmentally friendly and optimized.
Measuring and tracking the emissions of your workloads.
While performance efficiency and cost optimization are areas of strong focus for designing sustainable
workloads, the other pillars of the Well-Architected Framework are equally important when building long-term
sustainable workloads on Azure.
Security: how the security appliances in a workload are optimized and designed to auto-scale will have an
impact on the environment.
Reliability: designing reliable workloads that meet sustainability guidelines from the Green Software
Foundation can greatly reduce the workloads' carbon and electricity footprint.
Operational Excellence: how a workload is able to effectively respond to operational issues can ultimately
reduce carbon emissions.
Sustainable guidance within this series is composed of architectural considerations and recommendations
oriented around these key design areas.
Decisions made in one design area can impact or influence decisions across the entire design. The focus is
ultimately on building a sustainable solution to minimize the footprint and impact on the environment.
We recommend that readers familiarize themselves with these design areas, reviewing provided considerations
and recommendations to better understand the consequences of encompassed decisions.
Review the sustainability design methodology.

Design methodology

Design methodology for sustainable workloads on

Azure

12/16/2022 • 5 minutes to read • Edit Online

1—Design for business requirements

2—Evaluate the design areas using the design principles

3—Understanding your emissions

Briefly about emission scopesBriefly about emission scopes

Building a sustainable application on any cloud platform requires technical expertise and an understanding of

sustainability guidelines in general and for your specific cloud platform.

This design methodology aims to help establish an understanding about producing more carbon efficient

solutions, measuring your carbon impact, and ultimately reducing unnecessary energy usage and emissions.

Businesses globally have different requirements. Expect that the review considerations and design

recommendations provided by this design methodology will yield different design decisions and trade-offs for

different scenarios and organizations.

Establish your business requirements and priorities, then review the design methodologies in alignment with

those requirements.

Refer to the sustainability design principles and the design areas below for your sustainability workloads.

Decisions made within each design area will echo across other design areas. Review the considerations and

recommendations in each design area to understand the consequences and impact and any known trade-offs.

Design areas:

Application design

Application platform

Deployment and testing

Operational procedures

Storage

Network and connectivity

Security

To lower your emissions, you need to understand how to measure your sustainability efforts.

At Microsoft, we segment our greenhouse gas (GHG) emissions into three categories, consistent with the

Greenhouse Gas Protocol.

Scope 1 emissionsScope 1 emissions: direct emissions that your activities create.

Scope 2 emissionsScope 2 emissions: indirect emissions that result from the production of the electricity or heat you use.

Scope 3 emissionsScope 3 emissions: indirect emissions from all other activities you're engaged in. For a business, these

Scope 3 emissions can be extensive. They must be accounted for across its supply chain, materials in its

buildings, employee business travel, and the life cycle of its products (including the electricity customers

consume when using the products). A company's Scope 3 emissions are often far more significant than its

  
Measure and track carbon impactMeasure and track carbon impact
  
Carbon tracking and reporting with the Emissions Impact DashboardCarbon tracking and reporting with the Emissions Impact Dashboard
  
Leverage the Microsoft Sustainability ManagerLeverage the Microsoft Sustainability Manager
  
Use a proxy solution to measure emissionsUse a proxy solution to measure emissions
Scope 1 and 2 emissions combined.
As a customer, the context of Scope 3 emissions can be network configuration and delivery, power consumption,
and devices outside the data center. If an application uses excess bandwidth or packet size, it will impact from
when the traffic leaves the data center, through the various hops on the internet, down to the end-user device.
Reducing network bandwidth, therefore, can have a significant impact throughout the delivery chain. The same
considerations apply to compute resources, data storage, application platform decisions, application design, and
more.
Find more in-depth details and definitions in Azure's Scope 3 Methodology White Paper, published in 2021.
Microsoft aligns with the Green Software Foundation, responsible for creating the Software Carbon Intensity
(SCI) specification.
To measure the carbon impact of an application, the GSF provided a scoring methodology called SCI, calculated
as follows:
SCI = ((E*I)+M) per R
Where:
E  = Energy consumed by a software system. Measured in kWh.
I  = Location-based marginal carbon emissions. Carbon emitted per kWh of energy, gCO2/kWh.
M  = Embodied emissions of a software system. Carbon that is emitted through the hardware on which the
software is running.
R  = Functional unit, which is how the application scales; per extra user, per API call, per service, etc.
With this knowledge, it's essential to consider not only the application infrastructure and hardware but also the
user devices and application scalability, as it can alter the environmental footprint considerably.
Read the full SCI specification on GitHub.
Microsoft offers the Emissions Impact Dashboard for Azure and Microsoft 365, which helps you measure your
cloud-based emissions and carbon savings potential.
We recommend you use this tool to get the insights and transparency you need to understand your carbon
footprint and to measure and track emissions over time.
Download the Emissions Impact Dashboard Power BI app for Azure to get started.
Customers using Microsoft Cloud for Sustainability can leverage Microsoft Sustainability Manager. This
extensible solution unifies data intelligence and provides comprehensive, integrated, and automated
sustainability management for organizations at any stage of their sustainability journey. It automates manual
processes, enabling organizations to record, report, and reduce their emissions more efficiently.
One way of estimating the carbon emissions from workloads is to design a proxy solution architecture based on
the SCI model as described above.
Defining the proxies for applications can be done in different ways. For example, using these variables:
Any known carbon emission of the infrastructure
The cost of the infrastructure
Edge services and infrastructure carbon emissions
The number of users that are concurrently using the application

A P P L IC AT IO N P E RFORM A N C EAP P L IC AT IO N P ERFORM A N C E A P P L IC AT IO N C O STAP P L IC AT IO N C O ST L IK ELY O UTC O M EL IK ELY OUTC O M E

High Unchanged Optimized app

High Lower Optimized app

Unchanged/Lower Higher According to the green principles, a

higher energy cost can cause higher

carbon emissions. Therefore, you can

assume that the app produces

unnecessary carbon emissions.

High High The app may be producing

unnecessary carbon

4—The shared responsibility model for sustainability

Ways to reduce emissionsWays to reduce emissions

A shared responsibilityA shared responsibility

Metrics of the application to inform us about the performance over time

By designing an equation using the above variables, you can estimate the carbon score (an approximation),

helping you understand if you're building sustainable solutions.

There's also the aspect of application performance. You can link performance to cost and carbon and assume this

relationship yields a value. With this relation, you can simplify the view like this:

Therefore, building a carbon score dashboard can make use of the following proxies:

Cost

Performance

Carbon emissions of the infrastructure (if known/available)

Usage over time (requests, users, API calls, etc.)

Any extra measurement that is relevant to the application

Reducing emissions is a shared responsibility between the cloud provider and the customer designing and

deploying applications on the platform.

Reducing carbon emissions can happen with three possible solutions:

Carbon neutralization; compensating carbon emissions

Carbon avoidance; not emitting carbon in the first place

Carbon removal; subtract carbon from the atmosphere

The goal of green software is to avoid unnecessary emissions in the first place, hence actively working toward a

more sustainable future. Further,

carbon removal

is the preferred goal for removing emissions from our

atmosphere.

Microsoft is committed to being carbon negative by 2030, and by 2050 to have removed all the carbon the

company has emitted since it was founded in 1975.

As a cloud provider, Microsoft is responsible for the data centers hosting your applications.

However, deploying an application in the Microsoft cloud doesn't automatically make it sustainable, even if the

data centers are optimized for sustainability. Applications that aren't optimized may still emit more carbon than

necessary.

Next steps

Let's take an example.

You deploy an app to an Azure service, but you only utilize 10% of the allocated resources. The provisioned

resources are underutilized, ultimately leading to unnecessary emissions.

It would help if you considered scaling to an appropriate tier of the resource (rightsizing) or deploying more

apps to the same provisioned resources.

We recommend making applications more efficient to utilize the data center capacity in the best way possible.

Sustainability is a shared responsibility goal that must combine the efforts of the cloud provider and the

customers in designing and implementing applications.

Review the design principles for sustainability.

Design principles

Design principles of a sustainable workload

12/16/2022 • 4 minutes to read • Edit Online

Principles of green software

Carbon efficiency

Energy efficiency

The sustainability design methodology provides a framework to record, report, and reduce or optimize the

environmental impact of your workloads.

To achieve an increase in carbon efficiency, consider how your workload, directly and indirectly, can reduce

carbon emissions through:

Using less physical and virtual resources

Using less energy

Using energy and resources more intelligently

Supporting older devices

It's important to effectively record, report, and reduce carbon emissions through actionable insights.

Gain transparency into your current carbon impact

Estimate savings

Take action to accelerate progress

These critical design principles for sustainability resonate and extend the quality pillars of the Azure Well-

Architected Framework—Reliability, Security, Cost Optimization, Operational Excellence, and Performance

Efficiency.

Microsoft is actively working toward sustainability targets, and empowers every organization to help reduce

emissions and improve our environmental health. The Azure Well-Architected Framework workload for

sustainability aligns with the Green Software Principles from the Green Software Foundation.

The principles of green software are the starting point to understand the SCI model and how this will be

included in our framework.

Principle:Principle: Emit the least amount of carbon possible.

The application or software must emit the least amount of carbon possible. A carbon efficient cloud application

is one that is optimized, and the starting point is the cost – streamlining the application infrastructure and cost

will ensure that no unnecessary resources are wasted in the cloud to run the software. But this isn't enough, as

you might have cost-optimized your application but still waste tons of resources that emit carbon for no reason.

Read more about the Carbon Efficiency principle from the Green Software Foundation.

Principle:Principle: Use the least amount of energy possible.

The goal of this principle is that you build applications that are energy-efficient. This is a common pattern for

mobile applications, since they must rely on a battery powered device and are optimizing its consumption. It's

less common, however, for desktop or web applications, since until now, developers have never been asked to

optimize the electricity consumption of their software.

Carbon awareness

T E C H N IQUET E C H N IQU E DESC RIP T IONDESC RIP T ION

Demand shifting Demand shifting means moving the workloads and

resources to regions or data centers, or a time in the data

center where the energy supply is high and the demand is

lower and can be met by renewable energy. Delaying

running apps to a time when there's less demand should

result in lower carbon intensity.

Demand shaping Demand shaping means changing the application's behavior

and appearance to match the energy supply in real-time. A

good practice is to build an eco-version of the app and keep

it as a benchmark for demand shaping and carbon

optimization.

Hardware efficiency

Measuring sustainability

Climate commitments

Read more about the Energy Efficiency principle from the Green Software Foundation

Principle:Principle: Do more when the electricity is cleaner and do less when the electricity is dirtier.

We need to make the application aware of how much carbon it's emitting. This way, we can react to specific

conditions of energy supply using demand shifting and demand shaping techniques:

Read more about the Carbon Awareness principle from the Green Software Foundation.

Principle:Principle: Use the least amount of embodied carbon possible.

Embodied carbon is the carbon that was emitted to build a device. Therefore, a sustainable application will make

sure older devices are supported and will maximize the efficiency of each device. The goal is to build hardware-

efficient applications.

Consider the tradeoff that older devices can have power inefficiencies, and may not always be suitable.

Read more about the Hardware Efficiency principle from the Green Software Foundation.

Principle:Principle: What you can't measure, you can't improve.

Measuring carbon emissions of a cloud application is a complex task, as it involves the whole ecosystem of the

software: from the cloud infrastructure (where we have the emissions dashboards to help us out), to the

network path that is crossed, to the edge technology and user devices. With the SCI, we aren't targeting a

discrete measurement of carbon emissions, but a score that will change over time and with our optimization

techniques.

Read more about the Measurement from the Green Software Foundation.

Principle:Principle: Understand the exact mechanism of reduction.

Many corporations and groups have made commitments to the climate. They actively work toward new

sustainability goals with a primary objective to remove, reduce and prevent carbon emissions.

There are several options for reducing the carbon footprint of any organization or entity. However, and aligned

Next steps

with the goal of the Green Software Foundation, our main direction should always be to avoid emitting carbon

in the first place. This is what we call Abatement, or Carbon Elimination.

Once we've pursued this goal, there will still be emissions that can't be avoided. All the remaining carbon

reduction methodologies will help us do so, offsetting (either compensating or neutralizing carbon).

Your company's strategy can be a mix of all the possible methodologies and, depending on the final result, can

reach a Net Zero target when carbon emissions are eliminated where possible and the residual emissions

compensated.

The SCI equation aims to eliminate emissions, which should always be the primary goal of a sustainable

workload, and the score can only be reduced with abatement.

Read more about the Climate Commitments from the Green Software Foundation.

Review the considerations for application design.

Application design

Application design of sustainable workloads on

Azure

12/16/2022 • 6 minutes to read • Edit Online

IMPORTANTIMPORTANT

Code efficiency

Evaluate moving monoliths to a microservice architectureEvaluate moving monoliths to a microservice architecture

Improve API efficiencyImprove API efficiency

When building new or updating existing applications, it's crucial to consider how the solution will impact the

climate and if there are ways to improve and optimize. Learn about considerations and recommendations to

optimize your code and applications for a more sustainable application design.

This article is part of the Azure Well-Architected sustainable workload series. If you aren't familiar with this series, we

recommend you start with what is a sustainable workload?

Demands on applications can vary, and it's essential to consider ways to stabilize the utilization to prevent over-

or underutilization of resources, which can lead to unnecessary energy spills.

Monolithic applications usually scale as a unit, leaving little room to scale only the individual components that

may need it.

Green Software Foundation alignment: Energy efficiency, Hardware efficiency

Recommendation:Recommendation:

Evaluate the microservice architecture guidance.

A microservice architecture allows for scaling of only the necessary components during peak load; ensuring

idle components are scaled down or in. Additionally, it may reduce the overhead and resources required for

deploying monolithic applications.

Consider this tradeoff: While reducing the compute resources required, you may increase the amount of

traffic on the network, and the complexity of the application may increase significantly.

Consider this other tradeoff: Moving to microservices can result in extra deployment overhead with

numerous similarities in deployment pipelines. Carefully consider the required deployment resources for

monolithic versus microservice architectures.

Additionally, read about containerizing monolithic applications.

Many modern cloud applications are designed to transact many messages between services and components

asynchronously. Consider the format used to encode the payload data. How much information does your

application need to communicate, and is there room to reduce the chattiness?

Green Software Foundation alignment: Energy efficiency

Recommendation:Recommendation:

Learn about the chatty I/O antipattern to better understand how a large number of requests can impact

performance and responsiveness.

Improve the reliability and reduce unnecessary load to your systems. Implement advanced request throttling

with API Management.

  
Ensure backward software compatibility to ensure it works on legacy hardwareEnsure backward software compatibility to ensure it works on legacy hardware
  
Leverage cloud native design patternsLeverage cloud native design patterns
  
Consider using circuit breaker patternsConsider using circuit breaker patterns
  
Optimize code for efficient resource usageOptimize code for efficient resource usage
  
Optimize for async access patternsOptimize for async access patterns
Minimize the amount of data the application returns from requests by being selective and encoding the
messages. See message encoding considerations.
Cache responses to avoid reprocessing the same type of information from the backend system unless
necessary. See caching in Azure API Management.
Consider how applications render information. Does the application need to critically serve everything in the
highest quality, resulting in higher bandwidth and processing? Is there room for reducing the quality of
components in the UI to serve sustainability goals better?
Green Software Foundation alignment: Hardware efficiency
Recommendation:Recommendation:
Support more end-user consumer devices, like older browsers and operating systems. This backward
compatibility improves hardware efficiency by reusing existing hardware instead of requiring a hardware
upgrade for the solution to work.
Consider this tradeoff: If the most recent software updates have significant performance improvements,
using older software versions may not be more efficient.
Learning about cloud-native design patterns is helpful for building applications, whether they're hosted on
Azure or running elsewhere. Optimizing the performance and cost of your cloud application will also reduce its
resource utilization, hence its carbon emissions.
Green Software Foundation alignment: Energy efficiency, Hardware efficiency
Recommendation:Recommendation:
Leverage cloud-native design patterns when writing or updating applications.
Consider evaluating and preventing applications from performing operations that are likely to fail. Repeated
failures can lead to overhead and unnecessary processing that you can avoid with proper design patterns.
Green Software Foundation alignment: Energy efficiency
Recommendation:Recommendation:
A circuit breaker can act as a proxy for operations that might fail and should monitor the number of recent
failures that have occurred and use that information to decide whether to proceed.
Study the Circuit Breaker pattern, and then consider how you can implement the Circuit Breaker patterns to
your applications.
Consider using Azure Monitor to monitor failures and set up alerts.
Applications deployed using inefficient code may result in an inherent impact on sustainability.
Green Software Foundation alignment: Energy efficiency, Hardware efficiency
Recommendation:Recommendation:
Reduce CPU cycles and the number of resources you need for your application.
Use optimized and efficient algorithms and design patterns.
Consider the Don't repeat yourself (DRY) principle.
Demands on applications can vary, and it's essential to consider ways to stabilize the utilization to prevent over-

Evaluate server

side vs. client

side renderingEvaluate server

side vs. client

side rendering

Be aware of UX design for sustainabilityBe aware of UX design for sustainability

or underutilization of resources, which can lead to unnecessary energy spills.

Green Software Foundation alignment: Energy efficiency

Recommendation:Recommendation:

Queue and buffer requests that don't require immediate processing, then process in batch. Designing your

applications in this way helps achieve a stable utilization and helps flatten consumption to avoid spiky

requests.

Read about optimizing for async access patterns.

Determine whether to render on the server-side or client-side when building applications with a UI.

Green Software Foundation alignment: Energy efficiency, Hardware efficiency

Recommendation:Recommendation:

Consider these benefits of server-side rendering:

When the server's power comes from less-polluting alternatives than the client's locale.

When the hardware on the server has better processing-energy ratios.

Can use centralized caching to reduce multiple unnecessary renders.

Reducing the number of browser-to-server round-trips can be particularly important when the client's

device has a lossy link.

When the client devices are older and have slower CPUs. Users don't need to upgrade their devices to

support a modern browser.

Consider these benefits of client-side rendering:

When the end-user devices are more suitable, pushing the responsibility of rendering to the clients.

It's more efficient only to render what's needed and as requested, as opposed to rendering everything

at least once.

There's no need for a server, as you can rely on static storage.

Browser caching is used on the clients.

Consider how the UX design of a workload impacts sustainability and determine what options exist for

improving energy efficiency and reducing unnecessary network load, data processing, and compute resources.

Green Software Foundation alignment: Energy efficiency

Recommendation:Recommendation:

Consider reducing the number of components to load and render on pages.

Determine whether the application can render lower-resolution images and videos.

Ensuring there are no unused pages will help minimize the UX design.

Consider search and findability. Making it easier for users to find what they're looking for helps lower the

amount of data stored and retrieved.

Consider providing a lighter UI, using fewer resources and with a lower impact on sustainability, and provide

users with an informed choice.

Save energy by offering your apps and websites in dark mode, with dark backgrounds.

Opt for using system fonts when possible to avoid forcing clients to download additional fonts, which causes

Don't render full-size images as thumbnails where the browser is doing the resizing.

Using full-size images as thumbnails or resized images will transfer more data, unnecessary network

traffic, and additional client-side CPU usage due to image resizing and pre-rendering.

Update legacy codeUpdate legacy code

Next step

more network load.

Consider upgrading or deprecating legacy code if it's not running on modern cloud infrastructure or with the

latest updates.

Green Software Foundation alignment: Hardware efficiency

Recommendation:Recommendation:

Identify inefficient legacy code suited for modernization.

Review if there are options to move to serverless or any of the optimized PaaS options.

Consider this tradeoff: Updating old code that might end up being deprecated can consume valuable time.

Review the design considerations for the application platform.

Application platform

Application platform considerations for sustainable

workloads on Azure

12/16/2022 • 6 minutes to read • Edit Online

IMPORTANTIMPORTANT

Platform and service updates

Review platform and service updates regularlyReview platform and service updates regularly

Regional differences

Deploy to low

carbon regionsDeploy to low

carbon regions

Designing and building sustainable workloads requires understanding the platform where you're deploying the

applications. Review the considerations and recommendations in this section to know how to make better

informed platform-related decisions around sustainability.

This article is part of the Azure Well-Architected sustainable workload series. If you aren't familiar with this series, we

recommend you start with what is a sustainable workload?

Keep platform and services up to date to leverage the latest performance improvements and energy

optimizations.

Platform updates enable you to use the latest functionality and features to help increase efficiency. Running on

outdated software can result in running a suboptimal workload with unnecessary performance issues. New

software tends to be more efficient in general.

Green Software Foundation alignment: Energy efficiency

Recommendation:Recommendation:

Upgrade to newer and more efficient services as they become available.

Consider backward compatibility and hardware reusability. An upgrade may not be the most efficient

solution if the hardware or the OS isn't supported.

Make use of Azure Automation Update Management to ensure software updates are deployed to Azure VMs.

The Microsoft Azure data centers are geographically spread across the planet and powered using different

energy sources. Making decisions around where to deploy your workloads can significantly impact the

emissions your solutions produce.

Learn more about sustainability from the data center to the cloud with Azure.

Learn about what Azure regions have a lower carbon footprint than others to make better-informed decisions

about where and how our workloads process data.

Green Software Foundation alignment: Carbon efficiency

Recommendation:Recommendation:

Use less carbon because the data centers where you deploy the workload are more likely to be powered by

renewable and low-carbon energy sources.

Consider these potential tradeoffs:

Process when the carbon intensity is lowProcess when the carbon intensity is low

Choose data centers close to the customerChoose data centers close to the customer

Run batch workloads during low

carbon intensity periodsRun batch workloads during low

carbon intensity periods

Modernization

Containerize workloads where applicableContainerize workloads where applicable

The effort and time it takes to move to a low-carbon region.

Migrating data between data centers may not be carbon efficient.

Consider the cost for new regions, including low-carbon regions, which may be more expensive.

If the workloads are latency sensitive, moving to a lower carbon region may not be an option.

Some regions on the planet are more carbon intense than others. Therefore it's essential to consider where we

deploy our workloads and combine this with other business requirements.

Green Software Foundation alignment: Carbon efficiency, Carbon awareness

Recommendation:Recommendation:

Where you have the data available, consider optimizing workloads when knowing that the energy mix comes

mostly from renewable energy sources.

If your application(s) allow it, consider moving workloads dynamically when the energy conditions change.

For example, running specific workloads at night may be more beneficial when renewable sources are

at their peak.

Deploying cloud workloads to data centers is easy. However, consider the distance from a data center to the

customer. Network traversal increases if the data center is a greater distance from the consumer.

Green Software Foundation alignment: Energy efficiency

Recommendation:Recommendation:

Consider deploying to data centers close to the consumer.

Proactively designing batch processing of workloads can help with scheduling intensive work during low-

carbon periods.

Green Software Foundation alignment: Carbon awareness

Recommendation:Recommendation:

Where you have the data available to you, plan your deployments to maximize compute utilization for

running batch workloads during low-carbon intensity periods.

Potential tradeoffs may include the effort and time it takes to move to a low-carbon region. Additionally,

migrating data between data centers may not be carbon efficient, and the cost for new regions-including low

—carbon regions—may be more expensive.

Consider these platform design decisions when choosing how to operate workloads. Leveraging managed

services and highly optimized platforms in Azure helps build cloud-native applications that inherently contribute

to a better sustainability posture.

Consider options for containerizing workloads to reduce unnecessary resource allocation and to utilize the

deployed resources better.

Green Software Foundation alignment: Hardware efficiency

Recommendation:Recommendation:

Deploying apps as containers allows for bin packing and getting more out of a VM, ultimately reducing the

  
Evaluate moving to PaaS and serverless workloadsEvaluate moving to PaaS and serverless workloads
  
Use SPOT VMs where possibleUse SPOT VMs where possible
 
Right sizing
  
Turn off workloads outside of business hoursTurn off workloads outside of business hours
  
Utilize auto
-
scaling and bursting capabilitiesUtilize auto
-
scaling and bursting capabilities
need for duplication of libraries on the host OS.
Removes the overhead of managing an entire VM, and allows deploying more apps per physical machine.
Containerization also optimizes server utilization rates and improves service reliability, lowering operational
costs. Fewer servers are needed, and the existing servers can be better utilized.
Consider these tradeoffs: The benefit of containerization will only realize if the utilization is high. Additionally,
provisioning an orchestrator such as Azure Kubernetes Services (AKS) or Azure Red Had OpenShift (ARO) for
only a few containers would likely lead to higher emissions overall.
Managed services are highly optimized and operate on more efficient hardware than other options, contributing
to a lower carbon impact.
Green Software Foundation alignment: Hardware efficiency, Energy efficiency
Recommendation:Recommendation:
Build a cloud-native app without managing the infrastructure, using a fully managed and inherently
optimized platform. The platform handles scaling, availability, and performance, ultimately optimizing the
hardware efficiency.
Review design principles for Platform as a Service (PaaS) workloads.
Think about the unused capacity in Azure data centers. Utilizing the otherwise wasted capacity—at significantly
reduced prices—the workload contributes to a more sustainable platform design.
Green Software Foundation alignment: Hardware efficiency
Recommendation:Recommendation:
By utilizing SPOT VMs, you take advantage of unused capacity in Azure data centers while getting a
significant discount on the VM.
Consider the tradeoff: When Azure needs the capacity back, the VMs get evicted. Learn more about the SPOT
VM eviction policy.
Ensuring workloads use all the allocated resources helps deliver a more sustainable workload. Oversized
services are a common cause of more carbon emissions.
Operating idle workloads will waste energy and contributes to an added carbon emission.
Green Software Foundation alignment: Energy efficiency, Hardware efficiency
Recommendation:Recommendation:
Dev and testing workloads should be turned off or downsized when not used. Instead of leaving them
running, consider shutting them off outside regular business hours.
Learn more about starting/stopping VMs during off-hours.
It's not uncommon with oversized compute workloads where much of the capacity is never utilized, ultimately
leading to a waste of energy.
Green Software Foundation alignment: Hardware efficiency
Recommendation:Recommendation:

  
Match the scalability needsMatch the scalability needs
  
Evaluate Ampere Altra Arm
-
based processors for Virtual MachinesEvaluate Ampere Altra Arm
-
based processors for Virtual Machines
  
Delete zombie workloadsDelete zombie workloads
 
Next step
Review auto-scaling guidance for Azure workloads.
Review the B-series burstable virtual machine sizes.
Consider that it may require tuning to prevent unnecessary scaling during short bursts of high demand, as
opposed to a static increase in demand.
Consider the application architecture as part of scaling considerations. For example, logical components
should scale independently to match the demand of that component, as opposed to scaling the entire
application if only a portion of the components needs scaling.
Consider the platform and whether it meets the scalability needs of the solution. For example, having
provisioned resources with a dedicated allocation may lead to unused or underutilized compute resources.
Examples:
Provisioning an Azure App Service Environment (ASE) over an App Service plan may lead to having
provisioned compute, whether utilized or not.
Choosing the Azure API Management Premium tier instead of the consumption tier leads to unused
resources if you aren't utilizing it fully.
Green Software Foundation alignment: Hardware efficiency
Recommendation:Recommendation:
Review the platform design decisions regarding scalability, and ensure the workload utilizes as much of the
provisioned resources as possible.
Consider this tradeoff: Some services require a higher tier to access certain features and capabilities
regardless of resource utilization.
Consider and prefer services that allow dynamic tier scaling where possible.
The Arm-based VMs represent a cost-effective and power-efficient option that doesn't compromise on the
required performance.
Green Software Foundation alignment: Energy efficiency
Recommendation:Recommendation:
Evaluate if the Ampere Altra Arm-based VMs is a good option for your workloads.
Read more about Azure Virtual Machines with Ampere Altra Arm–based processors on Azure.
Consider discovering unutilized workloads and resources and if there are any orphaned resources in your
subscriptions.
Green Software Foundation alignment: Hardware efficiency, Energy efficiency
Recommendation:Recommendation:
Delete any orphaned workloads or resources if they're no longer necessary.
Review the design considerations for deployment and testing.
Deployment and testing

Testing considerations for sustainable workloads on

Azure

12/16/2022 • 4 minutes to read • Edit Online

IMPORTANTIMPORTANT

Testing efficiency

Run integration, performance, load, or any other intense testing during low

carbon periodsRun integration, performance, load, or any other intense testing during low

carbon periods

Automate CI/CD to scale worker agents as neededAutomate CI/CD to scale worker agents as needed

Consider caching when using CI/CD agentsConsider caching when using CI/CD agents

Organizations developing and deploying solutions to the cloud also need reliable testing. Learn about the

considerations and recommendations for running workload tests and how to optimize for a more sustainable

testing model.

This article is part of the Azure Well-Architected sustainable workload series. If you aren't familiar with this series, we

recommend you start with what is a sustainable workload?

Running integration, performance, load, or any other intense testing capability may result in much processing. A

well-crafted design for testing the deployed workloads can help ensure full utilization of the available resources,

reducing carbon emissions.

Green Software Foundation alignment: Carbon awareness

Recommendation:Recommendation:

Where you have the data available to you, plan for running testing when the data center's energy mix

primarily uses renewable energy. It may, for example, be more beneficial to run testing during the night in

some regions.

Running underutilized or inactive CI/CD agents results in more emissions.

Green Software Foundation alignment: Hardware efficiency

Recommendation:Recommendation:

Keeps the compute utilization high, based on the current demand, avoiding unnecessary capacity allocation.

Only scale out when necessary, and when not testing, scale in. Ultimately this ensures there's no idle compute

resources in test environments.

Consider optimized platform services like containers over testing in a VM, utilizing the platform to reduce

maintenance.

Using caching mechanisms during CI/CD can reduce compute time and, thus, carbon emissions.

Green Software Foundation alignment: Energy Efficiency

Recommendation:Recommendation:

Store results from steps in a cache and re-use them between different CI/CD runs when possible: when there

are steps that take CPU time to produce an artifact that does not often change between different runs, it is

wise to save it for future usage so that CPU time is not wasted on every run producing the same artifact, over

Split large code repositoriesSplit large code repositories

Profiling and measuring

Assess where parallelization is possibleAssess where parallelization is possible

Assess with chaos engineeringAssess with chaos engineering

and over.

If the CI/CD agent is self-hosted, use a cache local to the agent to further reduce data transfers and

emissions. This ensures that the cache is not transferred over the network, which can be a significant source

of emissions.

Splitting large repositories can help the CI/CD phases, where only the parts of the code that have changed are

compiled. This reduces compute time, which ultimately lowers carbon emissions.

Green Software Foundation alignment: Energy Efficiency

Recommendation:Recommendation:

Split large code repositories, separating main code from libraries and dependencies.

Publish and re-use artifacts and libraries of code that are common across multiple repositories.

Recommendation:Recommendation:

Split large repositories of code into smaller ones, separating main code from libraries and dependencies.

Publish and re-use artifacts and libraries of code that are common across multiple repositories.

Measuring, profiling, and testing workloads are imperative to understanding how to best use allocated

resources.

Without properly profiling and testing workloads, it's difficult to know if it's making the best use of the

underlying platform and deployed resources.

Green Software Foundation alignment: Measuring sustainability

Recommendation:Recommendation:

Test your applications to understand concurrent requests, simultaneous processing, and more.

If you're running Machine Learning (ML) for tests, consider machines with a GPU for better efficiency gains.

Identify if the workload is performance intensive and work toward optimization.

Consider this tradeoff:

Running GPU-based machines for ML tests may increase the cost.

Running integration, performance and load tests increase the reliability of a workload. However, the introduction

of chaos engineering can significantly help improve reliability and resilience and how the applications react to

failures. In doing so, the workload can be optimized to handle failures gracefully and with less wasted resources.

Green Software Foundation alignment: Measuring sustainability

Recommendation:Recommendation:

Use load testing or chaos engineering to assess how the workload handles platform outages and traffic

spikes or dips. This helps increase service resilience and the ability to react to failures, allowing for a more

optimized fault handling.

Consider this tradeoff:

Injecting fault during chaos engineering and increasing the load on any system also

increases the emissions used for the testing resources. Evaluate how and when you can utilize chaos

engineering to increase the workload reliability while considering the climate impact of running unnecessary

testing sessions.

Another angle to this is using chaos engineering to test energy faults or moments with higher carbon

emissions: consider setting up tests that will challenge your application to consume the minimum possible

Establish CPU and Memory thresholds in testingEstablish CPU and Memory thresholds in testing

Next step

energy. Define how the application will react to such conditions with a specific "eco" version informing users

that they're emitting the minimum possible carbon by sacrificing some features and possibly some

performance. This can also be your benchmark application for scoring its sustainability.

Help build tests for testing sustainability in your application. Consider having a baseline CPU utilization

measurement, and detect abnormal changes to the CPU utilization baseline when tests run. With a baseline,

suboptimal decisions made in recent code changes can be discovered earlier.

Adding tests and quality gates into the deployment and testing pipeline helps avoid deploying non-sustainable

solutions, contributing to lowered emissions.

Green Software Foundation alignment: Energy efficiency

Recommendation:Recommendation:

Monitor CPU and memory allocations when running integration tests or unit tests.

Find abnormally high resource consumption areas in the application code and focus on mitigating those first.

Configure alerts or test failures if surpassing the established baseline values, helping avoid deploying non-

sustainable workloads.

Consider this tradeoff: As applications grow, the baseline may need to shift accordingly to avoid failing the

tests when introducing new features.

Review the design considerations for operational procedures.

Operational procedures

Operational procedure considerations for

sustainable workloads on Azure

12/16/2022 • 7 minutes to read • Edit Online

IMPORTANTIMPORTANT

Measure and track carbon impact

The Emissions Impact DashboardThe Emissions Impact Dashboard

Define emissions targetDefine emissions target

The discipline of green software and its implementation within cloud efficiency patterns is relatively recent, and

no specific and universal standards have been agreed upon yet.

The Green Software Foundation works on creating and standardizing ways of making green software. However,

it's vital that everyone considers this aspect in their daily work and that when designing, planning, and

deploying Azure workloads, we consider the best practices that are already available and prepare our

environment to incorporate new standards when ready.

This document will guide you through setting up an environment for measuring and continuously improving

your Azure workloads' cost and carbon efficiency.

This article is part of the Azure Well-Architected sustainable workload series. If you aren't familiar with this series, we

recommend you start with what is a sustainable workload?

To optimize or improve something, we first must decide what we want to change and how to measure it. In this

section, you'll learn about best practices and guidelines to measure and track the sustainability impact of your

workloads.

An essential aspect of working toward any sustainability goal is tracking and quantifying progress. If you can't

track and measure the impact, you'll never be sure if the efforts are worthwhile. The Emissions Impact

Dashboard is a Power BI dashboard that will give you a measure of the carbon impact of all your services and

resource groups in your Azure subscription(s).

The Emissions Impact Dashboards produce insights in various forms, and allows for a wide range of reporting

capabilities:

Series of visual representations in the dashboard itself.

Snapshot export to Excel, Power Point and PDF.

Continuous export to Microsoft Sustainability Manager and Dataverse.

Green Software Foundation alignment: Measuring sustainability

Recommendation:Recommendation:

Use the Emissions Impact Dashboard to record current and future environmental impact.

Identify and track metrics to quantify the achievement of technical, business, and sustainability outcomes.

Rely on tooling to help measure the impact, and record any changes made to your workload.

Learn more about the Sustainability and Dataverse API access in the Microsoft Learn module Access

Microsoft Sustainability Manager data.

Identify the metrics and set improvement goalsIdentify the metrics and set improvement goals

Cost optimization as a proxyCost optimization as a proxy

The Software Carbon Intensity (SCI) is the score you're looking for to measure the carbon impact of your

application(s) by adding the scalability and cost metrics to any carbon emissions measurement.

If you aren't using the Emissions Impact Dashboard, there are still ways of building carbon proxies that allow

you to measure your application's impact on emissions.

It can be a challenge to build carbon proxies for existing applications. Therefore, we recommend planning for

efficiency targets during the design phase of every workload. When adding new workloads to Azure, you should

consider planning for costs and emissions that will add to your existing footprint. The main goal should always

be not to emit carbon, so ideally, you should immediately find an optimization pattern to make up for the new

emissions.

The next step is to define your target emissions, either for a single application or for your entire set of cloud

workloads. The target can also include cost constraints, making it even easier to build upon since shrinking costs

will give you some budget to optimize emissions. Once you know your target, the cloud efficiency continuous

optimization process can start.

Green Software Foundation alignment: Measuring sustainability

Recommendations:Recommendations:

Calculate your new workload's minimum cost and carbon emissions (where applicable).

Track progress with Service Level Objectives (SLO), Service Level Agreements (SLA), or other performance

metrics.

Provide optimization patterns to accommodate the new application to your overall cloud efficiency score.

Once you've defined your target, you'll need to identify a few metrics that you can measure to prove your

changes had a positive effect on efficiency.

The metrics can, as an example, be derived from these categories:

Application performance metrics.

Cost optimization metrics.

Carbon emissions metrics (or proxies).

Green Software Foundation alignment: Measuring sustainability

Recommendation:Recommendation:

Discuss with every application owner since the impact of optimizing can vary and might affect many users.

Make sure that any plan that impacts performance is agreed upon and communicated clearly to the app

users so that they know that a lower performance may be necessary for the greater good of fewer carbon

emissions.

If you've connected the Microsoft Emissions Impact Dashboard (EID) to your Microsoft Sustainability

Manager (MSM) instance, you can use the Goal Tracking feature in MSM to define and track your goals by

linking them to live data from EID.

Sometimes the ease of deploying cloud resources makes us forget what is useful and what is simply a waste of

resources, money, and carbon. The message here's that experiments in the cloud can sometimes be costly in

terms of overall cloud efficiency, not purely cost, while bringing no innovation.

Use cloud resources wisely, considering any extra workload's carbon footprint.

When defining your SCI, you can use carbon proxies to compensate for the lack of specific standards and

measurements. One of the safest and most potent proxies for carbon emissions are your application(s) cost.

  
Defining policiesDefining policies
 
Community and knowledge sharing
  
Create a sustainability communityCreate a sustainability community
  
Plan for learningPlan for learning
Reducing unnecessary spending lowers the number of excessive emissions from deployed workloads as you're
using fewer cloud resources.
Linking cost performance metrics to carbon efficiency can be a sound strategy because you won't necessarily
need to compromise on your defined workload Key Performance Indicators (KPI) by optimizing cost and
reducing carbon emissions. However, you might decide that you're prepared to sacrifice a KPI towards your
carbon goal, which can also be part of your strategy.
Green Software Foundation alignment: Measuring sustainability
Recommendation:Recommendation:
Review the concept of using a proxy solution to measure emissions.
Leverage the guidance in the Azure Well-Architected Framework Cost Optimization pillar.
Azure Policy is a powerful tool that can make some decisions for your cloud efficiency easier to implement.
Consider defining one of more policies to keep your Azure virtual data center continuously optimized.
Green Software Foundation alignment: Climate commitments
Recommendation:Recommendation:
Incorporate and use the cost policies available in the Cloud Adoption Framework.
Leverage built-in policies relevant to cost in Azure Policy, as they're technically closely tied to sustainability.
Customize Azure Policy policies according to green software principles. For example, create a new Azure
Policy initiative for "Sustainability".
Consider this tradeoff: Enforcement of new policies must not impact any unplanned operational
performance metric.
Teams needs to be constantly aware of new advancements in sustainability, so they leverage these learnings
when implementing workloads.
Building a community around cloud efficiency and green software is a good starting point to foster cloud
efficiency awareness and culture across your organization.
Creating a sustainability community doesn't have to be a tedious task. Start with a small team that will invest
some time in learning the sustainability status and the relevant information on green software. This team can
also join the Green Software Foundation and be part of the teams that create rules, standards, and more.
The Core cloud Efficiency team will have to be up to date with all the innovative tools and principles that drive
your Azure workload's cost and carbon footprint.
Green Software Foundation alignment: Climate commitments
Recommendation:Recommendation:
Define policies and targets, and communicate their efforts and goals with the rest of the organization.
Learn more by reading how do I start a sustainability community in my organization?
Make time for the core team to learn about advancements in sustainable operations. Meanwhile, ensure that
your entire organization starts thinking about green software and how to contribute to the sustainability picture
with their daily choices.

Share best practices across teamsShare best practices across teams

Plan for incentivesPlan for incentives

Next step

Green Software Foundation alignment: Climate commitments

Recommendation:Recommendation:

Review these popular training and learning resources:

Use the self-paced learning module to Learn about The Principles of Sustainable Software Engineering.

Use the self-paced learning path to Get started with Microsoft Cloud for Sustainability.

Find more resources in the Microsoft Sustainability Learning Center.

Driving adoption of sustainability efforts requires input and work from across the organization.

Green Software Foundation alignment: Climate commitments

Recommendation:Recommendation:

Let team members share their workload and company-specific best practices for sustainable operations.

Set up a shared repository of best practices and guidance that have been tested in your environment with

tangible results.

Consider frequent knowledge-sharing sessions or internal webinars for getting everyone up to speed.

The quickest way of enforcing policies and creating the right culture is by setting incentives for improving the

environmental sustainability of a workload by either putting sustainability as a core KPI or adding it to the

overall efficiency of the applications.

Many software partners already include green software in their best practices. Therefore, ensure that your

efficiency targets are defined and accepted when discussing the workload.

Green Software Foundation alignment: Climate commitments

Recommendations:Recommendations:

Promote carbon-aware applications. Reward application owners if the measured carbon footprint meets the

KPI.

Introduce gamification by creating a friendly culture of sustainability competition—track records to promote

green workloads, SCI scoring, and any optimization or improvement on the score.

Consider introducing loyalty programs, where participants get incentives when they can prove the cloud

efficiency of their applications.

Explore the opportunity to introduce badges like "Carbon Aware" and "Carbon Optimized".

Review the design considerations for networking and connectivity.

Networking and connectivity

Networking considerations for sustainable

workloads on Azure

12/16/2022 • 3 minutes to read • Edit Online

IMPORTANTIMPORTANT

Network efficiency

Make use of a CDNMake use of a CDN

Follow caching best practicesFollow caching best practices

Select Azure regions based on where the customer residesSelect Azure regions based on where the customer resides

Most workloads in the cloud rely heavily on networking to operate. Whether internal networking or public-

facing workloads, the components and services used in provisioned solutions must consider the impact of

carbon emissions. Consider that network equipment consumes electricity, including traffic between the data

centers and end consumers. Learn about considerations and recommendations to enhance and optimize

network efficiency to reduce unnecessary carbon emissions.

Internet traversal between data centers and end consumers is a significant Scope 3 emission. Therefore,

recommendations in this section are aligned with the Principles of Green Software Networking area to improve

networking efficiency.

This article is part of the Azure Well-Architected sustainable workload series. If you aren't familiar with this series, we

recommend you start with what is a sustainable workload?

Reduce unnecessary network traffic and lower bandwidth requirements where possible, allowing for a more

optimized network efficiency with less carbon emission.

Unnecessary traffic on the network should be avoided, as it's a cause for extra carbon emissions.

Green Software Foundation alignment: Energy efficiency

Recommendation:Recommendation:

A CDN helps minimize latency through storing frequently read static data closer to consumers, and helps

reduce the network traversal and server load.

Ensure to follow best practices for CDN.

Minimizing the amount of data transferred is crucial.

Green Software Foundation alignment: Energy efficiency, Hardware efficiency

Recommendation:Recommendation:

Caching is a well-understood design technique to improve performance and efficiency.

A caching solution helps reduce network traversal and reduces the server load.

Consider that it may require tuning of parameters to maximize the benefit and minimize the carbon

drawbacks. For example, setting a Time to Live (TTL).

Adding in-memory caching can help use idle compute resources, increasing the compute density of

resources that are already allocated.

Read caching best practices.

  
Use managed audio and video streaming services with built
-
in compressionUse managed audio and video streaming services with built
-
in compression
  
Enable network file compressionEnable network file compression
  
Maximize network utilization within the same cloud and regionMaximize network utilization within the same cloud and region
 
Next step
The location of an application's consumers can be disparate, and it can be challenging to serve requests with
good performance and energy efficiency if the distance is too great.
Green Software Foundation alignment: Energy efficiency
Recommendation:Recommendation:
Deploy or move Azure resources across regions to better serve the applications from where most consumers
reside.
Applications making use of a media streaming service may have high requirements for bandwidth and
compression, and can have a substantial carbon footprint if not designed carefully.
Green Software Foundation alignment: Hardware efficiency
Recommendation:Recommendation:
By making use of a managed service for audio and video, applications can leverage built-in optimizations like
encoding, compressions, and more.
Read about managed audio and video streaming services.
Networks sending uncompressed data can have a higher requirement on bandwidth, the allocated resources,
and the solution in general. Consider compressing data to optimize the workload and design for a more
network efficient solution.
Green Software Foundation alignment: Energy efficiency
Recommendation:Recommendation:
Reduce the network payload by improving CDN performance.
Operating solutions in multiple regions have a networking impact. Network traversals between components in
Azure are optimized to stay within the Azure infrastructure. However, any network traffic destined for the
internet or a component in another cloud involves the public internet's router resources, which you have no
control over regarding resource impact measurement or utilization.
Green Software Foundation alignment: Energy efficiency
Recommendation:Recommendation:
Keeping resources in a single cloud gives you maximum control and allows the cloud provider to optimize
the network routing.
Maximize network utilization within the same cloud and, if possible, within the same region.
Since the cost can be a proxy for sustainability, review the Azure regions documentation in the Cost
Optimization pillar of the Azure Well-Architected Framework.
Review the design considerations for storage.
Storage

Data and storage design considerations for

sustainable workloads on Azure

12/16/2022 • 3 minutes to read • Edit Online

IMPORTANTIMPORTANT

Storage efficiency

Enable storage compressionEnable storage compression

Optimize database query performanceOptimize database query performance

Use the best suited storage access tierUse the best suited storage access tier

Data storage in Azure is a crucial component of most provisioned workloads. Learn how to design for a more

sustainable data storage architecture and optimize existing deployments.

This article is part of the Azure Well-Architected sustainable workload series. If you aren't familiar with this series, we

recommend you start with what is a sustainable workload?

Build solutions with efficient storage to increase performance, lower the required bandwidth, and minimize

unnecessary storage design climate impact.

Storing much uncompressed data can result in unnecessary bandwidth waste and increase the storage capacity

requirements.

Green Software Foundation alignment: Hardware efficiency

Recommendation:Recommendation:

A solution to reduce the storage requirements, including both capacity and required bandwidth to write or

retrieve data. For example, compressing files in Azure Front Door and compressing files in Azure CDN.

Compression is a well-known design technique to improve network performance.

Consider the tradeoff of compression: Does the benefit of compression outweigh the increased

carbon

cost

in the resources (CPU, RAM) needed to perform the compression/decompression?

Querying extensive databases or retrieving much information simultaneously can have a performance penalty.

Ideally, apps should optimize for query performance.

Green Software Foundation alignment: Energy efficiency

Recommendation:Recommendation:

Reduces the latency of data retrieval while also reducing the load on the database.

Understand the query performance for Azure SQL Databases

There are many well-known ways to optimize data query performance, for example tuning apps and

databases for performance in an Azure SQL database.

Consider that it may require fine-tuning to achieve optimal results.

The carbon impact of data retrieved from hot storage can be higher than data from cold- or archive storage.

Designing solutions with the correct data access pattern can enhance the application's carbon efficiency.

Green Software Foundation alignment: Energy efficiency

  
Only store what is relevantOnly store what is relevant
  
Determine the most suitable access tier for blob dataDetermine the most suitable access tier for blob data
  
Reduce the number of recovery points for VM backupsReduce the number of recovery points for VM backups
  
Revise backup and retention policiesRevise backup and retention policies
  
Optimize the collection of logsOptimize the collection of logs
Recommendation:Recommendation:
Use storage best suited for the application's data access patterns.
Make sure your most frequent data is stored in hot storage, making it easy to retrieve and doesn't require
more processing to access.
Infrequently used data should be stored in cold or offline archive storage, using less energy.
Backup is a crucial part of reliability. However, storing backups indefinitely can quickly allocate much
unnecessary disk space. Consider how you plan backup storage retention.
Green Software Foundation alignment: Hardware efficiency
Recommendation:Recommendation:
Implement policies to streamline the process of storing and keeping relevant information. Microsoft Purview
can help label data and add time-based purging to delete it after a retention period automatically.
Additionally, this lets you stay in control of your data and reduces the amount of data to process and transfer.
Workloads integrated with Azure Monitor can rely on Data Collection Rules (DCR) to specify what data
should be collected, how to transform that data, and where to send the data.
Consider whether to store data in an online tier or an offline tier. Online tiers are optimized for storing data that
is accessed or modified frequently. Offline tiers are optimized for storing data that is rarely accessed.
Green Software Foundation alignment: Energy efficiency
Recommendation:Recommendation:
Read Hot, Cool, and Archive access tiers for blob data.
Recovery points aren't automatically cleaned up. Therefore, consider where soft delete is enabled for Azure
Backup. The expired recovery points aren't cleaned up automatically.
Green Software Foundation alignment: Hardware efficiency
Recommendation:Recommendation:
Read more about the impact of expired recovery points for items in soft deleted state.
Consider reviewing backup policies and retention periods for backups to avoid storing unnecessary data.
Green Software Foundation alignment: Hardware efficiency
Recommendation:Recommendation:
Review and revise backup and retention policies to minimize storage overhead.
Actively review and delete backups that are no longer needed.
Continuously collecting logs across workloads can quickly aggregate and store lots of unused data.
Green Software Foundation alignment: Energy efficiency
Recommendation:Recommendation:
Make sure you are logging and retaining only data that is relevant to your needs.
Read more about the Cost optimization and Log Analytics.

Next step

Review the design considerations for security.

Security

Security considerations for sustainable workloads on

Azure

12/16/2022 • 7 minutes to read • Edit Online

IMPORTANTIMPORTANT

Security monitoring

Use cloud native log collection methods where applicableUse cloud native log collection methods where applicable

Avoid transferring large unfiltered data sets from one cloud service provider to anotherAvoid transferring large unfiltered data sets from one cloud service provider to another

Designing sustainable workloads on Azure must encompass security, which is a foundational principle through

all phases of a project. Learn about considerations and recommendations leading to a more sustainable security

posture.

This article is part of the Azure Well-Architected sustainable workload series. If you aren't familiar with this series, we

recommend you start with what is a sustainable workload?

Use cloud native security monitoring solutions to optimize for sustainability.

Traditionally, log collection methods for ingestion to a Security Information and Event Management (SIEM)

solution required the use of an intermediary resource to collect, parse, filter and transmit logs onward to the

central collection system. Using this design can carry an overhead with more infrastructure and associated

financial and carbon-related costs.

Green Software Foundation alignment: Hardware efficiency, Energy efficiency

Recommendation:Recommendation:

Using cloud native service-to-service connectors simplify the integration between the services and the SIEM,

and removes the overhead of extra infrastructure.

It's possible to ingest log data from existing compute resources using previously deployed agents such as the

Azure Monitor Analytics agent. Review how to migrate to Azure Monitor agent from Log Analytics agent.

Consider this tradeoff: Deploying more monitoring agents will increase the overhead in processing as it

needs more compute resources. Carefully design and plan for how much information is needed to cover the

security requirements of the solution and find a suitable level of information to store and keep.

A possible solution to reduce unnecessary data collection is to rely on the Azure Monitor Data

Collection Rules (DCR).

Conventional SIEM solutions required all log data to be ingested and stored in a centralized location. In a

multicloud environment, this solution can lead to a large amount of data being transferred out of a cloud service

provide and into another, causing increased burden on the network and storage infrastructure.

Green Software Foundation alignment: Carbon efficiency, Energy efficiency

Recommendation:Recommendation:

Cloud native security services can perform localized analysis on relevant security data source. This analysis

allows the bulk of log data to remain within the source cloud service provider environment. Cloud native

SIEM solutions can be connected via an API or connector to these security services to transmit only the

relevant security incident or event data. This solution can greatly reduce the amount of data transferred while

  
Filter or exclude log sources before transmission or ingestion into a SIEMFilter or exclude log sources before transmission or ingestion into a SIEM
  
Archive log data to long
-
term storageArchive log data to long
-
term storage
 
Network architecture
  
Use cloud native network security controls to eliminate unnecessary network trafficUse cloud native network security controls to eliminate unnecessary network traffic
  
Minimize routing from endpoints to the destinationMinimize routing from endpoints to the destination
maintaining a high level of security information to respond to an incident.
In time, using the described approach helps reduce data egress and storage costs, which inherently help reduce
emissions.
Consider the complexity and cost of storing all logs from all possible sources. For instance, applications, servers,
diagnostics and platform activity.
Green Software Foundation alignment: Carbon efficiency, Energy efficiency
Recommendation:Recommendation:
When designing a log collection strategy for cloud native SIEM solutions, consider the use cases based on the
Microsoft Sentinel analytics rules required for your environment and match up the required log sources to
support those rules.
This option can help remove the unnecessary transmission and storage of log data, reducing the carbon
emissions on the environment.
Many customers have a requirement to store log data for an extended period due to regulatory compliance
reasons. In these cases, storing log data in the primary storage location of the SIEM system is a costly solution.
Green Software Foundation alignment: Energy efficiency
Recommendation:Recommendation:
Log data can be moved out to a cheaper long-term storage option which respects the retention policies of
the customer, but lowers the cost by utilizing separate storage locations.
Increase the efficiency and avoid unnecessary traffic by following good practices for network security
architectures.
When you use a centralized routing- and firewall design, all network traffic is sent to the hub for inspection,
filtering, and onward routing. While this approach centralizes policy enforcement, it can create an overhead on
the network of unnecessary traffic from the source resources.
Green Software Foundation alignment: Hardware efficiency, Energy efficiency
Recommendation:Recommendation:
Use Network security groups and Application security groups to help filter traffic at the source, and to
remove the unnecessary data transmission. Using these capabilities can help reduce the burden on the cloud
infrastructure, with lower bandwidth requirements and less infrastructure to own and manage.
In many customer environments, especially in hybrid deployments, all end user device network traffic is routed
through on-premises systems before being allowed to reach the internet. Usually, this happens due to the
requirement to inspect all internet traffic. Often, this requires higher capacity network security appliances within
the on-premises environment, or more appliances within the cloud environment.
Green Software Foundation alignment: Energy efficiency
Recommendation:Recommendation:
Minimize routing from endpoints to the destination.

  
Use network security tools with auto
-
scaling capabilitiesUse network security tools with auto
-
scaling capabilities
  
Evaluate whether to use TLS terminationEvaluate whether to use TLS termination
  
Use DDoS protectionUse DDoS protection
 
Endpoint security
  
Integrate Microsoft Defender for EndpointIntegrate Microsoft Defender for Endpoint
Where possible, end user devices should be optimized to split out known traffic directly to cloud
services while continuing to route and inspect traffic for all other destinations. Bringing these
capabilities and policies closer to the end user device prevents unnecessary network traffic and its
associated overhead.
Based on network traffic, there will be times when demand of the security appliance will be high, and other
times where it will be lower. Many network security appliances are deployed to a scale to cope with the highest
expected demand, leading to inefficiencies. Additionally, reconfiguration of these tools often requires a reboot
leading to unacceptable downtime and management overhead.
Green Software Foundation alignment: Hardware efficiency
Recommendation:Recommendation:
Making use of auto-scaling allows the rightsizing of the backend resources to meet demand without manual
intervention.
This approach will vastly reduce the time to react to network traffic changes, resulting in a reduced waste of
unnecessary resources, and increases your sustainability effect.
Learn more about relevant services by reading how to enable a Web Application Firewall (WAF) on an
Application Gateway, and deploy and configure Azure Firewall Premium.
Terminating and re-establishing TLS is CPU consumption that might be unnecessary in certain architectures.
Green Software Foundation alignment: Energy efficiency
Recommendation:Recommendation:
Consider if you can terminate TLS at your border gateway and continue with non-TLS to your workload load
balancer and onwards to your workload.
Review the information on TLS termination to better understand the performance and utilization impact it
offers.
Consider the tradeoff: A balanced level of security can offer a more sustainable and energy efficient workload
while a higher level of security may increase the requirements on compute resources.
Distributed Denial of Service (DDoS) attacks aim to disrupt operational systems by overwhelming them,
creating a significant impact on the resources in the cloud. Successful attacks flood network and compute
resources, leading to an unnecessary spike in usage and cost.
Green Software Foundation alignment: Energy efficiency, Hardware efficiency
Recommendation:Recommendation:
DDoS protection seeks to mitigate attacks at an abstracted layer, so the attack is mitigated before reaching
any customer operated services.
Mitigating any malicious usage of compute and network services will ultimately help reduce
unnecessary carbon emissions.
It's imperative that we secure our workloads and solutions in the cloud. Understanding how we can optimize our
mitigation tactics all the way down to the client devices can have a positive outcome for reducing emissions.

Reporting

Tag security resourcesTag security resources

Next step

Many attacks on cloud infrastructure seek to misuse deployed resources for the attacker's direct gain. Two such

misuse cases are botnets and crypto mining.

Both of these cases involve taking control of customer-operated compute resources and use them to either

create new cryptocurrency coins, or as a network of resources from which to launch a secondary action like a

DDoS attack, or mass e-mail spam campaigns.

Green Software Foundation alignment: Hardware efficiency

Recommendations:Recommendations:

Integrate Microsoft Defender for Endpoint with Defender for Cloud to identify and shut down crypto mining

and botnets.

The EDR capabilities provide advanced attack detections and are able to take response actions to

remediate those threats. The unnecessary resource usage created by these common attacks can

quickly be discovered and remediated, often without the intervention of a security analyst.

Getting the right information and insights at the right time is important for producing reports around emissions

from your security appliances.

It can be a challenge to quickly find and report on all security appliances in your tenant. Identifying the security

resources can help when designing a strategy for a more sustainable operating model for your business.

Green Software Foundation alignment: Measuring sustainability

Recommendation:Recommendation:

Tag security resources to record emissions impact of security resources.

Review the design principles for sustainability.

Design principles

Azure Well

Architected Framework review

Azure

Service Fabric

12/16/2022 • 14 minutes to read • Edit Online

Prerequisites

Reliability

Azure Service Fabric is a distributed systems platform that makes it easy to package, deploy, and manage

scalable and reliable microservices and containers. These resources are deployed onto a network-connected set

of virtual or physical machines, which is called a clustercluster .

There are two clusters models in Azure Service Fabric: standard clustersstandard clusters and managed clustersmanaged clusters.

Standard clustersStandard clusters require you to define a cluster resource alongside a number of supporting resources. These

resources must be set up correctly upon deployment and maintained correctly throughout the lifecycle of the

cluster. Otherwise, the cluster and your services will not function properly.

Managed clustersManaged clusters simplify your deployment and management operations. The managed cluster model

consists of a single Service Fabric managed cluster resource that encapsulates and abstracts away the

underlying resources.

This article primarily discusses the managed clustermanaged cluster model for simplicity. However, call-outs are made for any

special considerations that apply to the standard clusterstandard cluster model.

In this article, you learn architectural best practices for Azure Service Fabric. The guidance is based on the five

pillars of architectural excellence:

Reliability

Security

Cost optimization

Operational excellence

Performance efficiency

Understanding the Well-Architected Framework pillars can help produce a high quality, stable, and

efficient cloud architecture. Check out the Azure Well-Architected Framework overview page to review

the five pillars of architectural excellence.

Reviewing the core concepts of Azure Service Fabric and microservice architecture can help you

understand the context of the best practices provided in this article.

The following sections cover design considerations and configuration recommendations, specific to Azure

Service Fabric and reliability.

When discussing reliability with Azure Service Fabric, it's important to distinguish between

cluster reliability

and

workload reliability

. Cluster reliability is a shared responsibility between the Service Fabric cluster admin and

their resource provider, while workload reliability is the domain of a developer. Azure Service Fabric has

considerations and recommendations for both of these roles.

In the design checklistdesign checklist and list of recommendationslist of recommendations below, call-outs are made to indicate whether each

choice is applicable to cluster architecture, workload architecture, or both.

  
Design checklistDesign checklist
  
RecommendationsRecommendations
A Z URE SER VIC E FA B RIC  R EC OM M E NDAT IONA ZURE SERVIC E FA B RIC  RE C O M M EN DAT IO N B EN EF ITB EN EF IT
Cluster architecture:Cluster architecture: Use Standard SKU for production
scenarios.
This level ensures the resource provider maintains cluster
reliability. Standard cluster :Standard cluster : A Standard SKU managed
cluster provides the equivalent of durability level Silver. To
achieve this using the standard cluster model, you will need
to use 5 VMs (or more).
Cluster architecture:Cluster architecture: Consider using Availability Zones for
your Service Fabric clusters.
Service Fabric managed cluster supports deployments that
span across multiple Availability Zones to provide zone
resiliency. This configuration will ensure high-availability of
the critical system services and your applications to protect
from single-points-of-failure.
Cluster architecture:Cluster architecture: Consider using Azure API
Management to expose and offload cross-cutting
functionality for APIs hosted on the cluster.
API Management can integrate with Service Fabric directly.
Workload architecture:Workload architecture: For stateful workload scenarios,
consider using Reliable Services.
The Reliable Services model allows your services to stay up
even in unreliable environments where your machines fail or
hit network issues, or in cases where the services themselves
encounter errors and crash or fail. For stateful services, your
state is preserved even in the presence of network or other
failures.
 
Security
For more information about Azure Service Fabric cluster reliability, check out the capacity planning
documentation.
For more information about Azure Service Fabric workload reliability, reference the Reliability subsystem
included in the Service Fabric architecture.
As you make design choices for Azure Service Fabric, review the design principles for adding reliability to the
architecture.
Cluster architecture:Cluster architecture: Use Standard SKU for production scenarios. Standard cluster :Standard cluster : Use durability level
Silver (5 VMs) or greater for production scenarios.
Cluster architecture:Cluster architecture: For critical workloads, consider using Availability Zones for your Service Fabric
clusters.
Cluster architecture:Cluster architecture: For production scenarios, use the Standard tier load balancer. Managed clusters
create an Azure public Standard Load Balancer and fully qualified domain name with a static public IP for
both the primary and secondary node types. You can also bring your own load balancer, which supports
both Basic and Standard SKU load balancers.
Cluster architecture:Cluster architecture: Create additional, secondary node types for your workloads.
Explore the following table of recommendations to optimize your Azure Service Fabric configuration for service
reliability:
For more suggestions, see Principles of the reliability pillar.
The following sections cover design considerations and configuration recommendations, specific to Azure
Service Fabric and security.
When discussing security with Azure Service Fabric, it's important to distinguish between 
cluster security
 and

Design checklistDesign checklist

RecommendationsRecommendations

A Z URE SER VIC E FA B RIC R EC OM M E NDAT IONA ZURE SERVIC E FA B RIC RE C O M M EN DAT IO N B EN EF ITB EN EF IT

Cluster architecture:Cluster architecture: Ensure Network Security Groups

(NSG) are configured to restrict traffic flow between subnets

and node types.

For example, you may have an API Management instance

(one subnet), a frontend subnet (exposing a website directly),

and a backend subnet (accessible only to frontend).

Cluster architecture:Cluster architecture: Deploy Key Vault certificates to

Service Fabric cluster virtual machine scale sets.

Centralizing storage of application secrets in Azure Key Vault

allows you to control their distribution. Key Vault greatly

reduces the chances that secrets may be accidentally leaked.

Cluster architecture:Cluster architecture: Apply an Access Control List (ACL)

to your client certificate for your Service Fabric cluster.

Using an ACL provides an additional level of authentication.

Cluster architecture:Cluster architecture: Use resource requests and limits to

govern resource usage across the nodes in your cluster.

Enforcing resource limits helps ensure that one service

doesn't consume too many resources and starve other

services.

Workload architecture:Workload architecture: Encrypt Service Fabric package

secret values.

Encryption on your secret values provides an additional level

of security.

workload security

. Cluster security is a shared responsibility between the Service Fabric cluster admin and their

resource provider, while workload security is the domain of a developer. Azure Service Fabric has considerations

and recommendations for both of these roles.

In the design checklistdesign checklist and list of recommendationslist of recommendations below, call-outs are made to indicate whether each

choice is applicable to cluster architecture, workload architecture, or both.

For more information about Azure Service Fabric cluster security, check out Service Fabric cluster security

scenarios.

For more information about Azure Service Fabric workload security, reference Service Fabric application and

service security.

As you make design choices for Azure Service Fabric, review the design principles for adding security to the

architecture.

Cluster architecture:Cluster architecture: Ensure Network Security Groups (NSG) are configured to restrict traffic flow between

subnets and node types. Ensure that the correct ports are opened for application deployment and workloads.

Cluster architecture:Cluster architecture: When using the Service Fabric Secret Store to distribute secrets, use a separate data

encipherment certificate to encrypt the values.

Cluster architecture:Cluster architecture: Deploy client certificates by adding them to Azure Key Vault and referencing the URI

in your deployment.

Cluster architecture:Cluster architecture: Enable Azure Active Directory integration for your cluster to ensure users can access

Service Fabric Explorer using their Azure Active Directory (AAD) credentials. Don't distribute the cluster client

certificates among users to access Explorer.

Cluster architecture:Cluster architecture: For client authentication, use admin and read-only client certificates and/or AAD

authentication.

Cluster and workload architectures:Cluster and workload architectures: Create a process for monitoring the expiration date of client

certificates.

Cluster and workload architectures:Cluster and workload architectures: Maintain separate clusters for development, staging, and

production.

Consider the following recommendations to optimize your Azure Service Fabric configuration for security:

Workload architecture:Workload architecture: Include client certificates in

Service Fabric applications.

Having your applications use client certificates for

authentication provides opportunities for security at both

the cluster and workload level.

Workload architecture:Workload architecture: Authenticate Service Fabric

applications to Azure Resources using Managed Identity.

Using Managed Identity allow you to securely manage the

credentials in your code for authenticating to various

services without saving them locally on a developer

workstation or in source control.

Cluster and workload architectures:Cluster and workload architectures: Follow Service

Fabric best practices when hosting untrusted applications.

Following the best practices provides a security standard to

follow.

A Z URE SER VIC E FA B RIC R EC OM M E NDAT IONA ZURE SERVIC E FA B RIC RE C O M M EN DAT IO N B EN EF ITB EN EF IT

Policy definitionsPolicy definitions

Cost optimization

Design checklistDesign checklist

For more suggestions, see Principles of the security pillar.

Azure Advisor helps you ensure and improve the security of Azure Service Fabric. You can review the

recommendations in the Azure Advisor section of this article.

Azure Policy helps maintain organizational standards and assess compliance across your resources. Keep the

following built-in policies in mind as you configure Azure Service Fabric:

Service Fabric clusters should have the ClusterProtectionLevel property set to EncryptAndSign . This is the

default value for managed clusters and isn't changeable. Standard cluster :Standard cluster : Ensure you set

ClusterProtectionLevel to EncryptAndSign .

Service Fabric clusters should only use Azure Active Directory for client authentication.

All built-in policy definitions related to Azure Service Fabric are listed in Built-in policies - Service Fabric.

The following sections cover design considerations and configuration recommendations, specific to Azure

Service Fabric and cost optimization.

When discussing cost optimization with Azure Service Fabric, it's important to distinguish between

cost of

cluster resources

and

cost of workload resources

. Cluster resources are a shared responsibility between the

Service Fabric cluster admin and their resource provider, while workload resources are the domain of a

developer. Azure Service Fabric has considerations and recommendations for both of these roles.

In the design checklistdesign checklist and list of recommendationslist of recommendations below, call-outs are made to indicate whether each

choice is applicable to cluster architecture, workload architecture, or both.

For cluster cost optimization, go to the Azure pricing calculator and select Azure Ser vice FabricAzure Ser vice Fabric from the

available products. You can test different configuration and payment plans in the calculator.

For more information about Azure Service Fabric workload pricing, check out the example cost calculation

process for application planning.

As you make design choices for Azure Service Fabric, review the design principles for optimizing the cost of

your architecture.

Cluster architecture:Cluster architecture: Select appropriate VM SKU.

Cluster architecture:Cluster architecture: Use appropriate node type and size.

Cluster and workload architectures:Cluster and workload architectures: Use appropriate managed disk tier and size.

RecommendationsRecommendations
A Z URE SER VIC E FA B RIC  R EC OM M E NDAT IONA ZURE SERVIC E FA B RIC  RE C O M M EN DAT IO N B EN EF ITB EN EF IT
Cluster architecture:Cluster architecture: Avoid VM SKUs with temp disk
offerings.
Service Fabric uses managed disks by default, so avoiding
temp disk offerings ensures you don't pay for unneeded
resources.
Cluster architecture:Cluster architecture: If you need to select a certain VM
SKU for capacity reasons and it happens to offer temp disk,
consider using temporary disk support for your stateless
workloads.
Make the most of the resources you're paying for. Using a
temporary disk instead of a managed disk can reduce costs
for stateless workloads.
Cluster and workload architectures:Cluster and workload architectures: Align SKU selection
and managed disk size with workload requirements.
Matching your selection to your workload demands ensures
you don't pay for unneeded resources.
 
Operational excellence
  
Design checklistDesign checklist
  
RecommendationsRecommendations
A Z URE SER VIC E FA B RIC  R EC OM M E NDAT IONA ZURE SERVIC E FA B RIC  RE C O M M EN DAT IO N B EN EF ITB EN EF IT
Workload architecture:Workload architecture: Use Application Insights to
monitor your workloads.
Application Insights integrates with the Azure platform,
including Service Fabric.
Cluster and workload architectures:Cluster and workload architectures: Create a process
for monitoring the expiration date of client certificates.
For example, Key Vault offers a feature that sends an email
when  x%  of the certificate's lifespan has elapsed.
Explore the following table of recommendations to optimize your Azure Service Fabric configuration for cost:
For more suggestions, see Principles of the cost optimization pillar.
The following sections cover design considerations and configuration recommendations, specific to Azure
Service Fabric and operational excellence.
When discussing security with Azure Service Fabric, it's important to distinguish between 
cluster operation
 and
workload operation
. Cluster operation is a shared responsibility between the Service Fabric cluster admin and
their resource provider, while workload operation is the domain of a developer. Azure Service Fabric has
considerations and recommendations for both of these roles.
In the design checklistdesign checklist and list of recommendationslist of recommendations below, call-outs are made to indicate whether each
choice is applicable to cluster architecture, workload architecture, or both.
As you make design choices for Azure Service Fabric, review the design principles for operational excellence.
Cluster architecture:Cluster architecture: Prepare a cluster monitoring solution.
Cluster architecture:Cluster architecture: Review the cluster health policies in the Service Fabric health model.
Workload architecture:Workload architecture: Prepare an application monitoring solution.
Workload architecture:Workload architecture: Review the application and service type health policies in the Service Fabric health
model.
Cluster and workload architectures:Cluster and workload architectures: Prepare an infrastructure monitoring solution.
Cluster and workload architectures:Cluster and workload architectures: Design your cluster with build and release pipelines for continuous
integration and deployment.
Explore the following table of recommendations to optimize your Azure Service Fabric configuration for
operational excellence:

Cluster and workload architectures:Cluster and workload architectures: For pre-production

clusters use Azure Chaos Studio to drill service disruption on

a Virtual Machine Scale Set instance failure.

Practicing service disruption scenarios will help you

understand what is at-risk in your infrastructure and how to

best mitigate the issues if they arise.

Cluster and workload architectures:Cluster and workload architectures: Use Azure Monitor

to monitor cluster and container infrastructure events.

Azure Monitor integrates well with the Azure platform,

including Service Fabric.

Cluster and workload architectures:Cluster and workload architectures: Use Azure

Pipelines for your continuous integration and deployment

solution.

Azure Pipelines integrates well with the Azure platform,

including Service Fabric.

A Z URE SER VIC E FA B RIC R EC OM M E NDAT IONA ZURE SERVIC E FA B RIC RE C O M M EN DAT IO N B EN EF ITB EN EF IT

Performance efficiency

Design checklistDesign checklist

RecommendationsRecommendations

A Z URE SER VIC E FA B RIC R EC OM M E NDAT IONA ZURE SERVIC E FA B RIC RE C O M M EN DAT IO N B EN EF ITB EN EF IT

Cluster architecture:Cluster architecture: Exclude the Service Fabric processes

from Windows Defender to improve performance.

By default, Windows Defender antivirus is installed on

Windows Server 2016 and 2019. To reduce any performance

impact and resource consumption overhead incurred by

Windows Defender, and if your security policies allow you to

exclude processes and paths for open-source software, you

can exclude.

For more suggestions, see Principles of the operational excellence pillar.

The following section covers configuration recommendations, specific to Azure Service Fabric and performance

efficiency.

When discussing security with Azure Service Fabric, it's important to distinguish between

cluster operation

and

workload operation

. Cluster performance is a shared responsibility between the Service Fabric cluster admin

and their resource provider, while workload performance is the domain of a developer. Azure Service Fabric has

considerations and recommendations for both of these roles.

In the design checklistdesign checklist and list of recommendationslist of recommendations below, call-outs are made to indicate whether each

choice is applicable to cluster architecture, workload architecture, or both.

For more information about how Azure Service Fabric can reduce performance issues for your workload with

Service Fabric performance counters, reference Monitoring and diagnostic best practices for Azure Service

Fabric.

Cluster architecture:Cluster architecture: Exclude the Service Fabric processes from Windows Defender to improve

performance.

Cluster architecture:Cluster architecture: Select appropriate VM SKU.

Workload architecture:Workload architecture: Decide what programming model you will use for your services.

Cluster and workload architectures:Cluster and workload architectures: Use appropriate managed disk tier and size.

Consider the following recommendations to optimize your Azure Service Fabric configuration for performance

efficiency:

Cluster architecture:Cluster architecture: Consider using Autoscaling for your
cluster.
Autoscaling gives great elasticity and enables addition or
reduction of nodes on demand on a secondary node type.
This automated and elastic behavior reduces the
management overhead and potential business impact by
monitoring and optimizing the amount of nodes servicing
your workload.
Cluster architecture:Cluster architecture: Consider using Accelerated
Networking.
Accelerated networking enables a high-performance path
that bypasses the host from the data path, which reduces
latency, jitter, and CPU utilization for the most demanding
network workloads.
Cluster architecture:Cluster architecture: Considering using encryption at
host instead of Azure Disk Encryption (ADE).
This encryption method improves on ADE by supporting all
OS types and images, including custom images, for your
VMs by encrypting data in the Azure Storage service.
Workload architecture:Workload architecture: Review the Service Fabric
programming models to decide what model would best suit
your services.
Service Fabric supports several programming models. Each
come with their own advantages and disadvantages.
Knowing about the available programming models can help
you make the best choices for designing your services.
Workload architecture:Workload architecture: Leverage loosely-coupled
microservices for your workloads where appropriate.
Using microservices allows you to get the most out of
Service Fabric's features.
Workload architecture:Workload architecture: Leverage event-driven
architecture for your workloads where appropriate.
Using event-driven architecture allows you to get the most
out of Service Fabric's features.
Workload architecture:Workload architecture: Leverage background processing
for your workloads where appropriate.
Using background processing allows you to get the most
out of Service Fabric's features.
Cluster and workload architectures:Cluster and workload architectures: Review the
different ways you can scale your solution in Service Fabric.
You can use scaling to enable maximum resource utilization
for your solution.
A Z URE SER VIC E FA B RIC  R EC OM M E NDAT IONA ZURE SERVIC E FA B RIC  RE C O M M EN DAT IO N B EN EF ITB EN EF IT
 
Azure Advisor recommendations
  
SecuritySecurity
 
Additional resources
For more suggestions, see Principles of the performance efficiency pillar.
Azure Advisor is a personalized cloud consultant that helps you follow best practices to optimize your Azure
deployments. Here are some recommendations that can help you improve the reliability, security, cost
effectiveness, performance, and operational excellence when using Azure Service Fabric.
Service Fabric clusters should have the ClusterProtectionLevel property set to  EncryptAndSign . This is the
default value for managed clusters and isn't changeable. Standard cluster :Standard cluster : Ensure you set
ClusterProtectionLevel to  EncryptAndSign .
Service Fabric clusters should only use Azure Active Directory for client authentication.
Check out the Azure Service Fabric managed cluster configuration options article for a list of all the options you
have while creating and maintaining your cluster.
Review the Azure application architecture fundamentals for guidance on how to develop your workloads. While
Service Fabric can be used solely as a container hosting platform, using well-architected workloads leverages

Next steps

Service Fabric's full functionality.

Use these recommendations as you create your Service Fabric managed cluster using an ARM template or

through the Azure portal:

Quickstart: Deploy a Service Fabric managed cluster with an Azure Resource Manager template

Quickstart: Deploy a Service Fabric managed cluster using the Azure portal

Azure App Service and reliability

12/16/2022 • 8 minutes to read • Edit Online

Design considerations

Checklist

Azure App Service is an HTTP-based service for hosting web applications, REST APIs, and mobile back ends. This

service adds the power of Microsoft Azure to your application, such as:

Security

Load balancing

Autoscaling

Automated management

To explore how Azure App Service can bolster the resiliency of your application workload, reference key features

in Why use App Service?

The following sections include design considerations, a configuration checklist, and recommended configuration

options specific to Azure App Service.

Microsoft guarantees that Azure App Service will be available 99.95% of the time. However, no SLA is provided

using either the Free or Shared tiers.

For more information, reference the SLA for App Service.

Have you configured Azure App Ser vice with resiliency in mind?Have you configured Azure App Ser vice with resiliency in mind?

Consider disabling ARR Affinity for your App Service.

Use a different store for session state.

Use Web Jobs.

Enable Always On to ensure Web Jobs run reliably.

Access the on-prem database using private connections like Azure VPN or Express Route.

Set up backup and restore.

Understand IP Address deprecation impact.

Ensure App Service Environments (ASE) are deployed in highly available configurations across Availability

Zones.

Ensure the ASE Network is configured correctly.

Consider configuring Upgrade preference if multiple environments are used.

Plan for scaling out the ASE cluster.

Use Deployment slots for resilient code deployments.

Avoid unnecessary worker restarts.

Use Run From Package to avoid deployment conflicts.

Use Basic or higher plans with two or more worker instances for high availability.

Evaluate the use of TCP and SNAT ports to avoid outbound connection errors.

Enable Health check to identify non-responsive workers.

Enable Autoscale to ensure adequate resources are available to service requests.

Enable Local Cache to reduce dependencies on cluster file servers.

Configuration recommendations

A SE REC O M M EN DAT IO NA SE REC O M M EN DAT IO N DESC RIP T IO NDESC RIP T ION

Consider disabling ARR Affinity for your App Service. ARR Affinity (Application Request Routing) sets an affinity

cookie, which is used to redirect users to the same node that

handled their previous requests.

Use a different store for session state. Storing session state in memory can result in losing session

state when there's a problem with the application or App

Service. It also limits the possibility of spreading the load

over other instances.

Enable Always On to ensure Web Jobs run reliably. A web app can time out after 20 minutes of inactivity. Only

requests to the actual web app reset the timer. If your app

runs continuously or if you schedule Timer trigger Web Jobs,

enable Always On. Only available in the Basic, Standard, and

Premium pricing tiers.

Access the on-prem database using private connections like

Azure VPN or Express Route.

Access the on-premises database through a private

connection so that the connection is secure and predictable.

Set up backup and restore. Backup and restore lets you manually create or schedule app

backups. You can retain backups for an indefinite amount of

time.

Understand IP Address deprecation impact. Floating addresses aren't guaranteed to remain on the

resource. It's essential to check for deprecated IP addresses.

Deploy in highly available configuration across Availability

Zones.

Ensures applications can continue to operate even if there's a

datacenter-level failure. This provides excellent redundancy

without requiring multiple deployments in different Azure

regions.

Configure ASE Network correctly. A common ASE pitfall occurs when ASE is deployed into a

subnet with an IP address space that is too small to support

future expansion. In such cases, ASE can be left unable to

scale without redeploying the entire environment into a

larger subnet. We highly recommend that adequate IP

addresses be used to support either the maximum number

of workers or the largest number considered workloads will

need. A single ASE cluster can scale to 201 instances, which

would require a /24 subnet.

Enable Diagnostic Logging to provide insight into application behavior.

Enable Application Insights Alerts to signal fault conditions.

Review Azure App Service diagnostics to ensure common problems are addressed.

Evaluate per-app scaling for high density hosting on Azure App Service.

Explore the following table of recommendations to optimize your App Service configuration for service

reliability:

Configure Upgrade preferenceUpgrade preference if multiple environments

are used.

If lower environments are used for staging or testing,

consider configuring these environments to receive updates

sooner than the production environment. This will help to

identify any conflicts or problems with an update, and

provides a window to mitigate issues before they reach the

production environment. If multiple load balanced (zonal)

production deployments are used,

Upgrade preference

can

be used to protect the broader environment against issues

from platform upgrades.

Scale out the ASE cluster. Scaling ASE instances vertically or horizontally takes 30 to

60 minutes as new private instances need to be

provisioned. We highly recommend investing in up-front

planning for scaling during spikes in load or transient failure

scenarios.

Use Deployment slotsDeployment slots for resilient code deployments.

Deployment slots

allow for code to be deployed to instances

that are

warmed-up

before serving production traffic. For

more information, reference Testing in production with Azure

App Service.

Avoid unnecessary worker restarts. Many events can lead App Service workers to restart, such

as content deployment, App Settings changes, and VNet

integration configuration changes. A best practice is to make

changes in a deployment slot other than the slot currently

configured to accept production traffic. After workers are

recycled and warmed up, a

swap

can be performed without

unnecessary down time.

Run From Package to avoid deployment conflicts Run from Package provides several advantages:

- Eliminates file lock conflicts between deployment and

runtime.

- Ensures only fully deployed apps are running at any time.

- May reduce

cold-start

times, particularly for JavaScript

functions with large npm package trees.

Use Basic or higher plans with two or more worker instances

for high availability.

Azure App Service provides many configuration options that

aren't enabled by default.

Evaluate the use of TCP and SNAT ports. TCP connections are used for all outbound connections; but,

SNAT ports are used when making outbound connections to

public IP addresses. SNAT port exhaustion is a common

failure scenario that can be predicted by load testing while

monitoring ports using Azure Diagnostics. For more

information, reference TCP and SNAT ports.

Enable Health checkHealth check to identify non-responsive workers. Any health check is better than none at all. The logic behind

endpoint tests should assess all critical downstream

dependencies to ensure overall health. As a best practice, we

highly recommend tracking application health and cache

status in real time as this removes unnecessary delays

before action can be taken.

A SE REC O M M EN DAT IO NA SE REC O M M EN DAT IO N DESC RIP T IO NDESC RIP T ION

Enable AutoscaleAutoscale to ensure adequate resources are

available to service requests.

The default limit of App Service workers is 30 . If the App

Service routinely uses 15 or more instances, consider

opening a support ticket to increase the maximum number

of workers to 2x the instance count required to serve

normal peak load.

Enable Local_Cache to reduce dependencies on cluster file

servers.

Enabling local cache is always appropriate because it can lead

to slower worker startup times. When coupled with

Deployment slotsDeployment slots, it can improve resiliency by removing

dependencies on file servers and also reduces storage-

related recycle events. Don't use local cache with a single

worker instance or when shared storage is required.

Enable Diagnostic LoggingDiagnostic Logging to provide insight into

application behavior.

Diagnostic logging

provides the ability to ingest rich

application and platform-level logs through Log Analytics,

Azure Storage, or a third-party tool using Event Hub.

Enable Application Insights alertsApplication Insights alerts to make you aware of

fault conditions.

Application performance monitoring with Application

Insights provides deep analyses into application

performance. For Windows Plans, a

codeless deployment

approach is possible to quickly get a performance analysis

without changing any code.

Review Azure App Ser vice diagnosticsAzure App Ser vice diagnostics to ensure

common problems are addressed.

It's a good practice to regularly review service-related

diagnostics and recommendations, and take action as

appropriate.

Evaluate per-app scalingper-app scaling for high density hosting on Azure

App Service.

Per-app scaling can be enabled at the App Service plan level

to allow for scaling an app independently from the App

Service plan that hosts it. This way, an App Service plan can

be scaled to 10 instances, but an app can be set to use

only five. Apps are allocated within the available App Service

plan using a best effort approach for even distribution

across instances. While an even distribution isn't guaranteed,

the platform will make sure that two instances of the same

app won't be hosted on the same App Service plan instance.

A SE REC O M M EN DAT IO NA SE REC O M M EN DAT IO N DESC RIP T IO NDESC RIP T ION

TCP and SNAT portsTCP and SNAT ports

TC P P O RT STC P P ORT S SM A L L ( B 1, S1, P 1, I1)SM A L L (B1, S1, P 1, I1) M EDIUM (B 2, S2, P 2, I2)M EDIUM (B 2, S2, P 2, I2) L A RGE ( B 3, S3, P 3, I3)L A RGE ( B 3, S3, P 3, I3)

TCP ports 1920 3968 8064

If a load test results in SNAT errors, it's necessary to either scale across more or larger workers, or implement

coding practices to help preserve and reuse SNAT ports, such as connection pooling and the lazy loading of

resources. We don't recommend exceeding 100 simultaneous outbound connections to a public IP address per

worker, and to avoid communicating with downstream services through public IP addresses when a private

address (Private Endpoint) or Service Endpoint through vNet Integration could be used. TCP port exhaustion

happens when the sum of connection from a given worker exceeds the capacity. The number of available TCP

ports depend on the size of the worker.

The following table lists the current limits:

Applications with many longstanding connections require ports to be left open for long periods of time, which

can lead to TCP Connection exhaustion. TCP Connection limits are fixed based on instance size, so it's necessary

Source artifacts

Resources

| where type == "microsoft.web/serverfarms" and properties.computeMode == `Dedicated`

| where sku.capacity == 1

Learn moreLearn more

Next step

to scale up to a larger worker size to increase the allotment of TCP connections, or implement code level

mitigations to govern connection usage. Similar to SNAT port exhaustion, you can use Azure Diagnostics to

identify a problem exists with TCP port limits.

To identify App Service Plans with only one instance, use the following query:

The Ultimate Guide to Running Healthy Apps in the Cloud

Azure App Service and cost optimization

12/16/2022 • 2 minutes to read • Edit Online

Checklist

Configuration recommendations

A SE REC O M M EN DAT IO NA SE REC O M M EN DAT IO N DESC RIP T IO NDESC RIP T ION

Ensure the ASE subnet is appropriately sized. The size of the subnet used to host an ASE directly affects

maximum scale. An ASE with no App Service plans will use

12 to 13 addresses before you create an app. It's

recommended that you deploy ASEs into a /24 subnet.

The maximum number of nodes in an ASE is 100 . During a

scale-up event, the new machines are provisioned and

placed into the subnet before the applications are migrated

to the new machines, and the old machines are removed.

The subnet must allow for at least 200 machines to handle

the maximum deployment size, which requires a /24

subnet. If you plan for insufficient capacity, scale-out

operations will be limited.

Use App Service Premium v3 plan over the Premium v2 plan The App Service Premium (v3) Plan has a 20% discount

versus comparable Pv2 configurations. Reserved Instance

commitment (1Y, 3Y, Dev/Test) discounts are available for

App Services running in the Premium v3 plan.

Azure App Service is an HTTP-based service for hosting web applications, REST APIs, and mobile back ends. This

service adds the power of Microsoft Azure to your application, such as:

Security

Load balancing

Autoscaling

Automated management

To explore how to optimize costs for Azure App Service in your workload, reference key features in Why use App

Service?

The following sections include a checklist and recommended configuration options specific to Azure App

Service.

Have you configured Azure App Ser vice while considering cost optimization?Have you configured Azure App Ser vice while considering cost optimization?

Ensure the ASE subnet is appropriately sized.

Consider cost savings by using the App Service Premium v3 plan over the Premium v2 plan.

Always use a scale-out and scale-in rule combination.

Understand the behavior of multiple scaling rules in a profile.

Consider Basic or Free tier for non-production usage.

Explore the following table of recommendations to optimize your App Service configuration for service cost:

Use a scale-out and scale-in rule combination If you use only one part of the combination, autoscale will

only take action in a single direction (scale out, or in) until it

reaches the maximum, or minimum instance counts defined

in the profile. This scaling behavior isn't optimal, ideally you

want your resource to scale up at times of high usage to

ensure availability. Similarly, at times of low usage, you want

your resource to scale down, so you can realize cost savings.

Understand the behavior of multiple scaling rules in a profile. There are cases where you may have to set multiple rules in

a profile. On scale-out, autoscale runs if any rule is met. On

scale-in, autoscale requires all rules to be met.

Consider Basic or Free tier for non-production usage. For non-prod App Service plans, consider scaling to Basic or

Free Tier and scale up, as needed, and scale down when not

in use – for example, during a Load Test exercise or based on

the capabilities provided (such as custom domain, SSL, and

more).

A SE REC O M M EN DAT IO NA SE REC O M M EN DAT IO N DESC RIP T IO NDESC RIP T ION

Next step

Azure App Service and operational excellence

12/16/2022 • 7 minutes to read • Edit Online

Design considerations

Checklist

Azure App Service is an HTTP-based service for hosting web applications, REST APIs, and mobile back ends. This

service adds the power of Microsoft Azure to your application, such as:

Security

Load balancing

Autoscaling

Automated management

To explore how Azure App Service can benefit the operational excellence of your application workload, reference

key features in Why use App Service?

The following sections include design considerations, a configuration checklist, and recommended configuration

options specific to Azure App Service.

Microsoft guarantees that Azure App Service will be available 99.95% of the time. However, no SLA is provided

using either the Free or Shared tiers.

For more information, reference the SLA for App Service.

Have you configured Azure App Ser vice while considering operational excellence?Have you configured Azure App Ser vice while considering operational excellence?

Create a deployment plan because redeploying the app service can reset the scaled units.

Review the App Service Advisor recommendations.

Ensure you configure the App Service Environments (ASE) Network correctly.

Consider configuring Upgrade Preference if you're using multiple environments.

Plan for scaling out the ASE cluster.

Use Deployment Slots for resilient code deployments.

Avoid unnecessary worker restarts when deploying application code or configuration.

Use Run From Package to avoid deployment conflicts.

Use Basic or higher plans with two or more worker instances for high availability.

Evaluate the use of TCP and SNAT ports to avoid outbound connection errors.

Enable Health check to identify non-responsive workers.

Enable Autoscale to ensure adequate resources are available to service requests.

Enable Local Cache to reduce dependencies on cluster file servers.

Enable Diagnostic Logging to provide insight into application behavior.

Enable Application Insights Alerts to signal fault conditions.

Review Azure App Service diagnostics to ensure common problems are addressed.

Evaluate per-app scaling for high density hosting on Azure App Service.

A SE REC O M M EN DAT IO NA SE REC O M M EN DAT IO N DESC RIP T IO NDESC RIP T ION

Create a deployment plan because redeploying the app

service can reset the scaled units.

Automatic scaling rules apply during operation of the

environment, but redeploying the app service may cause the

plan to reset to the default number of units. Customers

should be aware of this behavior and plan for it during

deployments. Deploy only during off-peak times or deploy

maximum units with automatic scaling enabled to scale in

and out to prevent website performance implications.

Review the App Service Advisor recommendations. App Service Advisor gives you real-time recommendations in

the portal on resource exhaustion and conditions related to

CPU, memory, and connections.

Ensure you configure the App Ser vice EnvironmentsApp Ser vice Environments

(ASE) Network(ASE) Network correctly.

One common ASE pitfall occurs when ASE is deployed into a

subnet with an IP address space that is too small to support

future expansion. In such cases, ASE can be left unable to

scale without redeploying the entire environment into a

larger subnet. We highly recommended that adequate IP

addresses be used to support either the maximum number

of workers or the largest number considered workloads will

need. A single ASE cluster can scale to 201 instance, which

would require a /24 subnet.

Configure Upgrade preferenceUpgrade preference if you're using multiple

environments.

If lower environments are used for staging or testing,

consider configuring these environments to receive updates

sooner than the production environment. This will help to

identify any conflicts or problems with an update, and

provides a window to mitigate issues before they reach the

production environment. If multiple load balanced (zonal)

production deployments are used,

Upgrade preference

can

be used to protect the broader environment against issues

from platform upgrades.

Scale out the ASE cluster. Scaling ASE instances vertically or horizontally takes 30 to

60 minutes as new private instances need to be

provisioned. We highly recommend investing in up-front

planning for scaling during spikes in load or transient failure

scenarios.

Use Deployment slotsDeployment slots for resilient code deployments.

Deployment slots

allow for code to be deployed to instances

that are

warmed-up

before serving production traffic. For

more information, reference Testing in production with Azure

App Service.

Avoid unnecessary worker restarts. Many events can lead App Service workers to restart, such

as content deployment, App Settings changes, and VNet

integration configuration changes. A best practice is to make

changes in a deployment slot other than the slot currently

configured to accept production traffic. After workers are

recycled and warmed up, a

swap

can be performed without

unnecessary down time.

Run From Package to avoid deployment conflicts Run from Package provides several advantages:

- Eliminates file lock conflicts between deployment and

runtime.

- Ensures only fully deployed apps are running at any time.

- May reduce

cold-start

times, particularly for JavaScript

functions with large npm package trees.

Use Basic or higher plans with two or more worker instances

for high availability.

Azure App Service provides many configuration options that

aren't enabled by default.

Evaluate the use of TCP and SNAT ports. TCP connections are used for all outbound connections, but

SNAT ports are used when making outbound connections to

public IP addresses. SNAT port exhaustion is a common

failure scenario that can be predicted by load testing while

monitoring ports using Azure Diagnostics. For more

information, reference TCP and SNAT ports.

Enable Health checkHealth check to identify non-responsive workers. Any health check is better than none at all. The logic behind

endpoint tests should assess all critical downstream

dependencies to ensure overall health. As a best practice, we

highly recommend tracking application health and cache

status in real time as this removes unnecessary delays

before action can be taken.

Enable AutoscaleAutoscale to ensure adequate resources are

available to service requests.

The default limit of App Service workers is 30 . If the App

Service routinely uses 15 or more instances, consider

opening a support ticket to increase the maximum number

of workers to 2x the instance count required to serve

normal peak load.

Enable Local_Cache to reduce dependencies on cluster file

servers.

Enabling local cache is always appropriate because it can lead

to slower worker startup times. When coupled with

Deployment slotsDeployment slots, it can improve resiliency by removing

dependencies on file servers and also reduces storage-

related recycle events. Don't use local cache with a single

worker instance or when shared storage is required.

Enable Diagnostic LoggingDiagnostic Logging to provide insight into

application behavior.

Diagnostic logging

provides the ability to ingest rich

application and platform-level logs through Log Analytics,

Azure Storage, or a third-party tool using Event Hub.

Enable Application Insights alertsApplication Insights alerts to make you aware of

fault conditions.

Application performance monitoring with Application

Insights provides deep analyses into application

performance. For Windows Plans, a

codeless deployment

approach is possible to quickly get a performance analysis

without changing any code.

Review Azure App Ser vice diagnosticsAzure App Ser vice diagnostics to ensure

common problems are addressed.

It's a good practice to regularly review service-related

diagnostics and recommendations, and take action as

appropriate.

Evaluate per-app scalingper-app scaling for high density hosting on Azure

App Service.

Per-app scaling can be enabled at the App Service plan level

to allow for scaling an app independently from the App

Service plan that hosts it. This way, an App Service plan can

be scaled to 10 instances, but an app can be set to use

only five. Apps are allocated to available App Service plan

using a best effort approach for an even distribution across

instances. While an even distribution isn't guaranteed, the

platform will make sure that two instances of the same app

won't be hosted on the same App Service plan instance.

A SE REC O M M EN DAT IO NA SE REC O M M EN DAT IO N DESC RIP T IO NDESC RIP T ION

TCP and SNAT portsTCP and SNAT ports

If a load test results in SNAT errors, it's necessary to either scale across more or larger workers, or implement

coding practices to help preserve and reuse SNAT ports, such as connection pooling and the lazy loading of

TC P P O RT STC P P ORT S SM A L L ( B 1, S1, P 1, I1)SM A L L (B1, S1, P 1, I1) M EDIUM (B 2, S2, P 2, I2)M EDIUM (B 2, S2, P 2, I2) L A RGE ( B 3, S3, P 3, I3)L A RGE ( B 3, S3, P 3, I3)

TCP ports 1920 3968 8064

Source artifacts

Resources

| where type == "microsoft.web/serverfarms" and properties.computeMode == `Dedicated`

| where sku.capacity == 1

Learn moreLearn more

Next step

resources. We don't recommend exceeding 100 simultaneous outbound connections to a public IP address per

worker, and to avoid communicating with downstream services through public IP addresses when a private

address (Private Endpoint) or Service Endpoint through vNet Integration could be used. TCP port exhaustion

happens when the sum of connection from a given worker exceeds the capacity. The number of available TCP

ports depend on the size of the worker.

The following table lists the current limits:

Applications with many longstanding connections require ports to be left open for long periods of time, which

can lead to TCP Connection exhaustion. TCP Connection limits are fixed based on instance size, so it's necessary

to scale up to a larger worker size to increase the allotment of TCP connections, or implement code level

mitigations to govern connection usage. Similar to SNAT port exhaustion, you can use Azure Diagnostics to

identify if a problem exists with TCP port limits.

To identify App Service plans with only one instance, use the following query:

The Ultimate Guide to Running Healthy Apps in the Cloud

Azure Batch and reliability

12/16/2022 • 2 minutes to read • Edit Online

Design and configuration checklist

Design and configuration recommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Keep application binaries and reference data up to date in all

regions.

Staying up to date will ensure the region can be brought

online quickly without waiting for file upload and

deployment.

Use fewer jobs and more tasks. Using a job to run a single task is inefficient. For example, it's

more efficient to use a single job containing 1000 tasks

rather than creating 100 jobs that contain 10 tasks each.

Running 1000 jobs, each with a single task, would be the

least efficient, slowest, and most expensive approach.

Use multiple Batch accounts in various regions to allow your

application to continue running, if an Azure Batch account in

one region becomes unavailable.

It's crucial to have multiple accounts for a highly available

application.

Azure Batch allows you to run large-scale parallel and high-performance computing (HPC) batch jobs efficiently

in Azure.

Use Azure Batch to:

Create and manage a pool of compute nodes (virtual machines).

Install applications you want to run.

Schedule jobs to run on the compute nodes.

The following sections include a design and configuration checklist, recommended design, and configuration

options specific to Azure Batch.

Have you designed your workload and configured Azure Batch with resiliency in mind?Have you designed your workload and configured Azure Batch with resiliency in mind?

Keep application binaries and reference data up to date in all regions.

Use fewer jobs and more tasks.

Use multiple Batch accounts in various regions to allow your application to continue running, if an Azure

Batch account in one region becomes unavailable.

Build durable tasks.

Pre-create all required services in each region, such as the Batch account and storage account.

Make sure the appropriate quotas are set on all subscriptions ahead of time, so you can allocate the required

number of cores using the Batch account.

Explore the following table of recommendations to optimize your workload design and Azure Batch

configuration for service reliability:

Build durable tasks. Tasks should be designed to withstand failure and

accommodate retry, especially for long running tasks. Ensure

tasks generate the same, single result even if they're run

more than once. One way to achieve the same result is to

make your tasks

goal seeking

. Another way is to make sure

your tasks are

idempotent

(tasks will have the same

outcome no matter how many times they're run).

Pre-create all required services in each region, such as the

Batch account and storage account.

There's often no charge for creating accounts and charges

accrue only when you use the account, or when you store

data.

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Next step

Azure Batch and operational excellence

12/16/2022 • 2 minutes to read • Edit Online

Design and configuration checklist

Design and configuration recommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Keep application binaries and reference data up to date in all

regions.

Staying up to date will ensure the region can be brought

online quickly without waiting for file upload and

deployment.

Use fewer jobs and more tasks. Using a job to run a single task is inefficient. For example, it's

more efficient to use a single job containing 1000 tasks

rather than creating 100 jobs that contain 10 tasks each.

Running 1000 jobs, each with a single task, would be the

least efficient, slowest, and most expensive approach.

Pre-create all required services in each region, such as the

Batch account and storage account.

There's often no charge for creating accounts and charges

accrue only when you use the account, or when you store

data.

Next step

Azure Batch allows you to run large-scale parallel and high-performance computing (HPC) batch jobs efficiently

in Azure.

Use Azure Batch to:

Create and manage a pool of compute nodes (virtual machines).

Install applications you want to run.

Schedule jobs to run on the compute nodes.

The following sections include a design and configuration checklist, recommended design, and configuration

options specific to Azure Batch.

Have you designed your workload and configured Azure Batch with operational excellence inHave you designed your workload and configured Azure Batch with operational excellence in

mind?mind?

Keep application binaries and reference data up to date in all regions.

Use fewer jobs and more tasks.

Pre-create all required services in each region, such as the Batch account and storage account.

Make sure the appropriate quotas are set on all subscriptions ahead of time, so you can allocate the required

number of cores using the Batch account.

Explore the following table of recommendations to optimize your workload design and Azure Batch

configuration for operational excellence:

Azure Batch and performance efficiency

12/16/2022 • 2 minutes to read • Edit Online

Design checklist

Design and configuration recommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Use fewer jobs and more tasks. Using a job to run a single task is inefficient. For example, it's

more efficient to use a single job containing 1000 tasks

rather than creating 100 jobs that contain 10 tasks each.

Running 1000 jobs, each with a single task, would be the

least efficient, slowest, and most expensive approach.

Next step

Azure Batch allows you to run large-scale parallel and high-performance computing (HPC) batch jobs efficiently

in Azure.

Use Azure Batch to:

Create and manage a pool of compute nodes (virtual machines).

Install applications you want to run.

Schedule jobs to run on the compute nodes.

The following sections include a design checklist and recommended design options specific to Azure Batch.

Have you designed your workload and configured Azure Batch with performance efficiency inHave you designed your workload and configured Azure Batch with performance efficiency in

mind?mind?

Use fewer jobs and more tasks.

Consider the following recommendation to optimize your workload design and Azure Batch configuration for

performance efficiency:

AKS and reliability

Azure Well

Architected Framework review

Azure

Kubernetes Service

(

AKS

)

12/16/2022 • 22 minutes to read • Edit Online

Prerequisites

Reliability

Design checklistDesign checklist

This article provides architectural best practices for Azure Kubernetes Service (AKS). The guidance is based on

the five pillars of architecture excellence:

Reliability

Security

Cost optimization

Operational excellence

Performance efficiency

We assume that you understand system design principles, have working knowledge of Azure Kubernetes

Service, and are well versed with its features. For more information, see Azure Kubernetes Service.

Understanding the Well-Architected Framework pillars can help produce a high-quality, stable, and efficient

cloud architecture. We recommend that you review your workload by using the Azure Well-Architected

Framework Review assessment.

For context, consider reviewing a reference architecture that reflects these considerations in its design. We

recommend that you start with the baseline architecture for an Azure Kubernetes Service (AKS) cluster and

Microservices architecture on Azure Kubernetes Service. Also review the AKS landing zone accelerator, which

provides an architectural approach and reference implementation to prepare landing zone subscriptions for a

scalable Azure Kubernetes Service (AKS) cluster.

In the cloud, we acknowledge that failures happen. Instead of trying to prevent failures altogether, the goal is to

minimize the effects of a single failing component. Use the following information to minimize failed instances.

When discussing reliability with Azure Kubernetes Service, it's important to distinguish between

cluster

reliability

and

workload reliability

. Cluster reliability is a shared responsibility between the cluster admin and

their resource provider, while workload reliability is the domain of a developer. Azure Kubernetes Service has

considerations and recommendations for both of these roles.

In the design checklistdesign checklist and list of recommendationslist of recommendations below, call-outs are made to indicate whether each

choice is applicable to cluster architecture, workload architecture, or both.

Cluster architecture:Cluster architecture: For critical workloads, use availability zones for your AKS clusters.

Cluster architecture:Cluster architecture: Plan the IP address space to ensure your cluster can reliably scale, including handling

of failover traffic in multi-cluster topologies.

Cluster architecture:Cluster architecture: Enable Container insights to monitor your cluster and configure alerts for reliability-

impacting events.

Workload architecture:Workload architecture: Ensure workloads are built to support horizontal scaling and report application

readiness and health.

  
AKS configuration recommendationsAKS configuration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N BEN EF ITB EN EF IT
Cluster and workload architectures:Cluster and workload architectures: Control pod
scheduling using node selectors and affinity.
Allows the Kubernetes scheduler to logically isolate
workloads by hardware in the node. Unlike tolerations, pods
without a matching node selector can be scheduled on
labeled nodes, which allows unused resources on the nodes
to consume, but gives priority to pods that define the
matching node selector. Use node affinity for more flexibility,
which allows you to define what happens if the pod can't be
matched with a node.
Cluster architecture:Cluster architecture: Ensure proper selection of network
plugin based on network requirements and cluster sizing.
Azure CNI is required for specific scenarios, for example,
Windows-based node pools, specific networking
requirements and Kubernetes Network Policies. Reference
Kubenet versus Azure CNI for more information.
Cluster and workload architectures:Cluster and workload architectures: Use the AKS
Uptime SLA for production grade clusters.
The AKS Uptime SLA guarantees:
-  99.95%  availability of the Kubernetes API server endpoint
for AKS Clusters that use Azure Availability Zones, or 
-  99.9%  availability for AKS Clusters that don't use Azure
Availability Zones.
Cluster and workload architectures:Cluster and workload architectures: Configure
monitoring of cluster with Container insights.
Container insights help monitor the health and performance
of controllers, nodes, and containers that are available in
Kubernetes through the Metrics API. Integration with
Prometheus enables collection of application and workload
metrics.
Cluster architecture:Cluster architecture: Use availability zones to maximize
resilience within an Azure region by distributing AKS agent
nodes across physically separate data centers.
By spreading node pools across multiple zones, nodes in one
node pool will continue running even if another zone has
gone down. If colocality requirements exist, either a regular
VMSS-based AKS deployment into a single zone or proximity
placement groups can be used to minimize internode
latency.
Cluster architecture:Cluster architecture: Adopt a multiregion strategy by
deploying AKS clusters deployed across different Azure
regions to maximize availability and provide business
continuity.
Internet facing workloads should leverage Azure Front Door
or Azure Traffic Manager to route traffic globally across AKS
clusters.
Cluster and workload architectures:Cluster and workload architectures: Define Pod
resource requests and limits in application deployment
manifests, and enforce with Azure Policy.
Container CPU and memory resource limits are necessary to
prevent resource exhaustion in your Kubernetes cluster.
Cluster and workload architectures:Cluster and workload architectures: Keep the System
node pool isolated from application workloads.
System node pools require a VM SKU of at least 2 vCPUs
and 4 GB memory, but 4 vCPU or more is recommended.
Reference System and user node pools for detailed
requirements.
Cluster and workload architectures:Cluster and workload architectures: Ensure your workload is running on user node pools and chose the
right size SKU. At a minimum, include two nodes for user node pools and three nodes for the system node
pool.
Cluster architecture:Cluster architecture: Use the AKS Uptime SLA to meet availability targets for production workloads.
Explore the following table of recommendations to optimize your AKS configuration for Reliability.

Cluster and workload architectures:Cluster and workload architectures: Separate
applications to dedicated node pools based on specific
requirements.
Applications may share the same configuration and need
GPU-enabled VMs, CPU or memory optimized VMs, or the
ability to scale-to-zero. Avoid large number of node pools to
reduce extra management overhead.
Cluster architecture:Cluster architecture: Use a NAT gateway for clusters that
run workloads that make many concurrent outbound
connections.
To avoid reliability issues with Azure Load Balancer
limitations with high concurrent outbound traffic, us a NAT
Gateway instead to support reliable egress traffic at scale.
REC O M M EN DAT IO NREC O M M EN DAT IO N BEN EF ITB EN EF IT
  
Azure PolicyAzure Policy
  
Cluster and workload architectureCluster and workload architecture
 
Security
  
Design checklistDesign checklist
  
RecommendationsRecommendations
For more suggestions, see Principles of the reliability pillar.
Azure Kubernetes Service offers a wide variety of built-in Azure Policies that apply to both the Azure resource
like typical Azure Policies and, using the Azure Policy add-on for Kubernetes, also within the cluster. There are a
numerous number of policies, and key policies related to this pillar are summarized here. For a more detailed
view, see built-in policy definitions for Kubernetes.
Clusters have readiness or liveness health probes configured for your pod spec.
In addition to the built-in Azure Policy definitions, custom policies can be created for both the AKS resource and
for the Azure Policy add-on for Kubernetes. This allows you to add additional reliability constraints you'd like to
enforce in your cluster and workload architecture.
Security is one of the most important aspects of any architecture. To explore how AKS can bolster the security of
your application workload, we recommend you review the Security design principles. If your Azure Kubernetes
Service cluster needs to be designed to run a sensitive workload that meets the regulatory requirements of the
Payment Card Industry Data Security Standard (PCI-DSS 3.2.1), review AKS regulated cluster for PCI-DSS 3.2.1.
To learn about DoD Impact Level 5 (IL5) support and requirements with AKS, review Azure Government IL5
isolation requirements.
When discussing security with Azure Kubernetes Service, it's important to distinguish between 
cluster security
and 
workload security
. Cluster security is a shared responsibility between the cluster admin and their resource
provider, while workload security is the domain of a developer. Azure Kubernetes Service has considerations and
recommendations for both of these roles.
In the design checklistdesign checklist and list of recommendationslist of recommendations below, call-outs are made to indicate whether each
choice is applicable to cluster architecture, workload architecture, or both.
Cluster architecture:Cluster architecture: Use Managed Identities to avoid managing and rotating service principles.
Cluster architecture:Cluster architecture: Use Kubernetes role-based access control (RBAC) with Azure AD for least privilege
access and minimize granting administrator privileges to protect configuration, and secrets access.
Cluster architecture:Cluster architecture: Use Microsoft Defender for containers with Azure Sentinel to detect and quickly
respond to threats across your cluster and workloads running on them.
Cluster architecture:Cluster architecture: Deploy a private AKS cluster to ensure cluster management traffic to your API server
remains on your private network. Or use the API server allow list for non-private clusters.
Workload architecture:Workload architecture: Use a Web Application Firewall to secure HTTP(S) traffic.
Workload architecture:Workload architecture: Ensure your CI/CID pipeline is hardened with container-aware scanning.

REC O M M EN DAT IO NREC O M M EN DAT IO N BEN EF ITB EN EF IT
Cluster architecture:Cluster architecture: Use Azure Active Directory
integration.
Using Azure AD centralizes the identity management
component. Any change in user account or group status is
automatically updated in access to the AKS cluster. The
developers and application owners of your Kubernetes
cluster need access to different resources.
Cluster architecture:Cluster architecture: Authenticate with Azure Active
Directory (Azure AD) to Azure Container Registry.
AKS and Azure AD enables authentication with Azure
Container Registry without the use of  imagePullSecrets
secrets. Review Authenticate with Azure Container Registry
from Azure Kubernetes Service for more information.
Cluster architecture:Cluster architecture: Secure network traffic to your API
server with private AKS cluster.
By default, network traffic between your node pools and the
API server travels the Microsoft backbone network; by using
a private cluster, you can ensure network traffic to your API
server remains on the private network only.
Cluster architecture:Cluster architecture: For non-private AKS clusters, use
API server authorized IP ranges.
When using public clusters, you can still limit the traffic that
can reach your clusters API server by using the authorized IP
range feature. Include sources like the public IPs of your
deployment build agents, operations management, and
node pools' egress point (such as Azure Firewall).
Cluster architecture:Cluster architecture: Protect the API server with Azure
Active Directory RBAC.
Securing access to the Kubernetes API Server is one of the
most important things you can do to secure your cluster.
Integrate Kubernetes role-based access control (RBAC) with
Azure AD to control access to the API server. Disable local
accounts to enforce all cluster access using Azure AD-based
identities.
Cluster architecture:Cluster architecture: Use Azure network policies or Calico. Secure and control network traffic between pods in a cluster.
Cluster architecture:Cluster architecture: Secure clusters and pods with Azure
Policy.
Azure Policy can help to apply at-scale enforcement and
safeguards on your clusters in a centralized, consistent
manner. It can also control what functions pods are granted
and if anything is running against company policy.
Cluster architecture:Cluster architecture: Secure container access to resources. Limit access to actions that containers can perform. Provide
the least number of permissions, and avoid the use of root
or privileged escalation.
Workload architecture:Workload architecture: Use a Web Application Firewall to
secure HTTP(S) traffic.
To scan incoming traffic for potential attacks, use a web
application firewall such as Azure Web Application Firewall
(WAF) on Azure Application Gateway or Azure Front Door.
Cluster architecture:Cluster architecture: Control cluster egress traffic. Ensure your cluster's outbound traffic is passing through a
network security point such as Azure Firewall or an HTTP
proxy.
Cluster architecture:Cluster architecture: Use the open-source Azure AD
Workload Identity and Secrets Store CSI Driver with Azure
Key Vault.
Protect and rotate secrets, certificates, and connection
strings in Azure Key Vault with strong encryption. Provides
an access audit log, and keeps core secrets out of the
deployment pipeline.
Explore the following table of recommendations to optimize your AKS configuration for security.

Cluster architecture:Cluster architecture: Use Microsoft Defender for

Containers.

Monitor and maintain the security of your clusters,

containers, and their applications.

REC O M M EN DAT IO NREC O M M EN DAT IO N BEN EF ITB EN EF IT

Policy definitionsPolicy definitions

Cluster architectureCluster architecture

Cluster and workload architectureCluster and workload architecture

Cost optimization

For more suggestions, see Principles of the security pillar.

Azure Advisor helps ensure and improve Azure Kubernetes service. It makes recommendations on a subset of

the items listed in the policy section below, such as clusters without RBAC configured, missing Microsoft

Defender configuration, unrestricted network access to the API Server. Likewise, it makes workload

recommendations for some of the pod security initiative items. Review the recommendations.

Azure Policy offers various built-in policy definitions that apply to both the Azure resource and AKS like standard

policy definitions, and using the Azure Policy add-on for Kubernetes, also within the cluster. Many of the Azure

resource policies come in both

Audit/Deny

, but also in a

Deploy If Not Exists

variant.

There are a numerous number of policies, and key policies related to this pillar are summarized here. For a more

detailed view, see built-in policy definitions for Kubernetes.

Microsoft Defender for Cloud-based policies

Authentication mode and configuration policies (Azure AD, RBAC, disable local authentication)

API Server network access policies, including private cluster

Kubernetes cluster pod security initiatives Linux-based workloads

Include pod and container capability policies such as AppArmor, sysctl, security caps, SELinux, seccomp,

privileged containers, automount cluster API credentials

Mount, volume drivers, and filesystem policies

Pod/Container networking policies, such as host network, port, allowed external IPs, HTTPs, and internal load

balancers

Azure Kubernetes Service deployments often also use Azure Container Registry for Helm charts and container

images. Azure Container Registry also supports a wide variety of Azure policies that spans network restrictions,

access control, and Microsoft Defender for Cloud, which complements a secure AKS architecture.

In addition to the built-in policies, custom policies can be created for both the AKS resource and for the Azure

Policy add-on for Kubernetes. This allows you to add additional security constraints you'd like to enforce in your

cluster and workload architecture.

For more suggestions, see AKS security concepts and evaluate our security hardening recommendations based

on the CIS Kubernetes benchmark.

Cost optimization is about looking at ways to reduce unnecessary expenses and improve operational

efficiencies. We recommend you review the Cost optimization design principles. Tightly associated with (but not

limited to) cost optimization is sustainability. Cost optimization goals typically support workload sustainability

goals as well.

When discussing cost optimization with Azure Kubernetes Service, it's important to distinguish between

cost of

cluster resources

and

cost of workload resources

. Cluster resources are a shared responsibility between the

cluster admin and their resource provider, while workload resources are the domain of a developer. Azure

Kubernetes Service has considerations and recommendations for both of these roles.

Design checklistDesign checklist

RecommendationsRecommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N BEN EF ITB EN EF IT

Cluster and workload architectures:Cluster and workload architectures: Align SKU selection

and managed disk size with workload requirements.

Matching your selection to your workload demands ensures

you don't pay for unneeded resources.

Cluster and workload architectures:Cluster and workload architectures: Use the Start and

Stop feature in Azure Kubernetes Services (AKS).

The AKS Stop and Start cluster feature allows AKS customers

to pause an AKS cluster, saving time and cost. The stop and

start feature keeps cluster configurations in place and

customers can pick up where they left off without

reconfiguring the clusters.

Cluster architecture:Cluster architecture: Enable cluster autoscaler to

automatically reduce the number of agent nodes in response

to excess resource capacity.

Automatically scale down the number of nodes in your AKS

cluster lets you run an efficient cluster when demand is low,

and scale up again when demand returns.

Workload architecture:Workload architecture: Consider using Azure Spot VMs

for workloads that can handle interruptions, early

terminations, and evictions.

For example, workloads such as batch processing jobs,

development and testing environments, and large compute

workloads may be good candidates for you to schedule on a

spot node pool. Using spot VMs for nodes with your AKS

cluster allows you to take advantage of unused capacity in

Azure at a significant cost savings.

Cluster architecture:Cluster architecture: Enforce resource quotas at the

namespace level.

Resource quotas provide a way to reserve and limit

resources across a development team or project. These

quotas are defined on a namespace and can be used to set

quotas on compute resources, storage resources, and object

counts. When you define resource quotas, all pods created in

the namespace must provide limits or requests in their pod

specifications.

Workload architecture:Workload architecture: Use the Horizontal pod

autoscaler.

Adjust the number of pods in a deployment depending on

CPU utilization or other select metrics, which support cluster

scale-in operations.

Cluster architecture:Cluster architecture: Configure monitoring of cluster with

Container insights.

Container insights help provide actionable insights into your

clusters idle and unallocated resources.

In the design checklistdesign checklist and list of recommendationslist of recommendations below, call-outs are made to indicate whether each

choice is applicable to cluster architecture, workload architecture, or both.

For cluster cost optimization, go to the Azure pricing calculator and select Azure Kubernetes Ser viceAzure Kubernetes Ser vice from the

available products. You can test different configuration and payment plans in the calculator.

Cluster architecture:Cluster architecture: Use appropriate VM SKU per node pool and reserved instances where long-term

capacity is expected.

Cluster and workload architectures:Cluster and workload architectures: Use appropriate managed disk tier and size.

Cluster architecture:Cluster architecture: Review performance metrics, starting with CPU, memory, storage, and network, to

identify cost optimization opportunities by cluster, nodes, and namespace.

Cluster architecture:Cluster architecture: Use cluster autoscaler to scale in when workloads are less active.

Explore the following table of recommendations to optimize your AKS configuration for cost.

Cluster architecture:Cluster architecture: Select the appropriate region. Due to many factors, cost of resources varies per region in

Azure. Evaluate the cost, latency, and compliance

requirements to ensure you're running your workload cost-

effectively and it doesn't affect your end-users or create

additional networking charges.

Cluster architecture:Cluster architecture: Sign up for Azure Reservations. If you properly planned for capacity, your workload is

predictable and will exist for an extended period of time, sign

up for Azure Reserved Instances to further reduce your

resource costs.

Workload architecture:Workload architecture: Maintain small and optimized

images.

Streamlining your images helps reduce costs since new

nodes need to download these images. Build images in a

way that allows the container start as soon as possible to

help avoid user request failures or timeouts while the

application is starting up, potentially leading to

overprovisioning.

Cluster architecture:Cluster architecture: Use Kubernetes Resource Quotas. Resource quotas can be used to limit resource consumption

for each namespace in your cluster, and by extension

resource utilization for the Azure service.

REC O M M EN DAT IO NREC O M M EN DAT IO N BEN EF ITB EN EF IT

Policy definitionsPolicy definitions

Cloud efficiencyCloud efficiency

Operational excellence

Design checklistDesign checklist

For more suggestions, see Principles of the cost optimization pillar.

While there are no built-in policies that are related to cost optimization, custom policies can be created for both

the AKS resource and for the Azure Policy add-on for Kubernetes. This allows you to add additional cost

optimization constraints you'd like to enforce in your cluster and workload architecture.

Making workloads more sustainable and cloud efficient, requires combining efforts around cost optimizationcost optimization,

reducing carbon emissionsreducing carbon emissions, and optimizing energy consumptionoptimizing energy consumption. Optimizing the application's cost is the

initial step in making workloads more sustainable.

Learn how to build sustainable and efficient AKS workloads, in Sustainable software engineering principles in

Azure Kubernetes Service (AKS).

Monitoring and diagnostics are crucial. Not only can you measure performance statistics, but also use metrics

troubleshoot and remediate issues quickly. We recommend you review the Operational excellence design

principles and the Day-2 operations guide.

When discussing operational excellence with Azure Kubernetes Service, it's important to distinguish between

cluster operational excellence

and

workload operational excellence

. Cluster operations are a shared

responsibility between the cluster admin and their resource provider, while workload operations are the domain

of a developer. Azure Kubernetes Service has considerations and recommendations for both of these roles.

In the design checklistdesign checklist and list of recommendationslist of recommendations below, call-outs are made to indicate whether each

choice is applicable to cluster architecture, workload architecture, or both.

Cluster architecture:Cluster architecture: Use a template-based deployment using Bicep, Terraform, or others. Make sure that

all deployments are repeatable, traceable, and stored in a source code repo.

Cluster architecture:Cluster architecture: Build an automated process to ensure your clusters are bootstrapped with the

  
RecommendationsRecommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N BEN EF ITB EN EF IT
Cluster and workload architectures:Cluster and workload architectures: Review AKS best
practices documentation.
To build and run applications successfully in AKS, there are
key considerations to understand and implement. These
areas include multi-tenancy and scheduler features, cluster,
and pod security, or business continuity and disaster
recovery.
Cluster and workload architectures:Cluster and workload architectures: Review Azure
Chaos Studio.
Azure Chaos Studio can help simulate faults and trigger
disaster recovery situations.
Cluster and workload architectures:Cluster and workload architectures: Configure
monitoring of cluster with Container insights.
Container insights help monitor the performance of
containers by collecting memory and processor metrics from
controllers, nodes, and containers that are available in
Kubernetes through the Metrics API and container logs.
Workload architecture:Workload architecture: Monitor application performance
with Azure Monitor.
Configure Application Insights for code-based monitoring of
applications running in an AKS cluster.
Workload architecture:Workload architecture: Configure scraping of
Prometheus metrics with Container insights.
Container insights, which are part of Azure Monitor, provide
a seamless onboarding experience to collect Prometheus
metrics. Reference Configure scraping of Prometheus metrics
for more information.
Cluster architecture:Cluster architecture: Adopt a multiregion strategy by
deploying AKS clusters deployed across different Azure
regions to maximize availability and provide business
continuity.
Internet facing workloads should leverage Azure Front Door
or Azure Traffic Manager to route traffic globally across AKS
clusters.
Cluster architecture:Cluster architecture: Operationalize clusters and pods
configuration standards with Azure Policy.
Azure Policy can help to apply at-scale enforcement and
safeguards on your clusters in a centralized, consistent
manner. It can also control what functions pods are granted
and if anything is running against company policy.
Workload architecture:Workload architecture: Use platform capabilities in your
release engineering process.
Kubernetes and ingress controllers support many advanced
deployment patterns for inclusion in your release
engineering process. Consider patterns like blue-greem
deployments or canary releases.
necessary cluster-wide configurations and deployments. This is often performed using GitOps.
Workload architecture:Workload architecture: Use a repeatable and automated deployment processes for your workload within
your software development lifecycle.
Cluster architecture:Cluster architecture: Enable diagnostics settings to ensure control plane or core API server interactions are
logged.
Cluster and workload architectures:Cluster and workload architectures: Enable Container insights to collect metrics, logs, and diagnostics to
monitor the availability and performance of the cluster and workloads running on it.
Workload architecture:Workload architecture: The workload should be designed to emit telemetry that can be collected, which
should also include liveliness and readiness statuses.
Cluster and workload architectures:Cluster and workload architectures: Use chaos engineering practices that target Kubernetes to identify
application or platform reliability issues.
Workload architecture:Workload architecture: Optimize your workload to operate and deploy efficiently in a container.
Cluster and workload architectures:Cluster and workload architectures: Enforce cluster and workload governance using Azure Policy.
Explore the following table of recommendations to optimize your AKS configuration for operations.

Cluster and workload architectures:Cluster and workload architectures: For mission-critical

workloads, use stamp-level blue/green deployments.

Automate your mission-critical design areas, including

deployment and testing.

REC O M M EN DAT IO NREC O M M EN DAT IO N BEN EF ITB EN EF IT

Policy definitionsPolicy definitions

Cluster architectureCluster architecture

Cluster and workload architectureCluster and workload architecture

Performance efficiency

Design checklistDesign checklist

For more suggestions, see Principles of the operational excellence pillar.

Azure Advisor also makes recommendations on a subset of the items listed in the policy section below, such

unsupported AKS versions and unconfigured diagnostic settings. Likewise, it makes workload recommendations

around the use of the default namespace.

Azure Policy offers various built-in policy definitions that apply to both the Azure resource and AKS like standard

policy definitions, and using the Azure Policy add-on for Kubernetes, also within the cluster. Many of the Azure

resource policies come in both

Audit/Deny

, but also in a

Deploy If Not Exists

variant.

There are a numerous number of policies, and key policies related to this pillar are summarized here. For a more

detailed view, see built-in policy definitions for Kubernetes.

Azure Policy add-on for Kubernetes

GitOps configuration policies

Diagnostics settings policies

AKS version restrictions

Prevent command invoke

Namespace deployment restrictions

In addition to the built-in policies, custom policies can be created for both the AKS resource and for the Azure

Policy add-on for Kubernetes. This allows you to add additional security constraints you'd like to enforce in your

cluster and workload architecture.

Performance efficiency is the ability of your workload to scale to meet the demands placed on it by users in an

efficient manner. We recommend you review the Performance efficiency principles.

When discussing performance with Azure Kubernetes Service, it's important to distinguish between

cluster

performance

and

workload performance

. Cluster performance is a shared responsibility between the cluster

admin and their resource provider, while workload performance is the domain of a developer. Azure Kubernetes

Service has considerations and recommendations for both of these roles.

In the design checklistdesign checklist and list of recommendationslist of recommendations below, call-outs are made to indicate whether each

choice is applicable to cluster architecture, workload architecture, or both.

As you make design choices for Azure Kubernetes Service, review the Performance efficiency principles.

Cluster and workload architectures:Cluster and workload architectures: Perform and iterate on a detailed capacity plan exercise that

includes SKU, autoscale settings, IP addressing, and failover considerations.

Cluster architecture:Cluster architecture: Enable cluster autoscaler to automatically adjust the number of agent nodes in

response workload demands.

Cluster architecture:Cluster architecture: Use the Horizontal pod autoscaler to adjust the number of pods in a deployment

depending on CPU utilization or other select metrics.

  
RecommendationsRecommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N BEN EF ITB EN EF IT
Cluster and workload architectures:Cluster and workload architectures: Develop a detailed
capacity plan and continually review and revise.
After formalizing your capacity plan, it should be frequently
updated by continuously observing the resource utilization
of the cluster.
Cluster architecture:Cluster architecture: Enable cluster autoscaler to
automatically adjust the number of agent nodes in response
to resource constraints.
The ability to automatically scale up or down the number of
nodes in your AKS cluster lets you run an efficient, cost-
effective cluster.
Cluster and workload architectures:Cluster and workload architectures: Separate
workloads into different node pools and consider scaling
user node pools.
Unlike System node pools that always require running
nodes, user node pools allow you to scale up or down.
Workload architecture:Workload architecture: Use AKS advanced scheduler
features.
Helps control balancing of resources for workloads that
require them.
Workload architecture:Workload architecture: Use meaningful workload scaling
metrics.
Not all scale decisions can be derived from CPU or memory
metrics. Often scale considerations will come from more
complex or even external data points. Use KEDA to build a
meaningful auto scale ruleset based on signals that are
specific to your workload.
  
Policy definitionsPolicy definitions
  
Cluster and workload architectureCluster and workload architecture
 
Additional resources
  
Azure Architecture Center guidanceAzure Architecture Center guidance
Cluster and workload architectures:Cluster and workload architectures: Perform ongoing load testing activities that exercise both the pod
and cluster autoscaler.
Cluster and workload architectures:Cluster and workload architectures: Separate workloads into different node pools allowing independent
scalling.
Explore the following table of recommendations to optimize your Azure Kubernetes Service configuration for
performance.
For more suggestions, see Principles of the performance efficiency pillar.
Azure Policy offers various built-in policy definitions that apply to both the Azure resource and AKS like standard
policy definitions, and using the Azure Policy add-on for Kubernetes, also within the cluster. Many of the Azure
resource policies come in both 
Audit/Deny
, but also in a 
Deploy If Not Exists
 variant.
There are a numerous number of policies, and key policies related to this pillar are summarized here. For a more
detailed view, see built-in policy definitions for Kubernetes.
CPU and memory resource limits
In addition to the built-in policies, custom policies can be created for both the AKS resource and for the Azure
Policy add-on for Kubernetes. This allows you to add additional security constraints you'd like to enforce in your
cluster and workload architecture.
AKS baseline architecture
Advanced AKS microservices architecture
AKS cluster for a PCI-DSS workload
AKS baseline for multiregion clusters

Cloud Adoption Framework guidanceCloud Adoption Framework guidance

Next steps

AKS Landing Zone Accelerator

Deploy an Azure Kubernetes Service (AKS) cluster using the Azure CLI Quickstart: Deploy an Azure

Kubernetes Service (AKS) cluster using the Azure CLI

Azure Functions and security

12/16/2022 • 2 minutes to read • Edit Online

Design consideration checklist

Design consideration recommendations

A Z URE F UN C T ION S DESIGN REC O M M ENDAT IO N SA Z URE F UN C T ION S DESIGN REC O M M ENDAT IO N S DESC RIP T IONDESC RIP T ION

Evaluate if Azure Functions requires HTTP trigger. Azure Functions supports multiple specific triggers and

bindings. These include Azure Blob storage, Azure Cosmos

DB, Azure Service Bus, and many more. If HTTP trigger is

needed, then consider protecting that HTTP endpoint like

any other web application. Common protection measures

include keeping HTTP endpoint internal to specific Azure

virtual networks by using Private endpoint connections or

service endpoints. Consider using guidance available on

Azure Functions networking options for more information. If

Functions HTTP endpoint will be exposed to internet, then

it's recommended to secure the endpoint behind a web

application firewall (WAF).

Treat Azure Functions code just like any other code. Subject Azure Functions code to code scanning tools that

are integrated with CI/CD pipeline.

Use guidance available on Securing Azure Functions. This guidance addresses key security concerns such as

operations, deployment, and network security.

Consider using Azure Functions Proxy to act as a facade. Functions Proxy can inspect and modify incoming requests

and responses.

Next step

Azure Functions is a cloud service available on-demand that provides all the continually updated infrastructure

and resources needed to run your applications. Functions allow you to write less code, maintain less

infrastructure, and save on costs. Instead of worrying about deploying and maintaining servers, the cloud

infrastructure provides all the up-to-date resources needed to keep your applications securely running.

For more information related to network security, reference Securing Azure Functions.

The following sections include a design consideration checklist and recommendations specific to Azure

Functions, and security.

Have you designed your workload and configured Azure Functions with security in mind?Have you designed your workload and configured Azure Functions with security in mind?

Evaluate if Azure Functions requires HTTP trigger.

Treat Azure Functions code just like any other code.

Use guidance available on Securing Azure Functions.

Consider using Azure Functions Proxy to act as a facade.

The following table reflects design consideration recommendations and descriptions related to Azure Functions:

Azure Service Fabric and reliability

Azure Well

Architected Framework review

Virtual

Machines

12/16/2022 • 9 minutes to read • Edit Online

Prerequisites

Reliability

Design checklistDesign checklist

RecommendationsRecommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N BEN EF ITB EN EF IT

Review SLAs for virtual machines. When defining test availability and recovery targets, make

sure you have a good understanding of the SLAs offered for

VMs.

Virtual Machines is an on-demand, scalable computing resource that gives you the flexibility of virtualization

without having to buy and maintain physical hardware to run it.

In this article, you learn architectural best practices for Azure Virtual Machines. The guidance is based on the five

pillars of architectural excellence:

Reliability

Security

Cost optimization

Operational excellence

Performance efficiency

Understanding the Well-Architected Framework pillars can help produce a high quality, stable, and

efficient cloud architecture. We recommend that you review your workload using the Microsoft Azure

Well-Architected Review assessment.

Use a reference architecture to review the considerations based on the guidance provided in this article.

We recommend, you start with Run a Linux VM on Azure.

As you make design choices for virtual machines, review the design principles for adding reliability to the

architecture.

Review the SLAs for virtual machines.

VMs should be deployed in a scale set using the Flexible orchestration mode.

Deployed VMs across Availability Zones .

Install applications on data disks.

Use maintenance control.

Explore the following table of recommendations to optimize your Virtual Machine configuration for service

reliability:

Deploy using Flexible scale sets. Even single instance VMs should be deployed into a scale set
using the Flexible orchestration mode to future-proof your
application for scaling and availability. Flexible orchestration
offers high availability guarantees (up to 1000 VMs) by
spreading VMs across fault domains in a region or within an
Availability Zone.
Deploy across availability zones Azure availability zones are physically separate locations
within each Azure region that are tolerant to local failures.
Install applications on data disks. Having your data on a separate disk from your OS disk
makes it easier to recover from failures and to migrate
workloads.
Use maintenance control Control when VM maintenance occurs to manage the timing
of system restarts.
REC O M M EN DAT IO NREC O M M EN DAT IO N BEN EF ITB EN EF IT
 
Security
  
Design checklistDesign checklist
  
RecommendationsRecommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N BEN EF ITB EN EF IT
Consider using Azure Bastion Authentication and access control using Azure Bastion
provides secure and seamless RDP/SSH connectivity to your
virtual machines directly from the Azure portal over TLS
Protect against malware Install antimalware protection to help identify and remove
viruses.
Manage updates Use a solution like Azure Automation to manage operating
system updates.
Azure Advisor helps you ensure and improve the continuity of your business-critical applications. Review the
Azure Advisor recommendations.
This article provides an overview of the core Azure security features that can be used with virtual machines.
As you make design choices for virtual machines, review the security principles and Security best practices for
adding security to the architecture.
As you make design choices for your virtual machine deployment, review the design principles for security.
Review the Linux security baseline
Review the Windows security baseline
Manage authentication and access control.
Protect against malware
Managed updates
Encryption
Explore the following table of recommendations to optimize your virtual machine configuration for security.

Monitor for security To monitor the security posture of your Windows and Linux
VMs, use Microsoft Defender for Cloud.
Use encryption Use Azure Disk Encryption to protect your data.
REC O M M EN DAT IO NREC O M M EN DAT IO N BEN EF ITB EN EF IT
  
Policy definitionsPolicy definitions
 
For more suggestions, see Principles of the security pillar.
Azure Advisor helps you ensure and improve security. Review the recommendations.
Deploy default Microsoft IaaSAntimalware extension for Windows Server  - This policy deploys a Microsoft
IaaSAntimalware extension with a default configuration when a VM is not configured with the antimalware
extension.
Microsoft IaaSAntimalware extension should be deployed on Windows servers  - This policy audits any
Windows server VM without Microsoft IaaSAntimalware extension deployed.
Only approved VM extensions should be installed  - This policy governs the virtual machine extensions that
are not approved.
Managed disks should be double encrypted with both platform-managed and customer-managed keys  - High
security sensitive customers who are concerned of the risk associated with any particular encryption
algorithm, implementation, or key being compromised can opt for additional layer of encryption using a
different encryption algorithm/mode at the infrastructure layer using platform managed encryption keys.
The disk encryption sets are required to use double encryption. Learn more at https://aka.ms/disks-
doubleEncryption.
Managed disks should use a specific set of disk encryption sets for the customer-managed key encryption  -
Requiring a specific set of disk encryption sets to be used with managed disks give you control over the keys
used for encryption at rest. You are able to select the allowed encrypted sets and all others are rejected when
attached to a disk. Learn more at https://aka.ms/disks-cmk.
Microsoft Antimalware for Azure should be configured to automatically update protection signatures  - This
policy audits any Windows virtual machine not configured with automatic update of Microsoft Antimalware
protection signatures.
OS and data disks should be encrypted with a customer-managed key  - Use customer-managed keys to
manage the encryption at rest of the contents of your managed disks. By default, the data is encrypted at rest
with platform-managed keys, but customer-managed keys are commonly required to meet regulatory
compliance standards. Customer-managed keys enable the data to be encrypted with an Azure Key Vault key
created and owned by you. You have full control and responsibility for the key lifecycle, including rotation
and management. Learn more at https://aka.ms/disks-cmk.
Virtual machines and virtual machine scale sets should have encryption at host enabled  - Use encryption at
host to get end-to-end encryption for your virtual machine and virtual machine scale set data. Encryption at
host enables encryption at rest for your temporary disk and OS/data disk caches. Temporary and ephemeral
OS disks are encrypted with platform-managed keys when encryption at host is enabled. OS/data disk
caches are encrypted at rest with either customer-managed or platform-managed key, depending on the
encryption type selected on the disk. Learn more at https://aka.ms/vm-hbe.
Require automatic OS image patching on Virtual Machine Scale Sets  - This policy enforces enabling
automatic OS image patching on Virtual Machine Scale Sets to always keep virtual Machines secure by safely
applying latest security patches every month.
All built-in policy definitions related to Azure Virtual Machines are listed in Azure Policy built-in definitions for
Azure Virtual Machines.

Cost optimization
  
Design considerationsDesign considerations
  
RecommendationsRecommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N BEN EF ITB EN EF IT
Stop VMs during off-hours Configuring start and stop times will shut down instances
that aren't in use. The feature is suitable as a low-cost
automation option.
Use Spot VMs when appropriate. Spot VMs are ideal for workloads that can be interrupted,
such as highly parallel batch processing jobs. These VMs
take advantage of the surplus capacity in Azure at a lower
cost. They're also well suited for experimenting, developing,
and testing large-scale solutions. Check out our Azure
Virtual Machine Spot Eviction guide to learn how to create a
reliable interruptible workload in Azure.
Right-size your VMs Identify the best VM for your workloads with the virtual
machines selector. See Windows and Linux pricing.
Prepay for added cost savings Purchasing reserved instances is a way to reduce Azure costs
for workloads with stable usage. Make sure you manage
usage. If usage is too low, then you're paying for resources
that aren't used. Keep reserved instances simple and keep
management overhead low to prevent increasing cost.
Use existing licensing through the hybrid benefit licensing
program
Hybrid benefit licensing is available for both Linux and
Windows
  
Policy definitionsPolicy definitions
 
Operational excellence
To optimize costs, review the design principles.
To estimate costs related to virtual machines, use these tools.
Identify the best VM for your workloads with the virtual machines selector. For more information, see
Linuxand Windows pricing.
Use this pricing calculator to configure and estimate the costs of your Azure VMs.
Shut down VM instances which aren't in use.
Use Spot VMs when appropriate.
Choose the right VM size for your workload.
Use Zone to Zone disaster recovery for virtual machines.
Prepay for reserved instances for one year, three years, or more.
Use hybrid benefit licensing
Explore the following table of recommendations to optimize your Virtual Machine configuration for service cost:
Azure Advisor helps you ensure and improve cost optimization. Review the recommendations.
Consider setting an  Allowed virtual machine SKU  policy to limit the sizes that can be used.
All built-in policy definitions related to Azure Virtual Machines are listed in Azure Policy built-in definitions for
Azure Virtual Machines.

  
Design checklistDesign checklist
  
RecommendationsRecommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N BEN EF ITB EN EF IT
Monitor and measure health In a production environment, it's important to monitor the
health, and performance of your VMs.
Automate tasks Building automation reduces deviations from your plans and
reduces that time it takes to manage your workload.
Build a robust testing environment Ideally, an organization will have multiple environments in
which to test deployments. These test environments should
be similar enough to production that deployment and run
time issues are detected before deployment to production.
Right-size your VMs Choose the right VM family for your workload.
Manage your quota Plan what level of quota will be required and review that
level regularly as the workload evolves and grows and
request changes early
  
Policy definitionsPolicy definitions
 
Performance efficiency
  
Design checklistDesign checklist
To ensure operational excellence, review the design principles.
Monitor and measure health.
Automate tasks like provisioning and updating.
Build a robust testing environment.
Right size your VMs.
Manage your quota.
For more suggestions, see Principles of the operational excellence pillar.
Azure Advisor helps you ensure and improve the continuity of your business-critical applications. Review the
recommendations.
Consider setting an  Allowed virtual machine SKU  policy
All built-in policy definitions related to Azure Virtual Machines are listed in Azure Policy built-in definitions for
Azure Virtual Machines.
Performance efficiency is matching the resources that are available to an application with the demand that it's
receiving. Performance efficiency includes scaling resources, identifying and optimizing potential bottlenecks,
and optimizing your application code for peak performance.
As you make design choices for your virtual machine deployment, review Microsoft Azure Well-Architected
Framework - Performance efficiency for performance and efficiency.
Reduce latency by deploying VMs closer together in proximity placement groups
Convert disks from standard HDD to premium SSD
Enable Accelerated Networking to improve network performance and latency
Autoscale your Flexible scale sets.

  
RecommendationsRecommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N BEN EF ITB EN EF IT
Reduce latency Consider deploying VMs in Creating and using proximity
placement groups using PowerShell.
Convert disks from standard HDD to premium SSD Azure premium SSDs deliver high-performance and low-
latency disk support for virtual machines (VMs) with
input/output (IO)-intensive workloads.
Consider accelerated networking Accelerated networking enables single root I/O virtualization
(SR-IOV) to a VM, greatly improving its networking
performance.
Use autoscaling Automatically increase or decrease the number of VM
instances that run your application with autoscaling.
 
Azure Advisor recommendations
 
Additional resources
  
Cost analysisCost analysis
 
Next steps
Explore the following table of recommendations to optimize your virtual machine deployment configuration for
performance and efficiency.
For more suggestions, see Principles of the performance efficiency pillar.
Azure Advisor helps you ensure and improve performance. Review the recommendations.
Azure Advisor is a personalized cloud consultant that helps you follow best practices to optimize your Azure
deployments. Here are some recommendations that can help you improve the reliability, security, cost
effectiveness, performance, and operational excellence of your Virtual Machines.
Reliability
Cost Optimization
Performance
Operational excellence
Here are other resources to help you query for unhealthy instances.
Planned versus actual spending can be managed through Azure Cost Management + Billing. There are several
options for grouping resources by billing unit.
Use the recommendations as you provision virtual machines for your solution.
Learn module: Introduction to Azure virtual machines
Review the Virtual Machine recommendations provided by Azure Advisor.
Review the built-in definitions provided by Azure Policy that apply to Virtual Machines. All built-in policy
definitions related to Azure Virtual Machines are listed in Azure Policy built-in definitions for Azure
Virtual Machines.

Azure Cache for Redis and reliability

12/16/2022 • 4 minutes to read • Edit Online

Design considerations

Azure Cache for Redis provides an in-memory data store based on the Redis (Remote Dictionary Server)

software. It's a secure data cache and messaging broker that provides high throughput and low-latency access

to data for applications.

Key concepts and best practices that support reliability include:

High availability

Failover and patching

Connection resilience

The following sections include design considerations, a configuration checklist, and recommended configuration

options specific to Azure Cache for Redis.

The Azure Cache for Redis Service Level Agreements (SLA) covers only Standard and Premium tier caches. Basic

tier isn't covered.

Redis is an in-memory cache for key value pairs and has High Availability (HA), by default, except for Basic tier.

There are three tiers for Azure Cache for Redis:

Basic

Not recommended for production workloads

. Basic tier is ideal for:

Single node

Multiple sizes

Development

Test

Non-critical workloads

Standard

: A replicated cache in a two-node primary and secondary configuration managed by Microsoft,

with a high availability SLA.

Premium

: Includes all standard-tier features and includes the following other features:

Faster hardware and performance compared to Basic or Standard tier.

Larger cache size, up to 120GB .

Data persistence, which includes Redis Database File (RDB) and Append Only File (AOF).

VNET support.

Clustering

Geo-Replication: A secondary cache is in another region and replicates data from the primary for

disaster recovery. To failover to the secondary, the caches need to be unlinked manually and then the

secondary is available for writes. The application writing to Redis needs to be updated with the

secondary's cache connection string.

Availability Zones: Deploy the cache and replicas across availability zones.

Checklist

Configuration recommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Schedule updates. Schedule the days and times that Redis Server updates will

be applied to the cache, which doesn't include Azure

updates, or updates to the VM operating system.

Monitor the cache and set alerts. Set alerts for exceptions, high CPU, high memory usage,

server load, and evicted keys for insights about when to

scale the cache. If the cache needs to be scaled,

understanding when to scale is important because it will

increase CPU during the scaling event to migrate data.

Deploy the cache within a VNET. Gives the customer more control over the traffic that can

connect to the cache. Make sure that the subnet has

sufficient address space available to deploy the cache nodes

and shards (cluster).

NOTENOTE

Import and export.

By default, each deployment will have one replica per shard. Persistence, clustering, and geo-replication are

all disabled at this time with deployments that have more than one replica. Your nodes will be distributed

evenly across all zones. You should have a replica count >= number of zones.

Microsoft guarantees at least 99.9% of the time that customers will have connectivity between the Cache

Endpoints and Microsoft's Internet gateway.

Have you configured Azure Cache for Redis with resiliency in mind?Have you configured Azure Cache for Redis with resiliency in mind?

Schedule updates.

Monitor the cache and set alerts.

Deploy the cache within a VNET.

Evaluate a partitioning strategy within Redis cache.

Configure Data PersistenceData Persistence to save a copy of the cache to Azure Storage or use Geo-Replication, depending

on the business requirement.

Implement retry policies in the context of your Azure Redis Cache.

Use one static or singleton implementation of the connection multiplexer to Redis and follow the best

practices guide.

Review How to administer Azure Cache for Redis.

Explore the following table of recommendations to optimize your Azure Cache for Redis configuration for

service reliability:

Evaluate a partitioning strategy within Redis cache. Partitioning a Redis data store involves splitting the data

across instances of the Redis server. Each instance makes up

a single partition. Azure Redis Cache abstracts the Redis

services behind a facade and doesn't expose them directly.

The simplest way to implement partitioning is to create

multiple Azure Redis Cache instances and spread the data

across them. You can associate each data item with an

identifier (a partition key) that specifies which cache stores

the data item. The client application logic can then use this

identifier to route requests to the appropriate partition. This

scheme is simple, but if the partitioning scheme changes (for

example, if extra Azure Redis Cache instances are created),

client applications may need to be reconfigured.

Configure Data PersistenceData Persistence to save a copy of the cache to

Azure Storage or use Geo-Replication, depending on the

business requirement.

Data Persistence

: if the master and replica reboot, the data

will be loaded automatically from the storage account.

Geo-

Replication

: The secondary cache needs to be unlinked from

the primary. The secondary will now become the primary

and can receive

writes

Implement retry policies in the context of your Azure Redis

Cache.

Most Azure services and client SDKs include a retry

mechanism. These mechanisms differ because each service

has different characteristics and requirements. Each retry

mechanism is tuned to a specific service.

Review How to administer Azure Cache for Redis. Understand how data loss can occur with cache reboots and

how to test the application for resiliency.

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Source artifacts

Resources

| where type == 'microsoft.cache/redis'

| where properties.sku.name != 'Premium'

Next step

To identify Redis instances that aren't on the Premium tier, use the following query:

Azure Cache for Redis and operational excellence

12/16/2022 • 4 minutes to read • Edit Online

Design considerations

Azure Cache for Redis provides an in-memory data store based on the Redis (Remote Dictionary Server)

software. It's a secure data cache and messaging broker that provides high throughput and low-latency access

to data for applications.

Best practices that support operational excellence include:

Server load management

Memory management

Performance testing

The following sections include design considerations, a configuration checklist, and recommended configuration

options specific to Azure Cache for Redis.

The Azure Cache for Redis Service Level Agreements (SLA) covers only Standard and Premium tier caches. Basic

tier isn't covered.

Redis is an in-memory cache for key value pairs and has High Availability (HA), by default, except for Basic tier.

There are three tiers for Azure Cache for Redis:

Basic

Not recommended for production workloads

. Basic tier is ideal for:

Single node

Multiple sizes

Development

Test

Non-critical workloads

Standard

: A replicated cache in a two-node primary and secondary configuration managed by Microsoft,

with a high availability SLA.

Premium

: Includes all standard-tier features and includes the following other features:

Faster hardware and performance compared to Basic or Standard tier.

Larger cache size, up to 120GB .

Data persistence, which includes Redis Database File (RDB) and Append Only File (AOF).

VNET support.

Clustering

Geo-Replication: A secondary cache is in another region and replicates data from the primary for

disaster recovery. To failover to the secondary, the caches need to be unlinked manually and then the

secondary is available for writes. The application writing to Redis will need to be updated with the

secondary's cache connection string.

Availability Zones: Deploy the cache and replicas across availability zones.

Checklist

Configuration recommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Schedule updates. Schedule the days and times that Redis Server updates will

be applied to the cache, which doesn't include Azure

updates, or updates to the VM operating system.

Monitor the cache and set alerts. Set alerts for exceptions, high CPU, high memory usage,

server load, and evicted keys for insights about when to

scale the cache. If the cache needs to be scaled,

understanding when to scale is important because it will

increase CPU during the scaling event to migrate data.

Deploy the cache within a VNET. Gives the customer more control over the traffic that can

connect to the cache. Make sure that the subnet has

sufficient address space available to deploy the cache nodes

and shards (cluster).

NOTENOTE

Import and export.

By default, each deployment will have one replica per shard. Persistence, clustering, and geo-replication are

all disabled at this time with deployments that have more than one replica. Your nodes will be distributed

evenly across all zones. You should have a replica count >= number of zones.

Microsoft guarantees at least 99.9% of the time that customers will have connectivity between the Cache

Endpoints and Microsoft's Internet gateway.

Have you configured Azure Cache for Redis with operational excellence in mind?Have you configured Azure Cache for Redis with operational excellence in mind?

Schedule updates.

Monitor the cache and set alerts.

Deploy the cache within a VNET.

Use the correct caching type (local, in role, managed, redis) within your solution.

Configure Data PersistenceData Persistence to save a copy of the cache to Azure Storage or use Geo-Replication, depending

on the business requirement.

Use one static or singleton implementation of the connection multiplexer to Redis and follow the best

practices guide.

Review How to administer Azure Cache for Redis.

Explore the following table of recommendations to optimize your Azure Cache for Redis configuration for

operational excellence:

Use the correct caching type (local, in role, managed, redis)

within your solution.

Distributed applications typically implement either or both of

the following strategies when caching data:

- Using a private cache, where data is held locally on the

machine that's running an instance of an application or

service.

- Using a shared cache, serving as a common source that

can be accessed by multiple processes and machines.

In both cases, caching can be performed client-side and

server-side. Client-side caching is done by the process that

provides the user interface for a system, such as a web

browser or desktop application. Server-side caching is done

by the process that provides the business services that are

running remotely.

Configure Data PersistenceData Persistence to save a copy of the cache to

Azure Storage or use Geo-Replication, depending on the

business requirement.

Data Persistence

: If the master and replica reboot, the data

will be loaded automatically from the storage account.

Geo-

Replication

: The secondary cache needs to be unlinked from

the primary. The secondary will now become the primary

and can receive

writes

Review How to administer Azure Cache for Redis. Understand how data loss can occur with cache reboots and

how to test the application for resiliency.

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Source artifacts

Resources

| where type == 'microsoft.cache/redis'

| where properties.sku.name != 'Premium'

Next step

To identify Redis instances that aren't on the Premium tier, use the following query:

Azure Databricks and security

12/16/2022 • 2 minutes to read • Edit Online

Design considerations

Checklist

Configuration recommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Ensure that the cloud workspaces for your analytics are only

accessible by properly managed users.

Azure Active Directory can handle single sign-on for remote

access. For extra security, reference Conditional Access.

Azure Databricks is a data analytics platform optimized for Azure cloud services. It offers three environments for

developing data intensive applications:

Databricks SQL

Databricks Data Science and Engineering

Databricks Machine Learning

To learn more about how Azure Databricks improves the security of big data analytics, reference Azure

Databricks concepts.

The following sections include design considerations, a configuration checklist, and recommended configuration

options specific to Azure Databricks.

All users' notebooks and notebook results are encrypted at rest, by default. If other requirements are in place,

consider using customer-managed keys for notebooks.

Have you configured Azure Databricks with security in mind?Have you configured Azure Databricks with security in mind?

Use Azure Active Directory credential passthrough to avoid the need for service principals when

communicating with Azure Data Lake Storage.

Isolate your workspaces, compute, and data from public access. Make sure that only the right people have

access and only through secure channels.

Ensure that the cloud workspaces for your analytics are only accessible by properly managed users.

Implement Azure Private Link.

Restrict and monitor your virtual machines.

Use Dynamic IP access lists to allow admins to access workspaces only from their corporate networks.

Use the VNet injection functionality to enable more secure scenarios.

Use diagnostic logs to audit workspace access and permissions.

Consider using the Secure cluster connectivity feature and hub/spoke architecture to prevent opening ports,

and assigning public IP addresses on cluster nodes.

Explore the following table of recommendations to optimize your Azure Databricks configuration for security:

Implement Azure Private Link. Ensure all traffic between users of your platform, the

notebooks, and the compute clusters that process queries

are encrypted and transmitted over the cloud provider's

network backbone, inaccessible to the outside world.

Restrict and monitor your virtual machines. Clusters, which execute queries, should have SSH and

network access restricted to prevent installation of arbitrary

packages. Clusters should use only images that are

periodically scanned for vulnerabilities.

Use the VNet injection functionality to enable more secure

scenarios.

Such as:

- Connecting to other Azure services using service

endpoints.

- Connecting to on-premises data sources, taking advantage

of user-defined routes.

- Connecting to a network virtual appliance to inspect all

outbound traffic and take actions according to allow and

deny rules.

- Using custom DNS.

- Deploying Azure Databricks clusters in existing virtual

networks.

Use diagnostic logs to audit workspace access and

permissions.

Use audit logs to see privileged activity in a workspace,

cluster resizing, files, and folders shared on the cluster.

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Source artifacts

Next step

Azure Databricks source artifacts include the Databricks blog: Best practices to secure an enterprise-scale data

platform.

Azure Database for MySQL and cost optimization

12/16/2022 • 2 minutes to read • Edit Online

Design considerations

Checklist

Configuration recommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Choose the appropriate server size for your workload. Configuration options: Single Server and Flexible Server.

Azure Database for MySQL is a relational database service in the Microsoft cloud based on the MySQL

Community Edition. You can use either Single Server or Flexible Server to host a MySQL database in Azure. It's a

fully managed database as a service offering that can handle mission-critical workloads with predictable

performance and dynamic scalability.

For more information about how Azure Database for MySQL supports cost optimization for your workload,

reference Server concepts, specifically, Stop/Start an Azure Database for MySQL.

The following sections include design considerations, a configuration checklist, and recommended configuration

options specific to Azure Database for MySQL.

Azure Database for MySQL includes the following design considerations:

Take advantage of the scaling capabilities of Azure Database for MySQL to lower consumption cost whenever

possible. To scale your database up and down, as needed, reference the following Microsoft Support article,

which covers the automation process using runbooks: How to autoscale an Azure Database for

MySQL/PostgreSQL instance with Azure run books and Python.

Plan your Recovery Point Objective (RPO) according to your operation level requirement. There's no extra

charge for backup storage for up to 100% of your total provisioned server storage. Extra consumption of

backup storage will be charged in GB/month .

The cloud native design of the Single-Server service allows it to support 99.99% of availability, eliminating

the cost of passive

hot

standby.

Consider using Flexible Server SKU for non-production workloads. Flexible servers provide better cost

optimization controls with ability to stop and start your server. They provide a burstable compute tier that is

ideal for workloads that don't need continuous full compute capacity.

Have you configured Azure Database for MySQL with cost optimization in mind?Have you configured Azure Database for MySQL with cost optimization in mind?

Choose the appropriate server size for your workload.

Consider Reserved Capacity for Azure Database for MySQL Single Server.

Explore the following table of recommendations to optimize your Azure Database for MySQL configuration for

cost optimization:

Consider Reserved Capacity for Azure Database for MySQL

Single Server.

Compute costs associated with Azure Database For MySQL

Single Server Reservation Discount. Once you've determined

the total compute capacity and performance tier for Azure

Database for MySQL in a region, this information can be

used to reserve the capacity. The reservation can span one

or three years. You can realize significant cost optimization

with this commitment.

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Azure Database for PostgreSQL and cost optimization

Azure Database for PostgreSQL and cost

optimization

12/16/2022 • 2 minutes to read • Edit Online

Design considerations

Azure Database for PostgreSQL is a relational database service in the Microsoft cloud based on the PostgreSQL

Community Edition.

You can choose between three deployment modes, such as Single Server, Flexible Server, and Hyperscale

(Citus):

Single Server

: A fully managed database service with minimal requirements for database customizations.

This platform is designed to handle most database management functions, such as:

Patching

Backups

High Availability

Security with minimal user configuration and control

This service offers three pricing tiers and allows you to pay only for the resources you need, and

only when you need them.

Flexible Server

: A fully managed database service designed to provide more granular control and

flexibility over database management functions and configuration settings based on user requirements.

This service provides better cost optimization controls because you can stop and start the server, and the

burstable compute tier.

Hyperscale

: This option uses sharding to horizontally scale across multiple machines. Hyperscale is best

for applications that require greater scale where workloads approach 100GB of data.

The following sections include design considerations, a configuration checklist, and recommended configuration

options specific to Azure Database for PostgreSQL.

Azure Database for PostgreSQL includes the following design considerations:

Hyperscale (Citus) provides dynamic scalability without the cost of manual sharding with low application

rearchitecture required.

Distributing table rows across multiple PostgreSQL servers is a key technique for scalable queries in

Hyperscale (Citus). Together, multiple nodes can hold more data than a traditional database, and in many

cases can use worker CPUs in parallel to execute queries potentially lowering the database costs. Follow

this Shard data on worker nodes tutorial to practice this potential savings architecture pattern.

Consider using Flexible Server SKU for non-production workloads.

Flexible servers provide better cost optimization controls with ability to stop and start your server, and

burstable compute tier that is ideal for workloads that don't need continuous full compute capacity.

Plan your Recovery Point Objective (RPO) according to your operation level requirement.

There's no extra charge for backup storage for up to 100% of your total provisioned server storage. Extra

Checklist

Configuration recommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Choose the appropriate server size for your workload. Configuration options: Single Server, Flexible Server,

Hyperscale (Citus).

Consider Reserved Capacity for Azure Database for

PostgreSQL Single Server and Hyperscale (Citus).

Compute costs associated with Azure Database For

PostgreSQL Single Server Reservation Discount and

Hyperscale (Citus) Reservation Discount. Once the total

compute capacity and performance tier for Azure Database

for PostgreSQL in a region is determined, this information

can be used to reserve the capacity. The reservation can

span one or three years. You can realize significant cost

optimization with this commitment.

Next step

consumption of backup storage will be charged in GB/month.

Take advantage of the scaling capabilities of Azure Database for PostgreSQL to lower consumption cost

whenever possible.

This Microsoft Support article about How to autoscale an Azure Database for MySQL/PostgreSQL

instance with Azure runbooks and Python covers the automation process using runbooks to scale your

database up and down, as needed.

The cloud native design of the Single- Server service allows it to support 99.99% of availability

eliminating the cost of passive

hot

standby.

Have you configured Azure Database for PostgreSQL with cost optimization in mind?Have you configured Azure Database for PostgreSQL with cost optimization in mind?

Choose the appropriate server size for your workload.

Consider Reserved Capacity for Azure Database for PostgreSQL Single Server and Hyperscale (Citus).

Explore the following table of recommendations to optimize your Azure Database for PostgreSQL configuration

for cost optimization:

Azure SQL Database and reliability

Azure Well

Architected Framework review

Azure

SQL Database

12/16/2022 • 19 minutes to read • Edit Online

Prerequisites

Azure SQL Database and reliability

Design considerationsDesign considerations

Azure SQL Database is a fully managed platform as a service (PaaS) database engine that handles most of the

database management functions without user involvement. Management functions include upgrades, patches,

backups, and monitoring.

The single database resource type creates a database in Azure SQL Database with its own set of resources and is

managed via a logical server. You can choose between the DTU-based purchasing model or vCore-based

purchasing model. You can create multiple databases in a single resource pool, with elastic pools.

The following sections include a design checklist and recommended design options specific to Azure SQL

Database security. The guidance is based on the five pillars of architectural excellence:

Reliability

Security

Cost optimization

Operational excellence

Performance efficiency

Understanding the Well-Architected Framework pillars can help produce a high quality, stable, and

efficient cloud architecture. Check out the Azure Well-Architected Framework overview page to review

the five pillars of architectural excellence.

Review the core concepts of Azure SQL Database and What's new in Azure SQL Database?.

Azure SQL Database is a fully managed platform as a service (PaaS) database engine that handles most of the

database management functions without user involvement. Management functions include:

Upgrades

Patches

Backups

Monitoring

This service allows you to create a highly available and high-performance data storage layer for your Azure

applications and workloads. Azure SQL Database is always running on the latest stable version of the SQL

Server database engine and patched OS with 99.99% availability.

For more information about how Azure SQL Database promotes reliability and enables your business to

continue operating during disruptions, reference Availability capabilities.

The following sections include design considerations, a configuration checklist, and recommended configuration

options specific to Azure SQL Database and reliability.

Azure SQL Database includes the following design considerations:

ChecklistChecklist

Azure SQL Database Business Critical tier configured with geo-replication has a guaranteed Recovery

time objective (RTO) of 30 seconds for 100% of deployed hours.

Use

sharding

to distribute data and processes across many identically structured databases. Sharding

provides an alternative to traditional scale-up approaches for cost and elasticity. Consider using sharding

to partition the database horizontally. Sharding can provide fault isolation. For more information,

reference Scaling out with Azure SQL Database.

Azure SQL Database Business Critical or Premium tiers not configured for Zone Redundant Deployments,

General Purpose, Standard, or Basic tiers, or Hyperscale tier with two or more replicas have an availability

guarantee. For more information about the availability guarantee, reference SLA for Azure SQL Database.

Provides built-in regional high availability and turnkey geo-replication to any Azure region. It includes

intelligence to support self-driving features, such as:

Performance tuning

Threat monitoring

Vulnerability assessments

Fully automated patching and updating of the code base

Define an application performance SLA and monitor it with alerts. Quickly detect when your application

performance inadvertently degrades below an acceptable level, which is important to maintain high

resiliency. Use the monitoring solution previously defined to set alerts on key query performance metrics

so you can take action when the performance breaks the SLA. Go to Monitor Your Database and alerting

tools for more information.

Use geo-restore to recover from a service outage. You can restore a database on any SQL Database

server or an instance database on any managed instance in any Azure region from the most recent geo-

replicated backups. Geo-restore uses a geo-replicated backup as its source. You can request geo-restore

even if the database or datacenter is inaccessible because of an outage. Geo-restore restores a database

from a geo-redundant backup. For more information, reference Recover an Azure SQL database using

automated database backups.

Use the Business Critical tier configured with geo-replication, which has a guaranteed Recovery point

objective (RPO) of 5 seconds for 100% of deployed hours.

PaaS capabilities built into Azure SQL Database enable you to focus on the domain-specific database

administration and optimization activities that are critical for your business.

Use point-in-time restore to recover from human error. Point-in-time restore returns your database to an

earlier point in time to recover data from changes done inadvertently. For more information, read the

Point-in-time restore (PITR) documentation.

Business Critical or Premium tiers are configured as Zone Redundant Deployments which have an

availability guarantee. For more information about the availability guarantee, reference SLA for Azure

SQL Database.

Have you configured Azure SQL Database with reliability in mind?Have you configured Azure SQL Database with reliability in mind?

Use Active Geo-Replication to create a readable secondary in a different region.

Use Auto Failover Groups that can include one or multiple databases, typically used by the same application.

Use a Zone-Redundant database.

Monitor your Azure SQL Database in near-real time to detect reliability incidents.

Implement Retry Logic.

Back up your keys.

  
Configuration recommendationsConfiguration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Use Active Geo-Replication to create a readable secondary in
a different region.
If your primary database fails, perform a manual failover to
the secondary database. Until you fail over, the secondary
database remains read-only. Active geo-replication enables
you to create readable replicas and manually failover to any
replica if there is a datacenter outage or application upgrade.
Up to four secondaries are supported in the same or
different regions, and the secondaries can also be used for
read-only access queries. The failover must be initiated
manually by the application or the user. After failover, the
new primary has a different connection end point.
Use Auto Failover Groups that can include one or multiple
databases, typically used by the same application.
You can use the readable secondary databases to offload
read-only query workloads. Because autofailover groups
involve multiple databases, these databases must be
configured on the primary server. Autofailover groups
support replication of all databases in the group to only one
secondary server or instance in a different region. Learn
more about AutoFailover Groups and DR design.
Use a Zone-Redundant database. By default, the cluster of nodes for the premium availability
model is created in the same datacenter. With the
introduction of Azure Availability Zones, SQL Database can
place different replicas of the Business Critical database to
different availability zones in the same region. To eliminate a
single point of failure, the control ring is also duplicated
across multiple zones as three gateway rings (GW). The
routing to a specific gateway ring is controlled by Azure
Traffic Manager (ATM). Because the zone redundant
configuration in the Premium or Business Critical service
tiers doesn't create extra database redundancy, you can
enable it at no extra cost. Learn more about Zone-
redundant databases.
Monitor your Azure SQL Database in near-real time to
detect reliability incidents.
Use one of the available solutions to monitor SQL DB to
detect potential reliability incidents early and make your
databases more reliable. Choose a near real-time monitoring
solution to quickly react to incidents. Reference Azure SQL
Analytics for more information.
Implement Retry Logic. Although Azure SQL Database is resilient when it concerns
transitive infrastructure failures, these failures might affect
your connectivity. When a transient error occurs while
working with SQL Database, make sure your code can retry
the call. For more information, reference how to implement
retry logic.
Back up your keys. If you're not using encryption keys in Azure Key Vault to
protect your data, back up your keys.
 
Azure SQL Database and security
Explore the following table of recommendations to optimize your Azure SQL Database configuration for
reliability:
SQL Database provides a range of built-in security and compliance features to help your application meet
various security and compliance requirements.

  
Design checklistDesign checklist
  
RecommendationsRecommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N BEN EF ITB EN EF IT
Review the minimum TLS version. Determine whether you have legacy applications that require
older TLS or unencrypted connections. After you enforce a
version of TLS, it's not possible to revert to the default.
Review and configure the minimum TLS version for SQL
Database connections via the Azure portal. If not, set the
latest TLS version to the minimum.
Ledger Consider designing database tables based on the Ledger to
provide auditing, tamper-evidence, and trust of all data
changes.
Always Encrypted Consider designing application access based around Always
Encrypted to protect sensitive data inside applications by
delegating data access to encryption keys.
Private endpoints and private link Private endpoint connections enforce secure communication
by enabling private connectivity to Azure SQL Database. You
can use a private endpoint to secure connections and deny
public network access by default. Azure Private Link for
Azure SQL Database is a type of private endpoint
recommended for Azure SQL Database.
Automated vulnerability assessments Monitor for vulnerability assessment scan results and
recommendations for how to remediate database
vulnerabilities.
Advanced Threat Protection Detect anomalous activities indicating unusual and
potentially harmful attempts to access or exploit databases
with Advanced Threat Protection for Azure SQL Database.
Advanced Threat Protection integrates its alerts with
Microsoft Defender for Cloud.
Auditing Track database events with Auditing for Azure SQL Database.
Have you designed your workload and configured Azure SQL Database with security in mind?Have you designed your workload and configured Azure SQL Database with security in mind?
Understand logical servers and how you can administer logins for multiple databases when appropriate.
Enable Azure AD authentication with Azure SQL. Azure AD authentication enables simplified permission
management and centralized identity management.
Azure SQL logical servers should have an Azure Active Directory administrator provisioned.
Verify contact information email address in your Azure Subscription for service administrator and co-
administrators is reaching the correct parties inside your enterprise. You don't want to miss or ignore
important security notifications from Azure!
Review the Azure SQL Database connectivity architecture. Choose the  Redirect  or  Proxy  connection policy
as appropriate.
Review Azure SQL Database firewall rules.
Use virtual network rules to control communication from particular subnets in virtual networks.
If using the Azure Firewall, configure Azure Firewall application rules with SQL FQDNs.

Managed identities Consider configuring a user-assigned managed identity
(UMI). Managed identities for Azure resources eliminate the
need to manage credentials in code.
Azure AD-only authentication Consider disabling SQL-based authentication and allowing
only on Azure AD authentication.
REC O M M EN DAT IO NREC O M M EN DAT IO N BEN EF ITB EN EF IT
  
Policy definitionsPolicy definitions
 
Azure SQL Database and cost optimization
  
ChecklistChecklist
  
Configuration recommendationsConfiguration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Review the Azure security baseline for Azure SQL Database and Azure Policy built-in definitions.
All built-in policy definitions related to Azure SQL are listed in Built-in policies.
Review Tutorial: Secure a database in Azure SQL Database.
Azure SQL Database is a fully managed platform as a service (PaaS) database engine that handles most of the
database management functions without user involvement. Management functions include:
Upgrades
Patches
Backups
Monitoring
This service allows you to create a highly available and high-performance data storage layer for your Azure
applications and workloads. SQL Database includes built-in intelligence that helps you dramatically reduce the
costs of running and managing databases through automatic performance monitoring and tuning.
For more information about how Azure SQL Database provides cost-saving features, reference Plan and manage
costs for Azure SQL Database.
The following sections include a configuration checklist and recommended configuration options specific to
Azure SQL Database and cost optimization.
Have you configured Azure SQL Database with cost optimization in mind?Have you configured Azure SQL Database with cost optimization in mind?
Optimize queries.
Evaluate resource usage.
Fine-tune backup storage consumption.
Evaluate Azure SQL Database serverless.
Consider reserved capacity for Azure SQL Database.
Consider elastic pools for managing and scaling multiple databases.
Explore the following table of recommendations to optimize your Azure SQL Database configuration for cost
savings:

Optimize queries. Optimize the queries, tables, and databases using Query
Performance Insights and Performance Recommendations to
help reduce resource consumption, and arrive at appropriate
configuration.
Evaluate resource usage. Evaluate the resource usage for all databases and determine
if they've been sized and provisioned correctly. For non-
production databases, consider scaling resources down as
applicable. The DTUs or vCores for a database can be scaled
on demand, for example, when running a load test or user
acceptance test.
Fine-tune backup storage consumption For vCore databases in Azure SQL Database, the storage
consumed by each type of backup (full, differential, and log)
is reported on the database monitoring pane as a separate
metric. Backup storage consumption up to the maximum
data size for a database is not charged. Excess backup
storage consumption will depend on the workload and
maximum size of the individual databases. For more
information, see Backup storage consumption.
Evaluate Azure SQL Database Serverless. Consider using Azure SQL Database serverless over the
Provisioned Computing Tier. Serverless is a compute tier for
single databases that automatically scales compute based on
workload demand and bills for the amount of compute used
per second. The serverless compute tier also automatically
pauses databases during inactive periods when only storage
is billed. It automatically resumes databases when activity
returns. Azure SQL Database serverless isn't suited for all
scenarios. If you have a database with unpredictable or
bursty usage patterns interspersed with periods of low or
idle usage, serverless is a solution that can help you optimize
price-performance.
Consider reserved capacity for Azure SQL Database. You can reduce compute costs associated with Azure SQL
Database by using Reservation Discount. Once you've
determined the total compute capacity and performance tier
for Azure SQL databases in a region, you can use this
information to reserve the capacity. The reservation can
span one or three years. For more information, reference
Save costs for resources with reserved capacity.
Elastic pools help you manage and scale multiple databases
in Azure SQL Database
Azure SQL Database elastic pools are a simple, cost-effective
solution for managing and scaling multiple databases that
have varying and unpredictable usage demands. The
databases in an elastic pool are on a single server and share
a set number of resources at a set price. For more
information, see Elastic pools for managing and scaling
multiple databases.
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
 
Azure SQL Database and operational excellence
For more information, see Plan and manage costs for Azure SQL Database.
Azure SQL Database is a fully managed platform as a service (PaaS) database engine that handles most of the
database management functions without user involvement. Management functions include:
Upgrades

Design considerationsDesign considerations

Patches

Backups

Monitoring

This service allows you to create a highly available and high-performance data storage layer for your Azure

applications and workloads. Azure SQL Database provides advanced monitoring and tuning capabilities backed

by artificial intelligence to help you troubleshoot and maximize the performance of your databases and

solutions.

For more information about how Azure SQL Database promotes operational excellence and enables your

business to continue operating during disruptions, reference Monitoring and performance tuning in Azure SQL

Database.

The following sections include design considerations, a configuration checklist, and recommended configuration

options specific to Azure SQL Database, and operational excellence.

Azure SQL Database includes the following design considerations:

Azure SQL Database Business Critical tier configured with geo-replication has a guaranteed Recovery

time objective (RTO) of 30 seconds for 100% of deployed hours.

Use

sharding

to distribute data and processes across many identically structured databases. Sharding

provides an alternative to traditional scale-up approaches for cost and elasticity. Consider using sharding

to partition the database horizontally. Sharding can provide fault isolation. For more information,

reference Scaling out with Azure SQL Database.

Azure SQL Database Business Critical or Premium tiers not configured for Zone Redundant Deployments,

General Purpose, Standard, or Basic tiers, or Hyperscale tier with two or more replicas have an availability

guarantee. For more information, reference SLA for Azure SQL Database.

Provides built-in regional high availability and turnkey geo-replication to any Azure region. It includes

intelligence to support self-driving features, such as:

Performance tuning

Threat monitoring

Vulnerability assessments

Fully automated patching and updating of the code base

Define an application performance SLA and monitor it with alerts. Quickly detect when your application

performance inadvertently degrades below an acceptable level, which is important to maintain high

resiliency. Use the monitoring solution previously defined to set alerts on key query performance metrics

so you can take action when the performance breaks the SLA. Go to Monitor Your Database for more

information.

Use geo-restore to recover from a service outage. You can restore a database on any SQL Database

server or an instance database on any managed instance in any Azure region from the most recent geo-

replicated backups. Geo-restore uses a geo-replicated backup as its source. You can request geo-restore

even if the database or datacenter is inaccessible because of an outage. Geo-restore restores a database

from a geo-redundant backup. For more information, reference Recover an Azure SQL database using

automated database backups.

Use the Business Critical tier configured with geo-replication, which has a guaranteed Recovery point

objective (RPO) of 5 seconds for 100% of deployed hours.

PaaS capabilities built into Azure SQL Database enable you to focus on the domain-specific database

administration and optimization activities that are critical for your business.

ChecklistChecklist

Configuration recommendationsConfiguration recommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Use Active Geo-Replication to create a readable secondary in

a different region.

If your primary database fails, perform a manual failover to

the secondary database. Until you fail over, the secondary

database remains read-only. Active geo-replication enables

you to create readable replicas and manually failover to any

replica if there is a datacenter outage or application upgrade.

Up to four secondaries are supported in the same or

different regions, and the secondaries can also be used for

read-only access queries. The failover must be initiated

manually by the application or the user. After failover, the

new primary has a different connection end point.

Use Auto Failover Groups that can include one or multiple

databases, typically used by the same application.

You can use the readable secondary databases to offload

read-only query workloads. Because autofailover groups

involve multiple databases, these databases must be

configured on the primary server. Autofailover groups

support replication of all databases in the group to only one

secondary server or instance in a different region. Learn

more about Auto-Failover Groups and DR design.

Use a Zone-Redundant database. By default, the cluster of nodes for the premium availability

model is created in the same datacenter. With the

introduction of Azure Availability Zones, SQL Database can

place different replicas of the Business Critical database to

different availability zones in the same region. To eliminate a

single point of failure, the control ring is also duplicated

across multiple zones as three gateway rings (GW). The

routing to a specific gateway ring is controlled by Azure

Traffic Manager (ATM). Because the zone redundant

configuration in the Premium or Business Critical service

tiers doesn't create extra database redundancy, you can

enable it at no extra cost. Learn more about Zone-

redundant databases.

Use point-in-time restore to recover from human error. Point-in-time restore returns your database to an

earlier point in time to recover data from changes done inadvertently. For more information, read the

Point-in-time restore (PITR) documentation.

Business Critical or Premium tiers are configured as Zone Redundant Deployments. For more information

about the availability guarantee, reference SLA for Azure SQL Database .

Have you configured Azure SQL Database with operational excellence in mind?Have you configured Azure SQL Database with operational excellence in mind?

Use Active Geo-Replication to create a readable secondary in a different region.

Use Auto Failover Groups that can include one or multiple databases, typically used by the same application.

Use a Zone-Redundant database.

Monitor your Azure SQL Database in near-real time to detect reliability incidents.

Implement retry logic.

Back up your keys.

Explore the following table of recommendations to optimize your Azure SQL Database configuration for

operational excellence:

Monitor your Azure SQL Database in near-real time to
detect reliability incidents.
Use one of the available solutions to monitor SQL DB to
detect potential reliability incidents early and make your
databases more reliable. Choose a near real-time monitoring
solution to quickly react to incidents. Reference Azure SQL
Analytics for more information.
Implement Retry Logic. Although Azure SQL Database is resilient when it concerns
transitive infrastructure failures, these failures might affect
your connectivity. When a transient error occurs while
working with SQL Database, make sure your code can retry
the call. For more information, reference how to implement
retry logic and Configurable retry logic in SqlClient
introduction.
Back up your keys. If you're not using encryption keys in Azure Key Vault to
protect your data, back up your keys.
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
 
Azure SQL Database and performance efficiency
  
Design checklistDesign checklist
  
RecommendationsRecommendations
Azure SQL Database is a fully managed platform as a service (PaaS) database engine that handles most of the
database management functions without user involvement. Management functions include:
Upgrades
Patches
Backups
Monitoring
The following sections include a design checklist and recommended design options specific to Azure SQL
Database performance efficiency.
Have you designed your workload and configured Azure SQL Database with performanceHave you designed your workload and configured Azure SQL Database with performance
efficiency in mind?efficiency in mind?
Review resource limits. For specific resource limits per pricing tier (also known as service objective) for single
databases, refer to either DTU-based single database resource limits or vCore-based single database
resource limits. For elastic pool resource limits, refer to either DTU-based elastic pool resource limits or
vCore-based elastic pool resource limits.
Choose the right deployment model for your workload, vCore or DTU. Compare the vCore and DTU-based
purchasing models.
Microsoft recommends the latest vCore database standard-series or premium-series hardware. Older Gen4
hardware has been retired.
When using elastic pools, familiarize yourself with resource governance.
Review the default max degree of parallelism (MAXDOP) and configure as needed based on a migrated or
expected workload.
Consider using read-only replicas of critical database to offload read-only query workloads.
Review the Performance Center for SQL Server Database Engine and Azure SQL Database.
Applications connecting to Azure SQL Database should use the latest connection providers, for example the
latest OLE DB Driver or ODBC Driver.

REC O M M EN DAT IO NREC O M M EN DAT IO N BEN EF ITB EN EF IT
Diagnose and troubleshoot high CPU utilization. Azure SQL Database provides built-in tools to identify the
causes of high CPU usage and to optimize workload
performance.
Understand blocking and deadlocking issues. Blocking due to concurrency and terminated sessions due to
deadlocks have different causes and outcomes.
Tune applications and databases for performance. Tune your application and database to improve performance.
Review best practices.
Review Azure portal utilization reporting and scale as
appropriate.
After deployment, use built-in reporting in the Azure portal
to regularly review peak and average database utilization
and right-size up or down. You can easily scale single
databases or elastic pools with no data loss and minimal
downtime.
Review Performance Recommendations. In the Intelligent Performance menu of the database page in
the Azure portal, review and consider action on any of the
Performance Recommendations and implement any index,
schema, and parameterization issues.
Review Query Performance Insight. Review Query Performance Insight for Azure SQL Database
reports to identify top resource-consuming queries, long
running queries, and more.
Configure Automatic tuning. Provide peak performance and stable workloads through
continuous performance tuning based on AI and machine
learning. Consider using Azure Automation to configure
email notifications for automatic tuning.
Evaluate potential use of in-memory database objects. In-memory technologies enable you to improve
performance of your application, and potentially reduce cost
of your database. Consider designing some database objects
in high-volume OLTP applications.
Leverage the Query Store. Enabled by default in Azure SQL Database, the Query Store
contains a wealth of query performance and resource
consumption data, as well as advanced tuning features like
Query Store hints and automatic plan correction. Review
Query Store defaults in Azure SQL Database.
Implement retry logic for transient errors. Applications should include automatic transaction retry logic
for transient errors including common connection errors.
Leverage exponential retry interval logic.
 
Additional resources
 
Next steps
For information about supported features, see Features and Resolving Transact-SQL differences during
migration to SQL Database.
Migrating to Azure SQL Database? Review our Azure Database Migration Guides.
Watch episodes of Data Exposed covering Azure SQL topics and more.

Try Azure SQL Database free with Azure free account, then get started with single databases in Azure SQL

Database.

Azure SQL Managed Instance and reliability

12/16/2022 • 3 minutes to read • Edit Online

Design considerations

Checklist

Azure SQL Managed Instance is the intelligent, scalable cloud database service that combines the broadest SQL

Server database engine compatibility with all the benefits of a fully managed and evergreen platform as a

service.

The goal of the high availability architecture in SQL Managed Instance is to guarantee that your database is up

and running without worrying about the impact of maintenance operations and outages. This solution is

designed to:

Ensure that committed data is never lost because of failures.

Ensure that maintenance failures don't affect your workload.

Ensure that the database won't be a single point of failure in your software architecture.

For more information about how Azure SQL Managed Instance supports application and workload resilience,

reference the following articles:

High availability for Azure SQL Managed Instance

Use autofailover groups to enable transparent and coordinated geo-failover of multiple databases

The following sections include design considerations, a configuration checklist, and recommended configuration

options specific to Azure SQL Managed Instance, and reliability.

Azure SQL Managed Instance includes the following design considerations:

Define an application performance SLA and monitor it with alerts. Detecting quickly when your application

performance inadvertently degrades below an acceptable level is important to maintain high resiliency. Use a

monitoring solution to set alerts on key query performance metrics so you can take action when the

performance breaks the SLA.

Use point-in-time restore to recover from human error. Point-in-time restore returns your database to an

earlier point in time to recover data from changes done inadvertently. For more information, read the Point-

in-time-restore (PITR) documentation for managed instance.

Use geo-restore to recover from a service outage. Geo-restore restores a database from a geo-redundant

backup into a managed instance in a different region. For more information, reference Recover a database

using Geo-restore documentation.

Consider the time required for certain operations. Make sure you separate time to thoroughly test the

amount of time required to scale up and down your existing managed instance, and to create a new

managed instance. This timing practice ensures that you understand completely how time consuming

operations will affect your RTO and RPO.

Have you configured Azure SQL Managed Instance with reliability in mind?Have you configured Azure SQL Managed Instance with reliability in mind?

Use the Business Critical Tier.

Configure a secondary instance and an Autofailover group to enable failover to another region.

Implement Retry Logic.

Monitor your SQL MI instance in near-real time to detect reliability incidents.

Configuration recommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Use the Business Critical Tier. This tier provides higher resiliency to failures and faster

failover times because of the underlying HA architecture,

among other benefits. For more information, reference SQL

Managed Instance High availability.

Configure a secondary instance and an Autofailover group

to enable failover to another region.

If an outage impacts one or more of the databases in the

managed instance, you can manually or automatically

failover all the databases inside the instance to a secondary

region. For more information, read the Autofailover groups

documentation for managed instance.

Implement Retry Logic. Although Azure SQL MI is resilient to transitive

infrastructure failures, these failures might affect your

connectivity. When a transient error occurs while working

with SQL MI, make sure your code can retry the call. For

more information, reference how to implement retry logic.

Monitor your SQL MI instance in near-real time to detect

reliability incidents.

Use one of the available solutions to monitor your SQL MI

to detect potential reliability incidents early and make your

databases more reliable. Choose a near real-time monitoring

solution to quickly react to incidents. For more information,

check out the Azure SQL Managed Instance monitoring

options.

Next step

Explore the following table of recommendations to optimize your Azure SQL Managed Instance configuration

for reliability:

Azure SQL Managed Instance and operational excellence

Azure SQL Managed Instance and operational

excellence

12/16/2022 • 3 minutes to read • Edit Online

Design considerations

Checklist

Azure SQL Managed Instance is the intelligent, scalable cloud database service that combines the broadest SQL

Server database engine compatibility with all the benefits of a fully managed and evergreen platform as a

service.

The goal of the high availability architecture in SQL Managed Instance is to guarantee that your database is up

and running without worrying about the impact of maintenance operations and outages. This solution is

designed to:

Ensure that committed data is never lost because of failures.

Ensure that maintenance failures don't affect your workload.

Ensure that the database won't be a single point of failure in your software architecture.

For more information about how Azure SQL Managed Instance supports operational excellence for your

application workloads, reference the following articles:

Overview of Azure SQL Managed Instance management operations

Monitoring Azure SQL Managed Instance management operations

The following sections include design considerations, a configuration checklist, and recommended configuration

options specific to Azure SQL Managed Instance, and operational excellence.

Azure SQL Managed Instance includes the following design considerations:

Define an application performance SLA and monitor it with alerts. Detecting quickly when your application

performance inadvertently degrades below an acceptable level is important to maintain high resiliency. Use a

monitoring solution to set alerts on key query performance metrics so you can take action when the

performance breaks the SLA.

Use point-in-time restore to recover from human error. Point-in-time restore returns your database to an

earlier point in time to recover data from changes done inadvertently. For more information, read the Point-

in-time-restore (PITR) documentation for managed instance.

Use geo-restore to recover from a service outage. Geo-restore restores a database from a geo-redundant

backup into a managed instance in a different region. For more information, reference Recover a database

using Geo-restore documentation.

Consider the time required for certain operations. Make sure you separate time to thoroughly test the

amount of time required to scale up and down your existing managed instance, and to create a new

managed instance. This timing practice ensures that you understand completely how time consuming

operations will affect your RTO and RPO.

Have you configured Azure SQL Managed Instance with operational excellence in mind?Have you configured Azure SQL Managed Instance with operational excellence in mind?

Use the Business Critical Tier.

Configure a secondary instance and an Autofailover group to enable failover to another region.

Configuration recommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Use the Business Critical Tier. This tier provides higher resiliency to failures and faster

failover times because of the underlying HA architecture,

among other benefits. For more information, reference SQL

Managed Instance High availability.

Configure a secondary instance and an Autofailover group

to enable failover to another region.

If an outage impacts one or more of the databases in the

managed instance, you can manually or automatically

failover all the databases inside the instance to a secondary

region. For more information, read the Autofailover groups

documentation for managed instance.

Implement Retry Logic. Although Azure SQL MI is resilient to transitive

infrastructure failures, these failures might affect your

connectivity. When a transient error occurs while working

with SQL MI, make sure your code can retry the call. For

more information, reference how to implement retry logic.

Monitor your SQL MI instance in near-real time to detect

reliability incidents.

Use one of the available solutions to monitor your SQL MI

to detect potential reliability incidents early and make your

databases more reliable. Choose a near real-time monitoring

solution to quickly react to incidents. For more information,

check out the Azure SQL Managed Instance monitoring

options.

Next step

Implement Retry Logic.

Monitor your SQL MI instance in near-real time to detect reliability incidents.

Explore the following table of recommendations to optimize your Azure SQL Managed Instance configuration

for operational excellence:

Azure Cosmos DB and reliability

12/16/2022 • 8 minutes to read • Edit Online

Design considerations

Checklist

Azure Cosmos DB is a fully managed NoSQL database for modern app development.

Key features include:

Guaranteed speed at any scale

Simplified application development

Mission-critical ready

Fully managed and cost effective

To understand how Azure Cosmos DB bolsters resiliency for your application workload, reference the following

articles:

Distribute your data globally with Azure Cosmos DB

How does Azure Cosmos DB provide high availability

Consistency levels in Azure Cosmos DB

Configure Azure Cosmos DB account with periodic backup

The following sections include design considerations, a configuration checklist, recommended configuration

options, and source artifacts specific to Azure Cosmos DB and reliability.

Azure Cosmos DB includes the following design considerations:

SLA for read availability for Database Accounts spanning two or more Azure regions.

SLAs for throughput, consistency, availability, and latency.

SLA for both read and write availability with the configuration of multiple Azure regions as writable

endpoints.

For more detailed information about SLAs specific to this product, see Azure Cosmos DB Service Level

Agreements.

Have you configured Azure Cosmos DB with reliability in mind?Have you configured Azure Cosmos DB with reliability in mind?

Deploy Azure Cosmos DB and the application in the region that corresponds to end users.

If the multi-master option is enabled on Azure Cosmos DB, it's important to understand Conflict Types and

Resolution Policies.

Start with, Session, the default consistency level.

Change the consistency level, depending on the data operation and usage.

Evaluate connectivity modes and connection protocols in Azure Cosmos DB.

Configure preferred locations.

Specify index precisions.

Use Azure Monitor to see the provisioned autoscale max RU/s (Autoscale Max Throughput) and the RU/s

the system is currently scaled to (Provisioned Throughput).

Understand your traffic pattern to pick the right option for provisioned throughput types.

Configuration recommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Deploy Azure Cosmos DB and the application in the region

that corresponds to end users.

There are two common scenarios for configuring two or

more regions: Delivering low-latency access to data to end

users no matter where they're located around the globe.

Adding regional resiliency for business continuity and

disaster recovery (BCDR). For delivering low-latency to end

users, it's recommended to deploy both the application and

add Azure Cosmos DB in the regions that correspond to

where the application's users are located.

Start with, Session, the default consistency level. It's the recommended consistency level to start with as it

receives data later, but in the same order as the writes.

Change the consistency level, depending on the data

operation and usage.

Depending on the type of data stored, different consistency

levels may be changed on a per request basis. For example, if

log data is written to Azure Cosmos DB, an EventualEventual

consistency may be relevant, but if writing e-commerce

transactions, then StrongStrong may be more appropriate.

New applications

: If you don't know your traffic pattern yet, start at the entry point RU/s to avoid over-

provisioning in the beginning.

Existing applications

: Use Azure Monitor metrics to determine if your traffic pattern is suitable for autoscale.

Existing applications

: Find the normalized request unit consumption metric of your database or container.

Existing applications

: The closer the number is to 100% , the more you're fully using your provisioned RU/s .

Set provisioned RU/s to T for all hours in a month.

Enable automatic failover when you configure Azure Cosmos DB accounts used for production workloads.

Implement retry logic in your client.

For query-intensive workloads, use Windows 64-bit instead of Linux or Windows 32-bit host processing.

To reduce latency and CPU jitter, enable accelerated networking on client virtual machines in both Windows

and Linux.

Increase the number of threads and tasks.

To avoid network latency, collocate client in the same region as Azure Cosmos DB.

Call OpenAsync to avoid startup latency on first request.

Scale out client applications across multiple servers if client consumes more than 50,000 RU/s .

Select a partition key.

Ensure the partition key is a property that has a value that doesn't change.

You can't change a partition key after it's created with the collection.

Ensure the partition key has a high cardinality.

Ensure the partition key spreads RU consumption and data storage evenly across all logical partitions.

Ensure you're running read queries with the partitioned column to reduce RU consumption and latency.

Evaluate ways to improve data performance.

Configure data replication to ensure Azure Cosmos DB meets the SLAs.

Explore the following table of recommendations to optimize your Azure Cosmos DB configuration for reliability:

Evaluate connectivity modes and connection protocols in

Azure Cosmos DB.

Azure Cosmos DB supports two connectivity modes. In

Gateway mode

, requests are always made to the Azure

Cosmos DB gateway, which forwards it to the corresponding

data partitions. In

Direct connectivity mode

, the client

fetches the mapping of tables to partitions, and requests are

made directly against data partitions. We recommend Direct,

the default mode. Azure Cosmos DB supports two

connection protocols: HTTPS and TCP , which is the

default. TCP is recommended because it's more lightweight.

Configure preferred locations. Setting preferred locations can improve query performance.

To take advantage of global distribution, client applications

can specify the ordered preference list of regions to be used

to perform document operations, which can be done by

setting the connection policy. Based on the Azure Cosmos

DB account configuration, current regional availability and

the preference list specified, the most optimal endpoint will

be chosen by the SQL SDK to perform write and read

operations. This preference list is specified when initializing a

connection using the SQL SDKs. The SDKs accept an optional

parameter, PreferredLocations , that is an ordered list of

Azure regions. The SDK will automatically send all writes to

the current write region. All reads will be sent to the first

available region in the PreferredLocations list. If the

request fails, the client will fail down the list to the next

region, and so on. The SDKs will only attempt to read from

the regions specified in PreferredLocations .

Specify index precisions. Setting these values appropriately can improve query

performance and reduce throughput requests. You can use

index precision to make trade-offs between index storage

overhead and query performance. For numbers, we

recommend using the default precision configuration of -1

(maximum). Because numbers are 8 bytes in JSON, this is

equivalent to a configuration of 8 bytes. Choosing a lower

value for precision, such as 1 through 7 , means that

values within some ranges map to the same index entry. You

reduce index storage space, but query execution might have

to process more documents. It consumes more throughput

in request units. Index precision configuration has more

practical application with string ranges. Because strings can

be any arbitrary length, the choice of the index precision

might affect the performance of string range queries. It also

may affect the amount of index storage space that's

required. String Range indexes can be configured with 1

through 100 or -1 (maximum).

Existing applications

: Find the normalized request unit

consumption metric of your database or container.

Normalized

usage is a measure of how much you're currently

using your standard (manual) provisioned throughput.

Set provisioned RU/s to T for all hours in a month. If you set provisioned RU/s to T and use the full amount

for 66% of the hours or more, it's estimated you'll save with

standard (manual) provisioned RU/s . If you set autoscale

max RU/s to Tmax and use the full amount Tmax for

66% of the hours or less, it's estimated you'll save with

autoscale.

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Scale out client applications across multiple servers if client

consumes more than 50,000 RU/s .

There could be a bottleneck because of the machine capping

out on CPU or network usage.

Ensure the partition key spreads RU consumption and data

storage evenly across all logical partitions.

This spread ensures even RU consumption and storage

distribution across your physical partitions.

Evaluate ways to improve data performance. Best practices for query performance:

- Connection policy: Use direct connection mode.

- Connection Policy: Use the TCP protocol

Call OpenAsync to avoid startup latency on first request.

- Collocate clients in the same Azure region for performance.

- Increase number of threads and tasks.

- Install the most recent SDK.

- Use a singleton Azure Cosmos DB client for the lifetime of

your application.

- Increase System.Net MaxConnections per host when

using Gateway mode.

- Tune parallel queries for partitioned collections.

- Turn on server-side GC.

- Implement backoff at RetryAfter intervals.

- Scale out your client-workload.

- Cache document URIs for lower read latency.

- Tune the page size for queries and read feeds for better

performance.

- Use 64-bit host processing.

- Exclude unused paths from indexing for faster writes.

- Measure and tune for lower request units and second

usage.

- Handle rate limiting and request rates that are too large.

- Design for smaller documents for higher throughput.

Configure data replication to ensure Azure Cosmos DB

meets the SLAs

If you've replicated your data in more than one data center,

Azure Cosmos DB automatically rolls over your operations

should a regional data center go offline. You can create a

prioritized list of failover regions using the regions in which

your data is replicated. Even within a single data center,

Azure Cosmos DB automatically replicates data for high

availability giving you the choice of consistency levels.

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Source artifacts

Resources

|where type =~ 'Microsoft.DocumentDb/databaseAccounts'

|where properties.enableAutomaticFailover!=True

resources

| where type == "microsoft.documentdb/databaseaccounts"

and properties.enableMultipleWriteLocations == "true"

To check for cosmosdb instances where automatic failover isn't enabled, use the following query:

Use the following query to see the list of multiregion writes:

To view consistency levels for your Azure Cosmos DB accounts, use the following query:

Resources

| project name, type, location, consistencyLevel = properties.consistencyPolicy.defaultConsistencyLevel

| where type == "microsoft.documentdb/databaseaccounts"

| order by name asc

Resources

|where type =~ 'Microsoft.DocumentDb/databaseAccounts'

|where array_length( properties.locations) <=1

Learn more

Next step

To check if multilocation isn't selected, use the following query:

High availability in Azure Cosmos DB

Autoscale FAQ

Performance tips for Azure Cosmos DB

Azure Cosmos DB and operational excellence

12/16/2022 • 4 minutes to read • Edit Online

Design considerations

Checklist

Azure Cosmos DB is a fully managed NoSQL database for modern app development.

Key features include:

Guaranteed speed at any scale

Simplified application development

Mission-critical ready

Fully managed and cost effective

To understand how Azure Cosmos DB promotes operational excellence for your application workload, reference

the following articles:

Monitor Azure Cosmos DB

Monitor and debug with insights in Azure Cosmos DB

Visualize Azure Cosmos DB data by using the Power BI connector

The following sections include design considerations, a configuration checklist, recommended configuration

options, and source artifacts specific to Azure Cosmos DB, and operational excellence.

Azure Cosmos DB includes the following design considerations:

SLA for read availability for Database Accounts spanning two or more Azure regions.

SLAs for throughput, consistency, availability, and latency.

SLA for both read and write availability with the configuration of multiple Azure regions as writable

endpoints.

For more granular information specific to this product, reference Azure Cosmos DB Service Level Agreements.

Have you configured Azure Cosmos DB with operational excellence in mind?Have you configured Azure Cosmos DB with operational excellence in mind?

Monitor for normal and abnormal activity.

If the multi-master option is enabled on Azure Cosmos DB, it's important to understand Conflict Types and

Resolution Policies.

Start with, Session, the default consistency level.

Use Azure Monitor to see the provisioned autoscale max RU/s (Autoscale Max Throughput) and the RU/s

the system is currently scaled to (Provisioned Throughput).

Understand your traffic pattern to pick the right option for provisioned throughput types.

New applications

: If you don't know your traffic pattern yet, start at the entry point RU/s to avoid over-

provisioning in the beginning.

Existing applications

: Use Azure Monitor metrics to determine if your traffic pattern is suitable for autoscale.

Existing applications

: Find the normalized request unit consumption metric of your database or container.

Existing applications

: The closer the number is to 100% , the more you're fully using your provisioned RU/s .

Set provisioned RU/s to T for all hours in a month.

Configuration recommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Monitor for normal and abnormal activity. The Azure Activity Log is a subscription log that provides

insight into subscription-level events that have occurred in

Azure. The Activity Log reports control plane events for your

subscriptions under the Administrative category. Using the

Activity Log, you can determine the

what

who

, and

when

for

any write operations ( PUT , POST , DELETE ) taken on the

resources in your subscription. You can also understand the

status of the operation and other relevant properties. The

Activity Log differs from Diagnostic Logs. Activity Logs

provide data about the operations on a resource from the

outside (the

control plane

). In the Azure Cosmos DB context,

some of the control plane operations include create

collection, list keys, delete keys, list database, and more.

Diagnostic Logs are emitted by a resource and provide

information about the operation of that resource (the

data

plane

). Some of the data plane diagnostic log examples

include delete, insert, ReadFeed operation, and more.

Start with, Session, the default consistency level. It's the recommended consistency level to start with as it

receives data later, but in the same order as the writes.

Existing applications

: Find the normalized request unit

consumption metric of your database or container.

Normalized

usage is a measure of how much you're currently

using your standard (manual) provisioned throughput.

Set provisioned RU/s to T for all hours in a month. If you set provisioned RU/s to T and use the full amount

for 66% of the hours or more, it's estimated you'll save with

standard (manual) provisioned RU/s . If you set autoscale

max RU/s to Tmax and use the full amount Tmax for

66% of the hours or less, it's estimated you'll save with

autoscale.

Enable automatic failover when you configure Azure Cosmos DB accounts used for production workloads.

Implement retry logic in your client.

For query-intensive workloads, use Windows 64-bit instead of Linux or Windows 32-bit host processing.

To reduce latency and CPU jitter, enable accelerated networking on client virtual machines in both Windows

and Linux.

Increase the number of threads and tasks.

To avoid network latency, colocate the client in the same region as the Azure Cosmos DB instance.

Call OpenAsync to avoid startup latency on first request.

Scale out client applications across multiple servers if client consumes more than 50,000 RU/s .

Select a partition key.

Ensure the partition key is a property that has a value that doesn't change.

You can't change a partition key after it's created with the collection.

Ensure the partition key has a high cardinality.

Ensure the partition key spreads RU consumption and data storage evenly across all logical partitions.

Ensure you're running read queries with the partitioned column to reduce RU consumption and latency.

Explore the following table of recommendations to optimize operational excellence for your Azure Cosmos DB

configuration:

Scale out client applications across multiple servers if client

consumes more than 50,000 RU/s .

There could be a bottleneck because of the machine capping

out on CPU or network usage.

Ensure the partition key spreads RU consumption and data

storage evenly across all logical partitions.

This spread ensures even RU consumption and storage

distribution across your physical partitions.

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Source artifacts

Resources

|where type =~ 'Microsoft.DocumentDb/databaseAccounts'

|where properties.enableAutomaticFailover!=True

resources

| where type == "microsoft.documentdb/databaseaccounts"

and properties.enableMultipleWriteLocations == "true"

Resources

| project name, type, location, consistencyLevel = properties.consistencyPolicy.defaultConsistencyLevel

| where type == "microsoft.documentdb/databaseaccounts"

| order by name asc

Resources

|where type =~ 'Microsoft.DocumentDb/databaseAccounts'

|where array_length( properties.locations) <=1

Learn more

Next step

To check for cosmosdb instances where automatic failover isn't enabled, use the following query:

Use the following query to see the list of multiregion writes:

To view consistency levels for your Azure Cosmos DB accounts, use the following query:

To check if multilocation isn't selected, use the following query:

Autoscale FAQ

Performance tips for Azure Cosmos DB

Azure Stack Hub and reliability

12/16/2022 • 2 minutes to read • Edit Online

Design considerations

Checklist

Azure Stack Hub is a hybrid cloud platform that lets you provide Azure services from your datacenter. It provides

a way to run apps in an on-premises environment.

This service unlocks the following hybrid cloud use cases for customer-facing and internal line-of-business apps:

Edge and disconnected solutions

: Addresses latency and connectivity requirements by processing data

locally.

Cloud apps that meet varied regulations

: Allows you to develop and deploy apps with full flexibility to meet

regulatory or policy requirements.

Cloud app model on-premises

: Provides Azure services, containers, serverless, and microservice

architectures to update and extend existing apps or build new ones.

For more information, reference Azure Stack Hub overview.

To understand how Azure Stack Hub supports resiliency for your application workload, reference the following

articles:

Capacity planning for Azure Stack Hub overview

Storage Spaces Direct cache and capacity tiers

Datacenter integration planning considerations for Azure Stack Hub integrated systems

The following sections include design considerations, a configuration checklist, and recommended configuration

options specific to Azure Stack Hub and reliability.

Azure Stack Hub includes the following design considerations:

Microsoft doesn't provide an SLA for Azure Stack Hub because Microsoft doesn't have control over customer

datacenter reliability, people, and processes.

Azure Stack Hub only supports a single Scale Unit (SU) within a single region, which consists of between four

and 16 servers that use Hyper-V failover clustering. Each region serves as an independent Azure Stack Hub

stamp

with separate portal and API endpoints.

Azure Stack Hub doesn't support Availability Zones because it consists of a single

region

or a single physical

location. High availability to cope with outages of a single location should be implemented by using two

Azure Stack Hub instances deployed in different physical locations.

Azure Stack Hub supports premium storage to ensure compatibility. However, provisioning premium storage

accounts or disks doesn't guarantee that storage objects will be allocated onto SSD or NVMe drives.

Azure Stack Hub supports only a subset of VPN Gateway SKUs available in Azure with a limited bandwidth of

100 or 200 Mbps .

Only one site-to-site (S2S) VPN connection can be created between two Azure Stack Hub deployments. This

connection limit is because of a platform limitation that allows only a single VPN connection to the same IP

address. Multiple S2S VPN connections with higher throughput can be established using third-party NVAs.

Apply general Azure configuration recommendations for all Azure Stack Hub services.

Configuration recommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Treat Azure Stack Hub as a scale unit and deploy multiple

instances to remove Azure Stack Hub as a single point of

failure for encompassed workloads.

Deploy workloads in either an active-active or active-passive

configuration across Azure Stack Hub stamps or Azure.

Next step

Have you configured Azure Stack Hub with reliability in mind?Have you configured Azure Stack Hub with reliability in mind?

Treat Azure Stack Hub as a scale unit and deploy multiple instances to remove Azure Stack Hub as a single

point of failure for encompassed workloads.

Consider the following recommendation table to optimize your Azure Stack Hub configuration for reliability:

Azure Stack Hub and operational excellence

12/16/2022 • 2 minutes to read • Edit Online

Design considerations

Checklist

Configuration recommendations

Azure Stack Hub is a hybrid cloud platform that lets you provide Azure services from your datacenter. It provides

a way to run apps in an on-premises environment.

This service unlocks the following hybrid cloud use cases for customer-facing and internal line-of-business apps:

Edge and disconnected solutions

: Addresses latency and connectivity requirements by processing data

locally.

Cloud apps that meet varied regulations

: Allows you to develop and deploy apps with full flexibility to meet

regulatory or policy requirements.

Cloud app model on-premises

: Provides Azure services, containers, serverless, and microservice

architectures to update and extend existing apps or build new ones.

For more information, reference Azure Stack Hub overview.

To understand how Azure Stack Hub supports operational excellence for your application workload, reference

the following articles:

Monitor health and alerts in Azure Stack Hub

Monitor Azure Stack Hub hardware components

Manage network resources in Azure Stack Hub

The following sections include design considerations, a configuration checklist, and recommended configuration

options specific to Azure Stack Hub and operational excellence.

Azure Stack Hub includes the following design considerations:

Microsoft doesn't provide an SLA for Azure Stack Hub because Microsoft doesn't have control over customer

datacenter reliability, people, and processes.

Azure Stack Hub only supports a single Scale Unit (SU) within a single region, which consists of between four

and 16 servers that use Hyper-V failover clustering. Each region serves as an independent Azure Stack Hub

stamp

with separate portal and API endpoints.

Azure Stack Hub doesn't support Availability Zones because it consists of a single

region

or a single physical

location. High availability to cope with outages of a single location should be implemented by using two

Azure Stack Hub instances deployed in different physical locations.

Apply general Azure configuration recommendations for all Azure Stack Hub services.

Have you configured Azure Stack Hub with operational excellence in mind?Have you configured Azure Stack Hub with operational excellence in mind?

Treat Azure Stack Hub as a scale unit and deploy multiple instances to remove Azure Stack Hub as a single

point of failure for encompassed workloads.

Consider the following recommendation table to optimize your Azure Stack Hub configuration for operational

excellence:

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Treat Azure Stack Hub as a scale unit and deploy multiple

instances to remove Azure Stack Hub as a single point of

failure for encompassed workloads.

Deploy workloads in either an active-active or active-passive

configuration across Azure Stack Hub stamps or Azure.

Next step

Storage Accounts and reliability

12/16/2022 • 7 minutes to read • Edit Online

Design considerations

Checklist

Azure Storage Accounts are ideal for workloads that require fast and consistent response times, or that have a

high number of input output (IOP) operations per second. Storage accounts contain all your Azure Storage data

objects, which include:

Blobs

File shares

Queues

Tables

Disks

Storage accounts provide a unique namespace for your data that's accessible anywhere over HTTP or HTTPS .

For more information about the different types of storage accounts that support different features, reference

Types of storage accounts.

To understand how an Azure storage account supports resiliency for your application workload, reference the

following articles:

Azure storage redundancy

Disaster recovery and storage account failover

The following sections include design considerations, a configuration checklist, and recommended configuration

options specific to Azure storage accounts and reliability.

Azure storage accounts include the following design considerations:

General purpose v1 storage accounts provide access to all Azure Storage services, but may not have the

latest features or the lower per-gigabyte pricing. It's recommended to use general purpose v2 storage

accounts, in most cases. Reasons to use v1 include:

Applications require the classic deployment model.

Applications are transaction intensive or use significant geo-replication bandwidth, but don't require

large capacity.

The use of a Storage Service REST API that is earlier than February 14, 2014, or a client library with a

version earlier than 4.x is required. An application upgrade isn't possible.

For more information, reference the Storage account overview.

Storage account names must be between three and 24 characters and may contain numbers, and lowercase

letters only.

For current SLA specifications, reference SLA for Storage Accounts.

Go to Azure Storage redundancy to determine which redundancy option is best for a specific scenario.

Storage account names must be unique within Azure. No two storage accounts can have the same name.

Have you configured your Azure Storage Account with reliability in mind?Have you configured your Azure Storage Account with reliability in mind?

Configuration recommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Turn on soft delete for blob data. Soft delete for Azure Storage blobs enables you to recover

blob data after it has been deleted.

Use Azure AD to authorize access to blob data. Azure AD provides superior security and ease of use over

Shared Key for authorizing requests to blob storage. It's

recommended to use Azure AD authorization with your blob

and queue applications when possible to minimize potential

security vulnerabilities inherent in Shared Key. For more

information, reference Authorize access to Azure blobs and

queues using Azure Active Directory.

Consider the principle of least privilege when you assign

permissions to an Azure AD security principal through Azure

RBAC.

When assigning a role to a user, group, or application, grant

that security principal only those permissions necessary for

them to perform their tasks. Limiting access to resources

helps prevent both unintentional and malicious misuse of

your data.

Use managed identities to access blob and queue data. Azure Blob and Queue storage support Azure AD

authentication with managed identities for Azure resources.

Managed identities for Azure resources can authorize access

to blob and queue data using Azure AD credentials from

applications running in Azure virtual machines (VMs),

function apps, virtual machine scale sets, and other services.

By using managed identities for Azure resources together

with Azure AD authentication, you can avoid storing

credentials with your applications that run in the cloud and

issues with expiring service principals. Reference Authorize

access to blob and queue data with managed identities for

Azure resources for more information.

Turn on soft delete for blob data.

Use Azure AD to authorize access to blob data.

Consider the principle of least privilege when you assign permissions to an Azure AD security principal

through Azure RBAC.

Use managed identities to access blob and queue data.

Use blob versioning or immutable blobs to store business-critical data.

Restrict default internet access for storage accounts.

Enable firewall rules.

Limit network access to specific networks.

Allow trusted Microsoft services to access the storage account.

Enable the Secure transfer requiredSecure transfer required option on all your storage accounts.

Limit shared access signature (SAS) tokens to HTTPS connections only.

Avoid and prevent using Shared Key authorization to access storage accounts.

Regenerate your account keys periodically.

Create a revocation plan and have it in place for any SAS that you issue to clients.

Use near-term expiration times on an impromptu SAS, service SAS, or account SAS.

Consider the following recommendations to optimize reliability when configuring your Azure Storage Account:

Use blob versioning or immutable blobs to store business-
critical data.
Consider using Blob versioning to maintain previous
versions of an object or the use of legal holds and time-
based retention policies to store blob data in a WORM
(Write Once, Read Many) state. Immutable blobs can be
read, but can't be modified or deleted during the retention
interval. For more information, reference Store business-
critical blob data with immutable storage.
Restrict default internet access for storage accounts. By default, network access to Storage Accounts isn't
restricted and is open to all traffic coming from the internet.
Access to storage accounts should be granted to specific
Azure Virtual Networks only whenever possible or use
private endpoints to allow clients on a virtual network (VNet)
to access data securely over a Private Link. Reference Use
private endpoints for Azure Storage for more information.
Exceptions can be made for Storage Accounts that need to
be accessible over the internet.
Enable firewall rules. Configure firewall rules to limit access to your storage
account to requests that originate from specified IP
addresses or ranges, or from a list of subnets in an Azure
Virtual Network (VNet). For more information about
configuring firewall rules, reference Configure Azure Storage
firewalls and virtual networks.
Limit network access to specific networks. Limiting network access to networks hosting clients
requiring access reduces the exposure of your resources to
network attacks either by using the built-in Firewall and
virtual networks functionality or by using private endpoints.
Allow trusted Microsoft services to access the storage
account.
Turning on firewall rules for storage accounts blocks
incoming requests for data by default, unless the requests
originate from a service operating within an Azure Virtual
Network (VNet) or from allowed public IP addresses. Blocked
requests include those requests from other Azure services,
from the Azure portal, from logging and metrics services,
and so on. You can permit requests from other Azure
services by adding an exception to allow trusted Microsoft
services to access the storage account. For more information
about adding an exception for trusted Microsoft services,
reference Configure Azure Storage firewalls and virtual
networks.
Enable the Secure transfer requiredSecure transfer required option on all your
storage accounts.
When you enable the Secure transfer requiredSecure transfer required option, all
requests made against the storage account must take place
over secure connections. Any requests made over HTTP will
fail. For more information, reference Require secure transfer
in Azure Storage.
Limit shared access signature (SAS) tokens to  HTTPS
connections only.
Requiring  HTTPS  when a client uses a SAS token to access
blob data helps to minimize the risk of eavesdropping. For
more information, reference Grant limited access to Azure
Storage resources using shared access signatures (SAS).
Avoid and prevent using Shared Key authorization to access
storage accounts.
It's recommended to use Azure AD to authorize requests to
Azure Storage and to prevent Shared Key Authorization. For
scenarios that require Shared Key authorization, always
prefer SAS tokens over distributing the Shared Key.
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Regenerate your account keys periodically. Rotating the account keys periodically reduces the risk of

exposing your data to malicious actors.

Create a revocation plan and have it in place for any SAS

that you issue to clients.

If a SAS is compromised, you'll want to revoke that SAS

immediately. To revoke a user delegation SAS, revoke the

user delegation key to quickly invalidate all signatures

associated with that key. To revoke a service SAS that's

associated with a stored access policy, you can delete the

stored access policy, rename the policy, or change its expiry

time to a time that is in the past.

Use near-term expiration times on an impromptu SAS,

service SAS, or account SAS.

If a SAS is compromised, it's valid only for a short time. This

practice is especially important if you can't reference a stored

access policy. Near-term expiration times also limit the

amount of data that can be written to a blob by limiting the

time available to upload to it. Clients should renew the SAS

well before the expiration to allow time for retries if the

service providing the SAS is unavailable.

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Next step

Storage Accounts and security

12/16/2022 • 7 minutes to read • Edit Online

Design considerations

Checklist

Azure Storage Accounts are ideal for workloads that require fast and consistent response times, or that have a

high number of input output (IOP) operations per second. Storage accounts contain all your Azure Storage data

objects, which include:

Blobs

File shares

Queues

Tables

Disks

Storage accounts provide a unique namespace for your data that's accessible anywhere over HTTP or HTTPS .

For more information about the different types of storage accounts that support different features, reference

Types of storage accounts.

To understand how an Azure storage account boosts security for your application workload, reference the

following articles:

Azure security baseline for Azure Storage

Azure Storage encryption for data at rest

Use private endpoints for Azure Storage

The following sections include design considerations, a configuration checklist, and recommended configuration

options specific to Azure storage accounts and security.

Azure storage accounts include the following design considerations:

Storage account names must be between three and 24 characters and may contain numbers, and lowercase

letters only.

For current SLA specifications, reference SLA for Storage Accounts.

Go to Azure Storage redundancy to determine which redundancy option is best for a specific scenario.

Storage account names must be unique within Azure. No two storage accounts can have the same name.

Have you configured your Azure Storage Account with security in mind?Have you configured your Azure Storage Account with security in mind?

Enable Azure Defender for all your storage accounts.

Turn on soft delete for blob data.

Use Azure AD to authorize access to blob data.

Consider the principle of least privilege when you assign permissions to an Azure AD security principal

through Azure RBAC.

Use managed identities to access blob and queue data.

Use blob versioning or immutable blobs to store business-critical data.

Restrict default internet access for storage accounts.

Configuration recommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Enable Azure Defender for all your storage accounts. Azure Defender for Azure Storage provides an extra layer of

security intelligence that detects unusual and potentially

harmful attempts to access or exploit storage accounts.

Security alerts are triggered in Azure Security Center when

anomalies in activity occur. Alerts are also sent through email

to subscription administrators, with details of suspicious

activity and recommendations on how to investigate, and

remediate threats. For more information, reference

Configure Azure Defender for Azure Storage.

Turn on soft delete for blob data. Soft delete for Azure Storage blobs enables you to recover

blob data after it has been deleted.

Use Azure AD to authorize access to blob data. Azure AD provides superior security and ease of use over

Shared Key for authorizing requests to blob storage. It's

recommended to use Azure AD authorization with your blob

and queue applications when possible to minimize potential

security vulnerabilities inherent in Shared Key. For more

information, reference Authorize access to Azure blobs and

queues using Azure Active Directory.

Consider the principle of least privilege when you assign

permissions to an Azure AD security principal through Azure

RBAC.

When assigning a role to a user, group, or application, grant

that security principal only those permissions necessary for

them to complete their tasks. Limiting access to resources

helps prevent both unintentional and malicious misuse of

your data.

Use managed identities to access blob and queue data. Azure Blob and Queue storage support Azure AD

authentication with managed identities for Azure resources.

Managed identities for Azure resources can authorize access

to blob and queue data using Azure AD credentials from

applications running in Azure virtual machines (VMs),

function apps, virtual machine scale sets, and other services.

By using managed identities for Azure resources together

with Azure AD authentication, you can avoid storing

credentials with your applications that run in the cloud and

issues with expiring service principals. Reference Authorize

access to blob and queue data with managed identities for

Azure resources for more information.

Enable firewall rules.

Limit network access to specific networks.

Allow trusted Microsoft services to access the storage account.

Enable the Secure transfer requiredSecure transfer required option on all your storage accounts.

Limit shared access signature (SAS) tokens to HTTPS connections only.

Avoid and prevent using Shared Key authorization to access storage accounts.

Regenerate your account keys periodically.

Create a revocation plan and have it in place for any SAS that you issue to clients.

Use near-term expiration times on an impromptu SAS, service SAS, or account SAS.

Consider the following recommendations to optimize security when configuring your Azure Storage Account:

Use blob versioning or immutable blobs to store business-
critical data.
Consider using Blob versioning to maintain previous
versions of an object or the use of legal holds and time-
based retention policies to store blob data in a WORM
(Write Once, Read Many) state. Immutable blobs can be
read, but can't be modified or deleted during the retention
interval. For more information, reference Store business-
critical blob data with immutable storage.
Restrict default internet access for storage accounts. By default, network access to Storage Accounts isn't
restricted and is open to all traffic coming from the internet.
Access to storage accounts should be granted to specific
Azure Virtual Networks only whenever possible or use
private endpoints to allow clients on a virtual network (VNet)
to access data securely over a Private Link. Reference Use
private endpoints for Azure Storage for more information.
Exceptions can be made for Storage Accounts that need to
be accessible over the internet.
Enable firewall rules. Configure firewall rules to limit access to your storage
account to requests that originate from specified IP
addresses or ranges, or from a list of subnets in an Azure
Virtual Network (VNet). For more information about
configuring firewall rules, reference Configure Azure Storage
firewalls and virtual networks.
Limit network access to specific networks. Limiting network access to networks hosting clients
requiring access reduces the exposure of your resources to
network attacks either by using the built-in Firewall and
virtual networks functionality or by using private endpoints.
Allow trusted Microsoft services to access the storage
account.
Turning on firewall rules for storage accounts blocks
incoming requests for data by default, unless the requests
originate from a service operating within an Azure Virtual
Network (VNet) or from allowed public IP addresses. Blocked
requests include those requests from other Azure services,
from the Azure portal, from logging and metrics services,
and so on. You can permit requests from other Azure
services by adding an exception to allow trusted Microsoft
services to access the storage account. For more information
about adding an exception for trusted Microsoft services,
reference Configure Azure Storage firewalls and virtual
networks.
Enable the Secure transfer requiredSecure transfer required option on all your
storage accounts.
When you enable the Secure transfer requiredSecure transfer required option, all
requests made against the storage account must take place
over secure connections. Any requests made over HTTP will
fail. For more information, reference Require secure transfer
in Azure Storage.
Limit shared access signature (SAS) tokens to  HTTPS
connections only.
Requiring  HTTPS  when a client uses a SAS token to access
blob data helps to minimize the risk of eavesdropping. For
more information, reference Grant limited access to Azure
Storage resources using shared access signatures (SAS).
Avoid and prevent using Shared Key authorization to access
storage accounts.
It's recommended to use Azure AD to authorize requests to
Azure Storage and to prevent Shared Key Authorization. For
scenarios that require Shared Key authorization, always
prefer SAS tokens over distributing the Shared Key.
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Regenerate your account keys periodically. Rotating the account keys periodically reduces the risk of

exposing your data to malicious actors.

Create a revocation plan and have it in place for any SAS

that you issue to clients.

If a SAS is compromised, you'll want to revoke that SAS

immediately. To revoke a user delegation SAS, revoke the

user delegation key to quickly invalidate all signatures

associated with that key. To revoke a service SAS that's

associated with a stored access policy, you can delete the

stored access policy, rename the policy, or change its expiry

time to a time that is in the past.

Use near-term expiration times on an impromptu SAS,

service SAS, or account SAS.

If a SAS is compromised, it's valid only for a short time. This

practice is especially important if you can't reference a stored

access policy. Near-term expiration times also limit the

amount of data that can be written to a blob by limiting the

time available to upload to it. Clients should renew the SAS

well before the expiration to allow time for retries if the

service providing the SAS is unavailable.

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Next step

Storage Accounts and cost optimization

12/16/2022 • 2 minutes to read • Edit Online

Design considerations

C O N SIDERAT IO NSC O N SIDERAT IO NS DESC RIP T IONDESC RIP T ION

Periodically dispose and clean up unused storage resources,

such as unattached disks and old snapshots.

Unused storage resources can incur cost and it's a good idea

to regularly perform cleanup to reduce cost.

Consider Azure Blob access time tracking and access time-

based lifecycle management.

Minimize your storage cost automatically by setting up a

policy based on last access time to: cost-effective backup

storage options.

Transition your data from a hotter access tier to a cooler

access tier if there's no access for a period

For example:

- Hot to cool

- Cool to archive

- Hot to archive

Azure Storage Accounts are ideal for workloads that require fast and consistent response times, or that have a

high number of input output (IOP) operations per second. Storage accounts contain all your Azure Storage data

objects, which include:

Blobs

File shares

Queues

Tables

Disks

Storage accounts provide a unique namespace for your data that's accessible anywhere over HTTP or HTTPS .

For more information about the different types of storage accounts that support different features, reference

Types of storage accounts.

To understand how an Azure storage account can optimize costs for your workload, reference the following

articles:

Plan and manage costs for Azure Blob Storage

Optimize costs for Blob storage with reserved capacity

Understand how reservation discounts are applied to Azure storage services

The following sections include design considerations, a configuration checklist, and recommended configuration

options specific to Azure storage accounts and cost optimization.

Azure storage accounts include the following design considerations:

Periodically dispose and clean up unused storage resources, such as unattached disks and old snapshots.

Consider Azure Blob access time tracking and access time-based lifecycle management.

Transition your data from a hotter access tier to a cooler access tier if there's no access for a period.

Delete your data if there's no access for an extended period.

Checklist

Configuration recommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Consider cost savings by reserving data capacity for block

blob storage.

Save money by reserving capacity for block blob and for

Azure Data Lake Storage gen 2 data in standard storage

account when customer commits to one or three years

reservation.

Organize data into access tiers. You can reduce cost by placing blob data into the most cost-

effective access tier. Place frequently accessed data in a hot

tier, less frequent in a cold or archive tier. Use Premium

storage for workloads with high transaction volumes or

workloads where latency is critical.

Use lifecycle policy to move data between access tiers. Lifecycle management policy periodically moves data

between tiers. Policies can move data based on rules

specified by the user. For example, you can create rules that

move blobs to the archive tier if that blob has been modified

in 90 days. Unused data can be removed completely using a

policy. By creating policies that adjust the access tier of your

data, you can design the least expensive storage options for

your requirements.

Next step

Have you configured your Azure Storage Account with cost optimization in mind?Have you configured your Azure Storage Account with cost optimization in mind?

Consider cost savings by reserving data capacity for block blob storage.

Organize data into access tiers.

Use lifecycle policy to move data between access tiers.

Consider the following recommendations to optimize costs when configuring your Azure Storage Account:

Storage Accounts and operational excellence

12/16/2022 • 7 minutes to read • Edit Online

Design considerations

Checklist

Azure Storage Accounts are ideal for workloads that require fast and consistent response times, or that have a

high number of input output (IOP) operations per second. Storage accounts contain all your Azure Storage data

objects, which include:

Blobs

File shares

Queues

Tables

Disks

Storage accounts provide a unique namespace for your data that's accessible anywhere over HTTP or HTTPS .

For more information about the different types of storage accounts that support different features, reference

Types of storage accounts.

To understand how an Azure storage account can promote operational excellence for your workload, reference

the following articles:

Best practices for monitoring Azure Blob Storage

Use Azure Storage analytics to collect logs and metrics data

Azure Storage analytics logging

The following sections include design considerations, a configuration checklist, and recommended configuration

options specific to Azure storage accounts and operational excellence.

Azure storage accounts include the following design considerations:

General purpose v1 storage accounts provide access to all Azure Storage services, but may not have the

latest features or the lower per-gigabyte pricing. It's recommended to use general purpose v2 storage

accounts, in most cases. Reasons to use v1 include:

Applications require the classic deployment model.

Applications are transaction intensive or use significant geo-replication bandwidth, but don't require

large capacity.

The use of a Storage Service REST API that is earlier than February 14, 2014, or a client library with a

version earlier than 4.x is required. An application upgrade isn't possible.

For more information, reference the Storage account overview.

Storage account names must be between three and 24 characters and may contain numbers, and lowercase

letters only.

For current SLA specifications, reference SLA for Storage Accounts.

Go to Azure Storage redundancy to determine which redundancy option is best for a specific scenario.

Storage account names must be unique within Azure. No two storage accounts can have the same name.

Configuration recommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Enable Azure Defender for all your storage accounts. Azure Defender for Azure Storage provides an extra layer of

security intelligence that detects unusual and potentially

harmful attempts to access or exploit storage accounts.

Security alerts are triggered in Azure Security Center when

anomalies in activity occur. Alerts are also sent through email

to subscription administrators, with details of suspicious

activity and recommendations on how to investigate, and

remediate threats. For more information, reference

Configure Azure Defender for Azure Storage.

Turn on soft delete for blob data. Soft delete for Azure Storage blobs enables you to recover

blob data after it has been deleted.

Use Azure AD to authorize access to blob data. Azure AD provides superior security and ease of use over

Shared Key for authorizing requests to blob storage. It's

recommended to use Azure AD authorization with your blob

and queue applications when possible to minimize potential

security vulnerabilities inherent in Shared Key. For more

information, reference Authorize access to Azure blobs and

queues using Azure Active Directory.

Consider the principle of least privilege when you assign

permissions to an Azure AD security principal through Azure

RBAC.

When assigning a role to a user, group, or application, grant

that security principal only those permissions necessary for

them to complete their tasks. Limiting access to resources

helps prevent both unintentional and malicious misuse of

your data.

Have you configured your Azure Storage Account with operational excellence in mind?Have you configured your Azure Storage Account with operational excellence in mind?

Enable Azure Defender for all your storage accounts.

Turn on soft delete for blob data.

Use Azure AD to authorize access to blob data.

Consider the principle of least privilege when you assign permissions to an Azure AD security principal

through Azure RBAC.

Use managed identities to access blob and queue data.

Use blob versioning or immutable blobs to store business-critical data.

Restrict default internet access for storage accounts.

Enable firewall rules.

Limit network access to specific networks.

Allow trusted Microsoft services to access the storage account.

Enable the Secure transfer requiredSecure transfer required option on all your storage accounts.

Limit shared access signature (SAS) tokens to HTTPS connections only.

Avoid and prevent using Shared Key authorization to access storage accounts.

Regenerate your account keys periodically.

Create a revocation plan and have it in place for any SAS that you issue to clients.

Use near-term expiration times on an impromptu SAS, service SAS, or account SAS.

Consider the following recommendations to optimize operational excellence when configuring your Azure

Storage Account:

Use managed identities to access blob and queue data. Azure Blob and Queue storage support Azure AD
authentication with managed identities for Azure resources.
Managed identities for Azure resources can authorize access
to blob and queue data using Azure AD credentials from
applications running in Azure virtual machines (VMs),
function apps, virtual machine scale sets, and other services.
By using managed identities for Azure resources together
with Azure AD authentication, you can avoid storing
credentials with your applications that run in the cloud and
issues with expiring service principals. Reference Authorize
access to blob and queue data with managed identities for
Azure resources for more information.
Use blob versioning or immutable blobs to store business-
critical data.
Consider using Blob versioning to maintain previous
versions of an object or the use of legal holds and time-
based retention policies to store blob data in a WORM
(Write Once, Read Many) state. Immutable blobs can be
read, but can't be modified or deleted during the retention
interval. For more information, reference Store business-
critical blob data with immutable storage.
Restrict default internet access for storage accounts. By default, network access to Storage Accounts isn't
restricted and is open to all traffic coming from the internet.
Access to storage accounts should be granted to specific
Azure Virtual Networks only whenever possible or use
private endpoints to allow clients on a virtual network (VNet)
to access data securely over a Private Link. Reference Use
private endpoints for Azure Storage for more information.
Exceptions can be made for Storage Accounts that need to
be accessible over the internet.
Enable firewall rules. Configure firewall rules to limit access to your storage
account to requests that originate from specified IP
addresses or ranges, or from a list of subnets in an Azure
Virtual Network (VNet). For more information about
configuring firewall rules, reference Configure Azure Storage
firewalls and virtual networks.
Limit network access to specific networks. Limiting network access to networks hosting clients
requiring access reduces the exposure of your resources to
network attacks either by using the built-in Firewall and
virtual networks functionality or by using private endpoints.
Allow trusted Microsoft services to access the storage
account.
Turning on firewall rules for storage accounts blocks
incoming requests for data by default, unless the requests
originate from a service operating within an Azure Virtual
Network (VNet) or from allowed public IP addresses. Blocked
requests include those requests from other Azure services,
from the Azure portal, from logging and metrics services,
and so on. You can permit requests from other Azure
services by adding an exception to allow trusted Microsoft
services to access the storage account. For more information
about adding an exception for trusted Microsoft services,
reference Configure Azure Storage firewalls and virtual
networks.
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Enable the Secure transfer requiredSecure transfer required option on all your

storage accounts.

When you enable the Secure transfer requiredSecure transfer required option, all

requests made against the storage account must take place

over secure connections. Any requests made over HTTP will

fail. For more information, reference Require secure transfer

in Azure Storage.

Limit shared access signature (SAS) tokens to HTTPS

connections only.

Requiring HTTPS when a client uses a SAS token to access

blob data helps to minimize the risk of eavesdropping. For

more information, reference Grant limited access to Azure

Storage resources using shared access signatures (SAS).

Avoid and prevent using Shared Key authorization to access

storage accounts.

It's recommended to use Azure AD to authorize requests to

Azure Storage and to prevent Shared Key Authorization. For

scenarios that require Shared Key authorization, always

prefer SAS tokens over distributing the Shared Key.

Regenerate your account keys periodically. Rotating the account keys periodically reduces the risk of

exposing your data to malicious actors.

Create a revocation plan and have it in place for any SAS

that you issue to clients.

If a SAS is compromised, you'll want to revoke that SAS

immediately. To revoke a user delegation SAS, revoke the

user delegation key to quickly invalidate all signatures

associated with that key. To revoke a service SAS that's

associated with a stored access policy, you can delete the

stored access policy, rename the policy, or change its expiry

time to a time that is in the past.

Use near-term expiration times on an impromptu SAS,

service SAS, or account SAS.

If a SAS is compromised, it's valid only for a short time. This

practice is especially important if you can't reference a stored

access policy. Near-term expiration times also limit the

amount of data that can be written to a blob by limiting the

time available to upload to it. Clients should renew the SAS

well before the expiration to allow time for retries if the

service providing the SAS is unavailable.

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Next step

Disks and cost optimization

12/16/2022 • 2 minutes to read • Edit Online

Design considerations

C O N SIDERAT IO NSC O N SIDERAT IO NS DESC RIP T IONDESC RIP T ION

Use a shared disk for workload, such as SQL server failover

cluster instance (FCI), file server for general use (IW

workload), and SAP ASCS/SCS.

You can use shared disks to enable cost-effective clustering

instead of setting up your own shared disks through S2D

(Storage Spaces Direct). Sample workloads that would benefit

from shared disks include:

- SQL Server Failover Cluster Instances (FCI)

- Scale-out File Server (SoFS)

- File Server for General Use (IW workload)

- SAP ASCS/SCS

Checklist

Azure managed disks are block-level storage volumes that are managed by Azure and used with Azure Virtual

Machines. Managed disks are like a physical disk in an on-premises server, but these disks are virtualized.

Available disk types include:

Ultra disks

Premium solid-state drives (SSD)

Standard SSDs

Standard hard disk drives (HDD)

For more information about the different types of disks, reference Azure managed disk types.

To understand how Azure managed disks are cost-effective solutions for your workload, reference the following

articles:

Overview of Azure Disk Backup

Understand how your reservation discount is applied to Azure disk storage

Reduce costs with Azure Disks Reservation

The following sections include design considerations, a configuration checklist, and recommended configuration

options specific to Azure managed disks and cost optimization.

Azure Disks include the following design considerations:

Use a shared disk for workload, such as SQL server failover cluster instance (FCI), file server for general use

(IW workload), and SAP ASCS/SCS.

Consider selective disk backup and restore for Azure VMs.

Have you configured your Azure managed disk with cost optimization in mind?Have you configured your Azure managed disk with cost optimization in mind?

Configure data and log files on different disks for database workloads.

Use bursting for P20 and lower disks for workloads, such as batch jobs, workloads, which handle traffic

spikes, and to improve OS boot time.

Consider using Premium disks (P30 and greater).

Configuration recommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Configure data and log files on different disks for database

workloads.

You can optimize IaaS DB workload performance by

configuring system, data, and log files to be on different disk

SKUs (leveraging Premium Disks for data and Ultra Disks for

logs satisfies most production scenarios). Ultra Disk cost and

performance can be optimized by taking advantage of

configuring capacity, IOPS, and throughput independently.

Also, you can dynamically configure these attributes.

Example workloads include:

- SQL on IaaS

- Cassandra DB

- Maria DB

- MySql and

- Mongo DB on IaaS

Use bursting for P20 and lower disks for workloads, such as

batch jobs, workloads, which handle traffic spikes, and to

improve OS boot time.

Azure Disks offer various SKUs and sizes to satisfy different

workload requirements. Some of the more recent features

could help further optimize cost performance of existing disk

use cases. You can use disk bursting for Premium (disks P20

and lower). Example scenarios that could benefit from this

feature include:

- Improving OS boot time

- Handling batch jobs

- Handling traffic spikes

Consider using Premium disks (P30 and greater). Premium Disks (P30 and greater) can be reserved (one or

three years) at a discounted price.

Next step

Consider the following recommendations to optimize costs when configuring your Azure managed disk:

Event Grid and reliability

12/16/2022 • 3 minutes to read • Edit Online

Design considerations

Checklist

Configuration recommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Monitor Event Grid for failed event delivery. The Delivery Failed metric will increase every time a

message can't be delivered to an event handler (timeout or a

non- 200-204 HTTP status code). If an event can't be lost,

set up a Dead-Letter-Queue (DLQ) storage account. A DLQ

account is where events that can't be delivered after the

maximum retry count will be placed. Optionally, implement a

notification system on the DLQ storage account, for

example, by handling a

new file

event through Event Grid.

Azure Event Grid lets you easily build applications with event-based architectures. This solution has build-in

support for events coming from Azure services, like storage blobs and resource groups. Event Grid also has

support for your own events, using custom topics.

For more information about using Event Grid, reference Create and route custom events with Azure Event Grid.

To understand how using Event Grid creates a more reliable workload, reference Server-side geo disaster

recovery in Azure Event Grid.

The following sections are specific to Azure Event Grid and reliability:

Design considerations

Configuration checklist

Recommended configuration options

Source artifacts

Azure Event Grid provides an uptime SLA. For more information, reference SLA for Event Grid.

Have you configured Azure Event Grid with reliability in mind?Have you configured Azure Event Grid with reliability in mind?

Deploy an Event Grid instance per region, in case of a multi-region Azure solution.

Monitor Event Grid for failed event delivery.

Use batched events.

Event batches can't exceed 1MB in size.

Configure and optimize batch-size selection during load testing.

Ensure Event Grid messages are accepted with HTTP 200-204 responses only if delivering to an endpoint

that holds custom code.

Monitor Event Grid for failed event publishing.

Consider the following recommendations to optimize reliability when configuring Azure Event Grid:

Use batched events in high-throughput scenarios. The service will deliver a json array with multiple events to

the subscribers, instead of an array with one event. The

consuming application must be able to process these arrays.

Event batches can't exceed 1MB in size. If the message payload is large, only one or a few messages

will fit in the batch. The consuming service will need to

process more event batches. If your event has a large

payload, consider storing it elsewhere, such as in blob

storage, and passing a reference in the event. When

integrating with third-party services through the

CloudEvents schema, it's not recommended to exceed 64KB

events.

Configure and optimize batch-size selection during load

testing.

Batch size selection depends on the payload size and the

message volume.

Monitor Event Grid for failed event publishing. The Unmatched metric will show messages that are

published, but not matched to any subscription. Depending

on your application architecture, the latter may be

intentional.

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Source artifacts

Resources

| where type == 'microsoft.eventgrid/topics'

| project name, resourceGroup, location, subscriptionId, properties['inputSchema']

Resources

| where type == 'microsoft.eventgrid/domains' and notnull(properties['privateEndpointConnections'])

| mvexpand properties['privateEndpointConnections']

| project-rename privateEndpointConnections = properties_privateEndpointConnections

| project name, resourceGroup, location, subscriptionId, privateEndpointConnections['properties']

['privateEndpoint']['id']

Resources

| where type == 'microsoft.eventgrid/domains'

| project name, resourceGroup, location, subscriptionId, properties['publicNetworkAccess']

Resources

| where type == 'microsoft.eventgrid/domains' and properties['publicNetworkAccess'] == 'Enabled'

| project name, resourceGroup, location, subscriptionId, properties['inboundIpRules']

To determine the Input SchemaInput Schema type for all available Event Grid topics, use the following query:

To retrieve the Resource IDResource ID of existing private endpoints for Event Grid domains, use the following query:

To identify Public Network AccessPublic Network Access status for all available Event Grid domains, use the following query:

To identify Firewall RulesFirewall Rules for all public Event Grid domains, use the following query:

To identify Firewall RulesFirewall Rules for all public Event Grid topics, use the following query:

Resources

| where type == 'microsoft.eventgrid/topics' and properties['publicNetworkAccess'] == 'Enabled'

| project name, resourceGroup, location, subscriptionId, properties['inboundIpRules']

Resources

| where type == 'microsoft.eventgrid/topics' and notnull(properties['privateEndpointConnections'])

| mvexpand properties['privateEndpointConnections']

| project-rename privateEndpointConnections = properties_privateEndpointConnections

| project name, resourceGroup, location, subscriptionId, privateEndpointConnections['properties']

['privateEndpoint']['id']

Resources

| where type == 'microsoft.eventgrid/domains'

| project name, resourceGroup, location, subscriptionId, properties['inputSchema']

Resources

| where type == 'microsoft.eventgrid/topics'

| project name, resourceGroup, location, subscriptionId, properties['publicNetworkAccess']

Next step

To retrieve the Resource IDResource ID of existing private endpoints for Event Grid topics, use the following query:

To determine the Input SchemaInput Schema type for all available Event Grid domains, use the following schema:

To identify Public Network AccessPublic Network Access status for all available Event Grid topics, use the following query:

Event Grid and operational excellence

12/16/2022 • 3 minutes to read • Edit Online

Design considerations

Checklist

Configuration recommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Monitor Event Grid for failed event delivery. The Delivery Failed metric will increase every time a

message can't be delivered to an event handler (timeout or a

non- 200-204 HTTP status code). If an event can't be lost,

set up a Dead-Letter-Queue (DLQ) storage account. A DLQ

account is where events that can't be delivered after the

maximum retry count will be placed. Optionally, implement a

notification system on the DLQ storage account, for

example, by handling a

new file

event through Event Grid.

Azure Event Grid lets you easily build applications with event-based architectures. This solution has build-in

support for events coming from Azure services, like storage blobs and resource groups. Event Grid also has

support for your own events, using custom topics.

For more information about using Event Grid, reference Create and route custom events with Azure Event Grid.

To understand how using Event Grid promotes operational excellence for your workload, reference Diagnostic

logs for Event Grid topics and Event Grid domains.

The following sections are specific to Azure Event Grid and operational excellence:

Design considerations

Configuration checklist

Recommended configuration options

Source artifacts

Azure Event Grid provides an uptime SLA. For more information, reference SLA for Event Grid.

Have you configured Azure Event Grid with operational excellence in mind?Have you configured Azure Event Grid with operational excellence in mind?

Monitor Event Grid for failed event delivery.

Use batched events.

Event batches can't exceed 1MB in size.

Configure and optimize batch-size selection during load testing.

Ensure Event Grid messages are accepted with HTTP 200-204 responses only if delivering to an endpoint

that holds custom code.

Monitor Event Grid for failed event publishing.

Consider the following recommendations to optimize operational excellence when configuring Azure Event Grid:

Use batched events in high-throughput scenarios. The service will deliver a json array with multiple events to

the subscribers, instead of an array with one event. The

consuming application must be able to process these arrays.

Event batches can't exceed 1MB in size. If the message payload is large, only one or a few messages

will fit in the batch. The consuming service will need to

process more event batches. If your event has a large

payload, consider storing it elsewhere, such as in blob

storage, and passing a reference in the event. When

integrating with third-party services through the

CloudEvents schema, it's not recommended to exceed 64KB

events.

Configure and optimize batch-size selection during load

testing.

Batch size selection depends on the payload size and the

message volume.

Monitor Event Grid for failed event publishing. The Unmatched metric will show messages that are

published, but not matched to any subscription. Depending

on your application architecture, the latter may be

intentional.

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Source artifacts

Resources

| where type == 'microsoft.eventgrid/topics'

| project name, resourceGroup, location, subscriptionId, properties['inputSchema']

Resources

| where type == 'microsoft.eventgrid/domains' and notnull(properties['privateEndpointConnections'])

| mvexpand properties['privateEndpointConnections']

| project-rename privateEndpointConnections = properties_privateEndpointConnections

| project name, resourceGroup, location, subscriptionId, privateEndpointConnections['properties']

['privateEndpoint']['id']

Resources

| where type == 'microsoft.eventgrid/domains'

| project name, resourceGroup, location, subscriptionId, properties['publicNetworkAccess']

Resources

| where type == 'microsoft.eventgrid/domains' and properties['publicNetworkAccess'] == 'Enabled'

| project name, resourceGroup, location, subscriptionId, properties['inboundIpRules']

To determine the Input SchemaInput Schema type for all available Event Grid topics, use the following query:

To retrieve the Resource IDResource ID of existing private endpoints for Event Grid domains, use the following query:

To identify Public Network AccessPublic Network Access status for all available Event Grid domains, use the following query:

To identify Firewall RulesFirewall Rules for all public Event Grid domains, use the following query:

To identify Firewall RulesFirewall Rules for all public Event Grid topics, use the following query:

Resources

| where type == 'microsoft.eventgrid/topics' and properties['publicNetworkAccess'] == 'Enabled'

| project name, resourceGroup, location, subscriptionId, properties['inboundIpRules']

Resources

| where type == 'microsoft.eventgrid/topics' and notnull(properties['privateEndpointConnections'])

| mvexpand properties['privateEndpointConnections']

| project-rename privateEndpointConnections = properties_privateEndpointConnections

| project name, resourceGroup, location, subscriptionId, privateEndpointConnections['properties']

['privateEndpoint']['id']

Resources

| where type == 'microsoft.eventgrid/domains'

| project name, resourceGroup, location, subscriptionId, properties['inputSchema']

Resources

| where type == 'microsoft.eventgrid/topics'

| project name, resourceGroup, location, subscriptionId, properties['publicNetworkAccess']

Next step

To retrieve the Resource IDResource ID of existing private endpoints for Event Grid topics, use the following query:

To determine the Input SchemaInput Schema type for all available Event Grid domains, use the following schema:

To identify Public Network AccessPublic Network Access status for all available Event Grid topics, use the following query:

Event Hubs and reliability

12/16/2022 • 4 minutes to read • Edit Online

Design considerations

Checklist

Configuration recommendations

Azure Event Hubs is a scalable event processing service that ingests and processes large volumes of events and

data, with low latency and high reliability. It can receive and process millions of events per second. Data sent to

an event hub can be transformed and stored by using any real-time analytics provider or batching and storage

adapters.

For more information about using Event Hubs, reference the Azure Event Hubs documentation to learn how to

use Event Hubs to ingest millions of events per second from connected devices and applications.

To understand how using Event Hubs creates a more reliable workload, reference Azure Event Hubs - Geo-

disaster recovery.

The following sections are specific to Azure Event Hubs and reliability:

Design considerations

Configuration checklist

Recommended configuration options

Source artifacts

Azure Event Hubs provides an uptime SLA. For more information, reference SLA for Event Hubs.

Have you configured Azure Event Hubs with reliability in mind?Have you configured Azure Event Hubs with reliability in mind?

Create SendOnly and ListenOnly policies for the event publisher and consumer, respectively.

When using the SDK to send events to Event Hubs, ensure the exceptions thrown by the retry policy (

EventHubsException or OperationCancelledException ) are properly caught.

In high-throughput scenarios, use batched events.

Every consumer can read events from one to 32 partitions.

When developing new applications, use EventProcessorClient (.NET and Java) or EventHubConsumerClient

(Python and JavaScript) as the client SDK.

As part of your solution-wide availability and disaster recovery strategy, consider enabling the Event Hubs

geo disaster-recovery option.

When a solution has a large number of independent event publishers, consider using Event Publishers for

fine-grained access control.

Don't publish events to a specific partition.

When publishing events frequently, use the AMQP protocol when possible.

The number of partitions reflect the degree of downstream parallelism you can achieve.

Ensure each consuming application uses a separate consumer group and only one active receiver per

consumer group is in place.

When using the Capture feature, carefully consider the configuration of the time window and file size,

especially with low event volumes.

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

When using the SDK to send events to Event Hubs, ensure

the exceptions thrown by the retry policy (

EventHubsException or OperationCancelledException )

are properly caught.

When using HTTPS , ensure a proper retry pattern is

implemented.

In high-throughput scenarios, use batched events. The service will deliver a json array with multiple events to

the subscribers, instead of an array with one event. The

consuming application must process these arrays.

Every consumer can read events from one to 32 partitions. To achieve maximum scale on the side of the consuming

application, every consumer should read from a single

partition.

When developing new applications, use

EventProcessorClient (.NET and Java) or

EventHubConsumerClient (Python and JavaScript) as the

client SDK.

EventProcessorHost has been deprecated.

As part of your solution-wide availability and disaster

recovery strategy, consider enabling the Event Hubs geo

disaster-recovery option.

This option allows the creation of a secondary namespace in

a different region. Only the active namespace receives

messages at any time. Messages and events aren't replicated

to the secondary region. The RTO for the regional failover is

up to 30 minutes

. Confirm this RTO aligns with the

requirements of the customer and fits in the broader

availability strategy. If a higher RTO is required, consider

implementing a client-side failover pattern.

When a solution has a large number of independent event

publishers, consider using Event Publishers for fine-grained

access control.

Event Publishers automatically set the partition key to the

publisher name, so this feature should only be used if the

events originate from all publishers evenly.

Don't publish events to a specific partition. If ordering events is essential, implement ordering

downstream or use a different messaging service instead.

When publishing events frequently, use the AMQP protocol

when possible.

AMQP has higher network costs when initializing the

session, but HTTPS requires TLS overhead for every

request. AMQP has higher performance for frequent

publishers.

The number of partitions reflect the degree of downstream

parallelism you can achieve.

For maximum throughput, use the maximum number of

partitions ( 32 ) when creating the Event Hub. The maximum

number of partitions will allow you to scale up to 32

concurrent processing entities and will offer the highest send

and receive availability.

When using the Capture feature, carefully consider the

configuration of the time window and file size, especially with

low event volumes.

Data Lake will charge for minimal file size for storage (gen1)

or minimal transaction size (gen2). If you set the time

window so low that the file hasn't reached minimum size,

you'll incur extra cost.

Source artifacts

Consider the following recommendations to optimize reliability when configuring Azure Event Hubs:

To find Event Hubs namespaces with BasicBasic SKU, use the following query:

Resources

| where type == 'microsoft.eventhub/namespaces'

| where sku.name == 'Basic'

| project resourceGroup, name, sku.name

Next step

Event Hubs and operational excellence

12/16/2022 • 4 minutes to read • Edit Online

Design considerations

Checklist

Azure Event Hubs is a scalable event processing service that ingests and processes large volumes of events and

data, with low latency and high reliability. It can receive and process millions of events per second. Data sent to

an event hub can be transformed and stored by using any real-time analytics provider or batching and storage

adapters.

For more information about using Event Hubs, reference the Azure Event Hubs documentation to learn how to

use Event Hubs to ingest millions of events per second from connected devices and applications.

To understand ways using Event Hubs helps you achieve operational excellence for your workload, reference the

following articles:

Monitor Azure Event Hubs

Stream Azure Diagnostics data using Event Hubs

Scaling with Event Hubs

The following sections are specific to Azure Event Hubs and operational excellence:

Design considerations

Configuration checklist

Recommended configuration options

Source artifacts

Azure Event Hubs provides an uptime SLA. For more information, reference SLA for Event Hubs.

Have you configured Azure Event Hubs with operational excellence in mind?Have you configured Azure Event Hubs with operational excellence in mind?

Create SendOnly and ListenOnly policies for the event publisher and consumer, respectively.

When using the SDK to send events to Event Hubs, ensure the exceptions thrown by the retry policy (

EventHubsException or OperationCancelledException ) are properly caught.

In high-throughput scenarios, use batched events.

Every consumer can read events from one to 32 partitions.

When developing new applications, use EventProcessorClient (.NET and Java) or EventHubConsumerClient

(Python and JavaScript) as the client SDK.

As part of your solution-wide availability and disaster recovery strategy, consider enabling the Event Hubs

geo disaster-recovery option.

When a solution has a large number of independent event publishers, consider using Event Publishers for

fine-grained access control.

Don't publish events to a specific partition.

When publishing events frequently, use the AMQP protocol when possible.

The number of partitions reflect the degree of downstream parallelism you can achieve.

Ensure each consuming application uses a separate consumer group and only one active receiver per

consumer group is in place.

Configuration recommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

When using the SDK to send events to Event Hubs, ensure

the exceptions thrown by the retry policy (

EventHubsException or OperationCancelledException )

are properly caught.

When using HTTPS , ensure a proper retry pattern is

implemented.

In high-throughput scenarios, use batched events. The service will deliver a json array with multiple events to

the subscribers, instead of an array with one event. The

consuming application must process these arrays.

Every consumer can read events from one to 32 partitions. To achieve maximum scale on the side of the consuming

application, every consumer should read from a single

partition.

When developing new applications, use

EventProcessorClient (.NET and Java) or

EventHubConsumerClient (Python and JavaScript) as the

client SDK.

EventProcessorHost has been deprecated.

As part of your solution-wide availability and disaster

recovery strategy, consider enabling the Event Hubs geo

disaster-recovery option.

This option allows the creation of a secondary namespace in

a different region. Only the active namespace receives

messages at any time. Messages and events aren't replicated

to the secondary region. The RTO for the regional failover is

up to 30 minutes

. Confirm this RTO aligns with the

requirements of the customer and fits in the broader

availability strategy. If a higher RTO is required, consider

implementing a client-side failover pattern.

When a solution has a large number of independent event

publishers, consider using Event Publishers for fine-grained

access control.

Event Publishers automatically set the partition key to the

publisher name, so this feature should only be used if the

events originate from all publishers evenly.

Don't publish events to a specific partition. If ordering events is essential, implement ordering

downstream or use a different messaging service instead.

When publishing events frequently, use the AMQP protocol

when possible.

AMQP has higher network costs when initializing the

session, but HTTPS requires TLS overhead for every

request. AMQP has higher performance for frequent

publishers.

The number of partitions reflect the degree of downstream

parallelism you can achieve.

For maximum throughput, use the maximum number of

partitions ( 32 ) when creating the Event Hub. The maximum

number of partitions will allow you to scale up to 32

concurrent processing entities and will offer the highest send

and receive availability.

When using the Capture feature, carefully consider the

configuration of the time window and file size, especially with

low event volumes.

Data Lake will charge for minimal file size for storage (gen1)

or minimal transaction size (gen2). If you set the time

window so low that the file hasn't reached minimum size,

you'll incur extra cost.

When using the Capture feature, carefully consider the configuration of the time window and file size,

especially with low event volumes.

Consider the following recommendations to optimize reliability when configuring Azure Event Hubs:

Source artifacts

Resources

| where type == 'microsoft.eventhub/namespaces'

| where sku.name == 'Basic'

| project resourceGroup, name, sku.name

Next step

To find Event Hubs namespaces with BasicBasic SKU, use the following query:

Service Bus and reliability

12/16/2022 • 5 minutes to read • Edit Online

Design considerations

Fully manage enterprise message brokering with message queues and publish-subscribe topics used in Azure

Service Bus. This service stores messages in a

broker

(for example, a

queue

) until the consuming party is ready

to receive the messages.

Benefits include:

Load-balancing across competing workers.

Safely routing and transferring data and control across service, and application boundaries.

Coordinating transactional work that requires a high-degree of reliability.

For more information about using Service Bus, reference Azure Service Bus Messaging. Learn how to set up

messaging that connects applications and services across on-premises and cloud environments.

To understand how Service Bus contributes to a reliable workload, reference the following topics:

Asynchronous messaging patterns and high availability

Azure Service Bus Geo-disaster recovery

Handling outages and disasters

The following sections are specific to Azure Service Bus and reliability:

Design considerations

Configuration checklist

Recommended configuration options

Source artifacts

Maximize reliability with an Azure Service Bus uptime SLA. Properly configured applications can send or receive

messages, or do other operations on a deployed Queue or Topic. For more information, reference the Service

Bus SLA.

Other design considerations include:

Express Entities

Partitioned queues and topics

Besides the documentation on Service Bus Premium and Standard messaging tiers, the following features are

only available on the Premium Stock Keeping Unit (SKU):

Dedicated resources.

Virtual network integration: Limits the networks that can connect to the Service Bus instance. Requires

Service Endpoints to be enabled on the subnet. There are Trusted Microsoft services that are not supported

when implementing Virtual Networks(for example, integration with Event Grid). For more information,

reference Allow access to Azure Service Bus namespace from specific virtual networks.

Private endpoints.

IP Filtering/Firewall: Restrict connections to only defined IPv4 addresses or IPv4 address ranges.

Availability zones: Provides enhanced availability by spreading replicas across availability zones within one

region at no extra cost.

Checklist

Configuration recommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Evaluate Premier-tier benefits of Azure Service Bus. Consider migrating to the Premium tier of Service Bus to

take advantage of platform-supported outage and disaster

protection.

Connect to Service Bus with the AMQP protocol and use

Service Endpoints or Private Endpoints when possible.

This recommendation keeps traffic on the Azure Backbone.

Note: The default connection protocol for

Microsoft.Azure.ServiceBus

and

Windows.Azure.ServiceBus

namespaces is

AMQP

Implement geo-replication on the sender and receiver side

to protect against outages and disasters.

Standard tier supports only the implementation of sender

and receiver-side geo-redundancy. An outage or disaster in

an Azure Region could cause downtime for your solution.

Configure Geo-Disaster. - Active/Active

- Active/Passive

- Paired Namespace (Active/Passive)

Note: The secondary region should preferably be an Azure

paired region

Event Grid integration: Available event types.

Scale messaging units.

Geo-Disaster Recovery (paired namespace).

BYOK (Bring Your Own Key): Azure Service Bus encrypts data at rest and automatically decrypts it when

accessed, but customers can also bring their own customer-managed key.

When deploying Service Bus with Geo-disaster recovery and in availability zones, the Service Level Operation

(SLO) increases dramatically, but does not change the uptime SLA.

Have you configured Azure Ser vice Bus with reliability in mind?Have you configured Azure Ser vice Bus with reliability in mind?

Evaluate Premier-tier benefits of Azure Service Bus.

Ensure that Service Bus Messaging Exceptions are handled properly.

Connect to Service Bus with the Advanced Messaging Queue Protocol (AMQP) and use Service Endpoints or

Private Endpoints when possible.

Review the Best Practices for performance improvements using Service Bus Messaging.

Implement geo-replication on the sender and receiver side to protect against outages and disasters.

Configure Geo-Disaster.

If you need mission-critical messaging with queues and topics, Service Bus Premium is recommended with

Geo-Disaster Recovery.

Configure Zone Redundancy in the Service Bus namespace (

only available with Premium tier

Implement high availability for the Service Bus namespace.

Ensure related messages are delivered in guaranteed order.

Evaluate different Java Messaging Service (JMS) features through the JMS API.

Use .NET Nuget packages to communicate with Service Bus messaging entities.

Implement resilience for transient fault handling when sending or receiving messages.

Consider the following recommendations to optimize reliability when configuring Azure Service Bus:

If you need mission-critical messaging with queues and

topics, Service Bus Premium is recommended with Geo-

Disaster Recovery.

Choosing the pattern is dependent on the business

requirements and the recovery time objective (RTO).

Configure Zone Redundancy in the Service Bus namespace

(

only available with Premium tier

Zone Redundancy includes three copies of the messaging

store. One zone is allocated as the primary messaging store

and the other zones are allocated as secondaries. If the

primary zone becomes unavailable, a secondary is promoted

to primary with no perceivable downtime.

Availability Zones

are available in a subset of Azure Regions with new regions

added regularly.

Implement high availability for the Service Bus namespace. Premium tier supports Geo-disaster recovery and replication

at the namespace level. At this level, Premium tier provides

high availability for metadata disaster recovery using

primary and secondary disaster recovery namespaces.

Ensure related messages are delivered in guaranteed order. Be aware of the requirement to set a Partition Key, Session

ID, or Message ID on each message to ensure related

messages send to the same partition in the messaging

entity.

Evaluate different JMS features through the JMS API. Features available through the JMS 2.0 API (and its Software

Development Kit (SDK)) are not the same as the features

available through the native SDK. For example, Service Bus

Sessions are not available in JMS.

Implement resilience for transient fault handling when

sending or receiving messages.

It is essential to implement suitable transient fault handling

and error handling for send and receive operations to

maintain throughput and to prevent message loss.

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Source artifacts

Resources

| where

type == 'microsoft.servicebus/namespaces'

| where

sku.tier == 'Premium'

and isempty(properties.privateEndpointConnections)

Resources

| where

type == 'microsoft.servicebus/namespaces'

| where

sku.tier != 'Premium'

To identify premium Service Bus Instances that are not using private endpoints, use the following query:

To identify Service Bus Instances that are not on the premium tier, use the following query:

To identify premium Service Bus Instances that are not zone redundant, use the following query:

Next step

Resources

| where

type == 'microsoft.servicebus/namespaces'

| where

sku.tier == 'Premium'

and properties.zoneRedundant == 'false'

Service Bus and operational excellence

12/16/2022 • 3 minutes to read • Edit Online

Design considerations

Fully manage enterprise message brokering with message queues and publish-subscribe topics using Azure

Service Bus. This service stores messages in a

broker

(for example, a

queue

) until the consuming party is ready

to receive the messages.

Benefits include:

Load-balancing work across competing workers.

Safely routing and transferring data and control across service, and application boundaries.

Coordinating transactional work that requires a high-degree of reliability.

For more information about using Service Bus, reference Azure Service Bus Messaging. Learn how to set up

messaging that connects applications and services across on-premises and cloud environments.

To understand how Service Bus promotes operational excellence, reference the following topics:

Handling outages and disasters

Throttling operations on Azure Service Bus

The following sections are specific to Azure Service Bus and operational excellence:

Design considerations

Configuration checklist

Recommended configuration options

Source artifacts

Maximize reliability with an Azure Service Bus uptime Service Level Agreement (SLA). Properly configured

applications can send or receive messages, or do other operations on a deployed Queue or Topic. For more

information, reference the Service Bus SLA.

Other design considerations include:

Express Entities

Partitioned queues and topics

Besides the documentation on Service Bus Premium and Standard messaging tiers, the following features are

only available on the Premium Stock Keeping Unit (SKU):

Dedicated resources.

Virtual network integration: Limits the networks that can connect to the Service Bus instance. Requires

Service Endpoints to be enabled on the subnet. There are Trusted Microsoft services that are not supported

when implementing Virtual Networks (for example, integration with Event Grid). For more information,

reference Allow access to Azure Service Bus namespace from specific virtual networks.

Private endpoints.

IP Filtering/Firewall: Restrict connections to only defined IPv4 addresses or IPv4 address ranges.

Availability zones: Provides enhanced availability by spreading replicas across availability zones within one

region at no extra cost.

Event Grid integration: Available event types.

Checklist

Configuration recommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Connect to Service Bus with the AMQP protocol and use

Service Endpoints or Private Endpoints when possible.

This recommendation keeps traffic on the Azure Backbone.

Note: The default connection protocol for

Microsoft.Azure.ServiceBus

and

Windows.Azure.ServiceBus

namespaces is

AMQP

Establish a process to actively monitor the dead-letter queue

(dlq) messages.

The dead-letter queue holds messages that cannot be

processed or cannot be delivered to any receiver. It is

important to monitor this queue to examine the issue cause,

apply required corrections, and to resubmit messages.

Analyze the differences between Azure Storage Queues and

Azure Service Bus Queues.

You will find that Azure Service Bus Messaging Entities are

more advanced, reliable, and feature-rich than Azure Storage

Queues. If your requirement is for simple queue messaging

without requirements for reliable messaging, then Azure

Storage Queues may be a more suitable option.

Source artifacts

Scale messaging units.

Geo-Disaster Recovery (paired namespace).

BYOK (Bring Your Own Key): Azure Service Bus encrypts data at rest and automatically decrypts it when

accessed, but customers can also bring their own customer-managed key.

When deploying Service Bus with Geo-disaster recovery and in availability zones, the Service Level Objective

(SLO) increases dramatically, but does not change the uptime SLA.

Have you configured Azure Ser vice Bus with operational excellence in mind?Have you configured Azure Ser vice Bus with operational excellence in mind?

Ensure that Service Bus Messaging Exceptions are handled properly.

Connect to Service Bus with the Advanced Message Queuing Protocol (AMQP) and use Service Endpoints or

Private Endpoints when possible.

Establish a process to actively monitor the dead-letter queue (dlq) messages.

Review the Best Practices for performance improvements using Service Bus Messaging.

Analyze the differences between Azure Storage Queues and Azure Service Bus Queues.

Consider the following recommendation to optimize reliability when configuring Azure Service Bus:

Resources

| where

type == 'microsoft.servicebus/namespaces'

| where

sku.tier == 'Premium'

and isempty(properties.privateEndpointConnections)

To identify premium Service Bus Instances that aren't using private endpoints, use the following query:

To identify Service Bus Instances that are not on the premium tier, use the following query:

Next step

Resources

| where

type == 'microsoft.servicebus/namespaces'

| where

sku.tier != 'Premium'

Queue Storage and reliability

12/16/2022 • 2 minutes to read • Edit Online

Design considerations

Checklist

Configuration recommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

For an SLA increase, use geo-redundant storage. Use geo-redundant storage with read access and configure

the client application to fail over to secondary read

endpoints if the primary endpoints fail to respond. This

consideration should be part of the overall reliability strategy

of your solution.

Azure Queue Storage is a service for storing large numbers of messages that you can access from anywhere in

the world through authenticated calls using HTTP or HTTPS . Queues are commonly used to create a backlog of

work to process asynchronously.

For more information about Queue Storage, reference What is Azure Queue Storage?

To understand how Azure Queue Storage helps maintain a reliable workload, reference the following topics:

Azure Storage redundancy

Disaster recovery and storage account failover

The following sections are specific to Azure Queue Storage and reliability:

Design considerations

Configuration checklist

Recommended configuration options

Source artifacts

Azure Queue Storage follows the SLA statements of the general Storage Account service.

Have you configured Azure Queue Storage with reliability in mind?Have you configured Azure Queue Storage with reliability in mind?

Since Storage Queues are a part of the Azure Storage service, refer to the Storage Accounts configuration

checklist and recommendations for reliability.

Ensure that for all clients accessing the storage account, implement a proper retry policy.

Refer to the Storage guidance for specifics on data recovery for storage accounts.

For an SLA increase, use geo-redundant storage.

Use geo-zone-redundant storage (GZRS) or read-access geo-zone-redundant storage (RA-GZRS) for

durability and protection against failover if an entire data center becomes unavailable.

Consider the following recommendations to optimize reliability when configuring your Azure Queue Storage:

Use geo-zone-redundant storage (GZRS) or read-access

geo-zone-redundant storage (RA-GZRS) for durability and

protection against failover if an entire data center becomes

unavailable.

For more information, reference Azure Storage redundancy.

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Source artifacts

Resources

| where

type == 'microsoft.storage/storageaccounts'

and sku.name =~ 'Standard_LRS'

Resources

| where

type == 'microsoft.storage/storageaccounts'

and kind == 'Storage'

Next step

To identify storage accounts using locally redundant storage (LRS), use the following query:

To identify storage accounts using V1 storage accounts, use the following query:

Queue Storage and operational excellence

12/16/2022 • 2 minutes to read • Edit Online

Design considerations

Checklist

Source artifacts

Resources

| where

type == 'microsoft.storage/storageaccounts'

and kind == 'Storage'

Next step

Azure Queue Storage is a service for storing large numbers of messages that you can access from anywhere in

the world through authenticated calls using HTTP or HTTPS . Queues are commonly used to create a backlog of

work to process asynchronously.

For more information about Queue Storage, reference What is Azure Queue Storage?

To understand how Azure Queue Storage promotes operational excellence, reference the following topics:

Monitoring Azure Queue Storage

Best practices for monitoring Azure Queue Storage

The following sections are specific to Azure Queue Storage and operational excellence:

Design considerations

Configuration checklist

Source artifacts

Azure Queue Storage follows the SLA statements of the general Storage Account service.

Have you configured Azure Queue Storage with operational excellence in mind?Have you configured Azure Queue Storage with operational excellence in mind?

Since Storage Queues are a part of the Azure Storage service, refer to the Storage Accounts configuration

checklist and recommendations for operational excellence.

Ensure that for all clients accessing the storage account, implement a proper retry policy.

Refer to the Storage guidance for specifics on data recovery for storage accounts.

To identify storage accounts using V1 storage accounts, use the following query:

IoT Hub and reliability

12/16/2022 • 4 minutes to read • Edit Online

Design considerations

Checklist

Azure IoT Hub is a managed service hosted in the cloud that acts as a central message hub for communication

between an IoT application and its attached devices. You can connect millions of devices and their backend

solutions reliably and securely. Almost any device can be connected to an IoT Hub.

IoT Hub supports monitoring to help you track device creation, device connections, and device failures.

IoT Hub also supports the following messaging patterns:

Device-to-cloud telemetry

Uploading files from devices

Request-reply methods to control your devices from the cloud

For more information about IoT Hub, reference IoT Concepts and Azure IoT Hub.

To understand how IoT Hub supports a reliable workload, reference the following topics:

IoT Hub high availability and disaster recovery

How to achieve cross-region High Availability with IoT Hub

How to clone an Azure IoT Hub to another region

The following sections are specific to Azure IoT Hub and reliability:

Design considerations

Configuration checklist

Recommended configuration options

For more information about the Azure IoT Hub Service Level Agreement, reference SLA for Azure IoT Hub.

Have you configured Azure IoT Hub with reliability in mind?Have you configured Azure IoT Hub with reliability in mind?

Provision a second IoT Hub in another region and have routing logic on the device.

Use the AMQP or MQTT protocol when sending events frequently.

Use only certificates validated by a root CA in the production environment if you're using X.509 certificates

for the device connection.

For maximum throughput, use the maximum number of partitions ( 32 ) when creating the IoT Hub, if you're

planning to use the built-in endpoint.

For scaling, increase the tier and allocated IoT Hub units instead of adding more than one IoT Hub per region.

In high-throughput scenarios, use batched events.

If you require the minimum possible latency, don't use routing and read the events from the built-in

endpoint.

As part of your solution-wide availability and disaster recovery strategy, consider using the IoT Hub cross-

region Disaster Recovery option.

When reading device telemetry from the built-in Event Hub-compatible endpoint, refer to the Event Hub

consumers recommendation.

Configuration recommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Provision a second IoT Hub in another region and have

routing logic on the device.

These configurations can be further enhanced with a

Concierge Service.

Use the AMQP or MQTT protocol when sending events

frequently.

AMQP and MQTT have higher network costs when

initializing the session, however HTTPS requires extra TLS

overhead for every request. AMQP and MQTT have higher

performance for frequent publishers.

Use only certificates validated by a root CA in the

production environment if you're using X.509 certificates for

the device connection.

Make sure you have processes in place to update the

certificate before they expire.

For maximum throughput, use the maximum number of

partitions ( 32 ) when creating the IoT Hub, if you're

planning to use the built-in endpoint.

The number of device-to-cloud partitions for the Event Hub-

compatible endpoint reflect the degree of downstream

parallelism you can achieve. This will allow you to scale up to

32 concurrent processing entities and will offer the highest

send and receive availability. This number can't be changed

after creation.

For scaling, increase the tier and allocated IoT Hub units

instead of adding more than one IoT Hub per region.

Adding more than one IoT Hub per region doesn't offer

extra resiliency because all hubs can run on the same

underlying cluster.

In high-throughput scenarios, use batched events. The service will deliver an array with multiple events to the

consumers, instead of an array with one event. The

consuming application must process these arrays.

If you require the minimum possible latency, don't use

routing and read the events from the built-in endpoint.

When using message routing in IoT Hub, latency of the

message delivery increases. On average, latency shouldn't

exceed 500 ms , but there's no guarantee for the delivery

latency.

As part of your solution-wide availability and disaster

recovery strategy, consider using the IoT Hub cross-region

Disaster Recovery option.

This option will move the IoT Hub endpoint to the paired

Azure region. Only the device registry gets replicated. Events

aren't replicated to the secondary region.

The RTO for the

customer-initiated failover is between 10 minutes to a

couple of hours. For a Microsoft-initiated failover, the RTO is

2-26

hours. Confirm this RTO aligns with the requirements

of the customer and fits in the broader availability strategy. If

a higher RTO is required, consider implementing a client-side

failover pattern.

When using an SDK to send events to IoT Hub, ensure the

exceptions thrown by the retry policy (

EventHubsException or OperationCancelledException )

are properly caught.

When using HTTPS , implement a proper retry pattern.

When using an SDK to send events to IoT Hubs, ensure the exceptions thrown by the retry policy (

EventHubsException or OperationCancelledException ) are properly caught.

To avoid telemetry interruption due to throttling and a fully used quota, consider adding a custom auto-

scaling solution.

Consider the following recommendations to optimize reliability when configuring Azure IoT Hub:

Next step

IoT Hub and operational excellence

12/16/2022 • 4 minutes to read • Edit Online

Design considerations

Checklist

Azure IoT Hub is a managed service hosted in the cloud that acts as a central message hub for communication

between an IoT application and its attached devices. You can connect millions of devices and their backend

solutions reliably and securely. Almost any device can be connected to an IoT Hub.

IoT Hub supports monitoring to help you track device creation, device connections, and device failures.

IoT Hub also supports the following messaging patterns:

Device-to-cloud telemetry

Uploading files from devices

Request-reply methods to control your devices from the cloud

For more information about IoT Hub, reference IoT Concepts and Azure IoT Hub.

To understand how IoT Hub promotes operational excellence, reference the following topics:

Tutorial: Set up and use metrics and logs with an IoT Hub

Monitoring Azure IoT Hub

Trace Azure IoT device-to-cloud messages with distributed tracing (preview)

Check IoT Hub service and resource health

The following sections are specific to Azure IoT Hub and operational excellence:

Design considerations

Configuration checklist

Recommended configuration options

For more information about the Azure IoT Hub Service Level Agreement, reference SLA for Azure IoT Hub.

Have you configured Azure IoT Hub with operational excellence in mind?Have you configured Azure IoT Hub with operational excellence in mind?

Provision a second IoT Hub in another region and have routing logic on the device.

Use the AMQP or MQTT protocol when sending events frequently.

Use only certificates validated by a root CA in the production environment if you're using X.509 certificates

for the device connection.

For maximum throughput, use the maximum number of partitions ( 32 ) when creating the IoT Hub, if you're

planning to use the built-in endpoint.

For scaling, increase the tier and allocated IoT Hub units instead of adding more than one IoT Hub per region.

In high-throughput scenarios, use batched events.

If you require the minimum possible latency, don't use routing and read the events from the built-in

endpoint.

As part of your solution-wide availability and disaster recovery strategy, consider using the IoT Hub cross-

region Disaster Recovery option.

When reading device telemetry from the built-in Event Hub-compatible endpoint, refer to the Event Hub

Configuration recommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Provision a second IoT Hub in another region and have

routing logic on the device.

These configurations can be further enhanced with a

Concierge Service.

Use the AMQP or MQTT protocol when sending events

frequently.

AMQP and MQTT have higher network costs when

initializing the session, however HTTPS requires extra TLS

overhead for every request. AMQP and MQTT have higher

performance for frequent publishers.

Use only certificates validated by a root CA in the

production environment if you're using X.509 certificates for

the device connection.

Make sure you have processes in place to update the

certificate before they expire.

For maximum throughput, use the maximum number of

partitions ( 32 ) when creating the IoT Hub, if you're

planning to use the built-in endpoint.

The number of device-to-cloud partitions for the Event Hub-

compatible endpoint reflect the degree of downstream

parallelism you can achieve. This will allow you to scale up to

32 concurrent processing entities and will offer the highest

send and receive availability. This number can't be changed

after creation.

For scaling, increase the tier and allocated IoT Hub units

instead of adding more than one IoT Hub per region.

Adding more than one IoT Hub per region doesn't offer

extra resiliency because all hubs can run on the same

underlying cluster.

In high-throughput scenarios, use batched events. The service will deliver an array with multiple events to the

consumers, instead of an array with one event. The

consuming application must process these arrays.

If you require the minimum possible latency, don't use

routing and read the events from the built-in endpoint.

When using message routing in IoT Hub, latency of the

message delivery increases. On average, latency shouldn't

exceed 500 ms , but there's no guarantee for the delivery

latency.

As part of your solution-wide availability and disaster

recovery strategy, consider using the IoT Hub cross-region

Disaster Recovery option.

This option will move the IoT Hub endpoint to the paired

Azure region. Only the device registry gets replicated. Events

aren't replicated to the secondary region.

The RTO for the

customer-initiated failover is between 10 minutes to a

couple of hours. For a Microsoft-initiated failover, the RTO is

2-26

hours. Confirm this RTO aligns with the requirements

of the customer and fits in the broader availability strategy. If

a higher RTO is required, consider implementing a client-side

failover pattern.

consumers recommendation.

When using an SDK to send events to IoT Hubs, ensure the exceptions thrown by the retry policy (

EventHubsException or OperationCancelledException ) are properly caught.

To avoid telemetry interruption due to throttling and a fully used quota, consider adding a custom auto-

scaling solution.

Consider the following recommendations for increasing operational excellence when configuring Azure IoT Hub:

When using an SDK to send events to IoT Hub, ensure the

exceptions thrown by the retry policy (

EventHubsException or OperationCancelledException )

are properly caught.

When using HTTPS , implement a proper retry pattern.

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Next step

IoT Hub Device Provisioning Service and reliability

12/16/2022 • 2 minutes to read • Edit Online

Design considerations

Next step

The IoT Hub Device Provisioning Service (DPS) is a helper service for IoT Hub. DPS enables zero-touch, just-in-

time provisioning to the right IoT Hub without requiring human intervention. IoT Hub DPS allows customers to

provision millions of devices in a secure and scalable manner.

For more information, reference What is Azure IoT Hub Device Provisioning Service?

To understand how IoT Hub DPS can increase workload reliability, reference IoT Hub DPS supports Availability

Zones.

For more information about the Service Level Agreement for Azure IoT Hub DPS, reference SLA for Azure IoT

Hub.

IoT Hub Device Provisioning Service and operational excellence

IoT Hub Device Provisioning Service and

operational excellence

12/16/2022 • 2 minutes to read • Edit Online

Design considerations

Next step

The IoT Hub Device Provisioning Service (DPS) is a helper service for IoT Hub. DPS enables zero-touch, just-in-

time provisioning to the right IoT Hub without requiring human intervention. IoT Hub DPS allows customers to

provision millions of devices in a secure and scalable manner.

For more information, reference What is Azure IoT Hub Device Provisioning Service?

To understand how IoT Hub DPS promotes operational excellence, reference How to manage device enrollments

with Azure portal.

For more information about the Service Level Agreement for Azure IoT Hub DPS, reference SLA for Azure IoT

Hub.

Application Delivery (General) and reliability

Application Delivery

(

General

)

and reliability

12/16/2022 • 3 minutes to read • Edit Online

Design considerations

Checklist

Application Delivery (General) explores key recommendations to deliver internal-facing and external-facing

applications in the Azure network through a secure, highly scalable, and highly available way. General

Application Delivery can be handled using a combination of the following networking services in Azure:

Content Delivery Network (CDN)

Azure Front Door Service

Traffic Manager

Application Gateway

Internet Analyzer

Load Balancer

For more information, reference Azure networking services overview.

The following sections are specific to general Application Delivery and reliability:

Design considerations

Configuration checklist

Recommended configuration options

Application Delivery in Azure includes the following design considerations:

Azure Load Balancer (internal and public) provides high availability for application delivery at a regional

level. (

Standard tier only

)

Azure Traffic manager allows the delivery of applications through DNS redirection, including traffic using

protocols other than HTTP/S .

Azure Front Door allows the secure delivery of highly available HTTP/S applications across Azure regions.

Azure Application Gateway allows the secure delivery of HTTP/S applications at a regional level.

Have you configured your Application Deliver y networking ser vices with reliability in mind?Have you configured your Application Deliver y networking ser vices with reliability in mind?

Use Azure Traffic Manager to deliver global applications that span protocols other than HTTP/S .

When using Azure Front Door and Application Gateway to protect HTTP/S applications, use Web Application

Firewall (WAF) policies in Front Door and lock down Application Gateway to receive traffic only from Azure

Front Door.

Create a separate health endpoint on the backend that the health probe can use. The health endpoint can

aggregate the state of the critical services and dependencies needed to serve requests.

Enable health probes for backends.

Deploy Application Gateway v2 or third-party NVAs used for inbound HTTP/S connections together with the

applications that they're securing.

When doing global load balancing for HTTP/S applications, use Front Door over Traffic Manager.

Use a third-party Network Virtual Appliance (NVA) if Application Gateway v2 can't be used for the security of

HTTP/S applications.

Configuration recommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Use Azure Traffic Manager to deliver global applications that

span protocols other than HTTP/S .

Traffic manager doesn't forward traffic, but only completes

the DNS redirection. The connection from the client is

established directly to the target using any protocol.

When using Azure Front Door and Application Gateway to

protect HTTP/S applications, use WAF policies in Front

Door and lock down Application Gateway to receive traffic

only from Azure Front Door.

Certain scenarios might force a customer to implement rules

specifically on AppGateway: For example, if ModSec CRS

2.2.9 , CRS 3.0 or CRS 3.1 rules are required, rules can

be only implemented on AppGateway. Conversely, rate-

limiting and geo-filtering are available only on Azure Front

Door, not on AppGateway. Instructions on how to lock down

traffic can be found at FAQ for Azure Front Door.

Create a separate health endpoint on the backend that the

health probe can use. The health endpoint can aggregate

the state of the critical services and dependencies needed to

serve requests.

For more information, reference Health Endpoint Monitoring

pattern.

Enable health probes for backends. Health probes are http(s) endpoints that are queried by

the load balancer (Azure Front Door, Traffic Manager,

AppGateway) service to determine if the backend is healthy

enough to handle requests.

Deploy Application Gateway v2 or third-party NVAs used for

inbound HTTP/S connections together with the

applications that they're securing.

Don't centrally manage Application Gateway v2 or third-

party NVAs within the organization and share with other

workloads.

When doing global load balancing for HTTP/S applications,

use Front Door over Traffic Manager.

- Azure Front Door optimizes the number of TCP

connections to the backend when forwarding traffic.

- Changes to the routing configuration, based on backend

health, are instantaneous. With Traffic Manager, traffic will

point to the original backend until a new DNS lookup occurs,

plus potential time for DNS propagation.

- Front Door supports caching on global edge nodes,

negating the need for a separate CDN service.

- Front Door supports Web Application Firewall rules,

negating the need for a separate WAF service.

For secure delivery of HTTP/S applications, ensure you

enable Web Application Firewall (WAF) protection and

policies.

Enable WAF protection and policies in either Application

Gateway or Front Door.

Application delivery for both internal and external facing

applications should be part of the application.

Don't centrally manage Application Delivery within an

organization.

For secure delivery of HTTP/S applications, ensure you enable Web Application Firewall (WAF) protection

and policies.

Application delivery for both internal and external facing applications should be part of the application.

Global HTTP/S applications that span Azure regions should be delivered and protected using Azure Front

Door with Web Application Firewall (WAF) policies.

All public IP addresses in the solution should be protected with a DDoS Standard protection plan.

Consider the following recommendations to optimize reliability when configuring your Application Delivery

networking services:

Next step

Application Delivery (General) and operational excellence

Application Delivery

(

General

)

and operational

excellence

12/16/2022 • 4 minutes to read • Edit Online

Design considerations

Checklist

Application Delivery (General) explores key recommendations to deliver internal-facing and external-facing

applications in the Azure network through a secure, highly scalable, and highly available way. General

Application Delivery can be handled using a combination of the following networking services in Azure:

Content Delivery Network (CDN)

Azure Front Door Service

Traffic Manager

Application Gateway

Internet Analyzer

Load Balancer

For more information, reference Azure networking services overview.

The following sections are specific to general Application Delivery and operational excellence:

Design considerations

Configuration checklist

Recommended configuration options

Application Delivery in Azure includes the following design considerations:

Azure Load Balancer (internal and public) provides high availability for application delivery at a regional

level. (

Standard tier only

)

Azure Traffic manager allows the delivery of applications through DNS redirection, including traffic using

protocols other than HTTP/S .

Azure Front Door allows the secure delivery of highly available HTTP/S applications across Azure regions.

Azure Application Gateway allows the secure delivery of HTTP/S applications at a regional level.

Have you configured your Application Deliver y networking ser vices with operational excellence inHave you configured your Application Deliver y networking ser vices with operational excellence in

mind?mind?

Use Azure Traffic Manager to deliver global applications that span protocols other than HTTP/S .

When using Azure Front Door and Application Gateway to protect HTTP/S applications, use Web Application

Firewall (WAF) policies in Front Door and lock down Application Gateway to receive traffic only from Azure

Front Door.

Create a separate health endpoint on the backend that the health probe can use. The health endpoint can

aggregate the state of the critical services and dependencies needed to serve requests.

Enable health probes for backends.

Deploy Application Gateway v2 or third-party NVAs used for inbound HTTP/S connections together with the

applications that they're securing.

Configuration recommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Use Azure Traffic Manager to deliver global applications that

span protocols other than HTTP/S .

Traffic manager doesn't forward traffic, but only completes

the DNS redirection. The connection from the client is

established directly to the target using any protocol.

When using Azure Front Door and Application Gateway to

protect HTTP/S applications, use WAF policies in Front

Door and lock down Application Gateway to receive traffic

only from Azure Front Door.

Certain scenarios might force a customer to implement rules

specifically on AppGateway: For example, if ModSec CRS

2.2.9 , CRS 3.0 or CRS 3.1 rules are required, rules can

be only implemented on AppGateway. Conversely, rate-

limiting and geo-filtering are available only on Azure Front

Door, not on AppGateway. Instructions on how to lock down

traffic can be found at FAQ for Azure Front Door.

Create a separate health endpoint on the backend that the

health probe can use. The health endpoint can aggregate

the state of the critical services and dependencies needed to

serve requests.

For more information, reference Health Endpoint Monitoring

pattern.

Enable health probes for backends. Health probes are http(s) endpoints that are queried by

the load balancer (Azure Front Door, Traffic Manager,

AppGateway) service to determine if the backend is healthy

enough to handle requests.

Deploy Application Gateway v2 or third-party NVAs used for

inbound HTTP/S connections together with the

applications that they're securing.

Don't centrally manage Application Gateway v2 or third-

party NVAs within the organization and share with other

workloads.

When doing global load balancing for HTTP/S applications,

use Front Door over Traffic Manager.

- Azure Front Door optimizes the number of TCP

connections to the backend when forwarding traffic.

- Changes to the routing configuration, based on backend

health, are instantaneous. With Traffic Manager, traffic will

point to the original backend until a new DNS lookup occurs,

plus potential time for DNS propagation.

- Front Door supports caching on global edge nodes,

negating the need for a separate CDN service.

- Front Door supports Web Application Firewall rules,

negating the need for a separate WAF service.

When doing global load balancing for HTTP/S applications, use Front Door over Traffic Manager.

Use a third-party Network Virtual Appliance (NVA) if Application Gateway v2 can't be used for the security of

HTTP/S applications.

For secure delivery of HTTP/S applications, ensure you enable Web Application Firewall (WAF) protection

and policies.

Application delivery for both internal and external facing applications should be part of the application.

Global HTTP/S applications that span Azure regions should be delivered and protected using Azure Front

Door with Web Application Firewall (WAF) policies.

All public IP addresses in the solution should be protected with a DDoS Standard protection plan.

Consider the following recommendations for operational excellence when configuring your Application Delivery

networking services:

For secure delivery of HTTP/S applications, ensure you

enable Web Application Firewall (WAF) protection and

policies.

Enable WAF protection and policies in either Application

Gateway or Front Door.

Application delivery for both internal and external facing

applications should be part of the application.

Don't centrally manage Application Delivery within an

organization.

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Next step

Azure Well-Architected Framework review - Azure Application Gateway v2

Azure Well

Architected Framework review

Azure

Application Gateway v2

12/16/2022 • 16 minutes to read • Edit Online

Prerequisites

Reliability

Design checklistDesign checklist

RecommendationsRecommendations

This article provides architectural best practices for the Azure Application Gateway v2 family of SKUs. The

guidance is based on the five pillars of architecture excellence:

Reliability

Security

Cost optimization

Operational excellence

Performance efficiency

We assume that you have working knowledge of Azure Application Gateway and are well versed with v2 SKU

features. For more information, see Azure Application Gateway features.

Understanding the Well-Architected Framework pillars can help produce a high-quality, stable, and efficient

cloud architecture. We recommend that you review your workload by using the Azure Well-Architected

Framework Review assessment.

Use a reference architecture to review the considerations based on the guidance provided in this article. We

recommend that you start with Protect APIs with Application Gateway and API Management and IaaS: Web

application with relational database.

In the cloud, we acknowledge that failures happen. Instead of trying to prevent failures altogether, the goal is to

minimize the effects of a single failing component. Use the following information to minimize failed instances.

As you make design choices for Application Gateway, review the Reliability design principles.

Deploy the instances in a zone-aware configuration, where available.

Use Application Gateway with Web Application Firewall (WAF) within an application virtual network to

protect inbound HTTP/S traffic from the Internet.

In new deployments, use Application Gateway v2 unless there is a compelling reason to use v1.

Plan for rule updates

Use health probes to detect backend unavailability

Review the impact of the interval and threshold settings on health probes

Verify downstream dependencies through health endpoints

Explore the following table of recommendations to optimize your Application Gateway configuration for

Reliability.

REC O M M EN DAT IO NREC O M M EN DAT IO N B EN EFITB EN EF IT

Plan for rule updates Plan enough time for updates before accessing Application

Gateway or making further changes. For example, removing

servers from backend pool might take some time because

they have to drain existing connections.

Use health probes to detect backend unavailability If Application Gateway is used to load balance incoming

traffic over multiple backend instances, we recommend the

use of health probes. These will ensure that traffic is not

routed to backends that are unable to handle the traffic.

Review the impact of the interval and threshold settings on

health probes

The health probe sends requests to the configured endpoint

at a set interval. Also, there's a threshold of failed requests

that will be tolerated before the backend is marked

unhealthy. These numbers present a trade-off.

- Setting a higher interval puts a higher load on your

service. Each Application Gateway instance sends its own

health probes, so 100 instances every 30 seconds means

100 requests per 30 seconds.

- Setting a lower interval leaves more time before an outage

is detected.

- Setting a low unhealthy threshold may mean that short,

transient failures may take down a backend.

- Setting a high threshold it can take longer to take a

backend out of rotation.

Verify downstream dependencies through health endpoints Suppose each backend has its own dependencies to ensure

failures are isolated. For example, an application hosted

behind Application Gateway may have multiple backends,

each connected to a different database (replica). When such

a dependency fails, the application may be working but

won't return valid results. For that reason, the health

endpoint should ideally validate all dependencies. Keep in

mind that if each call to the health endpoint has a direct

dependency call, that database would receive 100 queries

every 30 seconds instead of 1. To avoid this, the health

endpoint should cache the state of the dependencies for a

short period of time.

When using Azure Front Door and Application Gateway to

protect HTTP/S applications, use WAF policies in Front

Door and lock down Application Gateway to receive traffic

only from Azure Front Door.

Certain scenarios can force you to implement rules

specifically on Application Gateway. For example, if ModSec

CRS 2.2.9, CRS 3.0 or CRS 3.1 rules are required, these rules

can be only implemented on Application Gateway.

Conversely, rate-limiting and geo-filtering are available only

on Azure Front Door, not on AppGateway.

Security

Design checklistDesign checklist

Azure Advisor helps you ensure and improve continuity of your business-critical applications. Review the Azure

Advisor recommendations.

Security is one of the most important aspects of any architecture. Application Gateway provides features to

employ both the principle of least privilege and defense-in-defense. We recommend you review the Security

design principles.

Set up a TLS policy for enhanced security

RecommendationsRecommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N B EN EFITB EN EF IT

Set up a TLS policy for enhanced security Set up a TLS policy for extra security. Ensure you're using the

latest TLS policy version (AppGwSslPolicy20170401S). This

enforces TLS 1.2 and stronger ciphers.

Use AppGateway for TLS termination There are advantages of using Application Gateway for TLS

termination:

- Performance improves because requests going to different

backends to have to re-authenticate to each backend.

- Better utilization of backend servers because they don't

have to perform TLS processing

- Intelligent routing by accessing the request content.

- Easier certificate management because the certificate only

needs to be installed on Application Gateway.

Use Azure Key Vault to store TLS certificates Application Gateway is integrated with Key Vault. This

provides stronger security, easier separation of roles and

responsibilities, support for managed certificates, and an

easier certificate renewal and rotation process.

When re-encrypting backend traffic, ensure the backend

server certificate contains both the root and intermediate

Certificate Authorities (CAs)

A TLS certificate of the backend server must be issued by a

well-known CA. If the certificate was not issued by a trusted

CA, the Application Gateway checks if the certificate of the

issuing CA was issued by a trusted CA, and so on until either

a trusted CA is found. Only then a secure connection is

established. Otherwise, Application Gateway marks the

backend as unhealthy.

Use an appropriate DNS server for backend pool resources When the backend pool contains a resolvable FQDN, the

DNS resolution is based on a private DNS zone or custom

DNS server (if configured on the VNet), or it uses the default

Azure-provided DNS.

Comply with all NSG restrictions for Application Gateway NSGs are supported on Application Gateway, but there are

some restrictions. For instance, some communication with

certain port ranges is prohibited. Make sure you understand

the implications of those restrictions. For details, see

Network security groups.

Use AppGateway for TLS termination

Use Azure Key Vault to store TLS certificates

When re-encrypting backend traffic, ensure the backend server certificate contains both the root and

intermediate Certificate Authorities (CAs)

Use an appropriate DNS server for backend pool resources

Comply with all NSG restrictions for Application Gateway

Refrain from using UDRs to the backend subnet

Be aware of Application Gateway capacity changes when enabling WAF

Explore the following table of recommendations to optimize your Application Gateway configuration for

Security.

Refrain from using UDRs on the backend subnet Using User Defined Routes (UDR) on the Application
Gateway subnet cause some issues. Health status in the
back-end might be unknown. Application Gateway logs and
metrics might not get generated. We recommend that you
don't use UDRs on the Application Gateway subnet so that
you can view the back-end health, logs, and metrics. If your
organizations require to use UDR in the Application Gateway
subnet, please ensure you review the supported scenarios.
For more information, see Supported user-defined routes.
Be aware of Application Gateway capacity changes when
enabling WAF
When WAF is enabled, every request must be buffered by
the Application Gateway until it fully arrives and check if the
request matches with any rule violation in its core rule set
and then forward the packet to the backend instances. For
large file uploads (30MB+ in size), this can result in a
significant latency. Because Application Gateway capacity
requirements are different with WAF, we do not recommend
enabling WAF on Application Gateway without proper
testing and validation.
REC O M M EN DAT IO NREC O M M EN DAT IO N B EN EFITB EN EF IT
  
Policy definitionsPolicy definitions
 
Cost optimization
  
Design checklistDesign checklist
For more suggestions, see Principles of the security pillar.
Azure Advisor helps you ensure and improve continuity of your business-critical applications. Review the Azure
Advisor recommendations.
Gateway subnets should not be configured with a network security group. This policy denies if a gateway
subnet is configured with a network security group. Assigning a network security group to a gateway subnet
will cause the gateway to stop functioning.
Web Application Firewall (WAF) should be enabled for Application Gateway. Deploy Azure Web Application
Firewall (WAF) in front of public facing web applications for additional inspection of incoming traffic. Web
Application Firewall (WAF) provides centralized protection of your web applications from common exploits
and vulnerabilities such as SQL injections, Cross-Site Scripting, local and remote file executions. You can also
restrict access to your web applications by countries, IP address ranges, and other http(s) parameters via
custom rules.
Web Application Firewall (WAF) should use the specified mode for Application Gateway. Mandates the use of
'Detection' or 'Prevention' mode to be active on all Web Application Firewall policies for Application Gateway.
Azure DDoS Protection should be enabled. DDoS protection should be enabled for all virtual networks with a
subnet that is part of an application gateway with a public IP.
All built-in policy definitions related to Azure Networking are listed in Built-in policies - Network.
Cost optimization is about looking at ways to reduce unnecessary expenses and improve operational
efficiencies. We recommend you review the Cost optimization design principles.
Familiarize yourself with Application Gateway pricing
Review underutilized resources
Stop Application Gateway instances that are not in use
Have a scale-in and scale-out policy
Review consumption metrics across different parameters

  
RecommendationsRecommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N B EN EFITB EN EF IT
Familiarize yourself with Application Gateway pricing For information about Application Gateway pricing, see
Understanding Pricing for Azure Application Gateway and
Web Application Firewall. You can also leverage the Pricing
calculator.
Ensure that the options are adequately sized to meet the
capacity demand and deliver expected performance without
wasting resources.
Review underutilized resources Identify and delete Application Gateway instances with
empty backend pools to avoid unnecessary costs.
Stop Application Gateway instances when not in use You aren't billed when Application Gateway is in the stopped
state. Continuously running Application Gateway instances
can incur extraneous costs. Evaluate usage patterns and stop
instances when you don't need them. For example, usage
after business hours in Dev/Test environments is expected to
be low.
See these articles for information about how to stop and
start instances.
- Stop-AzApplicationGateway
- Start-AzApplicationGateway
Have a scale-in and scale-out policy A scale-out policy ensures that there will be enough
instances to handle incoming traffic and spikes. Also, have a
scale-in policy that makes sure the number of instances are
reduced when demand drops. Consider the choice of
instance size. The size can significantly impact the cost. Some
considerations are described in the Estimate the Application
Gateway instance count.
For more information, see What is Azure Application
Gateway v2?
Review consumption metrics across different parameters You're billed based on metered instances of Application
Gateway based on the metrics tracked by Azure. Evaluate
the various metrics and capacity units and determine the
cost drivers. For more information, see Azure Cost
Management and Billing.
The following metrics are key for Application Gateway. This
information can be used to validate that the provisioned
instance count matches the amount of incoming traffic.
- Estimated Billed Capacity Units
- Fixed Billable Capacity Units
- Current Capacity Units
For more information, see Application Gateway metrics.
Make sure you account for bandwidth costs. For details, see
Traffic across billing zones and regions.
Explore the following table of recommendations to optimize your Application Gateway configuration for Cost
optimization.
For more suggestions, see Principles of the cost optimization pillar.

Operational excellence

Design checklistDesign checklist

RecommendationsRecommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N B EN EFITB EN EF IT

Monitor capacity metrics Use these metrics as indicators of utilization of the

provisioned Application Gateway capacity. We strongly

recommend setting up alerts on capacity. For details, see

Application Gateway high traffic support.

Troubleshoot using metrics There are other metrics that can indicate issues either at

Application Gateway or the backend. We recommend

evaluating the following alerts:

- Unhealthy Host Count

- Response Status (dimension 4xx and 5xx)

- Backend Response Status (dimension 4xx and 5xx)

- Backend Last Byte Response Time

- Application Gateway Total Time

For more information, see Metrics for Application Gateway.

Enable diagnostics on Application Gateway and Web

Application Firewall (WAF)

Diagnostic logs allow you to view firewall logs, performance

logs, and access logs. Use these logs to manage and

troubleshoot issues with Application Gateway instances. For

more information, see Back-end health and diagnostic logs

for Application Gateway.

Use Azure Monitor Network Insights Azure Monitor Network Insights provides a comprehensive

view of health and metrics for network resources, including

Application Gateway. For additional details and supported

capabilities for Application Gateway, see Azure Monitor

Network insights.

Azure Advisor helps you ensure and improve continuity of your business-critical applications. Review the Azure

Advisor recommendations.

Monitoring and diagnostics are crucial. Not only can you measure performance statistics but also use metrics

troubleshoot and remediate issues quickly. We recommend you review the Operational excellence design

principles.

Monitor capacity metrics

Enable diagnostics on Application Gateway and Web Application Firewall (WAF)

Use Azure Monitor Network Insights

Match timeout settings with the backend application

Monitor Key Vault configuration issues using Azure Advisor

Configure and monitor SNAT port limitations

Consider SNAT port limitations in your design

Explore the following table of recommendations to optimize your Application Gateway configuration for

Operational excellence.

Match timeout settings with the backend application Ensure you have configured the IdleTimeout settings to

match the listener and traffic characteristics of the backend

application. The default value is set to four minutes and can

be configured to a maximum of 30. For more information,

see Load Balancer TCP Reset and Idle Timeout.

For workload considerations, see Application Monitoring.

Monitor Key Vault configuration issues using Azure Advisor Application Gateway checks for the renewed certificate

version in the linked Key Vault at every 4-hour interval. If it

is inaccessible due to any incorrectly modified Key Vault

configurations, it logs that error and pushes a corresponding

Advisor recommendation. You must configure the Advisor

alert to stay updated and fix such issues immediately to

avoid any Control or Data plane related problems. To set an

alert for this specific case, use the Recommendation Type as

Resolve Azure Key Vault issue for your ApplicationResolve Azure Key Vault issue for your Application

GatewayGateway.

Consider SNAT port limitations in your design SNAT port limitations are important for backend connections

on the Application Gateway. There are separate factors that

affect how Application Gateway reaches the SNAT port limit.

For example, if the backend is a public IP address, it will

require its own SNAT port. In order to avoid SNAT port

limitations, you can increase the number of instances per

Application Gateway, scale out the backends to have more IP

addresses, or move your backends into the same virtual

network and use private IP addresses for the backends.

Requests per second (RPS) on the Application Gateway will

be affected if the SNAT port limit is reached. For example, if

an Application Gateway reaches the SNAT port limit, then it

won't be able to open a new connection to the backend, and

the request will fail.

REC O M M EN DAT IO NREC O M M EN DAT IO N B EN EFITB EN EF IT

Performance efficiency

Design checklistDesign checklist

RecommendationsRecommendations

For more suggestions, see Principles of the operational excellence pillar.

Azure Advisor helps you ensure and improve continuity of your business-critical applications. Review the Azure

Advisor recommendations.

Performance efficiency is the ability of your workload to scale to meet the demands placed on it by users in an

efficient manner. We recommend you review the Performance efficiency principles.

Estimate the Application Gateway instance count

Define the maximum instance count

Define the minimum instance count

Define Application Gateway subnet size

Take advantage features for autoscaling and performance benefits

Explore the following table of recommendations to optimize your Application Gateway configuration for

Performance efficiency.

REC O M M EN DAT IO NREC O M M EN DAT IO N B EN EFITB EN EF IT

Estimate the Application Gateway instance count Application Gateway v2 scales out based on many aspects,

such as CPU, memory, network utilization, and more. To

determine the approximate instance count, factor in these

metrics:

Current compute units — Indicates CPU utilization. 1

Application Gateway instance is approximately 10 compute

units.

Throughput — Application Gateway instance can serve

~500 Mbps of throughput. This data depends on the type

of payload.

Consider this equation when calculating instance counts.

After you've estimated the instance count, compare that

value to the maximum instance count. This will indicate how

close you are to the maximum available capacity.

Define the minimum instance count For Application Gateway v2 SKU, autoscaling takes some

time (approximately six to seven minutes) before the

additional set of instances is ready to serve traffic. During

that time, if there are short spikes in traffic, expect transient

latency or loss of traffic.

We recommend that you set your minimum instance count

to an optimal level. After you estimate the average instance

count and determine your Application Gateway autoscaling

trends, define the minimum instance count based on your

application patterns. For information, see Application

Gateway high traffic support.

Check the Current Compute Units for the past one month.

This metric represents the gateway's CPU utilization. To

define the minimum instance count, divide the peak usage

by 10. For example, if your average Current Compute Units

in the past month is 50, set the minimum instance count to

five.

Define the maximum instance count We recommend 125 as the maximum autoscale instance

count. Make sure the subnet that has the Application

Gateway has sufficient available IP addresses to support the

scale-up set of instances.

Setting the maximum instance count to 125 has no cost

implications because you're billed only for the consumed

capacity.

Define Application Gateway subnet size Application Gateway needs a dedicated subnet within a
virtual network. The subnet can have multiple instances of
the deployed Application Gateway resource. You can also
deploy other Application Gateway resources in that subnet,
v1 or v2 SKU.
Here are some considerations for defining the subnet size:
- Application Gateway uses one private IP address per
instance and another private IP address if a private front-end
IP is configured.
- Azure reserves five IP addresses in each subnet for internal
use.
- Application Gateway (Standard or WAF SKU) can support
up to 32 instances. Taking 32 instance IP addresses + 1
private front-end IP + 5 Azure reserved, a minimum subnet
size of /26 is recommended. Because the Standard_v2 or
WAF_v2 SKU can support up to 125 instances, using the
same calculation, a subnet size of /24 is recommended.
- If you want to deploy additional Application Gateway
resources in the same subnet, consider the additional IP
addresses that will be required for their maximum instance
count for both, Standard and Standard v2.
Take advantage features for autoscaling and performance
benefits
The v2 SKU offers autoscaling to ensure that your
Application Gateway can scale up as traffic increases. When
compared to v1 SKU, v2 has capabilities that enhance the
performance of the workload. For example, better TLS
offload performance, quicker deployment and update times,
zone redundancy, and more. For more information about
autoscaling features, see Scaling Application Gateway v2 and
WAF v2.
If you are running v1 SKU gateways, consider migrating to
the v2 SKU. For more information, see Migrate Azure
Application Gateway and Web Application Firewall from v1
to v2.
REC O M M EN DAT IO NREC O M M EN DAT IO N B EN EFITB EN EF IT
 
Azure Advisor recommendations
  
ReliabilityReliability
 
Additional resources
  
Azure Architecture Center guidanceAzure Architecture Center guidance
Azure Advisor helps you ensure and improve continuity of your business-critical applications. Review the Azure
Advisor recommendations.
Azure Advisor is a personalized cloud consultant that helps you follow best practices to optimize your Azure
deployments. Here are some recommendations that can help you improve the reliability, security, cost
effectiveness, performance, and operational excellence of your Application Gateway.
Ensure application gateway fault tolerance
Do not override hostname to ensure website integrity
Using API gateways in microservices
Firewall and Application Gateway for virtual networks
Protect APIs with Application Gateway and API Management

Next steps

IaaS: Web application with relational database

Securely managed web applications

Zero-trust network for web applications with Azure Firewall and Application Gateway

Deploy an Application Gateway to see how it works: Quickstart: Direct web traffic with Azure Application

Gateway - Azure portal

Azure Well

Architected Framework review

Azure

Firewall

12/16/2022 • 13 minutes to read • Edit Online

Prerequisites

Reliability

Design checklistDesign checklist

RecommendationsRecommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N B EN EFITB EN EF IT

Use Azure Firewall Manager with Azure Virtual WAN to

deploy and manage instances of Azure Firewall across Virtual

WAN hubs or in hub virtual networks.

Easily create hub-and-spoke and transitive architectures with

native security services for traffic governance and protection.

This article provides architectural best practices for Azure Firewall. The guidance is based on the five pillars of

architecture excellence:

Reliability

Security

Cost optimization

Operational excellence

Performance efficiency

We assume that you have working knowledge of Azure Firewall and are well versed with its features. For more

information, see Azure Firewall Standard features.

Understanding the Azure Well-Architected Framework pillars can help produce a high-quality, stable, and

efficient cloud architecture. Review your workload by using the Well-Architected Framework review

assessment.

Use a reference architecture to review the considerations based on the guidance provided in this article. Start

with Network-hardened web application with private connectivity to PaaS datastores and Implement a secure

hybrid network.

To learn how Azure Firewall supports a reliable workload, see the following articles:

Introduction to Azure Firewall

Quickstart: Deploy Azure Firewall with availability zones

As you make design choices for Azure Firewall, review the design principles for reliability.

Deploy by using a secured virtual hub.

Use a global Azure Firewall policy.

Determine if you want to use third-party security as a service (SECaaS) providers.

Explore the following table of recommendations to optimize your Azure Firewall configuration for reliability.

Create a global Azure Firewall policy to govern the security

posture across global network environments. Assign the

policy to all instances of Azure Firewall.

Allow for granular policies to meet the requirements of

specific regions. Delegate incremental firewall policies to local

security teams through role-based access control (RBAC).

Configure supported third-party software as a service (SaaS)

security providers within Firewall Manager if you want to use

these solutions to protect outbound connections.

You can use your familiar, best-in-breed, third-party SECaaS

offerings to protect internet access for your users.

Deploy Azure Firewall across multiple availability zones for a

higher service-level agreement (SLA).

Azure Firewall provides different SLAs when it's deployed in a

single availability zone and when it's deployed in multizones.

For more information, see SLA for Azure Firewall. For

information about all Azure SLAs, see SLA summary for

Azure services.

In multi-region environments, deploy an instance of Azure

Firewall per region.

For workloads designed to be resistant to failures and fault

tolerant, remember to consider that instances of Azure

Firewall and Azure Virtual Network are regional resources.

Closely monitor Azure Firewall metrics to ensure this

component of your solution is healthy.

Closely monitor metrics, especially SNAT port utilization,

firewall health state, and throughput.

REC O M M EN DAT IO NREC O M M EN DAT IO N B EN EFITB EN EF IT

Security

Design checklistDesign checklist

RecommendationsRecommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N B EN EFITB EN EF IT

Create a global Azure Firewall policy to govern the security

posture across global network environments. Assign the

policy to all instances of Azure Firewall.

Allow for granular policies to meet the requirements of

specific regions. Delegate incremental firewall policies to local

security teams through RBAC.

Azure Advisor helps you ensure and improve the continuity of your business-critical applications. Review the

Azure Advisor recommendations.

Security is one of the most important aspects of any architecture. Azure Firewall is an intelligent firewall security

service that provides threat protection for your cloud workloads running in Azure.

As you make design choices for Azure Firewall, review the design principles for security.

Use a global Azure Firewall policy.

Use threat intelligence.

Use a DNS proxy.

Direct network traffic through Azure Firewall.

Validate spoke networks.

Determine if you want to use third-party SECaaS providers.

Use just-in-time (JIT) systems.

Explore the following table of recommendations to optimize your Azure Firewall configuration for security.

Enable threat intelligence on Azure Firewall. You can enable threat intelligence-based filtering for your

firewall to alert and deny traffic from or to unknown IP

addresses and domains. The IP addresses and domains are

sourced from the Microsoft Threat Intelligence Feed.

Intelligent Security Graph powers Microsoft threat

intelligence and is used by multiple services, including

Microsoft Defender for Cloud.

Enable Domain Name System (DNS) proxy and point the

infrastructure DNS to Azure Firewall.

By default, Azure Firewall uses Azure DNS. Custom DNS

allows you to configure Azure Firewall to use corporate DNS

to resolve external and internal names.

Configure the user-defined routes (UDR) to force traffic to

Azure Firewall.

Configure UDRs to force traffic to Azure Firewall for

SpoketoSpoke , SpoketoInternet , and SpoketoHybrid

connectivity.

Validate if unnecessary peering exists between the hub

virtual network where Azure Firewall is deployed, and other

spoke virtual networks.

Helps to guarantee that undesired traffic isn't being sent to

the Azure firewall or the hub network where the Azure

Firewall is deployed.

Use security partner providers for third-party SECaaS

offerings.

Security partner providers help filter internet traffic through

a virtual private network or a branch to the internet.

Use JIT systems to control access to virtual machines (VMs)

from the internet.

You can use Microsoft Defender for Cloud JIT to control

access for clients that connect from the internet by using

Azure Firewall.

Configure Azure Firewall in the forced tunneling mode to

route all internet-bound traffic to a designated next hop

instead of going directly to the internet.

Azure Firewall must have direct internet connectivity. If your

AzureFirewallSubnetAzureFirewallSubnet learns a default route to your on-

premises network via the Border Gateway Protocol, you

must configure Azure Firewall in the forced tunneling mode.

Using the forced tunneling feature, you'll need another /26

address space for the Azure Firewall Management subnet.

You're required to name it

AzureFirewallManagementSubnetAzureFirewallManagementSubnet.

If this is an existing Azure Firewall instance that can't be

reconfigured in the forced tunneling mode, create a UDR

with a 0.0.0.0/0 route. Set the NextHopTypeNextHopType value as

InternetInternet. Associate it with AzureFirewallSubnetAzureFirewallSubnet to

maintain internet connectivity.

Set the public IP address to NoneNone to deploy a fully private

data plane when you configure Azure Firewall in the forced

tunneling mode.

When you deploy a new Azure Firewall instance, if you

enable the forced tunneling mode, you can set the public IP

address to NoneNone to deploy a fully private data plane.

However, the management plane still requires a public IP for

management purposes only. The internal traffic from virtual

and on-premises networks won't use that public IP. For more

about forced tunneling, see Azure Firewall forced tunneling.

Use fully qualified domain name (FQDN) filtering in network

rules.

You can use FQDNs based on DNS resolution in Azure

Firewall and firewall policies. This capability allows you to

filter outbound traffic with any TCP/UDP protocol (including

NTP, SSH, RDP, and more). You must enable the DNS Proxy

option to use FQDNs in your network rules. To learn how it

works, see Azure Firewall FQDN filtering in network rules.

REC O M M EN DAT IO NREC O M M EN DAT IO N B EN EFITB EN EF IT

Azure Advisor helps you ensure and improve the continuity of your business-critical applications. Review the

Policy definitionsPolicy definitions

Cost optimization

Design checklistDesign checklist

RecommendationsRecommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N B EN EFITB EN EF IT

Deploy the Standard and Premium SKUs where appropriate. The Standard option is usually enough for east-west traffic.

Premium has the necessary extra features for north-south

traffic, the forced tunneling feature, and many other

features. For more information, see Azure Firewall Premium

Preview features. Deploy mixed scenarios using the Standard

and Premium options according to your needs.

Stop Azure Firewall deployments that don't need to run for

24 hours.

You might have development environments that are used

only during business hours. For more information, see

Deallocate and allocate Azure Firewall.

Share the same instance of Azure Firewall across multiple

workloads and Azure Virtual Network.

You can use a central instance of Azure Firewall in the hub

virtual network and share the same firewall across many

spoke virtual networks that are connected to the same hub

from the same region.

Ensure there's no unexpected cross-region traffic as part of

the hub-spoke topology.

Azure Advisor recommendations.

All internet traffic should be routed via your Azure Firewall. Protect your subnets from potential threats by

restricting access to them with Azure Firewall or a supported next-generation firewall.

All built-in policy definitions related to Azure networking are listed in Built-in policies - Network.

Cost optimization is about looking at ways to reduce unnecessary expenses and improve operational

efficiencies.

As you make design choices for Azure Firewall, review the design principles for cost optimization.

Determine which firewall SKUs to deploy.

Determine if some resources don't need 100% allocation.

Determine where you can optimize firewall use across workloads.

Monitor firewall usage to determine cost-effectiveness.

Review Firewall Manager capabilities to determine potential operational efficiency.

Determine the number of public IP addresses required.

Explore the following table of recommendations to optimize your Azure Firewall configuration for cost

optimization.

Review underutilized Azure Firewall instances. Identify and

delete unused Azure Firewall deployments.

To identify unused Azure Firewall deployments, start by

analyzing the monitoring metrics and UDRs associated with

subnets pointing to the firewall's private IP. Combine that

information with other validations, such as if your instance of

Azure Firewall has any rules (classic) for NAT, Network and

Application, or even if the DNS Proxy setting is configured to

DisabledDisabled, and with internal documentation about your

environment and deployments.

You can detect deployments that are cost-effective over

time.

For more information about monitoring logs and metrics,

see Monitor Azure Firewall logs and metrics and SNAT port

utilization.

Use Azure Firewall Manager and its policies to reduce

operational costs, increase efficiency, and reduce

management overhead.

Review your Firewall Manager policies, associations, and

inheritance carefully. Policies are billed based on firewall

associations. A policy with zero or one firewall association is

free of charge. A policy with multiple firewall associations is

billed at a fixed rate.

For more information, see Pricing - Azure Firewall Manager.

Delete unused public IP addresses and use IP Groups to

reduce your management overhead.

Validate whether all the associated public IP addresses are in

use. If they aren't in use, disassociate and delete them. Use

IP Groups to reduce your management overhead. Evaluate

SNAT port utilization before removing any IP addresses.

You'll only use the number of public IPs your firewall needs.

For more information, see Monitor Azure Firewall logs and

metrics and SNAT port utilization.

REC O M M EN DAT IO NREC O M M EN DAT IO N B EN EFITB EN EF IT

Operational excellence

Design checklistDesign checklist

RecommendationsRecommendations

For more suggestions, see Principles of the Cost optimization pillar.

Azure Advisor helps you ensure and improve the continuity of your business-critical applications. Review the

Azure Advisor recommendations.

Monitoring and diagnostics are crucial. You can measure performance statistics and metrics to troubleshoot and

remediate issues quickly.

As you make design choices for Azure Firewall, review the design principles for operational excellence.

Use logs for monitoring.

Use tags when possible to allow traffic through the firewall.

Use workbooks.

Use the Azure Firewall connector in Microsoft Sentinel.

Explore the following table of recommendations to optimize your Azure Firewall configuration for operational

excellence.

REC O M M EN DAT IO NREC O M M EN DAT IO N B EN EFITB EN EF IT

Turn on logs for Azure Firewall. You can monitor Azure Firewall by using firewall logs or

workbooks. You can also use activity logs for auditing

operations on Azure Firewall resources.

Use FQDN tags on Azure Firewall. FQDN tags make it easy to allow known Azure service

network traffic through your firewall. For example, say you

want to allow Windows Update network traffic through your

firewall. You create an application rule and use the Windows

Update tag. Now network traffic from Windows Update can

flow through your firewall.

Use workbooks in Azure Log Analytics. Workbooks help visualize firewall logs.

Enable Azure Firewall connector in Microsoft Sentinel. You can use Microsoft Sentinel to create detections and logic

apps for Azure Firewall.

Migrate Azure Firewall rules to Azure Firewall Manager

policies for existing deployments.

For existing deployments, migrate Azure Firewall rules to

Azure Firewall Manager policies. Use Azure Firewall Manager

to centrally manage your firewalls and policies.

Monitoring capacity metrics are indicators of the utilization

of provisioned Azure Firewall capacity.

Set alerts as needed to get notifications after reaching a

threshold for any metric.

For information about monitoring logs and metrics, see

Monitor Azure Firewall logs and metrics.

Monitor other Azure Firewall logs and metrics for

troubleshooting and set alerts.

Azure Firewall exposes a few other logs and metrics for

troubleshooting that are suitable indicators of issues.

Evaluate alerts based on the following list.

Use diagnostics logs and policy analytics. Diagnostic logs allow you to view Azure Firewall logs,

performance logs, and access logs. You can use these logs in

Azure to manage and troubleshoot your Azure Firewall

instance.

Policy analytics for Azure Firewall Manager allows you to

start seeing rules and flows that match the rules and hit

count for those rules. You can have full traffic visibility by

watching what rule is in use and the traffic being matched.

Performance efficiency

Design checklistDesign checklist

Application Rule log: Each new connection that

matches one of your configured application rules

results in a log for the accepted/denied connection.

Network Rule log: Each new connection that matches

one of your configured network rules results in a log

for the accepted/denied connection.

DNS Proxy log: This log tracks DNS messages to a

DNS server configured using a DNS proxy.

Azure Advisor helps you ensure and improve the continuity of your business-critical applications. Review the

Azure Advisor recommendations.

Performance efficiency is the ability of your workload to scale to efficiently meet the demands placed on it by

users.

  
RecommendationsRecommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N B EN EFITB EN EF IT
If you'll need more than 512,000 SNAT ports, deploy a NAT
gateway with Azure Firewall.
With a NAT gateway, you can scale up to more than 1 million
ports. For more information, see Scale SNAT ports with
Azure NAT gateway.
Create initial traffic that isn't part of your load tests 20
minutes before the test. Use diagnostics settings to capture
scale-up and scale-down events. You can use the Azure Load
Testing service to generate the initial traffic.
Allows the Azure Firewall instance to scale up its instances to
the maximum.
Use IP Groups to summarize IP address ranges. You can use IP Groups to summarize IP ranges, so you don't
exceed 10,000 network rules. For each rule, Azure multiplies
ports by IP addresses. So, if you have 1 rule with 4 IP
address ranges and 5 ports, you'll consume 20 network
rules.
Configure an Azure Firewall subnet (AzureFirewallSubnet)
with a /26 address space.
Azure Firewall is a dedicated deployment in your virtual
network. Within your virtual network, a dedicated subnet is
required for the instance of Azure Firewall. Azure Firewall
provisions more capacity as it scales.
A /26 address space for its subnets ensures that the firewall
has enough IP addresses available to accommodate the
scaling. Azure Firewall doesn't need a subnet bigger than
/26. The Azure Firewall subnet name must be
AzureFirewallSubnetAzureFirewallSubnet .
 
Azure Advisor recommendations
 
Additional resources
As you make design choices for Azure Firewall, review the design principles for performance efficiency.
Determine your SNAT port requirements and if you should deploy a NAT gateway.
Plan load tests to test auto-scale performance in your environment.
Plan network rule requirements and opportunities to summarize IP ranges.
Explore the following table of recommendations to optimize your Azure Firewall configuration for performance
efficiency.
Azure Advisor helps you ensure and improve the continuity of your business-critical applications. Review the
Azure Advisor recommendations.
Azure Advisor is a personalized cloud consultant that helps you follow best practices to optimize your Azure
deployments. Here are some recommendations that can help you improve the reliability, security, cost-
effectiveness, performance, and operational excellence of your instance of Azure Firewall.
Create Azure Service Health alerts to be notified when Azure problems affect you
Ensure you have access to Azure cloud experts when you need it
Enable Traffic Analytics to view insights into traffic patterns across Azure resources
Update your outbound connectivity protocol to Service Tags for Azure Site Recovery
Follow just enough administration (least privilege principle)
Protect your network resources with Microsoft Defender for Cloud
Azure Firewall documentation

Azure Architecture Center guidanceAzure Architecture Center guidance

Next step

Azure Firewall service limits, quotas, and constraints

Azure security baseline for Azure Firewall

Azure Firewall architecture overview

Use Azure Firewall to help protect an Azure Kubernetes Service (AKS) cluster

Hub-spoke network topology in Azure

Implement a secure hybrid network

Network-hardened web application with private connectivity to PaaS datastores

Deploy an instance of Azure Firewall to see how it works:

Tutorial: Deploy and configure Azure Firewall and policy by using the Azure portal

Azure Well

Architected Framework review

Azure

ExpressRoute

12/16/2022 • 10 minutes to read • Edit Online

Prerequisites

Reliability

Design checklistDesign checklist

This article provides architectural best practice for Azure ExpressRoute. The guidance is based on the five pillars

of the architecture excellence:

Reliability

Security

Cost optimization

Operational excellence

Performance efficiency

We assume that you have working knowledge of Azure ExpressRoute and are well versed with all of its features.

For more information, see Azure ExpressRoute.

For context, consider reviewing a reference architecture that reflects these considerations in its design. We

recommend that you start with Cloud Adoption Framework Ready methodology's guidance Connect to Azure

and Architect for hybrid connectivity with Azure ExpressRoute. For low-code application architectures, we

recommend reviewing Enabling ExpressRoute for Power Platform when planning and configuring ExpressRoute

for use with Microsoft Power Platform.

In the cloud, we acknowledge that failures happen. Instead of trying to prevent failures altogether, the goal is to

minimize the effects of a single failing component. Use the following information to minimize down time to and

from Azure when establishing connectivity using Azure ExpressRoute.

When discussing about reliability with Azure ExpressRoute it's important to taking into consideration bandwidth

usage, physical layout of the network, and disaster recovery if there's failures. Azure ExpressRoute is capable of

achieving these design considerations and have recommendations for each item in the checklist.

In the design checklistdesign checklist and list of recommendationslist of recommendations below, information is presented in order for you to

design a highly available network between your Azure environment and on-premises network.

As you make design choices for Azure ExpressRoute, review the design principles for adding reliability to the

architecture.

Select between ExpressRoute circuit or ExpressRoute Direct for business requirements.

Configure a diverse physical layer network to the service provider.

Configure ExpressRoute circuits with different service provider to have diverse routing paths.

Configure Active-Active ExpressRoute connections between on-premises and Azure.

Set up availability zone aware ExpressRoute Virtual Network Gateways.

Configure ExpressRoute circuits in a different location than the on-premises network.

Configure ExpressRoute Virtual Network Gateways in different regions.

Configure site-to-site VPN as a backup to ExpressRoute private peering.

RecommendationsRecommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N B EN EFITB EN EF IT

Plan for ExpressRoute circuit or ExpressRoute Direct During the initial planning phase, you want to decide

whether you want to configure an ExpressRoute circuit or an

ExpressRoute Direct connection. An ExpressRoute circuit

allows a private dedicated connection into Azure with the

help of a connectivity provider. ExpressRoute Direct allows

you to extend on-premises network directly into the

Microsoft network at a peering location. You also need to

identify the bandwidth requirement and the SKU type

requirement for your business needs.

Physical layer diversity For better resiliency, plan to have multiple paths between

the on-premises edge and the peering locations

(provider/Microsoft edge locations). This configuration can

be achieved by going through different service provider or

through a different location from the on-premises network.

Plan for geo-redundant circuits To plan for disaster recovery, set up ExpressRoute circuits in

more than one peering locations. You can create circuits in

peering locations in the same metro or different metro and

choose to work with different service providers for diverse

paths through each circuit. For more information, see

Designing for disaster recovery and Designing for high

availability.

Plan for Active-Active connectivity ExpressRoute dedicated circuits guarantee 99.95%

availability when an active-active connectivity is configured

between on-premises and Azure. This mode provides higher

availability of your Expressroute connection. It's also

recommended to configure BFD for faster failover if there's a

link failure on a connection.

Planning for Virtual Network Gateways Create availability zone aware Virtual Network Gateway for

higher resiliency and plan for Virtual Network Gateways in

different region for disaster recovery and high availability.

Monitor circuits and gateway health Set up monitoring and alerts for ExpressRoute circuits and

Virtual Network Gateway health based on various metrics

available.

Enable service health ExpressRoute uses service health to notify about planned

and unplanned maintenance. Configuring service health will

notify you about changes made to your ExpressRoute

circuits.

Set up monitoring for ExpressRoute circuit and ExpressRoute Virtual Network Gateway health.

Configure service health to receive ExpressRoute circuit maintenance notification.

Explore the following table of recommendations to optimize your ExpressRoute configuration for Reliability.

For more suggestions, see Principles of the reliability pillar.

Azure Advisor provides many recommendations for ExpressRoute circuits as they relate to reliability. For

example, Azure Advisor can detect:

ExpressRoute gateways in which only a single ExpressRoute circuit is deployed, instead of multiple. Multiple

ExpressRoute circuits are recommended for add resiliency for the peering location.

Security

Design checklistDesign checklist

RecommendationsRecommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N B EN EFITB EN EF IT

Configure Activity log to send logs to archive Activity logs provide insights into operations that were

performed at the subscription level for ExpressRoute

resources. With Activity logs, you can determine who and

when an operation was performed at the control plane. Data

retention is only 90 days and required to be stored in Log

Analytics, Event Hubs or a storage account for archive.

Maintain inventory of administrative accounts Use Azure RBAC to configure roles to limit user accounts

that can add, update, or delete peering configuration on an

ExpressRoute circuit.

Configure MD5 hash on ExpressRoute circuit During configuration of private peering or Microsoft peering,

apply an MD5 hash to secure messages between the on-

premises route and the MSEE routers.

Configure MACSec for ExpressRoute Direct resources Media Access Control security is a point-to-point security at

the data link layer. ExpressRoute Direct supports configuring

MACSec to prevent security threats to protocols such as

ARP, DHCP, LACP not normally secured on the Ethernet link.

For more information on how to configure MACSec, see

MACSec for ExpressRoute Direct ports.

Encrypt traffic using IPsec Configure a Site-to-site VPN tunnel over your ExpressRoute

circuit to encrypt data transferring between your on-

premises network and Azure virtual network. You can

configure a tunnel using private peering or using Microsoft

peering.

Cost optimization

ExpressRoute circuits that aren't being observed by Connection Monitor, as end-to-end monitoring of your

ExpressRoute circuit is critical for reliability insights.

Network topologies involving multiple peering locations that would benefit from ExpressRoute Global Reach

to improve disaster recovery designs for on-premises connectivity to account for unplanned connectivity

loss.

Security is one of the most important aspects of any architecture. ExpressRoute provides features to employ

both the principle of least privilege and defense-in-defense. We recommend you review the Security design

principles.

Configure Activity log to send logs to archive.

Maintain an inventory of administrative accounts with access to ExpressRoute resources.

Configure MD5 hash on ExpressRoute circuit.

Configure MACSec for ExpressRoute Direct resources.

Encrypt traffic over private peering and Microsoft peering for virtual network traffic.

Explore the following table of recommendations to optimize your ExpressRoute configuration for security.

For more suggestions, see Principles of the security pillar.

  
Design checklistDesign checklist
  
RecommendationsRecommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N B EN EFITB EN EF IT
Familiarize yourself with ExpressRoute pricing For information about ExpressRoute pricing, see Understand
pricing for Azure ExpressRoute. You can also use the Pricing
calculator.
Ensure that the options are adequately sized to meet the
capacity demand and deliver expected performance without
wasting resources.
Determine SKU and bandwidth required The way you're charged for your ExpressRoute usage varies
between the three different SKU types. With Local SKU,
you're automatically charged with an Unlimited data plan.
With Standard and Premium SKU, you can select between a
Metered or an Unlimited data plan. All ingress data are free
of charge except when using the Global Reach add-on. It's
important to understand which SKU types and data plan
works best for your workload to best optimize cost and
budget. For more information resizing ExpressRoute circuit,
see upgrading ExpressRoute circuit bandwidth.
Determine the ExpressRoute virtual network gateway size ExpressRoute virtual network gateways are used to pass
traffic into a virtual network over private peering. Review the
performance and scale needs of your preferred Virtual
Network Gateway SKU. Select the appropriate gateway SKU
on your on-premises to Azure workload.
Monitor cost and create budget alerts Monitor the cost of your ExpressRoute circuit and create
alerts for spending anomalies and overspending risks. For
more information, see Monitoring ExpressRoute costs.
Deprovision and delete ExpressRoute circuits no longer in
use.
ExpressRoute circuits are charged from the moment they're
created. To reduce unnecessary cost, deprovision the circuit
with the service provider and delete the ExpressRoute circuit
from your subscription. For steps on how to remove an
ExpressRoute circuit, see Deprovisioning an ExpressRoute
circuit.
Cost optimization is about looking at ways to reduce unnecessary expenses and improve operational
efficiencies. We recommend you review the Cost optimization design principle and Plan and manage costs for
Azure ExpressRoute.
Familiarize yourself with ExpressRoute pricing.
Determine the ExpressRoute circuit SKU and bandwidth required.
Determine the ExpressRoute virtual network gateway size required.
Monitor cost and create budget alerts.
Deprovision ExpressRoute circuits no longer in use.
Explore the following table of recommendations to optimize your ExpressRoute configuration for Cost
optimization.
For more suggestions, see Principles of the cost optimization pillar.
Azure Advisor can detect ExpressRoute circuits that have been deployed for a significant time but have a
provider status of 
Not Provisioned
. Circuits in this state aren't operational; and removing the unused resource
will reduce unnecessary costs.

Operational excellence

Design checklistDesign checklist

RecommendationsRecommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N B EN EFITB EN EF IT

Configure connection monitoring Connection monitoring allows you to monitor connectivity

between your on-premises resources and Azure over the

ExpressRoute private peering and Microsoft peering

connection. Connection monitor can detect networking

issues by identifying where along the network path the

problem is and help you quickly resolve configuration or

hardware failures.

Configure Service Health Set up Service Health notifications to alert when planned

and upcoming maintenance is happening to all ExpressRoute

circuits in your subscription. Service Health also displays past

maintenance along with RCA if an unplanned maintenance

were to occur.

Review metrics with Network Insights ExpressRoute Insights with Network Insights allow you to

review and analyze ExpressRoute circuits, gateways,

connections metrics and health dashboards. ExpressRoute

Insights also provide a topology view of your ExpressRoute

connections where you can view details of your peering

components all in a single place.

Metrics available:

- Availability

- Throughput

- Gateway metrics

Review ExpressRoute resource metrics ExpressRoute uses Azure Monitor to collect metrics and

create alerts base on your configuration. Metrics are

collected for ExpressRoute circuits, ExpressRoute gateways,

ExpressRoute gateway connections, and ExpressRoute Direct.

These metrics are useful for diagnosing connectivity

problems and understanding the performance of your

ExpressRoute connection.

Performance efficiency

Monitoring and diagnostics are crucial. Not only can you measure performance statistics but also use metrics

troubleshoot and remediate issues quickly. We recommend you review the Operational excellence design

principles.

Configure connection monitoring between your on-premises and Azure network.

Configure Service Health for receiving notification.

Review metrics and dashboards available through ExpressRoute Insights using Network Insights.

Review ExpressRoute resource metrics.

Explore the following table of recommendations to optimize your ExpressRoute configuration for Operational

excellence.

For more suggestions, see Principles of the operational excellence pillar.

Performance efficiency is the ability of your workload to scale to meet the demands placed on it by users in an

efficient manner. We recommend you review the Performance efficiency principles.

Design checklistDesign checklist

RecommendationsRecommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N B EN EFITB EN EF IT

Test ExpressRoute gateway performance to meet work load

requirements.

Use Azure Connectivity Toolkit to test performance across

your ExpressRoute circuit to understand bandwidth capacity

and latency of your network connection.

Increase the size of the ExpressRoute gateway. Upgrade to a higher gateway SKU for improved throughput

performance between on-premises and Azure environment.

Upgrade ExpressRoute circuit bandwidth Upgrade your circuit bandwidth to meet your work load

requirements. Circuit bandwidth is shared between all virtual

networks connected to the ExpressRoute circuit. Depending

on your work load, one or more virtual networks can use up

all the bandwidth on the circuit.

Enable ExpressRoute FastPath for higher throughput If you're using an Ultra performance or an ErGW3AZ virtual

network gateway, you can enable FastPath to improve the

data path performance between your on-premises network

and Azure virtual network.

Monitor ExpressRoute circuit and gateway metrics Set up alerts base on ExpressRoute metrics to proactively

notify you when a certain threshold is met. These metrics are

useful to understand anomalies that can happen with your

ExpressRoute connection such as outages and maintenance

happening to your ExpressRoute circuits.

Azure Policy

Additional resources

Cloud Adoption Framework guidanceCloud Adoption Framework guidance

Test ExpressRoute gateway performance to meet work load requirements.

Increase the size of the ExpressRoute gateway.

Upgrade the ExpressRoute circuit bandwidth.

Enable ExpressRoute FastPath for higher throughput.

Monitor the ExpressRoute circuit and gateway metrics.

Explore the following table of recommendations to optimize your ExpressRoute configuration for performance

efficiency.

For more suggestions, see Principles of the performance efficiency pillar.

Azure Advisor will offer a recommendation to upgrade your ExpressRoute circuit bandwidth to accommodate

usage when your circuit has recently been consuming over 90% of your procured bandwidth. If your traffic

exceeds your allocated bandwidth, you’ll experience dropped packets, which can lead to significant performance

or reliability impact.

Azure Policy doesn't provide any built-in policies for ExpressRoute, but custom policies can be created to help

govern how ExpressRoute circuits should match your desired end state, such as SKU choice, peering type,

peering configurations and so on.

Traditional Azure network topology

Virtual WAN network topology (Microsoft-managed)

Next steps

Configure an ExpressRoute circuit or ExpressRoute Direct port to establish communication between your on-

premises network and Azure.

API Management and reliability

12/16/2022 • 2 minutes to read • Edit Online

Checklist

Configuration recommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Ensure you set quotas and rate limits when exposing APIs to

third parties.

Protect backend services and reduce the load placed on an

API Management scale unit. Rate limiting policies can be

applied at Global, Product, API, and Operation levels to

provide rate limit customization applied to API consumers.

Evaluate the need for response caching. Response caching can reduce API latency and bandwidth

consumption. Response caching reduces the load placed on

the backend APIs leading to improved performance, user

experience, and reduced solution cost.

Learn how to use API Management to publish APIs to external, partner, and employee developers securely and

at scale. This networking service is a hybrid, multicloud management platform for APIs across all environments.

Components include:

API gateway

Management plane

Developer portal

For more information, reference About API Management.

To understand how API Management can increase reliability for your workload, reference the following topics:

Availability zone support for Azure API Management

How to deploy an Azure API Management service instance to multiple Azure regions

How to implement disaster recovery using service backup and restore in Azure API Management

Have you configured API Management with reliability in mind?Have you configured API Management with reliability in mind?

Secure the communication between API Management and your backend.

Ensure that each party has its own credential when exposing APIs to third parties.

Ensure you set quotas and rate limits when exposing APIs to third parties.

Evaluate the need for response caching.

Plan a backup and restore process for your API Management instance.

Configure multiple Azure regions in your API Management service.

Implement a strategy to ensure availability during an outage or disaster affecting an Azure region.

Consider the following recommendations to optimize reliability when configuring your API Management

service:

Plan a backup and restore process for your API

Management instance.

Consider taking regular backups of your API Management

service so that you can easily restore it in another region.

Your recovery time objective may require that a standby is

deployed in a secondary region. It is a good practice to take

regular backups to recreate the service due to unforeseen

loss or misconfiguration of the service. Regular backups

allow you to replicate changes between your primary and

standby instances.

Configure multiple Azure regions in your API Management

service.

Configure your API Management service with multiple

regions to provide high-availability support in case an Azure

region experiences downtime or a disaster scenario.

Configuring multiple regions also reduces API call latency

because calls can be routed to the nearest region.

Implement a strategy to ensure availability during an outage

or disaster affecting an Azure region.

Consider using Azure Traffic Manager, Azure Front Door, or

Azure DNS to enable access to multiple regional

deployments of API Management. Using these services

ensures you can still serve requests due to an outage or

disaster. Requirements include syncing configurations

between these individual Standard instances.

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Next step

API Management and cost optimization

12/16/2022 • 2 minutes to read • Edit Online

Checklist

Configuration recommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Configure autoscaling where appropriate. Consider scaling your API Management instance up or down

to control costs. You can configure API Management with

Autoscale based on a metric or a specific count. Costs

depend upon the number of units, which determines

throughput in requests per seconds (RPS). An autoscaled API

Management instance switches between scale units

appropriate for RPS numbers during a specific time window.

Autoscaling helps to achieve balance between cost

optimization and performance.

Learn how to use API Management to publish APIs to external, partner, and employee developers securely and

at scale. This networking service is a hybrid, multicloud management platform for APIs across all environments.

Components include:

API gateway

Management plane

Developer portal

For more information, reference About API Management.

To understand how API Management supports cost optimization for your workload, reference the following

topics:

Automatically scale an Azure API Management instance

Use a virtual network with Azure API Management

Have you configured API Management with cost optimization in mind?Have you configured API Management with cost optimization in mind?

Configure autoscaling where appropriate.

Consider which features you need all the time.

Consider the following recommendations to optimize reliability when configuring your API Management

service:

Consider which features you need all the time. Consider switching between Basic, Standard, and Premium

tiers. If a workload does not need features available in higher

tiers, then consider switching to a lower tier. As an example,

a workload may need just 1GB of cache during off-peak

periods compared to 5GB of cache during peak periods.

Costs associated with such a workload can be reduced by

switching from a Premium to Standard tier during off-peak

periods and back to a Premium tier during peak periods. This

process can be automated as a job using Set-

AzApiManagement cmdlet. Refer to API Management pricing

about features available in different API Management tiers.

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Next step

API Management and operational excellence

12/16/2022 • 2 minutes to read • Edit Online

Checklist

Configuration recommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Ensure you set quotas and rate limits when exposing APIs to

third parties.

Protect backend services and reduce the load placed on an

API Management scale unit. Rate limiting policies can be

applied at Global, Product, API, and Operation levels to

provide rate limit customization applied to API consumers.

Learn how to use API Management to publish APIs to external, partner, and employee developers securely and

at scale. This networking service is a hybrid, multicloud management platform for APIs across all environments.

Components include:

API gateway

Management plane

Developer portal

For more information, reference About API Management.

To understand how API Management supports operational excellence, reference the following topics:

Managing Azure API Management using Azure Automation

Observability in Azure API Management

Have you configured API Management with operational excellence in mind?Have you configured API Management with operational excellence in mind?

Secure the communication between API Management and your backend.

Ensure that each party has its own credential when exposing APIs to third parties.

Ensure you set quotas and rate limits when exposing APIs to third parties.

Understand the Microsoft REST API design and architecture guidance.

Enable versioning of APIs to maintain backwards compatibility while adding other features.

Use the API Management Versioning and Revisions features to implement API versioning.

Understand the API import restrictions in API Management.

Understand the Event logging feature.

Trace calls in Azure API Management to help with debugging and testing.

Configure logging using Azure Monitor for the API Management service.

Choose the right modes to access private site connections.

Evaluate firewall rules and IP allowlists based on the API Management public IP address.

Consider the following recommendations for operational excellence when configuring your API Management

service:

Understand the Microsoft REST API design and architecture

guidance.

Follow standards and best practices when using the REST

API. Following best practices enables maximum compatibility

across platforms and implementations. Review the REST API

Guidelines and API Design guidance.

Understand the API import restrictions in API Management. Every effort is made to ensure the API import process runs

smoothly, which includes requiring no customizations. Some

scenarios impose restrictions that will require modification to

the import source. Applies to both REST and SOAP services.

Reference Policy Restrictions for the current API Import

restrictions.

Understand the Event logging feature. Supports event logging to an Azure event hub to perform

near real-time analysis. This feature integrates with external

logging, security information and event management (SIEM)

solutions, or analyzing API usage in near real time.

Trace calls in Azure API Management to help with debugging

and testing.

Tracing must be enabled on the subscription used to make

the request. Tracing is enabled on a request-by-request basis

using the Ocp-Apim-Trace header value. API Tracing is also

built into the admin portal and is enabled by default when

testing APIs from the portal.

Configure logging using Azure Monitor for the API

Management service.

Logs can be sent to a Logs Analytics workspace to enable

complex querying and analysis. Metrics can be ingested for

longer term analysis. All data is then surfaced using Azure

Monitor. It is possible to integrate Application Insights for

Application Performance Management.

Choose the right modes to access private site connections. Supports Virtual Network integration in internal and external

mode.

Evaluate firewall rules and IP allowlists based on the API

Management public IP address.

A fixed public IP address is available for the lifetime of the

service with the Basic, Developer, Standard, and Premium

plans for API Management.

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Next step

Reliability and Azure Firewall

Reliability and Azure Front Door

12/16/2022 • 2 minutes to read • Edit Online

Checklist

Configuration recommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Use WAF policies in Front Door. Lock down Application

Gateway to receive traffic only from Azure Front Door when

using Azure Front Door and Application Gateway to protect

HTTP/S applications.

Certain scenarios can force a customer to implement rules

specifically on AppGateway: For example, if ModSec Core

Rule Set (CRS) 2.2.9 , CRS 3.0 , or CRS 3.1 rules are

required, rules can be only implemented on AppGatway.

Rate-limiting and geo-filtering are available only on Azure

Front Door, not on AppGateway. Instructions on how to lock

down traffic can be found at Frequently asked questions for

Azure Front Door

Ensure that the connection to the back-end is re-encrypted. Front Door doesn't support SSL passthrough. Front Door

must hold the certificate to terminate the encrypted

inbound connection.

A scalable and secure entry point, Azure Front Door provides fast delivery of your global web applications. Front

Door uses the Microsoft global edge network to create fast, secure, and widely scalable web applications.

Key features include:

Accelerated application performance by using split TCP-based anycast protocol.

Intelligent health probe monitoring for backend resources.

URL-path based routing for requests.

Enables hosting of multiple websites for efficient application infrastructure.

Cookie-based session affinity.

For more key features and information, reference Why use Azure Front Door?

To understand how Azure Front Door creates a more reliable workload, reference the following topics:

Selecting the Front Door environment for traffic routing (Anycast)

Front Door routing methods

Have you configured Azure Front Door with reliability in mind?Have you configured Azure Front Door with reliability in mind?

Use WAF policies in Front Door. Lock down Application Gateway to receive traffic only from Azure Front Door

when using Azure Front Door and Application Gateway to protect HTTP/S applications.

Use Azure Front Door Web Application Firewall (WAF) policies to provide global protection across Azure

regions for inbound HTTP/S connections to a

Landing Zone

Create a rule to block access to the health endpoint from the internet.

Ensure that the connection to the back-end is re-encrypted.

Evaluate the four traffic routing configurations in Azure Front Door.

Consider the following recommendation to optimize reliability when configuring Azure Front Door:

Evaluate the four traffic routing configurations in Azure

Front Door.

The Front Door service supports various traffic-routing

methods to determine how to route your HTTP/HTTPS

traffic to the various service endpoints.

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Next step

Security and Azure Front Door

12/16/2022 • 2 minutes to read • Edit Online

Checklist

Configuration recommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Consider using geo-filtering in Azure Front Door. The Front Door service responds to user requests regardless

of the location of the user making the request.

Next step

A scalable and secure entry point, Azure Front Door provides fast delivery of your global web applications. Front

Door uses the Microsoft global edge network to create fast, secure, and widely scalable web applications.

Key features include:

Accelerated application performance by using split TCP-based anycast protocol.

Intelligent health probe monitoring for backend resources.

URL-path based routing for requests.

Enables hosting of multiple websites for efficient application infrastructure.

Cookie-based session affinity.

For more key features and information, reference Why use Azure Front Door?

To understand how Azure Front Door creates a more secure workload, reference the following topics:

Security baseline for Azure Front Door

DDoS protection on Front Door

End-to-end Transport Layer Security (TLS) with Azure Front Door

Have you configured Azure Front Door with security in mind?Have you configured Azure Front Door with security in mind?

Consider using geo-filtering in Azure Front Door.

Consider the following recommendation to optimize security when configuring Azure Front Door:

Operational excellence and Azure Front Door

12/16/2022 • 2 minutes to read • Edit Online

Checklist

Configuration recommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Use Web Application Firewall (WAF) policies in Front Door.

Lock down Application Gateway to receive traffic only from

Azure Front Door when using Azure Front Door and

Application Gateway to protect HTTP/S applications.

Certain scenarios can force a customer to implement rules

specifically on AppGateway: For example, if ModSec Core

Rule Set (CRS) 2.2.9 , CRS 3.0 , or CRS 3.1 rules are

required, rules can be only implemented on AppGatway.

Rate-limiting and geo-filtering are available only on Azure

Front Door, not on AppGateway. Instructions on how to lock

down traffic can be found at Frequently asked questions for

Azure Front Door

Ensure that the connection to the back-end is re-encrypted. Front Door doesn't support SSL passthrough. Front Door

must hold the certificate to terminate the encrypted

inbound connection.

A scalable and secure entry point, Azure Front Door provides fast delivery of your global web applications. Front

Door uses the Microsoft global edge network to create fast, secure, and widely scalable web applications.

Key features include:

Accelerated application performance by using split TCP-based anycast protocol.

Intelligent health probe monitoring for backend resources.

URL-path based routing for requests.

Enables hosting of multiple websites for efficient application infrastructure.

Cookie-based session affinity.

For more key features and information, reference Why use Azure Front Door?

To understand how Azure Front Door supports operational excellence, reference the following topics:

How Front Door determines backend health

Monitoring metrics and logs in Azure Front Door

Front Door Standard/Premium (Preview) Logging

Have you configured Azure Front Door with operational excellence in mind?Have you configured Azure Front Door with operational excellence in mind?

Use Web Application Firewall (WAF) policies in Front Door. Lock down Application Gateway to receive traffic

only from Azure Front Door when using Azure Front Door and Application Gateway to protect HTTP/S

applications.

Use Azure Front Door Web Application Firewall (WAF) policies to provide global protection across Azure

regions for inbound HTTP/S connections to a

Landing Zone

Create a rule to block access to the health endpoint from the internet.

Ensure that the connection to the back-end is re-encrypted.

Consider the following recommendation to optimize operational excellence when configuring Azure Front Door:

Next step

Reliability and Network Virtual Appliances

(

NVA

)

12/16/2022 • 2 minutes to read • Edit Online

Checklist

Configuration recommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

NVAs should be deployed within a

Landing Zone

solution-level

Virtual Network.

If third-party NVAs are required for inbound HTTP/S

connections, deploy NVAs together with the applications

that they're protecting and exposing to the internet.

For Virtual Wide Area Network (VWAN) topologies, deploy

the NVAs to a separate Virtual Network (such as, NVA VNet).

Connect the NVA to the regional Virtual WAN Hub and to

the

Landing Zones

that require access to NVAs.

If third-party NVAs are required for east-west or south-

north traffic protection and filtering, reference Scenario:

Route traffic through an NVA.

Network Virtual Appliances (NVA) are typically used to control the flow of traffic between network segments

classified with different security levels, for example between a perimeter network (also known as DMZ,

demilitarized zone, and screened subnet) and the public internet.

Examples of NVAs include:

Network firewalls

Layer-4 reverse-proxies

Internet Protocol Security (IPsec) Virtual Private Network (VPN) endpoints

Web-based reverse-proxies

Internet proxies

Layer-7 load balancers

For more information about Network Virtual Appliances, reference Deploy highly available NVAs.

To understand how NVAs support a reliable workload, reference the following topics:

Scenario: Route traffic through an NVA

Scenario: Route traffic through NVAs by using custom settings

Use L7 load balancers

Have you configured your Network Vir tual Appliances (NVA) with reliability in mind?Have you configured your Network Vir tual Appliances (NVA) with reliability in mind?

NVAs should be deployed within a

Landing Zone

solution-level

Virtual Network.

For Virtual Wide Area Network (VWAN) topologies, deploy the NVAs to a separate Virtual Network (such as,

NVA VNet). Connect the NVA to the regional Virtual WAN Hub and to the

Landing Zones

that require access

to NVAs.

For non-Virtual Wide Are Network (WAN) topologies, deploy the third-party NVAs in the central Hub Virtual

Network (VNet).

Consider the following recommendations to optimize reliability when configuring your Network Virtual

Appliances (NVA):

For non-Virtual Wide Area Network (WAN) topologies,

deploy the third-party NVAs in the central Hub Virtual

Network (VNet).

If third-party NVAs are required for east-west or south-

north traffic protection and filtering, deploy the third-party

NVAs in the central Hub Virtual Network.

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Next step

Cost optimization and Network Virtual Appliances (NVA)

Cost optimization and Network Virtual Appliances

(

NVA

)

12/16/2022 • 2 minutes to read • Edit Online

Design considerations

Next step

Network Virtual Appliances (NVA) are typically used to control the flow of traffic between network segments

classified with different security levels, for example between a perimeter network (also known as DMZ,

demilitarized zone, and screened subnet) and the public internet.

Examples of NVAs include:

Network firewalls

Layer-4 reverse-proxies

Internet Protocol Security (IPsec) Virtual Private Network (VPN) endpoints

Web-based reverse-proxies

Internet proxies

Layer-7 load balancers

For more information about Network Virtual Appliances, reference Deploy highly available NVAs.

When deploying a Network Virtual Appliance (NVA), keep in mind the following design considerations:

There's a difference between using a third-party app (NVA) and using an Azure native service (Firewall or

Application Gateway).

With managed Platform as a Service (PaaS) services such as Azure Firewall or Application Gateway, Microsoft

handles the management of the service and the underlying infrastructure. Using NVAs, which usually have to

be deployed on Virtual Machines or Infrastructure as a Service (IaaS), the customer has to handle the

management operations (such as patching and updating) of that Virtual Machine and the appliance on top.

Managing third-party services also involves using specific vendor tools making integration difficult.

Operational excellence and Network Virtual Appliances (NVA)

Operational excellence and Network Virtual

Appliances

(

NVA

)

12/16/2022 • 2 minutes to read • Edit Online

Checklist

Configuration recommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

NVAs should be deployed within a

Landing Zone

solution-level

Virtual Network.

If third-party NVAs are required for inbound HTTP/S

connections, deploy NVAs together with the applications

that they're protecting and exposing to the internet.

Network Virtual Appliances (NVA) are typically used to control the flow of traffic between network segments

classified with different security levels, for example between a perimeter network (also known as DMZ,

demilitarized zone, and screened subnet) and the public internet.

Examples of NVAs include:

Network firewalls

Layer-4 reverse-proxies

Internet Protocol Security (IPsec) Virtual Private Network (VPN) endpoints

Web-based reverse-proxies

Internet proxies

Layer-7 load balancers

For more information about Network Virtual Appliances, reference Deploy highly available NVAs.

To understand how NVAs promote operational excellence, reference the following topics:

Scenario: Route traffic through an NVA

Scenario: Route traffic through NVAs by using custom settings

Gateway Load Balancer

Have you configured your Network Vir tual Appliances (NVA) with operational excellence in mind?Have you configured your Network Vir tual Appliances (NVA) with operational excellence in mind?

NVAs should be deployed within a

Landing Zone

solution-level

Virtual Network.

For Virtual Wide Area Network (VWAN) topologies, deploy the NVAs to a separate Virtual Network (such as,

NVA VNet). Connect the NVA to the regional Virtual WAN Hub and to the

Landing Zones

that require access

to NVAs.

For non-Virtual Wide Are Network (WAN) topologies, deploy the third-party NVAs in the central Hub Virtual

Network (VNet).

Consider the following recommendations to optimize reliability when configuring your Network Virtual

Appliances (NVA):

For Virtual Wide Area Network (VWAN) topologies, deploy

the NVAs to a separate Virtual Network (such as, NVA VNet).

Connect the NVA to the regional Virtual WAN Hub and to

the

Landing Zones

that require access to NVAs.

If third-party NVAs are required for east-west or south-

north traffic protection and filtering, reference Scenario:

Route traffic through an NVA.

For non-Virtual Wide Area Network (WAN) topologies,

deploy the third-party NVAs in the central Hub Virtual

Network (VNet).

If third-party NVAs are required for east-west or south-

north traffic protection and filtering, deploy the third-party

NVAs in the central Hub Virtual Network.

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Next step

Reliability and Network connectivity

12/16/2022 • 2 minutes to read • Edit Online

Design considerations

Network connectivity includes three Azure models for private network connectivity:

VNet injection

VNet service endpoints

Private Link

VNet injection applies to services that are deployed specifically for you, such as:

Azure Kubernetes Service (AKS) nodes

SQL Managed Instance

Virtual Machines

These resources connect directly to your virtual network.

Virtual Network (VNet) service endpoints provide secure and direct connectivity to Azure services. These service

endpoints use an optimized route over the Azure network. Service endpoints enable private IP addresses in the

VNet to reach the endpoint of an Azure service without needing a public IP address on the VNet.

Private Link provides dedicated access using private IP addresses to Azure PaaS instances, or custom services

behind an Azure Load Balancer Standard.

Network connectivity includes the following design considerations related to a reliable workload:

Use Private Link, where available, for shared Azure PaaS services. Private Link is generally available for

several services and is in public preview for numerous ones.

Access Azure PaaS services from on-premises through ExpressRoute private peering.

Use either virtual network injection for dedicated Azure services or Azure Private Link for available

shared Azure services. To access Azure PaaS services from on-premises when virtual network injection or

Private Link isn't available, use ExpressRoute with Microsoft peering. This method avoids transiting over

the public internet.

Use virtual network service endpoints to secure access to Azure PaaS services from within your virtual

network. Use virtual network service endpoints only when Private Link isn't available and there are no

concerns with unauthorized movement of data.

Service Endpoints don't allow a PaaS service to be accessed from on-premises networks. Private

Endpoints do.

To address concerns about unauthorized movement of data with service endpoints, use network-virtual

appliance (NVA) filtering. You can also use virtual network service endpoint policies for Azure Storage.

The following native network security services are fully managed services. Customers don't incur the

operational and management costs associated with infrastructure deployments, which can become

complex at scale:

Azure Firewall

Application Gateway

Checklist

Next step

Azure Front Door

PaaS services are typically accessed over public endpoints. The Azure platform provides capabilities to

secure these endpoints or make them entirely private.

You can also use third-party network-virtual appliances (NVAs) if the customer prefers them for

situations where native services don't satisfy specific requirements.

Have you configured Network connectivity with reliability in mind?Have you configured Network connectivity with reliability in mind?

Don't implement forced tunneling to enable communication from Azure to Azure resources.

Unless you use network virtual appliance (NVA) filtering, don't use virtual network service endpoints when

there are concerns about unauthorized movement of data.

Don't enable virtual network service endpoints by default on all subnets.

Cost optimization and Network connectivity

12/16/2022 • 2 minutes to read • Edit Online

Design considerations

Checklist

Configuration recommendations

Network connectivity includes three Azure models for private network connectivity:

VNet injection

VNet service endpoints

Private Link

VNet injection applies to services that are deployed specifically for you, such as:

Azure Kubernetes Service (AKS) nodes

SQL Managed Instance

Virtual Machines

These resources connect directly to your virtual network.

Virtual Network (VNet) service endpoints provide secure and direct connectivity to Azure services. These service

endpoints use an optimized route over the Azure network. Service endpoints enable private IP addresses in the

VNet to reach the endpoint of an Azure service without needing a public IP address on the VNet.

Private Link provides dedicated access using private IP addresses to Azure PaaS instances, or custom services

behind an Azure Load Balancer Standard.

Network connectivity includes the following design considerations related to cost optimization:

Running cost of services: The services are metered. Pay for service itself and consumption on service.

VNet Peering cost: Consider the consequences of putting all resources in a single VNet to save costs. It also

prevents the infrastructure from growing. The VNet can eventually reach a point where new resources don't

fit anymore.

For two peered VNets using a private endpoint: Only the private endpoint access is billed and not the VNet

peering cost.

Azure Firewall is also metered: Pay for the instance and for usage. The same applies to load balancers.

Have you configured Network connectivity with cost optimization in mind?Have you configured Network connectivity with cost optimization in mind?

Select SKU for service so that it does the job required, which allows the customer to grow as the workload

evolves.

For the Load balancer, select two SKUs: Basic (free) and Standard (paid).

For App Gateway, select Basic or V2.

For Gateways, limit throughput and performance.

Select DDoS Standard.

Consider the following recommendation for cost optimization when configuring Network connectivity:

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

For the Load balancer, select two SKUs: Basic (free) and

Standard (paid).

Microsoft recommends Standard because it has richer

capabilities, such as:

- Outbound rules

- Granular network security configuration

- Monitoring

Standard provides a Service Level Agreement (SLA) and can

be deployed in Availability Zones. Capabilities in Basic are

limited.

Select DDoS Standard. Depending on the workload and usage patterns, Standard

can provide useful protection. Otherwise, you can use Basic

for small customers.

Next step

Operational excellence and Network connectivity

12/16/2022 • 2 minutes to read • Edit Online

Design considerations

Network connectivity includes three Azure models for private network connectivity:

VNet injection

VNet service endpoints

Private Link

VNet injection applies to services that are deployed specifically for you, such as:

Azure Kubernetes Service (AKS) nodes

SQL Managed Instance

Virtual Machines

These resources connect directly to your virtual network.

Virtual Network (VNet) service endpoints provide secure and direct connectivity to Azure services. These service

endpoints use an optimized route over the Azure network. Service endpoints enable private IP addresses in the

VNet to reach the endpoint of an Azure service without needing a public IP address on the VNet.

Private Link provides dedicated access using private IP addresses to Azure PaaS instances, or custom services

behind an Azure Load Balancer Standard.

Network connectivity includes the following design considerations related to operational excellence:

Use Private Link, where available, for shared Azure PaaS services. Private Link is generally available for

several services and is in public preview for numerous ones.

Access Azure PaaS services from on-premises through ExpressRoute private peering.

Use either virtual network injection for dedicated Azure services or Azure Private Link for available

shared Azure services. To access Azure PaaS services from on-premises when virtual network injection or

Private Link isn't available, use ExpressRoute with Microsoft peering. This method avoids transiting over

the public internet.

Use virtual network service endpoints to secure access to Azure PaaS services from within your virtual

network. Use virtual network service endpoints only when Private Link isn't available and there are no

concerns with unauthorized movement of data.

Service Endpoints don't allow a PaaS service to be accessed from on-premises networks. Private

Endpoints do.

To address concerns about unauthorized movement of data with service endpoints, use network-virtual

appliance (NVA) filtering. You can also use virtual network service endpoint policies for Azure Storage.

The following native network security services are fully managed services. Customers don't incur the

operational and management costs associated with infrastructure deployments, which can become

complex at scale:

Azure Firewall

Application Gateway

Checklist

Next step

Azure Front Door

PaaS services are typically accessed over public endpoints. The Azure platform provides capabilities to

secure these endpoints or make them entirely private.

You can also use third-party network-virtual appliances (NVAs) if the customer prefers them for

situations where native services don't satisfy specific requirements.

Have you configured Network connectivity with operational excellence in mind?Have you configured Network connectivity with operational excellence in mind?

Don't implement forced tunneling to enable communication from Azure to Azure resources.

Unless you use network virtual appliance (NVA) filtering, don't use virtual network service endpoints when

there are concerns about unauthorized movement of data.

Don't enable virtual network service endpoints by default on all subnets.

Reliability and Azure Virtual Network

12/16/2022 • 3 minutes to read • Edit Online

Design considerations

Checklist

A fundamental building block for your private network, Azure Virtual Network enables Azure resources to

securely communicate with each other, the internet, and on-premises networks.

Key features of Azure Virtual Network include:

Communication with Azure resources

Communication with the internet

Communication with on-premises resources

Network traffic filtering

For more information, reference What is Azure Virtual Network?

To understand how Azure Virtual Network supports a reliable workload, reference the following topics:

Tutorial: Move Azure VMs across regions

Quickstart: Create a virtual network using the Azure portal

Virtual Network – Business Continuity

The Virtual Network (VNet) includes the following design considerations for a reliable Azure workload:

Overlapping IP address spaces across on-premises and Azure regions creates major contention challenges.

While a Virtual Network address space can be added after creation, this process requires an outage if the

Virtual Network is already connected to another Virtual Network through peering. An outage is necessary

because the Virtual Network peering is deleted and re-created.

Resizing of peered Virtual Networks is in public preview (August 20, 2021).

Some Azure services do require dedicated subnets, such as:

Subnets can be delegated to certain services to create instances of that service within the subnet.

Azure reserves five IP addresses within each subnet, which should be factored in when sizing Virtual

Networks and encompassed subnets.

Azure Firewall

Azure Bastion

Virtual Network Gateway

Have you configured Azure Vir tual Network with reliability in mind?Have you configured Azure Vir tual Network with reliability in mind?

Use Azure DDoS Standard Protection Plans to protect all public endpoints hosted within customer Virtual

Networks.

Enterprise customers must plan for IP addressing in Azure to ensure there's no overlapping IP address space

across considered on-premises locations and Azure regions.

Use IP addresses from the address allocation for private internets (Request for Comment (RFC) 1918).

For environments with limited private IP addresses (RFC 1918) availability, consider using IPv6.

Don't create unnecessarily large Virtual Networks (for example: /16 ) to ensure there's no unnecessary waste

of IP address space.

Configuration recommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Don't create Virtual Networks without planning the required

address space in advance.

Adding address space will cause an outage once a Virtual

Network is connected through Virtual Network peering.

Use VNet Service Endpoints to secure access to Azure

Platform as a Service (PaaS) services from within a customer

VNet.

Only when Private Link isn't available and when there are no

data exfiltration concerns.

Access Azure PaaS services from on-premises through

ExpressRoute Private Peering.

Use either VNet injection for dedicated Azure services or

Azure Private Link for available shared Azure services.

To access Azure PaaS services from on-premises networks

when VNet injection or Private Link aren't available, use

ExpressRoute with Microsoft Peering when there are no data

exfiltration concerns.

Avoids transit over the public internet.

Don't replicate on-premises perimeter network (also known

as DMZ, demilitarized zone, and screened subnet) concepts

and architectures into Azure.

Customers can get similar security capabilities in Azure as

on-premises, but the implementation and architecture will

need to be adapted to the cloud.

Ensure the communication between Azure PaaS services that

have been injected into a Virtual Network is locked down

within the Virtual Network using user-defined routes (UDRs)

and network security groups (NSGs).

Azure PaaS services that have been injected into a Virtual

Network still perform management plane operations using

public IP addresses.

Next step

Don't create Virtual Networks without planning the required address space in advance.

Don't use public IP addresses for Virtual Networks, especially if the public IP addresses don't belong to the

customer.

Use VNet Service Endpoints to secure access to Azure Platform as a Service (PaaS) services from within a

customer VNet.

To address data exfiltration concerns with Service Endpoints, use Network Virtual Appliance (NVA) filtering

and VNet Service Endpoint Policies for Azure Storage.

Don't implement forced tunneling to enable communication from Azure to Azure resources.

Access Azure PaaS services from on-premises through ExpressRoute Private Peering.

To access Azure PaaS services from on-premises networks when VNet injection or Private Link aren't

available, use ExpressRoute with Microsoft Peering when there are no data exfiltration concerns.

Don't replicate on-premises perimeter network (also known as DMZ, demilitarized zone, and screened

subnet) concepts and architectures into Azure.

Ensure the communication between Azure PaaS services that have been injected into a Virtual Network is

locked down within the Virtual Network using user-defined routes (UDRs) and network security groups

(NSGs).

Don't use VNet Service Endpoints when there are data exfiltration concerns, unless NVA filtering is used.

Don't enable VNet Service Endpoints by default on all subnets.

Consider the following recommendations to optimize reliability when configuring an Azure Virtual Network:

Operational excellence and Azure Virtual Network

12/16/2022 • 3 minutes to read • Edit Online

Design considerations

Checklist

A fundamental building block for your private network, Azure Virtual Network enables Azure resources to

securely communicate with each other, the internet, and on-premises networks.

Key features of Azure Virtual Network include:

Communication with Azure resources

Communication with the internet

Communication with on-premises resources

Network traffic filtering

For more information, reference What is Azure Virtual Network?

To understand how Azure Virtual Network supports operational excellence, reference the following topics:

Monitoring Azure Virtual Network

Monitoring Azure Virtual Network data reference

Azure Virtual Network concepts and best practices

The Virtual Network (VNet) includes the following design considerations for operational excellence:

Overlapping IP address spaces across on-premises and Azure regions creates major contention challenges.

While a Virtual Network address space can be added after creation, this process requires an outage if the

Virtual Network is already connected to another Virtual Network through peering. An outage is necessary

because the Virtual Network peering is deleted and re-created.

Resizing of peered Virtual Networks is in public preview (August 20, 2021).

Some Azure services do require dedicated subnets, such as:

Subnets can be delegated to certain services to create instances of that service within the subnet.

Azure reserves five IP addresses within each subnet, which should be factored in when sizing Virtual

Networks and encompassed subnets.

Azure Firewall

Azure Bastion

Virtual Network Gateway

Have you configured Azure Vir tual Network with operational excellence in mind?Have you configured Azure Vir tual Network with operational excellence in mind?

Use Azure DDoS Standard Protection Plans to protect all public endpoints hosted within customer Virtual

Networks.

Enterprise customers must plan for IP addressing in Azure to ensure there's no overlapping IP address space

across considered on-premises locations and Azure regions.

Use IP addresses from the address allocation for private internets (Request for Comment (RFC) 1918).

For environments with limited private IP addresses (RFC 1918) availability, consider using IPv6.

Don't create unnecessarily large Virtual Networks (for example: /16 ) to ensure there's no unnecessary waste

of IP address space.

Configuration recommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Don't create Virtual Networks without planning the required

address space in advance.

Adding address space will cause an outage once a Virtual

Network is connected through Virtual Network peering.

Use VNet Service Endpoints to secure access to Azure

Platform as a Service (PaaS) services from within a customer

VNet.

Only when Private Link isn't available and when there are no

data exfiltration concerns.

Access Azure PaaS services from on-premises through

ExpressRoute Private Peering.

Use either VNet injection for dedicated Azure services or

Azure Private Link for available shared Azure services.

To access Azure PaaS services from on-premises networks

when VNet injection or Private Link aren't available, use

ExpressRoute with Microsoft Peering when there are no data

exfiltration concerns.

Avoids transit over the public internet.

Don't replicate on-premises perimeter network (also known

as DMZ, demilitarized zone, and screened subnet) concepts

and architectures into Azure.

Customers can get similar security capabilities in Azure as

on-premises, but the implementation and architecture will

need to be adapted to the cloud.

Ensure the communication between Azure PaaS services that

have been injected into a Virtual Network is locked down

within the Virtual Network using user-defined routes (UDRs)

and network security groups (NSGs).

Azure PaaS services that have been injected into a Virtual

Network still perform management plane operations using

public IP addresses.

Next step

Don't create Virtual Networks without planning the required address space in advance.

Don't use public IP addresses for Virtual Networks, especially if the public IP addresses don't belong to the

customer.

Use VNet Service Endpoints to secure access to Azure Platform as a Service (PaaS) services from within a

customer VNet.

To address data exfiltration concerns with Service Endpoints, use Network Virtual Appliance (NVA) filtering

and VNet Service Endpoint Policies for Azure Storage.

Don't implement forced tunneling to enable communication from Azure to Azure resources.

Access Azure PaaS services from on-premises through ExpressRoute Private Peering.

To access Azure PaaS services from on-premises networks when VNet injection or Private Link aren't

available, use ExpressRoute with Microsoft Peering when there are no data exfiltration concerns.

Don't replicate on-premises perimeter network (also known as DMZ, demilitarized zone, and screened

subnet) concepts and architectures into Azure.

Ensure the communication between Azure PaaS services that have been injected into a Virtual Network is

locked down within the Virtual Network using user-defined routes (UDRs) and network security groups

(NSGs).

Don't use VNet Service Endpoints when there are data exfiltration concerns, unless NVA filtering is used.

Don't enable VNet Service Endpoints by default on all subnets.

Consider the following recommendations for operational excellence when configuring an Azure Virtual

Network:

Reliability and ExpressRoute

Reliability and Azure Load Balancer

12/16/2022 • 2 minutes to read • Edit Online

Checklist

Configuration recommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

For production workloads, use the Standard Stock Keeping

Units (SKU).

Basic load balancers don't have a Service Level Agreement

(SLA). The Standard SKU supports Availability Zones.

Next step

Load balancing

refers to evenly distributing load (incoming network traffic) across a group of backend resources

or servers. With Azure Load Balancer, load-balance traffic to and from virtual machines and cloud resources,

and in cross-premises virtual networks.

You can scale your applications and create highly available services with Azure Load Balancer. It supports both

inbound and outbound scenarios. Load balancer provides low latency and high throughput.

Key benefits include:

Load balance internal and external traffic to Azure virtual machines.

Increase availability by distributing resources within and across zones.

Configure outbound connectivity for Azure virtual machines.

Use health probes to monitor load-balanced resources.

For more information, reference Why use Azure Load Balancer?

To understand how Azure Load Balancer supports a reliable workload, reference the following topics:

Improve application scalability and resiliency by using Azure Load Balancer

Load Balancer and Availability Zones

High availability ports overview

Have you configured Azure Load Balancer with reliability in mind?Have you configured Azure Load Balancer with reliability in mind?

For production workloads, use the Standard Stock Keeping Units (SKU).

Consider the following recommendation to optimize reliability when configuring an Azure Load Balancer:

Operational excellence and Azure Load Balancer

12/16/2022 • 2 minutes to read • Edit Online

Checklist

Configuration recommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

For production workloads, use the Standard Stock Keeping

Units (SKU).

Basic load balancers don't have a Service Level Agreement

(SLA). The Standard SKU supports Availability Zones.

Next step

Load balancing

refers to evenly distributing load (incoming network traffic) across a group of backend resources

or servers. With Azure Load Balancer, load-balance traffic to and from virtual machines and cloud resources,

and in cross-premises virtual networks.

You can scale your applications and create highly available services with Azure Load Balancer. It supports both

inbound and outbound scenarios. Load balancer provides low latency and high throughput.

Key benefits include:

Load balance internal and external traffic to Azure virtual machines.

Increase availability by distributing resources within and across zones.

Configure outbound connectivity for Azure virtual machines.

Use health probes to monitor load-balanced resources.

For more information, reference Why use Azure Load Balancer?

To understand how Azure Load Balancer supports operational excellence, reference the following topics:

Load Balancer health probes

Standard load balancer diagnostics with metrics, alerts, and resource health

Using Insights to monitor and configure your Azure Load Balancer

Have you configured Azure Load Balancer with operational excellence in mind?Have you configured Azure Load Balancer with operational excellence in mind?

For production workloads, use the Standard Stock Keeping Units (SKU).

Consider the following recommendation for operational excellence when configuring an Azure Load Balancer:

Reliability and Traffic Manager

12/16/2022 • 2 minutes to read • Edit Online

Checklist

Configuration recommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

If the Time to Live (TTL) interval of the DNS record is too

long, consider adjusting the health probe timing or DNS

record TTL.

When a backend becomes unavailable, Traffic Manager won't

fail over to another region immediately. There will be a time

interval where clients can't be served. The length of this

interval depends on the time settings of the health probe

(probe interval and the number of unhealthy responses

allowed). If the resulting interval is still too large for the

scenario, consider switching to Azure Front Door for global

load balancing.

Traffic Manager is a Domain Name System (DNS)-based traffic load balancer. This service allows you to

distribute traffic to your public-facing applications across the global Azure regions. Traffic Manager also

provides your public endpoints with high availability and quick responsiveness.

Features include:

Increase application availability

Improve application performance

Service maintenance without downtime

Combine hybrid applications

Distribute traffic for complex deployments

For more information, reference What is Traffic Manager?

To learn how Traffic Manager supports a reliable workload, reference the following articles:

Enhance your service availability and data locality by using Azure Traffic Manager

Using load-balancing services in Azure

Disaster recovery using Azure DNS and Traffic Manager

Have you configured Traffic Manager with reliability in mind?Have you configured Traffic Manager with reliability in mind?

If the Time to Live (TTL) interval of the DNS record is too long, consider adjusting the health probe timing or

DNS record TTL.

Implement a custom page to use as a health check for your Traffic Manager.

Evaluate the three different traffic routing methods.

Consider nested Traffic Manager profiles.

Consider the following recommendations to optimize reliability when configuring Traffic Manager:

Implement a custom page to use as a health check for your

Traffic Manager.

A common practice is to implement a custom page within

your application (for example: /health.aspx ). Using this

path for monitoring, you can do application-specific checks,

such as checking performance counters or verifying database

availability. Based on these custom checks, the page returns

an appropriate HTTPS status code.

Evaluate the three different traffic routing methods. Traffic Manager supports three traffic-routing methods to

determine how to route network traffic to the various

service endpoints. Traffic Manager applies the traffic-routing

method to each DNS query it receives. The traffic-routing

method determines which endpoint is returned in the DNS

response. The customer should be aware of these endpoints

and the differences in routing between endpoints.

Consider nested Traffic Manager profiles. Each Traffic Manager profile specifies a single traffic-routing

method. There are scenarios that require more sophisticated

traffic routing than the routing provided by a single Traffic

Manager profile. You can nest Traffic Manager profiles to

combine the benefits of more than one traffic-routing

method. Nested profiles allow you to override the default

Traffic Manager behavior to support larger, more complex

application deployments.

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Next step

Operational excellence and Traffic Manager

12/16/2022 • 2 minutes to read • Edit Online

Checklist

Configuration recommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

If the Time to Live (TTL) interval of the DNS record is too

long, consider adjusting the health probe timing or DNS

record TTL.

When a backend becomes unavailable, Traffic Manager won't

fail over to another region immediately. There will be a time

interval where clients can't be served. The length of this

interval depends on the time settings of the health probe

(probe interval and the number of unhealthy responses

allowed). If the resulting interval is still too large for the

scenario, consider switching to Azure Front Door for global

load balancing.

Next step

Traffic Manager is a Domain Name System (DNS)-based traffic load balancer. This service allows you to

distribute traffic to your public-facing applications across the global Azure regions. Traffic Manager also

provides your public endpoints with high availability and quick responsiveness.

Features include:

Increase application availability

Improve application performance

Service maintenance without downtime

Combine hybrid applications

Distribute traffic for complex deployments

For more information, reference What is Traffic Manager?

To learn how Traffic Manager supports operational excellence, reference the following articles:

Troubleshooting degraded state on Azure Traffic Manager

Traffic Manager endpoint monitoring

Traffic Manager metrics and alerts

Have you configured Traffic Manager with operational excellence in mind?Have you configured Traffic Manager with operational excellence in mind?

If the Time to Live (TTL) interval of the DNS record is too long, consider adjusting the health probe timing or

DNS record TTL.

Consider the following recommendation for operational excellence when configuring Traffic Manager:

Cost optimization and IP addresses

12/16/2022 • 2 minutes to read • Edit Online

Checklist

Configuration recommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

PIPs (Public IPs) are free until used. Static PIPs are paid even

when not assigned to resources.

There's a difference in billing for regular and static public IP

addresses. Develop a process to look for orphan network

interface cards (NICs) and PIPs that aren't being used in

production and non-production.

Next step

IP services are a collection of IP address-related services that enable communication in an Azure Virtual

Network. Public and private IP addresses are used in Azure for communication between resources. The

communication with resources can occur in a private Azure Virtual Network and the public internet.

Key features include:

Public IP addresses

Public IP address prefixes

Private IP addresses

Routing preference

Routing preference unmetered

For more information, reference What is Azure Virtual Network IP Services?

To understand how IP services support a cost-optimized workload, reference the following articles:

IP addresses pricing

Create, change, or delete an Azure public IP address

Routing over public Internet (ISP network)

Have you configured IP addresses with cost optimization in mind?Have you configured IP addresses with cost optimization in mind?

PIPs (Public IPs) are free until used. Static PIPs are paid even when not assigned to resources.

Consider the following recommendation for cost optimization when configuring IP addresses:

Cost optimization and Log Analytics

12/16/2022 • 2 minutes to read • Edit Online

Design considerations

Checklist

Configuration recommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Log Analytics is a tool in the Azure portal used to edit and run log queries with data in Azure Monitor Logs. You

can write a query that returns a set of records. Then, use features of Log Analytics to sort, filter, and analyze

them. Alternately, write a more advanced query to do statistical analysis. Visualize the results in a chart to

identify a particular trend.

For more information, reference Overview of Log Analytics in Azure Monitor.

To understand how Log Analytics supports cost optimization, reference Manage usage and costs with Azure

Monitor Logs. This article includes:

Estimating the costs to manage your environment

Viewing Log Analytics usage on your Azure bill

Understand your usage and optimizing your pricing tier

Log Analytics workspace includes the following design considerations:

Consider how long to retain data on Log Analytics:

Data ingested into Log Analytics workspace can be retained at no additional charge up to the first 31

days. Consider general aspects to configure the Log Analytics workspace level default retention. Consider

specific needs to configure data retention by data type that can be as low as four days. Example: Usually,

performance data doesn't need to be retained longer, instead, security logs may need to be retained

longer.

Consider exporting data for long-term retention or auditing purposes:

Data retained for audit purposes may be exported to a cheaper storage type. Refer to Log Analytics

workspace data export in Azure Monitor (preview).

Have you configured Log Analytics with cost optimization in mind?Have you configured Log Analytics with cost optimization in mind?

Consider adoption of the Commitment Tiers pricing model to the Log Analytics workspace.

Evaluate usage of daily cap to limit the daily ingestion for your workspace.

Understand Log Analytics workspace usage.

Evaluate possible data ingestion volume reducing.

Consider the following recommendation for cost optimization when configuring Log Analytics:

Consider adoption of the Commitment Tiers pricing model

to the Log Analytics workspace.

The usage of Commitment Tiers enables saving as much as

30% compared to Pay-As-You-Go pricing. Commitment

Tiers starts at 100 GB/day and any usage above the

reservation level is billed at the Pay-As-You-Go rate. Refer to

Changing pricing tier about how to change the pricing tier

to Capacity Reservations. Use the Log Analytics usage and

estimated cost page to analyze data usage and calculate

possible Commitment Tiers.

Note: Azure Defender (Security

Center) billing includes

500 MB/node/day

allocation against

the security data types. Take it into consideration when

calculating Commitment Tiers.

Evaluate daily cap usage to limit the daily ingestion for your

workspace.

Daily cap is used to manage an unexpected increase in data

volume. Use daily cap when you want to limit unplanned

charges for your workspace. Use care with this configuration

as it can cause some data to be unwritten on Log Analytics

workspace if the daily cap is reached. This configuration can

impact services whose functionality may depend on the

availability of up-to-date data in the workspace. Refer to Set

the Daily Cap about how to set the daily cap in Application

Insights.

Note: If you have a workspace-based Application

Insights, use daily cap in workspace to limit ingestion and

costs instead of using the cap in Application Insights.

Understand Log Analytics workspace usage. When Log Analytics workspace usage is higher than

expected, consider the troubleshooting guide and the

Understanding ingested data volume guide to understand

the unexpected behavior.

Evaluate possible data ingestion volume reducing. Refer to Tips for reducing data volume documentation to

help configure data ingestion properly.

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Next step

Security and Application Insights

12/16/2022 • 2 minutes to read • Edit Online

Checklist

Configuration recommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Review instances where customer data is captured in your

application.

We don't recommend collecting customer data in Application

Insights, although it can be unavoidable. It's up to you and

your company to determine the strategy you'll use to handle

your private data.

Next step

Application Insights is a feature of Azure Monitor. This feature provides extensible application performance

management (APM) and monitoring for live web apps.

Key features include:

Supports a wide variety of platforms, including .NET, Node.js, Java, and Python.

Works for apps hosted on-premises, hybrid, or on any public cloud.

Integrates with DevOps processes.

Has connection points to many development tools.

Can monitor and analyze customer data from mobile apps by integrating with Visual Studio App Center.

For more information, reference Application Insights overview.

Have you configured Application Insights with security in mind?Have you configured Application Insights with security in mind?

Review instances where customer data is captured in your application.

Consider the following security recommendation when configuring Application Insights:

Cost optimization and Application Insights

12/16/2022 • 2 minutes to read • Edit Online

Design considerations

Checklist

Configuration recommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Application Insights is a feature of Azure Monitor. This feature provides extensible application performance

management (APM) and monitoring for live web apps.

Key features include:

Supports a wide variety of platforms, including .NET, Node.js, Java, and Python.

Works for apps hosted on-premises, hybrid, or on any public cloud.

Integrates with DevOps processes.

Has connection points to many development tools.

Can monitor and analyze customer data from mobile apps by integrating with Visual Studio App Center.

For more information, reference Application Insights overview.

Application Insights includes the following design considerations for cost optimization:

Consider using sampling to reduce the amount of data that's sent:

Sampling is a feature in Application Insights. It's a recommended way to reduce data traffic, data, and

storage costs. Refer to Sampling in Application Insights.

Consider turning off collection for unneeded modules:

On configuration files, you can enable or disable data modules and initializers for tracking data from your

applications. Refer to Application Insights for web pages.

Consider limiting Asynchronous JavaScript and XML (AJAX) call tracing:

AJAX calls can be limited to reduce costs. Refer to Application Insights for web pages, which explains the

fields and its configurations.

Have you configured Application Insights with cost optimization in mind?Have you configured Application Insights with cost optimization in mind?

Evaluate usage of daily cap to limit the daily ingestion for your workspace.

Use sampling in Azure Application Insights to reduce data traffic, data costs, and storage costs, while

preserving a statistically correct analysis of application data.

Consider the following recommendations for cost optimization when configuring Application Insights:

Evaluate daily cap usage to limit the daily ingestion for your

workspace.

Daily cap is used to manage an unexpected increase in data

volume. Use daily cap when you want to limit unplanned

charges for your workspace. Use care with this configuration

as it can cause some data to be unwritten on Log Analytics

workspace if the daily cap is reached. This configuration can

impact services whose functionality may depend on the

availability of up-to-date data in the workspace. Refer to Set

the Daily Cap about how to set the daily cap in Application

Insights.

Note: If you have a workspace-based Application

Insights, use the daily cap in workspace to limit ingestion

and costs instead of using the cap in Application Insights.

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Next step

Operational excellence and Application Insights

12/16/2022 • 3 minutes to read • Edit Online

Checklist

Configuration recommendations

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Configure Application Insights to monitor the availability and

responsiveness of your web application.

After you've deployed your application, you can set up

recurring tests to monitor availability and responsiveness.

Application Insights sends web requests to your application

at regular intervals from points around the world. It can

alert you if your application isn't responding or if it responds

too slowly.

Application Insights is a feature of Azure Monitor. This feature provides extensible application performance

management (APM) and monitoring for live web apps.

Key features include:

Supports a wide variety of platforms, including .NET, Node.js, Java, and Python.

Works for apps hosted on-premises, hybrid, or on any public cloud.

Integrates with DevOps processes.

Has connection points to many development tools.

Can monitor and analyze customer data from mobile apps by integrating with Visual Studio App Center.

For more information, reference Application Insights overview.

Have you configured Application Insights with operational excellence in mind?Have you configured Application Insights with operational excellence in mind?

Configure Application Insights to monitor the availability and responsiveness of your web application.

Be aware that Application Insights can be used to monitor deployed sites and services on-premises (or on an

Azure Virtual Machine (VM)).

Evaluate Java codeless application monitoring for your Java-based application development stack.

Configure sampling in Application Insights.

Record custom events and metrics from sites and services in Application Insights.

Use Application Insights to ingest existing log traces from common libraries, such as ILogger , Nlog , and

log4Net .

Become familiar with the Application Insights quotas and limits.

Review the need for custom analysis. Use Application Insights data with tools such as Azure Dashboards or

Power BI.

Separate data across Application Insights resources.

Consider the following recommendations for operational excellence when configuring Application Insights:

Evaluate Java codeless application monitoring for your Java-

based application development stack.

Java codeless application monitoring is all about simplicity.

There are no code changes. You can enable the Java agent

through a couple of configuration changes. The Java agent

works in any environment and allows you to monitor all

your Java applications. No matter if you're running your Java

apps on Virtual Machines, on-premises, in Azure Kubernetes

Service (AKS), on Windows, or Linux, the Java 3.0 agent

will monitor your app.

Configure sampling in Application Insights. Ingestion sampling operates at the point where the data

from your web servers, browsers, and devices reaches the

Application Insights service endpoints. Although it doesn't

reduce the data sent from your app, it does reduce the

amount processed, retained, and charged by Application

Insights. Use this type of sampling if your app often goes

above its monthly quota. Use ingestion sampling if you don't

have access to the Software Development Kit (SDK)-based

types of sampling.

Record custom events and metrics from sites and services in

Application Insights.

Use Application Insights to record domain-specific custom

events and metrics from your site or service. For example:

number-of-active-baskets

product-lines-out-of-stock

Use Application Insights to ingest existing log traces from

common libraries, such as ILogger , Nlog , and log4Net .

If you're already using a logging framework such as

ILogger , Nlog , log4Net , or

System.Diagnostics.Trace , we recommend sending your

diagnostic tracing logs to Application Insights. For Python

applications, send diagnostic tracing logs using

AzureLogHandler in OpenCensus Python for Azure

Monitor. You can explore and search these logs, which are

merged with the other log files from your application.

Merging the log files allows you to identify traces associated

with each user request and correlate them with other events

and exception reports.

Become familiar with the Application Insights quotas and

limits.

This information can influence your sampling model and

your strategy for separating Application Insights resources.

Review the need for custom analysis. Use Application

Insights data with tools such as Azure Dashboards or Power

BI.

There are several available options to analyze your

Application Insights data. For example, you can create a

dashboard in the Azure portal that includes tiles visualizing

data from multiple Azure resources across different resource

groups and subscriptions. Alternatively, you can use Power

BI to analyze data combined with data from other sources

and share insights.

Separate data across Application Insights resources. It's important to consider when to share a single Application

Insights resource and when to create a new one. For

example, you should use a single resource for application

components that you deploy together, a single Team

develops, or that the same set of DevOps or ITOps users

manages. You should use a separate resource for different

environments.

REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N

Next step

Operational excellence and Application Insights

Implementing recommendations

12/16/2022 • 2 minutes to read • Edit Online

Overview

Assess workload

NOTENOTE

Integrate recommendations

The purpose of this document is to provide a guide for incorporating the recommendations generated by the

Microsoft Azure Well-Architected Review and Azure Advisor into a new or established operational process for

continuous workload improvement.

This first step in the Well-Architected Recommendation Process is to conduct an assessment of your workload.

The Microsoft Azure Well-Architected Review tool generates a set of recommendations through a guided

assessment based on the Microsoft Well-Architected Framework. This tool also has the ability to pull in Azure

Advisor recommendations based on an Azure subscription or resource group. At the end of the assessment,

there is an option to export these recommendations into a CSV file that can then be used to incorporate them

into the operational process for the workload.

When using the Microsoft Azure Well-Architected Review tool, it's important to Sign inSign in and select the Azure subscription

or resource group that contains your workload. This will ensure that only the relevant Azure Advisor recommendations

are included when exporting the CSV.

The goal of this stage is to establish a backlog of work items either through a manual or an automated process

that is based on the recommendations generated by the assessment. The process and tooling for managing the

backlog for your workload should be well-defined based on your cloud operating model.

NOTENOTE
 
Triage backlog
 
Implement work items
 
Monitor progress
The Microsoft Cloud Adoption Framework for Azure is proven guidance that's designed to help you create and implement
the business and technology strategies necessary for your organization to succeed in the cloud. This guidance can help
you define the cloud operating model that governs the operational process for your workload.
If you're using a DevOps approach for your operational process, Microsoft has provided example scripts that will
automate the import of the recommendations from the Well-Architected Review exported CSV into a new Azure
DevOps or GitHub project within your existing organization.
Download example import scripts at DevOps Tooling for Well-Architected Recommendation Process.
Now that the recommendations have been added to the backlog, the workload owners and key stakeholders
should prioritize, then triage the recommendations in a weekly standup meeting with the workload team. Next,
they assign recommendations to a specific owner, postpone, or dismiss. When assigned to a specific owner, the
recommendation should be tracked until resolved and weekly, or monthly reminders are sent to the assignee.
When going through this process, it's recommended to align responsibilities across teams by developing a
cross-team matrix that identifies 
responsible, accountable, consulted, and informed (RACI)
 parties. Some of the
key benefits of this exercise include:
Assisting teams in charting roles and responsibilities in a consistent manner
Assisting teams with development of implementation tool kits
Clarifying individual and departmental roles, and responsibilities
Identifying accountabilities
Eliminating misunderstandings and encouraging teamwork
Reducing duplication of effort
Establishing 
consults
 and 
informs
 resulting in better communication
This stage of the process is focused on working through the backlog of recommendations. How you organize
this work will depend on your cloud operating model. For example, an Agile-based process would have an
emphasis on sprint planning paired with frequent stand ups so that progress can be tracked and reported.
This stage of the process is to monitor improvements to a workload relative to the Microsoft Well-Architected
Framework through each iteration. It's important to establish a baseline based on the Microsoft Azure Well-
Architected Review and Azure Advisor so that you can track improvements back to key recommendations and
metrics.
The Azure Advisor Score is a key metric that is available for tracking and socializing improvements during this
continuous process.