ContentsContents
Microsoft Azure Well
-
Architected Framework
Overview
Reliability
About
Overview
Principles
Design
Checklist
Requirements
Application design
Resiliency & dependencies
Best practices
Testing
Checklist
Resiliency testing
Backup & recovery
Backup and recovery
Automatic retry of failed backup jobs
Error handling
Chaos engineering
Best practices
Monitoring
Checklist
Application health
Health modeling
Best practices
Reliability patterns
Security
About
Overview
Principles
Design
Governance
Checklist
Compliance requirements
Landing zone
Segmentation strategy
Management groups
Administration
Identity and access management
Checklist
Roles and responsibilities
Control plane
Authentication
Authorization
Best practices
Networking
Checklist
Network segmentation
Connectivity
Application endpoints
Data flow
Best practices
Data protection
Checklist
Encryption
Key and secret management
Best practices
Applications and services
Application security considerations
Application classification
Threat analysis
Secure PaaS deployments
Configuration and dependencies
Build
-
deploy
Checklist
Governance considerations
Infrastructure provisioning
Code deployments
Monitor
-
remediate
Checklist
Tools
Azure resources
Logs and alerts
Review and remediate
Compliance review
Validate and test
Security operations
Tradeoffs
Cost Optimization
About
Overview
Principles
Design
Checklist
Cost model
Capture requirements
Azure regions
Azure resources
Governance
Initial estimate
Managed services
Performance and price options
Provision
Checklist
AI + Machine Learning
Big data
Compute
Data stores
Messaging
Networking
Cost for networking services
Web apps
Monitor
Checklist
Budgets and alerts
Reports
Reviews
Optimize
Checklist
Autoscale
VM instances
Caching
Tradeoffs
Operational Excellence
About
Overview
Principles
Automation
Automation overview
Repeatable infrastructure
Configure infrastructure
Automate operational tasks
Release engineering
Application development
Continuous integration
Testing
Performance
Deployment
Rollback
Monitor
Checklist
Monitor cloud applications
Monitoring stages
Data sources
Instrumentation
Collection and storage
Analysis
Visualization
Alerting
Common use cases
Health monitoring
Usage monitoring
Issue tracking
Tracing and debugging
Auditing
Operational Excellence patterns
Performance Efficiency
About
Overview
Principles
Design
Checklist
Distributed architecture challenges
Application design
Application efficiency
Scalability
Capacity planning
Test
Checklist
Performance testing
Testing tools
Monitoring
Checklist
Application profiling
Analyze infrastructure
Performance data
Performance Efficiency patterns
Checklist
Tradeoffs
Workloads
Mission
-
critical
Quick links
Get started
Design methodology
Design principles
Architecture pattern
Cross
-
cutting concerns
Design areas
Application design
Application platform
Data platform
Networking and connectivity
Health modeling
Deployment and testing
Security
Operational procedures
Assessment tool
Carrier
-
grade
Quick links
Get started
Design principles
Design areas
Fault tolerance
Data model
Health modeling
Testing and validation
Hybrid
Overview
Cost Optimization
Operational Excellence
Performance Efficiency
Reliability
Security
IoT
Overview
Reliability
Security
Cost Optimization
Operational Excellence
Performance Efficiency
SAP
Overview
Reliability
Security
Cost Optimization
Operational Excellence
Performance Efficiency
Sustainability
Quick links
Get started
Design methodology
Design principles
Design areas
Application design
Application platform
Testing
Operational procedures
Networking and connectivity
Storage
Security
Services
Compute
Azure Service Fabric
Azure App Service
Reliability
Cost optimization
Operational excellence
Azure Batch
Reliability
Operational excellence
Performance efficiency
Azure Kubernetes Service
Functions
Security
Virtual Machines
Data
Azure Cache for Redis
Reliability
Operational excellence
Azure Databricks
Security
Azure Database for MySQL
Cost optimization
Azure Database for PostgreSQL
Cost optimization
Azure SQL Database
Azure SQL Managed Instance
Reliability
Operational excellence
Azure Cosmos DB
Reliability
Operational excellence
Hybrid
Azure Stack Hub
Reliability
Operational excellence
Storage
Storage Accounts
Reliability
Security
Cost optimization
Operational excellence
Disks
Cost optimization
Messaging
Event Grid
Reliability
Operational excellence
Event Hubs
Reliability
Operational excellence
Service Bus
Reliability
Operational excellence
Queue Storage
Reliability
Operational excellence
IoT Hub
Reliability
Operational excellence
IoT Hub Device Provisioning Service
Reliability
Operational excellence
Networking
Application Delivery
(
General
)
Reliability
Operational excellence
Application Gateway v2
Azure Firewall
ExpressRoute
API Management
Reliability
Cost optimization
Operational excellence
Azure Front Door
Reliability
Security
Operational excellence
Network Virtual Appliances
(
NVA
)
Reliability
Cost optimization
Operational excellence
Network Connectivity
Reliability
Cost optimization
Operational excellence
Azure Virtual Network
Reliability
Operational excellence
Azure Load Balancer
Reliability
Operational excellence
Traffic Manager
Reliability
Operational excellence
IP addresses
Cost optimization
Monitoring
Log Analytics
Cost optimization
Application Insights
Security
Cost optimization
Operational excellence
Implementing Recommendations
Microsoft Azure Well
-
Architected Framework
12/16/2022 4 minutes to read • Edit Online
P ILLA RP ILLA R DESC RIP T IONDESC RIP T ION
Reliability The ability of a system to recover from failures and continue
to function.
Security Protecting applications and data from threats.
Cost Optimization Managing costs to maximize the value delivered.
Operational Excellence Operations processes that keep a system running in
production.
Performance Efficiency The ability of a system to adapt to changes in load.
Overview
The Azure Well-Architected Framework is a set of guiding tenets that can be used to improve the quality of a
workload. The framework consists of five pillars of architectural excellence:
Reliability
Security
Cost Optimization
Operational Excellence
Performance Efficiency
Incorporating these pillars helps produce a high quality, stable, and efficient cloud architecture:
Reference the following video about how to architect successful workloads on Azure with the Well-Architected
Framework:
The following diagram gives a high-level overview of the Azure Well-Architected Framework:
Assess your workloadAssess your workload
In the center, is the Well-Architected Framework, which includes the five pillars of architectural excellence.
Surrounding the Well-Architected Framework are six supporting elements:
Azure Well-Architected Review
Azure Advisor
Documentation
Partners, Support, and Services Offers
Reference Architectures
Design Principles
To assess your workload using the tenets found in the Microsoft Azure Well-Architected Framework, see the
Microsoft Azure Well-Architected Review.
We also recommend you use Azure Advisor and Advisor Score to identify and prioritize opportunities to
improve the posture of your workloads. Both services are free to all Azure users and align to the five pillars of
Reliability
Reliability guidanceReliability guidance
Security
Security guidanceSecurity guidance
the Well-Architected Framework:
Azure AdvisorAzure Advisor is a personalized cloud consultant that helps you follow best practices to optimize your
Azure deployments. It analyzes your resource configuration and usage telemetry. It recommends
solutions that can help you improve the reliability, security, cost effectiveness, performance, and
operational excellence of your Azure resources. Learn more about Azure Advisor.
Advisor ScoreAdvisor Score is a core feature of Azure Advisor that aggregates Advisor recommendations into a
simple, actionable score. This score enables you to tell at a glance if you're taking the necessary steps to
build reliable, secure, and cost-efficient solutions, and to prioritize the actions that will yield the biggest
improvement to the posture of your workloads. The Advisor score consists of an overall score, which can
be further broken down into five category scores corresponding to each of the Well-Architected pillars.
Learn more about Advisor Score.
A reliable workload is one that is both resilient and available. Resiliency is the ability of the system to recover
from failures and continue to function. The goal of resiliency is to return the application to a fully functioning
state after a failure occurs. Availability is whether your users can access your workload when they need to.
For more information about resiliency, reference the following video that will show you how to start improving
the reliability of your Azure workloads:
The following topics offer guidance on designing and improving reliable Azure applications:
Designing reliable Azure applications
Design patterns for resiliency
Best practices:
Transient fault handling
Retry guidance for specific services
For an overview of reliability principles, reference Principles of the reliability pillar.
Think about security throughout the entire lifecycle of an application, from design and implementation to
deployment and operations. The Azure platform provides protections against various threats, such as network
intrusion and DDoS attacks. But you still need to build security into your application and into your DevOps
processes.
Ask the right questions about secure application development on Azure by referencing the following video:
Consider the following broad security areas:
Identity management
Protect your infrastructure
Cost optimization
Cost guidanceCost guidance
Operational excellence
Operational excellence guidanceOperational excellence guidance
Performance efficiency
Application security
Data sovereignty and encryption
Security resources
For more information, reference Overview of the security pillar.
When you're designing a cloud solution, focus on generating incremental value early. Apply the principles of
Build-Measure-LearnBuild-Measure-Learn, to accelerate your time to market while avoiding capital-intensive solutions.
For more information, reference Cost optimization and the following video on how to start optimizing your
Azure costs:
The following topics offer cost optimization guidance as you develop the Well-Architected Framework for your
workload:
Review cost principles
Develop a cost model
Create budgets and alerts
Review the cost optimization checklist
For a high-level overview, reference Overview of the cost optimization pillar.
Operational excellence covers the operations and processes that keep an application running in production.
Deployments must be reliable and predictable. Automate deployments to reduce the chance of human error.
Fast and routine deployment processes won't slow down the release of new features or bug fixes. Equally
important, you must quickly roll back or roll forward if an update has problems.
For more information, reference the following video about bringing security into your DevOps practice on
Azure:
The following topics provide guidance on designing and implementing DevOps practices for your Azure
workload:
Design patterns for operational excellence
Best practices: Monitoring and diagnostics
For a high-level summary, reference Overview of the operational excellence pillar.
Performance efficiency is the ability of your workload to scale to meet the demands placed on it by users in an
efficient manner. The main ways to achieve performance efficiency include using scaling appropriately and
Performance efficiency guidancePerformance efficiency guidance
Next steps
implementing PaaS offerings that have scaling built in.
For more information, watch Performance Efficiency: Fast & Furious: Optimizing for Quick and Reliable VM
Deployments.
The following topics offer guidance on how to design and improve the performance efficiency posture of your
Azure workload:
Design patterns for performance efficiency
Best practices:
Autoscaling
Background jobs
Caching
CDN
Data partitioning
For a high-level synopsis, reference Overview of the performance efficiency pillar.
Learn more about:
Azure Well-Architected Review
Well-Architected Series
Introduction to the Microsoft Azure Well-Architected Framework
Microsoft Defender for Cloud
Cloud Adoption Framework
Microsoft Azure Well
-
Architected Framework
12/16/2022 4 minutes to read • Edit Online
P ILLA RP ILLA R DESC RIP T IONDESC RIP T ION
Reliability The ability of a system to recover from failures and continue
to function.
Security Protecting applications and data from threats.
Cost Optimization Managing costs to maximize the value delivered.
Operational Excellence Operations processes that keep a system running in
production.
Performance Efficiency The ability of a system to adapt to changes in load.
Overview
The Azure Well-Architected Framework is a set of guiding tenets that can be used to improve the quality of a
workload. The framework consists of five pillars of architectural excellence:
Reliability
Security
Cost Optimization
Operational Excellence
Performance Efficiency
Incorporating these pillars helps produce a high quality, stable, and efficient cloud architecture:
Reference the following video about how to architect successful workloads on Azure with the Well-Architected
Framework:
The following diagram gives a high-level overview of the Azure Well-Architected Framework:
Assess your workloadAssess your workload
In the center, is the Well-Architected Framework, which includes the five pillars of architectural excellence.
Surrounding the Well-Architected Framework are six supporting elements:
Azure Well-Architected Review
Azure Advisor
Documentation
Partners, Support, and Services Offers
Reference Architectures
Design Principles
To assess your workload using the tenets found in the Microsoft Azure Well-Architected Framework, see the
Microsoft Azure Well-Architected Review.
We also recommend you use Azure Advisor and Advisor Score to identify and prioritize opportunities to
improve the posture of your workloads. Both services are free to all Azure users and align to the five pillars of
Reliability
Reliability guidanceReliability guidance
Security
Security guidanceSecurity guidance
the Well-Architected Framework:
Azure AdvisorAzure Advisor is a personalized cloud consultant that helps you follow best practices to optimize your
Azure deployments. It analyzes your resource configuration and usage telemetry. It recommends
solutions that can help you improve the reliability, security, cost effectiveness, performance, and
operational excellence of your Azure resources. Learn more about Azure Advisor.
Advisor ScoreAdvisor Score is a core feature of Azure Advisor that aggregates Advisor recommendations into a
simple, actionable score. This score enables you to tell at a glance if you're taking the necessary steps to
build reliable, secure, and cost-efficient solutions, and to prioritize the actions that will yield the biggest
improvement to the posture of your workloads. The Advisor score consists of an overall score, which can
be further broken down into five category scores corresponding to each of the Well-Architected pillars.
Learn more about Advisor Score.
A reliable workload is one that is both resilient and available. Resiliency is the ability of the system to recover
from failures and continue to function. The goal of resiliency is to return the application to a fully functioning
state after a failure occurs. Availability is whether your users can access your workload when they need to.
For more information about resiliency, reference the following video that will show you how to start improving
the reliability of your Azure workloads:
The following topics offer guidance on designing and improving reliable Azure applications:
Designing reliable Azure applications
Design patterns for resiliency
Best practices:
Transient fault handling
Retry guidance for specific services
For an overview of reliability principles, reference Principles of the reliability pillar.
Think about security throughout the entire lifecycle of an application, from design and implementation to
deployment and operations. The Azure platform provides protections against various threats, such as network
intrusion and DDoS attacks. But you still need to build security into your application and into your DevOps
processes.
Ask the right questions about secure application development on Azure by referencing the following video:
Consider the following broad security areas:
Identity management
Protect your infrastructure
Cost optimization
Cost guidanceCost guidance
Operational excellence
Operational excellence guidanceOperational excellence guidance
Performance efficiency
Application security
Data sovereignty and encryption
Security resources
For more information, reference Overview of the security pillar.
When you're designing a cloud solution, focus on generating incremental value early. Apply the principles of
Build-Measure-LearnBuild-Measure-Learn, to accelerate your time to market while avoiding capital-intensive solutions.
For more information, reference Cost optimization and the following video on how to start optimizing your
Azure costs:
The following topics offer cost optimization guidance as you develop the Well-Architected Framework for your
workload:
Review cost principles
Develop a cost model
Create budgets and alerts
Review the cost optimization checklist
For a high-level overview, reference Overview of the cost optimization pillar.
Operational excellence covers the operations and processes that keep an application running in production.
Deployments must be reliable and predictable. Automate deployments to reduce the chance of human error.
Fast and routine deployment processes won't slow down the release of new features or bug fixes. Equally
important, you must quickly roll back or roll forward if an update has problems.
For more information, reference the following video about bringing security into your DevOps practice on
Azure:
The following topics provide guidance on designing and implementing DevOps practices for your Azure
workload:
Design patterns for operational excellence
Best practices: Monitoring and diagnostics
For a high-level summary, reference Overview of the operational excellence pillar.
Performance efficiency is the ability of your workload to scale to meet the demands placed on it by users in an
efficient manner. The main ways to achieve performance efficiency include using scaling appropriately and
Performance efficiency guidancePerformance efficiency guidance
Next steps
implementing PaaS offerings that have scaling built in.
For more information, watch Performance Efficiency: Fast & Furious: Optimizing for Quick and Reliable VM
Deployments.
The following topics offer guidance on how to design and improve the performance efficiency posture of your
Azure workload:
Design patterns for performance efficiency
Best practices:
Autoscaling
Background jobs
Caching
CDN
Data partitioning
For a high-level synopsis, reference Overview of the performance efficiency pillar.
Learn more about:
Azure Well-Architected Review
Well-Architected Series
Introduction to the Microsoft Azure Well-Architected Framework
Microsoft Defender for Cloud
Cloud Adoption Framework
Overview of the reliability pillar
12/16/2022 4 minutes to read • Edit Online
Reliability ensures your application can meet the commitments you make to your customers. Architecting
resiliency into your application framework ensures your workloads are available and can recover from failures
at any scale.
Building for reliability includes:
Ensuring a highly available architecture
Recovering from failures such as data loss, major downtime, or ransomware incidents
To assess the reliability of your workload using the tenets found in the Microsoft Azure Well-Architected
Framework, reference the Microsoft Azure Well-Architected Review.
For more information, explore the following video on diving deeper into Azure workload reliability:
In traditional application development, there has been a focus on increasing the mean time between failures
(MTBF). Effort was spent trying to prevent the system from failing. In cloud computing, a different mindset is
required, because of several factors:
Distributed systems are complex, and a failure at one point can potentially cascade throughout the system.
Costs for cloud environments are kept low through commodity hardware, so occasional hardware failures
must be expected.
Applications often depend on external services, which may become temporarily unavailable or throttle high-
volume users.
Today's users expect an application to be available 24/7 without ever going offline.
All of these factors mean that cloud applications must be designed to expect occasional failures and recover
from them. Azure has many resiliency features already built into the platform. For example:
Azure Storage, Azure SQL Database, and Azure Cosmos DB all provide built-in data replication across
availability zones and regions.
Azure managed disks are automatically placed in different storage scale units to limit the effects of hardware
failures.
Virtual machines (VMs) in an availability set are spread across several fault domains. A
fault domain
is a
group of VMs that share a common power source and network switch. Spreading VMs across fault domains
limits the impact of physical hardware failures, network outages, or power interruptions.
Availability Zones
are physically separate locations within each Azure region. Each zone is composed of one
or more datacenters equipped with independent power, cooling, and networking infrastructure. With
availability zones, you can design and operate applications, and databases that automatically transition
between zones without interruption, which ensures resiliency if one zone is affected. For more information,
reference Regions and Availability Zones in Azure.
That said, you still need to build resiliency into your application. Resiliency strategies can be applied at all levels
of the architecture. Some mitigations are more tactical in nature—for example, retrying a remote call after a
transient network failure. Other mitigations are more strategic, such as failing over the entire application to a
secondary region. Tactical mitigations can make a large difference. While it's rare for an entire region to
Topics and best practices
REL IA B IL IT Y TO P ICREL IA B IL IT Y TO P IC DESC RIP T IONDESC RIP T ION
Reliability principles These critical principles are used as lenses to assess the
reliability of an application deployed on Azure.
Design for reliability Consider how systems use Availability Zones, perform
scalability, respond to failure, and other strategies that
optimize reliability in application design.
Resiliency checklist for specific Azure services Every technology has its own particular failure modes, which
you must consider when designing and implementing your
application. Use this checklist to review the resiliency
considerations for specific Azure services.
Target and non-functional requirements Target and non-functional requirements such as availability
targets and recovery targets allow you to measure the
uptime and downtime of your workloads. Having clearly
defined targets is crucial to have a goal to work and measure
against.
Resiliency and dependencies Building failure recovery into the system should be part of
the architecture and design phases from the beginning to
avoid the risk of failure. Dependencies are required for the
application to fully operate.
Availability Zones Availability Zones can be used to spread a solution across
multiple zones within a region, allowing for an application to
continue functioning when one zone fails.
Availability of services Availability of services across Azure regions depends on a
region's type. Azure's general policy on deploying services
into any given region is primarily driven by region type,
service categories, and customer demand.
Availability zone terminology To better understand regions and availability zones in Azure,
it helps to understand key terms or concepts.
Best practices During the architectural phase, focus on implementing
practices that meet your business requirements, identify
failure points, and minimize the scope of failures.
Testing for reliability Regular testing should be performed as part of each major
change to validate existing thresholds, targets, and
assumptions.
experience a disruption, transient problems such as network congestion are more common—so target these
issues first. Having the right monitoring and diagnostics is also important, both to detect failures when they
happen, and to find the root causes.
When designing an application to be resilient, you must understand your availability requirements. How much
downtime is acceptable? The amount of downtime is partly a function of cost. How much will potential
downtime cost your business? How much should you invest in making the application highly available?
The reliability pillar covers the following topics and best practices to help you build a resilient workload:
Monitoring for reliability Get an overall picture of application health. If something
fails, you need to know
that
it failed,
when
it failed, and
why
.
Reliability patterns Applications must be designed and implemented to
maximize availability.
REL IA B IL IT Y TO P ICREL IA B IL IT Y TO P IC DESC RIP T IONDESC RIP T ION
Next step
Principles
Reliability design principles
12/16/2022 2 minutes to read • Edit Online
Design for business requirements
Design for failure
Observe application health
Drive automation
Building a reliable application in the cloud is different from traditional application development. Historically, you
may have purchased levels of redundant higher-end hardware to minimize the chance of an entire application
platform failing.
In the cloud, we acknowledge that failures happen. Instead of trying to prevent failures altogether, the goal is to
minimize the effects of a single failing component.
To assess your workload using the tenets found in the Azure Well-Architected Framework, reference the
Microsoft Azure Well-Architected Review.
The following design principles provide:
Context for questions
Why a certain aspect is important
How an aspect is applicable to Reliability
These critical design principles are used as lenses to assess the Reliability of an application deployed on Azure.
These lenses provide a framework for the application assessment questions.
Reliability is a subjective concept. For an application to be appropriately reliable, it must reflect the business
requirements surrounding it.
For example, a mission-critical application with a 99.999% service level agreement (SLA) requires a higher level
of reliability than another application with an SLA of 95% .
Cost implications are inevitable when introducing greater reliability and high availability. This trade-off should
be carefully considered.
Failure is impossible to avoid in a highly distributed and multi-tenant environment like Azure.
By anticipating failures, from individual components to entire Azure regions, you can develop a solution in a
resilient way to increase reliability.
Before mitigating issues that impact application reliability, you must first detect these issues.
By monitoring the operation of an application relative to a healthy state, you can detect and predict reliability
issues.
Monitoring allows you to take swift and remedial action.
One of the leading causes of application downtime is human error due to the deployment of insufficiently tested
software or through misconfiguration.
Design for self-healing
Design for scale-out
Next stepNext step
To minimize the possibility and consequence of human errors, it's vital to strive for automation in all aspects of a
cloud solution.
Automation improves:
Reliability
Automated testing
Deployment
Management
Self-healing
describes the ability of a system to deal with failures automatically. Handling failures happens
through pre-defined remediation protocols. These protocols connect to failure modes within the solution.
It's an advanced concept that requires a high level of system maturity with monitoring and automation.
From inception, self-healing should be an aspiration to maximize reliability.
Scale-out
is a concept that focuses on the ability of a system to respond to demand through horizontal growth.
As traffic grows,
more
resource units are added in parallel, instead of increasing the size of the existing
resources.
Through scale units, a system can handle expected and unexpected traffic increases, essential to overall
reliability.
Scale units further reduce the effects of a single resource failure.
Design
Design for reliability
12/16/2022 2 minutes to read • Edit Online
Checklist
Azure services
Reference architecture
Next step
Related links
Reliable applications should maintain a pre-defined percentage of uptime (
availability
). They should also balance
between high resiliency, low latency, and cost (
High Availability
). Just as important, applications should be able
to recover from failures (
resiliency
).
How have you designed your applications with reliability in mind?How have you designed your applications with reliability in mind?
Define availability and recovery targets to meet business requirements.
Build resiliency and availability into your apps by gathering requirements.
Ensure that application and data platforms meet your reliability requirements.
Configure connection paths to promote availability.
Use Availability Zones where applicable to improve reliability and optimize costs.
Ensure that your application architecture is resilient to failures.
Know what happens if the requirements of Service Level Agreements are not met.
Identify possible failure points in the system to build resiliency.
Ensure that applications can operate in the absence of their dependencies.
Azure Front Door
Azure Traffic Manager
Azure Load Balancer
Azure Virtual Network NAT
Service Fabric
Kubernetes Service (AKS)
Azure Site Recovery
Deploy highly available network virtual appliances
Failure Mode Analysis for Azure applications
Minimize coordination
Target & non-functional requirements
Use platform as a service (PaaS) options
Design to scale out
Workload availability targets.
Building solutions for high availability using Availability Zones
Make all things redundant
Target and non
-
functional requirements
12/16/2022 12 minutes to read • Edit Online
Key points
Availability targets
Considerations for availability targetsConsiderations for availability targets
Target and non-functional requirements such as
availability targets
and
recovery targets
allow you to measure
the uptime and downtime of your workloads. Having clearly defined targets is crucial in order to have a goal to
work and measure against. In addition to these targets, there are many other requirements you should consider
to improve reliability requirements and meet business expectations.
Building resiliency (recovering from failures) and availability (running in a healthy state without significant
downtime) into your apps begins with gathering requirements. For example, how much downtime is
acceptable? How much does potential downtime cost your business? What are your customer's availability
requirements? How much do you invest in making your application highly available? What is the risk versus the
cost?
Determine the acceptable level of uptime for your workloads.
Determine how long workloads can be unavailable and how much data is acceptable to lose during a
disaster.
Consider application and data platform requirements to improve resiliency and availability.
Ensure connection availability and improve reliability with Azure services.
Assess overall application health of workloads.
A Service Level Agreement (SLA), is an availability target that represents a commitment around performance
and availability of the application. Understanding the SLA of individual components within the system is
essential in order to define reliability targets. Knowing the SLA of dependencies will also provide a justification
for additional spend when making the dependencies highly available and with proper support contracts.
Availability targets for any dependencies leveraged by the application should be understood and ideally align
with application targets should also be considered.
Understanding your availability expectations is vital to reviewing overall operations for the application. For
example, if you are striving to achieve an application Service Level Objective (SLO) of 99.999%, the level of
inherent operational action required by the application is going to be far greater than if an SLO of 99.9% was the
goal.
Monitoring and measuring application availability is vital to qualifying overall application health and progress
towards defined targets. Make sure you measure and monitor key targets such as:
Mean Time Between Failures (MTBF) — The average time between failures of a particular component.
Mean Time To Recover (MTTR) — The average time it takes to restore a component after a failure.
Are SL As/SLOs/SLIs for all leveraged dependencies understood?Are SL As/SLOs/SLIs for all leveraged dependencies understood?
Availability targets for any dependencies leveraged by the application should be understood and ideally align
with application targets. Make sure SLAs/SLOs/SLIs for all leveraged dependencies are understood.
Has a composite SL A been calculated for the application and/or key scenarios using Azure SL As?Has a composite SL A been calculated for the application and/or key scenarios using Azure SL As?
NOTENOTE
Recovery targets
Meet application platform requirements
Multiple and paired regionsMultiple and paired regions
A composite SLA captures the end-to-end SLA across all application components and dependencies. It is
calculated using the individual SLAs of Azure services housing application components and provides an
important indicator of designed availability in relation to customer expectations and targets. Make sure the
composite SLA of all components and dependencies on the critical paths are understood. To learn more, see
Composite SLAs.
If you have contractual commitments to an SLA for your Azure solution, additional allowances on top of the Azure
composite SLA must be made to accommodate outages caused by code-level issues and deployments. This is often
overlooked and customers directly put the composite SLA forward to their customers.
Are availability targets considered while the system is running in disaster recover y mode?Are availability targets considered while the system is running in disaster recover y mode?
Availability targets might or might not be applied when running in disaster recovery mode. This depends from
application to application. If targets must also apply in a failure state then an N+1 model should be used to
achieve greater availability and resiliency. In this scenario, N is the capacity needed to deliver required
availability. There's also a cost implication, because more resilient infrastructure usually is more expensive. This
has to be accepted by business.
What are the consequences if availability targets are not satisfied?What are the consequences if availability targets are not satisfied?
Are there any penalties, such as financial charges, associated with failing to meet SLA commitments? Additional
measures can be used to prevent penalties, but that also brings additional cost to operate the infrastructure. This
has to be factored in and evaluated. It should be fully understood what are the consequences if availability
targets are not satisfied. This will also inform when to initiate a failover case.
Recovery targets identify how long the workload can be unavailable and how much data is acceptable to lose
during a disaster. Define target reports for the application and key scenarios. Target reports needed are
Recovery Time Objective (RTO) — the maximum acceptable time an application is unavailable after an incident,
and Recovery Point Objective (RPO) — the maximum duration of data loss that is acceptable during a disaster.
Recovery targets are nonfunctional requirements of a system and should be dictated by business requirements.
Recovery targets should be defined in accordance to the required RTO and RPO targets for the workloads.
Azure application platform services offer resiliency features to support application reliability, though they may
only be applicable at a certain SKU and configuration/deployment. For example, an SLA is dependent on the
number of instances deployed or a certain feature enabled. It is recommended that you review the SLA for
services used. For example, Service Bus Premium SKU provides predictable latency and throughput to mitigate
noisy neighbor scenarios. It also provides the ability to automatically scale and replicate metadata to another
Service Bus instance for failover purposes.
To learn more, see Azure Service Bus Premium SKU.
An application platform should be deployed across multiple regions if the requirements dictate. Covering the
requirements using zones is cheaper and less complex. Regional isolation should be an extra measure if the
SLAs given by the single region cross-zone setup are insufficient or if required by a geographical spread of
users.
Availability Zones and setsAvailability Zones and sets
Considerations for availabilityConsiderations for availability
NOTENOTE
Meet data platform requirements
Data consistencyData consistency
The ability to respond to disaster scenarios for overall compute platform availability and application resiliency
depends on the use of multiple regions or other deployment locations.
Use paired regions that exist within the same geography and provide native replication features for recovery
purposes, such as Geo-Redundant Storage (GRS) asynchronous replication. In the event of planned
maintenance, updates to a region will be performed sequentially only. To learn more, see Business continuity
with Azure Paired Regions.
Platform services that can leverage Availability Zones are deployed in either a zonal manner within a particular
zone, or in a zone-redundant configuration across multiple zones. To learn more, see Building solutions for high
availability using Availability Zones.
An Availability Set (AS) is a logical construct to inform Azure that it should distribute contained virtual machine
instances across multiple fault and update domains within an Azure region. Availability Zones (AZ) elevate the
fault level for virtual machines to a physical datacenter by allowing replica instances to be deployed across
multiple datacenters within an Azure region. While zones provide greater resiliency than sets, there are
performance and cost considerations where applications are extremely 'chatty' across zones given the implied
physical separation and inter-zone bandwidth charges. Ultimately, Azure Virtual Machines and Azure PaaS
services, such as Service Fabric and Azure Kubernetes Service (AKS) which use virtual machines underneath, can
leverage either AZs or an AS to provide application resiliency within a region. To learn more, see Business
continuity with data resiliency.
Is the application hosted across 2 or more application platform nodes?Is the application hosted across 2 or more application platform nodes?
To ensure application platform reliability, it is vital that the application be hosted across at least two nodes to
ensure there are no single points of failure. Ideally An n+1 model should be applied for compute availability
where n is the number of instances required to support application availability and performance requirements.
Higher SLAs provided for virtual machines and associated related platform services, require at least two replica nodes
deployed to either an Availability Set or across two or more Availability Zones. To learn more, see SLA for Virtual
Machines.
How is the client traffic routed to the application in the case of region, zone or network outage?How is the client traffic routed to the application in the case of region, zone or network outage?
In the event of a major outage, client traffic should be routable to application deployments which remain
available across other regions or zones. This is ultimately where cross-premises connectivity and global load
balancing should be used, depending on whether the application is internal and/or external facing. Services such
as Azure Front Door, Azure Traffic Manager, or third-party CDNs can route traffic across regions based on
application health solicited via health probes. To learn more, see Traffic Manager endpoint monitoring.
Data and storage services should be running in a highly available configuration/SKU. Azure data platform
services offer resiliency features to support application reliability, though they may only be applicable at a
certain SKU. Examples are Azure SQL Database Business Critical SKUs, or Azure Storage Zone Redundant
Storage (ZRS) with three synchronous replicas spread across availability zones.
Data types should be categorized by data consistency requirements. Data consistency requirements, such as
strong or eventual consistency, should be understood for all data types and used to inform data grouping and
Replication and RedundancyReplication and Redundancy
NOTENOTE
Networking and connectivity requirements
ConnectivityConnectivity
categorization, as well as what data replication/synchronization strategies can be considered to meet application
reliability targets.
CAP theorem proves that it is impossible for a distributed data store to simultaneously provide more than two
guarantees across:
Consistency:Consistency: Every read receives the most recent write or an error.
Availability:Availability: Every request receives a non-error response, without the guarantee that it contains the most
recent write.
Par tition tolerance:Par tition tolerance: A system continues to operate despite an arbitrary number of transactions being
dropped or delayed by the network between nodes.
Determining which of these guarantees are most important in the context of application requirements is critical.
Replicating data across zones or paired regions supports application availability objectives to limit the impact of
failure scenarios. The ability to restore data from a backup is essential when recovering from data corruption
situations as well as failure scenarios. To ensure sufficient redundancy and availability for zonal and regional
failure scenarios, backups should be stored across zones and/or regions.
Define and test a data restore process to ensure a consistent application state. Regular testing of the data restore
process promotes operational excellence and confidence in the ability to recover data in alignment with defined
recovery objectives for the application.
Consider how your application traffic is routed to data sources in the case of region, zone, or network outage.
Understanding the method used to route application traffic to data sources in the event of a major failure event
is critical to identify whether failover processes will meet recovery objectives. Many Azure data platform
services offer native reliability capabilities to handle major failures, such as Azure Cosmos DB Automatic
Failover or Azure SQL Database Active Geo-Replication.
Some capabilities such as Azure Storage RA-GRS and Azure SQL DB Active Geo-Replication require application-side
failover to alternate endpoints in some failure scenarios, so application logic should be developed to handle these
scenarios.
Consider these guidelines to ensure connection availability and improve reliability with Azure services.
Use a global load balancer used to distribute traffic and/or failover across regions.Use a global load balancer used to distribute traffic and/or failover across regions. Azure
Front Door, Azure Traffic Manager, or third-party CDN services can be used to direct inbound requests to
external-facing application endpoints deployed across multiple regions. It is important to note that Traffic
Manager is a DNS-based load balancer, so failover must wait for DNS propagation to occur. A sufficiently
low TTL (Time To Live) value should be used for DNS records, though not all ISPs may honor this. For
application scenarios requiring transparent failover, Azure Front Door should be used. To learn more, see
Disaster Recovery using Azure Traffic Manager and Azure Front Door routing architecture.
For cross-premises connectivity (ExpressRoute or VPN) ensure there are redundantFor cross-premises connectivity (ExpressRoute or VPN) ensure there are redundant
connections from different locations.connections from different locations. At least two redundant connections should be established
across two or more Azure regions and peering locations to ensure there are no single points of failure. An
active/active load-shared configuration provides path diversity and promotes availability of network
connection paths. To learn more, see Cross-network connectivity.
Zone
-
aware servicesZone
-
aware services
Next step
Related links
Simulate a failure path to ensure connectivity is available over alternative paths.Simulate a failure path to ensure connectivity is available over alternative paths. The failure of
a connection path onto other connection paths should be tested to validate connectivity and operational
effectiveness. Using Site-to-Site VPN connectivity as a backup path for ExpressRoute provides an
additional layer of network resiliency for cross-premises connectivity. To learn more, see Using site-to-site
VPN as a backup for ExpressRoute private peering.
Eliminate all single points of failure from the data path (on-premises and Azure.Eliminate all single points of failure from the data path (on-premises and Azure. Single-
instance Network Virtual Appliances (NVAs), whether deployed in Azure or within an on-premises
datacenter, introduce significant connectivity risk. To learn more, see Deploy highly available network
virtual appliances.
Use ExpressRoute/VPN zone-redundant Vir tual Network Gateways.Use ExpressRoute/VPN zone-redundant Vir tual Network Gateways. Zone-redundant virtual
network gateways distribute gateway instances across Availability Zones to improve reliability and
ensure availability during failure scenarios impacting a datacenter within a region. To learn more, see
Zone-redundant Virtual Network Gateways.
If used, deploy Azure Application Gateway v2 deployed in a zone-redundant configuration.If used, deploy Azure Application Gateway v2 deployed in a zone-redundant configuration.
Azure Application Gateway v2 can be deployed in a zone-redundant configuration to deploy gateway
instances across zones for improved reliability and availability during failure scenarios impacting a
datacenter within a region. To learn more, see Zone-redundant Application Gateway v2.
Use Azure Load Balancer Standard to load-balance traffic across Availability Zones.Use Azure Load Balancer Standard to load-balance traffic across Availability Zones. Azure
Load Balancer Standard is zone-aware to distribute traffic across Availability Zones. It can also be
configured in a zone-redundant configuration to improve reliability and ensure availability during failure
scenarios impacting a datacenter within a region. To learn more, see Standard Load Balancer and
Availability Zones.
Configure health probes for Azure Load Balancer(s)/Azure Application Gateways.Configure health probes for Azure Load Balancer(s)/Azure Application Gateways. Health
probes allow Azure Load Balancers to assess the health of backend endpoints to prevent traffic from
being sent to unhealthy instances. To learn more, see Load Balancer health probes.
Assess critical application dependencies with health probes.Assess critical application dependencies with health probes. Custom health probes should be
used to assess overall application health including downstream components and dependent services,
such as APIs and datastores, so that traffic is not sent to backend instances that cannot successfully
process requests due to dependency failures. To learn more, see Health Endpoint Monitoring Pattern.
Application design
To understand business metrics to design resilient Azure applications, see Workload availability targets.
For information on Availability Zones, see Building solutions for high availability using Availability Zones.
For information on health probes, see Load Balancer health probes and Health Endpoint Monitoring Pattern.
To learn about connectivity risk, see Deploy highly available network virtual appliances.
Go back to the main article: Design
Design reliable Azure applications
12/16/2022 4 minutes to read • Edit Online
Key Points
Use Availability Zones within a region
NOTENOTE
Respond to failure
Building a reliable application in the cloud is different from traditional application development. While
historically you may have purchased levels of redundant higher-end hardware to minimize the chance of an
entire application platform failing, in the cloud, we acknowledge up front that failures will happen. Instead of
trying to prevent failures altogether, the goal is to minimize the effects of a single failing component. Failures
you can expect here are inherent to highly distributed systems, not a feature of Azure.
Use Availability Zones where applicable to improve reliability and optimize costs.
Design applications to operate when impacted by failures.
Use the native resiliency capabilities of PaaS to support overall application reliability.
Design to scale out.
Validate that required capacity is within Azure service scale limits and quotas.
If your requirements demand an even greater failure isolation than Availability Zones alone can offer, consider
deploying to multiple regions. Multiple regions should be used for failover purposes in a disaster state.
Additional cost needs to be taken into consideration. Examples of cost needs are data and networking, and
services such as Azure Site Recovery.
Design your application architecture to use
Availability Zones
within a region. Availability Zones can be used to
optimize application availability within a region by providing datacenter level fault tolerance. However, the
application architecture must not share dependencies between zones to use them effectively.
Availability Zones may introduce performance and cost considerations for applications which are extremely "chatty" across
zones given the implied physical separation between each zone and inter-zone bandwidth charges. This also means that
Availability Zones can be considered to get higher SLA for lower cost.
Consider if component proximity is required for application performance reasons. If all or part of the application
is highly sensitive to latency, it may mandate component co-locality which can limit the applicability of multi-
region and multi-zone strategies.
Avoiding failure is impossible in the public cloud, and as a result applications require resilience to respond to
outages and deliver reliability. The application should therefore be designed to operate even when impacted by
regional, zonal, service or component failures across critical application scenarios and functionality. Application
operations may experience reduced functionality or degraded performance during an outage.
Define an availability strategy to capture how the application remains available when in a failure state. It should
apply across all application components and the application deployment stamp as a whole such as via multi-geo
scale-unit deployment approach. There are cost implications as well: More resources need to be provisioned in
advance to provide high availability. Active-active setup, while more expensive than single deployment, can
Considerations for improving reliability
Next step
Related links
balance cost by lowering load on one stamp and reducing the total amount of resources needed.
In addition to an availability strategy, define a Business Continuity Disaster Recovery (BCDR) strategy for the
application and/or its key scenarios. A disaster recovery strategy should capture how the application responds
to a disaster situation such as a regional outage or the loss of a critical platform service, using either a re-
deployment, warm-spare active-passive, or hot-spare active-active approach.
To drive cost down consider splitting application components and data into groups. For example:
Must protect
Nice to protect
Ephemeral/can be rebuilt/lost, instead of protecting all data with the same policy
Is the application designed to use managed ser vices?Is the application designed to use managed ser vices?
Azure-managed services provide native resiliency capabilities to support overall application reliability. Platform
as a service (PaaS) offerings should be used to leverage these capabilities. PaaS options are easier to configure
and administer. You don't need to provision VMs, set up VNets, manage patches and updates, and all of the other
overhead associated with running software on a VM. To learn more, see Use managed services.
Has the application been designed to scale out?Has the application been designed to scale out?
Azure provides elastic scalability and you should design to scale out. However, applications must leverage a
scale-unit approach to navigate service and subscription limits to ensure that individual components and the
application as a whole can scale horizontally. Don't forget about scale in, which is important to drive cost down.
For example, scale in and out for App Service is done via rules. Often customers write scale out rules and never
write scale in rules. This leaves the App Service more expensive.
Is the application deployed across multiple Azure subscriptions?Is the application deployed across multiple Azure subscriptions?
Understanding the subscription landscape of the application and how components are organized within or
across subscriptions is important when analyzing if relevant subscription limits or quotas can be navigated.
Review Azure subscription and service limits to validate that required capacity is within Azure service scale
limits and quotas. To learn more, see Azure subscription and service limits.
Resiliency and dependencies
For information on minimizing dependencies, see Minimize coordination.
For more information on fault-points and fault-modes, see Failure Mode Analysis for Azure applications.
For information on managed services, see Use platform as a service (PaaS) options.
Go back to the main article: Design
Resiliency and dependencies
12/16/2022 4 minutes to read • Edit Online
Key points
Build resiliency with failure mode analysis
NOTENOTE
Understand the impact of dependencies
Building failure recovery into the system should be part of the architecture and design phases from the
beginning to avoid the risk of failure. Dependencies are required for the application to fully operate.
Identify possible failure points in the system with failure mode analysis.
Eliminate all single point of failure.
Maintain a complete list of application dependencies.
Ensure that applications can operate in the absence of their dependencies.
Understand the SLA of individual components within the system to define reliability targets.
Failure mode analysis (FMA) is a process for building resiliency into a system, by identifying possible failure
points in the system. The FMA should be part of the architecture and design phases, so that you can build failure
recovery into the system from the beginning.
Identify all fault-points and fault-modes. Fault-points describe the elements within an application architecture
which are capable of failing, while fault-modes capture the various ways by which a fault-point may fail. To
ensure an application is resilient to end-to-end failures, it is essential that all fault-points and fault-modes are
understood and operationalized. To learn more, see Failure mode analysis for Azure applications.
Eliminate all single point of failure. A single point of failure describes a specific fault-point which if it where to
fail, would bring down the entire application. Single points of failure introduce significant risk since any failure
of this component will cause an application outage. To learn more, see Make all things redundant.
Eliminate all
singletons
. A singleton describes a logical component within an application for which there can only be a
single instance. It can apply to stateful architectural components or application code constructs. Ultimately, singletons
introduce a significant risk by creating single points of failure within the application design.
Internal
dependencies describe components within the application scope which are required for the application
to fully operate.
External
dependencies capture required components outside the scope of the application, such
as another application or third-party service. Dependencies may be categorized as either strong or weak based
on whether or not the application is able to continue operating in a degraded fashion in their absence. To learn
more, see Twelve-Factor App: Dependencies.
You should maintain a complete list of application dependencies. Examples of typical dependencies include
platform dependencies outside the remit of the application, such as Azure Active Directory, Express Route, or a
central NVA (Network Virtual Appliance), as well as application dependencies such as APIs. For cost purposes,
it's important to understand the price for these services and how they are being charged. For more details see
Cost models.
You can map application dependencies either as a simple list or a document. Usually this is part of a design
Next step
Related links
document or reference architecture.
Understand the impact of an outage with each dependency.Understand the impact of an outage with each dependency. Strong dependencies play a critical role in
application function and availability. Their absence will have a significant impact, while the absence of weak
dependencies may only impact specific features and not affect overall availability. This reflects the cost that is
needed to maintain the High Availability relationship between the service and its dependencies. Classifying
dependencies as either strong or weak will help you identify which components are essential to the application.
Maintain SL As and suppor t agreements for critical dependencies.Maintain SL As and suppor t agreements for critical dependencies. A Service Level Agreement (SLA)
represents a commitment around performance and availability of the application. Understanding the SLA of
individual components within the system is essential in order to define reliability targets. Knowing the SLA of
dependencies will also provide a justification for additional spend when making the dependencies highly
available and with proper support contracts. The operational commitments of all external and internal
dependencies should be understood to inform the broader application operations and health model.
The usage of platform level dependencies such as Azure Active Directory must also be understood to ensure
that their availability and recovery targets align with that of the application
Ensure that applications can operate in the absence of their dependencies.Ensure that applications can operate in the absence of their dependencies. If the application has
strong dependencies which it cannot operate in the absence of, then the availability and recovery targets of
these dependencies should align with that of the application itself. Make an effort to minimize dependencies to
achieve control over application reliability. To learn more, see Minimize dependencies.
Ensure that the lifecycle of the application decoupled from its dependencies.Ensure that the lifecycle of the application decoupled from its dependencies. If the application
lifecycle is closely coupled with that of its dependencies, it can limit the operational agility of the application. This
is true particularly where new releases are concerned.
Best practices
For information on failure mode analysis, see Failure mode analysis for Azure applications.
For information on single point of failure, see Make all things redundant.
For information on fault-points and fault-modes, see Failure Mode Analysis for Azure applications.
For information on minimizing dependencies, see Minimize coordination.
Go back to the main article: Design
Best practices for designing reliability in Azure
applications
12/16/2022 2 minutes to read • Edit Online
Build availability targets and recovery targets into your design
Ensure the application & data platforms meet your reliability
requirements
Ensure connectivity
Use zone-aware services
Design resilience to respond to outages
Perform a failure mode analysis (FMA)
This article lists Azure best practices to enhance designing Azure applications for reliability. These best practices
are derived from our experience with Azure reliability and the experiences of customers like yourself.
During the architectural phase, focus on implementing practices that meet your business requirements, identify
failure points, and minimize the scope of failures.
A Service Level Agreement (SLA), is an availability target that represents a commitment around performance
and availability of the application. Understanding the SLA of individual components within the system is
essential in order to define reliability targets. Recovery targets identify how long the workload can be
unavailable and how much data is acceptable to lose during a disaster. Define target reports for the application
and key scenarios. There may be penalties, such as financial charges, associated with failing to meet SLA
commitments. The consequences of not satisfying availability targets should be fully understood.
Designing application platform and data platform resiliency and availability are critical to ensuring overall
application reliability.
To ensure connection availability and improve reliability with Azure services:
Use a global load balancer used to distribute traffic and/or failover across regions.
For cross-premises connectivity (ExpressRoute or VPN) ensure there are redundant connections from
different locations.
Simulate a failure path to ensure connectivity is available over alternative paths.
Eliminate all single points of failure from the data path (on-premises and Azure).
Zone-aware services can improve reliability and ensure availability during failure scenarios affecting a
datacenter within a region. They can also be used to deploy gateway instances across zones for improved
reliability and availability during failure scenarios affecting a datacenter within a region.
Applications should be designed to operate even when affected by regional, zonal, service or component
failures across critical application scenarios and functionality. Application operations may experience reduced
functionality or degraded performance during an outage.
Understand the impact of an outage with each dependency
Design for scalability
Next step
FMA builds resiliency into an application early in the design stage. It helps you identify the types of failures your
application might experience, the potential effects of each, and possible recovery strategies.
Have all single points of failure been eliminated? A single point of failure describes a specific fault-point which if
it where to fail would bring down the entire application. Single points of failure introduce significant risk since
any failure of this component will cause an application outage.
Have all fault-points and fault-modes been identified? Fault-points describe the elements within an application
architecture which are capable of failing, while fault-modes capture the various ways by which a fault-point may
fail. To ensure an application is resilient to end-to-end failures, it is essential that all fault-points and fault-modes
are understood and operationalized.
Strong dependencies play a critical role in application function and availability. Their absence will have a
significant impact, while the absence of weak dependencies may only impact specific features and not affect
overall availability. Dependencies may be categorized as either strong or weak based on whether or not the
application is able to continue operating in a degraded fashion in their absence.
A cloud application must be able to scale to accommodate changes in usage. Begin with discrete components,
and design the application to respond automatically to load changes whenever possible. Keep scaling limits in
mind during design so you can expand easily in the future.
Testing
Go back to the main article: Design
Testing for reliability
12/16/2022 2 minutes to read • Edit Online
Checklist
Azure services
Reference architecture
Next step
Related links
Regular testing should be performed as part of each major change and if possible, on a regular basis to validate
existing thresholds, targets and assumptions. Testing should also ensure the validity of the health model,
capacity model, and operational procedures.
Have you tested your applications with reliability in mind?Have you tested your applications with reliability in mind?
Test regularly to validate existing thresholds, targets and assumptions.
Automate testing as much as possible.
Perform testing on both key test environments with the production environment.
Perform chaos testing by injecting faults.
Create and test a disaster recovery plan on a regular basis using key failure scenarios.
Design disaster recovery strategy to run most applications with reduced functionality.
Design a backup strategy that is tailored to business requirements and circumstances of the application.
Test and validate the failover and failback approach successfully at least once.
Configure request timeouts to manage inter-component calls.
Implement retry logic to handle transient application failures and transient failures with internal or external
dependencies.
Configure and test health probes for your load balancers and traffic managers.
Apply chaos principles continuously.
Create and organize a central chaos engineering team.
Azure Site Recovery
Azure Pipelines
Azure Traffic Manager
Azure Load Balancer
Failure Mode Analysis for Azure applications
High availability and disaster recovery scenarios for IaaS apps
Back up files and applications on Azure Stack Hub
Resiliency testing
For information on performance testing, see Performance testing.
For information on chaos engineering, see Chaos engineering.
For information on failure and disaster recovery, see Failure and disaster recovery for Azure applications.
For information on testing applications, see Testing your application and Azure environment.
Testing applications for availability and resiliency
12/16/2022 3 minutes to read • Edit Online
Key points
When to test
NOTENOTE
Testing for resiliency
Applications should be tested to ensure
availability
and
resiliency
. Availability describes the amount of time
when an application runs in a healthy state without significant downtime. Resiliency describes how quickly an
application recovers from failure.
Being able to measure availability and resiliency can answer questions like, How much downtime is acceptable?
How much does potential downtime cost your business? What are your availability requirements? How much
do you invest in making your application highly available? What is the risk versus the cost? Testing plays a
critical role in making sure your applications can meet these requirements.
Test regularly to validate existing thresholds, targets and assumptions.
Automate testing as much as possible.
Perform testing on both key test environments with the production environment.
Verify how the end-to-end workload performs under intermittent failure conditions.
Test the application against critical non-functional requirements for performance.
Conduct load testing with expected peak volumes to test scalability and performance under load.
Perform chaos testing by injecting faults.
Regular testing should be performed as part of each major change and if possible, on a regular basis to validate
existing thresholds, targets and assumptions. While the majority of testing should be performed within the
testing and staging environments, it is often beneficial to also run a subset of tests against the production
system. Plan a 1:1 parity of key test environments with the production environment.
Automate testing where possible to ensure consistent test coverage and reproducibility. Automate common testing tasks
and integrate them into your build processes. Manually testing software is tedious and susceptible to error, although
manual explorative testing may also be conducted.
To test resiliency, you should verify how the end-to-end workload performs under intermittent failure
conditions.
Run tests in production using both synthetic and real user data. Test and production are rarely identical, so it's
important to validate your application in production using a blue-green or canary deployment. This way, you're
testing the application under real conditions, so you can be sure that it will function as expected when fully
deployed.
As part of your test plan, include:
Chaos engineering
Automated pre-deployment testing
Performance testing
Simulation testing
Fault injection testing
Test under peak loads
Fault injection testing
Peak load testing
Disaster recovery testing
The primary goal of performance testing is to validate benchmark behavior for the application. Performance
testing is the superset of both
load testing
and
stress testing
.
Load testing validates application scalability by rapidly and/or gradually increasing the load on the application
until it reaches a threshold/limit. Stress testing involves various activities to overload existing resources and
remove components to understand overall resiliency and how the application responds to issues.
Simulation testing involves creating small, real-life situations. Simulations demonstrate the effectiveness of the
solutions in the recovery plan and highlight any issues that weren't adequately addressed.
As you perform simulation testing, follow best practices:
Conduct simulations in a manner that doesn't disrupt actual business but feels like a real situation.
Make sure that simulated scenarios are completely controllable. If the recovery plan seems to be failing, you
can restore the situation back to normal without causing damage.
Inform management about when and how the simulation exercises will be conducted. Your plan should detail
the time frame and the resources affected during the simulation.
For fault injection testing, check the resiliency of the system during failures, either by triggering actual failures or
by simulating them. Here are some strategies to induce failures:
Shut down virtual machine (VM) instances.
Crash processes.
Expire certificates.
Change access keys.
Shut down the DNS service on domain controllers.
Limit available system resources, such as RAM or number of threads.
Unmount disks.
Redeploy a VM.
Your test plan should incorporate possible failure points identified during the design phase, in addition to
common failure scenarios:
Test your application in an environment as close to production as possible.
Test failures in combination.
Measure the recovery times, and be sure that your business requirements are met.
Verify that failures don't cascade and are handled in an isolated way.
Load testing is crucial for identifying failures that only happen under load, such as the back-end database being
overwhelmed or service throttling. Test for peak load and anticipated increase in peak load, using production
data or synthetic data that is as close to production data as possible. Your goal is to see how the application
behaves under real-world conditions.
Next step
Related links
Backup and recovery
For more test types, see Test types.
To learn about load and stress tests, see Performance testing.
To learn about chaos testing, see Chaos engineering.
Go back to the main article: Testing
Backup and disaster recovery for Azure applications
12/16/2022 6 minutes to read • Edit Online
Key points
Disaster recovery plan
Operational readiness testing
Disaster recovery
is the process of restoring application functionality in the wake of a catastrophic loss.
In the cloud, we acknowledge up front that failures will happen. Instead of trying to prevent failures altogether,
the goal is to minimize the effects of a single failing component. Testing is one way to minimize these effects.
Automate testing your applications where possible, but you need to be prepared for when they fail. When a
failure happens, having backup and recovery strategies becomes important.
Your tolerance for reduced functionality during a disaster is a business decision that varies from one application
to the next. It might be acceptable for some applications to be unavailable or to be partially available with
reduced functionality or delayed processing for a while. For other applications, any reduced functionality is
unacceptable.
Create and test a disaster recovery plan regularly using key failure scenarios.
Design disaster recovery strategy to run most applications with reduced functionality.
Design a backup strategy that is tailored to business requirements and circumstances of the application.
Automate failover and failback steps and processes.
Test and validate the failover and failback approach successfully at least once.
Start by creating a recovery plan. The plan is considered complete after it has been fully tested. Include the
people, processes, and applications needed to restore functionality within the service-level agreement (SLA)
you've defined for your customers.
Consider the following suggestions when creating and testing your disaster recovery plan:
Include the process for contacting support and for escalating issues. This information will help to avoid
prolonged downtime as you work out the recovery process for the first time.
Evaluate the business impact of application failures.
Choose a cross-region recovery architecture for mission-critical applications.
Identify a specific owner of the disaster recovery plan, including automation and testing.
Document the process, especially any manual steps.
Automate the process as much as possible.
Establish a backup strategy for all reference and transactional data, and test backup restoration regularly.
Set up alerts for the stack of the Azure services consumed by your application.
Train operations staff to execute the plan.
Perform regular disaster simulations to validate and improve the plan.
If you're using Azure Site Recovery to replicate virtual machines (VMs), create a fully automated recovery plan to
fail over the entire application.
Perform an operational readiness test for failover to the secondary region and for failback to the primary region.
Many Azure services support manual failover or test failover for disaster recovery drills. Instead, you can
Failover and failback testing
Dependent service outage
Network outage
Recovery automation
Backup strategy
simulate an outage by shutting down or removing Azure services.
Automated operational responses should be tested frequently as part of the normal application lifecycle to
ensure operational effectiveness.
Test failover and failback to verify that your application's dependent services come back up in a synchronized
manner during disaster recovery. Changes to systems and operations may affect failover and failback functions,
but the impact may not be detected until the main system fails or becomes overloaded. Test failover capabilities
before
they're required to compensate for a live problem. Also, be sure dependent services failover and failback
in the correct order.
If you're using Azure Site Recovery to replicate VMs, run disaster recovery drills periodically by testing failovers
to validate your replication strategy. A test failover doesn't affect the ongoing VM replication or your production
environment. For more information, see Run a disaster recovery drill to Azure.
For each dependent service, you should understand the implications of service disruption and the way that the
application will respond. Many services include features that support resiliency and availability, so evaluating
each service independently is likely to improve your disaster recovery plan. For example, Azure Event Hubs
supports failing over to the secondary namespace.
When parts of the Azure network are inaccessible, you may not be able to access your application or data. In this
situation, we recommend designing the disaster recovery strategy to run most applications with reduced
functionality.
If reducing functionality isn't an option, the remaining options are application downtime or failover to an
alternate region.
In a reduced functionality scenario:
If your application can't access its data because of an Azure network outage, you can run locally with reduced
application functionality by using cached data.
You can store data in an alternate location until connectivity is restored.
The steps required to recover or failover the application to a secondary Azure region in failure situations should
be codified, preferably in an automated manner, to ensure capabilities exist to respond effectively to an outage
in a way that limits impact. Similar codified steps should also exist to capture the process required to failback the
application to the primary region once a failover triggering issue has been addressed.
When automating failover procedures, ensure that the tooling used for orchestrating the failover is also
considered in the failover strategy. For example, if you run your failover from Jenkins running on a VM, you'll be
in trouble if that virtual machine is part of the outage. Azure DevOps Projects are scoped to a region too.
Many alternative strategies are available for implementing distributed compute across regions. These strategies
must be tailored to the specific business requirements and circumstances of the application. At a high level, the
approaches can be divided into the following categories:
Plan for regional failures
Next step
Related links
Redeploy on disasterRedeploy on disaster: In this approach, the application is redeployed from scratch at the time of
disaster. Redeploying from scratch is appropriate for non-critical applications that don't require a
guaranteed recovery time.
Warm Spare (Active/Passive)Warm Spare (Active/Passive): Create a secondary hosted service in an alternate region, and deploy
roles to guarantee minimal capacity. However, the roles don't receive production traffic. This approach is
useful for applications that have not been designed to distribute traffic across regions.
Hot Spare (Active/Active)Hot Spare (Active/Active): The application is designed to receive production load in multiple regions.
The cloud services in each region might be configured for higher capacity than required for disaster
recovery purposes. Instead, the cloud services might scale out as necessary at the time of a disaster and
failover. This approach requires a large investment in application design, but it has significant benefits.
These include low and guaranteed recovery time, continuous testing of all recovery locations, and
efficient usage of capacity.
Azure is divided physically and logically into units called regions. A region consists of one or more data centers
in close proximity. Many regions and services also support availability zones, which can be used to provide
more resiliency against outages in a single data center. Consider using regions with availability zones to improve
the availability of your solution.
Under rare circumstances, it's possible that facilities in an entire availability zone or region can become
inaccessible, for example, because of network failures. Or, facilities can be lost entirely, for example, because of a
natural disaster. Azure has capabilities for creating applications that are distributed across zones and regions.
Such distribution helps to minimize the possibility that a failure in one zone or region could affect other zones
or regions.
Automatic retry of failed backup jobs
For information on testing failovers, see Run a disaster recovery drill to Azure.
For information on Event Hubs, see Azure Event Hubs.
Go back to the main article: Testing
Automatic retry of failed backup jobs
12/16/2022 5 minutes to read • Edit Online
Prerequisites
Assign permissions to managed identities
Azure Backup comprehensively protects your data assets in Azure through a simple, secure, and cost-effective
solution that requires zero infrastructure. Azure's built-in data protection offers a solution for a wide range of
workloads, and helps protect your mission critical workloads running in the cloud. Azure's built-in data ensures
your backups are always available and managed at scale across your entire backup estate.
As a backup user or administrator, you can monitor all backup solutions and configure alerts to notify you about
important events.
Many of the failures or outages are transient. You can solve them just by retrying the backup, or the restore job.
However, waiting for an engineer to retry the job manually or assign the relevant permission wastes valuable
time. Automation is the smarter way to retry the failed jobs, and ensures you continue to meet your target RPOs
with one successful backup a day.
Retrieve relevant backup data through Azure Resource Graph (ARG) and combine the data with corrective
PowerShell and CLI steps. This article guides you through retrying the backup for all failed jobs using ARG and
PowerShell.
You'll need an Azure Automation account. You can use an existing account or create a new account with one
user-assigned managed identity, at minimum.
To assign permissions to managed identities, complete the following steps:
# Sign in to your Azure subscription
$sub = Get-AzSubscription -ErrorAction SilentlyContinue
if(-not($sub))
{
Connect-AzAccount
}
# If you have multiple subscriptions, set the one to use
# Select-AzSubscription -SubscriptionId <SUBSCRIPTIONID>
$resourceGroup = "resourceGroupName"
# These values are used in this tutorial
$automationAccount = "xAutomationAccount"
$userAssignedManagedIdentity = "xUAMI"
1. Sign in to Azure interactively using the Connect-AzAccount cmdlet and follow the instructions:
2. Provide an appropriate value for the following variables and then run the script:
3. Use the PowerShell cmdlet New-AzRoleAssignment to assign a role to the system-assigned managed
identity:
Add modules
Create a PowerShell runbook
$role1 = "DevTest Labs User"
$SAMI = (Get-AzAutomationAccount -ResourceGroupName $resourceGroup
-Name $automationAccount).Identity.PrincipalId
New-AzRoleAssignment
-ObjectId $SAMI
-ResourceGroupName $resourceGroup
-RoleDefinitionName $role1
$UAMI = (Get-AzUserAssignedIdentity -ResourceGroupName
$resourceGroup -Name $userAssignedManagedIdentity).PrincipalId
New-AzRoleAssignment
-ObjectId $UAMI
-ResourceGroupName $resourceGroup
-RoleDefinitionName $role1
$role2 = "Reader"
New-AzRoleAssignment
-ObjectId $SAMI
-ResourceGroupName $resourceGroup
-RoleDefinitionName $role2
4. You need the same role assignment for the user-assigned managed identity:
5. You'll need extra permissions for the system-assigned managed identity to run the cmdlets :
Get-AzUserAssignedIdentity and Get-AzAutomationAccount :
For the preceding scripts to work, install the following modules by navigating to the individual module gallery
after you've created the automation account:
Az.Accounts
Az.RecoveryServices
Az.Graph
To create a runbook that managed identities can run, complete the following steps:
$connection = Get-AutomationConnection -Name AzureRunAsConnection
$connectionResult = Connect-AzAccount
-ServicePrincipal
-Tenant $connection.TenantID
-ApplicationId $connection.ApplicationID
1. Sign in to the Azure portal and navigate to your Automation account.
2. Under Process AutomationProcess Automation, select RunbooksRunbooks.
3. Select Create a runbookCreate a runbook:
a. Name the runbook
miTesting
.
b. From the Runbook typeRunbook type dropdown menu, select PowerShellPowerShell.
c. Select CreateCreate.
4. In the runbook editor, paste the following code:
-CertificateThumbprint $connection.CertificateThumbprint
"Login successful.."
$query = "RecoveryServicesResources
| where type in~ ('microsoft.recoveryservices/vaults/backupjobs')
| extend vaultName = case(type =~
'microsoft.dataprotection/backupVaults/backupJobs',properties.vaultName,type =~
'Microsoft.RecoveryServices/vaults/backupJobs',split(split(id, '/Microsoft.RecoveryServices/vaults/')
[1],'/')[0],'--')
| extend friendlyName = case(type =~
'microsoft.dataprotection/backupVaults/backupJobs',strcat(properties.dataSourceSetName , '/',
properties.dataSourceName),type =~ 'Microsoft.RecoveryServices/vaults/backupJobs',
properties.entityFriendlyName, '--')
| extend dataSourceType = case(type =~
'Microsoft.RecoveryServices/vaults/backupJobs',properties.backupManagementType,type =~
'microsoft.dataprotection/backupVaults/backupJobs',properties.dataSourceType,'--')
| extend protectedItemName = split(split(properties.backupInstanceId, 'protectedItems')[1],'/')[1]
| extend vaultId = tostring(split(id, '/backupJobs')[0])
| extend vaultSub = tostring( split(id, '/')[2])
| extend jobStatus = case (properties.status == 'Completed' or properties.status ==
'CompletedWithWarnings','Succeeded',properties.status == 'Failed','Failed',properties.status ==
'InProgress', 'Started', properties.status), operation = case(type =~
'microsoft.dataprotection/backupVaults/backupJobs' and tolower(properties.operationCategory) =~
'backup' and properties.isUserTriggered == 'true',strcat('adhoc',properties.operationCategory),type
=~ 'microsoft.dataprotection/backupVaults/backupJobs', tolower(properties.operationCategory), type =~
'Microsoft.RecoveryServices/vaults/backupJobs' and tolower(properties.operation) =~ 'backup' and
properties.isUserTriggered == 'true',strcat('adhoc',properties.operation),type =~
'Microsoft.RecoveryServices/vaults/backupJobs',tolower(properties.operation), '--'),startTime =
todatetime(properties.startTime),endTime = properties.endTime, duration = properties.duration
| where startTime >= ago(24h)
| where (dataSourceType in~ ('AzureIaasVM'))
| where jobStatus=='Failed'
| where operation == 'backup' or operation == 'adhocBackup'
| project vaultSub, vaultId, protectedItemName, startTime, endTime, jobStatus, operation
| sort by vaultSub"
$subscriptions = Get-AzSubscription | foreach {$_.SubscriptionId}
$result = Search-AzGraph -Subscription $subscriptions -Query $query -First 5
$result = $result.data
$prevsub = ""
foreach($jobresponse in $result)
{
if($jobresponse.vaultSub -ne $prevsub)
{
Set-AzContext -SubscriptionId
$jobresponse.vaultSub
$prevsub = $jobresponse.vaultSub
}
$item = Get-AzRecoveryServicesBackupItem -VaultId
$jobresponse.vaultId -BackupManagementType AzureVM -WorkloadType AzureVM -Name
$jobresponse.protectedItemName
Backup-AzRecoveryServicesBackupItem -ExpiryDateTimeUTC
(get-date).AddDays(10) -Item $item -VaultId $jobresponse.vaultId
}
Create a new schedule with PowerShell
PS C:\> $StartTime = Get-Date "13:00:00"
PS C:\> $EndTime = $StartTime.AddYears(1)
PS C:\> New-AzAutomationSchedule -AutomationAccountName " MyAutomationAccount" -Name "Schedule02" -StartTime
$StartTime -ExpiryTime $EndTime -DayInterval 1 -ResourceGroupName "ResourceGroup01"
Link a schedule to a runbook
Link a schedule to a runbook with PowerShell
5. Select SaveSave and then Test paneTest pane.
You've now successfully created a PowerShell runbook.
To create a new schedule with PowerShell, you must:
Use the New-AzAutomationSchedule cmdlet to create schedules.
Specify the start time for the schedule and the frequency it should run.
The following code example shows how to create a recurring schedule that runs every day at 1:00 PM for one
year:
The first command creates a date object using the Get-Date cmdlet and then stores the object in the
$StartDate variable. Specify a time that is, at least, five minutes in the future.
The second command creates a date object using the Get-Date cmdlet and then stores the object in the
$EndDate variable. The command specifies a future time.
The final command creates a daily schedule named
Schedule02
to begin at the time stored in $StartDate
and expires at the time stored in $EndDate .
Consider the following concepts when you link a schedule to a runbook:
You can link a runbook to multiple schedules and a schedule can have multiple runbooks linked to it.
If a runbook has parameters, you can provide values for them.
Provide values for any mandatory parameters and any optional parameters. These values are used each time
the runbook is started by this schedule.
You can attach the same runbook to another schedule and specify different parameter values.
To link a schedule to a runbook with PowerShell, you must:
Use the Register-AzAutomationScheduledRunbook cmdlet to link a schedule.
You can specify parameter values for the runbook with the Parameters $parameter.
For more information about how to specify parameter values, reference Starting a Runbook in Azure
Automation.
The following code example shows how to link a schedule to a runbook by using an Azure Resource Manager
cmdlet with parameters:
$automationAccountName = "MyAutomationAccount"
$runbookName = "Test-Runbook"
$scheduleName = "Sample-DailySchedule"
$params = @{"FirstName"="Joe";"LastName"="Smith";"RepeatCount"=2;"Show"=$true}
Register-AzAutomationScheduledRunbook -AutomationAccountName $automationAccountName
-Name $runbookName -ScheduleName $scheduleName -Parameters $params `
-ResourceGroupName "ResourceGroup01"
Next steps
Error handling
Error handling for resilient applications in Azure
12/16/2022 3 minutes to read • Edit Online
Key points
Transient fault handling
Request timeouts
Cascading Failures
Ensuring your application can recover from errors is critical when working in a distributed system. You test your
applications to prevent errors and failure, but you need to be prepared for when applications encounter issues
or fail. Understanding how to handle errors and prevent potential failure becomes important, as testing doesn't
always catch everything.
Many things in a distributed system are outside your span of control and your means to test. This can be the
underlying cloud infrastructure, third party runtime dependencies, etc. You can be sure something will fail
eventually, so you need to prepare for that.
Uncover issues or failures in your application's retry logic.
Configure request timeouts to manage inter-component calls.
Implement retry logic to handle transient application failures and transient failures with internal or external
dependencies.
Configure and test health probes for your load balancers and traffic managers.
Segregate read operations from update operations across application data stores.
Track the number of transient exceptions and retries over time to uncover issues or failures in your application's
retry logic. A trend of increasing exceptions over time may indicate that the service is having an issue and may
fail. To learn more, reference Retry service specific guidance.
Use the Retry pattern, paying particular attention to issues and considerations. Avoid overwhelming dependent
services by implementing the Circuit Breaker pattern. Review and incorporate additional best practices guidance
for Transient fault handling. While calling systems that have Throttling pattern implemented, ensure that your
retries are not counter productive.
A reference implementation is available here.It uses Polly and IHttpClientBuilder to implement the Circuit
Breaker pattern.
When making a service call or a database call, ensure that appropriate request timeouts are set. Database
Connection timeouts are typically set to 30 seconds. For guidance on how to troubleshoot, diagnose, and
prevent SQL connection errors, see transient errors for SQL Database.
Leverage design patterns that encapsulate robust timeout strategies like Choreography pattern or
Compensating Transaction pattern.
A reference implementation is available on GitHub.
The Circuit Breaker pattern provides stability while the system recovers from a failure and minimizes the impact
on performance. It can help to maintain the response time of the system by quickly rejecting a request for an
Application Health Probes
Command and Query Responsibility Segregation (CQRS)
Next step
Related links
operation that's likely to fail, rather than waiting for the operation to time out, or never return.
A circuit breaker might be able to test the health of a service by sending a request to an endpoint exposed by
the service. The service should return information indicating its status.
Retry pattern. Describes how an application can handle anticipated temporary failures when it tries to connect to
a service or network resource by transparently retrying an operation that has previously failed.
Samples related to this pattern are here.
Configure and test health probes for your load balancers and traffic managers. Ensure that your health endpoint
checks the critical parts of the system and responds appropriately.
For Azure Front Door and Azure Traffic Manager, the health probe determines whether to fail over to another
region. Your health endpoint should check any critical dependencies that are deployed within the same
region.
For Azure Load Balancer, the health probe determines whether to remove a VM from rotation. The health
endpoint should report the health of the VM. Don't include other tiers or external services. Otherwise, a
failure that occurs outside the VM will cause the load balancer to remove the VM from rotation.
Samples related to heath probes are here.
ARM template that deploys an Azure Load Balancer and health probes that detect the health of the
sample service endpoint.
An ASP.NET Core Web API that shows configuration of health checks at startup.
Achieve levels of scale and performance needed for your solution by segregating read and write interfaces by
implementing the CQRS pattern.
Chaos engineering
For information on transient faults, see Troubleshoot transient connection errors.
For guidance on implementing health monitoring in your application, see Health Endpoint Monitoring
pattern.
Go back to the main article: Testing
Chaos engineering
12/16/2022 5 minutes to read • Edit Online
Key points
Increase resiliency
When to apply chaos
Chaos engineering is a methodology that helps developers attain consistent reliability by hardening services
against failures in production. Another way to think about chaos engineering is that it's about embracing the
inherent chaos in complex systems and, through experimentation, growing confidence in your solution's ability
to handle it.
A common way to introduce chaos is to deliberately inject faults that cause system components to fail. The goal
is to observe, monitor, respond to, and improve your system's reliability under adverse circumstances. For
example, taking dependencies offline (stopping API apps, shutting down VMs, etc.), restricting access (enabling
firewall rules, changing connection strings, etc.), or forcing failover (database level, Front Door, etc.), is a good
way to validate that the application is able to handle faults gracefully.
It's difficult to simulate the characteristics of a service's behavior at scale outside a production environment. The
transient nature of cloud platforms can exacerbate this difficulty. Architecting your service to expect failure is a
core approach to creating a modern service. Chaos engineering embraces the uncertainty of the production
environment and strives to anticipate rare, unpredictable, and disruptive outcomes, so that you can minimize
any potential impact on your customers.
Increase service resiliency and ability to react to failures.
Apply chaos principles continuously.
Create and organize a central chaos engineering team.
Follow best practices for chaos testing.
Chaos engineering is aimed at increasing your service's resiliency and its ability to react to failures. By
conducting experiments in a controlled environment, you can identify issues that are likely to arise during
development and deployment. During this process, be vigilant in adopting the following guidelines:
Be proactive.
Embrace failure.
Break the system.
Identify and address single points of failure early.
Install guardrails and graceful mitigation.
Minimize the blast radius.
Build immunity.
Chaos engineering should be an integral part of development team culture and an ongoing practice, not a
short-term tactical effort in response to a single outage.
Development team members are partners in the process. They must be equipped with the resources to triage
issues, implement the testability that's required for fault injection, and drive the necessary product changes.
Ideally, you should apply chaos principles continuously. There's constant change in the environments in which
Process
AT TA C K ERAT TA C K ER DEF EN DER F O R C LO UDDEF EN DER F O R C LO UD
Inject faults Assess
Provide hints Analyze
Mitigate
GoalsGoals
Overall methodOverall method
software and hardware run, so monitoring the changes is key. By constantly applying stress or faults on
components, you can help expose issues early, before small problems are compounded by a number of other
factors.
Apply chaos engineering principles when you're:
Deploying new code.
Adding dependencies.
Observing changes in usage patterns.
Mitigating problems.
Chaos engineering requires specialized expertise, technology, and practices. As with security and performance
teams, the model of a central team supporting the service teams is a common, effective approach.
If you plan to practice the simulated handling of potentially catastrophic scenarios under controlled conditions,
here's a simplified way to organize your teams:
Familiarize team members with monitoring tools.
Recognize outage patterns.
Learn how to assess the impact.
Determine the root cause and mitigate accordingly.
Practice log analysis.
1. Start with a hypothesis.
2. Measure baseline behavior.
3. Inject a fault or faults.
4. Monitor the resulting behavior.
5. Document the process and observations.
6. Identify and act on the result.
Periodically validate your process, architecture choices, and code. By conducting fault-injection experiments, you
can confirm that monitoring is in place and alerts are set up, the
directly responsible individual
(DRI) process is
effective, and your documentation and investigation processes are up to date. Keep in mind a few key
considerations:
Challenge system assumptions.
Validate change (topology, platform, resources).
Use service-level agreement (SLA) buffers.
Use live-site outages as opportunities.
Best practices
Shift leftShift left
Shift rightShift right
Blast radiusBlast radius
Error budget testingError budget testing
Considerations for chaos testing
Shift-left testing means experiment early, experiment often. Incorporate fault-injection configurations and create
resiliency-validation gates during the development stages and in the deployment pipeline.
Shift-right testing means that you verify that the service is resilient where it counts in a pre-production or
production environment with actual customer load. Adopt a proactive approach as opposed to reacting to
failures. Be a part of determining and controlling requirements for the blast radius.
Stop the experiment when it goes beyond scope. Unknown results are an expected outcome of chaos
experiments. Strive to achieve balance between collecting substantial result data and affecting as few production
users as possible. For an example of this principle in practice, see the Bulkhead pattern article.
Establish an error budget as an investment in chaos and fault injection. Your error budget is the difference
between achieving 100% of the service-level objective (SLO) and achieving the
agreed-upon
SLO.
The following questions and answers discuss considerations about chaos engineering, based on its application
inside Azure.
Have you identified faults that are relevant to the development team?Have you identified faults that are relevant to the development team?
Work closely with the development teams to ensure the relevance of the injected failures. Use past incidents or
issues as a guide. Examine dependencies and evaluate the results when those dependencies are removed.
An external team can't hypothesize faults for your team. A study of failures from an artificial source might be
relevant to your team's purposes, but the effort must be justified.
Have you injected faults in a way that accurately reflects production failures?Have you injected faults in a way that accurately reflects production failures?
Simulate production failures. Treat injected faults in the same way that you would treat production-level faults.
Enforcing a tighter limit on the blast radius will enable you to simulate a production environment. Each fault-
injection effort must be accompanied by tooling that's designed to inject the types of faults that are relevant to
your team's scenarios. Here are two basic ways:
Inject faults in a non-production environment, such as Canary or Test In Production (TIP).
Partition the production service or environment.
Halt all faults and roll back the state to its last-known good configuration if the state seems severe.
Have you built confidence incrementally?Have you built confidence incrementally?
Start by hardening the core, and then expand out in layers. At each point, lock in progress with automated
regression tests. Each team should have a long-term strategy based on a progression that makes sense for the
team's circumstances.
By applying the shift left strategy, you can help ensure that any obstacles to developer usage are removed early
and the testing results are actionable.
The process must be very
low tax
. That is, the process must make it easy for developers to understand what
happened and to fix the issues. The effort must fit easily into their normal workflow, not burden them with one-
Next step
Related links
off special activities.
Best practices
For information on release testing, see Testing your application and Azure environment.
For more information, see Bulkhead pattern.
Go back to the main article: Testing
Testing best practices for reliability in Azure
applications
12/16/2022 2 minutes to read • Edit Online
Test regularly
Test for resiliency
Design a backup strategy
Design a disaster recovery strategy
Codify steps to failover and fallback
Plan for regional failures
This article lists Azure best practices to enhance testing Azure applications for reliability. These best practices are
derived from our experience with Azure reliability and the experiences of customers like yourself.
During the architectural phase, focus on implementing practices that meet your business requirements, and
ensure that applications will run in a healthy state without significant downtime.
Test regularly to validate existing thresholds, targets, and assumptions. Regular testing should be performed as
part of each major change and if possible, on a regular basis. While most testing should be performed within
the testing and staging environments, it is often beneficial to also run a subset of tests against the production
system.
To test resiliency, you should verify how the end-to-end workload performs under intermittent failure
conditions. Consider performing the following tests:
Performance testing
Simulation testing
Fault injection testing
Load testing
Operational readiness testing
Failover and failback testing
Design a backup strategy that is tailored to the specific business requirements and circumstances of the
application. At a high level, the approaches can be divided into these categories: 1) Redeploy on disaster, 2)
Warm Spare (Active/Passive), and 3) Hot Spare (Active/Active).
When parts of the Azure network are inaccessible, you might not be able to access your application or data. In
this situation, design a disaster recovery strategy to run most applications with reduced functionality.
Codify steps, preferably automatically, to failover and fallback the application to the primary region once a
failover triggering issue has been addressed. Doing this should ensure capabilities exist to effectively respond to
an outage in a way that limits impact.
Use Azure for creating applications that are distributed across regions. Such distribution helps to minimize the
Implement retry logic
Configure and test health probes
Segregate read and write interfaces
Next step
possibility that a failure in one region could affect other regions.
Track the number of transient exceptions and retries over time to uncover issues or failures in your application's
retry logic. A trend of increasing exceptions over time may indicate that the service is having an issue and may
fail.
Configure and test health probes for your load balancers and traffic managers. Ensure that your health endpoint
checks the critical parts of the system and responds appropriately.
Achieve levels of scale and performance needed for your solution by segregating read and write interfaces by
implementing the Command and Query Responsibility Segregation (CQRS) pattern.
Monitoring
Go back to the main article: Testing
Monitoring for reliability
12/16/2022 2 minutes to read • Edit Online
Checklist
Azure services for monitoring
Reference architecture
Next step
Related links
Monitoring and diagnostics are crucial for resiliency. If something fails, you need to know
that
it failed,
when
it
failed — and
why
.
How do you monitor and measure application health?How do you monitor and measure application health?
The application is instrumented with semantic logs and metrics.
Application logs are correlated across components.
All components are monitored and correlated with application telemetry.
Key metrics, thresholds, and indicators are defined and captured.
A health model has been defined based on performance, availability, and recovery targets.
Azure Service Health events are used to alert on applicable service level events.
Azure Resource Health events are used to alert on resource health events.
Monitor long-running workflows for failures.
Azure Monitor
Application Insights
Azure Service Health
Azure Resource Health
Azure Resource Manager
Azure Policy
Hybrid availability and performance monitoring
Unified logging for microservices applications
Application health
Azure Monitor
Continuous monitoring
Monitoring application health for reliability
12/16/2022 6 minutes to read • Edit Online
Key points
Alerting
Service level alertsService level alerts
Resource level alertsResource level alerts
Monitoring and diagnostics are crucial for availability and resiliency. If something fails, you need to know
that
it
failed,
when
it failed, and
why
.
Monitoring
isn't the same as
failure detection
. For example, your application might detect a transient error and
retry, avoiding downtime. But it should also log the retry operation so that you can monitor the error rate to get
an overall picture of application health.
Define alerts that are actionable and effectively prioritized.
Create alerts that poll for services nearing their limits and quotas.
Use application instrumentation to detect and resolve performance anomalies.
Track the progress of long-running processes.
Troubleshoot issues to gain an overall view of application health.
Alerts are notifications of system health issues that are found during monitoring. Alerts only deliver value if they
are actionable and effectively prioritized by on-call engineers through defined operational procedures. Present
telemetry data in a dashboard or email alert format that makes it easy for an operator to notice problems or
trends quickly.
Use Azure Service Health to respond to
service level
events. Azure Service Health provides a view into the
health of Azure services and regions. It issues communications that impact the following services:
Outages
Planned maintenance activities
Other health advisories
Azure Service Health alerts should be configured to operationalize Service Health events. However, Service
Health alerts shouldn't be used to detect issues because of associated latencies. There is a 5 minute service
level objective (SLO) for automated issues, but many issues require manual interpretation to define a root cause
analysis (RCA). Instead, alerts should be used to provide useful information to help interpret issues that have
been detected and surfaced through the health model, to inform an operational response.
To learn more, reference Azure Service Health.
Use Azure Resource Health to respond to
resource level
events. Azure Resource Health provides information
about the health of individual resources such as a specific virtual machine, and is highly useful when diagnosing
unavailable resources.
Azure Resource Health alerts should be configured for specific resource groups and resource types. These alerts
should be adjusted to maximize signal to noise ratios. For example, only distribute a notification when a
resource becomes unhealthy according to the application health model or due to an Azure platform initiated
event. It's important to consider transient issues when setting an appropriate threshold for resource
unavailability. For example, configure an alert for a virtual machine with a threshold of 1 minute for
DashboardsDashboards
SamplesSamples
Azure subscription and service limits
Individual servicesIndividual services
Azure storage scalability and performance targetsAzure storage scalability and performance targets
Scalability targets for virtual machine disksScalability targets for virtual machine disks
Virtual machine sizeVirtual machine size
unavailability before an alert is triggered.
To learn more, reference Azure Resource Health.
You can also get a full-stack view of application state by using Azure dashboards to create a combined view of
monitoring graphs from the following:
Application Insights
Log Analytics
Azure Monitor metrics
Service Health
Here are some samples about creating and querying alerts:
HealthAlert: A sample about creating resource-level health activity log alerts. The sample uses Azure
Resource Manager to create alerts.
GraphAlertsPsSample: A set of PowerShell commands that queries for alerts generated against your
subscription.
Azure subscriptions have limits on certain resource types, such as number of resource groups, cores, and
storage accounts. To ensure your application doesn't run up against Azure subscription limits, create alerts that
poll for services nearing their limits and quotas.
Address the following subscription limits with alerts.
Individual Azure services have consumption limits on:
Storage
Throughput
Number of connections
Requests per second
Your application will fail if it attempts to use resources beyond these limits, resulting in service throttling and
possible downtime.
Depending on the specific service and your application requirements, you can often stay under these limits by
scaling up (choosing another pricing tier, for example) or scaling out (adding new instances).
Azure allows a maximum number of storage accounts per subscription. If your application requires more
storage accounts than are currently available in your subscription, create a new subscription with extra storage
accounts. For more information, reference Azure subscription and service limits, quotas, and constraints.
An Azure infrastructure as a service (IaaS) virtual machine supports attaching many data disks, depending on
several factors, including the virtual machine size and the type of storage account. If your application exceeds
the scalability targets for virtual machine disks, provision additional storage accounts and create the virtual
machine disks there. To learn more, reference Scalability and performance targets for VM disks.
Azure SQL DatabaseAzure SQL Database
Instrumentation
Long-running workflow failures
TIPTIP
Analysis and diagnosis
If the actual CPU, memory, disk, and I/O of your virtual machines approach the limits of the virtual machine size,
your application may experience capacity issues. To correct the issues, increase the virtual machine size.
If your workload fluctuates over time, consider using virtual machine scale sets to automatically scale the
number of virtual instances. Otherwise, you need to manually increase or decrease the number of virtual
machines.
If your Azure SQL Database tier isn't adequate to handle your application's Database Transaction Unit (DTU)
requirements, your data use will be throttled. For more information on selecting the correct service plan,
reference Azure SQL Database purchasing models.
Instrument applications to measure the customer experience. Effective instrumentation is vital for detecting and
resolving performance anomalies that can impact customer experience, and application availability. To build a
robust application health model, it's vital that you achieve visibility into the operational state of critical internal
dependencies, such as a shared NVA or Express Route connection.
Automated failover and failback systems depend on the correct functioning of monitoring and instrumentation.
Dashboards that visualize system health and operator alerts also depend on having accurate monitoring and
instrumentation. If these elements fail, miss critical information, or report inaccurate data, an operator might not
realize that the system is unhealthy or failing. Make sure you include monitoring systems in your test plan.
Instrument applications to track calls to dependent services. Dependency tracking and measuring the duration
or status of dependency calls is also vital to measuring overall application health. It should be used to inform a
health model for the application.
Microsoft recommends collecting and storing logs, and key metrics of critical components.
Provide rich instrumentation:
For failures that are likely, but have not yet occurred: provide enough data to determine the cause, mitigate
the situation, and ensure that the system remains available.
For failures that have already occurred: the application should return an appropriate error message to the
user, but should attempt to continue running despite reduced functionality.
Monitoring systems should capture comprehensive details so that applications can be restored efficiently and, if
necessary, designers and developers can modify the system to prevent the situation from recurring.
Long-running workflows often include multiple steps, each of which should be independent.
Track the progress of long-running processes to minimize the likelihood that the entire workflow will need to be
rolled back or that multiple compensating transactions will need to be executed.
Monitor and manage the progress of long-running workflows by implementing a pattern such as Scheduler Agent
Supervisor.
Analyze data combined in these data stores to troubleshoot issues and gain an overall view of application
health. Generally, you can search for and analyze the data in Application Insights, and Log Analytics using Kusto
Related links
Next step
queries, or view preconfigured graphs using management solutions. Use Azure Advisor to view
recommendations with a focus on resiliency and performance.
For information on dashboards, reference Azure dashboards.
For information on virtual machine sizes, reference Sizes for virtual machines in Azure.
For information on scale sets, reference virtual machine scale sets overview.
Go back to the main article: Monitoring
Health modeling
Health modeling for reliability
12/16/2022 5 minutes to read • Edit Online
NOTENOTE
Key points
Healthy and unhealthy states
Quantify application states
The health model should be able to surface the health of critical system flows or key subsystems to ensure
appropriate operational prioritization is applied. For example, the health model should be able to represent the
current state of the user login transaction flow.
The health model should not treat all failures the same. For example, the health model should distinguish
between transient and non-transient faults. It should clearly distinguish between expected-transient but
recoverable failures and a true disaster state.
The health model should clearly distinguish between expected-transient but recoverable failures and a true disaster state.
Know how to tell if an application is healthy or unhealthy.
Understand the impact of logs in diagnostic data.
Ensure the consistent use of diagnostic settings across the application.
Use critical system flows in your health model.
A health model qualifies what
healthy
and
unhealthy
states represent for the application. A holistic application
health model should be used to quantify what healthy and unhealthy states represent across all application
components. It's highly recommended that a "traffic light" model be used to indicate a green/healthy state when
key non-functional requirements and targets are fully satisfied and resources are optimally utilized. For example,
95 percent of requests are processed in <= 500ms with AKS node utilization at x% etc. Once established, this
health model should inform critical monitoring metrics across system components and operational sub-system
composition.
The overall health state can be impacted by both application level issues and resource level failures. Telemetry
correlation should be used to ensure transactions can be mapped through the end-to-end application and
critical system flows, as this is vital to root cause analysis for failures. Platform level metrics and logs such as
CPU percentage, network in/out, and disk operations/sec should be collected from the application to inform a
health model and detect/predict issues. This can also help to distinguish between transient and non-transient
faults.
Application level events should be automatically correlated with resource level metrics to quantify the current
application state. The overall health state can be impacted by both application level issues and resource level
failures.
Telemetry correlation should be used to ensure transactions can be mapped through the end-to-end application
and critical system flows, as this is vital to root cause analysis for failures. Platform level metrics and logs such
as CPU percentage, network in/out, and disk operations/sec should be collected from the application to inform a
health model and detect/predict issues(Telemetry correlation). This can also help to distinguish between
Application logs
White
-
box and black
-
box monitoringWhite
-
box and black
-
box monitoring
Use critical system flows in the health modelUse critical system flows in the health model
Create good health probesCreate good health probes
transient and non-transient faults.
Application logs are an important source of diagnostics data. To gain insight when you need it most, follow
these best practices for application logging:
Use semantic (structured) logging.Use semantic (structured) logging. With structured logs, it's easier to automate the consumption and
analysis of the log data, which is especially important at cloud scale. Generally, we recommend storing
Azure resources metrics and diagnostics data in a Log Analytics workspace rather than in a storage
account. This way, you can use Kusto queries to obtain the data you want quickly and in a structured
format. You can also use Azure Monitor APIs and Azure Log Analytics APIs.
Log data in the production environment.Log data in the production environment. Capture robust telemetry data while the application is
running in the production environment, so you have sufficient information to diagnose the cause of
issues in the production state.
Log events at ser vice boundaries.Log events at ser vice boundaries. Include a correlation ID that flows across service boundaries. If a
transaction flows through multiple services and one of them fails, the correlation ID helps you track
requests across your application and pinpoints why the transaction failed.
Use asynchronous logging.Use asynchronous logging. Synchronous logging operations sometimes block your application code,
causing requests to back up as logs are written. Use asynchronous logging to preserve availability during
application logging.
Separate application logging from auditing.Separate application logging from auditing. Audit records are commonly maintained for
compliance or regulatory requirements and must be complete. To avoid dropped transactions, maintain
audit logs separately from diagnostic logs.
All application resources should be configured to route diagnostic logs and metrics to the chosen log
aggregation technology. Azure Policy should also be used as a device to ensure the consistent use of diagnostic
settings across the application, to enforce the desired configuration for each Azure service.
Application level events should be automatically correlated with resource level metrics to quantify the current
application state. The overall health state can be impacted by both application level issues as well as resource
level failures. Telemetry correlation should be used to ensure transactions can be mapped through the end-to-
end application and critical system flows, as this is vital to root cause analysis (RCA) for failures. Platform level
metrics and logs such as CPU percentage, network in/out, and disk operations/sec should be collected from the
application to inform a health model and detect/predict issues. This can also help to distinguish between
transient and non-transient faults.
Use white box monitoring to instrument the application with semantic logs and metrics. Application level
metrics and logs, such as current memory consumption or request latency, should be collected from the
application to inform a health model and detect/predict issues.
Use black-box monitoring to measure platform services and the resulting customer experience. Black box
monitoring tests externally visible application behavior without knowledge of the internals of the system. This is
a common approach to measuring customer-centric service level indicators (SLIs), service level objectives
(SLOs), and service level agreements (SLAs).
The health model should be able to surface the respective health of critical system flows or key subsystems to
ensure appropriate operational prioritization is applied. For example, the health model should be able to
represent the current state of the user login transaction flow.
Next step
Related links
The health and performance of an application can degrade over time, and degradation might not be noticeable
until the application fails.
Implement probes or check functions, and run them regularly from outside the application. These checks can be
as simple as measuring response time for the application as a whole, for individual parts of the application, for
specific services that the application uses, or for separate components.
Check functions can run processes to ensure that they produce valid results, measure latency and check
availability, and extract information from the system.
The HealthProbesSample sample shows how to set up health probes. It provides an Azure Resource
Manager (ARM) template to set up the infrastructure. A load balancer accepts public requests and load balance
to a set of virtual machines. The health probe is set up so that it can check for service's path /Health.
Best practices
For information on monitoring metrics, see Azure Monitor Metrics overview.
For information on using Application Insights, see What is Application Insights?
Go back to the main article: Monitoring
Monitoring best practices for reliability in Azure
applications
12/16/2022 2 minutes to read • Edit Online
Implement health probes and check functions
Check long-running workflows
Maintain application logs
Measure remote call statistics
Track transient exceptions and retries
Set up an early warning system
Operate within Azure subscription limits
Monitor third-party services
This article lists Azure best practices to enhance monitoring Azure applications for reliability. These best
practices are derived from our experience with Azure reliability and the experiences of customers like yourself.
Implement these best practices for monitoring and alerts in your application so you can detect failures and alert
an operator to fix them.
Run them regularly from outside the application to identify degradation of application health and performance.
Catching issues early can minimize the need to roll back the entire workflow or to execute multiple
compensating transactions.
Log applications in production and at service boundaries.
Use semantic and asynchronous logging.
Separate application logs from audit logs.
Measure remote call statistics, and share the data with the application team to give your operations team an
instantaneous view into application health, summarize remote call metrics, such as latency, throughput, and
errors in the 99 and 95 percentiles. Perform statistical analysis on the metrics to uncover errors that occur within
each percentile.
A trend of increasing exceptions over time indicates that the service is having an issue and may fail. Track
transient exceptions and retries over an appropriate time frame to prevent failure.
Identify the key performance indicators (KPIs) of an application's health, such as transient exceptions and remote
call latency, and set appropriate threshold values for each of them. Send an alert to operations when the
threshold value is reached.
Azure subscriptions have limits on certain resource types, such as the number of resource groups, cores, and
storage accounts. Watch your use of resource types.
Train multiple operators
Log your invocations and correlate them with your application's health and diagnostic logging using a unique
identifier.
Train multiple operators to monitor the application and to perform manual recovery steps. Make sure there is
always at least one trained operator active.
Reliability patterns
12/16/2022 3 minutes to read • Edit Online
Availability
PAT T E RNPAT T E RN SUM M A RYSUM M A RY
Deployment Stamps Deploy multiple independent copies of application
components, including data stores.
Geodes Deploy backend services into a set of geographical nodes,
each of which can service any client request in any region.
Health Endpoint Monitoring Implement functional checks in an application that external
tools can access through exposed endpoints at regular
intervals.
Queue-Based Load Leveling Use a queue that acts as a buffer between a task and a
service that it invokes, to smooth intermittent heavy loads.
Throttling Control the consumption of resources by an instance of an
application, an individual tenant, or an entire service.
High availability
PAT T E RNPAT T E RN SUM M A RYSUM M A RY
Deployment Stamps Deploy multiple independent copies of application
components, including data stores.
Geodes Deploy backend services into a set of geographical nodes,
each of which can service any client request in any region.
Health Endpoint Monitoring Implement functional checks in an application that external
tools can access through exposed endpoints at regular
intervals.
Availability is measured as a percentage of uptime, and defines the proportion of time that a system is
functional and working. Availability is affected by system errors, infrastructure problems, malicious attacks, and
system load. Cloud applications typically provide users with a service level agreement (SLA), which means that
applications must be designed and implemented to maximize availability.
To mitigate against availability risks from malicious Distributed Denial of Service (DDoS) attacks, implement the
native Azure DDoS protection service or a third party capability.
Azure infrastructure is composed of geographies, regions, and Availability Zones, which limit the blast radius of
a failure and therefore limit potential impact to customer applications and data. The Azure Availability Zones
construct was developed to provide a software and networking solution to protect against datacenter failures
and to provide increased high availability (HA) to our customers. With HA architecture there is a balance
between high resilience, low latency, and cost.
Bulkhead Isolate elements of an application into pools so that if one
fails, the others will continue to function.
Circuit Breaker Handle faults that might take a variable amount of time to
fix when connecting to a remote service or resource.
PAT T E RNPAT T E RN SUM M A RYSUM M A RY
Resiliency
PAT T E RNPAT T E RN SUM M A RYSUM M A RY
Bulkhead Isolate elements of an application into pools so that if one
fails, the others will continue to function.
Circuit Breaker Handle faults that might take a variable amount of time to
fix when connecting to a remote service or resource.
Compensating Transaction Undo the work performed by a series of steps, which
together define an eventually consistent operation.
Health Endpoint Monitoring Implement functional checks in an application that external
tools can access through exposed endpoints at regular
intervals.
Leader Election Coordinate the actions performed by a collection of
collaborating task instances in a distributed application by
electing one instance as the leader that assumes
responsibility for managing the other instances.
Queue-Based Load Leveling Use a queue that acts as a buffer between a task and a
service that it invokes in order to smooth intermittent heavy
loads.
Retry Enable an application to handle anticipated, temporary
failures when it tries to connect to a service or network
resource by transparently retrying an operation that's
previously failed.
Scheduler Agent Supervisor Coordinate a set of actions across a distributed set of
services and other remote resources.
Resiliency is the ability of a system to gracefully handle and recover from failures, both inadvertent and
malicious.
The nature of cloud hosting, where applications are often multi-tenant, use shared platform services, compete
for resources and bandwidth, communicate over the Internet, and run on commodity hardware means there is
an increased likelihood that both transient and more permanent faults will arise. The connected nature of the
internet and the rise in sophistication and volume of attacks increase the likelihood of a security disruption.
Detecting failures and recovering quickly and efficiently, is necessary to maintain resiliency.
Overview of the security pillar
12/16/2022 8 minutes to read • Edit Online
Information security has always been a complex subject, and it evolves quickly with the creative ideas and
implementations of attackers and security researchers. The origin of security vulnerabilities started with
identifying and exploiting common programming errors and unexpected edge cases. However over time, the
attack surface that an attacker may explore and exploit has expanded well beyond these common errors and
edge cases. Attackers now freely exploit vulnerabilities in system configurations, operational practices, and the
social habits of the systems' users. As system complexity, connectedness, and the variety of users increase,
attackers have more opportunities to identify unprotected edge cases. Attackers can
hack
systems into doing
things they weren't designed to do.
Security is one of the most important aspects of any architecture. It provides the following assurances against
deliberate attacks and abuse of your valuable data and systems:
Confidentiality
Integrity
Availability
Losing these assurances can negatively affect your business operations and revenue, and your organization's
reputation. For the security pillar, we'll discuss key architectural considerations and principles for security and
how they apply to Azure.
The security of complex systems depends on understanding the business context, social context, and technical
context. As you design your system, cover these areas:
Understanding an IT solution as it interacts with its surrounding environment holds the key to preventing
unauthorized activity and to identifying anomalous behavior that may represent a security risk.
Another key factor in success: Adopt a mindset of assuming failure of security controls. Assuming failure allows
you to design compensating controls that limit risk and damage if a primary control fails.
Assuming failures can be referred to as
assume breach
or
assume compromise
. Assume breach is closely
related to the
Zero Trust
approach of continuously validating security assurances. The Zero Trust approach is
described in the Security Design Principles section in more detail.
Cloud architectures can help simplify the complex task of securing an enterprise estate through specialization
and shared responsibilities:
Specialization:Specialization: Specialist teams at cloud providers can develop advanced capabilities to operate and secure
systems on behalf of organizations. This approach is preferable to numerous organizations individually
developing deep expertise on managing and securing common elements, such as:
Datacenter physical security
Firmware patching
Hypervisor configuration
The economies of scale allow cloud provider specialist teams to invest in optimization of management and
security that far exceeds the ability of most organizations.
Cloud providers must be compliant with the same IT regulatory requirements as the aggregate of all their
customers. Providers must develop expertise to defend against the aggregate set of adversaries who attack their
customers. As a consequence, the default security posture of applications deployed to the cloud is frequently
much better than that of applications hosted on-premises.
Shared Responsibility Model:Shared Responsibility Model: As computing environments move from customer-controlled datacenters to
the cloud, the responsibility of security also shifts. Security of the operational environment is now a concern
shared by both cloud providers and customers. Organizations can reduce focus on activities that aren't core
business competencies by shifting these responsibilities to a cloud service like Azure. Depending on the specific
technology choices, some security protections will be built into the particular service, while addressing others
will remain the customer's responsibility. To ensure that proper security controls are provided, organizations
must carefully evaluate the services and technology choices.
Shared Responsibility and Key Strategies:Shared Responsibility and Key Strategies:
After reading this document, you'll be equipped with key insights about how to improve the security posture of
your architecture.
As part of your architecture design, you should consider all relevant areas that affect the success of your
application. While this article is concerned primarily with security principles, you should also prioritize other
requirements of a well-designed system, such as:
Availability
Scalability
Costs
Operational characteristics (trading off one over the other as necessary)
Consistently sacrificing security for gains in other areas isn't advisable because security risks tend to increase
dynamically over time.
Increasing security risks result in three key strategies:
SEC URIT Y TOPICSEC URIT Y TOPIC DESC RIP T IONDESC RIP T ION
Security design principles These principles describe a securely architected system
hosted on cloud or on-premises datacenters, or a
combination of both.
Governance, risk, and compliance How is the organization's security going to be monitored,
audited, and reported? What types of risks does the
organization face while trying to protect identifiable
information, Intellectual Property (IP), financial information?
Is there specific industry, government, or regulatory
requirements that dictate or provide recommendations on
criteria that your organization's security controls must meet?
Regulatory compliance Governments and other organizations frequently publish
standards to help define good security practices (due
diligence) so that organizations can avoid being negligent in
security.
Establish a modern perimeter :Establish a modern perimeter : For the elements that your organization controls to ensure you have a
consistent set of controls (a perimeter) between those assets and the threats to them. Perimeters should be
designed based on intercepting authentication requests for the resources (identity controls) versus
intercepting network traffic on enterprise networks. This traditional approach isn't feasible for enterprise
assets outside the network.
More on perimeters and how they relate to Zero Trust and Enterprise Segmentation are in the Governance, Risk,
and Compliance and Network Security & Containment sections.
Modernize infrastructure security:Modernize infrastructure security: For operating systems and middleware elements that legacy
applications require, take advantage of cloud technology to reduce security risk to the organization. For
example, knowing whether all servers in a physical datacenter are updated with security patches has
always been challenging because of discoverability. Software-defined datacenters allow easy and rapid
discovery of all resources. This rapid discovery enables technology like Microsoft Defender for Cloud to
measure quickly and accurately the patch state of all servers and remediate them.
"Trust but verify" each cloud provider :"Trust but verify" each cloud provider : For the elements, which are under the control of the cloud
provider. You should ensure the security practices and regulatory compliance of each cloud provider
(large and small) meet your requirements.
To assess your workload using the tenets found in the Microsoft Azure Well-Architected Framework, see the
Microsoft Azure Well-Architected Review.
We cover the following areas in the security pillar of the Microsoft Azure Well-Architected Framework:
Administration Administration is the practice of monitoring, maintaining,
and operating Information Technology (IT) systems to meet
service levels that the business requires. Administration
introduces some of the highest impact security risks because
performing these tasks requires privileged access to a broad
set of these systems and applications.
Applications and services Applications and the data associated with them ultimately
act as the primary store of business value on a cloud
platform.
Identity and access management Identity provides the basis of a large percentage of security
assurances.
Information protection and storage Protecting data at rest is required to maintain confidentiality,
integrity, and availability assurances across all workloads.
Network security and containment Network security has been the traditional linchpin of
enterprise security efforts. However, cloud computing has
increased the requirement for network perimeters to be
more porous and many attackers have mastered the art of
attacks on identity system elements (which nearly always
bypass network controls).
Security Operations Security operations maintain and restores the security
assurances of the system as live adversaries attack it. The
tasks of security operations are described well by the NIST
Cybersecurity Framework functions of Detect, Respond, and
Recover.
SEC URIT Y TOPICSEC URIT Y TOPIC DESC RIP T IONDESC RIP T ION
Identity management
Consider using Azure Active Directory (Azure AD) to authenticate and authorize users. Azure AD is a fully
managed identity and access management service. You can use it to create domains that exist purely on Azure,
or integrate with your on-premises Active Directory identities.
Azure AD is also used by:
Microsoft 365
Dynamics 365
Many third-party applications
For consumer-facing applications, Azure Active Directory B2C lets users authenticate with their existing social
accounts, such as:
Facebook
Google
LinkedIn
Users can also create a new user account managed by Azure AD.
If you want to integrate an on-premises Active Directory environment with an Azure network, several
approaches are possible, depending on your requirements. For more information, reference Identity
Management reference architectures.
Protect your infrastructure
Application security
Data sovereignty and encryption
Security resources
Control access to the Azure resources that you deploy. Every Azure subscription has a trust relationship with an
Azure AD tenant.
Use Azure role-based access control (Azure RBAC role) to grant users within your organization the correct
permissions to Azure resources. Grant access by assigning Azure roles to users or groups at a certain scope. The
scope can be a:
Subscription
Resource group
Single resource
Audit all changes to infrastructure.
In general, the security best practices for application development still apply in the cloud. Best practices include:
Encrypt data in-transit with the latest supported TLS versions
Protect against CSRF and XSS attacks
Prevent SQL injection attacks
Cloud applications often use managed services that have access keys. Never check these keys into source
control. Consider storing application secrets in Azure Key Vault.
Make sure that your data remains in the correct geopolitical zone when using Azure data services. Azure's geo-
replicated storage uses the concept of a paired region in the same geopolitical region.
Use Key Vault to safeguard cryptographic keys and secrets. By using Key Vault, you can encrypt keys and secrets
by using keys that are protected by hardware security modules (HSMs). Many Azure storage and DB services
support data encryption at rest, including:
Azure Storage
Azure SQL Database
Azure Synapse Analytics
Azure Cosmos DB
Microsoft Defender for Cloud provides integrated security monitoring and policy management for your
workload.
Azure Security Documentation
Microsoft Trust Center
The security pillar is part of a comprehensive set of security guidance that also includes:
Security in the Microsoft Cloud Adoption Framework for Azure: A high-level overview of a cloud security end
state.
Security architecture design: Implementation-level journey of our security architectures.
Azure security benchmarks: Prescriptive best practices and controls for Azure security.
End-to-end security in Azure: Documentation that introduces you to the security services in Azure.
Browse our security architectures
Next stepNext step
Top 10 security best practices for Azure: Top Azure security best practices that Microsoft recommends based
on lessons learned across customers and our own environments.
Microsoft Cybersecurity Architectures: The diagrams describe how Microsoft security capabilities integrate
with Microsoft platforms and 3rd-party platforms.
Principles
Security design principles
12/16/2022 2 minutes to read • Edit Online
Plan resources and how to harden them
Automate and use least privilege
Classify and encrypt data
Monitor system security, plan incident response
Security design principles describe a securely architected system hosted on cloud or on-premises datacenters
(or a combination of both). Application of these principles dramatically increases the likelihood your security
architecture assures confidentiality, integrity, and availability.
To assess your workload using the tenets found in the Azure Well-Architected Framework, reference the
Microsoft Azure Well-Architected Review.
The following design principles provide:
Context for questions
Why a certain aspect is important
How an aspect is applicable to Security
These critical design principles are used as lenses to assess the Security of an application deployed on Azure.
These lenses provide a framework for the application assessment questions.
Recommendations:
Consider security when planning workload resources.
Understand how individual cloud services are protected.
Use a service enablement framework to evaluate.
Recommendations:
Implement least privilege throughout the application and control plane to protect against data exfiltration
and malicious actor scenarios.
Drive automation through DevSecOps to minimize the need for human interaction.
Recommendations:
Classify data according to risk.
Apply industry-standard encryption at rest and in transit, which ensures keys and certificates are stored
securely and managed properly.
Recommendations:
Correlate security and audit events to model application health.
Correlate security and audit events to identify active threats.
Establish automated and manual procedures to respond to incidents.
Use security information and event management (SIEM) tooling for tracking.
Identify and protect endpoints
Protect against code-level vulnerabilities
Model and test against potential threats
Next stepNext step
Recommendations:
Monitor and protect the network integrity of internal and external endpoints through security appliances or
Azure services, such as:
Use industry standard approaches to protect against common attack vectors, such as distributed denial of
service (DDoS) attacks like SlowLoris.
Firewalls
Web application firewalls
Recommendations:
Identify and mitigate code-level vulnerabilities, such as cross-site scripting and structured query language
(SQL) injection.
In the operational lifecycle, regularly incorporate:
Security fixes
Codebase and dependency patching
Recommendations:
Establish procedures to identify and mitigate known threats.
Use penetration testing to verify threat mitigation.
Use static code analysis to detect and prevent future vulnerabilities.
Use code scanning to detect and prevent future vulnerabilities
Design governance
Governance, risk, and compliance
12/16/2022 4 minutes to read • Edit Online
Prioritize security best practices investments
As part of overall design, prioritize where to invest the available resources; financial, people, and time.
Constraints on those resources also affect the security implementation across the organization. To achieve an
appropriate ROI on security the organization needs to first understand and define its security priorities.
Governance:Governance: How is the organization's security going to be monitored, audited, and reported? Design
and implementation of security controls within an organization is only the beginning of the story. How
does the organization know that things are actually working? Are they improving? Are there new
requirements? Is there mandatory reporting? Similar to compliance there may be external industry,
government or regulatory standards that need to be considered.
Risk :Risk : What types of risks does the organization face while trying to protect identifiable information,
Intellectual Property (IP), financial information? Who may be interested or could use this information if
stolen, including external and internal threats as well as unintentional or malicious? A commonly
forgotten but extremely important consideration within risk is addressing Disaster Recovery and
Business Continuity.
Compliance:Compliance: Is there a specific industry, government, or regulatory requirements that dictate or provide
recommendation on criteria that your organization's security controls must meet? Examples of such
standards, organizations, controls, and legislation are ISO27001, NIST, PCI-DSS.
The collective role of organization(s) is to manage the security standards of the organization through their
lifecycle:
Define:Define: Set organizational policies for operations, technologies, and configurations based on internal
factors (business requirements, risks, asset evaluation) and external factors (benchmarks, regulatory
standards, threat environment).
Improve:Improve: Continually push these standards incrementally forward towards the ideal state to ensure
continual risk reduction.
Sustain:Sustain: Ensure the security posture doesn't degrade naturally over time by instituting auditing and
monitoring compliance with organizational standards.
Security best practices are ideally applied proactively and completely to all systems as you build your cloud
program, but this isn't reality for most enterprise organizations. Business goals, project constraints, and other
factors often cause organizations to balance security risk against other risks and apply a subset of best practices
at any given point.
We recommend applying as many of the best practices as early as possible, and then working to retrofit any
gaps over time as you mature your security program to include review, prioritization, and proactive application
of best practices to cloud resources. We recommend evaluating the following considerations when prioritizing
which to follow first:
High business impact and highly exposed systems:High business impact and highly exposed systems: These include systems with direct intrinsic
value as well as the systems that provide attackers a path to them. For more information, see Identify and
classify business critical applications.
Easiest to implement mitigations:Easiest to implement mitigations: Identify quick wins by prioritizing the best practices, which your
Checklist
In this section
A SSESSM EN TA SSESSM EN T DESC RIP T IONDESC RIP T ION
Are there any regulator y requirements for thisAre there any regulator y requirements for this
workload?workload?
Understand all regulatory requirements. Check the Microsoft
Trust Center for the latest information, news, and best
practices in security, privacy, and compliance.
Is the organization using a landing zone for thisIs the organization using a landing zone for this
workload?workload?
Consider the security controls placed on the infrastructure
into which the workload will get deployed.
Do you have a segmentation strategyDo you have a segmentation strategy Reference model and strategies of how the functions and
teams can be segmented.
Are you using management groups as par t of yourAre you using management groups as par t of your
segmentation strategy?segmentation strategy?
Strategies using management groups to manage resources
across multiple subscriptions consistently and efficiently.
What security controls do you have in place forWhat security controls do you have in place for
access to Azure infrastructure?access to Azure infrastructure?
Guidance on reducing risk exposure in scope and time when
configuring critical impact accounts such as Administrators.
Azure security benchmark
organization can execute quickly because you already have the required skills, tools, and knowledge to do
it (for example, implementing a Web App Firewall (WAF) to protect a legacy application). Be careful not to
exclusively use (or overuse) this short-term prioritization method. Doing so can increase your risk by
preventing your program from growing and leaving critical risks exposed for extended periods.
Microsoft has provided some prioritized lists of security initiatives to help organizations start with these
decisions based on our experience with threats and mitigation initiatives in our own environments and across
our customers. See Module 4a of the Microsoft CISO Workshop.
What considerations for compliance and governance did you make?What considerations for compliance and governance did you make?
Create a landing zone for the workload. The infrastructure must have appropriate controls and be repeatable
with every deployment.
Enforce creation and deletion of services and their configuration through Azure Policies.
Ensure consistency across the enterprise by applying policies, permissions, and tags across all subscriptions
through careful implementation of root management group.
Understand regulatory requirements and operational data that may be used for audits.
Continuously monitor and assess the compliance of your workload. Perform regular attestations to avoid
fines.
Review and apply recommendations from Azure.
Remediate basic vulnerabilities to keep the attacker costs high.
Follow these questions to assess the workload at a deeper level.
The Azure Security Benchmark includes a collection of high-impact security recommendations you can use to
help secure the services you use in Azure:
The questions in this section are aligned to these controls:
Reference architecture
Next steps
Related links
Governance and Strategy
Posture and vulnerability management
Here are some reference architectures related to governance:
Cloud Adoption Framework enterprise-scale landing zone architecture
Provide security assurance through identity management to authenticate and grant permission to users,
partners, customers, applications, services, and other entities.
Identity and access management
Go back to the main article: Security
Regulatory compliance
12/16/2022 3 minutes to read • Edit Online
Key points
Review the requirements
Suggested actionSuggested action
A workload can have regulatory requirements, which may mandate that operational data, such as application
logs and metrics, remain within a certain geo-political region.
These requirements may need strict security measures that affect the overall architecture, the selection, and
configuration of specific PaaS, and SaaS services. The requirements also have implications for how the workload
should be operationalized.
Make sure that all regulatory and governance requirements are known, and well understood.
Periodically perform external and, or internal workload security audits.
Have compliance checks as part of the workload operations.
Use Microsoft Trust Center.
Regulatory organizations frequently publish standards and updates to help define good security practices so
that organizations can avoid negligence. The purpose and scope of these standards, and regulations vary. The
security requirements, however, can influence the design for data protection and retention, network access, and
system security.
Knowing whether your cloud resources are in compliance with standards mandated by governments or industry
organizations is essential in today's globalized world.
For example, a workload that handles credit card transactions is subject to the Payment Card Industry (PCI)
standard. One of the requirements prohibits access between the internet and any system component in the
cardholder data environment.
To provide a restrictive environment, you can choose to do the following:
Host the workload in different Azure compute options that supports bring your own VNet.
Remove any internet-facing endpoints by using Private Endpoints.
Use network security groups (NSGs) rules that define authorized inbound and outbound access.
Noncompliance can lead to fines or other business impact. Work with your regulators and carefully review the
standard to understand both the intent and the literal wording of each requirement. Here are some questions
that may help you understand each requirement.
How is compliance measured?
Who approves that the workload meets the requirements?
Are there processes for obtaining attestations?
What are the documentation requirements?
Use Microsoft Defender for Cloud to assess your current compliance score and to identify the gaps.
Learn moreLearn more
Tutorial: Improve your regulatory compliance
Use the Microsoft Trust Center
Elevated security capabilities
Suggested actionsSuggested actions
Operational considerations
Keep checking the Microsoft Trust Center for the latest information, news, and best practices in security, privacy,
and compliance.
Data governanceData governance. Focus on protecting information in cloud services, mobile devices, workstations, or
collaboration platforms. Build the security strategy by classifying and labeling information. Use strong
access control and encryption technology.
Compliance offeringsCompliance offerings. Microsoft offers a comprehensive set of compliance offerings to help your
organization follow national, regional, and industry-specific requirements governing the collection and
use of data. For information, see Compliance offerings.
Compliance scoreCompliance score. Use Microsoft Compliance Score to assess your data protection controls on an
ongoing basis. Act on the recommendations to make progress toward compliance.
Audit repor tsAudit repor ts. Use audit reports to stay current on the latest privacy, security, and compliance-related
information for Microsoft's cloud services. See Audit Reports.
Shared responsibilityShared responsibility. The workload can be hosted on Software as a Service (SaaS), Platform as a
Service (PaaS), Infrastructure as a Service (IaaS), or in an on-premises datacenter. Have a clear
understanding about the portions of the architecture you're responsible for versus Azure. Whatever the
hosting model, the following responsibilities are always retained by you:
Data
Endpoints
Account
Access management
For more information, reference Shared responsibility in the cloud.
Consider whether to use specialized security capabilities in your enterprise architecture.
Dedicated HSMs and Confidential Computing have the potential to enhance security and meet regulatory
requirements, but can introduce complexity that may negatively impact your operations and efficiency.
We recommend careful consideration and judicious use of these security measures as required:
Dedicated Hardware Security Modules (HSMs)Dedicated Hardware Security Modules (HSMs)
Dedicated Hardware Security Modules (HSMs) may help meet regulatory or security requirements.
Confidential ComputingConfidential Computing
Confidential Computing may help meet regulatory or security requirements.
Learn more about elevated security capabilities for Azure workloads.
Regulatory requirements may influence the workload operations. For example, there might be a requirement
that operational data, such as application logs and metrics, remain within a certain geo-political region.
Consider automation of deployment and maintenance tasks. Automation reduces security and compliance risk
by limiting opportunity to introduce human errors during manual tasks.
Related links
Next
Azure maintains a compliance portfolio that covers US government, industry specific, and region/country
standards. For more information, reference Azure compliance offerings.
Monitor the compliance of the workload to check if the security controls are aligned to the regulatory
requirements. For more information, reference Security audits.
Go back to the main article: Governance
Azure landing zone
Azure landing zone integration
12/16/2022 3 minutes to read • Edit Online
Increase automation with Azure Blueprints
From a workload perspective, a
landing zone
refers to a prepared platform into which the application gets
deployed. A landing zone implementation can have compute, data sources, access controls, and networking
components already provisioned. With the required plumbing ready in place; the workload needs to plug into it.
When considering the overall security, a landing zone offers centralized security capabilities that adds a threat
mitigation layer for the workload. Implementations can vary but here are some common strategies that enhance
the security posture.
Isolation through segmentation. You can isolate assets at several layers from Azure enrollment down to a
subscription that has the resources for the workload. This strategy of having resources within a boundary
that is separate from other parts of the organization is an effective way of detecting and containing
adversary movements.
Consistent adoption of organizational policies. Policies govern which resources can be used and their
usage limits. Policies also provide identity controls. Only authenticated and authorized entities are
allowed access. This approach decouples the governance requirements from the workload requirements.
It's crucial that a landing zone is handed over to the workload owner with the security guardrails
deployed.
Configurations that align with principles of Zero Trust . For instance an implementation might have
network connectivity to on-premises data centers. When designing networking controls, the landing zone
may apply the least-privilege principle by opening communication paths only when necessary and only
to trusted entities.
The preceding examples are conceptually simple but the implementation can get complicated for an enterprise-
scale deployment. Azure landing zone as part of the Cloud Adoption Framework (CAF) provides architecture
guidance about identity and access management, networking, and other design areas necessary to achieve an
optimal implementation.
Learn moreLearn more
What is an Azure landing zone?
Landing zone implementation options
Use Azure's native automation capabilities to increase consistency, compliance, and deployment speed for
workloads. A recommended way to implement a landing zone is with Azure Blueprints and Azure Policies.
Automation of deployment and maintenance tasks reduces security and compliance risk by limiting opportunity
to introduce human errors during manual tasks. This will also allow both IT Operations teams and security
teams to shift their focus from repeated manual tasks to higher value tasks like enabling developers and
business initiatives, protecting information, and so on.
Utilize the Azure Blueprint service to rapidly and consistently deploy application environments that are
compliant with your organization's policies and external regulations. Azure Blueprint Service automates
deployment of environments including Azure roles, policies, resources, such as virtual machines, networking,
storage, and more. Azure Blueprints builds on Microsoft's significant investment into the Azure Resource
Manager to standardize resource deployment in Azure and enable resource deployment and governance based
Enforce policy compliance
Architecture
Next
on a desired-state approach. You can use built in configurations in Azure Blueprint, make your own, or just use
Resource Manager scripts for smaller scope.
Several Security and Compliance Blueprints samples are available to use as a starting template.
Organizations of all sizes will have security compliance requirements. Industry, government, and internal
corporate security policies all need to be audited and enforced. Policy monitoring is critical to check that initial
configurations are correct and that it continues to be compliant over time.
In Azure, you can take advantage of Azure Policy to create and manage policies that enforce compliance. Like
Azure Blueprints, Azure Policies are built on the underlying Azure Resource Manager capabilities in the Azure
platform (and Azure Policy can also be assigned via Azure Blueprints).
For more information on how to do this in Azure, please review Tutorial: Create and manage policies to enforce
compliance.
How do you consistently deploy landing zones that follow organizational policies?How do you consistently deploy landing zones that follow organizational policies?
Key Azure services that can help in creating a landing zone:
Azure Blueprints sketches a solution's design parameters based on an organization's standards, patterns, and
requirements.
Azure Resource Manager template specs stores an Azure Resource Manager template (ARM template) in
Azure for later deployment.
Azure Policy enforces organizational standards and to assess compliance at-scale.
Azure AD and Azure role-based access control (Azure RBAC) work in conjunction to provide identity and
access controls.
Microsoft Defender for Cloud
Microsoft Defender for Cloud
For information about an enterprise-scale reference architecture, see Cloud Adoption Framework enterprise-
scale landing zone architecture. The architecture provides considerations in these critical design areas:
Enterprise Agreement (EA) enrollment and Azure Active Directory tenants
Identity and access management
Management group and subscription organization
Network topology and connectivity
Management and monitoring
Business continuity and disaster recovery
Security, governance, and compliance
Platform automation and DevOps
Use management groups to manage resources across multiple subscriptions consistently and efficiently.
Management groups
Back to the main article: Governance
Segmentation strategies
12/16/2022 4 minutes to read • Edit Online
Reference model
Segmentation refers to the isolation of resources from other parts of the organization. It's an effective way of
detecting and containing adversary movements.
One approach to segmentation is network isolation. This approach is not recommended because different
technical teams may not be aligned with the business use cases and application workloads. One outcome of
such a mismatch is complexity, as especially seen with on-premises networking, and can lead to reduced
velocity, or in worse cases, broad network firewall exceptions. Although network control should be considered as
a segmentation strategy, it should be part of a unified segmentation strategy.
Network security has been the traditional linchpin of enterprise security efforts. However, cloud computing has
increased the requirement for network perimeters to be more porous and many attackers have mastered the art
of attacks on identity system elements (which nearly always bypass network controls). These factors have
increased the need to focus primarily on identity-based access controls to protect resources rather than
network-based access controls.
An effective segmentation strategy will guide
all
technical teams (IT, security, applications) to consistently isolate
access using networking, applications, identity, and any other access controls. The strategy should aim to:
Minimize operational friction by aligning to business practices and applications
Contain risk by adding cost to attackers. This is done by:
Monitor operations that might lead to potential violation of the integrity of the segments (account usage,
unexpected traffic).
Isolating sensitive workloads from compromise by other assets.
Isolating high-exposure systems from being used as a pivot to other systems.
Here are some recommendations for creating a unified strategy:
Ensure alignment of technical teams to a single strategy based on assessing business risks.
Establish a modern perimeter based on zero-trust principles, focused on identity, devices, applications, and
other signals. This helps overcome limitations of network controls in protecting from new resources and
attack types.
Reinforce network controls for legacy applications by exploring microsegmentation strategies.
Centralize the organizational responsibility for management and security of core networking functions such
as cross-premises links, virtual networking, subnetting, and IP address schemes as well as network security
elements such as virtual network appliances, encryption of cloud virtual network activity and cross-premises
traffic, network-based access controls, and other traditional network security components.
Start with this reference model and adapt it to your organization's needs. This model shows how functions,
resources, and teams can be segmented.
Example segmentsExample segments
Core Services segmentCore Services segment
Additional segmentsAdditional segments
Clear lines of responsibilityClear lines of responsibility
FU NC T IONFU NC T ION SC O P ESC O P E RE SP ON SIBIL IT YRESP O N SIB IL IT Y
Policy management (Core and
individual segments)
Some or all resources. Monitor and enforce compliance with
external (or internal) regulations,
standards, and security policy assign
appropriate permission to those roles.
Central IT operations (Core) Across all resources. Grant permissions to the central IT
department (often the infrastructure
team) to create, modify, and delete
resources like virtual machines and
storage.
Consider isolating shared and individual resources as shown in the preceding image.
This segment hosts shared services utilized across the organization. These shared services typically include
Active Directory Domain Services, DNS/DHCP, and system management tools hosted on Azure Infrastructure as
a Service (IaaS) virtual machines.
Other segments can contain grouped resources based on certain criteria. For instance, resources that are used
by one specific workload or application might be contained in a separate segment. You may also segment or
sub-segment by lifecycle stage, like
development
,
test
, and
production
. Some resources might intersect, such as
applications, and can use virtual networks for lifecycle stages.
These are the main functions for this reference model. Permissions for these functions are described in Team
roles and responsibilities.
Central networking group (Core and
individual segments)
All network resources. Centralize network management and
security to reduce the potential for
inconsistent strategies that create
potential attacker exploitable security
risks. Because all divisions of the IT and
development organizations do not
have the same level of network
management and security knowledge
and sophistication, organizations
benefit from leveraging a centralized
network team's expertise and tooling.
Ensure consistency and avoid technical
conflicts, assign network resource
responsibilities to a single central
networking organization. These
resources should include virtual
networks, subnets, Network Security
Groups (NSG), and the virtual
machines hosting virtual network
appliances.
Resource role permissions (Core) - For most core services, administrative
privileges required are granted
through the application (Active
Directory, DNS/DHCP, System
Management Tools). No additional
Azure resource permissions are
required. If your organizational model
requires these teams to manage their
own VMs, storage, or other Azure
resources, you can assign these
permissions to those roles.
Security operations (Core and
individual segments)
All resources. Assess risk factors, identify potential
mitigations, and advise organizational
stakeholders who accept the risk.
FU NC T IONFU NC T ION SC O P ESC O P E RE SP ON SIBIL IT YRESP O N SIB IL IT Y
IT operations (individual segments) All resources. Grant permission to create, modify,
and delete resources. The purpose of
the segment (and resulting
permissions) will depend on your
organization structure.
Service admin (Core and individual
segments)
Use the service admin role only for
emergencies (and initial setup if
required). Do not use this role for daily
tasks.
FU NC T IONFU NC T ION SC O P ESC O P E RE SP ON SIBIL IT YRESP O N SIB IL IT Y
Next steps
Segments with resources
managed by a centralized IT
organization can grant the
central IT department (often
the infrastructure team)
permission to modify these
resources.
Segments managed by
independent business units or
functions (such as a Human
Resources IT Team) can grant
those teams permission to all
resources in the segment.
Segments with autonomous
DevOps teams don't need to
grant permissions across all
resources because the resource
role (below) grants permissions
to application teams. For
emergencies, use the service
admin account (break-glass
account).
Start with this reference model and manage resources across multiple subscriptions consistently and efficiently
with management groups.
Management groups
Establish segmentation with management groups
12/16/2022 3 minutes to read • Edit Online
Support your segmentation strategy with management groups
Use root management group with caution
Management groups can manage resources across multiple subscriptions consistently and efficiently. However,
due to its flexibility, your design can become complex and compromise security and operations.
Structure management groups into a simple design that guides the enterprise segmentation model.
Management groups offer the ability to consistently and efficiently manage resources (including multiple
subscriptions as needed). However, because of their flexibility, it's possible to create an overly complex design.
Complexity creates confusion and negatively impacts both operations and security (as illustrated by overly
complex Organizational Unit (OU) and Group Policy Object (GPO) designs for Active Directory).
Microsoft recommends aligning the top level of management groups (MGs) into a simple enterprise
segmentation strategy and limit the levels to no more than two.
In the example reference, there are enterprise-wide resources used by all segments, a set of core services that
share services, additional segments for each workload.
C a u t i o nC a u t i o n
Root management group for enterprise-wide resources.
Use the root management group to include identities that have the requirement to apply policies across
every resource. For example, regulatory requirements, such as restrictions related to data sovereignty.
This group is effective in by applying policies, permissions, tags, across all subscriptions.
Be careful when using the root management group because the policies can affect all resources on Azure
and potentially cause downtime or other negative impacts. For considerations, see Use root management
group with caution later in this article.
For complete guidance about using management groups for an enterprise, see CAF: Management group
and subscription organization.
Management group for each workload segment.
Use a separate management group for teams with limited scope of responsibility. This group is typically
required because of organizational boundaries or regulatory requirements.
Root or segment management group for the core set of services.
Use the Root Management Group (MG) for enterprise consistency, but test changes carefully to minimize risk of
operational disruption.
The root management group enables you to ensure consistency across the enterprise by applying policies,
permissions, and tags across all subscriptions. Care must be taken when planning and implementing
assignments to the root management group because this can affect every resource on Azure and potentially
cause downtime or other negative impacts on productivity in the event of errors or unanticipated effects.
Plan carefully:Plan carefully: Select enterprise-wide elements to the root management group that have a clear
requirement to be applied across every resource and/or low impact.
Next steps
Select enterprise-wide identities that have a clear requirement to be applied across all resources. Good
candidates include:
Regulator y requirementsRegulator y requirements with clear business risk/impact. For example, restrictions related to
data sovereignty.
Near-zero potential negative impact.Near-zero potential negative impact. For example, policy with audit effect, tag assignment,
Azure RBAC permissions assignments that have been carefully reviewed.
Use a dedicated service principal name (SPN) to execute management group management operations,
subscription management operations, and role assignment. SPN reduces the number of users who have
elevated rights and follows least-privilege guidelines. Assign the User Access AdministratorUser Access Administrator at the root
management group scope (/) to grant the SPN just mentioned access at the root level. After the SPN is
granted permissions, the User Access AdministratorUser Access Administrator role can be safely removed. In this way, only the
SPN is part of the User Access AdministratorUser Access Administrator role. Assign ContributorContributor permission to the SPN, which
allows tenant-level operations. This permission level ensures that the SPN can be used to deploy and
manage resources to any subscription within your organization.
Limit the number of Azure Policy assignments made at the root management group scope ( / ). This
limitation minimizes debugging inherited policies in lower-level management groups.
Don't create any subscriptions under the root management group. This hierarchy ensures that
subscriptions don't only inherit the small set of Azure policies assigned at the root-level management
group, which don't represent a full set necessary for a workload.
Test first:Test first: Plan, test, and validate all enterprise-wide changes on the root management group before
applying (policy, tags, Azure RBAC model, and so on).
Test lab:Test lab: Representative lab tenant or lab segment in production tenant.
Production pilot:Production pilot: This can be a segment management group or designated subset in
subscription(s) management group.
Validate changes:Validate changes: to ensure they have the desired effect.
Administrative accounts
Administrative account security
12/16/2022 9 minutes to read • Edit Online
Administration is the practice of monitoring, maintaining, and operating Information Technology (IT) systems to
meet service levels that the business requires. Administration introduces some of the highest impact security
risks because performing these tasks requires privileged access to a very broad set of these systems and
applications. Attackers know that gaining access to an account with administrative privileges can get them
access to most or all of the data they would target, making the security of administration one of the most critical
security areas.
As an example, Microsoft makes significant investments in protection and training of administrators for our
cloud systems and IT systems:
Microsoft's recommended core strategy for administrative privileges is to use the available controls to reduce
risk.
Reduce risk exposure (scope and time):Reduce risk exposure (scope and time): The principle of least privilege is best accomplished with modern
controls that provide privileges on demand. This help to limit risk by limiting administrative privileges exposure
by:
Scope:Scope:
Just enough access (JEA)
provides only the required privileges for the administrative operation
required (vs. having direct and immediate privileges to many or all systems at a time, which is almost
never required).
Time:Time:
Just in time (JIT)
approaches provided the required privileged as they are needed.
Mitigate the remaining risks:Mitigate the remaining risks: Use a combination of preventive and detective controls to reduce risks
such as isolating administrator accounts from the most common risks phishing and general web
browsing, simplifying and optimizing their workflow, increasing assurance of authentication decisions,
and identifying anomalies from normal baseline behavior that can be blocked or investigated.
Microsoft has captured and documented best practices for protecting administrative accounts and published
prioritized roadmaps for protecting privileged access that can be used as references for prioritizing mitigations
Minimize number of critical impact admins
Managed accounts for admins
Separate accounts for admins
No standing access / just in time privileges
for accounts with privileged access.
Securing Privileged Access (SPA) roadmap for administrators of on-premises Active Directory
Guidance for securing administrators of Azure Active Directory
Grant the fewest number of accounts to privileges that can have a critical business impact.
Each admin account represents potential attack surface that an attacker can target, so minimizing the number of
accounts with that privilege helps limit the overall organizational risk. Experience has taught us that
membership of these privileged groups grows naturally over time as people change roles if membership not
actively limited and managed.
We recommend an approach that reduces this attack surface risk while ensuring business continuity in case
something happens to an administrator:
Assign at least two accounts to the privileged group for business continuity.
When two or more accounts are required, provide justification for each member including the original
two.
Regularly review membership & justification for each group member.
Ensure all critical impact admins are managed by enterprise directory to follow organizational policy
enforcement.
Consumer accounts such as Microsoft accounts like @Hotmail.com, @live.com, @outlook.com, don't offer
sufficient security visibility and control to ensure the organization's policies and any regulatory requirements are
being followed. Because Azure deployments often start small and informally before growing into enterprise-
managed tenants, some consumer accounts remain as administrative accounts long afterward for example,
original Azure project managers, creating blind spots, and potential risks.
Ensure all critical impact admins have a separate account for administrative tasks (vs the account they use for
email, web browsing, and other productivity tasks).
Phishing and web browser attacks represent the most common attack vectors to compromise accounts,
including administrative accounts.
Create a separate administrative account for all users that have a role requiring critical privileges. For these
administrative accounts, block productivity tools like Office 365 email (remove license). If possible, block
arbitrary web browsing (with proxy and/or application controls) while allowing exceptions for browsing to the
Azure portal and other sites required for administrative tasks.
Avoid providing permanent "standing" access for any critical impact accounts.
Permanent privileges increase business risk by increasing the time an attacker can use the account to do
damage. Temporary privileges force attackers targeting an account to either work within the limited times the
admin is already using the account or to initiate privilege elevation (which increases their chance of being
detected and removed from the environment).
Emergency access or 'Break Glass' accounts
Admin workstation security
Critical impact admin dependencies Account/Workstation
Grant privileges required only as required using one of these methods:
Just in time:Just in time: Enable Azure AD Privileged Identity Management (PIM) or a third party solution to require
following an approval workflow to obtain privileges for critical impact accounts.
Break glass:Break glass: For rarely used accounts, follow an emergency access process to gain access to the
accounts. This is preferred for privileges that have little need for regular operational usage like members
of global admin accounts.
Ensure you have a mechanism for obtaining administrative access in case of an emergency.
While rare, sometimes extreme circumstances arise where all normal means of administrative access are
unavailable.
We recommend following the instructions at Managing emergency access administrative accounts in Azure AD
and ensure that security operations monitor these accounts carefully.
Ensure critical impact admins use a workstation with elevated security protections and monitoring.
Attack vectors that use browsing and email like phishing are cheap and common. Isolating critical impact
admins from these risks will significantly lower your risk of a major incident where one of these accounts is
compromised and used to materially damage your business or mission.
Choose level of admin workstation security based on the options available at https://aka.ms/securedworkstation
Highly Secure Productivity Device (Enhanced Security Workstation or SpecializedHighly Secure Productivity Device (Enhanced Security Workstation or Specialized
Workstation)Workstation)
You can start this security journey for critical impact admins by providing them with a higher security
workstation that still allows for general browsing and productivity tasks. Using this as an interim step
helps ease the transition to fully isolated workstations for both the critical impact admins as well as the IT
staff supporting these users and their workstations.
Privileged Access Workstation (Specialized Workstation or Secured Workstation)Privileged Access Workstation (Specialized Workstation or Secured Workstation)
These configurations represent the ideal security state for critical impact admins as they heavily restrict
access to phishing, browser, and productivity application attack vectors. These workstations don't allow
general internet browsing, only allow browser access to Azure portal and other administrative sites.
Carefully choose the on-premises security dependencies for critical impact accounts and their workstations.
To contain the risk from a major incident on-premises spilling over to become a major compromise of cloud
assets, you must eliminate or minimize the means of control that on premises resources have to critical impact
accounts in the cloud. As an example, attackers who compromise the on-premises Active Directory can access
and compromise cloud-based assets that rely on those accounts like resources in Azure, Amazon Web Services
(AWS), ServiceNow, and so on. Attackers can also use workstations joined to those on premises domains to gain
access to accounts and services managed from them.
Choose the level of isolation from on premises means of control also known as security dependencies for critical
impact accounts.
User Accounts:User Accounts: Choose where to host the critical impact accounts
Native Azure AD Accounts - Create Native Azure AD Accounts that are not synchronized with on-
Passwordless or multifactor authentication for admins
Enforce conditional access for admins - Zero Trust
Avoid granular and custom permissions
premises active directory.
Synchronize from On Premises Active Directory.
Use existing accounts hosted in the on-premises active directory.
Workstations:Workstations: Choose how you will manage and secure the workstations used by critical admin
accounts:
Native Cloud Management and Security (Recommended): Join workstations to Azure AD &
Manage/Patch them with Intune or other cloud services. Protect and Monitor with Windows
Microsoft Defender for Endpoint or another cloud service not managed by on premises based
accounts.
Manage with Existing Systems: Join existing AD domain and use existing management/security.
Require all critical impact admins to use passwordless authentication or multifactor authentication (MFA).
Attack methods have evolved to the point where passwords alone cannot reliably protect an account. This is well
documented in a Microsoft Ignite Session.
Administrative accounts and all critical accounts should use one of the following methods of authentication.
These capabilities are listed in preference order by highest cost/difficulty to attack (strongest/preferred options)
to lowest cost/difficult to attack:
Passwordless (such as Windows Hello)Passwordless (such as Windows Hello)
Passwordless (Authenticator App)Passwordless (Authenticator App)
Multifactor AuthenticationMultifactor Authentication
Note that SMS Text Message based MFA has become very inexpensive for attackers to bypass, so we
recommend you avoid relying on it. This option is still stronger than passwords alone, but is much weaker than
other MFA options.
Authentication for all admins and other critical impact accounts should include measurement and enforcement
of key security attributes to support a Zero Trust strategy.
Attackers compromising Azure Admin accounts can cause significant harm. Conditional Access can significantly
reduce that risk by enforcing security hygiene before allowing access to Azure management.
Configure Conditional Access policy for Azure management that meets your organization's risk appetite and
operational needs.
Require Multifactor Authentication and/or connection from designated work network.
Require Device integrity with Microsoft Defender for Endpointintegrity with Microsoft Defender for Endpoint (Strong Assurance).
Avoid permissions that specifically reference individual resources or users.
Specific permissions create unneeded complexity and confusion as they don't carry the intention to new similar
resources. This then accumulates into a complex legacy configuration that is difficult to maintain or change
without fear of "breaking something" – negatively impacting both security and solution agility.
Use built-in roles
Establish lifecycle management for critical impact accounts
Attack simulation for critical impact accounts
Instead of assigning specific resource-specific permissions, use either:
Management Groups for enterprise-wide permissions.
Resource groups for permissions within subscriptions.
Instead of granting permissions to specific users, assign access to groups in Azure AD. If there isn't an
appropriate group, work with the identity team to create one. This allows you to add and remove group
members externally to Azure and ensure permissions are current, while also allowing the group to be used for
other purposes such as mailing lists.
Use built-in roles for assigning permissions where possible.
Customization leads to complexity that increases confusion and makes automation more complex, challenging,
and fragile. These factors all negatively impact security.
We recommend that you evaluate the built-in roles designed to cover most normal scenarios. Custom roles are
a powerful and sometimes useful capability, but they should be reserved for cases when built in roles won't
work.
Ensure you have a process for disabling or deleting administrative accounts when admin personnel leave the
organization (or leave administrative positions).
See Regularly review critical access for more details.
Regularly simulate attacks against administrative users with current attack techniques to educate and empower
them.
People are a critical part of your defense, especially your personnel with access to critical impact accounts.
Ensuring these users (and ideally all users) have the knowledge and skills to avoid and resist attacks will reduce
your overall organizational risk.
You can use Office 365 Attack Simulation capabilities or any number of third party offerings.
Azure identity and access management
considerations
12/16/2022 2 minutes to read • Edit Online
Checklist
Azure security benchmark
Azure services for identity
Reference architecture
Most architectures have shared services that are hosted and accessed across networks. Those services share
common infrastructure and users need to access resources and data from anywhere. For such architectures, a
common way to secure resources is to use network controls. However, that isn't enough.
Provide security assurance through
identity management
: the process of authenticating and authorizing security
principals. Use identity management services to authenticate and grant permission to users, partners,
customers, applications, services, and other entities.
How are you managing the identity for your workload?How are you managing the identity for your workload?
Define clear lines of responsibility and separation of duties for each function. Restrict access based on a
need-to-know basis and least privilege security principles.
Assign permissions to users, groups, and applications at a certain scope through Azure RBAC. Use built-in
roles when possible.
Prevent deletion or modification of a resource, resource group, or subscription through management locks.
Use Managed Identities to access resources in Azure.
Support a single enterprise directory. Keep the cloud and on-premises directories synchronized, except for
critical-impact accounts.
Set up Azure AD Conditional Access. Enforce and measure key security attributes when authenticating all
users, especially for critical-impact accounts.
Have a separate identity source for non-employees.
Preferably use passwordless methods or opt for modern password methods.
Block legacy protocols and authentication methods.
The Azure Security Benchmark includes a collection of high-impact security recommendations you can use to
help secure the services you use in Azure:
The questions in this section are aligned to the Azure Security Benchmarks Identity and Access
Control.
The considerations and best practices in this section are based on these Azure services:
Azure AD
Azure AD B2B
Azure AD B2C
Next steps
Related links
Here are some reference architectures related to identity and access management:
Integrate on-premises AD domains with Azure AD
Integrate on-premises AD with Azure
Monitor the communication between segments. Use data to identify anomalies, set alerts, or block traffic to
mitigate the risk of attackers crossing segmentation boundaries.
Network-related risks
Five steps to securing your identity infrastructure
Go back to the main article: Security
Roles, responsibilities, and permissions
12/16/2022 9 minutes to read • Edit Online
Clear lines of responsibility
GRO UP O R IN DIVIDUA L ROLEGRO UP O R IN DIVIDUA L ROLE RESP O N SIB IL IT YRE SP ON SIBIL IT Y
Network SecurityNetwork Security
Typically existing network security team.
Configuration and
maintenance of Azure Firewall, Network Virtual Appliances
(and associated routing), Web Application Firewall (WAF),
Network Security Groups, Application Security Groups (ASG),
and other cross-network traffic.
Network ManagementNetwork Management
Typically existing network operations team.
Enterprise-wide
virtual network and subnet allocation.
Server Endpoint SecurityServer Endpoint Security
Typically IT operations, security, or jointly.
Monitor and
remediate server security (patching, configuration, endpoint
security).
Incident Monitoring and ResponseIncident Monitoring and Response
Typically security operations team.
Incident monitoring and
response to investigate and remediate security incidents in
Security Information and Event Management (SIEM) or
source console such as Microsoft Defender for Cloud Azure
AD Identity Protection.
Policy ManagementPolicy Management
Typically GRC team + Architecture.
Apply governance based
on risk analysis and compliance requirements. Set direction
for use of Azure role-based access control (Azure RBAC),
Microsoft Defender for Cloud, Administrator protection
strategy, and Azure Policy to govern Azure resources.
In an organization, several teams work together to make sure that the workload and the supporting
infrastructure are secure. To avoid confusion that can create security risks, define clear lines of responsibility and
separation of duties.
Based on Microsoft's experience with many cloud adoption projects, establishing clearly defined roles and
responsibilities for specific functions in Azure will avoid confusion that can lead to human and automation
errors creating security risk.
Do the teams have a clear view on responsibilities and individual/group access levels?Do the teams have a clear view on responsibilities and individual/group access levels?
Designate the parties responsible for specific functions in Azure.
Clearly documenting and sharing the contacts responsible for each of these functions will create consistency
and facilitate communication. Based on our experience with many cloud adoption projects, this will avoid
confusion that can lead to human and automation errors that create security risk.
Designate groups (or individual roles) that will be responsible for key functions.
Identity Security and StandardsIdentity Security and Standards
Typically Security Team + Identity Team jointly.
Set direction
for Azure AD directories, PIM/PAM usage, MFA,
password/synchronization configuration, Application Identity
Standards.
GRO UP O R IN DIVIDUA L ROLEGRO UP O R IN DIVIDUA L ROLE RESP O N SIB IL IT YRE SP ON SIBIL IT Y
NOTENOTE
Assign permissions
Reference model exampleReference model example
Core services reference permissionsCore services reference permissions
Application roles and responsibilities should cover different access level of each operational function. For example, publish
production release, access customer data, manipulate database records, and so on. Application teams should include
central functions listed in the preceding table.
Grant roles the appropriate permissions that start with least privilege and add more based on your operational
needs. Provide clear guidance to your technical teams that implement permissions. This clarity makes it easier to
detect and correct that reduces human errors such as overpermissioning.
Assign permissions at management group for the segment rather than the individual subscriptions. This
will drive consistency and ensure application to future subscriptions. In general, avoid granular and
custom permissions.
Consider the built-in roles in Azure before creating custom roles to grant the appropriate permissions to
VMs and other objects.
Security managersSecurity managers group membership may be appropriate for smaller teams/organizations where
security teams have extensive operational responsibilities.
When assigning permissions for a segment, consider consistency while allowing flexibility to accommodate
several organizational models. These models can range from a single centralized IT group to mostly
independent IT and DevOps teams.
This section uses this Reference model to demonstrate the considerations for assigning permissions for different
segments. Microsoft recommends starting from these models and adapting to your organization.
This segment hosts shared services utilized across the organization. These shared services typically include
Active Directory Domain Services, DNS/DHCP, System Management Tools hosted on Azure Infrastructure as a
Service (IaaS) virtual machines.
Security Visibility across all resources:Security Visibility across all resources: For security teams, grant read-only access to security attributes for
all technical environments. This access level is needed to assess risk factors, identify potential mitigations, and
advise organizational stakeholders who accept the risk. See Security Team Visibility for more details.
Policy management across some or all resources:Policy management across some or all resources: To monitor and enforce compliance with external (or
internal) regulations, standards, and security policy, assign appropriate permission to those roles. The roles and
permissions you choose will depend on the organizational culture and expectations of the policy program. See
Microsoft Cloud Adoption Framework for Azure.
Before defining the policies, consider:
How is the organization's security audited and reported? Is there mandatory reporting?
Are the existing security practices working?
Are there any requirements specific to industry, government, or regulatory requirements?
Designate group(s) (or individual roles) for central functions that affect shared services and applications.
After the policies are set, continuously improve those standards incrementally. Make sure that the security
posture doesn't degrade over time by having auditing and monitoring compliance. For information about
managing security standards of an organization, see governance, risk, and compliance (GRC).
Central IT operations across all resources:Central IT operations across all resources: Grant permissions to the central IT department (often the
infrastructure team) to create, modify, and delete resources like virtual machines and storage. ContributorContributor or
OwnerOwner roles are appropriate for this function.
Central networking group across network resources:Central networking group across network resources: To ensure consistency and avoid technical conflicts,
assign network resource responsibilities to a single central networking organization. These resources should
include virtual networks, subnets, Network Security Groups (NSG), and the virtual machines hosting virtual
network appliances. Assign network resource responsibilities to a single central networking organization. The
Network ContributorNetwork Contributor role is appropriate for this group. See Centralize Network Management And Security for
more details
Resource Role Permissions:Resource Role Permissions: For most core services, administrative privileges required to manage them are
granted through the application (Active Directory, DNS/DHCP, System Management Tools), so no additional
Azure resource permissions are required. If your organizational model requires these teams to manage their
own VMs, storage, or other Azure resources, you can assign these permissions to those roles.
Segment reference permissionsSegment reference permissions
Workload segments with autonomous DevOps teams will manage the resources associated with each
application. The actual roles and their permissions depend on the application size and complexity, the
application team size and complexity, and the culture of the organization and application team.
Ser vice admin (Break Glass Account):Ser vice admin (Break Glass Account): Use the Ser vice AdministratorSer vice Administrator role only for emergencies and
initial setup. Do not use this role for daily tasks. See Emergency Access ('Break Glass' Accounts) for more details.
This segment permission design provides consistency while allowing flexibility to accommodate the range of
organizational models from a single centralized IT group to mostly independent IT and DevOps teams.
Security visibility across all resources:Security visibility across all resources: For security teams, grant read-only access to security attributes for
all technical environments. This access level is needed to assess risk factors, identify potential mitigations, and
advise organizational stakeholders who accept the risk. See Security Team Visibility.
Policy management across some or all resources:Policy management across some or all resources: To monitor and enforce compliance with external (or
internal) regulations, standards, and security policy assign appropriate permission to those roles. The roles and
permissions you choose will depend on the organizational culture and expectations of the policy program. See
Microsoft Cloud Adoption Framework for Azure.
IT Operations across all resources:IT Operations across all resources: Grant permission to create, modify, and delete resources. The purpose of
the segment (and resulting permissions) will depend on your organization structure.
Segments with resources managed by a centralized IT organization can grant the central IT department
(often the infrastructure team) permission to modify these resources.
Segments managed by independent business units or functions (such as a Human Resources IT Team)
can grant those teams permission to all resources in the segment.
Segments with autonomous DevOps teams don't need to grant permissions across all resources because
the resource role (below) grants permissions to application teams. For emergencies, use the service
admin account (break-glass account).
Central networking group across network resources:Central networking group across network resources: To ensure consistency and avoid technical conflicts,
assign network resource responsibilities to a single central networking organization. These resources should
include virtual networks, subnets, Network Security Groups (NSG), and the virtual machines hosting virtual
network appliances. See Centralize Network Management And Security.
Security team visibility
IMPORTANTIMPORTANT
Manage connected tenants
Resource Role Permissions:Resource Role Permissions: Segments with autonomous DevOps teams will manage the resources
associated with each application. The actual roles and their permissions depend on the application size and
complexity, the application team size and complexity, and the culture of the organization and application team.
Ser vice Admin (Break Glass Account):Ser vice Admin (Break Glass Account): Use the service admin role only for emergencies (and initial setup if
required). Do not use this role for daily tasks. See Emergency Access ('Break Glass' Accounts) for more details.
An application team needs to be aware of security initiatives to align their security improvement plans with the
outcome of those activities. Provide security teams read-only access to the security aspects of all technical
resources in their purview.
Security organizations require visibility into the technical environment to perform their duties of assessing and
reporting on organizational risk. Without this visibility, security will have to rely on information provided from
groups, operating the environment, which have potential conflict of interest (and other priorities).
Note that security teams may separately be granted additional privileges if they have operational
responsibilities or a requirement to enforce compliance on Azure resources.
For example in Azure, assign security teams to the Security ReadersSecurity Readers permission that provides access to
measure security risk (without providing access to the data itself).
For enterprise security groups with broad responsibility for security of Azure, you can assign this permission
using:
Root management group
– for teams responsible for assessing and reporting risk on all resources
Segment management group(s)
– for teams with limited scope of responsibility (typically required
because of organizational boundaries or regulatory requirements)
Because security will have broad access to the environment (and visibility into potentially exploitable vulnerabilities), treat
security teams as critical impact accounts and apply the same protections as administrators. The Administration section
details these controls for Azure.
Suggested actionsSuggested actions
Define a process for aligning communication, investigation, and hunting activities with the application team.
Following the principle of least privilege, establish access control to all cloud environment resources for
security teams with sufficient access to gain required visibility into the technical environment and to perform
their duties of assessing, and reporting on organizational risk.
Learn moreLearn more
Engage your organization's security team
Does your security team have visibility into all existing subscriptions and cloud environments? How do they
discover new ones?
Ensure your security organization is aware of all enrollments and associated subscriptions connected to your
existing environment (via ExpressRoute or Site-Site VPN) and monitoring as part of the overall enterprise.
These Azure resources are part of your enterprise environment and security organizations require visibility into
Suggested actionsSuggested actions
Next steps
Related links
them. Security organizations need this access to assess risk and to identify whether organizational policies and
applicable regulatory requirements are being followed.
The organizations' cloud infrastructure should be well documented, with security team access to all resources
required for monitoring and insight. Frequent scans of the cloud-connected assets should be performed to
ensure no additional subscriptions or tenants have been added outside of organizational controls. Regularly
review Microsoft guidance to ensure security team access best practices are consulted and followed.
Ensure all Azure environments that connect to your production environment and network apply your
organization's policy, and IT governance controls for security.
You can discover existing connected tenants using a tool provided by Microsoft for guidance on permissions.
Restrict access to Azure resources based on a need-to-know basis starting with the principle of least privilege
security and add more based on your operational needs.
Azure control plane security
For considerations about using management groups to reflect the organization's structure within an Azure
Active Directory (Azure AD) tenant, see CAF: Management group and subscription organization.
Back to the main article: Azure identity and access management considerations
Azure control plane security
12/16/2022 3 minutes to read • Edit Online
Key points
Roles and permission assignment
The term
control plane
refers to the management of resources in your subscription. These activities include
creating, updating, and deleting Azure resources as required by the technical team.
Azure Resource Manager handles all control plane requests and applies restrictions that you specify through
Azure role-based access control (Azure RBAC), Azure Policy, locks. Apply those restrictions based on the
requirement of the organization.
It's recommended to implement Infrastructure as Code, and to deploy application infrastructure through
automation, and CI/CD for consistency and auditing purposes.
Restrict access based on a need-to-know basis and least privilege security principles.
Assign permissions to users, groups, and applications at a certain scope through Azure RBAC.
Use built-in roles when possible.
Prevent deletion or modification of a resource, resource group, or subscription through management locks.
Use less critical control in your CI/CD pipeline for development and test environments.
Is the workload infrastructure protected with Azure role-based access control (Azure RBAC)?Is the workload infrastructure protected with Azure role-based access control (Azure RBAC)?
Azure role-based access control (Azure RBAC) provides the necessary tools to maintain separation of concerns
for administration and access to application infrastructure. Decide who has access to resources at the granular
level and what they can do with those resources. For example:
Developers can't access production infrastructure.
Only the SecOps team can read and manage Key Vault secrets.
If there are multiple teams, Project A team can access and manage Resource Group A and all resources
within.
Grant roles the appropriate permissions that start with least privilege and add more based on your
operational needs. Provide clear guidance to your technical teams that implement permissions. This clarity
makes it easier to detect and correct which reduces human errors such as overpermissioning.
Azure RBAC helps you manage that separation. You can assign permissions to users, groups, and applications at
a certain scope. The scope of a role assignment can be a subscription, a resource group, or a single resource. For
details, see Azure role-based access control (Azure RBAC).
Assign permissions at management group instead of individual subscriptions to drive consistency and
ensure application to future subscriptions.
Consider the built-in roles before creating custom roles to grant the appropriate permissions to resources
and other objects.
For example, assign security teams with the Security ReadersSecurity Readers permission that provides access needed to
assess risk factors, identify potential mitigations, without providing access to the data.
IMPORTANTIMPORTANT
Management locks
Suggested actions
Learn more
Next steps
Treat security teams as critical accounts and apply the same protections as administrators.
Learn moreLearn more
Azure RBAC documentation
Are there resource locks applied on critical par ts of the infrastructure?Are there resource locks applied on critical par ts of the infrastructure?
Unlike Azure role-based access control, management locks are used to apply a restriction across all users and
roles.
Critical infrastructure typically doesn't change often. Use management locks to prevent deletion or modification
of a resource, resource group, or subscription. Lock in use cases where only specific roles and users with
permissions can delete, or modify resources.
As an administrator, you may need to lock a subscription, resource group, or resource to prevent other users in
your organization from accidentally deleting or modifying critical resources. You can set the lock level to
CanNotDelete or ReadOnly . In the portal, the locks are called DeleteDelete and Read-onlyRead-only, respectively:
CanNotDelete
means authorized users can still read and modify a resource, but they can't delete the
resource.
ReadOnly
means authorized users can read a resource, but they can't delete or update the resource. Applying
this lock is similar to restricting all authorized users to the permissions granted by the
Reader
role.
When you apply a lock at a parent scope, all resources within that scope inherit the same lock. Even resources
you add later inherit the lock from the parent. The most restrictive lock in the inheritance takes precedence.
Unlike role-based access control, you use management locks to apply a restriction across all users and roles. To
learn about setting permissions for users and roles, see Azure role-based access control (Azure RBAC).
Identify critical infrastructure and evaluate resource lock suitability.
Set locks in the DevOps process carefully because modification locks can sometimes block automation. For
examples of those blocks and considerations, see Considerations before applying locks.
Restrict application infrastructure access to CI/CD only.
Use conditional access policies to restrict access to Microsoft Azure Management.
Configure role-based and resource-based authorization within Azure AD.
Manage access to Azure management with Conditional Access
Role-based and resource-based authorization
Grant or deny access to a system by verifying whether the accessor has the permissions to perform the
requested action.
Authentication
Related links
Back to the main article: Azure identity and access management considerations
Authentication with Azure AD
12/16/2022 11 minutes to read • Edit Online
Key points
Use identity-based authentication
Authentication is a process that grants or denies access to a system by verifying the accessor's identity. Use a
managed identity service for all resources to simplify overall management (such as password policies) and
minimize the risk of oversights or human errors. Azure Active Directory (Azure AD) is the one-stop-shop for
identity and access management service for Azure.
Use Managed Identities to access resources in Azure.
Keep the cloud and on-premises directories synchronized, except for high-privilege accounts.
Preferably use passwordless methods or opt for modern password methods.
Enable Azure AD conditional access based on key security attributes when authenticating all users, especially
for high-privilege accounts.
How is the application authenticated when communicating with Azure platform ser vices?How is the application authenticated when communicating with Azure platform ser vices?
Managed identities enable Azure Services to authenticate to each other without presenting explicit credentials
via code.
Managed identities for Azure resources is a feature of Azure Active Directory. Each of the Azure services that
support managed identities for Azure resources are subject to their own timeline. Make sure you review the
availability status of managed identities for your resource and known issues before you begin. The feature
provides Azure services with an automatically managed identity in Azure AD. You can use the identity to
authenticate to any service that supports Azure AD authentication, including Key Vault, without any credentials
in your code. The managed identities for Azure resources feature is free with Azure AD for Azure subscriptions,
there's no additional cost.
There are two types of managed identities:
A system-assigned managed identity is enabled directly on an Azure service instance. When the identity is
enabled, Azure creates an identity for the instance in the Azure AD tenant that's trusted by the subscription of
the instance. After the identity is created, the credentials are provisioned onto the instance. The life cycle of a
system-assigned identity is directly tied to the Azure service instance that it's enabled on. If the instance is
deleted, Azure automatically cleans up the credentials and the identity in Azure AD.
A user-assigned managed identity is created as a standalone Azure resource. Through a create process, Azure
creates an identity in the Azure AD tenant that's trusted by the subscription in use. After the identity is
created, the identity can be assigned to one or more Azure service instances. The life cycle of a user-assigned
identity is managed separately from the life cycle of the Azure service instances to which it's assigned.
Authenticate with identity services instead of cryptographic keys. On Azure, Managed Identities eliminate the
need to store credentials that might be leaked inadvertently. When Managed Identity is enabled for an Azure
resource, it's assigned an identity that you can use to obtain Azure AD tokens. For more information, see Azure
AD-managed identities for Azure resources.
For example, an Azure Kubernetes Service (AKS) cluster needs to pull images from Azure Container Registry
(ACR). To access the image, the cluster needs to know the ACR credentials. The recommended way is to enable
Managed Identities during cluster configuration. That configuration assigns an identity to the cluster and allows
TIPTIP
Choose a system with cross-platform support
it to obtain Azure AD tokens.
This approach is secure because Azure handles the management of the underlying credentials for you.
The identity is tied to the lifecycle of the resource, in the AKS cluster example. When the resource is deleted,
Azure automatically deletes the identity.
Azure AD manages the timely rotation of secrets for you.
Here are the resources for the preceding example:
GitHub: Azure Kubernetes Service (AKS) Secure Baseline Reference Implementation.
The design considerations are described in Azure Kubernetes Service (AKS) production baseline.
Suggested actionsSuggested actions
Review workload authentication and identify opportunities to convert explicit credentials (for example,
connection string and API key) to use managed identities.
For all new Azure workloads, standardize on using managed identities where applicable.
Learn moreLearn more
What are managed identities for Azure resources?
What kind of authentication is required by application APIs?What kind of authentication is required by application APIs?
Don't assume that API URLs used by a workload are hidden and can't get exposed to attackers. For example,
JavaScript code on a website can be viewed. A mobile application can be decompiled and inspected. Even for
internal APIs used only on the backend, a requirement of authentication can increase the difficulty of lateral
movement if an attacker gets network access. Typical mechanisms include API keys, authorization tokens and IP
restrictions.
Managed Identity can help an API be more secure because it replaces the use of human-managed service
principals and can request authorization tokens.
How is user authentication handled in the application?How is user authentication handled in the application?
Don't use custom implementations to manage user credentials. Instead, use Azure AD or other managed identity
providers such as Microsoft account Azure B2C. Managed identity providers provide additional security features
such as modern password protections, multifactor authentication (MFA), and resets. In general, passwordless
protections are preferred. Also, modern protocols like OAuth 2.0 use token-based authentication with limited
timespan.
Are authentication tokens cached securely and encr ypted when sharing across web ser vers?Are authentication tokens cached securely and encr ypted when sharing across web ser vers?
Application code should first try to get OAuth access tokens silently from a cache before attempting to acquire a
token from the identity provider, to optimize performance and maximize availability. Tokens should be stored
securely and handled as any other credentials. When there's a need to share tokens across application servers
(instead of each server acquiring and caching their own) encryption should be used.
For information, see Acquire and cache tokens.
Centralize all identity systems
IMPORTANTIMPORTANT
TIPTIP
Use passwordless authentication
Use a single identity provider for authentication on all platforms (operating systems, cloud providers, and third-
party services.
Azure AD can be used to authenticate Windows, Linux, Azure, Office 365, other cloud providers, and third-party
services as service providers.
For example, improve the security of Linux virtual machines (VMs) in Azure with Azure AD integration. For
details, see Log in to a Linux virtual machine in Azure using Azure Active Directory authentication.
Keep your cloud identity synchronized with the existing identity systems to ensure consistency and reduce
human errors.
Consistency of identities across cloud and on-premises will reduce human error and resulting security risk.
Teams managing resources in both environments need a consistent authoritative source to achieve security
assurances. For monitoring, if identity can be determined without an intermediate mapping process, security
efficiency is improved.
Synchronization is all about providing users an identity in the cloud based on their on-premises identity.
Whether or not they will use synchronized account for authentication or federated authentication, the users will
still need to have an identity in the cloud. This identity will need to be maintained and updated periodically. The
updates can take many forms, from title changes to password changes.
Start by evaluating the organization's on-premises identity solution and user requirements. This evaluation is
important, as it defines the technical requirements for how user identities will be created and maintained in the
cloud. For the majority of organizations, Active Directory is established on-premises and will be the on-premises
directory from which users will be synchronized, but this is not always the case.
Consider using Azure AD Connect for synchronizing Azure AD with your existing on-premises directory. For
migration projects, have a requirement to complete this task before an Azure migration and development
projects begin.
Don't synchronize high-privilege accounts to an on-premises directory. If an attacker gets full control of on-premises
assets, they can compromise a cloud account. This strategy will limit the scope of an incident. For more information, see
Critical impact account dependencies.
Synchronization is blocked by default in the default Azure AD Connect configuration. Make sure that you haven't
customized this configuration. For information about filtering in Azure AD, see Azure AD Connect sync: Configure filtering.
For more information, see hybrid identity providers.
Here are the resources for the preceding example::
The design considerations are described in Integrate on-premises Active Directory domains with Azure AD.
Learn moreLearn more
Synchronize the hybrid identity systems
Attackers constantly scan public cloud IP ranges for open management ports. They attempt to exploit weak
credentials (
password spray
) and unpatched vulnerabilities in management protocols like SSH, and RDP.
Preventing direct internet access to virtual machines stops a misconfiguration or oversight becoming more
serious.
Attack methods have evolved to the point where passwords alone cannot reliably protect an account. Modern
authentication solutions including passwordless and multifactor authentication increase security posture
through strong authentication.
Remove the use of passwords, when possible. Also, require the same set of credentials to sign in and access the
resources on-premises or in the cloud. This requirement is crucial for accounts that require passwords, such as
admin accounts.
With modern authentication and security features in Azure AD, that basic password should be supplemented or
replaced with more secure authentication methods. Each organization has different needs when it comes to
authentication. Microsoft offers the following three passwordless authentication options that integrate with
Azure Active Directory (Azure AD):
Windows Hello for Business
Microsoft Authenticator app
FIDO2 security keys
It's recommended to follow a four-stage plan to become passwordless:
Develop password replacement offering
Reduce user-visible password surface area
Transition into password-less deployment
Eliminate passwords from the identity directory
The following methods of authentication are ordered by highest cost/difficulty to attack (strongest/preferred
options) to lowest cost/difficult to attack:
Passwordless authentication. Some examples of this method include Windows Hello or Authenticator App.
MFA. Although this method is more effective than passwords, we recommend that you avoid relying on SMS
text message-based MFA. For more information, see Enable per-user Azure Active Directory MFA to secure
sign-in events.
Managed Identities. See Use identity-based authentication.
Those methods apply to all users, but should be applied first and strongest to accounts with administrative
privileges.
An implementation of this strategy is enabling single sign-on (SSO) to devices, apps, and services. By signing in
once using a single user account, you can grant access to all the applications and resources per business needs.
Users don't have to manage multiple sets of usernames and passwords. You can provision or de-provision
application access automatically. For more information, see Single sign-on.
Suggested actionsSuggested actions
Develop a passwordless strategy that requires MFA for all users without significantly impacting operations.
Ensure policy and processes require restricting, and monitoring direct internet connectivity by virtual
machines.
Learn moreLearn more
Passwordless Strategy
Remove Virtual Machine (VM) direct internet connectivity
Use modern password protection
Enable conditional access
Require modern protections through methods that reduce the use of passwords. Modern authentication
protocols support strong controls such as MFA and should be used instead of legacy authentication methods.
Use of legacy methods increases risk of credential exposure.
Modern authentication is a method of identity management that offers more secure user authentication and
authorization. It's available for Office 365 hybrid deployments of Skype for Business server on-premises and
Exchange server on-premises, and split-domain Skype for Business hybrids.
Modern authentication is an umbrella term for a combination of authentication and authorization methods
between a client (for example, your laptop or your phone) and a server, as well as some security measures that
rely on access policies that you may already be familiar with. It includes:
Authentication methods
: MFA; smart card authentication; client certificate-based authentication
Authorization methods
: Microsoft's implementation of Open Authorization (OAuth)
Conditional access policies
: Mobile Application Management (MAM) and Azure Active Directory (Azure AD)
Conditional Access
Review workloads that do not leverage modern authentication protocols and convert where possible. In
addition, standardize using modern authentication protocols for all future workloads.
For Azure, enable protections in Azure AD:
1. Configure Azure AD Connect to synchronize password hashes. For information, see Implement password
hash synchronization with Azure AD Connect sync.
2. Choose whether to automatically or manually remediate issues found in a report. For more information,
see Monitor identity risks.
For more information about supporting modern passwords in Azure AD, see the following articles:
What is Identity Protection?
Enforce on-premises Azure AD Password Protection for Active Directory Domain Services
Users at risk security report
Risky sign-ins security report
For more information about supporting modern passwords in Office 365, see the following article:
What is modern authentication?
Modern cloud-based applications are typically accessible over the internet, making network location-based
access inflexible and single-factor passwords a liability. Conditional access describes your authentication policy
for an access decision. For example, if a user is connecting from an InTune-managed corporate PC, they might
not be challenged for MFA every time, but if the user suddenly connects from a different device in a different
geography, MFA is required.
Grant access requests based on the requestors' trust level and the target resources' sensitivity.
Are there any conditional access requirements for the application?Are there any conditional access requirements for the application?
Workloads can be exposed over public internet and location-based network controls are not applicable. To
enable conditional access, understand what restrictions are required for the use case. For example, MFA is a
necessity for remote access; IP-based filtering can be used to enable adhoc debugging (VPNs are preferred).
Configure Azure AD Conditional Access by setting up Access policy for Azure management based on your
Suggested actionsSuggested actions
Next
Related links
operational needs. For information, see Manage access to Azure management with Conditional Access.
Conditional access can be an effective way to phase out legacy authentication and associated protocols. The
policies must be enforced for all admins and other critical impact accounts. Start by using metrics and logs to
determine users who still authenticate with old clients. Next, disable any down-level protocols that aren't used,
and set up conditional access for all users who aren't using legacy protocols. Finally, give notice and guidance to
users about upgrading before blocking legacy authentication completely. For more information, see Azure AD
Conditional Access support for blocking legacy auth.
Implement conditional access policies for this workload.
Learn more about Azure AD Conditional Access.
Grant or deny access to a system by verifying the accessor's identity.
Authorization
Back to the main article: Azure identity and access management considerations
Authorization with Azure AD
12/16/2022 5 minutes to read • Edit Online
Key points
Role-based authorization
Authorization is a process that grants or denies access to a system by verifying whether the accessor has the
permissions to perform the requested action. The accessor in this context is the workload (cloud application) or
the user of the workload. The action might be operational or related to resource management. There are two
main approaches to authorization: role-based and resource-based. Both can be configured with Azure AD.
Use a mix of role-based and resource-based authorization. Start with the principle of least privilege and add
more actions based on your needs.
Define clear lines of responsibility and separation of duties for application roles and the resources it can
manage. Consider the access levels of each operational function, such as permissions needed to publish
production release, access customer data, manipulate database records.
Do not provide permanent access for any critical accounts. Elevate access permissions that are based on
approval and is time bound using Azure AD Privileged Identity Management (Azure AD PIM).|
This approach authorizes an action based on the role assigned to a user. For example, some actions require an
administrator role.
A role is a set of permissions. For example, the administrator role has permissions to perform all read, write, and
delete operations. Also, the role has a scope. The scope specifies the management groups, subscriptions, or
resource groups within which the role is allowed to operate.
Applying consistent permissions to resources via management groups or resource groups reduces proliferation
of custom, specific, per-resource permissions. Custom resource-based permissions are often unnecessary, and
can cause confusion because they do not carry their intent to new similar resources. This process can
accumulate into a complex legacy configuration that is difficult to maintain or change without fear of
breaking
something
, and negatively impacting both security, and solution agility.
When assigning a role to a user consider what actions the role can perform and what is the scope of those
operations. Here are some considerations for role assignment:
Use built-in roles before creating custom roles to grant the appropriate permissions to VMs and other
objects. You can assign built-in roles to users, groups, service principals, and managed identities. For
more information, see Azure built-in roles.
If you need to create custom roles, grant roles with the appropriate action. Actions are categorized into
operational and data actions. Start with actions that have least privilege and add more based your
operational or data access needs. Provide clear guidance to your technical teams that implement
permissions. For more information, see Azure custom roles.
If you have a segmentation strategy, assign permissions with a scope. For example, if you use
management group to support your strategy, set the scope to the group rather than the individual
subscriptions. This will drive consistency and ensure application to future subscriptions. When assigning
permissions for a segment, consider consistency while allowing flexibility to accommodate several
organizational models. These models can range from a single centralized IT group to mostly independent
IT and DevOps teams. For information about assigning scope, see AssignableScopes.
Resource-based authorization
NOTENOTE
Authorization for critical accounts
You can use security groups to assign permissions. However, there are disadvantages. It can get complex
because the workload needs to keep track of which security groups correspond to which application
roles, for each tenant. Also, access tokens can grow significantly and Azure AD includes an "overage"
claim to limit the token size. See Microsoft identity platform access tokens.
Instead of granting permissions to specific users, assign access to Azure AD groups. In addition, build a
comprehensive delegation model that includes management groups, subscription, or resource groups
RBAC. For more information, see Azure role-based access control (Azure RBAC).
For information about implementing role-based authorization in an ASP.NET application, see Role-based
authorization.
Learn moreLearn more
Avoid granular and custom permissions
Delegate administration in Azure AD
With role-based authorization, a user gets the same level of control on a resource based on the user's role.
However, there might be situations where you need to define access rights per resource. For example, in a
resource group, you want to allow some users to delete the resource; other users cannot. In such situations, use
resource-based authorization that authorizes an action based on a particular resource. Every resource has an
Owner. Owner can delete the resource. Contributors can read and update but can't delete it.
The
owner
and
contributor
roles for a resource are not the same as application roles.
You'll need to implement custom logic for resource-based authorization. That logic might be a mapping of
resources, Azure AD object (like role, group, user), and permissions.
For information and code sample about implementing resource-based authorization in an ASP.NET application,
see Resource-based authorization.
There might be cases when you need to do activities that require access to important resources. Those resources
might already be accessible to critical accounts such as an administrator account. Or, you might need to elevate
the access permissions until the activities are complete. Both approaches can pose significant risks.
Critical accounts are those which can produce a business-critical outcome, whether cloud administrators or
workload-specific privileged users. Compromise or misuse of such an account can have a detrimental-to-
material effect on the business and its information systems. It's important to identify those accounts and adopt
processes including close monitoring, and lifecycle management, including retirement.
Securing privileged access is a critical first step to establishing security assurances for business assets in a
modern organization. The security of most or all business assets in an IT organization depends on the integrity
of the privileged accounts used to administer, manage, and develop. Cyberattackers often target these accounts
and other elements of privileged access to gain access to data, and systems using credential theft attacks like
Pass-the-Hash, and Pass-the-Ticket.
Protecting privileged access against determined adversaries requires you to take a complete and thoughtful
approach to isolate these systems from risks.
Are there any processes and tools leveraged to manage privileged activities?Are there any processes and tools leveraged to manage privileged activities?
Learn more
Related links
Do not provide permanent access for any critical accounts and lower permissions when access is no longer
required. Some strategies include:
Just-in-time privileged access to Azure AD and Azure resources.
Time-bound access.
Approval-based access.
Break glass for emergency access process to gain access.
Limit write access to production systems to service principals. No user accounts should have regular write-
access.
Ensure there's a process for disabling or deleting administrative accounts that are unused.
You can use native and third-party options to elevate access permissions for at least highly privileged if not all
activities. Azure AD Privileged Identity Management (Azure AD PIM) is the recommended native solution on
Azure.
For more information about PIM, see What is Azure AD Privileged Identity Management?
Establish lifecycle management for critical impact accounts
Back to the main article: Azure identity and access management considerations
Network security
12/16/2022 2 minutes to read • Edit Online
Checklist
Azure security benchmark
Azure services
Reference architecture
Protect assets by placing controls on network traffic originating in Azure, between on-premises and Azure
hosted resources, and traffic to and from Azure. If security measures aren't in place attackers can gain access, for
instance, by scanning across public IP ranges. Proper network security controls can provide defense-in-depth
elements that help detect, contain, and stop attackers who gain entry into your cloud deployments.
How have you secured the network of your workload?How have you secured the network of your workload?
Segment your network footprint and create secure communication paths between segments. Align the
network segmentation with overall enterprise segmentation strategy.
Design security controls that identify and allow or deny traffic, access requests, and application
communication between segments.
Protect all public endpoints with Azure Front Door, Application Gateway, Azure Firewall, Azure DDoS
Protection.
Mitigate DDoS attacks with Azure DDoS Protection for critical workloads.
Keep virtual machines private and secure when connecting to the internet with Azure Virtual Network NAT
(NAT gateway).
Control network traffic between subnets (east-west) and application tiers (north-south).
Protect from data exfiltration attacks through a defense-in-depth approach with controls at each layer.
The Azure Security Benchmark includes a collection of high-impact security recommendations you can use to
help secure the services you use in Azure:
The questions in this section are aligned to the Azure Security Benchmarks Network Security.
Azure Virtual Network
Azure Firewall
Azure Virtual Network NAT
Azure ExpressRoute
Azure Private Link
Azure DDoS Protection
Here are some reference architectures related to network security:
Hub-spoke network topology in Azure
Deploy highly available NVAs
Azure Kubernetes Service (AKS) production baseline
Next steps
Related links
We recommend applying as many as of the best practices as early as possible, and then working to retrofit any
gaps over time as you mature your security program.
Data protection
Combine network controls with application, identity, and other technical control types. This approach is effective
in preventing, detecting, and responding to threats outside the networks you control. For more information, see
these articles:
Applications and services security
Identity and access management considerations
Data protection
Ensure that resource grouping and administrative privileges align to the segmentation model. For more
information, see Administrative account security.
Go back to the main article: Security
Implement network segmentation patterns on Azure
12/16/2022 8 minutes to read • Edit Online
IMPORTANTIMPORTANT
Key points
A unified enterprise segmentation strategy guides technical teams to consistently segment access using
networking, applications, identity, and any other access controls. Create segmentation in your network footprint
by defining perimeters. The main reasons for segmentation are:
The ability to group related assets that are a part of (or support) workload operations.
Isolation of resources.
Governance policies set by the organization.
Assume compromise
is the recommended cybersecurity mindset and the ability to contain an attacker is vital in
protecting information systems. Model an attacker able to achieve a foothold at various points within the
workload and establish controls to mitigate further expansion.
Network controls can secure interactions between perimeters. This approach can strengthen the security
posture and contain risks in a breach because the controls can detect, contain, and stop attackers from gaining
access to an entire workload.
Containment of attack vectors within an environment is critical. However, to be effective in cloud environments,
traditional approaches may prove inadequate and security organizations may need to evolve their methods.
Traditional segmentation approaches typically fail to achieve their goals as they have not been developed in a
method to align with business use cases and application workloads. Often this results in overwhelming
complexity requiring broad firewall exceptions.
An evolving emerging best practice recommendation is to adopt a Zero Trust strategy based on user, device, and
application identities. In contrast to network access controls that are based on elements such as source and
destination IP address, protocols, and port numbers, Zero Trust enforces and validates access control at
access
time
. This avoids the need to play a prediction game for an entire deployment, network, or subnet only the
destination resource needs to provide the necessary access controls.
Azure Network Security Groups can be used for basic layer 3 and 4 access controls between Azure Virtual
Networks, their subnets, and the internet.
Azure Web Application Firewall and the Azure Firewall can be used for more advanced network access
controls that require application layer support.
Local Admin Password Solution (LAPS) or a third-party Privileged Access Management can set strong local
admin passwords and just-in-time access to them.
How does the organization implement network segmentation?How does the organization implement network segmentation?
This article highlights some Azure networking features that create segments and restrict access to individual
services.
Align your network segmentation strategy with the enterprise segmentation model. This will reduce confusion and
challenges with different technical teams (networking, identity, applications, and so on). Each team should not develop
their own segmentation and delegation models that don't align with each other.
What is segmentation?
Suggested actions
Learn more
Azure features for segmentation
Create software-defined perimeters in your networking footprint and secure communication paths between
them.
Establish a complete zero trust segmentation strategy.
Align technical teams in the enterprise on micro segmentation strategies for legacy applications.
Azure Virtual Networks (VNets) are created in private address spaces. By default, no traffic is allowed
between any two VNets. Open paths only when it's really needed.
Use Network Security Groups (NSG) to secure communication between resources within a VNet.
Use Application Security Groups (ASGs) to define traffic rules for the underlying VMs that run the workload.
Use Azure Firewall to filter traffic flowing between cloud resources, the internet, and on-premise.
Place resources in a single VNet, if you don't need to operate in multiple regions.
If you need to be in multiple regions, have multiple VNets that are connected through peering.
For advanced configurations, use a hub-spoke topology. A VNet is designated as a hub in a given region for
all the other VNets as spokes in that region.
You can create software-defined perimeters in your networking footprint by using the various Azure services
and features. When a workload (or parts of a given workload) is placed into separate segments, you can control
traffic from/to those segments to secure communication paths. If a segment is compromised, you will be able to
better contain the impact and prevent it from laterally spreading through the rest of your network. This strategy
aligns with the key principle of Zero Trust model published by Microsoft that aims to bring world class security
thinking to your organization.
Create a risk containment strategy that blends proven approaches including:
Existing network security controls and practices
Native security controls available in Azure
Zero trust approaches
For information about creating a segmentation strategy, see Enterprise segmentation strategy.
When you operate on Azure, you have many segmentation options.
Segmentation patterns
Pattern 1: Single VNet
1. Subscription: A high-level construct, which provides platform powered separation between entities. It's
intended to carve out boundaries between large organizations within a company and communication
between resources in different subscriptions needs to be explicitly provisioned.
2. Virtual Network (VNets): Created within a subscription in private address spaces. They provide network
level containment of resources with no traffic allowed by default between any two virtual networks. Like
subscriptions, any communication between virtual networks needs to be explicitly provisioned.
3. Network Security Groups (NSG): An access control mechanisms for controlling traffic between resources
within a virtual network and also with external networks, such as the internet, other virtual networks.
NSGs can take your segmentation strategy to a granular level by creating perimeters for a subnet, a VM,
or a group of VMs. For information about possible operations with subnets in Azure, see Subnets (Azure
Virtual Networks).
4. Application Security Groups (ASGs): Similar to NSGs but are referenced with an application context. It
allows you to group a set of VMs under an application tag and define traffic rules that are then applied to
each of the underlying VMs.
5. Azure Firewall: A cloud native stateful Firewall as a service, which can be deployed in your VNet or in
Azure Virtual WAN hub deployments for filtering traffic flowing between cloud resources, the internet,
and on-premise. You create rules or policies (using Azure Firewall or Azure Firewall Manager) specifying
allow/deny traffic using layer 3 to layer 7 controls. You can also filter traffic going to the internet using
both Azure Firewall and third parties by directing some or all traffic through third-party security
providers for advanced filtering & user protection.
Here are some common patterns for segmenting a workload in Azure from a networking perspective. Each
pattern provides a different type of isolation and connectivity. Choose a pattern based on your organization's
needs.
All the components of the workload reside in a single VNet. This pattern is appropriate you are operating in a
single region because a VNet cannot span multiple regions.
Common ways for securing segments, such as subnets or application groups, are by using NSGs and ASGs. You
can also use a Network Virtualized Appliance (NVAs) from Azure Marketplace or Azure Firewall to enforce and
secure this segmentation.
In this image, Subnet1 has the database workload. Subnet2 has the web workloads. You can configure NSGs that
allow Subnet1 to only communicate with Subnet2 and Subnet2 can only communicate with the internet.
Pattern 2: Multiple VNets that communicate through with peering
Consider a use case where you have multiple workloads that are placed in separate subnets. You can place
controls that will allow one workload to communicate to the backend of another workload.
The resources are spread or replicated in multiple VNets. The VNets can communicate through peering. This
pattern is appropriate when you need to group applications into separate VNets. Or, you need multiple Azure
regions. One benefit is the built-in segmentation because you have to explicitly peer one VNet to another. Virtual
network peering is not transitive. You can further segment within a VNet by using NSGs and ASGs as shown in
pattern 1.
Pattern 3: Multiple VNets in a hub and spoke model
A VNet is designated as a
hub
in a given region for all the other VNets as
spokes
in that region. The hub and its
spokes are connected through peering. All traffic passes through the hub that can act as a gateway to other hubs
in different regions. In this pattern, the security controls are set up at the hubs so that they get to segment and
govern the traffic in between other VNets in a scalable way. One benefit of this pattern is, as your network
topology grows, the security posture overhead does not grow (except when you expand to new regions).
The recommended native option is Azure Firewall. This option works across both VNets and subscriptions to
TIPTIP
Pattern comparison
C O N SIDERAT IO NSC O N SIDERAT IO NS PAT T E RN 1PAT T ER N 1 PAT T E RN 2PAT T ER N 2 PAT T E RN 3PAT T ER N 3
Connectivity/routing:Connectivity/routing:
how each segmenthow each segment
communicates to eachcommunicates to each
otherother
System routing provides
default connectivity to any
workload in any subnet.
Same as a pattern 1. No default connectivity
between spoke networks. A
layer 3 router, such as the
Azure Firewall, in the hub is
required to enable
connectivity.
Network level trafficNetwork level traffic
filteringfiltering
Traffic is allowed by default.
Use NSG, ASG to filter
traffic.
Same as a pattern 1. Traffic between spoke
virtual networks is denied
by default. Open selected
paths to allow traffic
through Azure Firewall
configuration.
Centralized loggingCentralized logging NSG, ASG logs for the
virtual network.
Aggregate NSG, ASG logs
across all virtual networks.
Azure Firewall logs all
accepted/denied traffic sent
through the hub. View the
logs in Azure Monitor.
Unintended open publicUnintended open public
endpointsendpoints
DevOps can accidentally
open a public endpoint
through incorrect NSG, ASG
rules.
Same as a pattern 1. Accidentally opened public
endpoint in a spoke will not
enable access because the
return packet will get
dropped through stateful
firewall (asymmetric
routing).
Application levelApplication level
protectionprotection
NSG or ASG provides
network layer support only.
Same as a pattern 1. Azure Firewall supports
FQDN filtering for HTTP/S
and MSSQL for outbound
traffic and across virtual
networks.
Next step
govern traffic flows using layer 3 to layer 7 controls. You can define your communication rules and apply them
consistently. Here are some examples:
VNet 1 cannot communicate with VNet 2, but it can communicate VNet 3.
VNet 1 cannot access public internet except for *.github.com.
With Azure Firewall Manager preview, you can centrally manage policies across multiple Azure Firewalls and
enable DevOps teams to further customize local policies.
Here are some resources that illustrate provisioning of resources in a hub and spoke topology:
GitHub: Hub and Spoke Topology Sandbox.
The design considerations are described in Hub-spoke network topology in Azure.
Secure network connectivity
Related links
For information about setting up peering, reference Virtual network peering.
For best practices about using Azure Firewall in various configurations, reference Azure Firewall
Architecture Guide.
For information about different access policies and control flow within a VNet, reference Azure Virtual
Network Subnet
Back to the main article: Network security
Azure services for securing network connectivity
12/16/2022 12 minutes to read • Edit Online
Key points
Connectivity between network segments
It's often the case that the workload and the supporting components of a cloud architecture will need to access
external assets. These assets can be on-premises, devices outside the main virtual network, or other Azure
resources. Those connections can be over the internet or networks within the organization.
Protect non-publicly accessible services with network restrictions and IP firewall.
Use Network Security Groups (NSGs) or Azure Firewall to protect and control traffic within the VNet.
Use Service Endpoints or Private Link for accessing Azure PaaS services.
Use Azure Firewall to protect against data exfiltration attacks.
Restrict access to backend services to a minimal set of public IP addresses.
Use Azure controls over third-party solutions for basic security needs. These controls are native to the
platform and are easy to configure and scale.
Define access policies based on the type of workload and control flow between the different application tiers.
When designing a workload, you'll typically start by provisioning an Azure Virtual Network (VNet) in a private
address space which has the workload. No traffic is allowed by default between any two virtual networks. If
there's a need, define the communication paths explicitly. One way of connecting VNets is through Virtual
network peering.
A key aspect of protecting VMs in a VNet is to control the flow of network traffic. The network interfaces on the
VMs allow them to communicate with other VMs, the internet, and on-premises networks. To control traffic on
VMs within a VNet (and subnet), use Application Security Groups (ASGs). ASGs allow you to group a set of VMs
under an application tag and define traffic rules. Those rules are then applied to each of the underlying VMs.
A VNet is segmented into subnets based on business requirements. Ensure that proper network security
controls are configured to allow or deny inbound network traffic to, or outbound network traffic from, within
larger network space.
By default VMs are provisioned with private IP addresses. This allows you to take advantage of the Azure IP
address to determine incoming traffic, how and where it's translated on to the virtual network.
A good Azure IP addressing schema provides flexibility, room for growth, and integration with on-premises
networks. The schema ensures that communication works for deployed resources, minimizes public exposure of
systems, and gives the organization flexibility in its network. If not properly designed, systems might not be able
to communicate, and additional work will be required to remediate.
How do you isolate and protect traffic within the workload VNet?How do you isolate and protect traffic within the workload VNet?
To secure communication within a VNet, set rules that inspect traffic. Then,
allow
or
deny
traffic to, or from
specific sources, and route them to the specified destinations.
Review the rule set and confirm that the required services are not unintentionally blocked.
For traffic between subnets (also referred to as east-west traffic), it's recommended to use Network Security
TIPTIP
Groups (NSG). NSGs allow you to define rules that check the source and destination address, protocol and port
of Inbound and Outbound traffic. The address can be a single IP address, multiple IP addresses, an Azure service
tag or an entire subnet.
If NSGs are being used to isolate and protect the application, the rule set should be reviewed to confirm that
required services are not unintentionally blocked, or more permissive access than expected is allowed.
For advanced networking controls, use Azure Firewall. It can be used to perform deep packet inspection on both
east-west and north-south traffic. Firewalls rules can be defined as policies and centrally managed. An
alternative solution is to use network virtual appliances (NVAs) that check inbound (ingress) and outbound
(egress) traffic and filters based on rules.
How do you route network traffic through NVAs for security boundar y policy enforcement,How do you route network traffic through NVAs for security boundar y policy enforcement,
auditing, and inspection?auditing, and inspection?
Use User Defined Routes (UDR) to control the next hop for traffic between Azure, on-premises, and internet
resources. The routes can be applied to virtual appliance, virtual network gateway, virtual network, or internet.
For example, you need to inspect all ingress traffic from a public load balancer. One way is to host an NVA in a
subnet that allows traffic only if certain criteria is met. That traffic is sent to the subnet that hosts an internal load
balancer that routes that traffic to the backend services.
You can also use NVAs for egress traffic. For instance, all workload traffic is routed by using UDR to another
subnet. That subnet has an internal load balancer that distributes requests to the NVA (or a set of NVAs). These
NVAs direct traffic to the internet using their individual public IP addresses.
Here are the resources for the preceding example:
GitHub: Automated failover for network virtual appliances.
The design considerations are described in Deploy highly available NVAs.
Azure Firewall can serve as an NVA. Azure supports third-party network device providers. They're available in
Azure Marketplace.
How do you get insights about ingoing and outgoing traffic of this workload?How do you get insights about ingoing and outgoing traffic of this workload?
As a general rule, configure and collect network traffic logs. If you use NSGs, capture and analyze NSG flow logs
to monitor performance and security. The NSG flow logs enable Traffic Analytics to gain insights into internal
and external traffic flows of the application.
For information about defining network perimeters, see Network segmentation.
Can the VNet and subnet handle growth?Can the VNet and subnet handle growth?
Typically, you'll add more network resources as the design matures. Most organizations end up adding more
resources to networks than initially planned. Refactoring to accommodate the extra resources is a labor-
intensive process. There is limited security value in creating a very large number of small subnets and then
trying to map network access controls (such as security groups) to each of them.
Plan your subnets based on roles and functions that use the same protocols. That way, you can add resources to
the subnet without making changes to security groups that enforce network level access controls.
Don't use all open rules that allow inbound and outbound traffic to and from 0.0.0.0-255.255.255.255. Use a
least-privilege approach and only allow relevant protocols. It will reduce your overall network attack surface on
Suggested actionsSuggested actions
Learn moreLearn more
Internet edge traffic
Communication with backend services
the subnet. All open rules provide a false sense of security because such a rule enforces no security.
The exception is when you want to use security groups only for network logging purposes.
Design virtual networks and subnets for growth. We recommend planning subnets based on common roles and
functions that use common protocols for those roles and functions. This allows you to add resources to the
subnet without making changes to security groups that enforce network level access controls.
Use NSG or consider using Azure Firewall to protect and control traffic within the VNET.
Azure firewall documentation
Design virtual network subnet security
Design an IP addressing schema for your Azure deployment
Network security groups
As you design the workload, consider security for internet traffic. Does the workload or parts of it need to be
accessible from public IP addresses? What level of access should be given to prevent unauthorized access?
Internet edge traffic (also called
North-South traffic
) represents network connectivity between resources used
by the workload and the internet. An internet edge strategy should be designed to mitigate as many attacks
from the internet to detect or block threats. There are two primary choices that provide security controls and
monitoring:
Azure solutions such as Azure Firewall and Web Application Firewall (WAF).
Azure provides networking solutions to restrict access to individual services. Use multiple levels of security, such
as combination of IP filtering, firewall rules to prevent application services from being accessed by unauthorized
actors.
Network virtual appliances (NVAs). You can use Azure Firewall or third-party solutions available in Azure
Marketplace.
Azure security features are sufficient for common attacks, easy to configure, and scale. Third-party solutions
often have advanced features but they can be hard to configure if they don't integrate well with fabric
controllers. From a cost perspective, Azure options tend to be cheaper than partner solutions.
Information revealing the application platform, such as HTTP banners containing framework information (
X-Powered-By , X-ASPNET-VERSION ), are commonly used by malicious actors when mapping attack vectors of the
application.
HTTP headers, error messages, and website footers should not contain information about the application
platform. Azure CDN can be used to separate the hosting platform from end users. Azure API Management
offers transformation policies that allow you to modify HTTP headers and remove sensitive information.
Suggested actionSuggested action
Consider using CDN for the workload to limit platform detail exposure to attackers.
Learn moreLearn more
Azure CDN documentation
Connection with Azure PaaS services
Most workloads are composed of multiple tiers where several services can serve each tier. Common examples
of tiers are web front ends, business processes, reporting and analysis, backend infrastructure, and so on.
Application resources allowing multiple methods to publish app content (such as FTP, Web Deploy) should have
the unused endpoints disabled. For Azure Web Apps, SCM is the recommended endpoint and it can be protected
separately with network restrictions for sensitive scenarios.
Public access to any workload should be judiciously approved and planned, as public entry points represent a
key possible vector of compromise. When allowing access from public IPs to any back-end service, limiting the
range of allowed IPs can significantly reduce the attack surface of that service. For example, if using Azure Front
Door, you can limit backend tiers to allow Front Door IPs only; or if a partner uses your API, limit access to only
their nominated public IP(s).
How do you configure traffic flow between multiple application tiers?How do you configure traffic flow between multiple application tiers?
Use Azure Virtual Network Subnet to allocate separate address spaces for different elements or tiers within the
workload. Then, define different access policies to control traffic flows between those tiers and restrict access.
You can implement those restrictions through IP filtering or firewall rules.
Do you need to restrict access to the backend infrastructure?Do you need to restrict access to the backend infrastructure?
Restrict access to backend services to a minimal set of public IP addresses with App Services IP restrictions or
Azure Front Door.
Web applications typically have one public entry point and don't expose subsequent APIs and database servers
over the internet. Expose only a minimal set of public IP addresses based on need
and
only those who really
need it. For example, when using gateway services, such as Azure Front Door, it's possible to restrict access only
to a set of Front Door IP addresses and lock down the infrastructure completely.
Suggested actionSuggested action
Restrict and protect application publishing methods.
Learn moreLearn more
Set up Azure App Service access restrictions
Azure Front Door documentation
Deploy your app to Azure App Service using FTP/S
The workload will often need to communicate with other Azure services. For example, it might need to get
secrets from Azure Key Vault. Avoid making connections over the public internet.
Does the workload use secure ways to access Azure PaaS ser vices?Does the workload use secure ways to access Azure PaaS ser vices?
Common approaches for accessing PaaS services are Service Endpoints or Private Links. Both approaches
restrict access to PaaS endpoints only from authorized virtual networks, effectively mitigating data intrusion
risks and associated impact to application availability.
With Service Endpoints, the communication path is secure because you can reach the PaaS endpoint without
needing a public IP address on the VNet. Most PaaS services support communication through service endpoints.
For a list of generally available services, see Virtual Network service endpoints.
Another mechanism is through Azure Private Link. Private Endpoint uses a private IP address from your VNet,
effectively bringing the service into your VNet. For details, see What is Azure Private Link?.
Service Endpoints provide service level access to a PaaS service, whereas Private Link provides direct access to a
On-premises to cloud connectivity
Next step
specific PaaS resource to mitigate data exfiltration risks, such as malicious admin access. Private Link is a paid
service and has meters for inbound and outbound data processed. Private Endpoints are also charged.
How do you control outgoing traffic of Azure PaaS ser vices where Private Link isn't available?How do you control outgoing traffic of Azure PaaS ser vices where Private Link isn't available?
Use NVAs and Azure Firewall (for supported protocols) as a reverse proxy to restrict access to only authorized
PaaS services for services where Private Link isn't supported. Use Azure Firewall to protect against data
exfiltration concerns.
In a hybrid architecture, the workload runs partly on-premises and partly in Azure. Have security controls that
check traffic entering Azure virtual network from on-premises data center.
How do you establish cross premises connectivity?How do you establish cross premises connectivity?
Use Azure ExpressRoute to set up cross premises connectivity to on-premises networks. This service uses a
private, dedicated connection through a third-party connectivity provider. The private connection extends your
on-premises network into Azure. This way, you can reduce the risk of potential of access to company's
information assets on-premises.
How do you access VMs?How do you access VMs?
Use Azure Bastion to log into your VMs and avoid public internet exposure using SSH and RDP with private IP
addresses only. You can also disable RDP/SSH access to VMs and use VPN, ExpressRoute to access these virtual
machines for remote management.
Do the cloud or on-premises VMs have direct internet connectivity for users that may performDo the cloud or on-premises VMs have direct internet connectivity for users that may perform
interactive logins?interactive logins?
Attackers constantly scan public cloud IP ranges for open management ports and attempt low-cost attacks such
as common passwords and known unpatched vulnerabilities. Develop processes and procedures to prevent
direct internet access of VMs with logging and monitoring to enforce policies.
How is internet traffic routed?How is internet traffic routed?
Decide how to route internet traffic. You can use on-premises security devices (also called
forced tunneling
) or
allow connectivity through cloud-based network security devices.
For production enterprise, allow cloud resources to start and respond to internet request directly through cloud
network security devices defined by your internet edge strategy. This approach fits the Nth datacenter paradigm,
that is Azure datacenters are a part of your enterprise. It scales better for an enterprise deployment because it
removes hops that add load, latency, and cost.
Another option is to force tunnel all outbound internet traffic from on-premises through site-to-site VPN. Or, use
a cross-premise WAN link. Network security teams have greater security and visibility to internet traffic. Even
when your resources in the cloud try to respond to incoming requests from the internet, the responses are force
tunneled. This option fits a datacenter expansion use case and can work well for a quick proof of concept, but
scales poorly because of the increased traffic load, latency, and cost. For those reasons, we recommend that you
avoid forced tunneling.
Secure endpoints
Related links
For information about controlling next hop for traffic, see Azure Virtual Network User Defined Routes (UDR).
For information about web application firewalls, see Application Gateway WAF.
For information about Network Appliances from Azure Marketplace, see Network Appliances.
For information about cross premises connectivity, see Azure site-to-site VPN or ExpressRoute.
For information about using VPN/ExpressRoute to access these virtual machines for remote management, see
Disable RDP/SSH access to Azure Virtual Machines.
Go back to the main article: Network security
Best practices for endpoint security on Azure
12/16/2022 6 minutes to read • Edit Online
Key points
Public endpoints
Web application firewalls
(
WAFs
)
Web application firewalls
(
WAFs
)
An
endpoint
is an address exposed by a web application so that external entities can communicate with it. A
malicious or an inadvertent interaction with the endpoint can compromise the security of the application and
even the entire system. One way to protect the endpoint is by placing filter controls on the network traffic that it
receives, such as defining rule sets. A defense-in-depth approach can further mitigate risks. Include
supplemental controls that protect the endpoint if the primary traffic controls fail.
This article describes way in which you can protect web applications with Azure services and features. For
product documentation, see Related links.
Protect all public endpoints with Azure Front Door, Application Gateway, Azure Firewall, Azure DDoS
Protection.
Use web application firewall (WAF) to protect web workloads.
Protect workload publishing methods and restrict ways that are not in use.
Mitigate DDoS attacks. Use Standard protection for critical workloads where outage would have business
impact. Also consider CDN as another layer of protection.
Develop processes and procedures to prevent direct internet access of virtual machines (such as proxy or
firewall) with logging and monitoring to enforce policies.
Implement an automated and gated CI/CD deployment process.
A public endpoint receives traffic over the internet. The endpoints make the service easily accessible to attackers.
Service Endpoints and Private Link can be leveraged to restrict access to PaaS endpoints only from authorized
virtual networks, effectively mitigating data intrusion risks and associated impact to application availability.
Service Endpoints provide service level access to a PaaS service, while Private Link provides direct access to a
specific PaaS resource to mitigate data exfiltration risks such as malicious admin scenarios.
Configure service endpoints and private links where appropriate.
Are all public endpoints of this workload protected?Are all public endpoints of this workload protected?
An initial design decision is to assess whether you need a public endpoint at all. If you do, protect it by using
these mechanisms.
For more information, see Virtual Network service endpoints and What is Azure Private Endpoint?
WAFs provide a basic level of security for web applications. WAFs are appropriate if the organizations that have
invested in application security as WAFs provide additional defense-in-depth mitigation.
WAFs mitigate the risk of an attacker to exploit commonly seen security vulnerabilities for applications. WAFs
provide a basic level of security for web applications. This mechanism is an important mitigation because
attackers target web applications for an ingress point into an organization (similar to a client endpoint).
External application endpoints should be protected against common attack vectors, from Denial of Service (DoS)
attacks like Slowloris to app-level exploits, to prevent potential application downtime due to malicious intent.
TIPTIP
Suggestion actionsSuggestion actions
Azure FirewallAzure Firewall
TIPTIP
Combination approachCombination approach
Azure-native technologies such as Azure Firewall, Application Gateway/Azure Front Door, WAF, and DDoS
Network Protection can be used to achieve requisite protection (Azure DDoS Protection).
Azure Application Gateway has WAF capabilities to inspect web traffic and detect attacks at the HTTP layer. It's a
load balancer and HTTP(S) full reverse proxy that can do secure socket layer (SSL) encryption and decryption.
For example, your workload is hosted in Application Service Environments(ILB ASE). The APIs are consolidated
internally and exposed to external users. This external exposure could be achieved using an Application Gateway.
This service is a load balancer. It forwards request to the internal API Management service, which in turn
consumes the APIs deployed in the ASE. Application Gateway is also configured over port 443 for secured and
reliable outbound calls.
The design considerations for the preceding example are described in Publishing internal APIs to external users.
Azure Front Door and Azure Content Delivery Network (CDN) also have WAF capabilities.
Protect all public endpoints with appropriate solutions such as Azure Front Door, Application Gateway, Azure
Firewall, Azure DDOS Protection, or any third-party solution.
Learn moreLearn more
What is Azure Firewall?
Azure DDoS Protection overview
Azure Front Door documentation
What is Azure Application Gateway?
Protect the entire virtual network against potentially malicious traffic from the internet and other external
locations. It inspects incoming traffic and only passes the allowed requests to pass through.
A common design is to implement a DMZ or a perimeter network in front of the application. The DMZ is a
separate subnet with the firewall.
The design considerations are described in Deploy highly available NVAs.
When you want higher security and there's a mix of web and non-web workloads in the virtual network use
both Azure Firewall and Application Gateway. There are several ways in which those two services can work
together.
For example, you want to filter egress traffic. You want to allow connectivity to a specific Azure Storage Account
but not others. You'll need fully qualified domain name (FQDN)-based filters. In this case run Firewall and
Application Gateway in parallel.
Another popular design is when you want Azure Firewall to inspect all traffic and WAF to protect web traffic, and
the application needs to know the client's source IP address. In this case, place Application Gateway in front of
Firewall. Conversely, you can place Firewall in front of WAF if you want to inspect and filter traffic before it
reaches the Application Gateway.
For more information, see Firewall and Application Gateway for virtual networks.
AuthenticationAuthentication
Mitigate DDoS attacks
Suggested actionSuggested action
Learn moreLearn more
Adopt DevOps
It's challenging to write concise firewall rules for networks where different cloud resources dynamically spin up
and down. Use Microsoft Defender for Cloud to detect misconfiguration risks.
Disable insecure legacy protocols for internet-facing services. Legacy authentication methods are among the top
attack vectors for cloud-hosted services. Those methods don't support other factors beyond passwords and are
prime targets for password spraying, dictionary, or brute force attacks.
In a distributed denial-of-service (DDoS) attack, the server is overloaded with fake traffic. DDoS attacks are
common and can be debilitating. An attack can completely block access or take down services. Make sure all
business-critical web application and services have DDoS mitigation beyond the default defenses so that the
application doesn't experience downtime because that can negatively impact business.
Microsoft recommends adopting advanced protection for any services where downtime will have negative
impact on the business.
How do you implement DDoS protection?How do you implement DDoS protection?
Here are some considerations:
DDoS protection at the infrastructure level in which your workload runs. Azure infrastructure has built-in
defenses for DDoS attacks.
DDoS protection at the network (layer 3) layer. Azure provides additional protection for services provisioned
in a virtual network.
DDoS protection with caching. Content delivery network (CDN) can add another layer of protection. In a
DDoS attack, a CDN intercepts the traffic and stops it from reaching the backend server. Azure CDN is
natively protected. Azure also supports popular CDNs that are protected with proprietary DDoS mitigation
platform.
Advanced DDoS protection. In your security baseline, consider features with monitoring techniques that use
machine learning to detect anomalous traffic and proactively protect your application before service
degradation occurs.
For information about Azure DDoS Protection services, see Azure DDoS Protection documentation.
Identify critical workloads that are susceptible to DDoS attacks and enable Distributed Denial of Service (DDoS)
mitigations for all business-critical web applications and services.
For a list of reference architectures that demonstrate the use of DDoS protection, see Azure DDoS Protection
reference architectures.
Developers shouldn't publish their code directly to app servers.
Does the organization have an CI/CD process for publishing code in this workload?Does the organization have an CI/CD process for publishing code in this workload?
Implement lifecycle of continuous integration, continuous delivery (CI/CD) for applications. Have processes and
tools in place that aid in an automated and gated CI/CD deployment process.
How are the publishing methods secured?How are the publishing methods secured?
Next step
Related linksRelated links
Application resources allowing multiple methods to publish app content, such as FTP, Web Deploy should have
the unused endpoints disabled. For Azure Web Apps, SCM is the recommended endpoint. It can be protected
separately with network restrictions for sensitive use cases.
Data flow
Azure Firewall
What is Azure Web Application Firewall on Azure Application Gateway?
Azure DDoS Protection
Go back to the main article: Network security
Traffic flow security in Azure
12/16/2022 3 minutes to read • Edit Online
Key points
East-west and north-south traffic
Protect data anywhere it goes including cloud services, mobile devices, workstations, or collaboration platforms.
In addition to using access control and encryption mechanisms, apply strong network controls that detect,
monitor, and contain attacks.
Control network traffic between subnets (east-west) and application tiers (north-south).
Apply a layered defense-in-depth approach that starts with Zero-Trust policies.
Use a cloud application security broker (CASB).
When analyzing the network flow of a workload, distinguish between east-west traffic from north-south traffic.
Most cloud architectures use a combination of both types.
Is the traffic between subnets, Azure components and tiers of the workload managed and secured?Is the traffic between subnets, Azure components and tiers of the workload managed and secured?
Nor th-south trafficNor th-south traffic
North-south
refers to the traffic that flows in and out of a datacenter. For example, traffic from an
application to a backend service. This type of traffic is a typical target for attack vectors because it flows
over the public internet. Proper network controls must be in place so that the queries to and from a data
center are secure.
Consider a typical flow in an Azure Kubernetes Service (AKS) cluster. The cluster receives incoming
(ingress) traffic from HTTP requests. The cluster can also send outgoing (egress) traffic to send queries to
other services, such as pulling a container image.
Your design can use Web Application Firewall on Application Gateway to secure ingress traffic, and Azure
Firewall to secure outgoing (egress) traffic.
East-west trafficEast-west traffic
East-west
traffic refers to traffic between or within data centers. For this type of traffic, several resources
of the network infrastructure communicate with each other. Those resources can be virtual networks,
subnets within those virtual networks, and so on. Security of east-west traffic can get overlooked even
though it makes up a large portion of the workload traffic. It's assumed that the infrastructure firewalls
are sufficient to block attacks. Make sure there are proper controls between network resources.
Extending the example of the AKS cluster to this concept, east-west traffic is the traffic within the cluster.
For example, communication between pods, such as the ingress controller and the workload. If your
workload is composed of multiple applications, the communication between those applications would fall
into this category.
By using Kubernetes network policies, you can restrict which pods can communicate, starting from a
Zero-Trust policy and then opening specific communication paths as needed.
TIPTIP
Data exfiltration
IMPORTANTIMPORTANT
Learn more
Related links
Here are the resources for the preceding AKS example:
GitHub: Azure Kubernetes Service (AKS) Secure Baseline Reference Implementation.
The design considerations are described in Azure Kubernetes Service (AKS) production baseline.
Data exfiltration is a common attack where an internal or external malicious actor does an unauthorized data
transfer. Most often access is gained because of lack of network controls.
Network virtual appliance (NVA) solutions and Azure Firewall (for supported protocols) can be leveraged as a
reverse proxy to restrict access to only authorized PaaS services for services where Private Link is not yet
supported (Azure Firewall).
Configure Azure Firewall or a third-party next generation firewall to protect against data exfiltration concerns.
Are there controls in the workload design to detect and protect from data exfiltration?Are there controls in the workload design to detect and protect from data exfiltration?
Choose a defense-in-depth design that can protect network communications at various layers, such as a hub-
spoke topology. Azure provides several controls to support the layered design:
Use Azure Firewall to allow or deny traffic using layer 3 to layer 7 controls.
Use Azure Virtual Network User Defined Routes (UDR) to control next hop for traffic.
Control traffic with Network Security Groups (NSGs) between resources within a virtual network, internet,
and other virtual networks.
Secure the endpoints through Azure PrivateLink and Private Endpoints.
Detect and protect at deep levels through packet inspection.
Detect attacks and respond to alerts through Microsoft Sentinel and Microsoft Defender for Cloud.
Network controls are not sufficient in blocking data exfiltration attempts. Harden the protection with proper identity
controls, key protection, and encryption. For more information, see these sections:
Data protection considerations
Identity and access management considerations
Have you considered a cloud application security broker (CASB) for this workload?Have you considered a cloud application security broker (CASB) for this workload?
CASBs provide a central point of control for enforcing policies. They provide rich visibility, control over data
travel, and sophisticated analytics to identify and combat cyberthreats across all Microsoft and third-party cloud
services.
Azure firewall documentation
Azure Marketplace networking apps
Azure Firewall
Network Security Groups (NSG)
What is Azure Web Application Firewall on Azure Application Gateway?
What is Azure Private Link?
Go back to the main article: Network security
Data protection considerations
12/16/2022 2 minutes to read • Edit Online
Checklist
Azure security benchmark
Reference architecture
Next steps
Related links
Classify, protect, and monitor sensitive data assets using access control, encryption, and logging in Azure.
Provide controls on data at rest and in transit.
How are you managing encr yption for this workload?How are you managing encr yption for this workload?
Use identity based storage access controls.
Use built-in features for data encryption for Azure services.
Classify all stored data and encrypt it.
Protect data moving over a network through encryption at all points so that it's not accessed unauthorized
users.
Store keys in managed key vault service with identity-based access control and audit policies.
Rotate keys and other secrets frequently.
The Azure Security Benchmark includes a collection of high-impact security recommendations you can use to
help secure the services you use in Azure:
The questions in this section are aligned to the Azure Security Benchmarks Data Protection.
Here are some reference architectures related to secure storage:
Using Azure file shares in a hybrid environment
DevSecOps in Azure
We recommend that you review the practices and tools implemented as part of the development cycle.
Data protection
Back to the main article: Security
Data encryption in Azure
12/16/2022 9 minutes to read • Edit Online
Key points
Azure encryption features
Data can be categorized by its state:
Data at restData at rest. All information storage objects, containers, and types that exist statically on physical media,
whether magnetic or optical disk.
Data in transitData in transit. Data that is being transferred between components, locations, or programs.
In a cloud solution, a single business transaction can lead to multiple data operations where data moves from
one storage medium to another. To provide complete data protection, it must be encrypted on storage volumes
and while it's transferred from one point to another.
Use identity-based storage access controls.
Use standard and recommended encryption algorithms.
Use only secure hash algorithms (SHA-2 family).
Classify your data at rest and use encryption.
Encrypt virtual disks.
Use an additional key encryption key (KEK) to protect your data encryption key (DEK).
Protect data in transit through encrypted network channels (TLS/HTTPS) for all client/server communication.
Use TLS 1.2 on Azure.
Azure provides built-in features for data encryption in many layers that participate in data processing. We
recommend that for each service, enable the encryption capability. The encryption is handled automatically
using Azure-managed keys. This almost requires no user interaction.
We recommend implementing identity-based storage access controls. Authentication with a shared key (like a
Shared Access Signature) doesn't permit the same flexibility and control as identity-based access control. The
leak of a shared key might allow indefinite access to a resource, whereas a role-based access control can be
identified and authenticated more strongly.
Storage in a cloud service like Azure is architected and implemented quite differently than on-premises
solutions to enable massive scaling, modern access through REST APIs, and isolation between tenants. Cloud
service providers make multiple methods of access control over storage resources available. Examples include
shared keys, shared signatures, anonymous access, and identity provider-based methods.
Consider some built-in features of Azure Storage:
Identity-based accessIdentity-based access. Supports access through Azure Active Directory (Azure AD) and key-based
authentication mechanisms, such as Symmetric Shared Key Authentication, or Shared Access Signature
(SAS).
Built-in encr yptionBuilt-in encr yption. All stored data is encrypted by Azure storage. Data cannot be read by a tenant if it has
not been written by that tenant. This feature provides control over cross tenant data leakage.
Region-based controlsRegion-based controls. Data remains only in the selected region and three synchronous copies of data are
maintained within that region. Azure storage provides detailed activity logging is available on an opt-in basis.
Firewall featuresFirewall features. The firewall provides an additional layer of access control and storage threat protection
Suggested action
Standard encryption algorithms
Data at rest
Data classification
to detect anomalous access and activities.
For the complete set of features, see Azure Storage Service encryption.
Identify provider methods of authentication and authorization that are the least likely to be compromised, and
enable more fine-grained role-based access controls over storage resources.
Learn moreLearn more
For more information, reference Authorize access to blobs using Azure Active Directory.
Does the organization use industr y standard encr yption algorithms instead of creating their own?Does the organization use industr y standard encr yption algorithms instead of creating their own?
Organizations should not develop and maintain their own encryption algorithms. Avoid using custom
encryption algorithms or direct cryptography in your workload. These methods rarely stand up to real world
attacks.
Secure standards already exist on the market and should be preferred. If custom implementation is required,
developers should use well-established cryptographic algorithms and secure standards. Use Advanced
Encryption Standard (AES) as a symmetric block cipher, AES-128, AES-192, and AES-256 are acceptable.
Developers should use cryptography APIs built into operating systems instead of non-platform cryptography
libraries. For .NET, follow the .NET Cryptography Model.
We advise using standard and recommended encryption algorithms.
For more information, refer to Choose an algorithm.
Are modern hashing functions used?Are modern hashing functions used?
Applications should use the SHA-2 family of hash algorithms (SHA-256, SHA-384, SHA-512).
All important data should be classified and encrypted with an encryption standard. Classify and protect all
information storage objects. Use encryption to make sure the contents of files cannot be accessed by
unauthorized users.
Data at rest is encrypted by default in Azure, but is your critical data classified and tagged, or labeled so that it
can be audited?
Your most sensitive data might include business, financial, healthcare, or personal information. Discovering and
classifying this data can play a pivotal role in your organization's information protection approach. It can serve
as infrastructure for:
Helping to meet standards for data privacy and requirements for regulatory compliance.
Various security scenarios, such as monitoring (auditing) and alerting on anomalous access to sensitive data.
Controlling access to and hardening the security of databases that contain highly sensitive data.
Suggested actionSuggested action
Classify your data. Consider using Data Discovery & Classification in Azure SQL Database.
A crucial initial exercise for protecting data is to organize it into categories based on certain criteria. The
classification criteria can be your business needs, compliance requirements, and the type of data.
Depending on the category, you can protect it through:
Standard encryption mechanisms.
Enforce security governance through policies.
Conduct audits to make sure the security measures are compliant.
One way of classifying data is through the use of tags.
Does the organization encr ypt vir tual disk files for vir tual machines that are associated with thisDoes the organization encr ypt vir tual disk files for vir tual machines that are associated with this
workload?workload?
There are many options to store files in the cloud. Cloud-native apps typically use Azure Storage. Apps that run
on VMs use them to store files. VMs use virtual disk files as virtual storage volumes and exist in a blob storage.
Consider a hybrid solution. Files can move from on-premises to the cloud, from the cloud to on-premises, or
between services hosted in the cloud. One strategy is to make sure that the files and their contents aren't
accessible to unauthorized users. You can use authentication-based access controls to prevent unauthorized
downloading of files. However, that is not enough. Have a backup mechanism to secure the virtual disk files in
case authentication and authorization or its configuration is compromised. There are several approaches. You
can encrypt the virtual disk files. If an attempt is made to mount disk files, the contents of the files cannot be
accessed because of the encryption.
We recommend that you enable virtual disk encryption. For information about how to encrypt Windows VM
disks, see Quickstart: Create and encrypt a Windows VM with the Azure CLI.
Azure-based virtual disks are stored as files in a Storage account. If no encryption is applied to a virtual disk,
and an attacker manages to download a virtual disk image file, it can be mounted and inspected at the attacker's
leisure as if they had physical access to the source computer. Encrypting virtual disk files helps prevent attackers
from gaining access to the contents of those disk files in the event they are able to download them. Depending
on the sensitivity of the information stored on the disk, unencrypted access could represent a critical risk to
confidential business data (such as a SQL database) or identity (such as an AD Domain Controller).
An example of virtual disk encryption is Azure Disk Encryption.
Azure Disk Encryption helps protect and safeguard your data to meet your organizational security and
compliance commitments. It uses the Bitlocker-feature of Windows (or DM-Crypt on Linux) to provide volume
encryption for the OS and data disks of Azure virtual machines (VMs). It is integrated with Azure Key Vault to
help you control and manage the disk encryption keys, and secrets.
Virtual machines use virtual disk files as storage volumes and exist in a cloud service provider's blob storage
system. These files can be moved from on-premises to cloud systems, from cloud systems to on-premises, or
between cloud systems. Due to the mobility of these files, it's recommended that the files and the contents are
not accessible to unauthorized users.
Does the organization use identity-based storage access controls for this workload?Does the organization use identity-based storage access controls for this workload?
There are many ways to control access to data: shared keys, shared signatures, anonymous access, identity
provider-based. Use Azure Active Directory (Azure AD) and role-based access control (RBAC) to grant access. For
more information, see Identity and access management considerations.
Does the organization protect keys in this workload with an additional key encr yption key (KEK)?Does the organization protect keys in this workload with an additional key encr yption key (KEK)?
Use more than one encryption key in an encryption at rest implementation. Storing an encryption key in Azure
Key Vault ensures secure key access and central management of keys.
Data in transit
Use an additional key encryption key (KEK) to protect your data encryption key (DEK).
Suggested actionsSuggested actions
Identify unencrypted virtual machines via Microsoft Defender for Cloud or script, and encrypt via Azure Disk
Encryption. Ensure all new virtual machines are encrypted by default and regularly monitor for unprotected
disks.
Learn moreLearn more
Azure Disk Encryption for virtual machines and virtual machine scale sets
Data in transit should be encrypted at all points to ensure data integrity.
Protecting data in transit should be an essential part of your data protection strategy. Because data is moving
back and forth from many locations, we generally recommend that you always use SSL/TLS protocols to
exchange data across different locations.
For data moving between your on-premises infrastructure and Azure, consider appropriate safeguards such as
HTTPS or VPN. When sending encrypted traffic between an Azure virtual network and an on-premises location
over the public internet, use Azure VPN Gateway.
Does the workload communicate over encr ypted network traffic only?Does the workload communicate over encr ypted network traffic only?
Any network communication between client and server where man-in-the-middle attacks can occur, must be
encrypted. All website communication should use HTTPS, no matter the perceived sensitivity of transferred data.
Man-in-the-middle attacks can occur anywhere on the site, not just login forms.
This mechanism can be applied to use cases such as:
Web applications and APIs for all communication with clients.
Data moving across a service bus from on-premises to the cloud and other way around, or during an
input/output process.
In certain architecture styles such as microservices, data must be encrypted during communication between the
services.
What TLS version is used across workloads?What TLS version is used across workloads?
Using the latest version of TLS is preferred. All Azure services support TLS 1.2 on public HTTPS endpoints.
Migrate solutions to support TLS 1.2 and use this version by default.
When traffic from clients using older versions of TLS is minimal, or it's acceptable to fail requests made with an
older version of TLS, consider enforcing a minimum TLS version. For information about TLS support in Azure
Storage, see Remediate security risks with a minimum version of TLS.
Sometimes you need to isolate your entire communication channel between your on-premises and the cloud
infrastructure by using either a virtual private network (VPN) or ExpressRoute. For more information, see these
articles:
Extending on-premises data solutions to the cloud
Configure a Point-to-Site VPN connection to a VNet using native Azure certificate authentication: Azure
portal
For more information, see Protect data in transit.
Is there any por tion of the application that doesn't secure data in transit?Is there any por tion of the application that doesn't secure data in transit?
Suggested actionsSuggested actions
Learn more
Next steps
Related links
All data should be encrypted in transit using a common encryption standard. Determine if all components in the
solution are using a consistent standard. There are times when encryption is not possible because of technical
limitations, make sure the reason is clear and valid.
Identify workloads using unencrypted sessions and configure the service to require encryption.
Encrypt data in transit
Azure encryption overview
While it's important to protect data through encryption, it's equally important to protect they keys that provide
access to the data.
Key and secret management
Identity and access management services authenticate and grant permission to users, partners, customers,
applications, services, and other entities. For security considerations, see Azure identity and access management
considerations.
Back to the main article: Data protection
Key and secret management considerations in
Azure
12/16/2022 5 minutes to read • Edit Online
Key points
Identity-based access control
NOTENOTE
Encryption is an essential tool for security because it restricts access. However, it's equally important to protect
the secrets (keys, certificates) key that provide access to the data.
Use identity-based access control instead of cryptographic keys.
Use standard and recommended encryption algorithms.
Store keys and secrets in managed key vault service. Control permissions with an access model.
Rotate keys and other secrets frequently. Replace expired or compromised secrets.
Organizations shouldn't develop and maintain their own encryption algorithms. There are many ways to
provide access control over storage resources available, such as:
Shared keys
Shared signatures
Anonymous access
Identity provider-based methods
Secure standards already exist on the market and should be preferred. AES should be used as symmetric block
cipher, AES-128 , AES-192 , and AES-256 are acceptable. Crypto APIs built into operating systems should be used
where possible, instead of non-platform crypto libraries. For .NET, make sure you follow the .NET Cryptography
Model.
Do you prioritize authentication through identity ser vices for a workload over cr yptographicDo you prioritize authentication through identity ser vices for a workload over cr yptographic
keys?keys?
Protection of cryptographic keys can often get overlooked or implemented poorly. Managing keys securely with
application code is especially difficult and can lead to mistakes such as accidentally publishing sensitive access
keys to public code repositories.
Use of identity-based options for storage access control is recommended. This option uses role-based access
controls (RBAC) over storage resources. Use RBAC to assign permissions to users, groups, and applications at a
certain scope. Identity systems such as Azure Active Directory (Azure AD) offer secure and usable experience for
access control with built-in mechanisms for handling key rotation, monitoring for anomalies, and others.
Grant access based on the principle of least privilege. Risk of giving more privileges than necessary can lead to data
compromise.
Suppose you need to store sensitive data in Azure Blob Storage. You can use Azure AD and RBAC to authenticate
a service principal that has the required permissions to access the storage. For more information about the
feature, reference Authorize access to blobs and queues using Azure Active Directory.
TIPTIP
Key storage
Operational considerations
Using SAS tokens is a common way to control access. SAS tokens are created by using the service owner's Azure AD
credentials. The tokens are created per resource and you can use Azure RBAC to restrict access. SAS tokens have a time
limit, which controls the window of exposure. Here are the resources for the preceding example:
GitHub: Azure Cognitive Services Reference Implementation.
The design considerations are described in Speech transcription with Azure Cognitive Services.
To prevent security leaks, store the following keys and secrets in a secure store:
API keys
Database connection strings
Data encryption keys
Passwords
Sensitive information shouldn't be stored within the application code or configuration. An attacker gaining read
access to source code shouldn't gain knowledge of application and environment-specific secrets.
Store all application keys and secrets in a managed key vault service such as Azure Key Vault or HashiCorp Vault.
Storing encryption keys in a managed store further limits access. The workload can access the secrets by
authenticating against Key Vault by using managed identities. That access can be restricted with Azure RBAC.
Make sure no keys and secrets for any environment types (Dev, Test, or Production) are stored in application
configuration files or CI/CD pipelines. Developers can use Visual Studio Connected Services or local-only files to
access credentials.
Have processes that periodically detect exposed keys in your application code. An option is Credential Scanner.
For information about the configuring task, reference Credential Scanner task.
Do you have an access model for key vaults to grant access to keys and secrets?Do you have an access model for key vaults to grant access to keys and secrets?
To secure access to your key vaults, control permissions to keys and secrets through an access model. For more
information, reference Access model overview.
Suggested actionsSuggested actions
Consider using Azure Key Vault for secrets and keys.
Who is responsible for managing keys and secrets in the application context?Who is responsible for managing keys and secrets in the application context?
Key and certificate rotation is often the cause of application outages. Even Azure has experienced expired
certificates. It's critical that the rotation of keys and certificates be scheduled and fully operationalized. The
rotation process should be automated and tested to ensure effectiveness. Azure Key Vault supports key rotation
and auditing.
Central SecOps team provides guidance on how keys and secrets are managed (governance). Application
DevOps team is responsible for managing the application-related keys and secrets.
What types of keys and secrets are used and how are those generated?What types of keys and secrets are used and how are those generated?
Suggested actions
Learn more
Related content
The following approaches include:
Microsoft-managed Keys
Customer-managed Keys
Bring Your Own Key
The decision is often driven by security, compliance, and specific data classification requirements. Develop a
clear understanding of these requirements to determine the most suitable type of keys.
Are keys and secrets rotated frequently?Are keys and secrets rotated frequently?
To reduce the attack vectors, secrets require rotation and are prone to expiration. The process should be
automated and executed without any human interactions. Storing them in a managed store simplifies those
operational tasks by handling key rotation.
Replace secrets after they've reached the end of their active lifetime or if they've been compromised. Renewed
certificates should also use a new key. Have a process for situations where keys get compromised (leaked) and
need to be regenerated on-demand. For example, secrets rotation in SQL Database.
For more information, reference Key Vault Key Rotation.
By using managed identities, you remove the operational overhead for storing the secrets or certificates of
service principals.
Are the expiration dates of SSL/TLS cer tificates monitored and are processes in place to renewAre the expiration dates of SSL/TLS cer tificates monitored and are processes in place to renew
them?them?
A common cause of application outage is expired SSL/TLS certificates.
Avoid outages by tracking the expiration dates of SSL/TLS certificates and renewing them in due time. Ideally,
the process should be automated, although this often depends on used certificate authority (CA). If not
automated, use alerts to make sure expiration dates don't go unnoticed.
Implement a process for SSL certificate management and the automated renewal process with Azure Key Vault.
Tutorial: Configure certificate auto-rotation in Key Vault
Identity and access management services authenticate and grant permission to the following groups:
Users
Partners
Customers
Applications
Services
Other entities
For security considerations, reference Azure identity and access management considerations.
Back to the main article: Data protection
Next steps
Protect data at rest and in transit through encryption. Make sure you use standard encryption algorithms.
Data encryption
Applications and services
12/16/2022 2 minutes to read • Edit Online
Identify and classify business critical applications
Suggested actionsSuggested actions
Next steps
Applications and the data associated with them act as the primary store of business value on a cloud platform.
Applications can play a role in risks to the business because:
Business processesBusiness processes are encapsulated and executed by applications and services need to be available and
provided with high integrity.
Business dataBusiness data is stored and processed by application workloads and requires high assurances of
confidentiality, integrity, and availability.
Enterprise organizations typically have a large application portfolio, but not all applications have equal
importance. Applications can be classified based on a criticality scale. For example, business-critical applications
are designed to prevent financial losses, safety-critical are focused on costs associated with loss of human life.
Mission-critical applicationsMission-critical applications cover both aspects that can be impacted by unavailability or underperformance.
Criticality should be identified and classified, to direct investment of monitoring, time, and resources
appropriately. You should also identify applications or systems with significant access — those which might
grant control over other critical systems or data.
For more information, see Criticality scale in Cloud Adoption Framework.
Identify and classify key organizational applications according to organizational impact.
See these best practices related to PaaS applications.
Securing PaaS deployments
Secure communication paths between applications and the services. Make sure that there's a distinction
between the endpoints exposed to the public internet and private ones. Also, the public endpoints are protected
with web application firewall.
Network security
Application classification for security
12/16/2022 5 minutes to read • Edit Online
Understand your responsibility as an owner
Azure can host both legacy and modern applications through Infrastructure as a Service (IaaS) virtual machines
and Platform as a Service (PaaS). With legacy applications, you have the responsibility of securing all
dependencies including OS, middleware, and other components. For PaaS applications, you don't need to
manage and secure the underlying server OS. You are responsible for the application configuration.
This article describes the considerations for understanding the hosting models and the security responsibility of
each, identifying critical applications.
Securing an application requires security assurances for three aspects:
Application codeApplication code. The logic that defines the custom application that you write. Securing that code
requires identifying and mitigating risks from the design and implementation of the application and
assessing supply chain risk of included components.
Application ser vicesApplication ser vices. The cloud services that the application uses such as databases, identity providers,
event hubs, IoT device management, and so on. Security for cloud services is a shared responsibility. The
cloud provider ensures the security of the underlying service. The application owner is responsible for
security implications of the configuration and operation of the service instance(s) used by the application
including any data stored and processed on the service.
Application hosting platformApplication hosting platform. The computing environment where the application runs. This could take
many forms with significant variations on who is responsible for security:
Legacy applicationsLegacy applications. typically require a full operating system (and any middleware) hosted on
physical or virtualized hardware. This operating system and installed middleware/other
components are operated and secured by the application owner or their infrastructure team(s).
The security responsibility for the physical hardware and OS virtualization components
(virtualization hosts, operating systems, and management services) varies:
Identify and classify applications
Use Azure services for fundamental components
On-premises:On-premises: The application owner is responsible for maintenance and security.
IaaS:IaaS: The cloud provider is responsible for the underlying infrastructure and the application
owner's organization is responsible for the VM configuration, operating system, and any
components installed on it.
Modern applicationsModern applications are hosted on PaaS environments such as an Azure application service.
The underlying operating system is secured by the cloud provider. Application owners are
responsible for the security of the application service configurations.
ContainersContainers are an application packaging mechanism in which applications are abstracted from
the environment in which they run. The containerized applications can run on a container service
by the cloud provider (modern applications) or on a server managed on premises or in IaaS.
Identify applications that have a high potential impact and,or a high potential exposure to risk.
Business critical dataBusiness critical data. Applications that process or store information must have assurance of
confidentiality, integrity, and availability.
Regulated dataRegulated data. Applications that handle monetary instruments and sensitive personal information
regulated by standards such as the payment card industry (PCI), General Data Protection Regulation (GDPR),
and Health Information Portability and Accountability Act (HIPAA).
Business critical availabilityBusiness critical availability. Applications whose functionality is critical to the business mission, such as
production lines generating revenue, devices or services critical to life and safety, and other critical functions.
Significant AccessSignificant Access. Applications that have access to systems with a high impact through technical means
such as:
High exposure to attacksHigh exposure to attacks. Applications that are easily accessible to attackers such as web applications on
the public internet. Legacy applications can also be higher exposure as attackers (and penetration testers)
frequently target them because they know these legacy applications often have vulnerabilities that are
difficult to fix.
Stored Credentials or keys/certificates that grant access to the data/service.
Permissions granted through access control lists or other methods.
Developers should use services available from a cloud provider for well-established functions like databases,
encryption, identity directory, and authentication, instead of building or adopting custom implementations, or
third-party solutions that require integration with the cloud provider. These services provide better security,
reliability, and efficiency because cloud providers operate and secure them with dedicated teams with deep
expertise in those areas.
Using these services also frees your developer resources from reinventing the proverbial wheel so that they can
focus development time on your unique requirements for your business. This practice should be followed to
avoid risk during new application development and to reduce risk in existing applications either during the
planned update cycle, or with a security-focused application update.
We recommend using cloud services from your cloud provider for identity, data protection, key management,
and application configurations:
Identity:Identity: User directories and other authentication functions are complex to develop and critically
important to security assurances. Avoid custom authentication solutions. Instead choose native
capabilities like Azure Active Directory (Azure AD), Azure AD B2B, Azure AD B2C, or third-party solutions
to authenticate and grant permission to users, partners, customers, applications, services, and other
entities. For more information, see Security with identity and access management (IAM) in Azure.
Use native capabilities
Next steps
Data Protection:Data Protection: Use established capabilities from cloud providers such as native encryption in cloud
services to encrypt and protect data. If direct use of cryptography is required, use well-established
cryptographic algorithms and not attempt to invent their own.
Key management:Key management: Always authenticate with identity services rather than handling cryptographic key.
For situations where you need to use keys, use a managed key store such as Azure Key Vault. This will
make sure keys are handled safely in application code. Tools such as, CredScan can discover potentially
exposed keys in your application code.
Application Configurations:Application Configurations: Inconsistent configurations for applications can create security risks.
Application configuration information can be stored with the application itself or preferably using a
dedicated configuration management system like Azure App Configuration or Azure Key Vault. App
Configuration provides a service to centrally manage application settings and feature flags, which helps
mitigate this risk. Don't store keys and secrets in application configuration.
For more information about using cloud services instead of custom implementations, reference Applications
and services.
Use native security capabilities built into cloud services instead of adding external security components, such as
data encryption, network traffic filtering, threat detection, and other functions.
Azure controls are maintained and supported by Microsoft. You don't have to invest in additional security
tooling.
List of Azure Services
Native security capabilities of each service
Applications and services
Application classification
Application threat analysis
Regulatory compliance
Application threat analysis
12/16/2022 4 minutes to read • Edit Online
1- Gather information about the basic security controls
A SK T H IS QU EST ION .. .A SK T H IS QU EST ION .. . TO DET E RM IN E C O N T RO L S T H AT . ..TO DET E RM IN E C O N T RO L S T H AT . ..
Are connections authenticated using Azure AD, TLS (with
mutual authentication), or another modern security protocol
approved by the security team?
Prevent unauthorized access to the application component
and data.
Are you limiting access to only those accounts that have the
need to write or modify data in the application
Prevent unauthorized data tampering or alteration.
Is the application activity logged and fed into a Security
Information and Event Management (SIEM) through Azure
Monitor or a similar solution?
Detect and investigate attacks quickly.
Is critical data protected with encryption that has been
approved by the security team?
Prevent unauthorized copying of data at rest.
Is inbound and outbound network traffic encrypted using
TLS?
Prevent unauthorized copying of data in transit.
Is the application protected against Distributed Denial of
Service (DDoS) attacks using services such as Azure DDoS
protection?
Detect attacks designed to overload the application so it
can't be used.
Does the application store any logon credentials or keys to
access other applications, databases, or services?
Identify whether an attack can use your application to attack
other systems.
Do the application controls allow you to fulfill regulatory
requirements?
Protect user's private data and avoid compliance fines.
Do a comprehensive analysis to identify threats, attacks, vulnerabilities, and counter measures. Having this
information can protect the application and threats it might pose to the system. Start with simple questions to
gain insight into potential risks. Then, progress to advanced techniques using threat modeling.
A threat modeling tool will produce a report of all threats identified. This report is typically uploaded into a
tracking tool, or converted to work items that can be validated and addressed by the developers. As new
features are added to the solution, the threat model should be updated and integrated into the code
management process. If a security issue is found, there should be a process to triage issue severity and
determine when and how to remediate (such as in the next release cycle, or a faster release).
Start by gathering information about each component of the application. The answers to these questions will
identify gaps in basic protection and clarify the attack vectors.
Between users and the application
Between different application components and
services
Suggested actionsSuggested actions
2- Evaluate the application design progressively
3- Mitigate the identified threats
Assign tasks to the individual people who are responsible for a particular risk identified during threat modeling.
Learn moreLearn more
Threat modeling
Analyze application components and connections and their relationships. Threat modeling is a crucial
engineering exercise that includes defining security requirements, identifying and mitigating threats, and
validating those mitigations. This technique can be used at any stage of application development or production,
but it's most effective during the design stages of a new functionality.
Popular methodologies include:
STRIDE:
Spoofing
Tampering
Repudiation
Information Disclosure
Denial of Service
Elevation of Privilege
Microsoft Security Development Lifecycle uses STRIDE and provides a tool to assist with this process. This tool is
available at no additional cost. For more information, see Microsoft Threat Modeling Tool.
Open Web Application Security Project (OWASP) has documented a threat modeling approach for
applications.
Integrate threat modeling through automation using secure operations. Here are some resources:
Toolkit for Secure DevOps on Azure.
Guidance on DevOps pipeline security by OWASP.
The threat modeling tool produces a report of all the threats identified. After a potential threat is identified,
determine how it can be detected and the response to that attack. Define a process and timeline which
minimizes exposure to any identified vulnerabilities in the workload, so that those vulnerabilities cannot be left
unaddressed.
Use the
Defense-in-Depth
approach. This can help identify controls needed in the design to mitigate risk if a
primary security control fails. Evaluate how likely it is for the primary control to fail. If it does, what is the extent
of the potential organizational risk? Also, what is the effectiveness of the additional control (especially in cases
that would cause the primary control to fail). Based on the evaluation apply Defense-in-Depth measures to
address potential failures of security controls.
The principle of
least privilege
is one way of implementing Defense-in-Depth. It limits the damage that can be
done by a single account. Grant least number of privileges to accounts that allows them to accomplish with the
required permissions within a time period. This helps mitigate the damage of an attacker who gains access to
the account to compromise security assurances.
There's often a disconnect between organizational leadership and technical teams regarding business
requirements for critical workloads. This can create undesired outcomes and is especially sensitive when it
pertains to information security. Routinely reviewing business critical workload requirements with executive
Learn moreLearn more
Next steps
sponsors to define requirements provides an opportunity to align expectations and ensure operational resource
allocation to the initiative.
How are threats addressed once found?How are threats addressed once found?
Here are some best practices:
Make sure the results are communicated to the interested teams.
Prioritize the vulnerabilities and fix the most important in a timely manner.
Upload the threat modeling report to a tracking tool. Create work items that can be validated and addressed
by the developers. Cyber security teams can also use the report to determine attack vectors during a
penetration test.
As new features are added to the application, update the threat model report and integrate it into the code
management process. Triage security issues into the next release cycle or a faster release, depending on the
severity.
For information about mitigation strategies, see RapidAttack.
How long does it typically take to deploy a security fix into production?How long does it typically take to deploy a security fix into production?
If a security vulnerability is discovered, update the software with the fix as soon as possible. Have processes,
tools, and approvals in place to roll out the fix quickly.
Threat modeling
Applications and services
Application classification
Regulatory compliance
Secure application configuration and dependencies
12/16/2022 3 minutes to read • Edit Online
Key points
Configuration security
Security of an application that is hosted in Azure is a shared responsibility between you as the application owner
and Azure. For IaaS, you're responsible for configurations related to VM, operating system, and components
installed on it. For PaaS, you're responsible for the security of the application service configurations and making
sure that the dependencies used by the application are also secure.
Don't store secrets in source code or configuration files. Instead, keep them in a secure store, such as Azure
App Configuration or Azure Key Vault.
Don't expose detailed error information when handling application exceptions.
Don't expose platform-specific information.
Store application configuration outside of the application code to update it separately and to have tighter
access control.
Restrict access to Azure resources that don't meet the security requirements.
Validate the security of any open-source code added to your application.
Update frameworks and libraries as part of the application lifecycle.
During the design phase, consider the way you store secrets and handle exceptions. Here are some points.
How is application configuration stored and how does the application access it?How is application configuration stored and how does the application access it?
Application configuration information can be stored with the application. However, that's not a recommended
practice. Consider using a dedicated configuration management system such as Azure App Configuration or
Azure Key Vault. That way, it can be updated independently of the application code.
Applications can include secrets like database connection strings, certificate keys, and so on. Don't store secrets
in source code or configuration files. Instead, keep them in a secure store, such as Azure Key Vault. Identify
secrets in code with static code scanning tools. Add the scanning process in your continuous integration (CI)
pipeline.
For more information about secret management, reference Key and secret management.
Are errors and exceptions handled properly without exposing that information to users?Are errors and exceptions handled properly without exposing that information to users?
When handling application exceptions, make the application fail gracefully and log the error. Don't provide
detailed information related to the failure, such as call stack, SQL queries, or out of range errors. This
information can provide attackers with valuable information about the internals of the application.
Can configuration settings be changed or modified without rebuilding or redeploying theCan configuration settings be changed or modified without rebuilding or redeploying the
application?application?
Application code and configuration shouldn't share the same lifecycle to enable operational activities. These
activities include those that change and update specific configurations without developer involvement or
redeployment.
Is platform-specific information removed from ser ver-client communication?Is platform-specific information removed from ser ver-client communication?
Dependencies, frameworks, and libraries
Don't reveal information about the application platform. Such information (for example, X-Powered-By ,
X-ASPNET-VERSION ) can get exposed through HTTP banners, HTTP headers, error messages, and website footers.
Malicious actors can use this information when mapping attack vectors of the application.
Suggested actionsSuggested actions
Consider using Azure Front Door or API Management to remove platform-specific HTTP headers. Instead, use
Azure CDN to separate the hosting platform from end users. Azure API Management offers transformation
policies that allow you to modify HTTP headers and remove sensitive information.
Learn moreLearn more
Azure Front Door Rules Engine Actions
API Management documentation
Are Azure policies used to control the configuration of the solution resources?Are Azure policies used to control the configuration of the solution resources?
Use Azure Policy to deploy settings where applicable. Block resources that don't meet the proper security
requirements defined during service enablement.
What are the frameworks and libraries used by the application?What are the frameworks and libraries used by the application?
Application frameworks are frequently updated and released by the vendor or communities. It's vital to track the
frameworks and libraries used by the application including any resulting vulnerabilities they introduce. These
frameworks and libraries include custom, OSS, third party, and others. Understand and manage the
technologies the application uses, such as:
.NET Core
Spring
Node.js
Automated solutions can help with this assessment.
Consider the following best practices:
Validate the security of any open-source code added to your application. Free tools to help with this
assessment include:
OWASP Dependency-Check
NPM audit
WhiteSource Bolt
GitHub Dependabot
These tools find outdated components and update them to the latest versions.
Maintain a list of frameworks and libraries as part of the application inventory. Also, keep track of
versions in use. If vulnerabilities are published, this awareness helps to identify affected workloads.
Update frameworks and libraries as part of the application lifecycle. Prioritize critical security patches.
Are the expir y dates of SSL/TLS cer tificates monitored and are processes in place to renew them?Are the expir y dates of SSL/TLS cer tificates monitored and are processes in place to renew them?
Tracking expiry dates of SSL/TLS certificates and renewing them in due time is highly critical. Ideally, the process
should be automated, although this often depends on the CA used for the certificate. If not automated, sufficient
alerting should be applied to ensure expiry dates don't go unnoticed.
Referenced Azure services
Next steps
Community resources
Learn moreLearn more
WhiteSource Bolt
npm-audit
OWASP Dependency-Check
GitHub Dependabot
Azure Key Vault
Azure CDN
Azure Policy
Azure Front Door
Azure API Management
Applications and services
Application classification
Application threat analysis
Regulatory compliance
OWASP Dependency-Check
NPM audit
Secure deployment in Azure
12/16/2022 2 minutes to read • Edit Online
Checklist
Have teams, processes, and tools that can quickly deploy security fixes? A
DevOps
or multidisciplinary approach
is recommended. Multiple teams work together with efficient practices and tools. Essential DevOps practices
include change management of the workload through continuous integration, continuous delivery (CI/CD).
Continuous integration (CI) is an automated process where code changes trigger the building and testing of the
application. Continuous Delivery (CD) is an automated process to build, test, configure, and deploy the
application from a build to production environment.
Those processes allow you to rapidly address the security concerns without waiting for a longer planning and
testing cycle.
Building a DevOps process which includes a security discipline helps incorporate security concepts and
enhancements earlier in the application development process. An organization's ability to rapidly address
security and operational concerns increases through the combination of the Secure Development Lifecycle (SDL)
and Operations Lifecycle related to application creation, maintenance, and updates.
Many traditional IT operating models aren't compatible with the cloud, and organizations must undergo
operational and organizational transformation to deliver against enterprise migration targets. We recommend
using a DevOps approach for both application and central teams.
Have you adopted a secure DevOps approach to ensure security and feature enhancements can beHave you adopted a secure DevOps approach to ensure security and feature enhancements can be
quickly deployed?quickly deployed?
Establish a cross-functional DevOps platform team to build, manage, and maintain your workload.
Involve the security team in the planning and design of the DevOps process to integrate preventive and
detective controls for security risks.
Clearly define CI/CD roles and permissions and minimize the number of people who have access to secure
information or resources.
Configure quality gate approvals in DevOps release process.
In this section
A SSESSM EN TA SSESSM EN T DESC RIP T IONDESC RIP T ION
Do you clearly define CI/CD roles and permissionsDo you clearly define CI/CD roles and permissions
for this workload?for this workload?
Define CI/CD permissions such that only users responsible
for production releases can start the process and that only
developers can access the source code.
Are any resources provisioned or operationallyAre any resources provisioned or operationally
configured with user tools such as the Azure por talconfigured with user tools such as the Azure por tal
or via Azure CLI?or via Azure CLI?
Always use Infrastructure as code (IaC) to make even the
smallest of changes. This approach makes it easy to track
code because the provisioned infrastructure is reproducible
and reversible.
Can you roll back or for ward code quickly throughCan you roll back or for ward code quickly through
automated pipelines?automated pipelines?
Automated deployment pipelines should allow for quick roll-
forward and roll-back deployments to address critical bugs
and code updates outside of the normal deployment
lifecycle.
Azure security benchmark
Reference architecture
Next step
Related link
Learn more
Integrate scanning tools within CI/CD pipeline.
No infrastructure changes, provisioning or configuring, should be done manually outside of IaC.
Follow these questions to assess the workload at a deeper level.
The Azure Security Benchmark includes a collection of high-impact security recommendations. Use them to
secure the services and processes you use to run the workload in Azure:
The questions in this section are aligned to the Azure Security Benchmark controls.
Here are some reference architectures related to building CI/CD pipelines:
CI/CD for microservices architectures
CI/CD for microservices on Kubernetes
We recommend monitoring activities that maintain the security posture. These activities can highlight, if the
current security practices are effective or are there new requirements.
Security monitoring
Go back to the main article: Security
Secure DevOps Kit for Azure
Agile Principles in Practice
Platform automation and DevOps
PsRules for Azure
Governance considerations for secure deployment
in Azure
12/16/2022 4 minutes to read • Edit Online
Key points
Minimize access
The automated continuous integration, continuous delivery (CI/CD) processes must have built-in governance
that authorize and authenticate the identities to do the tasks within a defined scope.
Clearly define CI/CD roles and permissions.
Implement just-in-time privileged access management.
Limit long-standing write access to production environments.
Limit the scope of execution in the pipelines.
Configure quality gate approvals in DevOps release process.
Minimize the number of people who have access to secure information or resources. This strategy will reduce
the chance of a malicious actor gaining access or an authorized user inadvertently impacting a sensitive
resource. Here are some considerations:
Use the principle of least privilege when assigning roles and permissions. Only users responsible for
production releases should start the process and only developers should access the source code.
A pipeline should use one or more service principals. Ideally, they should be managed identity, and
delivered by the platform and never directly defined within a pipeline. The identity should only have the
Azure RBAC permissions necessary to do the task. All service principals should be bound to that pipeline
and not shared across pipelines.
How do you define CI/CD roles and permissions?How do you define CI/CD roles and permissions?
Azure DevOps offers built-in roles that can be assigned to individual users of groups. If built-in roles are
insufficient to define least privilege for a pipeline, consider creating custom Azure RBAC roles. Make sure
those roles align with the action and the organization's teams and responsibilities.
To support security of your pipeline operations, you can add users to a built-in security group, set
individual permissions for a user or group, or add users to pre-defined roles. You manage security for the
following objects from Azure Pipelines in the web portal, either from the user or admin context.
For more information, see Get started with permissions, access, and security groups.
For permissions, you grant or restrict permissions by setting the permission state to AllowAllow or DenyDeny,
either for a security group or an individual user. For a role, you add a user or group to the role.
Use separate pipeline identities between pre-production and production environments. If available, take
advantage of pipeline features such as Environments to encapsulate last-mile authentication external to
the executing pipeline.
If the pipeline runs infrequently and has high privileges, consider removing standing permissions for that
identity. Use just-in-time (JIT) role assignments, time-based, and approval-based role activation. This
strategy will mitigate the risks of excessive, unnecessary, or misused access permissions on crucial
Execution scope
Gated approval process
Suggested actionsSuggested actions
Learn more
Next
resources. Azure AD Privileged Identity Management supports all those modes of activation.
Review the organization's CI/CD pipeline and refine role assignment to create a clear delineation between
development and production responsibilities.
Learn moreLearn more
For more information about pipeline permission and security roles, reference Set different levels of pipeline
permissions.
Where practical, limit the scope of execution in the pipelines.
Consider creating a multi-stage pipeline. Divide the work into discrete units and that can be isolated in a
separate pipeline. Limit the identities only to the scope of the unit so that it has minimal privileges enough to do
the action. For example, you can have two units, one to deploy and another that builds source code. Only allow
the deploy unit to have access to the identity, not the build unit. If the build unit is compromised, it could start
tampering with the infrastructure.
Do you have release gate approvals configured in the DevOps release process?Do you have release gate approvals configured in the DevOps release process?
Pull Requests and code reviews serve as the first line of approvals during development cycle. Before releasing
an update to production, require a process that mandates security review and approval.
Make sure that you involve the security team in the planning, design, and DevOps process. This collaboration
will help them implement security controls, auditing, and response processes.
Are branch policies used in source control management of this workload? How are theyAre branch policies used in source control management of this workload? How are they
configured?configured?
Establish branch policies that provide an extra level of control over the code that is committed to the repository.
Lack of secure branch policy might allow poor, rogue or broken code to be checked-in and deployed. It's a
common practice to deny pushes to the main branch if the change isn't approved. For example, you can require
pull-request (PR) with code review before merging the changes by at least one reviewer, other than the change
author.
Having multiple branches is recommended where each branch has a purpose and access level. For example,
feature branches are created by developers and are open to push. Integration branch requires PR and code-
review. Production branch requires another approval from the team lead before merging.
Configure quality gate approvals in DevOps release process.
Follow the guidance in the linked articles to deploy and adopt branch strategy.
About branches and branch policies
Adopt a Git branching strategy
Release deployment control using gates
Secure infrastructure deployments
Related links
Go back to the main article: Secure deployment and testing in Azure
Infrastructure provisioning considerations in Azure
12/16/2022 2 minutes to read • Edit Online
Key points
Infrastructure as code (IaC)
Pipeline secret management
Build environments
Azure resources can be provisioned by code or user tools such as the Azure portal or via Azure CLI. It's not
recommended that resources are provisioned or configured manually. Those methods are error prone and can
lead to security gaps. Even the smallest of changes should be through code. The recommended approach is
Infrastructure as code (IaC). It's easy to track because the provisioned infrastructure can be fully reproduced and
reversed.
No infrastructure changes should be done manually outside of IaC.
Store keys and secrets outside of deployment pipeline in Azure Key Vault or in secure store for the pipeline.
Incorporate security fixes and patching to the operating system and all parts of the codebase, including
dependencies (preinstalled tools, frameworks, and libraries).
Make all operational changes and modifications through IaC. This is a key DevOps practice, and it's often used
with continuous delivery. IaC manages the infrastructure - such as networks, virtual machines, and others - with
a descriptive model, using a versioning system that is similar to what is used for source code. IaC model
generates the same environment every time it's applied. Common examples of IaC are Azure Resource Manager,
Bicep or Terraform.
IaC reduces configuration effort and automates full environment deployment (production and pre-production).
Also, IaC allows you to develop and release changes faster. All those factors enhance the security of the
workload.
For detailed information about IaC, see What is Infrastructure as code (IaC).
How are credentials, cer tificates, and other secrets used in the operations for the workloadHow are credentials, cer tificates, and other secrets used in the operations for the workload
managed during deployment?managed during deployment?
Store keys and secrets outside of deployment pipeline in a managed key store, such as Azure Key Vault. Or, in a
secure store for the pipeline. When deploying application infrastructure with Azure Resource Manager, Bicep or
Terraform, the process might generate credentials and keys. Store them in a managed key store and make sure
the deployed resources reference the store. Do not hard-code credentials.
Secret scanning tools like GitHub Secret Scanner can be used to scan for existing hard-coded credentials. Add
the scanning process in your continuous integration (CI) pipeline to prevent new hard-coded credentials from
being added.
Does the organization apply security controls (IP firewall restrictions, update management) toDoes the organization apply security controls (IP firewall restrictions, update management) to
self-hosted build agents for this workload?self-hosted build agents for this workload?
Custom build agents add management complexity and can become an attack vector. Build machine credentials
must be stored securely and the file system needs to be cleaned of any temporary build artifacts regularly.
Learn more
Next step
Related links
Network isolation can be achieved by only allowing outgoing traffic from the build agent, because it's using the
pull model of communication with Azure DevOps.
As part of the operational lifecycle, incorporate security fixes and patching to the operating system and all parts
of the codebase, including dependencies (preinstalled tools, frameworks, and libraries).
Apply security controls to self-hosted build agents in the same manner as with other Azure IaaS VMs. These
should be minimalistic environments as a way to reduce the attack surface.
Azure Pipelines agents
I'm running a firewall and my code is in Azure Repos. What URLs does the agent need to communicate with?
Secure code deployments
Go back to the main article: Secure deployment and testing in Azure
Code deployments
12/16/2022 4 minutes to read • Edit Online
Key points
Rollback and roll-forward
IMPORTANTIMPORTANT
The automated build and release pipelines should update a workload to a new version seamlessly without
breaking dependencies. Augment the automation with processes that allow high priority fixes to get deployed
quickly.
Organizations should leverage existing guidance and automation when securing applications in the cloud, rather
than starting from zero. Using resources and lessons learned by external organizations that are early adopters
of these models can accelerate the improvement of an organizations security posture with less expenditure of
effort and resources.
Involve the security team in the planning and design of the DevOps process to integrate preventive and
detective controls for security risks.
Design automated deployment pipelines that allow for quick roll-forward and rollback deployments to
address critical bugs and code updates outside of the normal deployment lifecycle.
Integrate code scanning tools within CI/CD pipeline.
If something goes wrong, the pipeline should roll back to a previous working version. N-1 and N+1 refer to
rollback and roll-forward versions. Automated deployment pipelines should allow for quick roll-forward and
rollback deployments to address critical bugs and code updates outside of the normal deployment lifecycle.
Can N-1 or N+1 versions be deployed via automated pipelines where N is current deploymentCan N-1 or N+1 versions be deployed via automated pipelines where N is current deployment
version in production?version in production?
Because security updates are a high priority, design a pipeline that supports regular updates and critical security
fixes.
A release is typically associated with approval processes with multiple sign-offs, quality gates, and so on. If the
workload deployment is small with minimal approvals, you can usually use the same process and pipeline to
release a security fix.
An approval process that is complex and takes a significant amount of time can delay a fix. Consider building an
emergency process to accelerate high priority fixes. The process might be business and, or communication
process between teams. Another way is to build a pipeline that might not include all the gated approvals, but
should be able to push out the fix quickly. The pipeline should allow for quick roll-forward and rollback
deployments that address security fixes, critical bugs, and code updates outside of the regular deployment life
cycle.
Deploying a security fix is a priority, but it shouldn't be at the cost of introducing a regression or bug. When designing an
emergency pipeline, carefully consider which automated tests can be bypassed. Evaluate the value of each test against the
execution time. For example, unit tests usually complete quickly. Integration or end-to-end tests can run for a long time.
Involve the security team in the planning and design of the DevOps process. Your automated pipeline design
Suggested actionSuggested action
Credential scanning
Suggested actions
Learn more
Community links
should have the flexibility to support both regular and emergency deployments. This is important to support the
rapid and responsible application of both security fixes and other urgent, important fixes.
Implement an automated deployment process with support for rollback scenarios via Azure App Services
deployment slots.
Learn moreLearn more
Set up staging environments in Azure App Service
Credentials, keys, and certificates grant access to the data or service used by the workload. Storing credentials in
code poses a significant security risk. Ensure that static code scanning tools are an integrated part of the
continuous integration (CI) process.
Are code scanning tools an integrated par t of the continuous integration (CI) process for thisAre code scanning tools an integrated par t of the continuous integration (CI) process for this
workload?workload?
To prevent credentials from being stored in the source code or configuration files, integrate code scanning tools
within the CI/CD pipeline:
During design time, use code analyzers to prevent credentials from getting pushed to the source code
repository. For example, .NET Compiler Platform (Roslyn) Analyzers inspect your C# or Visual Basic code.
During the build process, use pipeline add-ons to catch credentials in the source code. Some options include
GitHub Advanced Security and OWASP source code analysis tools.
Scan all dependencies, such as third-party libraries and framework components, as part of the CI process.
Investigate vulnerable components that are flagged by the tool. Combine this task with other code scanning
tasks that inspect code churn, test results, and coverage.
Use a combination of dynamic application security testing (DAST) and static application security testing
(SAST). DAST tests the application while its in use. SAST scans the source code and detects vulnerabilities
based on its design or implementation. Some technology options are provided by OWASP. For more
information, see SAST Tools and Vulnerability Scanning Tools.
Use scanning tools that are specialized in technologies used by the workload. For example, if the workload is
containerized, run container-aware scanning tools to detect risks in the container registry, before use, and
during use.
Incorporate Secure DevOps on Azure toolkit and the guidance published by the Organization for Web App
Security Project (OWASP), or an equivalent guiding organization.
Follow DevOps security guidance
Getting started with Credential Scanner (CredScan)
OWASP source code analysis tools
GitHub Advanced Security
Vulnerability Scanning Tools
Go back to the main article: Secure deployment and testing in Azure
Security monitoring and remediation in Azure
12/16/2022 2 minutes to read • Edit Online
Checklist
Azure security benchmark
Reference architecture
Next step
Regularly monitor resources to maintain the security posture and detect vulnerabilities. Detection can take the
form of reacting to an alert of suspicious activity or proactively hunting for anomalous events in the enterprise
activity logs. vigilantly responding to anomalies and alerts to prevent security assurance decay, and designing
for defense in depth and least privilege strategies.
How are you monitoring security-related events in this workload?How are you monitoring security-related events in this workload?
Use native tools in Azure to monitor the workload resources and the infrastructure in which it runs.
Consider investing in a Security Operations Center (SOC), or SecOps team and incident response plan.
Monitor traffic, access requests, and application communication between segments.
Discover and remediate common risks to improve secure score in Microsoft Defender for Cloud.
Use an industry standard benchmark to evaluate the security posture by learning from external
organizations.
Send logs and alerts to a central security log management for analysis.
Perform regular internal and external compliance audits, including regulatory compliance attestations.
Regularly test your security design and implementation using test cases based on real-world attacks.
The Azure Security Benchmark includes a collection of high-impact security recommendations. Use them to
secure the services and processes you use to run the workload in Azure:
The questions in this section are aligned to these controls:
Azure Security Benchmarks Logging and threat detection.
Azure Security Benchmarks Incident response.
Posture and Vulnerability Management
Hybrid Security Monitoring using Microsoft Defender for Cloud and Microsoft Sentinel
This reference architecture illustrates how to use Microsoft Defender for Cloud and Microsoft Sentinel to
monitor the security configuration and telemetry of on-premises and Azure operating system workloads.
Azure security solutions for AWS
This article provides AWS identity architects, administrators, and security analysts with immediate
insights and detailed guidance for deploying several Microsoft security solutions.
We recommend applying as many best practices as early as possible, and then working to retrofit any gaps over
Related link
time as you mature your security program.
Go back to the main article: Security
Azure security monitoring tools
12/16/2022 5 minutes to read • Edit Online
Tools
SERVIC ESERVIC E USE C A SEUSE C A SE
Microsoft Defender for CloudMicrosoft Defender for Cloud Strengthens the security posture of your data centers, and
provides advanced threat protection across your workloads
in the cloud (whether they're in Azure or not) and on-
premises. Get a unified view into the infrastructure and
resources provisioned for the workload.
Microsoft SentinelMicrosoft Sentinel Use the native security information event management
(SIEM) and security orchestration automated response
(SOAR) solution on Azure. Receive intelligent security
analytics and threat intelligence across the enterprise.
Azure DDoS ProtectionAzure DDoS Protection Defend against distributed denial of service (DDoS) attacks.
Azure Rights Management (RMS)Azure Rights Management (RMS) Protect files and emails across multiple devices.
Microsoft Purview Information ProtectionMicrosoft Purview Information Protection Secure email, documents, and sensitive data that you share
outside your company.
Azure Governance VisualizerAzure Governance Visualizer Gain granular insight into policies, Azure role-based access
control (Azure RBAC), Azure Blueprints, subscriptions, and
more.
PSRule for AzurePSRule for Azure Scans Azure Infrastructure as Code (IaC) artifacts for issues
across Azure Well-Architected pillars.
Microsoft Defender for Cloud
The
leverage native control
security principle tells us to use native controls built over third-party solutions.
Native reduce the effort required to integrate external security tooling and update those integrations over time.
Azure provides several monitoring tools that observe the operations and detect anomalous behavior. These
tools can detect threats at different levels and report issues. Addressing the issues early in the operational
lifecycle will strengthen your overall security posture.
Enable Microsoft Defender for Cloud at the subscription level to monitor all resource provisioned in that scope.
At no additional cost, it provides continuous observability into resources, reports issues, and recommends fixes.
By regularly reviewing and fixing issues, you can improve the security posture, detect threats early, prevent
breaches.
Beyond just observability, Defender for Cloud offers an advanced mode through its integration with MicrosoftMicrosoft
Defender for CloudDefender for Cloud. When these plans are enabled, built-in policies, custom policies, and initiatives protect
resources and block malicious actors. You can also monitor compliance with regulatory standards - such as
NIST, Azure CIS, Azure Security Benchmark. For pricing details, see Defender for Cloud pricing.
Microsoft Sentinel
Azure DDoS Protection
Azure Rights Management (RMS)
Microsoft Purview Information Protection
Your organization might run workloads on multiple cloud platforms, and, or across cloud and on-premises, or
managed by various teams within the organization. Having a centralized view of all data is recommended. To get
that view you need security information event management (SIEM) and security orchestration automated
response (SOAR) solutions. These solutions connect to all security sources, monitor them, and analyze the
correlated data.
Microsoft Sentinel and is a native control that combines SIEM and SOAR capabilities. It analyzes events and logs
from various connected sources. Based on the data sources and their alerts, Sentinel creates incidents, performs
threat analysis for early detection. Through intelligent analytics and queries, you can be proactive with hunting
activities. In case of incidents, you can automate workflows. Also, with workbook templates you can quickly gain
insights through visualization.
A Distributed Denial of Service (DDoS) attack attempts to exhaust an application's resources, making the
application unavailable to legitimate users. DDoS attacks can be targeted at any endpoint that is publicly
reachable through the internet.
Every property in Azure is protected by Azure's infrastructure DDoS (Basic) Protection at no additional cost. The
scale and capacity of the globally deployed Azure network provides defense against common network-layer
attacks through always-on traffic monitoring and real-time mitigation. DDoS Infrastructure Protection requires
no user configuration or application changes. DDoS Infrastructure Protection helps protect all Azure services,
including PaaS services like Azure DNS.
Azure DDoS Protection provides enhanced DDoS mitigation features to defend against DDoS attacks. It's
automatically tuned to help protect your specific Azure resources in a virtual network. Protection is simple to
enable on any new or existing virtual network, and it requires no application or resource changes. It has several
advantages over the basic service, including logging, alerting, telemetry, SLA guarantee, and cost protection.
Azure DDoS Network Protection is designed for services that are deployed in a virtual network. For other
services, the default DDoS Infrastructure Protection service applies. To learn more about supported
architectures, see DDoS Protection reference architectures.
Your business may encounter challenges with protecting documents and emails. For example, file protection,
collaboration, and sharing may be issues. You also might be experiencing problems regarding platform support
or infrastructure.
Azure Rights Management (RMS) is a cloud-based protection service. RMS uses encryption, identity, and
authorization policies to help secure files and emails across devices, including phones, tablets, and PCs.
To learn more about how RMS can address these issues, see Business problems solved by Azure Rights
Management.
The
data classification
process categorizes data by sensitivity and business impact in order to identify risks.
When data is classified, you can manage it in ways that protect sensitive or important data from theft or loss.
With proper
file protection
, you can analyze data flows to gain insight into your business, detect risky behaviors
and take corrective measures, track access to documents, and more. The protection technology in AIP uses
encryption, identity, and authorization policies. Protection stays with the documents and emails, independently
of the location, regardless of whether they're inside or outside your organization, networks, file servers, and
Azure Governance Visualizer
PSRule for Azure
Next
Related links
applications
Azure Information Protection (AIP) is part of Microsoft Purview Information Protection solution, and extends the
labeling and classification functionality provided by Microsoft 365. For more information, see this article about
classification.
Azure Governance Visualizer is a PowerShell script that iterates through an Azure tenant's management group
hierarchy down to the subscription level. You can run the script either for your Tenant Root Group or any other
Management Group. It captures data from the most relevant Azure governance capabilities such as Azure Policy,
Azure role-based access control (Azure RBAC), and Azure Blueprints. From the collected data, the visualizer
shows your hierarchy map, creates a tenant summary, and builds granular scope insights about your
management groups and subscriptions.
The visualizer provides a holistic overview of your technical Azure Governance implementation by connecting
the dots.
PSRule for Azure is a set of tests and documentation to help you configure Azure solutions. These tests allow
you to check your Azure Template or Bicep Infrastructure as Code (IaC) before deployment to Azure. PSRule for
Azure includes tests that check how IaC is written and how Azure resources are configured.
Monitor workload resources in Microsoft Defender for Cloud
For information on the Microsoft Defender for Cloud tools, see Strengthen security posture.
For frequently asked questions on Microsoft Defender for Cloud, see FAQ - General Questions.
For information on the Microsoft Sentinel tools that will help to meet these requirements, see What is Microsoft
Sentinel?
For types of DDoS attacks that DDoS Protection mitigates as well as more features, see Azure DDoS Protection
overview.
Monitor Azure resources in Microsoft Defender for
Cloud
12/16/2022 10 minutes to read • Edit Online
Key points
General best practices
IaaS and PaaS securityIaaS and PaaS security
Virtual machines
Most cloud architecture have compute, networking, data, and identity components and each require different
monitoring mechanisms. Even Azure services have individual monitoring needs. For instance, to monitor Azure
Functions you want to enable Azure Application Insights.
Microsoft Defender for Cloud has many plans that monitor the security posture of machines, networks, storage
and data services, and applications to discover potential security issues. Common issues include internet
connected VMs, or missing security updates, missing endpoint protection or encryption, deviations from
baseline security configurations, missing Web Application Firewall (WAF), and more.
Enable Microsoft Defender for Cloud as a defense-in-depth measure. Use resource-specific Defender for
Cloud features such as Microsoft Defender for servers, Microsoft Defender for Endpoint, Microsoft Defender
for Storage.
Observe container hygiene through container aware tools and regular scanning.
Review all network flow logs through network watcher. See diagnostic logs in Microsoft Defender for Cloud.
Integrate all logs in a central SIEM solution to analyze and detect suspicious behavior.
Monitor identity-related risk events in Azure AD reporting amd Azure Active Directory Identity Protection.
Identifying common security activities will significantly reduce the overall risk.
Monitor suspicious activities from administrative accounts.
Monitor the location from where Azure resources are being managed.
Monitor attempts to access deactivated credentials.
Use automated tools to monitor network resource configurations and detect changes.
For more information, see Azure security baseline for Azure Monitor.
In an IaaS model, you can host the workload on Azure infrastructure. Azure provides security assurances that
maintain isolation and timely security updates to the infrastructure. For greater control, you host the entire IaaS
solution on-premises or in a hosted data center and are responsible for security. You must implement security
on the host, virtual machine, network, and storage. For instance if you have your own VNet, consider enabling
Azure Private Link over Azure Monitor so you can access this over a private endpoint.
In PaaS, you have shared responsibility with Azure in protecting the data.
If you're running your own Windows and Linux virtual machines, use Microsoft Defender for Cloud. Take
advantage of the free services to check for missing OS patches, security misconfiguration, and basic network
security. Enabling Microsoft Defender for Cloud is highly recommended because you get features that provide
adaptive application controls, file integrity monitoring (FIM), and others.
NOTENOTE
Remove direct internet connectivityRemove direct internet connectivity
Containers
For example, a common risk is the virtual machines don't have vulnerability scanning solutions that check for
threats. Microsoft Defender for Cloud reports those machines. You can remediate in Microsoft Defender for
Cloud by deploying a scanning solution. You can use the built-in vulnerability scanner for virtual machines. You
don't need a license. Instead, you can bring your license for supported partner solutions.
Vulnerability assessments are also available for container images, and SQL servers.
Attackers constantly scan public cloud IP ranges for open management ports, which can lead to attacks such as
common passwords and known unpatched vulnerabilities. JIT (Just In Time) access allows you to lock down the
inbound traffic to the virtual machines while providing easy access to connect to machines when needed.
Defender for Cloud identifies which machines should have JIT applied.
With Microsoft Defender for Cloud, you also get Microsoft Defender for Endpoint. This provides investigative
tools Endpoint Detection and Response (EDR) that helps in threat detection and analysis.
Microsoft Defender for servers also watches the network to and from virtual machines. If you are using network
security groups to control access to the virtual machines and the rules are overpermissive, Defender for Cloud
will flag them. Adaptive network hardening provides recommendations to further harden the NSG rules.
For a full list of features, see Feature coverage for machines.
Make sure policies and processes require restricting and monitoring direct internet connectivity by virtual
machines.
For Azure, you can enforce policies by:
Enterprise-wide prevention:Enterprise-wide prevention: Prevent inadvertent exposure by following the permissions and roles
described in the reference model.
Ensures that network traffic is routed through approved egress points by default.
Exceptions (such as adding a public IP address to a resource) must go through a centralized group
that evaluates exception requests and makes sure appropriate controls are applied.
Identify and remediateIdentify and remediate exposed virtual machines by using the Microsoft Defender for Cloud network
visualization to quickly identify internet exposed resources.
Restrict management por tsRestrict management por ts (RDP, SSH) using Just in Time access in Microsoft Defender for Cloud.
One way of managing VMs in the virtual network is by using Azure Bastion. This service allows you to log into
VMs in the virtual network through SSH or remote desktop protocol (RDP) without exposing the VMs directly to
the internet. To see a reference architecture that uses Bastion, see Network DMZ between Azure and an on-
premises datacenter.
Containerized workloads have an extra layer of abstraction and orchestration. That complexity requires specific
security measures that protect against common container attacks such as supply chain attacks.
Use container registries that are validated for security. Images in public registries might contain malware
or unwanted applications that activate when the container is running. Build a process for developers to
request and rapidly get security validation of new containers and images. The process should validate
against your security standards. This includes applying security updates, scanning for unwanted code
such as backdoors and illicit crypto coin miners, scanning for security vulnerabilities, and application of
Network
TIPTIP
secure development practices.
A popular process pattern is the quarantine pattern. This pattern allows you to get your images on a
dedicated container registry and subject them to security or compliance scrutiny applicable for your
organization. After it's validated, they can then be released from quarantine and promoted to being
available.
Microsoft Defender for Cloud identifies unmanaged containers hosted on IaaS Linux VMs, or other Linux
machines running Docker containers.
Make sure you use images from authorized registries. You can enforce this restriction through Azure
Policy. For example, for an Azure Kubernetes Service (AKS) cluster, have policies that restrict the cluster to
only pull images from Azure Container Registry (ACR) that is deployed as part of the architecture.
Here are the resources for the preceding example:
GitHub: Azure Kubernetes Service (AKS) Secure Baseline Reference Implementation.
The design considerations are described in Baseline architecture for an AKS cluster.
Regularly scan containers for known risks in the container registry, before use, and during use.
Use security monitoring tools that are container aware to monitor for anomalous behavior and enable
investigation of incidents.
Microsoft Defender for container registries are designed to protect AKS clusters, container hosts (virtual
machines running Docker), and ACR registries. When enabled, the images that are pulled or pushed to
registries are subject to vulnerability scans.
For more information, see these articles:
Container security in Defender for Cloud
How do you monitor and diagnose conditions of the network?How do you monitor and diagnose conditions of the network?
As an initial step, enable and review all logs (including raw traffic) from your network devices.
Security group logs – flow logs and diagnostic logs
Azure Network Watcher
Take advantage of the packet capture feature to set alerts and gain access to real-time performance information
at the packet level.
Packet capture tracks traffic in and out of virtual machines. It gives you the capability to run proactive captures
based on defined network anomalies including information about network intrusions.
For an example, see Scenario: Get alerts when VM is sending you more TCP segments than usual.
Then, focus on observability of specific services by reviewing the diagnostic logs. For example, for Azure
Application Gateway with integrated WAF, see Web application firewall logs. Microsoft Defender for Cloud
analyzes diagnostic logs on virtual networks, gateways, network security groups and determines if the controls
are secure enough. For example:
Is your virtual machine exposed to public internet. If so, do you have tight rules on network security groups
Identity
Review identity risksReview identity risks
Regularly review critical accessRegularly review critical access
Discover & replace insecure protocolsDiscover & replace insecure protocols
to protect the machine?
Are the network security groups (NSG) and rules that control access to the virtual machines overly
permissive?
Are the storage accounts receiving traffic over secure connections?
Follow the recommendations provided by Defender for Cloud. For more information, see Networking
recommendations. Use Azure Firewall logs and metrics for observability into operational and audit logs.
Integrate all logs into a security information and event management (SIEM) service, such as Microsoft Sentinel.
The SIEM solutions support ingestion of large amounts of information and can analyze large datasets quickly.
Based on those insights, you can:
Set alerts or block traffic crossing segmentation boundaries.
Identify anomalies.
Tune the intake to significantly reduce the false positive alerts.
Monitor identity-related risk events using adaptive machine learning algorithms, heuristics quickly before the
attacker can gain deeper access into the system.
Most security incidents take place after an attacker initially gains access using a stolen identity. Even if the
identity has low privileges, the attacker can use it to traverse laterally and gain access to more privileged
identities. This way the attacker can control access to the target data or systems.
Does the organization actively monitor identity-related risk events related to potentiallyDoes the organization actively monitor identity-related risk events related to potentially
compromised identities?compromised identities?
Monitor identity-related risk events on potentially compromised identities and remediate those risks. Review the
reported risk events in these ways:
Azure AD reporting. For information, see users at risk security report and the risky sign-ins security report.
Use the reporting capabilities of Azure Active Directory Identity Protection.
Use the Identity Protection risk events API to get programmatic access to security detections by using
Microsoft Graph. See riskDetection and riskyUser APIs.
Azure AD uses adaptive machine learning algorithms, heuristics, and known compromised credentials
(username/password pairs) to detect suspicious actions that are related to your user accounts. These
username/password pairs come from monitoring public and dark web and by working with security
researchers, law enforcement, security teams at Microsoft, and others.
Remediate risks by manually addressing each reported account or by setting up a user risk policy to require a
password change for high risk events.
Regularly review roles that are assigned privileges with a business-critical impact.
Set up a recurring review pattern to ensure that accounts are removed from permissions as roles change. You
can conduct the review manually or through an automated process by using tools such as Azure AD access
reviews.
Discover and disable the use of legacy insecure protocols SMBv1, LM/NTLMv1, wDigest, Unsigned LDAP Binds,
and Weak ciphers in Kerberos.
Learn moreLearn more
Connected tenantsConnected tenants
Suggested actionsSuggested actions
CI/CD pipelines
Next steps
Applications should use the SHA-2 family of hash algorithms (SHA-256, SHA-384, SHA-512). Use of weaker
algorithms, like SHA-1 and MD5, should be avoided.
Authentication protocols are a critical foundation of nearly all security assurances. These older versions can be
exploited by attackers with access to your network and are often used extensively on legacy systems on
Infrastructure as a Service (IaaS).
Here are ways to reduce your risk:
Discover protocol usage by reviewing logs with Microsoft Sentinel's Insecure Protocol Dashboard or
third-party tools.
Restrict or Disable use of these protocols by following guidance for SMB, NTLM, WDigest.
Use only secure hash algorithms (SHA-2 family).
We recommend implementing changes using pilot or other testing methods to mitigate risk of operational
interruption.
For more information about hash algorithms, see Hash and Signature Algorithms.
Does your security team have visibility into all existing subscriptions and cloud environments?Does your security team have visibility into all existing subscriptions and cloud environments?
How do they discover new ones?How do they discover new ones?
Make sure the security team is aware of all enrollments and associated subscriptions connected to your existing
environment through ExpressRoute or Site-Site VPN. Monitor them as part of the overall enterprise.
Assess if organizational policies and applicable regulatory requirements are followed for the connected tenants.
This applies to all Azure environments that connect to your production environment network.
The organizations' cloud infrastructure should be well documented, with security team access to all resources
required for monitoring and insight. Conduct frequent scans of the cloud-connected assets to ensure no
additional subscriptions or tenants have been added outside of organizational controls. Regularly review
Microsoft guidance to ensure security team access best practices are consulted and followed.
Ensure all Azure environments that connect to your production environment and network apply your
organization's policy, and IT governance controls for security.
You can discover existing connected tenants using a tool provided by Microsoft. Guidance on permissions you
may assign to security is in the Assign privileges for managing the environment section.
DevOps practices are for change management of the workload through continuous integration, continuous
delivery (CI/CD). Make sure you add security validation in the pipelines. Follow the guidance described in Learn
how to add continuous security validation to your CI/CD pipeline.
View logs and alerts
Security logs and alerts using Azure services
12/16/2022 5 minutes to read • Edit Online
Key points
Use native services
Audit logging
Logs provide insight into the operations of a workload, the infrastructure, network communications, and so on.
When suspicious activity is detected, use alerts as a way of detecting potential threats. As part of your defense-
in-depth strategy and continuous monitoring, respond to the alerts to prevent security assurance from decaying
over time.
Configure central security log management.
Enable audit logging for Azure resources.
Collect security logs from operating systems.
Configure security log storage retention.
Enable alerts for anomalous activities.
Azure MonitorAzure Monitor provides observability across your entire environment. You automatically get platform
metrics, activity logs, and diagnostics logs from most of your Azure resources with no configuration. The
activity logs provide detailed diagnostic and auditing information.
Microsoft Defender for CloudMicrosoft Defender for Cloud generates notifications as security alerts by collecting, analyzings, and
integrating log data from your Azure resources and the network. Alerts are available when you enable
the Microsoft Defender plans. This will add to the overall cost.
Microsoft SentinelMicrosoft Sentinel is a security information event management (SIEM) and security orchestration
automated response (SOAR) solution. It's a single solution for alert detection, threat visibility, proactive
hunting, and threat response.
Ideally use a combination of the preceding services to get a full view. For example, use Azure Monitor to collect
information about the operating system running on Azure compute. If you're running your own compute, use
Microsoft Defender for Cloud.
An important aspect of monitoring is tracking operations. For example, you want to know who created, updated,
deleted a resource. Or, get resource-specific information such as when an image was pulled from Azure
Container Registry. That information is crucial for a Security Operations (SecOps) team in detecting the presence
of adversaries, reacting to an alert of suspicious activity, or proactively hunting for anomalous events. They are
also useful for security auditing and compliance and offline analysis.
On Azure, that information is emitted as platform logs by the resources and the platform on which they run.
They are tracked by Azure Resource Manager as and when subscription-level events occur. Each resource emits
logs specific to the service.
Consider storing your data for audit purposes or statistical analysis. You can retain data in your log analytics
workspace and specify the data type. This example sets the retention for SecurityEvents to 730 days:
PUT /subscriptions/00000000-0000-0000-0000-
00000000000/resourceGroups/MyResourceGroupName/providers/Microsoft.OperationalInsights/workspaces/MyWorkspac
eName/Tables/SecurityEvent?api-version=2017-04-26-preview {"properties": {"retentionInDays": 730 } }
Alerts
Centralize logs and alerts
NOTENOTE
Retaining data in this manner can reduce your costs for data retention over time. For information about the type
of data you can retain, see security data types.
Another way is to send the logs to a storage account.
Security alerts are notifications that are generated when anomalous activity is detected on the resources used
by the workload or the platform.
With the Microsoft Defender plans, Microsoft Defender for Cloud analyzes log data and shows a list of alerts
that's based on logs collected from resources within a scope. Alerts include context information such as severity,
status, activity time. Defender for Cloud also provides a correlated view called incidentsincidents. Use this data to
analyze what actions the attacker took, and what resources were affected. Have strategies to react to alerts as
soon as they are generated. An option is to handle alerts in Azure Functions.
Use the data to support these activities:
Remediation of threats.
Investigation of an incident.
Proactive hunting activities.
For more information, see Security alerts and incidents.
Organizations typically follow one of three models when deploying logs: centralized, decentralized, or hybrid.
The choice depends on organizational structures. For example, if each team owns their resource group, log data
is segregated per resource. While access control to that data might be easy to set up, it's difficult to correlate
logs. This might be challenging for the SecOps team who need a holistic view to analyze the data.
Consider a central view of log and data, when applicable. Some advantages include:
The resources in the workload can share a common log workspace reducing duplication.
Single point of observability with all log data makes it easier consume data for hunting activities, querying,
and statistical evaluation.
The integrated data can be fed into modern machine learning analytics platforms support ingestion of large
amounts of information and can analyze large datasets quickly. In addition, these solutions can be tuned to
significantly reduce the false positive alerts.
You can collect logs and alerts from various sources centrally in a Log Analytics Workspace, storage account,
and Event Hubs. You can then review and query log data efficiently. In Azure Monitor, use the diagnosticdiagnostic
settingsetting on resources to route specific logs that are important for the organization. Logs vary by resource type.
In Microsoft Defender for Cloud, take advantage of the continuous export feature to route alerts.
Platform logs are not available indefinitely. You'll need to keep them so that you can review them later for auditing
purposes or offline analysis. Use Azure Storage Accounts for long-term/archival storage. In Azure Monitor, specify a
retention period when you enable diagnostic setting for your resources.
Next
Related links
Another way to see all data in a single view is to integrate logs and alerts into Security Information and Event
Management (SIEM) solutions, such as Microsoft Sentinel. Other popular third-party choices are Splunk,
QRadar, ArcSight. Microsoft Defender for Cloud and Azure Monitor supports all of those solutions.
Integrating more data can enrich alerts with additional context. However, collection is not detection. Make sure a
high volume of low value data doesn't flow into those solutions.
If you don't have a reasonable expectation that the data will provide value, deprioritize integration of these
events. For example, high volume of firewall denies events may create noise without actual actions.
That choice will help in rapid response and remediation by filtering out false positives, and elevate true positives,
and so on. Also it will lower SIEM cost, false positives, and increase performance.
Other ways of log integration may involve a hybrid model that mixes centralized and decentralized (distributed
among teams) approaches. For details, see Important considerations for an access control strategy.
Responding to alerts is an essential way to prevent security assurance decay, and designing for defense-in depth
and least privilege strategies.
Remediate security risks
For more information, see these articles:
How to get started with Azure Monitor and third-party SIEM integration
How to collect platform logs and metrics with Azure Monitor
Export alerts
Understand Microsoft Defender for Cloud data collection
Go back to the main article: Monitor
Remediate security risks in Microsoft Defender for
Cloud
12/16/2022 4 minutes to read • Edit Online
Key points
Track Secure Score
Security controls must remain effective against attackers who continuously improve their ways to attack the
digital assets of an enterprise. Use the principle of drive continuous improvement to make sure systems are
regularly evaluated and improved.
Start by remediating common security risks. These risks are usually from well-established attack vectors. This
will forces attackers to acquire use advanced and more expensive attack methods.
Processes for handling incidents and post-incident activities, such as lessons learned and evidence retention.
Remediate the common risks identified by Microsoft Defender for Cloud.
Track remediation progress with secure score and comparing against historical results.
Address alerts and take action with remediation steps.
Do you review and remediate common risks in the workload boundar y?Do you review and remediate common risks in the workload boundar y?
Monitor the security posture of VMs, networks, storage, data services, and various other contributing factors.
Secure Score in Microsoft Defender for Cloud shows a composite score that represents the security posture at
the subscription level.
Do you have a process for formally reviewing Secure Score on Microsoft Defender for Cloud?Do you have a process for formally reviewing Secure Score on Microsoft Defender for Cloud?
As you review the results and apply recommendations, track the progress and prioritize ongoing investments.
Higher score indicates a better security posture.
Set up a regular cadence (typically monthly) to review the secure score and plan initiatives with specific
improvement goals.
C AT E GO RYC AT E GO RY RESOU RC ESR ESO URC ES RESP O N SIB L E T EA MRESP O N SIB L E T EA M
Compute and applications App Services Application Development/Security
Team(s)
Containers Application Development and/or
Infrastructure/IT Operations
Virtual machines, scale sets, compute IT/Infrastructure Operations
Data and Storage SQL/Redis/Data Lake Analytics/Data
Lake Store
Database Team
Storage Accounts Storage/Infrastructure Team
Identity and access management Subscriptions Identity Team(s)
Key Vault Information/Data Security Team
Networking Resources Networking Team and Network
Security Team
IoT Security IoT Resources IoT Operations Team
Review and remediate recommendations
Assign stakeholders for monitoring and improving the score. Gamify the activity if possible to increase
engagement and focus from the responsible teams.
As a technical workload owner, work with your organization's dedicated team that monitors Secure Score. In the
DevOps model, workload teams may be responsible for their own resources. Typically, these teams are
responsible.
Security posture management team
Vulnerability management or governance, risk, and compliance team
Architecture team
Resource-specific technical teams responsible for improving secure score, as shown in this table.
The Azure Secure Score sample shows how to get your Azure Secure Score for a subscription by calling the
Microsoft Defender for Cloud REST API. The API methods provide the flexibility to query the data and build your
own reporting mechanism of your secure scores over time.
Microsoft Defender for Cloud monitors the security status of machines, networks, storage and data services, and
applications to discover potential security issues. Enable this capability at no additional cost to detect vulnerable
virtual machines connected to internet, missing security updates, missing endpoint protection or encryption,
deviations from baseline security configurations, missing Web Application Firewall (WAF), and more.
View the recommendations to see the potential security issues and apply the Microsoft Defender for Cloud
recommendations to execute technical remediations.
Policy remediation
The recommendations are grouped by controls. Each recommendation has detailed information such as severity,
affected resources, and quick fixes where applicable. Start with high severity items.
Defender for Cloud has the capability of exporting results at configured intervals. Compare the results with
previous sets to verify that issues have been remediated.
For more information, see Continuous export.
A common approach for maintaining the security posture is through Azure Policy.
Along with organizational policies, a workload owner can use scoped policies for governance purposes, such as
check misconfiguration, prohibit certain resource types, and others. The resources are evaluated against rules to
identify unhealthy resources that are risky. Post evaluation, certain actions are required as remediation. The
actions can be enforced through Azure Policy effects.
For example, a workload runs in an Azure Kubernetes Service (AKS) cluster. The business goals require the
workload to run in a highly restrictive environment. As a workload owner, you want the resource group to
contain AKS clusters that are private. You can enforce that requirement with the DenyDeny effect. It will prevent a
cluster from being created if that rule isn't satisfied.
That sort of isolation can be maintained through policies at a higher level such as the subscription level or even
management groups.
Another use case is that it can be automatically remediated by deploying related resources. For example, the
organization wants all storage resources in a subscription to send logs to a common Log Analytics workspace. If
a storage account doesn't pass the policy, a deployment is automatically started as remediation. That
remediation can be enforced through DeployIfNotExistDeployIfNotExist. There are some considerations.
There's a significant wait before the resource is updated and the deployment starts. In the preceding example,
there won't be logs captured during that wait time. Avoid using this effect for resources that cannot tolerate a
delay.
The resource deployed because of DeployIfNotExistDeployIfNotExist are created by a separate identity than that of the
identity that did the original deployment. That identity must have high enough privileges to make the
required changes.
Manage alerts
Next
Related links
Microsoft Defender for Cloud shows a list of alerts that's based on logs collected from resources within a scope.
Alerts include context information such as severity, status, activity time. Most alerts have MITRE ATT&CK®
tactics that can help you understand the kill chain intent. Select the alert and investigate the problem with
detailed information.
Finally take action. That action can be to fix the resources that are out of compliance with actionable remediation
steps. You can also suppress alerts that are false positives.
Make sure that you are integrating critical security alerts into Security Information and Event Management
(SIEM), Security Orchestration Automated Response (SOAR) without introducing a high volume of low value
data. Microsoft Defender for Cloud can stream alerts to Microsoft Sentinel. You can also use a third-party
solution by using Microsoft Graph Security API.
Azure security operations
Go back to the main article: Monitor
Security audits
12/16/2022 5 minutes to read • Edit Online
Key points
Evaluate using standard benchmarks
Suggested actionSuggested action
Audit regulatory compliance
To make sure that the security posture doesn't degrade over time, have regular auditing that checks compliance
with organizational standards. Enable, acquire, and store audit logs for Azure services.
Improve secure score in Microsoft Defender for Cloud.
Use an industry standard benchmark to evaluate your organizations current security posture.
Perform regular internal and external compliance audits, including regulatory compliance attestations.
Review the policy requirements.
Use Azure Governance Visualizer for a holistic overview of your technical Azure Governance implementation.
Do you evaluate the security posture of this workload using standard benchmarks?Do you evaluate the security posture of this workload using standard benchmarks?
Use an industry standard benchmark to evaluate your organizations current security posture.
Benchmarking allows you to improve your security program by learning from external organizations. It lets you
know how your current security state compares to that of other organizations, providing both external
validation for successful elements of your current system and identifying gaps that serve as opportunities to
enrich your team's overall security strategy. Even if your security program isn't tied to a specific benchmark or
regulatory standard, you will benefit from understanding the documented ideal states by those outside and
inside of your industry.
As an example, the Center for Internet Security (CIS) has created security benchmarks for Azure that map to the
CIS Control Framework. Another reference example is the MITRE ATT&CK™ framework that defines the various
adversary tactics and techniques based on real-world observations. These external references control mappings
and help you to understand any gaps between your current strategy, what you have, and what other experts
have in the industry.
Develop an Azure security benchmarking strategy aligned to industry standards.
As people in the organization and on the project change, it is crucial to make sure that only the right people
have access to the application infrastructure. Auditing and reviewing access control reduces the attack vector to
the application. Azure control plane depends on Azure AD and access reviews are often centrally performed as
part of internal, or external audit activities.
Make sure that the security team is auditing the environment to report on compliance with the security policy of
the organization. Security teams may also enforce compliance with these policies.
Compliance is important for several reasons. Aside from signifying levels of standards, like ISO 27001 and
others, noncompliance with regulatory guidelines may bring sanctions and penalties. Regularly review roles that
have high privileges. Set up a recurring review pattern to ensure that accounts are removed from permissions
as roles change. Consider auditing at least twice a year.
Suggested actionSuggested action
Learn moreLearn more
Use Microsoft Defender for Cloud to continuously assess and monitor your compliance score.
Assess your regulatory compliance
Have you established a monitoring and assessment solution for compliance?Have you established a monitoring and assessment solution for compliance?
Continuously assess and monitor the compliance status of your workload. Microsoft Defender for Cloud
provides a regulatory compliance dashboard that shows the current security state of workload against controls
mandated by the standard governments or industry organizations and Azure Security Benchmark. Keep your
resources in compliance with those standards. Defender for Cloud tracks many standards. You can set the
standards by management groups in a subscription.
Consider using Azure Access Reviews or Entitlement Management to periodically review access to the workload.
Consider using Azure Access Reviews or Entitlement Management to periodically review access to the workload.
For Azure, use Azure Policy to create and manage policies that enforce compliance. Azure Policies are built on
the Azure Resource Manager capabilities. Azure Policy can also be assigned through Azure Blueprints.
For more information, see Tutorial: Create and manage policies to enforce compliance.
Here's an example management group that is tracking compliance to the Payment Card Industry (PCI) standard.
Review critical access
Check policy compliance
Do you have internal and external audits for this workload?Do you have internal and external audits for this workload?
A workload should be audited internally, external, or both with the goal of discovering security gaps. Make sure
that the gaps are addressed through updates.
Auditing is important for workloads that follow a standard. Aside from signifying levels of standards,
noncompliance with regulatory guidelines may bring sanctions and penalties.
Perform regulatory compliance attestation. Attestations are done by an independent party that examines if the
workload is in compliance with a standard.
Is access to the control plane and data plane of the application periodically reviewed?Is access to the control plane and data plane of the application periodically reviewed?
Regularly review roles that have high privileges. Set up a recurring review pattern to ensure that accounts are
removed from permissions as roles change. Consider auditing at least twice a year.
As people in the organization and on the project change, make sure that only the right people, have access to
the application infrastructure and just enough privileges to complete the task. Auditing and reviewing the access
control reduces the attack vector to the application.
Azure control plane depends on Azure AD. You can conduct the review manually or through an automated
process by using tools such as Azure AD access reviews. These reviews are often centrally performed often as
part of internal or external audit activities.
Make sure that the security team is auditing the environment to report on compliance with the security policy of
the organization. Security teams may also enforce compliance with these policies.
Capture critical data
Next steps
Related links
Enforce and audit industry, government, and internal corporate security policies. Policy monitoring checks that
initial configurations are correct and that it continues to be compliant over time.
For Azure, use Azure Policy to create and manage policies that enforce compliance. Azure Policies are built on
the Azure Resource Manager capabilities. Azure Policy can also be assigned through Azure Blueprints. For more
information, see Tutorial: Create and manage policies to enforce compliance.
Azure Governance Visualizer captures data from the most relevant Azure governance capabilities such as Azure
Policy, Azure role-based access control (Azure RBAC), and Azure Blueprints. The visualizer PowerShell script
iterates through an Azure tenant's management group hierarchy down to the subscription level. From the
collected data, the visualizer shows your hierarchy map, creates a tenant summary, and builds granular scope
insights about your management groups and subscriptions.
Remediate security risks in Microsoft Defender for Cloud
Secure score in Microsoft Defender for Cloud allows you view all the security vulnerabilities into a single
score.
Tutorial: Improve your regulatory compliance describes a step-by-step process to evaluate regulatory
requirements in Microsoft Defender for Cloud.
Azure security test practices
12/16/2022 3 minutes to read • Edit Online
Key points
Penetration testing (pentesting)
Regularly test your security design and implementation, as part the organization's operations. That integration
will make sure the security assurances are effective and maintained as per the security standards set by the
organization.
A well-architected workload should be resilient to attacks. It should recover rapidly from disruption and yet
provide the security assurances of confidentiality, integrity, and availability. Invest in simulated attacks as tests
that can indicate gaps. Based on the results of the results you can harden the defense and limit a real attacker's
lateral movement within your environment.
Simulated tests can also give you data to plan risk mitigation. Applications that are already in production should
use data from real-world attacks. New or updated applications with new features, should rely on structured
models for detecting risks early, such as threat modeling.
Define test cases that are realistic and based on real-world attacks.
Identify and catalog lowest cost methods for preventing and detecting attacks.
Use penetration testing as a one-time attack to validate security defenses.
Simulate attacks through red teams for long-term persistent attacks.
Measure and reduce the potential attack surface that attackers target for exploitation for resources within the
environment.
Ensure proper follow-up to educate users about the various means that an attacker may use.
Do you perform penetration testing on the workload?Do you perform penetration testing on the workload?
It's recommended that you simulate a one-time attack to detect vulnerabilities. Pentesting is a popular
methodology to validate the security defense of a system. The practitioners are security experts who are not
part of the organization's IT or application teams. So, they look at the system in a way that malicious actors
scope an attack surface. The goal is to find security gaps by gathering information, analyzing vulnerabilities, and
reporting.
Penetration tests provide a point-in-time validation of security defenses. Red teams can help provide ongoing
visibility and assurance that your defenses work as designed, potentially testing across different levels within
your workload(s). Red team programs can be used to simulate either one time, or persistent threats against an
organization to validate defenses that have been put in place to protect organizational resources.
Microsoft recommends penetration testing and red team exercises to validate security defenses for your
workload.
Penetration Testing Execution Standard (PTES) provides guidelines about common scenarios and the activities
required to establish a baseline.
Azure uses shared infrastructure to host your assets and assets belonging to other customers. In a pentesting
exercise, the practitioners may need access to sensitive data of the entire organization. Follow the rules of
engagement to make sure that access and the intent is not misused. For guidance about planning and executing
simulated attacks, see Penetration Testing Rules of Engagement.
Learn moreLearn more
Simulate attacks
Related links
Azure Penetration Testing
Penetration Testing
The way users interact with a system is critical in planning your defense. The risks are even higher for critical
impact accounts because they have elevated permissions and can cause more damage.
Do you carr y out simulated attacks on users of this workload?Do you carr y out simulated attacks on users of this workload?
Simulate a persistent threat actor targeting your environment through a red team. Here are some advantages:
Periodic checks. The workload will get checked through a realistic attack to make sure the defense is up to
date and effective.
Educational purposes. Based on the learnings, upgrade the knowledge and skill level. This will help the users
understand the various means that an attacker may use to compromise accounts.
A popular choice to simulate realistic attack scenarios is Office 365 Attack Simulator.
Is personal information detected and removed/obfuscated automatically?Is personal information detected and removed/obfuscated automatically?
Be cautious about using sensitive application information. Don't store personal information such as contact
information, payment information, and so on, in any application logs. Apply protective measures, such as
obfuscation. Machine learning tools can help with this measure. For more information, see PII Detection
cognitive skill.
Threat modeling is a structured process to identify the possible attack vectors. Based on the results, prioritize the
risk mitigate efforts. For more information, see Application threat analysis.
For more information on current attacks, see the Microsoft Security Intelligence (SIR) report.
Microsoft Cloud Red Teaming
Go back to the main article: Monitor
Security operations in Azure
12/16/2022 4 minutes to read • Edit Online
Tools
The responsibility of the security operation team (also known as Security Operations Center (SOC), or SecOps) is
to rapidly detect, prioritize, and triage potential attacks. These operations help eliminate false positives and focus
on real attacks, reducing the mean time to remediate real incidents. Central SecOps team monitors security-
related telemetry data and investigates security breaches. It's important that any communication, investigation,
and hunting activities are aligned with the application team.
Here are some general best practices for conducting security operations:
Follow the NIST Cybersecurity Framework functions as part of operations.
DetectDetect the presence of adversaries in the system.
RespondRespond by quickly investigating whether it's an actual attack or a false alarm.
RecoverRecover and restore the confidentiality, integrity, and availability of the workload during and after
an attack.
For information about the framework, see NIST Cybersecurity Framework.
Acknowledge an alert quickly. A detected adversary must not be ignored while defenders are triaging
false positives.
Reduce the time to remediate a detected adversary. Reduce their opportunity time to conduct and attack
and reach sensitive systems.
Prioritize security investments into systems that have high intrinsic value. For example, administrator
accounts.
Proactively hunt for adversaries as your system matures. This effort will reduce the time that a higher
skilled adversary can operate in the environment. For example, skilled enough to evade reactive alerts.
For information about the metrics that the Microsoft's SOC team uses , see Microsoft SOC.
Here are some Azure tools that a SOC team can use investigate and remediate incidents.
TOO LTOO L P U RP O SEP URP O SE
Microsoft SentinelMicrosoft Sentinel Centralized Security Information and Event Management
(SIEM) to get enterprise-wide visibility into logs.
Microsoft Defender for CloudMicrosoft Defender for Cloud Alert generation. Use security playbook in response to an
alert.
Azure MonitorAzure Monitor Event logs from application and Azure services.
Azure Network Security Group (NSG)Azure Network Security Group (NSG) Visibility into network activities.
Azure Information ProtectionAzure Information Protection Secure email, documents, and sensitive data that you share
outside your company.
Assign incident notification contact
Incident response
Investigation practices should use native tools with deep knowledge of the asset type such as an Endpoint
detection and response (EDR) solution, Identity tools, and Microsoft Sentinel.
For more information about monitoring tools, see Security monitoring tools in Azure.
Security alerts need to reach the right people in your organization. Establish a designated point of contact to
receive Azure incident notifications from Microsoft, and, or Azure Defender for Cloud. In most cases, such
notifications indicate that your resource is compromised or attacking another customer. This enables your
security operations team to rapidly respond to potential security risks and remediate them.
This enables your security operations team to rapidly respond to potential security risks and remediate them.
Ensure administrator contact information in the Azure enrollment portal includes contact information that will
notify security operations directly or rapidly through an internal process.
Learn moreLearn more
To learn more about establishing a designated point of contact to receive Azure incident notifications from
Microsoft, reference the following articles:
Update notification settings
Configure email notifications for security alerts
Is the organization effectively monitoring security posture across workloads, with a central SecOps team
monitoring security-related telemetry data and investigating possible security breaches? Communication,
investigation, and hunting activities need to be aligned with the application team(s).
Are operational processes for incident response defined and tested?Are operational processes for incident response defined and tested?
Actions executed during an incident and response investigation could impact application availability or
performance. Define these processes and align them with the responsible (and in most cases central) SecOps
team. The impact of such an investigation on the application has to be analyzed.
Are there tools to help incident responders quickly understand the application and components toAre there tools to help incident responders quickly understand the application and components to
do an investigation?do an investigation?
Incident responders are part of a central SecOps team and need to understand security insights of an
Suggested actionSuggested action
Hybrid enterprise view
Leverage native detections and controls
Suggested actionsSuggested actions
Learn moreLearn more
Next steps
application. Security playbook in Microsoft Sentinel can help to understand the security concepts and cover the
typical investigation activities.
Consider using Microsoft Defender for Cloud to monitor security-related events and get alerted automatically.
Learn moreLearn more
Security alerts and incidents in Microsoft Defender for Cloud
Security operations tooling and processes should be designed for attacks on cloud and on-premises assets.
Attackers don't restrict their actions to a particular environment when targeting an organization. They attack
resources on any platform using any method available. They can pivot between cloud and on-premises
resources using identity or other means. This enterprise-wide view will enable SecOps to rapidly detect,
respond, and recover from attacks, reducing organizational risk.
Use Azure security detections and controls instead of creating custom features for viewing and analyzing event
logs. Azure services are updated with new features and have the ability to detect false positive with a higher
accuracy rate.
Integrating logs from the network devices, and even raw network traffic itself, will provide greater visibility into
potential security threats flowing over the wire.
To get a unified view across the enterprise, feed the logs collected through native detections (such as Azure
Monitor) into a centralized security information and event management (SIEM) solution like Microsoft Sentinel.
Avoid using generalized log analysis tools and queries. Within Azure Monitor, create Log Analytics Workspace to
store logs. You can also review logs and perform queries on log data. These tools can offer high-quality alerts.
The modern machine learning-based analytics platforms support ingestion of extremely large amounts of
information and can analyze large datasets very quickly. In addition, these solutions can be tuned to significantly
reduce false positive alerts.
Examples of network logs that provide visibility include:
Security group logs - flow logs and diagnostic logs
Web application firewall logs
Virtual network taps and their equivalents
Azure Network Watcher
Integrate network device log information in advanced SIEM solutions or other analytics platforms.
Enable enhanced network visibility
Security health modeling
Security tools
Security logs and audits
Check for identity, network, data risks
Tradeoffs for security
12/16/2022 4 minutes to read • Edit Online
Security vs Reliability
Security vs Cost Optimization
Security provides confidentiality, integrity, and availability assurances of an organization's data and systems.
When designing a system you can almost never compromise on security controls. When you enhance security
of an architecture there might be impact on reliability, performance efficiency, cost, and operational excellence.
This article describes some of those considerations.
Reliable applications are resilient and highly available. Every architectural component factors in achieving your
requirements for reliability. Workload security is often woven into many layers of the workload's architecture,
operations, and runtime requirements; and may come with their own implications on resiliency or availability.
For example, identity providers and authorization services are critical dependencies to consider. This includes the
identity service (Microsoft Identity Platform) and any libraries that help facilitate the use of those services. At
some points in the architecture, a failure at an identity layer is terminal. At other points, reliability can be still
achieved through strategies such as caching, taking advantage of TTLs on access tokens, and others. OAuth2
claims validation can happen mostly disconnected from the claims provider. However, not all authorization can
be achieved that way. In those situations reliability may be traded in favor of complete security.
Many workloads may quickly degrade in functionality with the loss of critical security controls. Consider
evaluating at each component of your architecture to detect that condition.
Other security considerations that might impact reliability are:
Poor or manual certifications or key rotation practices. Failure to do those tasks can lead to reliability issues.
Expired service principals. For example, a deployment pipeline that used a service principal might fail at a
later date, if that principal's access key has expired. Using managed identities helps keep reliability high while
also maintaining least privileges on that identity.
High availability is often achieved by redundancy (actively or passively), and security controls also need to
align with the failover mechanism. For example, failing over from one storage account to another for
reliability may impact how the client's active authorization session is handled. Using managed identity with
Azure AD integration for storage access can result in a higher reliability because the client doesn't have to
manage SAS tokens when switching to the new storage account.
Increasing security of the workload will almost always lead to higher cost. There are some ways to optimize cost.
Maximum security may not always be practical for all environments. Evaluate the security requirements
in pre-production and production environments. Are services such as Azure DDoS Protection, Microsoft
Sentinel, Dedicated HSMs, Microsoft Defender for Cloud needed in pre-production? Is inner loop
mocking of security controls sufficient? If resources are not publicly accessible, can you dial down some
controls for cost savings? Always make those choices,
if and only if
, the lowered environment still meets
the business requirements.
Premium security features can increase the cost. There are areas you can reduce cost by using native
security features. For example, avoid implementing custom roles if you can use built-in roles.
Every security control has an opportunity to impact workflows, and workflows that involve people can be
expensive. A security control that stops work from being done should be evaluated as necessary or
Security vs Operational Excellence
Related link
unnecessarily redundant. Total cost of ownership (TCO) includes operational costs for developers,
operators, IT SecOps and onerous security protocols. Reach agreement about where "less" can br
"sufficient" to optimize costs.
TCO includes the time needed do tasks. Optimizing that time will optimize cost. Using platform features
can lower TCO and enhance the security posture. Instead of training an engineer to manually review logs
and correlate access patterns, use intelligence in services, such as Microsoft Defender for Cloud or
Sentinel alerts.
Operational Excellence involves understanding business and workload behavior, and applying appropriate
amounts of automation, and observability into those processes.
Release management
Your organization defines its quality gates (manual or automated) as part of its safe deployment
practices. Evaluate whether each gate is required or optional, taking into consideration whether it's
valuable to perform, at the added cost of complexity (and time) for that release. Adding security checks
into the process makes the checks more valuable at that point in the process than any gains (usually
simplicity and time) by not having them in place.
Organizational policies
Teams can often benefit from collaborating and using cross-functional skills in all stages of workload life
cycle, from development all the way through production. However, organizational policy or regulatory
concerns may prevent such wide-reaching access. Consider, where possible, isolating those systems that
do need heightened access policies from those that do not. Avoid applying "one size fits all" model to all
components in the system. It's easier to optimize operations in systems that allow more unregulated
access. For systems that demand regulated access, complexity is expected to increase leading to higher
cost.
Supportability considerations
The most "serviceable" architectures are the ones that are the most transparent to everyone involved, and
those often have the least number of security controls. Adding security controls to your architecture like
filtered telemetry feeds, redacted logs, runtime system access restrictions, and so on can all impact the
supportability of a solution. Adding security controls often require adding compensating or
compromised solutions for observability into the platform.
Go back to the main article: Security
Overview of the cost optimization pillar
12/16/2022 2 minutes to read • Edit Online
Topics
C O ST TOP ICC O ST TO P IC DESC RIP T IONDESC RIP T ION
Capture cost requirements Start your planning with a careful enumeration of
requirements. Make sure the needs of the stakeholders are
addressed. For strong alignment with business goals, those
areas must be defined by the stakeholders and shouldn't be
collected from a vendor.
Cost of resources in Azure regions Cost of an Azure service can vary between locations based
on demand and local infrastructure costs.
Governance Understand how governance can assist with cost
management. This work will benefit your ongoing cost
review process and will offer a level of protection for new
resources.
Estimate the initial cost It's difficult to attribute costs before deploying a workload to
the cloud. If you use methods for on-premises estimation or
directly map on-premises assets to cloud resources, estimate
will be inaccurate.
The cost optimization pillar provides principles for balancing business goals with budget justification to create a
cost-effective workload while avoiding capital-intensive solutions. Cost optimization is about looking at ways to
reduce unnecessary expenses and improve operational efficiencies.
Use the pay-as-you-go strategy for your architecture, and invest in scaling out, rather than delivering a large
investment-first version. Consider opportunity costs in your architecture, and the balance between first mover
advantage versus
fast follow
. Use the cost calculators to estimate the initial cost and operational costs. Finally,
establish policies, budgets, and controls that set cost limits for your solution.
To assess your workload using the tenets found in the Microsoft Azure Well-Architected Framework, reference
the Microsoft Azure Well-Architected Review.
We recommend exploring the following videos to dive deeper into Azure cost optimization:
The Microsoft Azure Well-Architected Framework includes the following topics in the cost optimization pillar:
PaaS Look for areas in the architecture where it may be natural to
incorporate platform-as-a-service (PaaS) options. These
options include caching, queues, and data storage. PaaS
reduces time and cost of managing servers, storage,
networking, and other application infrastructure.
Consumption A common way to estimate cost is by considering workloads
on a peak throughput. Under consistently high usage,
consumption-based pricing can be less efficient for
estimating baseline costs when compared to the equivalent
provisioned pricing.
Provision cloud resources Deployment of workload cloud resources can optimize cost.
Monitor cost Azure Cost Management has an alert feature. Alerts are
generated when consumption reaches a threshold.
Optimize cost Monitor and optimize the workload by using the right
resources and sizes.
Tradeoffs for costs As you design the workload, consider tradeoffs between cost
optimization and other aspects of the design, such as
security, scalability, resilience, and operability.
C O ST TOP ICC O ST TO P IC DESC RIP T IONDESC RIP T ION
Next section
Read the cost optimization principles to guide you in your overall strategy.
Principles
Cost optimization design principles
12/16/2022 2 minutes to read • Edit Online
Choose the correct resources
Set up budgets and maintain cost constraints
Dynamically allocate and de-allocate resources
A cost-effective workload is driven by business goals and the return on investment (ROI) while staying within a
given budget. The principles of cost optimization are a series of important considerations that can help achieve
both business objectives and cost justification.
To assess your workload using the tenets found in the Azure Well-Architected Framework, reference the
Microsoft Azure Well-Architected Review.
The following design principles provide:
Context for questions
Why a certain aspect is important
How an aspect is applicable to Cost optimization
These critical design principles are used as lenses to assess the Cost optimization of an application deployed on
Azure. These lenses provide a framework for the application assessment questions.
Choose the right resources that are aligned with business goals and can handle the performance of the
workload.
When onboarding new workloads, explore the possibility of modernization and cloud native offerings where
possible. It's typically more cost effective to use Platform as a Service (PaaS) or Software as a Service (SaaS), as
opposed to Infrastructure as a Service (IaaS).
An inappropriate or misconfigured service can affect cost. For example, building a multi-region service when the
service levels don't require high-availability or geo-redundancy will increase cost without any reasonable
business justification.
Every design choice has cost implications. Consider the budget constraints set by the company, before choosing:
An architectural pattern
Azure service
A price model for the service
As part of design, identify acceptable boundaries on:
Scale
Redundancy
Performance against cost
After estimating the initial cost, set budgets and alerts at different scopes to measure the cost. One cost driver
can be unrestricted resources. These resources typically need to scale and consume more cost to meet demand.
To match performance needs, dynamically allocate and de-allocate resources.
Optimize workloads, aim for scalable costs
Continuously monitor and optimize cost management
Next stepNext step
Identify idle or underutilized resources through Azure Advisor or other tools, and:
Reconfigure
Consolidate
(or) Shut down
A key benefit of the cloud is the ability to scale dynamically. The workload cost should scale linearly with
demand. You can save cost through automatic scaling.
Recommendations:
Consider the usage metrics and performance to determine the number of instances.
Choose smaller instances for a highly variable workload.
Scale out, rather than up, to get the required level of performance. This choice will enable you to make your
cost calculations and estimates granular.
The cost management process should be:
Rigorous
Iterative
A key principle of responsible cloud optimization
To provision resources dynamically and to scale with demand:
Conduct regular cost reviews
Measure capacity needs
Forecast capacity needs
If you're just starting this process, review enable success during a cloud adoption journey.
Design checklist
Checklist
-
Design for cost
12/16/2022 2 minutes to read • Edit Online
Cost model
Architecture
Use this checklist when designing a cost-effective workload.
Capture clear requirementsCapture clear requirements. Gather detailed information about the business workflow, regulatory,
security, and availability.
Capture requirements
Estimate the initial costEstimate the initial cost. Use tools such as Azure pricing calculator to assess cost of the services you
plan to use in the workload. Use Azure Migrate and Microsoft Azure Total Cost of Ownership (TCO)
Calculator for migration projects. Accurately reflect the cost associated with right storage type. Add
hidden costs, such as networking cost for large data download.
Estimate the initial cost
Define policies for the cost constraints defined by the organizationDefine policies for the cost constraints defined by the organization. Understand the constraints
and define acceptable boundaries for quality pillars of scale, availability, security.
Consider the cost constraints
Identify shared assetsIdentify shared assets. Evaluate the business areas where you can use shared resources. Review the
billing meters build chargeback reports per consumer to identify metered costs for shared cloud services.
Create a structured view of the organization in the cloud
Plan a governance strategyPlan a governance strategy. Plan for cost controls through Azure Policy. Use resource tags so that
custom cost report can be created. Define budgets and alerts to send notifications when certain
thresholds are reached.
Governance
Check the cost of resources in various Azure geographic regionsCheck the cost of resources in various Azure geographic regions. Check your egress and ingress
cost, within regions and across regions. Only deploy to multiple regions if your service levels require it
for either availability or geo-distribution.
Azure regions
Choose a subscription that is appropriate for the workloadChoose a subscription that is appropriate for the workload. Azure Dev/Test subscription types are
suitable for experimental or non-production workloads and have lower prices on some Azure services
such as specific VM sizes. If you can commit to one or three years, consider subscriptions and offer types
that support Azure Reservations or savings plans.
Subscription and offer type
Choose the right resources to handle the performanceChoose the right resources to handle the performance. Understand the usage meters and the
number of meters for each resource in the workload. Consider tradeoffs over time. For example, cheaper
virtual machines may initially indicate a lower cost but can be more expensive over time to maintain a
certain performance level. Be clear about the billing model of third-party services.
Azure resources
Use cost alerts to monitor usage and spending
Compare consumption-based pricing with pre-provisioned costCompare consumption-based pricing with pre-provisioned cost. Establish baseline cost by
considering the peaks and the frequency of peaks when analyzing performance.
Consumption and fixed cost models
Use proof-of-concept deploymentsUse proof-of-concept deployments. The Azure Architecture Center has many reference architectures
and implementations that can serve as a starting point. The Azure Tech Community has architecture and
services forums.
Choose managed ser vices when possibleChoose managed ser vices when possible. With PaaS and SaaS options, the cost of running and
maintaining the infrastructure is included in the service price.
Managed services
Develop a cost model
12/16/2022 6 minutes to read • Edit Online
Cost modeling
is an exercise where you create logical groups of cloud resources that are mapped to the
organization's hierarchy and then estimate costs for those groups. The goal of cost modeling is to estimate the
overall cost of the organization in the cloud.
1. Understand how your responsibilities align with your organizationUnderstand how your responsibilities align with your organization
Map the organization's needs to logical groupings offered by cloud services. This way the business
leaders of the company get a clear view of the cloud services and how they're controlled.
2. Capture clear requirementsCapture clear requirements
Start your planning with a careful enumeration of requirements. From the high-level requirements,
narrow down each requirement before starting on the design of the solution.
3. Consider the cost constraintsConsider the cost constraints
Evaluate the budget constraints on each business unit and determine the governance policies in Azure to
lower cost by reducing wastage, overprovisioning, or expensive provisioning of resources.
4. Consider tradeoffsConsider tradeoffs
Optimal design doesn't equate to a lowest-cost design.
As requirements are prioritized, cost can be adjusted. Expect a series of tradeoffs in the areas that you
want to optimize, such as security, scalability, resilience, and operability. If the cost to address the
challenges in those areas is high, stakeholders will look for alternate options to reduce cost. There might
be risky choices made in favor or a cheaper solution.
5. Derive functional requirements from high-level goalsDerive functional requirements from high-level goals
Break down the high-level goals into functional requirements for the components of the solution. Each
requirement must be based on realistic metrics to estimate the actual cost of the workload.
6. Consider the billing model for Azure resourcesConsider the billing model for Azure resources
Azure services are offered with consumption-based price where you're charged for only what you use.
There's also options for fixed price where you're charged for provisioned resources
Most services are priced based on units of size, amount of data, or operations. Understand the meters
that are used to track usage. For more information, see Azure resources.
At the end of this exercise, you should have identified the lower and upper limits on cost and set budgets for the
workload. Azure allows you to create and manage budgets in Azure Cost Management. For information, see
Quickstart: Create a budget with an Azure Resource Manager template.
H AVE FRE QU EN T A ND C L EA R C O M M UN IC AT IO N W IT H T H E STA K EH O L DERSH AVE FRE QU EN T A ND C L EA R C O M M UN IC AT IO N W IT H T H E STA K EH O L DERS
Organization structure
In the initial stages, communication between stakeholders is vital. The overall team must align on the requirements so that
overall business goals are met. If not, the entire solution might be at risk.
For instance, the development team indicates that the resilience of a monthly batch-processing job is low. They might
request the job to work as a single node without scaling capabilities. This request opposes the architect's recommendation
to automatically scale out, and route requests to worker nodes.
This type of disagreement can introduce a point of failure into the system, risking the Service Level Agreement, and cause
an increase in operational cost.
Map the organization's needs to logical groupings offered by cloud services. This way the business leaders of
the company get a clear view of the cloud services and how they're controlled.
1. Understand how your workload fits into cost optimization across the portfolio of cloud workloads.
If you are working on a workload that fits into a broader portfolio of workloads, see CAF to get started
guide to document foundational decisions. That guide will help your team capture the broader portfolio
view of business units, resources organizations, responsibilities, and a view of the long term portfolio.
If cost optimization is being executed by a central team across the portfolio, see CAF to get started
managing enterprise costs.
2. Encourage a culture of democratized cost optimization decisions.
As a workload owner, you can have a measurable impact on cost optimization. There are other roles in
the organization which can help improve cost management activities. To help embed the pillar of cost
optimization into your organization beyond your workload team, see the CAF article: Build a cost-
conscious organization.
3. Reduce costs through shared cloud services and landing zones.
If your workload has dependencies on shared assets like Active Directory, Network connectivity, security
devices, or other services that are also used by other workloads, encourage your central IT organization
to provide those services through a centrally managed landing zone to reduce duplicate costs. See the
CAF article: get started with centralized design & configuration to get started with the development of
landing zones.
4. Calculate the ROI by understanding what is included in each grouping and what isn't.
Which aspects of the hierarchy are covered by cloud ser vices?Which aspects of the hierarchy are covered by cloud ser vices?
Azure pricing model is based on expenses incurred for the service. Expenses include hardware, software,
development, operations, security, and data center space to name a few. Evaluate the cost benefit of
shifting away from owned technology infrastructure to leased technology solutions.
5. Identify scenarios where you can use shared cloud services to lower cost.
Can some ser vices be shared by other consumers?Can some ser vices be shared by other consumers?
Identify areas where a service or an application environment can be shared with other business units.
Identify resources that can be used as shared services and review their billing meters. Examples include a
virtual network and its hybrid connectivity or a shared app service environment (ASE). If the meter data
isn't able to be split across consumers, decide on custom solutions to allocate proportional costs. Move
Next stepNext step
Cost constraints
Next stepNext step
Functional requirements
Use T
-
shirt sizes for choosing SKUsUse T
-
shirt sizes for choosing SKUs
Next stepsNext steps
shared services to dedicated resources for consumers for cost reporting.
Build chargeback reports per consumer to identify metered costs for shared cloud services. Aim
for granular reports to understand which workload is consuming what amount of the shared cloud
service.
Capture cost requirements
Here are some considerations for determining the governance policies that can assist with cost management.
What are the budget constraints set by the company for each business unit?
What are policies for the budget alert levels and associated actions?
Identify acceptable boundaries for scale, redundancy, and performance against cost.
Assess the limits for security. Don't compromise on security. Premium cloud security features can drive the
cost up. It's not necessary to overinvest. Instead use the cost profile to drive a realistic threat profile.
Identify unrestricted resources. These resources typically need to scale and consume more cost to meet
demand.
Consider tradeoffs
Break down high-level goals into functional requirements. For each of those requirements, define metrics to
calculate cost estimates accurately. Cloud services are priced based on performance, features, and locations.
When defining these metrics, identify acceptable boundaries of performance, scale, resilience, and security. Start
by expressing your goals in number of business transactions over time, breaking them down to fine-grain
requirements.
What resources are needed for a single transaction, and how many transactions are done perWhat resources are needed for a single transaction, and how many transactions are done per
second, day, year?second, day, year?
Start with a fixed cost of operations and a rough estimate of transaction volume to work out a cost-per-
transaction to establish a baseline. Consider the difference between cost models based on fixed, static
provisioning of services, more variable costs based upon autoscaling such as serverless technologies.
When choosing options for services, start with an abstract representation of size. For example, if you choose a T-
shirt size approach, small, medium, large sizes, can represent an on-demand virtual machine instead of picking
specific virtual machines or managed disks SKU sizes.
Abstract sizes give you an idea of the expected performance and cost requirements. It sets the tone for various
consumption units that are used to measure compute resources for performance. Also, it helps in understanding
the on-demand consumption model of the services.
For more information, see Estimate the initial cost.
Estimate the initial cost
Capture cost requirements
12/16/2022 4 minutes to read • Edit Online
Landing zone
Security
Start your planning with a careful enumeration of requirements. Make sure the needs of the stakeholders are
addressed. For strong alignment with business goals, those areas must be defined by the stakeholders and
shouldn't be collected from a vendor.
Capture requirements at these levels:
Business workflow
Compliance and regulatory
Security
Availability
What do you aim to achieve by building your architecture in the cloud?What do you aim to achieve by building your architecture in the cloud?
Here are some common answers.
Take advantage of features only available in the cloud, such as intelligent security systems, regions footprint,
or resiliency features.
Use the on-demand nature of the cloud to meet peak or seasonal requirements, then releasing that cost
investment when it is no longer needed.
Consolidate physical systems.
Retire on-premises infrastructure.
Reduce hardware or data center management costs.
Increase performance or processing capabilities, through services like big data and machine learning.
Meet regulatory considerations, including taking advantage of certified infrastructure.
Narrow down each requirement before you start the design of the workload. Expect the requirements to change
over time as the solution is deployed and optimized.
Consider the cost implications of the geographic region to which the landing zone is deployed.
The landing zone consists of the subscription and resource group, in which your cloud infrastructure
components exist. This zone impacts the overall cost. Consider the tradeoffs. For example, there are additional
costs for network ingress and egress for cross-zonal traffic. For more information, see Azure regions and Azure
resources.
For information about landing zone for the entire organization, see CAF: Implement landing zone best practices.
Security is one of the most important aspects of any architecture. Security measures protect the valuable data of
the organization. It provides confidentiality, integrity, and availability assurances against attacks and misuse of
the systems.
Factor in the cost of security controls, such as authentication, MFA, conditional access, information protection,
JIT/PIM, and premium Azure AD features. Those options will drive up the cost.
For security considerations, see the Security Pillar.
Business continuity
Application lifespan
Automation opportunities
Budget for staff education
Does the application have a Ser vice Level Agreement that it must meet?Does the application have a Ser vice Level Agreement that it must meet?
Factor in the cost when you create high availability and disaster recovery strategies.
Overall Service Level Agreement (SLA), Recovery Time Objective (RTO), and Recovery Point Objective (RPO)
may drive towards expensive design choices in order to support higher availability requirements. For example, a
choice might be to host the application across regions, which is costlier than single region but supports high
availability.
If your service SLAs, RTOs and RPOs allow, then consider cheaper options. For instance, pre-build automation
scripts and packages that would redeploy the disaster recovery components of the solution from the ground-up
in case a disaster occurs. Alternatively, use Azure platform-managed replication. Both options can lower cost
because fewer cloud services are pre-deployed and managed, reducing wastage.
In general, if the cost of high availability exceeds the cost of application downtime, then you could be over
engineering the high availability strategy. Conversely, if the cost of high availability is less than the cost of a
reasonable period of downtime, you may need to invest more.
Suppose the downtime costs are relatively low, you can save by using recovery from your backup and disaster
recovery processes. If the downtime is likely to cost a significant amount per hour, then invest more in the high
availability and disaster recovery of the service. It's a three-way tradeoff between cost of service provision, the
availability requirements, and the organization's response to risk.
Does your ser vice run seasonally or follow long-term patterns?Does your ser vice run seasonally or follow long-term patterns?
For long-running applications, consider using Azure Reservations if you can commit to one-year or three-year
term.
Similarly, Azure savings plans is another option to save money when you have consistent resource usage in
your subscriptions and can commit to a fixed hourly spend on compute services for a one-year or three-year
period
If your application runs intermittently, consider using consumption billing plans where they are offered on
services in your architecture. For example, use Azure Functions in a Consumption plan, so you only pay for
compute resources you use.
Is it a business requirement to have the ser vice be available 24x7?Is it a business requirement to have the ser vice be available 24x7?
You may not have a business goal to leave the service running all the time. Doing so will incur a consistent cost.
Can you save by shutting down the service or scaling it down outside normal business hours? If you can,
Azure has a rich set of APIs, SDKs, and automation technology that utilizes DevOps and traditional
automation principles. Those technologies ensure that the workload is available at an appropriate level of
scale as needed.
Repurpose some compute and data resources for other tasks that run out of regular business hours.
Reference the Compute Resource Consolidation pattern and consider containers or elastic pools for more
compute and data cost flexibility.
Standardization
Next step
Keep the technical staff up to date in cloud management skills so that the invested services are optimally used.
Consider using resources such as FastTrack for Azure and Microsoft Learn training to onboard the staff. Those
resources provide engineering investments at no cost to customers.
Identify training requirements and costs for cloud migration projects, application development, and
architecture refinement.
Invest in key areas, such as identity management, security configuration, systems monitoring, and
automation.
Give the staff access to training and relevant announcements. This way, they can be aware of new cloud
capabilities and updates.
Provide opportunities to get real-world experience of customers across the globe through conferences,
specific cloud training, and passing dedicated Microsoft Exams (AZ, MS, MB, etc.).
Ensure that your cloud environments are integrated into any IT operations processes. Those operations include
user or application access provisioning, incident response, and disaster recovery. That mapping may uncover
areas where additional cloud cost is needed.
Determine the cost constraints
Azure regions
12/16/2022 2 minutes to read • Edit Online
T E RM IN O LO GYT E RM IN O LO GY DE SC RIP T IO NDE SC RIP T ION
Azure region A set of datacenters deployed within a latency-defined
perimeter and connected through a dedicated regional low-
latency network.
Availability zone A unique physical location within a region with independent
power, network, and cooling to be tolerant to datacenter
failures through redundancy and logical isolation of services.
Billing zone A geographic collection of regions that is used for billing.
Location A region or a zone within a region. Azure has datacenters all
over the world and each datacenter is placed in a location.
Landing zone The ultimate location of your cloud solution or the landing
zone, typically consisting of logical containers such as a
subscription and resource group, in which your cloud
infrastructure components exist.
Tradeoff
Compliance and regulatory
Cost of an Azure service can vary between locations based on demand and local infrastructure costs. Consider
all these geographical areas when choosing the location of your resources to estimate costs.
The complete list of Azure geographies, regions, and locations, is shown in Azure global infrastructure.
To see availability of a product by region, see Products available by region.
Locating resources in a cheaper region should not negate the cost of network ingress and egress or by
degraded application performance because of increased latency.
An application hosted in a single region may cost less than an application hosted across regions because of
replication costs or the need for extra nodes.
Azure also offers differentiated cloud regions for specific security and compliance requirements.
Does your solution need specific levels of security and compliance?Does your solution need specific levels of security and compliance?
If your solution needs to follow certain government regulations, the cost will be higher. Otherwise you can meet
less rigid compliance, through Azure Policy, which is free.
Certain Azure regions are built specifically for high compliance and security needs. For example, with Azure
Government (USA) you're given an isolated instance of Azure. Azure Germany has datacenters that meet privacy
certifications. These specialized regions have higher cost.
Regulatory requirements can dictate restrictions on data residency. These requirements may impact your data
replication options for resiliency and redundancy.
Traffic across billing zones and regions
Cross-regional traffic and cross-zonal traffic incur additional costs.
Is the application critical enough to have the footprint of the resources cross zones and,or crossIs the application critical enough to have the footprint of the resources cross zones and,or cross
regions?regions?
Bandwidth refers to data moving in and out of Azure datacenters. Inbound data transfers (data going into Azure
datacenters) are free for most services. For outbound data transfers, (data going out of Azure datacenters) the
data transfer pricing is based on billing zones. For more information, see Bandwidth Pricing Details.
Suppose, you want to build a cost-effective solution by provisioning resources in locations that offer the lowest
prices. The dependent resources and their users are located in different parts of the world. In this case, data
transfer between locations will add cost if there are meters tracking the volume of data moving across locations.
Any savings from choosing the cheapest location could be offset by the additional cost of transferring data.
The cross-regional and cross-zone additional costs do not apply to global services, such as Azure Active
Directory.
Not all Azure services support zones and not all regions in Azure support zones.
Before choosing a location, consider how important is the application to justify the cost of having
resources cross zones and/or cross regions. For non-mission critical applications such as, developer or test,
consider keeping the solution and its dependencies in a single region or single zone to leverage the
advantages of choosing the lower-cost region.
Azure resources
12/16/2022 3 minutes to read • Edit Online
Usage meters for resources
Allocated usage for the resource
Just like on-premises equipment, there are several elements that affect monthly costs when using Azure
services.
Most services are priced based on units of size, amount of data, or operations. When you provision an Azure
resource, Azure creates metered instances for that resource. The
meters
track the resources' usage and generate
a usage record that is used to calculate your bill.
For example, you provision a virtual machine in Azure. Some meters that track its usage include: Compute
Hours, IP Address Hours, Data Transfer In, Data Transfer Out, Standard Managed Disk, Standard Managed Disk
Operations, Standard IO-Disk, Standard IO-Block Blob Read, Standard IO-Block Blob Write, Standard IO-Block
Blob Delete.
How is the usage tracked for each resource in the workload?How is the usage tracked for each resource in the workload?
For each Azure resource, have a clear understanding of the meters that track usage and the number of meters
associated with the resource tier. The meters correlate to several billable units. Those units are charged to the
account for each billing period. The rate per billable unit depends on the resource tier.
A resource tier impacts pricing because each tier offers levels of features such as performance or availability. For
example, a Standard HDD hard disk is cheaper than a Premium SSD hard disk.
Start with a lower resource tier then scale the resource up as needed. Growing a service with little to no
downtime is easier when compared to downscaling a service. Downscaling usually requires deprovisioning
or downtime. In general, choose scaling out instead of scaling up.
As part of the requirements, consider the metrics for each resource and build your alerts on baseline thresholds
for each metric. The alerts can be used to fine-tune the resources. For more information, see Respond to cost
alert.
Another way to look at pricing is the allocated usage.
Suppose, you de-allocate the virtual machine. You'll not be billed for Compute Hours, I/O reads or writes, and
other compute meters because the virtual machine is not running and has no given compute resources.
However, you'll be charged for storage costs for the disks.
Here are some considerations:
The meters and pricing vary per product and often have different pricing tiers based on the location, size, or
capacity of the resource.
Cost for associated with infrastructure is kept low with commodity hardware.
Failures cannot be prevented but the effects of failure can be minimized through design choices. The
resiliency features are factored in the price of a service and its features.
Here are some examples:
Azure DiskAzure Disk
ExpressRouteExpressRoute
Compute infrastructureCompute infrastructure
Subscription and offer type
Billing structure for services in Azure Marketplace
Start with a small size in GB for a managed disk instead of pay-per-GB model. It's cost effective because cost is
incurred on the allocated storage.
Start with a smaller bandwidth circuit and scale up as needed.
Deploy an additional smaller instance of compute alongside a smaller unit in parallel. That approach is more
cost effective in comparison to restarting an instance to scale up.
For details about how billing works, see Azure Cost Management + Billing documentation.
What is the subscription and offer type in which resources are created?What is the subscription and offer type in which resources are created?
Azure usage rates and billing periods can vary depending on the subscription and offer type. Some subscription
types also include usage allowances or lower prices. For example, Azure Dev/Test subscription offers lower
prices on Azure services such as specific VM sizes, PaaS web apps, and VM images with pre-installed software.
Visual Studio subscribers obtain as part of their benefits access to Azure subscriptions with monthly allowances.
For information about the subscription offers, see Microsoft Azure Offer Details.
As you decide on the offer type, consider the types supporting Azure reservations and savings plans.
With Azure Reservations, you prepay for reserved capacity at a discount. Reservations are suitable for
workloads that have a long-term usage pattern. Combining the offer type with reservations can
significantly lower the cost. For information about subscription and offer types that are eligible for
reservations, see Discounted subscription and offer types.
With Azure savings plans, you commit to a fixed hourly spend across all compute services and apply to
all participating compute services globally up to the hourly commitment.
Are you considering third-par ty ser vices through Azure Marketplace?Are you considering third-par ty ser vices through Azure Marketplace?
Azure Marketplace offers both the Azure products and services from third-party vendors. Different billing
structures apply to each of those categories. The billing structures can range from free, pay-as-you-go, one-time
purchase fee, or a managed offering with support and licensing monthly costs.
Governance
12/16/2022 2 minutes to read • Edit Online
Understand how centralized governance functions can support your
team
Follow organizational policies that define cost boundaries
Enforce resource tagging
Governance can assist with cost management. This work will benefit your ongoing cost review process and will
offer a level of protection for new resources.
Centralized governance can relieve some of the burden related to on-going cost monitoring and optimization.
However, that is an augmentation of the workload team's responsibilities, not a replacement. For an
understanding of how centralized cloud governance teams operate, see the Cloud Adoption Framework's
Govern methodology.
For more detailed information on cost optimization, see the section on Cost Management discipline.
For an example of the types of guardrails provided by a governance team, see the narrative for improving
the cost management discipline, which includes examples of suggested tags and policies for improving cost
governance.
If your team isn't supported by centralized governance teams, see Cloud governance function to better
understand the types of activities your team may need to consider including in each sprint.
Use policies to ensure compliance to the identified cost boundaries. Also, it eliminates the need for manual
resource approval and speeds up the total provisioning time.
Azure Policy can set rules on management groups, subscriptions, and resources groups. The policies control
clouds service resource SKU size, replication features, and locations that are allowed. Use policies to prevent
provisioning of expensive resources. Identify the built-in Azure policies that can help lower cost. For additional
control, create custom policies.
For more information, see Create management groups for resource organization and management.
Control the group who can manage resources in the subscription, see Azure built-in roles.
Set limits or quotas to prevent unexpected costs, For more information, see Azure subscription and service
limits, quotas, and constraints.
Use tags on resources and resource groups to track the incurred costs. Identify the service meters that can't be
tagged or viewed in the cost analysis tool in Azure portal.
The advantages of tagging include:
The cost can be reported to an owner, an application, a business department or a project initiative. This
feature is useful because the overall cost can span multiple resources, locations, and subscriptions.
Filter information. This filtering can be used in cost analysis tool in Azure portal allowing you to get granular
reports.
There are some limitations:
Not all Azure resources can be tagged and not all taggable resources in Azure are accounted for in the Azure
cost analysis tool.
Estimate the initial cost
12/16/2022 6 minutes to read • Edit Online
Migration workloads
It's difficult to attribute costs before deploying a workload to the cloud. If you use methods for on-premises
estimation or directly map on-premises assets to cloud resources, estimate will be inaccurate. For example, if
you build your own datacenter your costs may appear comparable to cloud. Most on-premises estimates don't
account for costs like cooling, electricity, IT and facilities labor, security, and disaster recovery.
Here are some best practices:
Use proof-of-concept deployments to help refine cost estimates.
Choose the right resources that can handle the performance of the workload. For example, cheaper virtual
machines may initially indicate a lower cost but can be more expensive eventually to maintain a certain
performance level.
Accurately reflect the cost associated with right storage type.
Add hidden costs, such as networking cost for large data download.
Quantify the cost of running your business in Azure by calculating Total Cost Ownership (TCO) and the Return
on Investment (ROI). Compare those metrics to existing on-premises equivalents.
It's difficult to attribute costs before migrating to the cloud.
Using on-premises calculation may not accurately reflect the cost of cloud resources. Here are some challenges:
On-premises TCO may not accurately account for hidden expenses. These expenses include under-utilization
of purchased hardware or network maintenance costs including labor and equipment failure.
Cloud TCO may not accurately account for a drop in the organization's operational labor hours. Cloud
provider's infrastructure, platform management services, and additional operational efficiencies are included
in the cloud service pricing. Especially true at a smaller scale, the cloud provider's services don't result in
reduction of IT labor head count.
ROI may not accurately account for new organizational benefits because of cloud capabilities. It's hard to
quantify improved collaboration, reduced time to service customers, and fast scaling with minimal or no
downtime.
ROI may not accurately account for business process re-engineering needed to fully adopt cloud benefits. In
some cases, this re-engineering may not occur at all, leaving an organization in a state where new
technology is used inefficiently.
Azure provides these tools to determine cost.
Microsoft Azure Total Cost of Ownership (TCO) Calculator to reflect all costs.
Azure pricing calculator to assess cost of the services you plan to use in your solution.
Azure Migrate to evaluate your organization's current workloads in on-premises datacenters. It suggests
Azure replacement solution, such virtual machine sizes based on your workload. It also provides a cost
estimate.
For migration projects, the TCO Calculator may assist, as it pre-populates some common cost but
allows you to modify the cost assumptions.
Example estimate for a microservices workload
ComputeCompute
Application gatewayApplication gateway
Load balancerLoad balancer
BandwidthBandwidth
External data sourceExternal data source
Let's consider this scenario as an example. We'll use the Azure Pricing calculator to estimate the initial cost
before the workload is deployed. The cost is calculated per month or for 730 hours.
In this example, we've chosen the microservices pattern. As the container orchestrator, one of the options could
be Azure Kubernetes Service (AKS) that manages a cluster of pods. We choose NGINX ingress controller
because it's well-known controller for such workloads.
The example is based on the current price and is subject to change. The calculation shown is for information
purposes only.
For AKS, there's no charge for cluster management.
For AKS agent nodes, there are many instance sizes and SKUs options. Our example workload is expected to
follow a long running pattern and we can commit to three years. So, an instance that is eligible for reserved
instances would be a good choice. We can lower the cost by choosing the 3-year reser ved plan3-year reser ved plan.
The workload needs two virtual machines. One is for the backend services and the other for the utility services.
The B12MS instance with 2 virtual machines is sufficient for this initial estimation. We can lower cost by
choosing reserved instances.
Estimated Total: $327.17 per month with upfront payment of $11,778.17Estimated Total: $327.17 per month with upfront payment of $11,778.17.
For this scenario, we consider the Standard_v2 TierStandard_v2 Tier of Azure Application Gateway because of the autoscaling
capabilities and performance benefits. We also choose consumption-based pricing, which is calculated by
capacity units (CU). Each capacity unit is calculated based on compute, persistent connections, or throughput.
For Standard_v2 SKU - Each compute unit can handle approximately 50 connections per second with RSA 2048-
bit key TLS certificate. For this workload, we estimate 10 capacity units.
Estimated Total: $248.64 per month.Estimated Total: $248.64 per month.
NGINX ingress controller deploys a load balancer that routes internet traffic to the ingress. Approximately 15
load balancer rules are needed. NAT rules are free. The main cost driver is the amount of data processed
inbound and outbound independent of rules. We estimate traffic of 1 TB (inbound and outbound).
Estimated Total: $96.37 per month.Estimated Total: $96.37 per month.
We estimate 2-TB outbound traffic. The first 5 GB/month are free in Zone 1 (Zone 1 includes North America,
Europe, and Australia). Between 5 GB - 10 TB /month is charged $0.087 per GB.
Estimated Total: $177.74 per monthEstimated Total: $177.74 per month
Because the schema-on read nature of the data handled by the workload, we choose Azure Cosmos DB as the
external data store. By using the Azure Cosmos DB capacity calculator, we can calculate the throughput to
reserve.
Cost variablesCost variables
For lower latency, in this scenario we enable geo-replication by using the Multi-regions writesMulti-regions writes feature.
By default Azure Cosmos DB uses one region for writes and the rest for reads.
Default choices in consistency model of SessionSession and indexing policy to AutomaticAutomatic. Automatic indexing
makes Azure Cosmos DB to index all properties in all items for flexible and efficient queries. Custom
indexing policy allows you to include/exclude properties from the index for lower write RU and storage
size. So uploading a custom indexing policy can make you reduce costs.
Total data store isn't a significant cost driver, here it's set to 500 GB.
The throughput is variable indicating peaks. The percentage of time at peak is set to 10%.
The item size is an average of 90k. By using the capacity calculator you can upload sample json files with
your document's data structure, the average document's size, the number of reads/writes per second.
These variables have the largest impact on cost because they're used to calculate the throughput. The
throughput values are shown in the image.
Now, we use those values in the Azure Pricing calculator.
CI/CD pipelinesCI/CD pipelines
Additional servicesAdditional services
The average throughput based on these settings is 20,000 RUs/sec which is the minimum throughput required
for a 3-year reser ved capacity3-year reser ved capacity plan.
Here is the total cost for three years using the reserved plan:
$1,635.20 Average per month ($58,867.20 charged upfront)
You save $700.00 by choosing the 3-year reserved capacity over the pay-as-you-go price.
With Azure DevOps, Basic PlanBasic Plan is included for Visual Studio Enterprise, Professional, Test Professional, and
MSDN Platforms subscribers. And no charge for adding or editing work items and bugs, viewing dashboards,
backlogs, and Kanban boards for stakeholders.
Basic plan license for five users is free.
For Microsoft Hosted Pipelines, the FreeFree tier includes one parallel CI/CD job with 1,800 minutes (30 hours) per
month. However you can select the PaidPaid tier and have one CI/CD parallel job ($40.00), in this tier, each parallel
CI/CD job includes unlimited minutes.
For this stage of cost estimation, the Self Hosted Pipelines is not required because the workload doesn't
have custom software that runs in your build process which isn't included in the Microsoft-hosted option.
Azure Artifacts is a service where you can create package feeds to publish and consume Maven, npm, NuGet,
Python, and universal packages. Azure Artifacts is billed on a consumption basis, and is free up until 2 GB of
storage. For this scenario, we estimate 56 GB in artifacts ($56.00)
Azure DevOps offers a cloud-based solution for load testing your apps. Load tests are measured and billed in
virtual user minutes (VUMs). For this scenario, we estimate a 200,000 VUMs ($72.00).
Estimated Total: $168.00 per monthEstimated Total: $168.00 per month
Managed services
12/16/2022 2 minutes to read • Edit Online
TA SKTA SK USEUSE
Host a web server Azure App Service instead of setting up IIS servers.
Indexing and querying heterogenous data Azure Cognitive Search instead of ElasticSearch.
Host a database server Azure offers many SQL and no-SQL options such as Azure
SQL Database and Azure Cosmos DB.
Secure access to virtual machine Azure Bastion instead of virtual machines as jump boxes.
Network security Azure Firewall instead of virtual network appliances.
Look for areas in the architecture where it may be natural to incorporate platform-as-a-service (PaaS) options.
These include caching, queues, and data storage. PaaS reduces time and cost of managing servers, storage,
networking, and other application infrastructure.
With PaaS, the infrastructure cost is included in the pricing model of the service. For example, you can provision
a lower SKU virtual machine as a jumpbox. There are additional costs for storage and managing a separate
server. You also need to configure a public IP on the virtual machine, which is not recommended. A managed
service such as Azure Bastion takes into consideration all those costs and offers better security.
Azure provides a wide range of PaaS resources. Here are some examples of when you might consider PaaS
options:
For more information, reference Use platform as a service (PaaS) options.
Reference architectureReference architecture
To see an implementation that provides better security and lowers cost through PaaS services, see [Network
DMZ between Azure and an on-premises datacenter]((/azure/architecture/reference-architectures/dmz/secure-
vnet-dmz).
Consumption and fixed cost models
12/16/2022 2 minutes to read • Edit Online
The common pricing options for Azure services are:
Consumption-based price - You are charged for only what you use. This model is also known as the Pay-As-
You-Go rate.
Fixed price - You provision resources and are charged for those instances whether or not they are used.
A common way to estimate cost is by considering workloads on a peak throughput. Under consistently high
utilization, consumption-based pricing can be less efficient for estimating baseline costs when compared to the
equivalent provisioned pricing. PaaS and serverless technologies can help you understand the economy cutoff
point for consumption-based pricing.
Observe the difference between cost models based on fixed, static provisioning of services, more variable costs
based on autoscaling of serverless technologies.
Start with a fixed minimum level of performance and then use architectural patterns (such as Queue Based Load
Leveling) and autoscaling of services. With this approach the peaks can be smoothed out into a more consistent
flow of compute and data. This approach should temporarily extend your burst performance when the service is
under sustained load. If cost is an important factor but you need to maintain service availability under burst
workload use the Throttling pattern to maintain quality of service under load.
Compare and contrast the options and understand how to provision workloads that can potentially switch
between the two models. The model will be a tradeoff between scalability and predictability. Ideally in the
architecture, blend the two aspects.
Provisioning cloud resources to optimize cost
12/16/2022 2 minutes to read • Edit Online
Deployment of cloud resources of a workload is known as
provisioning
.
Use the Azure Pricing calculator to estimate the cost of your SKU choices. This section describes some
considerations. This list is not meant to be an exhaustive list, but a subset of options.
AI + Machine Learning
Big data analytics
Compute
Networking
Data stores
AI + Machine Learning cost estimates
12/16/2022 3 minutes to read • Edit Online
Azure Machine Learning
Reference architectureReference architecture
The main cost driver for machine learning workloads is the compute cost. Those resources are needed to run the
training model and host the deployment. For information about choosing a compute target, see What are
compute targets in Azure Machine Learning?.
The compute cost depends on the cluster size, node type, and number of nodes. Billing starts while the cluster
nodes are starting, running, or shutting down.
With services such as Azure Machine Learning, you have the option of creating fix-sized clusters or use
autoscaling.
If the amount of compute is not known, start with a zero-node cluster. The cluster will scale up when it
detects jobs in the queue. A zero-node cluster is not charged.
Fix-sized clusters are appropriate for jobs that run at a constant rate and the amount of compute is known and
measured beforehand. The time taken to spin up or down a cluster incurs additional cost.
If you don't need retraining frequently, turn off the cluster when not in use.
To lower the cost for experimental or development workloads, choose Spot VMs. They aren't recommended for
production workloads because they might be evicted by Azure at any time. For more information, see Use Spot
VMs in Azure.
For more information about the services that make up a machine learning workload, see What are the machine
learning products at Microsoft?.
This article provides cost considerations for some technology choices. This isn't meant to be an exhaustive list,
but a subset of options.
Training models don't incur the machine learning service surcharge. You're charged for these factors.
The cost is driven by compute choices, such as, the virtual machine sizes and the region in which they're
available. If you can commit to one or three years, choosing reserved instances or savings plans can
lower cost. For more information, see Reserved instances and savings plans.
As part of provisioning Machine Learning resources, additional resources are deployed such as Azure
Container Registry, Azure Block Blob Storage, and Key Vault. You're charged for as per the pricing of those
individual services.
If you deploy models to a Kubernetes Service cluster, Machine Learning adds a surcharge on top of the
Kubernetes Service compute cost. This cost can be lowered through autoscaling.
For more information, see these articles:
Machine Learning pricing calculator
Azure Machine Learning
Azure Cognitive Services
Reference architectureReference architecture
Azure Bot Service
Reference architectureReference architecture
Training of Python scikit-learn and deep learning models on Azure
Distributed training of deep learning models on Azure
Batch scoring of Python machine learning models on Azure
Batch scoring of deep learning models on Azure
Real-time scoring of Python scikit-learn and deep learning models on Azure
Machine learning operationalization (MLOps) for Python models using Azure Machine Learning
Batch scoring of R machine learning models on Azure
Real-time scoring of R machine learning models on Azure
Batch scoring of Spark machine learning models on Azure Databricks
Enterprise-grade conversational bot
Build a real-time recommendation API on Azure
The billing depends on the type of service. The charges are based on the number of transactions for each type of
operation specific to a service. Certain number of transactions are free. If you need additional transactions,
choose from the StandardStandard instances. For more information, see
Cognitive services pricing calculator
Cognitive services pricing
Build an enterprise-grade conversational bot
The Azure Bot Service is a managed service purpose-built for enterprise-grade bot development. Billing is based
on the number of messages. Certain number of messages are free. If you need to create custom channels,
choose Premium channelsPremium channels, which can drive up the cost of the workload.
For a Web App Bot, an Azure App Service is provisioned to host the bot. Also, an instance of Application Insights
is provisioned. You're charged for as per the pricing of those individual services.
Enterprise-grade conversational bot
Big data analytics cost estimates
12/16/2022 6 minutes to read • Edit Online
Azure Synapse Analytics
Most big data workloads are designed to do:
Batch processing of big data sources at rest.
Stream processing of data in motion.
Those workloads have different needs. Batch processing is done with long-running batch jobs. For stream
processing, the data ingestion component should be able to capture and in some cases buffer, store real-time
messages. Both workloads also have the requirement to store large volume of data. Then, filter, aggregate, and
prepare that data for analysis.
For information about choosing technologies for each workload, see these articles:
Batch processing
Technology choices for batch processing
Capability matrix
Stream processing
What are your options when choosing a technology for real-time processing?
Capability matrix
This article provides cost considerations for some of those choices. This is not meant to be an exhaustive list, but
a subset of options.
The analytics resources are measured in
Data Warehouse Units (DWUs)
, which tracks CPU, memory, and IO.
DWU also indicates the required level of performance. If you need higher performance, add more DWU blocks.
You can provision the resources in one of two service levels.
Compute Optimized Gen1Compute Optimized Gen1 tracks usage in DWUs and is offered in a pay-as-you-go model.
Compute Optimized Gen2Compute Optimized Gen2 tracks the compute DWUs (cDWUs) which allows you to scale the compute
nodes. This level is intended for intensive workloads with higher query performance and compute scalability.
You can choose the pay-as-you-go model or save 37% to 65% by using reserved instances if you can commit
to one or three years. For more information, see Reserved instances.
Start with smaller DWUs and measure performance for resource intensive operations, such as heavy
data loading or transformation. This will help you determine the number of units you need to increase or
decrease. Measure usage during the peak business hours so you can assess the number of concurrent
queries and accordingly add units to increase the parallelism. Conversely, measure off-peak usage so that
you can pause compute when needed.
In Azure Synapse Analytics, you can import or export data from an external data store, such as Azure Blob
Storage and Azure Data Lake Store. Storage and analytics resources aren't included in the price. There is
additional bandwidth cost for moving data in and out of the data warehouse.
For more information, see these articles:
Reference architectureReference architecture
Azure Databricks
IT EMIT EM EXA M P L E EST IM AT EEXA M P L E EST IM AT E
Cost for 10 DS13v2 instances 100 hours x 10 instances x $0.741/hour = $741.00
DBU cost for Data AnalyticsData Analytics workload 100 hours x 10 instances x 2 DBU per node x $0.55/DBU =
$1,100
TotalTotal $1,841
Reference architectureReference architecture
Azure Synapse Pricing
Manage compute in Azure Synapse Analytics data warehouse
Automated enterprise BI with Azure Synapse Analytics and Azure Data Factory
Enterprise BI in Azure with Azure Synapse Analytics
Azure Databricks offers two SKUs StandardStandard and PremiumPremium , each with these options, listed in the order of least
to most expensive.
Data Engineering LightData Engineering Light is for data engineers to build and execute jobs in automated Spark clusters.
Data EngineeringData Engineering includes autoscaling and has features for machine learning flows.
Data AnalyticsData Analytics includes the preceding set of features and is intended for data scientists to explore, visualize,
manipulate, and share data and insights interactively.
Choose a SKU depending on your workload. If you need features like log audit, which is available in PremiumPremium ,
the overall cost can increase. If you need autoscaling of clusters to handle larger workloads or interactive
Databricks dashboards, choose an option higher than Data Engineering LightData Engineering Light.
Here are factors that impact Databricks billing:
Azure location where the resource is provisioned.
The virtual machine instance tier and number of hours the instances were running.
Databricks units (DBU)
, which is a unit of processing capability per hour, billed on per-second usage.
The example is based on the current price and is subject to change. The calculation shown is for information
purposes only.
Suppose you run a PremiumPremium cluster for 100 hours in East US 2 with 10 DS13v2 instances.
For more information, see Azure Databricks Pricing.
If you can commit to one or three years, opt for reserved instances, which can save 38% - 59%. For more
information, see Reserved instances.
Turning off the Spark cluster when not in use to prevent unnecessary charges.
Stream processing with Azure Databricks
Build a Real-time Recommendation API on Azure
Batch scoring of Spark models on Azure Databricks
Azure Stream Analytics
Reference architectureReference architecture
Azure Analysis Services
Reference architectureReference architecture
Azure Data Factory V2
Stream analytics uses
streaming units (SUs)
to measure the amount of compute, memory, and throughput
required to process data. When provisioning a stream processing job, you're expected to specify an initial
number of SUs. Higher streaming units mean higher cost because more resources are used.
Stream processing with low latency requires a significant amount of memory. This resource is tracked by the
SU% utilization metric. Lower utilization indicates that the workload requires more compute resources. You can
set an alert on 80% SU Utilization metric to prevent resource exhaustion.
To evaluate the number of units you need, process an amount of data that is realistic for your
production level workload, observe the SU% Utilization metric, and accordingly adjust the SU value.
You can create stream processing jobs in Azure Stream Analytics and deploy them to devices running Azure IoT
Edge through Azure IoT Hub. The number of devices impacts over all cost. Billing starts when a job is deployed
to devices, regardless of the job status (running, failed, stopped).
SUs are based on the partition configuration for the inputs and the query that's defined within the job. For more
information, see Calculate the max streaming units for a job and Understand and adjust Streaming Units.
For pricing details, see Azure Stream Analytics pricing.
Batch scoring of Python machine learning models on Azure
Azure IoT reference architecture
Big data solutions need to store data that can be used for reporting. Azure Analysis Services supports the
creation of tabular models to meet this need.
Here are the tiers:
DeveloperDeveloper is recommended for evaluation, development, and test scenarios.
BasicBasic is recommended for small production environment.
StandardStandard is recommended for mission critical workloads.
Analysis Services uses Query Processing Units (QPUs) to determine the processing power. A QPU is an
abstracted measure of compute and data processing resources that impact performance. Higher the QPU results
in higher performance.
Each tier offers one or more instances. The main cost drivers are the QPUs and memory allocated for the tier
instance. Start with a smaller instance, monitor the QPU usage, and scale up or down by selecting a higher or
lower instance within the tier. Also, monitor usage during off-peak hours. You can pause the server when not in
use. No charges apply when you pause your instance. For more information, see these articles:
The right tier when you need it
Monitor server metrics
Azure Analysis Services pricing
Enterprise business intelligence - Azure Reference Architectures
Automated enterprise BI - Azure Architecture Center
Reference architectureReference architecture
Azure Data Factory is a big data orchestrator. The service transfers data to and from diverse types of data stores.
It transforms the data by using other compute services. It creates workflows that automate data movement and
transformation. You are only charged for consumption. Consumption is measured by these factors:
Pipeline activities that take the actions on the data. Those actions include copying the data from various
sources, transforming it, and controlling the flow. For more information, see data movement activities,
data transformation activities, and control activities.
You are charged for the total activities in thousands.
Executions measured in data integration units. Each unit tracks CPU, memory, and network resource
allocation. This measurement applies to Azure Integration Runtime.
You are also charged for execution activities, such as copying data, lookups, and external activities. Each activity
is individually priced. You are also charged for pipelines with no associated triggers or runs within the month. All
activities are prorated by the minute and rounded up.
Automated enterprise BI - Azure Architecture Center
Enterprise business intelligence - Azure Reference Architectures
Build an enterprise-grade conversational bot - Azure Architecture Center
Compute cost estimates
12/16/2022 8 minutes to read • Edit Online
Compute
refers to the hosting model for the computing resources that your application runs on. Whether you're
hosting model is Infrastructure as a Service (IaaS), Platform as a Service (PaaS), or Function as a service (FaaS),
each resource requires your evaluation to understand the tradeoffs that can be made that impact your cost. To
learn more about hosting models, read Understand the hosting models.
Infrastructure-as-a-Ser viceInfrastructure-as-a-Ser vice (IaaS) lets you provision individual virtual machines (VMs) along with the
associated networking and storage components. Then you deploy whatever software and applications
you want onto those VMs. This model is the closest to a traditional on-premises environment, except that
Microsoft manages the infrastructure. You still manage the individual VMs.
Platform-as-a-Ser vicePlatform-as-a-Ser vice (PaaS) provides a managed hosting environment, where you can deploy your
application without needing to manage VMs or networking resources. Azure App Service is a PaaS
service.
Functions-as-a-Ser viceFunctions-as-a-Ser vice (FaaS) goes even further in removing the need to worry about the hosting
environment. In a FaaS model, you simply deploy your code and the service automatically runs it. Azure
Functions are a FaaS service.
What are the cost implications to consider for choosing a hosting model?What are the cost implications to consider for choosing a hosting model?
If your application can be broken down into short pieces of code, a FaaS hosting model might be the best choice.
You're charged only for the time it takes to execute your code. For example, Azure Functions is a FaaS service
that processes events with serverless code. Azure Functions allows you to run short pieces of code (called
functions) without worrying about application infrastructure. Use one of the three Azure Functions pricing plans
to fit your need. To learn more about the pricing plans, see How much does Functions cost?
If you want to deploy a larger or more complex application, PaaS may be the better choice. With PaaS, your
application is always running, as opposed to FaaS, where your code is executed only when needed. Since more
resources are used with PaaS, the price increases.
If you're migrating your infrastructure from on-premises to Azure, IaaS will greatly reduce and optimize
infrastructure costs and salaries for IT staff who are no longer needed to manage the infrastructure. Since IaaS
uses more resources than PaaS and FaaS, your cost could be highest.
What are the main cost drivers for Azure ser vices?What are the main cost drivers for Azure ser vices?
You'll be charged differently for each service depending on your region, licensing plan (such as, Azure Hybrid
Benefit for Windows Server), number and type of instances you need, operating system, lifespan, and other
parameters required by the service. Assess the need for each compute service by using the flowchart in Choose
a candidate service. Consider the tradeoffs that will impact your cost by creating different estimates using the
Pricing Calculator. If your application consists of multiple workloads, we recommend that you evaluate each
workload separately. Reference Consider limits and costs to perform a more detailed evaluation on service
limits, cost, SLAs, and regional availability.
Are there payment options for Vir tual Machines (VMs) to help meet my budget?Are there payment options for Vir tual Machines (VMs) to help meet my budget?
The best choice is driven by the business requirements of your workload. If you have higher SLA requirements,
it will increase overall costs. You'll likely need more VMs to ensure uptime and connectivity. Other factors that
will impact cost are region, operating system, and the number of instances you choose. Your cost also depends
Use PaaS as an alternative to buying VMs
on the workload life span. See Virtual machines and Use Spot VMs in Azure for more details.
Pay as you goPay as you go lets you pay for compute capacity by the second, with no long-term commitment or
upfront payments. This option allows you to increase or decrease compute capacity on demand. It's
appropriate for applications with short-term, spiky, or unpredictable workloads that can't be interrupted.
You can also start or stop usage at any time, resulting in paying only for what you use.
Reser ved Vir tual Machine InstancesReser ved Vir tual Machine Instances lets you purchase a VM for one or three years in a specified
region in advance. It's appropriate for applications with steady-state usage. You may save more money
compared to pay-as-you-go pricing.
Savings plansSavings plans lets you commit to one or three years, across all regions, for a fixed hourly dollar amount
across compute services.
Spot pricingSpot pricing lets you purchase unused compute capacity at major discounts. If your workload can
tolerate interruptions, and its execution time is flexible, then using spot pricing for VMs can significantly
reduce the cost of running your workload in Azure.
Dev-Test pricingDev-Test pricing offers discounted rates on Azure to support your ongoing development and testing.
Dev-Test allows you to quickly create consistent development and test environments through a scalable,
on-demand infrastructure. This will allow you to spin up what you need, when you need it, and explore
scenarios before going into production. To learn more about Azure Dev-Test reduced rates, see Azure
Dev/Test Pricing.
For details on available sizes and options for the Azure VMs you can use to run your apps and workloads, see
Sizes for virtual machines in Azure. For details on specific Azure VM types, see Virtual Machine series.
Do I pay extra to run large-scale parallel and high-performance computing (HPC) batch jobs?Do I pay extra to run large-scale parallel and high-performance computing (HPC) batch jobs?
Use Azure Batch to run large-scale parallel and HPC batch jobs in Azure. You can save on VM cost because
multiple apps can run on one VM. Configure your workload with either the Low-priority tier (the cheapest
option) or the Standard tier (provides better CPU performance). There's no cost for the Azure Batch service.
Charges accrue for the underlying resources that run your batch workloads.
When you use the IaaS model, you do have final control over the VMs. It may appear to be a cheaper option at
first, but when you add operational and maintenance costs, the cost increases. When you use the PaaS model,
these extra costs are included in the pricing. In some cases, this means that PaaS services can be a cheaper than
managing VMs on your own. For help with finding areas in the architecture where it may be natural to
incorporate PaaS options, see Managed services.
How can I cut costs with hosting my web apps in PaaS?How can I cut costs with hosting my web apps in PaaS?
If you host your web apps in PaaS, you'll need to choose an App Service plan to run your apps. The plan will
define the set of compute resources on which your app will run. If you have more than one app, they'll run using
the same resources. This is where you'll see the most significant cost savings, as you don't incur cost for VMs.
If your apps are event-driven with a short-lived process using a microservices architecture style, we recommend
using Azure Functions. Your cost is determined by execution time and memory for a single function execution.
For pricing details, see Azure Functions pricing.
Is it more cost-effective to deploy development testing (dev-test) on a PaaS or IaaS hostingIs it more cost-effective to deploy development testing (dev-test) on a PaaS or IaaS hosting
model?model?
If your dev-test is built on Azure managed services such as Azure DevOps, Azure SQL Database, Azure Cache for
Redis, and Application Insights, the cheapest solution might be using the PaaS hosting model. You won't incur
A special case of PaaS - Containers
Predict cost estimates using the Pricing Calculator
NOTENOTE
the cost and maintenance of hardware. If your dev-test is built on Azure managed services such as Azure
DevOps, Azure DevTest Labs, VMs, and Application Insights, you need to add the cost of the VMs, which can
greatly increase your cost. For details on evaluating, see Azure Dev/Test Pricing.
How can I get the best cost savings for a containerized workload that requires full orchestration?How can I get the best cost savings for a containerized workload that requires full orchestration?
Your business requirements may necessitate that you store container images so that you've fast, scalable
retrieval, and network-close deployment of container workloads. Although there are choices as to how you'll run
them, we recommend that you use AKS to set up instances with a minimum of three (3) nodes. AKS reduces the
complexity and operational overhead of managing Kubernetes by offloading much of that responsibility to
Azure. There's no charge for AKS Cluster Management. Any additional costs are minimal. The containers
themselves have no impact on cost. You pay only for per-second billing and custom machine sizes.
Can I save money if my containerized workload does not need full orchestration?Can I save money if my containerized workload does not need full orchestration?
Your business requirements may not necessitate full orchestration. If so and you're using App Service containers,
we recommend that you use one of the App Service plans. Choose the appropriate plan based on your
environment and workload.
There's no charge to use SNI-based SSL. Standard and Premium service plans include the right to use one IP SSL
at no additional charge. Free and shared service plans don't support SSL. You can purchase the right to use
additional SSL connections for a fee. In all cases, the SSL certificate itself must be purchased separately. To learn
more about each plan, see App services plans.
Where's the savings if my workload is event driven with a shor t-lived process?Where's the savings if my workload is event driven with a shor t-lived process?
In this example, Service Fabric may be a better choice than AKS. The biggest difference between the two is that
AKS works only with Docker applications using Kubernetes. Service Fabric works with microservices and
supports many different runtime strategies. Service Fabric can deploy Docker and Windows Server containers.
Like AKS, Service Fabric is built for the microservice and event-driven architectures. AKS is strictly a service
orchestrator and handles deployments, whereas Service Fabric also offers a development framework that allows
building stateful/stateless applications.
Use the Pricing Calculator to create your total cost estimate. After you run your initial scenario, you may find
that your plan is beyond the scope of your budget. You can adjust the overall cost and create various cost
scenarios to make sure your needs are met before you commit to purchasing.
The costs in this example are based on the current price and are subject to change. The calculation is for information
purposes only. It shows the Collapsed view of the cost in this estimate.
TIPTIP
Next steps
You can start building your cost estimate at any time and re-visit it later. The changes will be saved until you modify or
delete your estimate.
Provisioning cloud resources to optimize cost
Virtual Machines documentation
VM payment options
Pricing Calculator
Data store cost estimates
12/16/2022 9 minutes to read • Edit Online
Guidelines: Identify the business transactions and their requirements
Most cloud workloads adopt the
polyglot
persistence approach. Instead of using one data store service, a mix of
technologies is used. You can achieve optimal cost benefit from using this approach.
Each Azure data store has a different billing model. To establish a total cost estimate, first, identify the business
transactions and their requirements. Then, break each transaction into operations. Lastly, identify a data store
appropriate for the type of data. Do this for each workload separately.
Let's take an example of an e-commerce application. It needs to store data for transactions such as orders,
payments, and billing. The data structure is predetermined and not expected to change frequently. Data integrity
and consistency are crucial. There's a need to store product catalogs, social media posts, and product reviews. In
some cases, the data is unstructured, and is likely to change over time. Media files must be stored and data must
be stored for auditing purposes.
Learn about data stores in Understand data store models.
The following list of questions address the requirements that can have the greatest impact your cost estimate.
For example, your monthly bill may be within your budget now, but if you scale up or add storage space later,
your cost may increase well over your budget.
Will your data need to be migrated to on-premises, external data centers, or other cloud-hosting
environments?
What type of data are you intending to store?
How large are the entities you need to store?
What is the overall amount of storage capacity you need?
What kind of schemas will you apply to your data (e.g., fixed schema, schema-on-write, or schema-on-read)?
What are your data performance requirements (e.g., acceptable response times for querying and aggregation
of data once ingested)?
What level of fault-tolerance do you need to provide for data consumers?
What kind of data replication capabilities do you require?
Will the limits of a particular data store support your requirements for scale, number of connections, and
throughput?
How many instances will need to run to support your uptime and throughput requirements? (Consider
operations costs in this calculation.)
Can you partition your data to store it more cost effectively (e.g., can you move large objects out of an
expensive relational database into an object store)?
There are other requirements that may not have as great of an impact on your cost. For example, the US East
region is only slightly lower than Canada Central. See Criteria for choosing a data store for additional business
requirements.
Use the Pricing Calculator to determine different cost scenarios.
What are the network requirements that may impact cost?What are the network requirements that may impact cost?
Do you need to restrict or otherwise manage access to your data from other network resources?
Guidelines: Break each transaction into operations
Guidelines: Identify a data store appropriate for the type of data
NOTENOTE
Relational database management systems (RDBMS) cost
Does data need to be accessible only from inside the Azure environment?
Does the data need to be accessible from specific IP addresses or subnets?
Does the data need to be accessible from applications or services hosted on-premises or in other external
data centers?
For example, one business transaction can be processed by these distinct operations:
10-15 database calls
Three append operations on blob, and
Two list operations on two separate file shares
How does data type in each operation effect cost?How does data type in each operation effect cost?
Consider Azure Blob Storage Block Blobs instead of storing binary image data in Azure SQL Database. Blob
storage is cheaper than Azure SQL Database.
If your design requires SQL, store a lookup table in SQL Database and retrieve the document when needed to
serve it to the user in your application middle tier. SQL Database is highly targeted for high-speed data
lookups and set-based operations.
The hot access tier of Azure Block Blob Storage cost is cheaper than the equivalent size of the Premium SSD
volume that has the database.
Read this decision chart to make your choices.
An inappropriate data store or one that is mis-configured can have a huge cost impact on your design.
RDBMS is the recommended choice when you need strong consistency guarantees. An RDBMS typically
supports a schema-on-write model, where the data structure is defined ahead of time, and all read or write
operations must use the schema.
How can I save money if my data is on-premises and already on SQL ser ver?How can I save money if my data is on-premises and already on SQL ser ver?
If the on-premises data is already on a SQL server, it might be a natural choice. The on-premises license with
Software Assurance can be used to bring down the cost if the workload is eligible for Azure Hybrid Benefit. This
option applies for Azure SQL Database (PaaS) and SQL Server on Azure Virtual Machines (IaaS).
For open-source databases such as MySQL, MariaDB, or PostGreSQL, Azure provides managed services that are
easy to provision.
What are some design considerations that will affect cost?What are some design considerations that will affect cost?
If the SLAs don't allow for downtime, can a read-only replica in a different region enable business continuity?
If the database in the other region must be read/write, how will the data be replicated?
Does the data need to be synchronous, or could consistency allow for asynchronous replication?
How important is it for updates made in one node to appear in other nodes, before further changes can be
made?
Key/value and document databases cost
Graphic databases cost
Azure storage has several options to make sure data is copied and available when needed. Locally redundant
storage (LRS) synchronously replicates data in the primary region. If the entire primary center is unavailable the
replicated data is lost. At the expensive end, Geo-zone-redundant storage (preview) (GZRS) replicates data in
availability zones within the primary region and asynchronously copied to another paired region. Databases that
offer geo-redundant storage, such as SQL Database, are more expensive. Most other OSS RDBMS database use
LRS storage, which contributes to the lower price range.
For more information, see automated backups.
Azure Cosmos DB offers five consistency levels:
strong
,
bounded staleness
,
session
,
consistent prefix
, and
eventual
consistency. Each level provides availability and performance tradeoffs and is backed by
comprehensive SLAs. The consistency level itself does not affect cost.
How can I minimize compute cost?How can I minimize compute cost?
Higher throughput and IOPS require higher compute, memory, I/O and storage limits. These limits are
expressed in a vCore model. With higher vCore number, you buy more resources and consequently the cost is
higher. Azure SQL Database has more vCores and allows you to scale in smaller increments. Azure Database for
MySQL, PostgreSQL, and MariaDB have fewer vCores and scaling up to a higher vCore can cost more. MySQL
provides in-memory tables in the Memor y OptimizedMemor y Optimized tier, which can also increase the cost.
All options offer a consumption and provisioned pricing models. With pre-provisioned instances, you save more
if you can commit to one or three years.
How is primar y and backup storage cost calculated?How is primar y and backup storage cost calculated?
With Azure SQL Database the initial 32 GB of storage is included in the price. For the other listed options, you
need to buy storage separately and might increase the cost depending on your storage needs.
For most databases, there is no charge for the price of backup storage that is equal in size to primary storage. If
you need more backup storage, you will incur an additional cost.
Azure Cosmos DB is recommended for key/value stores and document databases. Azure Cache for Redis is also
recommended for key/value stores.
For Azure Cosmos DB, here are some considerations that will affect cost:
Can storage and item size be adjusted?
Can index policies be reduced or customized to bring down the extra cost for writes and eliminate the
requirement for additional throughput capacity?
If data is no longer needed, can it be deleted from the Azure Cosmos DB account? As an alternative, you can
migrate the old data to another data store such as Azure blob storage or Azure data warehouse.
For Azure Cache for Redis, there is no upfront cost, no termination fees, you pay only for what you use, and
billing is per-hour.
Cost considerations for graphic database stores include storage required by data and indexes used across all the
regions. For example, graphs need to scale beyond the capacity of a single server, and make it easy to lookup the
index when processing a query.
The Azure service that supports this is Azure Cosmos DB for Apache Gremlin. Cost is limited to the Azure Azure
Cosmos DB usage. You pay for the operations you perform against the database and for the storage consumed
by your data. Charges are determined by the number of provisioned containers, the number of hours the
Data analytics cost
Column family database cost
Search engine database cost
Time series database cost
Object storage cost
containers were online, and the provisioned throughput for each container.
If the on-premises data is already on an SQL server on Azure Virtual Machines (IaaS), the license with Software
Assurance can be used to bring down the cost if the workload is eligible for Azure Hybrid Benefit.
Cost considerations for data analytics stores include data storage, multiple servers to maximize scalability, and
the ability to access large data as external tables.
Azure has many services that support data analytics stores: Azure Synapse Analytics, Azure Data Lake, Azure
Data Explorer, Azure Analysis Services, HDInsight, and Azure Databricks. As an example of use, historical data is
typically stored in data stores such as blob storage or Azure Data Lake Storage Gen2, which are then accessed
by Azure Synapse, Databricks, or HDInsight as external tables.
When using Azure Synapse, you will only pay for the capabilities you opt in to use. During public preview, there
will not be a cost for provisioning an Azure Synapse workspace. Enabling a managed VNET and Customer
Managed Keys may incur a workspace fee after public preview. Pricing of workspaces with additional capabilities
will be announced at a future date.
The main cost consideration for a column family database is that it needs to be massively scalable.
The Azure services that support column family databases are Azure Cosmos DB for Apache Cassandra and
HBase in HDInsight.
For Azure Cosmos DB for Apache Cassandra, you pay the cost of Azure Cosmos DB, which includes database
operations and consumed storage. Azure Cosmos DB for Apache Cassandra is open-source.
For HBase in HDInsight, you pay the cost of HDInsight, which includes instance size and number. HBase is open-
source.
Cost incurs for a search engine database when applications need to search for information held in external data
stores. It also needs to index massive volumes of data and provide near real-time access to these indexes.
Azure Cognitive Search is a search service that uses AI capabilities to identify and explore relevant content at
scale.
The main cost considerations for a time series database is the need to collect large amounts of data in real time
from a large number of sources. Although the records written to a time series database are generally small,
there are often a large number of records, and total data size can grow rapidly which will drive up cost.
Azure Time Series Insights may be the best option to minimize cost.
Cost considerations include storing large binary objects such as images, text files, video and audio streams,
large application data objects and documents, and virtual machine disk images.
Azure services that support object storage are Azure Blob Storage and Azure Data Lake Storage Gen2.
Shared files cost
The main cost consideration is having the ability to access files across a network. For example, given appropriate
security and concurrent access control mechanisms, sharing data in this way can enable distributed services to
provide highly scalable data access for performing basic, low-level operations such as simple read and write
requests.
The Azure service that supports shared files is Azure Files.
Azure messaging cost estimates
12/16/2022 2 minutes to read • Edit Online
Service Bus cost
Event Grid cost
Event Hubs
The messaging services in this article have no up-front cost or termination fees, and you pay only for what you
use. In some cases, it's advantageous to combine two messaging services to increase the efficiency of your
messaging system. See Crossover scenarios for examples.
Cost is based on the number of operations or throughput units used depending on the message service. Using
the wrong messaging service for the intent can lead to higher costs. Before choosing a service, first, determine
the intent and requirements of the messages. Then, consider the tradeoffs between cost and
operations/throughput units. For tradeoff examples, see Technology choices for a message broker.
Use the Azure Pricing calculator for help creating various cost scenarios.
Connect on-premises and cloud-based applications and services to implement highly secure messaging
workflows using Service Bus. Cost is based on messaging operations and number of connections. The Basic tier
is the cheapest. If you want more operations and features, choose the Standard or Premium tier. For example,
Service Bus Premium runs in dedicated resources to provide higher throughput and more consistent
performance.
For pricing details, see Service Bus pricing.
Manage routing of all events from any source to any destination to simplify event-based app development using
Event Grid. Event Grid can route a massive number of events per second per region. Cost is based on number of
operations performed. Examples of some operations are event ingress, subscription delivery attempts,
management calls, and filtering by subject suffix.
For pricing details, see Event Grid pricing.
Stream millions of events per second from any source to build dynamic data pipelines and immediately respond
to business challenges using Event Hubs. Cost is based on throughput units. A key difference between Event
Grid and Event Hubs is in the way event data is made available to subscribers. For more information, see Pull
model.
For questions and answers on pricing, see Pricing.
For pricing, see Event Hubs pricing.
Networking cost estimates
12/16/2022 5 minutes to read • Edit Online
Traffic routing and load balancing
SERVIC ESERVIC E GLO B A L / REG IO N A LGLO B A L / REG IO N A L REC O M M EN DED T RA F F ICREC O M M EN DED T RA F F IC
Azure Front Door Global HTTP(S)
Traffic Manager Global non-HTTP(S)
Application Gateway Regional HTTP(S)
Azure Load Balancer Regional non-HTTP(S)
Example cost analysisExample cost analysis
Azure Front Door
(
West US
)
, Zone 1.Azure Front Door
(
West US
)
, Zone 1.
Most workloads have a load balancing service to route and distribute traffic in a way that a single resource isn't
overloaded, and the response time is minimum with maximum throughput.
Do you need to distribute traffic within a region or across regions?Do you need to distribute traffic within a region or across regions?
Azure Front Door and Traffic Manager can distribute traffic to backends, clouds, or hybrid on-premises services
that reside in multiple regions. Otherwise, for regional traffic that moves within virtual networks or zonal and
zone-redundant service endpoints within a region, choose Application Gateway or Azure Load Balancer.
What is the type of traffic?What is the type of traffic?
Load balancers such as Azure Application Gateway and Azure Front Door are designed to route and distribute
traffic to web applications or HTTP(S) endpoints. Those services support Layer 7 features such as SSL offload,
web application firewall, path-based load balancing, and session affinity. For non-web traffic, choose Traffic
Manager or Azure Load Balancer. Traffic Manager uses DNS to route traffic. Load Balancer supports Layer 4
features for all UDP and TCP protocols.
Here's the matrix for choosing load balancers by considering both dimensions.
For more information, see Choose a load balancing service.
Consider a web application that receives traffic from users across regions over the internet. To minimize the
request response time, the load balancer can delegate the responsibility from the web server to a different
machine. The application is expected to consume 10 TB of data. Routing rules are required to route incoming
requests to various paths. Also, we want to use Web Application Firewall (WAF) to secure the application
through policies.
By using the Azure Pricing Calculator, we can estimate the cost for these two services.
The example is based on the current price and is subject to change. The calculation shown is for information
purposes only.
Application Gateway
(
West US
)
Application Gateway
(
West US
)
Azure Traffic ManagerAzure Traffic Manager
IT EMIT EM EXA M P L E EST IM AT EEXA M P L E EST IM AT E
DNS queries 100 million x $0.54/per million = $54.00
Basic health checks 20 endpoints x $0.36/month = $7.20
Fast interval health checks 10 endpoints x $1.00/month = $10.00
External endpoints hosted outside Azure 10 endpoints x $0.54/month = $5.40
Real User Measurements 100 million user measurements x $2.00/month = $200.00
Traffic view 100 million data points processed x $2.00/month = $200.00
TotalTotal $486.60
Azure Load BalancerAzure Load Balancer
IT EMIT EM EXA M P L E EST IM AT EEXA M P L E EST IM AT E
Rules First 5 rules: $0.032/rule/hour x 730 = $18.25
Additional 5 rules: $0.013/rule/hour x 730 = $36.50
Data processed 10 TB x $0.005/GB = 51.20
Consider a similar example where the type of traffic is changed. Instead the application is a UDP streaming
service that is deployed across regions and traffic goes over the internet. We can use a combination of Traffic
Manager and Azure Load Balancer. Traffic Manager is a simple routing service that uses DNS to direct clients to
specific service. Here are the cost estimates:
TotalTotal $105.95
IT EMIT EM EXA M P L E EST IM AT EEXA M P L E EST IM AT E
Peering
Example cost analysisExample cost analysis
IT EMIT EM EXA M P L E EST IM AT EEXA M P L E EST IM AT E
Ingress traffic 50 TB x $0.0100/GB = $512.50
Egress traffic 50 TB x $0.0100/GB = $512.50
TotalTotal $1,025.00
IT EMIT EM EXA M P L E EST IM AT EEXA M P L E EST IM AT E
Ingress traffic 50 TB x $0.0350/GB = $1,792.00
Egress traffic 50 TB x $0.0350/GB = $1,792.00
TotalTotal $3,584.00
Hybrid connectivity
Do you need to connect vir tual networks?Do you need to connect vir tual networks?
Peering technology is a service used for Azure virtual networks to connect with other virtual networks in the
same or different Azure region. Peering technology is used often in hub and spoke architectures.
An important consideration is the additional costs incurred by peering connections on both egress and ingress
traffic traversing the peering connections.
Keeping the top talking services of a workload within the same virtual network, zone and/or region
unless otherwise required. Use virtual networks as shared resources for multiple workloads against a single
virtual network per workload approach. This approach will localize traffic to a single virtual network and
avoid the additional costs on peering charges.
The example is based on the current price and is subject to change. The calculation shown is for information
purposes only.
Peering within the same region is cheaper than peering between regions or Global regions. For instance, you
consume 50 TB per month by connecting two VNETs in Central US. Using the current price here is the incurred
cost.
Let's compare that cost for cross-region peering between Central US and East US.
There are two main options for connecting an on-premises datacenter to Azure datacenters:
Azure VPN Gateway can be used to connect a virtual network to an on-premises network through a VPN
appliance or to Azure Stack through a site-to-site VPN tunnel.
Example cost analysisExample cost analysis
Azure VPN GatewayAzure VPN Gateway
IT EMIT EM EXA M P L E EST IM AT EEXA M P L E EST IM AT E
VPN Hours 720 x $0.564 = $406.08
Site to Site (S2S) Tunnels The first 10 tunnels are free. For additional tunnels, the cost
is 5 tunnels x 720 hours x $0.015 per hour per tunnel =
$54.00
Outbound transfer 1024 GB x 0.086 = $88.65
Total cost per month $548.73
ExpressRouteExpressRoute
IT EMIT EM EXA M P L E EST IM AT EEXA M P L E EST IM AT E
Circuit bandwidth 1 GB/s speed has a fixed rate of $436.00 for the standard
price.
Outbound transfer 1 TB x $0.0025 = $25.60
Total cost per month $461.00
Azure ExpressRoute creates private connections between Azure datacenters and infrastructure that's on
premise or in a colocation environment.
What is the required throughput for cross-premises connectivity?What is the required throughput for cross-premises connectivity?
VPN gateway is recommended for development/test cases or small-scale production workloads where
throughput is less than 100 Mbps. Use ExpressRoute for enterprise and mission-critical workloads that access
most Azure services. You can choose bandwidth from 50 Mbps to 10 Gbps.
Another consideration is security. Unlike VPN Gateway traffic, ExpressRoute connections don't go over the public
internet. VPN Gateway traffic is secured by industry standard IPsec.
For both services, inbound transfers are free and outbound transfers are billed per the billing zone.
For more information, see Choose a solution for connecting an on-premises network to Azure.
This blog post provides a comparison of the two services.
This example compares the pricing details for VPN Gateway and ExpressRoute.
The example is based on the current price and is subject to change. The calculation shown is for information
purposes only.
Suppose, you choose the VpnGw2AZVpnGw2AZ tier, which supports availability zones; 1 GB/s bandwidth. The workload
needs 15 site-to-site tunnels and 1 TB of outbound transfer. The gateway is provisioned and available for 720
hours.
Let's choose the Metered DataMetered Data plan in billing zone of Zone 1Zone 1.
The main cost driver is outbound data transfer. ExpressRoute is more cost-effective than VPN Gateway when
consuming large amounts of outbound data. If you consume more than 200 TB per month, consider
ExpressRoute with the Unlimited DataUnlimited Data plan where you're charged a flat rate.
For pricing details, see:
Azure VPN Gateway pricing
Azure ExpressRoute pricing
Networking resources provisioning
12/16/2022 7 minutes to read • Edit Online
Azure Front Door
Reference architectureReference architecture
Azure Application Gateway
Reference architectureReference architecture
Azure ExpressRoute
For design considerations, see Networking resource choices.
Azure Front Door billing is affected by outbound data transfers, inbound data transfers, and routing rules. The
pricing chart doesn't include the cost of accessing data from the backend services and transferring to Front
Door. Those costs are billed based on data transfer charges, described in Bandwidth Pricing Details.
Another consideration is Web Application Firewall (WAF) settings. Adding policies will drive up the cost.
For more information, see Azure Front Door Pricing.
Highly available multi-region web application uses Front Door to route incoming requests to the primary
region. If the application running that region becomes unavailable, Front Door fails over to the secondary
region.
There are two main pricing models:
Fixed price
You're charged for the time that the gateway is provisioned and available and the amount of data
processed by the gateway. For more information, see Application Gateway pricing.
Consumption price
This model applies to v2 SKUs that offer additional features such as autoscaling, Azure Kubernetes
Service Ingress Controller, zone redundancy, and others. You're charged based on the consumed capacity
units. The capacity units measure the compute resources, persistent connections, and throughput.
Consumption price is charged in addition to the fixed price.
For more information, see:
Application Gateway v2 pricing
Application Gateway pricing
Microservices architecture on Azure Kubernetes Service (AKS) uses Application Gateway as the ingress
controller.
Securely managed web applications uses Application Gateway as a web traffic load balancer operating at
Layer 7 that manages traffic to the web application. Web Application Firewall (WAF) is enabled to
enhance security.
There are two main pricing models:
Metered Data planMetered Data plan
Reference architectureReference architecture
Azure Firewall
Reference architectureReference architecture
Azure Load Balancer
Reference architectureReference architecture
There are two pricing tiers: StandardStandard and PremiumPremium, which is priced higher. The tier pricing is based on
the circuit bandwidth.
If you don't need to access the services globally, choose StandardStandard. With this tier, you can connect to
regions within the same zone at no additional cost. Outbound cross-zonal traffic incurs more cost.
Unlimited Data planUnlimited Data plan
All inbound and outbound data transfer is included in the flat rate. There are two pricing tiers: StandardStandard
and PremiumPremium, which is priced higher.
Calculate your utilization and choose a billing plan. The Unlimited Data planUnlimited Data plan is recommended if you exceed
about 68% of utilization.
For more information, see Azure ExpressRoute pricing.
Connect an on-premises network to Azure using ExpressRoute connects an Azure virtual network and an on-
premises network connected using with VPN gateway failover.
Azure Firewall usage can be charged at a fixed rate per deployment hour. There's additional cost for the amount
of data transferred.
There aren't additional cost for a firewall deployed in an availability zone. There are additional costs for inbound
and outbound data transfers associated with availability zones.
When compared to network virtual appliances (NVAs), with Azure Firewall you can save up to 30-50%. For more
information see Azure Firewall vs NVA.
Hub-spoke network topology in Azure
Deploy highly available NVAs
The service distributes inbound traffic according to the configured rules.
There are two tiers: BasicBasic and StandardStandard.
The BasicBasic tier is free.
For the StandardStandard tier, you are charged only for the number of configured load-balancing and outbound rules.
Inbound NAT rules are free. There's no hourly charge for the load balancer when no rules are configured.
See Azure Load Balancer Pricing for more information.
Connect an on-premises network to Azure using ExpressRoute: Multiple subnets are connected through
Azure load balancers.
SAP S/4HANA in Linux on Azure: Distribute traffic to virtual machines in the application-tier subnet.
Extend an on-premises network using VPN Internal load balancer. Network traffic from the VPN gateway
is routed to the cloud application through an internal load balancer. The load balancer is located in the
front-end subnet of the application.
Azure VPN Gateway
Reference architectureReference architecture
Traffic Manager
Reference architectureReference architecture
When provisioning a VPN Gateway resource, choose between two gateway types:
VPN gateway to send encrypted traffic across the public internet. Site-to-Site, Point-to-Site, and VNet-to-
VNet connections all use a VPN gateway.
ExpressRoute gateway - To send network traffic on a private connection. This is used when configuring Azure
ExpressRoute.
For VPN gateway, select Route-basedRoute-based or Policy-basedPolicy-based based on your VPN device and the kind of VPN
connection you want to create. Route-based gateway allows point-to-site, inter-virtual network, or multiple site-
to-site connections. Policy based only allows one site-to-site tunnel. Point-to-site isn't supported. So, route-
based VPN is more expensive.
Next, you need to choose the SKU for Route-basedRoute-based VPN. For developer/test workloads, use BasicBasic. For
production workloads, an appropriate Generation1Generation1 or Generation2Generation2 SKU. Each SKU has a range and pricing
depends on the type of VPN gateway because each type offers different levels of bandwidth, site-to-site, and
point-to-site tunnel options. Some of those types also offer availability zones, which are more expensive. If you
need higher bandwidth, consider Azure ExpressRoute.
VPN gateway can be the cost driver in a workload because charges are based on the amount of time that the
gateway is provisioned and available.
All inbound traffic is free, all outbound traffic is charged as per the bandwidth of the VPN type. Bandwidth also
varies depending on the billing zone.
For more information, see
Hybrid connectivity
VPN Gateway Pricing.
Traffic across zones
Bandwidth Pricing Details.
Extend an on-premises network using VPN connects the virtual network to the on-premises network through
a VPN device.
Traffic manager uses DNS to route and load balance traffic to service endpoints in different Azure regions. So, an
important use case is disaster recovery. In a workload, you can use Traffic Manager to route incoming requests
to the primary region. If that region becomes unavailable, Traffic Manager fails over to the secondary region.
There are other features that can make the application highly responsive and available. Those features cost
money.
Determine the best web app to handle request based on geographic location.
Configure caching to reduce the response time.
Traffic Manager isn't charged for bandwidth consumption. Billing is based on the number of DNS queries
received, with a discount for services receiving more than 1 billion monthly queries. You're also charged for each
monitored endpoint.
Multi-region N-tier application uses Traffic Manager to route incoming requests to the primary region. If that
region becomes unavailable, Traffic Manager fails over to the secondary region. For more information, see the
section Traffic Manager configuration.
DNS query chargesDNS query charges
Health monitoring chargesHealth monitoring charges
Real User Measurement chargesReal User Measurement charges
Traffic view chargesTraffic view charges
Virtual Network
Traffic Manager uses DNS to direct clients to specific service.
Only DNS queries that reach Traffic Manager are charged in million query units. For 100 million DNS queries
month, the charges will be $54.00 a month based on the current Traffic Manager pricing.
Not all DNS queries reach Traffic Manager. Recursive DNS servers run by enterprises and ISPs first attempt to
resolve the query by using cached DNS responses. Those servers query Traffic Manager at a regular interval to
get updated DNS entries. That interval value or TTL is configurable in seconds. TTL can impact cost. Longer TTL
increases the amount of caching and reduces DNS query charges. Conversely, shorter TTL results in more
queries.
However, there is a tradeoff. Increased caching also impacts how often the endpoint status is refreshed. For
example, the user failover times, for an endpoint failure, will become longer.
When Traffic Manager receives a DNS request, it chooses an available endpoint based on configured state and
health of the endpoint. To do this, Traffic Manager continually monitors the health of each service endpoint.
The number of monitored endpoints are charged. You can add endpoints for services hosted in Azure and then
add on endpoints for services hosted on-premises or with a different hosting provider. The external endpoints
are more expensive, but health checks can provide high-availability applications that are resilient to endpoint
failure, including Azure region failures.
Real User Measurements evaluates network latency from the client applications to Azure regions. That
influences Traffic Manager to select the best Azure region in which the application is hosted. The number of
measurements sent to Traffic Manager is billed.
By using Traffic View, you can get insight into the traffic patterns where you have endpoints. The charges are
based on the number of data points used to create the insights presented.
Azure Virtual Network is free. You can create up to 50 virtual networks across all regions within a subscription.
Here are a few considerations:
Inbound and outbound data transfers are charged per the billing zone. Traffic that moves across regions
and billing zones are more expensive. For more information, see:
Traffic across zones
Bandwidth Pricing Details.
VNET Peering has additional cost. Peering within the same region is cheaper than peering between
regions or Global regions. Inbound and outbound traffic is charged at both ends of the peered networks.
For more information, see Peering
Managed services (PaaS) don't always need a virtual network. The cost of networking is included in the
service cost.
Web application cost estimates
12/16/2022 4 minutes to read • Edit Online
App Service plans
App Service cost
API Management cost
All web applications (apps) have no up-front cost or termination fees. Some charge only for what you use and
others charge per-second or per-hour. In addition, all web apps run in App Service plans. Together these costs
can help determine the total cost of a web app.
Use the Azure Pricing calculator to help create various cost scenarios.
The App Service plans include the set of compute resources needed for the web app to run. Except for Free tier,
an App Service plan carries a charge on the compute resources it uses. For a description of App Service plans,
see Azure App Service plan overview.
You can potentially save money by choosing one or more App Service plans. To help find the solution that meets
your business requirements, see Should I put an app in a new plan or an existing plan?
The
pricing tier
of an App Service plan determines what App Service features you get and how much you pay
for the plan. The higher the tier, the higher the cost. For details on the pricing tiers, see App Service Pricing.
Are there exceptions that impact App Ser vices plan cost?Are there exceptions that impact App Ser vices plan cost?
You don't get charged for using the App Service features that are available to you (such as configuring custom
domains, TLS/SSL certificates, deployment slots, backups, and so on). For a list of exceptions, see How much
does my App Service plan cost?
App Service plans are billed on a per second basis. If your solution includes several App Service apps, consider
deploying them to separate App Service plans. This approach enables you to scale them independently because
they run on separate instances, which saves unnecessary cost.
Azure App Service supports two types of SSL connections: Server Name Indication (SNI) SSL Connections and
IP Address SSL Connections. SNI-based SSL works on modern browsers while IP-based SSL works on all
browsers. If your business requirements allow, use SNI-based SSL instead of IP-based SSL. There is no charge
for SNI-based. IP-based SSL incurs a cost per connection.
See the SSL Connections section in App Service pricing for details.
If you want to publish APIs hosted on Azure, on-premises, and in other clouds more securely, reliably, and at
scale, use API Management. The pricing tier you choose depends on the features needed based on the business
requirements. For example, depending on your requirements for scalability, the number of units required to
scale out range from 1-10. As the number of units increases, the cost of the pricing tier increases.
Self-hosted gateways are available in the Developer and Premium pricing tiers. There is an additional cost for
this service using the Premium tier. There is no cost for self-hosted gateways if using the Developer tier.
For pricing details, see API Management Pricing.
Content Delivery Network (CDN) cost
Azure Cognitive Search cost
Azure SignalR cost
Notification Hubs cost
If you want to offer users optimal online content delivery, use CDN in your workload. The data size being used
has the most significant impact on cost, so you'll want to choose based on your business requirements.
Purchasing data in increments of TB is significantly higher than purchasing data in increments of GB.
The provider you choose also can impact cost. Choose Microsoft as the provider to get the lowest cost.
Some data, such as shopping carts, search results, and other dynamic content, is not cacheable. CDN offers
Acceleration Data Transfers, also called Dynamic Site Acceleration (DSA), to accelerate this type of web content.
The price for DSA is the same across Standard and Premium profiles.
For pricing details, see Content Delivery Network pricing.
To set up and scale a search experience quickly and cost-effectively, use Azure Cognitive Search. When you
create an Azure Cognitive Search service, a resource is created at a pricing tier (or SKU) that's fixed for the
lifetime of the service. Billing depends on the type of service. See Search API for a list of services.
The charges are based on the number of transactions for each type of operation specific to a service. A certain
number of transactions are free. If you need additional transactions, choose from the Standard instances.
For pricing details, see Azure Cognitive Search pricing.
Use SignalR for any scenario that requires pushing data from server to client in real time. For example, Azure
SignalR provides secure and simplified communication between client and app one-to-one (such as a chat
window) or one-to-many (instant broadcasting, IoT dashboards, or notification to social network). To learn more
about Azure SignalR, see What is Azure SignalR Service?
The Free tier is not recommended for a production environment. With the Standard tier, you pay only for what
you use. A Standard tier is recommended as an enterprise solution because it offers a large number of
concurrent connections and messages.
For pricing details, see Azure SignalR Service pricing.
To broadcast push notifications to millions of users at once, or tailor notifications to individual users, use
Notification Hubs. Pricing is based on number of pushes. The Free tier is a good starting point for exploring
push capabilities but is not recommended for production apps. If you require more pushes and features such as
scheduled pushes or multi-tenancy, you can buy additional pushes per million for a fee.
See Push notifications with Azure Notification Hubs: Frequently asked questions for more information. For
pricing details, see Notification Hubs pricing.
Checklist
-
Monitor cost
12/16/2022 2 minutes to read • Edit Online
Use this checklist to monitor the cost of the workload.
Gather cost data from diverse sources to create repor tsGather cost data from diverse sources to create repor ts. Start with tools like Azure Advisor,
Advisor Score, and Azure Cost Management. Build custom reports relevant for the business by using
Consumption APIs.
Cost reports
Review costs in cost analysis
Use resource tag policies to build repor tsUse resource tag policies to build repor ts. Tags can be used to identify the owners of systems or
applications and create custom reports.
Follow a consistent tagging standard
Video: How to review tag policies with Azure Cost Management
Use Azure built-in roles for costUse Azure built-in roles for cost. Only give access to users who are intended to view and analyze cost
reports. The roles are defined per scope. For example, use the Cost Management Reader roleCost Management Reader role to
enable users to view costs for their resources in subscriptions or resource groups.
Provide the right level of cost access
Azure RBAC scopes
Respond to aler ts and have a response plan according to the constraints.Respond to aler ts and have a response plan according to the constraints. Respond to alerts
quickly and identify possible causes and any required action.
Budget and alerts
Use cost alerts to monitor usage and spending
Adopt both proactive and reactive approaches for cost reviewsAdopt both proactive and reactive approaches for cost reviews. Conduct cost reviews at a
regular cadence to determine the cost trend. Also review reports that are created because of alerts.
Conduct cost reviews
Participate in central governance cost reviews
Analyze the cost at all scopesAnalyze the cost at all scopes by using Cost analysis. Identify services that are driving the cost
through different dimensions, such as location, usage meters, and so on. Review whether certain
optimizations are bringing results. For example, analyze costs associated with reserved instances, savings
plans and Spot VMs against business goals.
Quickstart: Explore and analyze costs with cost analysis
Detect anomaliesDetect anomalies and identify changes in business or applications that might have contributed changes
in cost. Focus on these factors:
Traffic pattern as the application scales.
Budget for the usage meters on resources.
Performance bottle necks.
CPU utilization and network throughput.
Storage footprint for blobs, backups, archiving.
Use Visualization tools to analyze cost information.Use Visualization tools to analyze cost information.
Create visuals and reports with the Azure Cost Management connector in Power BI Desktop
Cost Management App
Set budgets and alerts
12/16/2022 2 minutes to read • Edit Online
Respond to alerts
Revise budgets
Azure Cost Management has an alert feature. Alerts are generated when consumption reaches a threshold.
Consider the metrics that each resource in the workload. For each metric, build alerts on baseline thresholds.
This way, the admins can be alerted when the workload is using the services at capacity. The admins can then
tune the resources to target SKUs based on current load.
You can also set alerts on allowed budgets at the resource group or management groups scopes. Both cloud
services performance and budget requirements can be balanced through alerts on metrics and budgets.
Over time, the workload can be optimized to autoheal itself when alerts are triggered. For information about
using alerts, see Use cost alerts to monitor usage and spending.
When you receive an alert, check the current consumption data. Budget alerts aren't triggered in real time. There
may be a delay between the alert and the current actual cost. Look for significant difference between cost values
when the alert happened and the current cost. Next, conduct a cost review to discuss the cost trend, possible
causes, and any required action. For information about stakeholders in a cost review, see Cost reviews.
Determine short and long-term actions justified by business value. Can a temporary increase in the alert
threshold be a feasible fix? Does the budget need to be increased longer-term? Any increase in budget must be
approved.
If the alert was caused because of unnecessary or expensive resources, you can implement additional Azure
Policy controls. You can also add budget automation to trigger resource scaling or shutdowns.
After you identify and analyze your spending patterns, you can set budget limits for applications or business
units. You'll want to assign access to view or manage each budget to the appropriate groups. Setting several
alert thresholds for each budget can help track your burn down rate.
Generate cost reports
12/16/2022 3 minutes to read • Edit Online
Azure cost tools
Cost analysisCost analysis
NOTENOTE
Advisor recommendationsAdvisor recommendations
Consumption APIs
To monitor the cost of the workload, use Azure cost tools or custom reports. The reports can be scoped to
business units, applications, IT infrastructure shared services, and so on. Make sure that the information is
consistently shared with the stakeholders.
Azure provides cost tools that can help track cloud spend and make recommendations.
Azure Advisor
Advisor Score
Azure Cost Management
Cost analysisCost analysis is a tool in Azure Cost Management that allows you to view aggregated costs over a period. This
view can help you understand your spending trends.
View costs at different scopes, such as for a resource group or specific resource tags. Cost Analysis provides
built-in charts and custom views. You can also download the cost data in CSV format to analyze with other tools.
For more information, see: Quickstart: Explore and analyze costs with cost analysis.
There are many ways of purchasing Azure services. Not all are supported by Azure Cost Management. For example,
detailed billing information for services purchased through a Cloud Solution Provider (CSP) must be obtained directly
from the CSP. For more information about the supported cost data, see Understand cost management data.
Azure Advisor recommendations for cost can highlight the over-provisioned services and ways to lower cost.
For example, the virtual machines that should be resized to a lower SKU, unprovisioned ExpressRoute circuits,
and idle virtual network gateways.
For more information, see Advisor cost management recommendations.
Granular and custom reports can help track cost over time. Azure provides a set of Consumption APIs to
generate such reports. These APIs allow you to query and create various cost data. Data includes usage data for
Azure services and third-party services through Marketplace, balances, budgets, recommendations on reserved
instances, among others. You can configure Azure role-based access control (Azure RBAC) policies to allow only
a certain set of users or applications access the data.
For example, you want to determine the cost of all resources used in your workload for a given period. One way
of getting this data is by querying usage meters and the rate of those meters. You also need to know the billing
period of the usage. By combining these APIs, you can estimate the consumption cost.
Billing account API: To get your billing account to manage your invoices, payments, and track costs.
Billing Periods API: To get billing periods that have consumption data.
NOTENOTE
Custom scripts
Analyze and visualize
NOTENOTE
Usage Detail API: To get the breakdown of consumed quantities and estimated charges.
Marketplace Store Charge API: To get usage-based marketplace charges for third-party services.
Price Sheet API: To get the applicable rate for each meter.
The result of the APIs can be imported into analysis tools.
Consumption APIs are supported for Enterprise Enrollments and Web Direct Subscriptions (with exceptions). Check
Consumption APIs for updates to determine support for other types of Azure subscriptions.
For more information about common cost scenarios, see Billing automation scenarios.
Use Azure APIs to schedule custom scripts that identify orphaned or empty resources. For example, unattached
managed disks, load balancers, application gateways, or Azure SQL Servers with no databases. These resources
incur a flat monthly charge while unused. Other resources may be stale, for example VM diagnostics data in
blob or table storage. To determine if the item should be deleted, check its last use and modification timestamps.
Start with the usage details in the invoice. Review that information against relevant business data and events. If
there are anomalies, evaluate the significant changes in business or applications that might have contributed
those changes.
Power BI Desktop has a connector that allows you to connect billing data from Azure Cost Management. You can
create custom dashboards and reports, ask questions of your data and publish, and share your work.
Sharing requires Power BI Premium licenses.
For more information, see Create visuals and reports with the Azure Cost Management connector in Power BI
Desktop.
You can also use the Cost Management App. The app uses Azure Cost Management Template app for Power BI.
You can import and analyze usage data and incurred cost within Power BI.
Conduct cost reviews
12/16/2022 2 minutes to read • Edit Online
The goal of cost monitoring is to review cloud spend with the intent of establishing cost controls and preventing
any misuse. The organization should adopt both proactive and reactive review approaches for monitoring cost.
Effective cost reviews must be conducted by accountable stakeholders at a regular cadence. The reviews must be
complemented with reactive cost reviews, for example when a budget limit causes an alert.
Who should be included in a cost review?Who should be included in a cost review?
The financial stakeholders must understand cloud billing to derive business benefits using financial metrics to
make effective decisions.
Also include members of the technical team. Application owners, systems administrators who monitor and back-
up cloud systems, and business unit representatives must be aware of the decisions. They can provide insights
because they understand cloud cost metering and cloud architecture. One way of identifying owners of systems
or applications is through resource tags.
What should be the cadence of cost reviews?What should be the cadence of cost reviews?
Cost reviews can be conducted as part of the regular business reviews. It's recommended that such reviews are
scheduled,
During the billing period. This review is to create an awareness of the estimated pending billing. These
reports can be based on Azure Advisor, Advisor Score, and Azure Cost Management Cost analysis.
After the billing period. This review shows the actual cost with activity for that month. Use Balance APIs to
generate monthly reports. The APIs can query data that gets information on balances, new purchases, Azure
Marketplace service charges, adjustments, and overage charges.
Because of a budget alert or Azure Advisor recommendations.
Web Direct (pay-as-you-go) and Cloud Solution Provider (CSP) billing occurs monthly. While Enterprise
Agreement (EA) billing occurs annually, costs should still be reviewed monthly.
Checklist
-
Optimize cost
12/16/2022 2 minutes to read • Edit Online
Continue to monitor and optimize the workload by using the right resources and sizes. Use this checklist to
optimize a workload.
Review the underutilized resourcesReview the underutilized resources. Evaluate CPU utilization and network throughput over time to
check if the resources are used adequately. Azure Advisor identifies underutilized virtual machines. You
can choose to decommission, resize, or shut down the machine to meet the cost requirements.
Resize virtual machines
Shutdown the under utilized instances
Continuously take action on the cost reviewsContinuously take action on the cost reviews . Treat cost optimization as a process, rather than a
point-in-time activity. Use tooling in Azure that provides recommendations on usage or cost optimization.
Review the cost management recommendations and take action. Make sure that all stakeholders are in
agreement about the implementation and timing of the change.
RecommendedRecommended tab in the Azure portal
Recommendations in the Cost Management Power BI app
Recommendations in Azure Advisor
Recommendations using Reservation REST APIs
Use savings plansUse savings plans. By using savings plans, you get the most flexible savings for dynamic workloads,
while accommodating for planned or unplanned changes. Consider that with savings plan you commit to
a fixed hourly dollar amount collectively on compute services, globally.
Savings plans
Use Azure reser vationsUse Azure reser vations. You get the greatest cost savings for stable, predictable workloads with no
planned changes by using reserved instances or capacity. Consider that you are often committing to a
specific SKUs, throughput, or utilization targets, often in a particular Azure region.
Azure Reservations
Use discount pricesUse discount prices. These methods of buying Azure resources can lower costs.
Azure Hybrid Use Benefit
Azure Reservations
There are also payment plans offered at a lower cost:
Microsoft Azure Enterprise Agreement
Enterprise Dev Test Subscription
Cloud Service Provider (Partner Program)
Have a scale-in and scale-out policyHave a scale-in and scale-out policy. In a cost-optimized architecture, costs scale linearly with
demand. Increasing customer base shouldn't require more investment in infrastructure. Conversely, if
demand drops, scale-down of unused resources. Autoscale Azure resources when possible.
Autoscale instances
Reevaluate design choicesReevaluate design choices. Analyze the cost reports and forecast the capacity needs. You might need
to change some design choices.
Choose the right storage tierChoose the right storage tier. Consider using hot, cold, archive tier for storage account data.
Storage accounts can provide automated tiering and lifecycle management. For more information,
see Review your storage options
Choose the right data storeChoose the right data store. Instead of using one data store service, use a mix of data store
depending on the type of data you need to store for each workload. For more information,
reference Choose the right data store.
Choose Spot VMs for low priority workloadsChoose Spot VMs for low priority workloads. Spot VMs are ideal for workloads that can be
interrupted, such as highly parallel batch processing jobs.
Spot VMs
Optimize data transferOptimize data transfer. Only deploy to multiple regions if your service levels require it for
either availability or geo-distribution. Data going out of Azure datacenters can add cost because
pricing is based on Billing Zones.
Traffic across billing zones and regions
Reduce load on ser versReduce load on ser vers. Use Azure Content Delivery Network (CDN) and caching service to
reduce load on front-end servers. Caching is suitable for servers that are continually rendering
dynamic content that doesn't change frequently.
Use managed ser vicesUse managed ser vices. Measure the cost of maintaining infrastructure and replace it with Azure
PaaS or SaaS services.
Managed services
Autoscale instances
12/16/2022 2 minutes to read • Edit Online
Virtual machine instances
Stateless applications
In Azure, it's easier to grow a service with little to no downtime against downscaling a service, which usually
requires deprovisioning or downtime. In general, opt for scale-out instead of scale up.
For certain application, capacity requirements may swing over time. Autoscaling policies allow for less error-
prone operations and cost savings through robust automation.
For auto scaling, consider the choice of instance size. The size can significantly change the cost of your workload.
Choose smaller instances where workload is highly variable and scale out to get the desired level of
performance, rather than up. This choice will enable you to make your cost calculations and estimates
granular.
Many Azure services can be used to improve the application's ability to scale dynamically. Even where they may
not have been originally designed to do so.
For example, many ASP.NET stateful web applications can be made stateless. Then they can be autoscaled, which
results in cost benefit. You store state using Azure Redis Cache, or Azure Cosmos DB as a back-end session state
store through a Session State Provider.
Virtual machines
12/16/2022 3 minutes to read • Edit Online
Resize virtual machines
Virtual machines can be deployed in fix-sized blocks. These VMs must be adequately sized to meet capacity
demand and reduce waste.
For example, look at a VM running the SAP on Azure project can show you how initially the VM was sized based
on the size of existing hardware server (with cost around 1 K per month), but the real utilization of VM wasn't
more than 25% - but simple choosing the right VM size in the cloud we can achieve 75% saving (resize saving).
And by applying the snoozing you can get additional 14% of economy:
It's easy to handle cost comparison when you're well equipped and for this Microsoft provides the set of specific
services and tools that help you to understand and plan costs. These include the TCO Calculator, Azure Pricing
Calculator, Azure Cost Management (Cloudyn), Azure Migrate, Azure Cosmos DB Sizing Calculator, and the Azure
Site Recovery Deployment Planner.
Here are some strategies that you can use to lower cost for virtual machines.
You can lower cost by managing the size and the number of VMs.
Determine the load by analyzing the CPU utilization to make sure that the instance is adequately
utilized.
Ideally, with the right size, the current load should fit in a lower SKU of the same tier. Another way is to lower the
number instances and still keep the load below a reasonable utilization. Azure Advisor recommends load less
than 80% utilization for non-user facing workloads and 40% when user-facing workload. It also provides current
and target SKU information.
You can identify underutilized machines by adjusting the CPU utilization rule on each subscription.
Shut down underutilized instances
Spot VMs
Reserved VMs
Savings plans
Resizing a virtual machine does require the machine to be shut down and restarted. There might be a period of
time when it will be unavailable. Make sure you carefully time this action for minimal business impact.
Use the Star t/stop VMs during off-hoursStar t/stop VMs during off-hours feature of virtual machines to minimize waste. There are many
configuration options to schedule the start and stop times. The feature is suitable as a low-cost automation
option. For information, see Start/stop VMs during off-hours solution in Azure Automation.
Azure Advisor evaluates virtual machines based on CPU and network utilization over a time period. Then, the
recommended actions are shut down or resize instances and cost saving with both actions.
Some workloads don't have a business requirement to complete a job within a period.
Can the workload be interrupted?Can the workload be interrupted?
Spot VMs are ideal for workloads that can be interrupted, such as highly parallel batch processing jobs. These
VMs take advantage of the surplus capacity in Azure at a lower cost. They're also well suited for experimental,
development, and testing of large-scale solutions.
For more information, see Use Spot VMs in Azure or check out our Azure Virtual Machine Spot Eviction guide to
learn how to create a reliable interruptible workload in Azure.
Virtual machines are eligible for Azure Reservations. You can prepay for VM instances if you can commit to one
or three years. Reserved instances are appropriate for workloads that have a long-term usage pattern.
The discount only applies to compute and not the other meters used to measure usage for VMs. The discount
can be extended to other services that emit VM usage, such as Virtual machine scale sets and Container
services, to name a few. For more information, see Software costs not included with Azure Reserved VM
Instances and Services that get VM reservation discounts.
With reserved instances, you need to determine the VM size to buy. Analyze usage data using Reser vationsReser vations
Consumption APIsConsumption APIs and follow the recommendations of Azure portal and Azure Advisor to determine the size.
Reservations also apply to dedicated hosts. The discount is applied to all running hosts that match the
reservation scope and attributes. An important consideration is the SKU for the host. When selecting a SKU,
choose the VM series and type eligible to be a dedicated host. For more information, see Azure Dedicated Hosts
pricing.
For information about discounts on virtual machines, see How the Azure reservation discount is applied to
virtual machines.
Virtual machines are eligible for savings plans. By committing to a one-year or three-year savings plan, you
spend a fixed hourly dollar amount collectively on all compute services. Consider that the commitment spans
across all participating compute services globally.
Using savings plans can save you up to 66% on cost.
Caching data
12/16/2022 2 minutes to read • Edit Online
Lower costs associated with reliability and latency
Caching can be expensive
Caching is a strategy where you store a copy of the data in front of the main data store. The cache store is
typically located closer to the consuming client than the main store. Advantages of caching include faster
response times and the ability to serve data quickly. In doing so, you can save on the overall cost. Be sure to
assess the built-in caching features of Azure services used in your architecture. Azure also offers caching
services, such as Azure Cache for Redis or Azure CDN.
For information about what type of data is suitable for caching, reference Caching.
Caching can be a cheaper way of storing data, providing reliability, and reducing network latency.
Depending on the type of the data, determine if you need the complex capabilities of the backend data store,
such as data consistency. If the data is fully static, you can choose to store it only in a caching store. If the data
doesn't change frequently, consider placing a copy of the data in a caching store and refresh it from time to
time. For example, an application stores images in blob storage. Every time it requests an image, the business
logic generates a thumbnail from the main image and returns it to the caller. If the main image doesn't
change too often, then return the previously generated thumbnails stored in a cache. This way you can save
on resources required to process the image and lower the request response rate.
If the backend is unavailable, the cache continues to handle requests by using the copy until the backend fails
over to the backup data store.
Caching can also be an effective way of reducing network latency. Instead of reaching the central server, the
response is sent by using the cache. That way, the client can receive the response almost instantaneously. If
you need higher network performance and the ability to support more client connections, choose a higher
tier of the caching service. However, higher tiers will incur more costs.
Incorrect use of caching can result in severe business outcomes and higher costs.
Adding a cache will lead to multiple data sources in your architecture. There are added costs to keeping
them in sync. You may need to fill the cache before putting it in production. Filling the cache on the
application's first access can introduce latency or seeding the cache can have impact on application's start
time. If you don't refresh the cache, your customers can get stale data.
Invalidate the cache at the right time when there's latest information in the source system. Use strategies
to age out the cache when appropriate.
To make sure the caching layer is working optimally, add instrumentation. That feature will add
complexity and implementation cost.
Caching services such as Azure Cache for Redis are billed on the tier you choose. Pricing is primarily determined
on the cache size and network performance and they're dependent. A smaller cache will increase latency. Before
choosing a tier, estimate a baseline. An approach can be by load testing the number of users and cache size.
Tradeoffs for cost
12/16/2022 3 minutes to read • Edit Online
Cost vs reliability
Cost vs performance efficiency
As you design the workload, consider tradeoffs between cost optimization and other aspects of the design, such
as security, scalability, resilience, and operability.
What is most impor tant for the business: lowest cost, no downtime, high throughput?What is most impor tant for the business: lowest cost, no downtime, high throughput?
An optimal design doesn't equate to a low-cost design. There might be risky choices made in favor of a cheaper
solution.
Cost has a direct correlation with reliability.
Does the cost of high availability components exceed the acceptable downtime?Does the cost of high availability components exceed the acceptable downtime?
Overall Service Level Agreement (SLA), Recovery Time Objective (RTO), and Recovery Point Objective (RPO)
may lead to expensive design choices. If your service SLAs, RTOs, and RPOs times are short, then higher
investment is inevitable for high availability and disaster recovery options.
For example, to support high availability, you choose to host the application across regions. This choice is
costlier than single region because of the replication costs or the need provisioning extra nodes. Data transfer
between regions will also add cost.
If the cost of high availability exceeds the cost of downtime, you can save by using Azure platform-managed
replication and recover data from the backup storage.
For resiliency, availability, and reliability considerations, see the Reliability pillar.
Boosting performance will lead to higher cost.
Many factors impact performance.
Fixed or consumption-based provisioningFixed or consumption-based provisioning. Avoid cost estimation of a workload at consistently high
utilization. Consumption-based pricing will be more expensive that the equivalent provisioned pricing. Smooth
out the peaks to get a consistent flow of compute and data. Ideally, use manual and autoscaling to find the right
balance. Scaling up is more expensive than scaling out.
Azure regionsAzure regions. Cost scales directly with number of regions. Locating resources in cheaper regions shouldn't
negate the cost of network ingress and egress or by degraded application performance because of increased
latency.
CachingCaching. Every render cycle of a payload consumes both compute and memory. You can use caching to reduce
load on servers and save with pre-canned storage and bandwidth costs, and the savings can be dramatic,
especially for static content services.
While caching can reduce cost, there are some performance tradeoffs. For example, Azure Traffic Manager
pricing is based on the number of DNS queries that reach the service. You can reduce that number through
caching and configure how often the cache is refreshed. Relying on the cache that isn't frequently updated will
cause longer user failover times if an endpoint is unavailable.
Cost vs security
Cost vs operational excellence
Batch or real-time processingBatch or real-time processing. Using dedicated resources for batch processing long running jobs will
increase the cost. You can lower cost by provisioning Spot VMs but be prepared for the job to be interrupted
every time Azure evicts the VM.
For performance considerations, see the Performance Efficiency pillar.
Increasing security of the workload will increase cost.
As a rule, don't compromise on security. For certain workloads, you can't avoid security costs. For example, for
specific security and compliance requirements, deploying to differentiated regions will be more expensive.
Premium security features can also increase the cost. There are areas you can reduce cost by using native
security features. For example, avoid implementing custom roles if you can use built-in roles.
For security considerations, see the Security Pillar.
Investing in systems monitoring and automation might increase the cost initially but over time will reduce cost.
IT operations processes like user or application access provisioning, incident response, and disaster recovery
should be integrated with the workload.
Cost of maintaining infrastructure is more expensive. With PaaS or SaaS services, infrastructure, platform
management services, and additional operational efficiencies are included in the service pricing.
For operational considerations, see the Operational Excellence pillar.
Overview of the operational excellence pillar
12/16/2022 2 minutes to read • Edit Online
Topics
OP E RAT IO N A L EXC EL L E NC E TO P IC SO P ERAT IO N A L EXC EL L E N C E TO P IC S DESC RIP T IONDESC RIP T ION
Application design Provides guidance on how to design, build, and orchestrate
workloads with DevOps principles in mind.
Monitoring Monitoring and diagnostics are essential to any workload.
Specifically for cloud applications that run in a remote
datacenter often this discipline becomes even more crucial.
Application performance management The monitoring and management of performance, and
availability of software applications through DevOps.
Code deployment How you deploy your application code is one of the key
factors that determines your application stability.
Infrastructure provisioning Frequently known as
Deployment Automation
or
Infrastructure as code
, this discipline refers to best practices
for deploying the platform where your application will run.
Testing Testing is fundamental to prepare for the unexpected and to
catch mistakes before they impact users.
The operational excellence pillar covers the operations processes that keep an application running in production.
Deployments must be reliable and predictable. Automated deployments reduce the chance of human error. Fast
and routine deployment processes won't slow down the release of new features or bug fixes. Equally important,
you must be able to quickly roll back or roll forward if an update has problems.
To assess your workload using the tenets found in the Microsoft Azure Well-Architected Framework, reference
the Microsoft Azure Well-Architected Review.
We recommend the following video to help you achieve operational excellence with the Azure Well-Architected
Framework:
The Microsoft Azure Well-Architected Framework includes the following topics in the operational excellence
pillar:
Enforcing resource-level rules through Azure Policy helps ensure adoption of operational excellence best
practices for all the assets, which support your workload. For example, Azure Policy can help ensure all the VMs
supporting your workload adhere to a pre-approved list of VM SKUs.
Azure Advisor provides a set of Azure Policy recommendations to help you quickly identify opportunities to
implement Azure Policy best practices for your workload.
Use the DevOps checklist to review your design from a management and DevOps standpoint.
Next steps
Reference the operational excellence principles to guide you in your overall strategy.
Principles
Operational excellence design principles
12/16/2022 2 minutes to read • Edit Online
Optimize build and release processes
Understand operational health
The principles of operational excellence are a series of considerations that can help achieve superior operational
practices.
To achieve a higher competency in operations, consider and improve how software is:
Developed
Deployed
Operated
Maintained
Equally important, provide a team culture, which includes:
Experimentation and growth
Solutions for rationalizing the current state of operations
Incident response plans
To assess your workload using the tenets found in the Azure Well-Architected Framework, reference the
Microsoft Azure Well-Architected Review.
The following design principles provide:
Context for questions
Why a certain aspect is important
How an aspect is applicable to Operational excellence
These critical design principles are used as lenses to assess the Operational excellence of an application
deployed on Azure. These lenses provide a framework for the application assessment questions.
Embrace software engineering disciplines across your entire environment, which include the following
disciplines:
Provision with Infrastructure as Code
Build and release with continuous integration and continuous delivery (CI/CD) pipelines
Use automated testing methods
Avoid configuration drift through configuration as code
This approach ensures the creation and management of environments throughout the software development
lifecycle. It enables:
Consistency
Repetition
Early detection of issues
Implement systems and processes to monitor all aspects of your workload. Including:
IMPORTANTIMPORTANT
Rehearse recovery and practice failure
Embrace continuous operational improvement
Use loosely coupled architecture
Build and release processes
Infrastructure health
Application health
Robust monitoring ensures the observability of a workload and allows you to correlate events and take
proactive mitigating issues.
In addition, customer data is critical to understanding the health of a workload and whether the service is
meeting the business goals.
The Health Modeling section of the Azure Mission-Critical framework contains further in-depth guidance and examples
on how to build a health model for a given workload.
Rehearse recovery and practice failure using the following methods:
Run disaster recovery (DR) drills on a regular cadence.
Use chaos engineering practices to identify and remediate weak points in application reliability.
Rehearse failure to validate the effectiveness of recovery processes and ensure teams are familiar with their
responsibilities.
Document past failures and automate their remediation where possible.
Teams that embrace continuous operational improvement continuously evaluate and refine operational
procedures and tasks. They strive to reduce complexity and ambiguity whenever possible.
Adopting a continuous improvement culture helps organizations:
Evolve processes over time.
Optimize inefficiencies and associated processes.
Learn from failures.
Continuously evaluate new opportunities.
Use modern architecture patterns such as:
microservices
loosely coupled
serverless
and pair this with cloud design patterns such as:
Circuit breakers
Load-Leveling
Throttling
and advanced deployment strategies like:
Canary
Blue-green
Next stepNext step
Staggered
to enable teams to build and deploy services independently and minimize the impact if there is a service failure.
This principle also extends to procedural decoupling. Teams will be able to take full advantage of their loosely
coupled architecture if they do not have to depend on partner teams to support, approve, or operate their
workloads.
Automation overview
Automation overview: Goals, best practices, and
types
12/16/2022 4 minutes to read • Edit Online
Goals of automation
Automate toil to improve efficiency
Automation best practices
Automation has revolutionized how businesses operate and this trend continues to evolve. Businesses have
moved to automating manual processes so that engineers can focus attention on tasks that add business value.
Automating business solutions allow you to:
Activate resources on demand.
Deploy solutions rapidly.
Minimize human error in configuring repetitive tasks.
Produce consistent and repeatable results.
To learn more, see Deployment considerations for automation.
When automating technical processes, a common approach for some organizations is to automate what they
can and leave the more difficult processes for humans to perform manually.
A goal of automation is to make tools that do what humans can do, only better. For example, a human can
perform any given task once. But when the task requires repetitive runs, especially over long time periods, an
automated system is better equipped to do this with more predictable results that are error-free. Increasing
speed is another goal in automation. When you practice these automation goals, you can build systems that are
faster, repetitive, and can run on a daily basis.
Most automation involves a percentage of
toil
. Toil is the operational work that is related to a process that is
manual, repetitive, can be automated, and has minimal value. It is counterproductive to automation but in many
organizations, a small amount of toil is unavoidable. It becomes an issue when too much toil slows progress. A
project's production velocity will decrease if engineers are continuously interrupted by manual tasks attributed
to toil, either planned or unplanned. Too much toil can impact job satisfaction. Engineers become dissatisfied
when they find themselves spending too much time on operational toil rather than on other projects.
Automation should be developed, and increased, so engineers can eliminate future toil. By reducing toil,
engineers can concentrate on innovating business solutions.
For more information, see Toil automation.
Ensure consistency:Ensure consistency: The more manual processes are involved, the more prone to human error. This can
lead to mistakes, oversights, reduction of data quality, and reliability problems.
Centralize mistakes:Centralize mistakes: Choose a platform to allow you to fix bugs in one place in order to fix them
everywhere. This reduces the chance of error and the possibility of the bug being re-introduced.
Identify issues quickly:Identify issues quickly: Complex issue may not always be able to be identified in a timely manner.
However, with good automation, detection of these issues should occur quickly.
Maximize employee productivity:Maximize employee productivity: Automation leads to more innovative solutions, and in general
provides more value to the business. This in turn raises morale and job satisfaction.Once a process is
Types of automation
Infrastructure deploymentInfrastructure deployment
Infrastructure configurationInfrastructure configuration
Operational tasksOperational tasks
automated, training and maintenance can be greatly reduced or eliminated. This frees engineers to spend
less time on manual processes and more time on automating business solutions.
Three types of automation are described in this article:
infrastructure deployment
infrastructure configuration
operational tasks.
These categories share the same goals and best practices mentioned previously. They differ in areas where
Azure provides solutions that help achieve optimal automation. Other types of automation such as continuous
deployment and continuous delivery are described further in the Operational Excellence pillar.
As businesses move to the cloud, they need to
repeatedly
deploy their solutions and know that their
infrastructure is in a
reliable
state. To meet these challenges, you can automate deployments using a practice
referred to as infrastructure as code. In code, you define the infrastructure that needs to be deployed.
There are many deployment technologies you can use with Azure. Here are three examples:
Azure Resource Manager (ARM) templates
Azure Bicep
Terraform
These technologies use a
declarative
approach. This lets you state what you intend to deploy without having to
write the sequence of programming commands to create it. You can deploy not only virtual machines, but also
the network infrastructure, storage systems, and any other resources you may need.
If you don't manage configuration carefully, your business could encounter disruptions such as systems outages
and security issues. Optimal configuration can enable you to quickly detect and correct configurations that could
interrupt or slow performance.
When creating new resources on Azure, you may take advantage of configuration as code to bootstrap the
deployment.
Configuration tools can also be used to configure and manage the ongoing state of deployed resources.
As the demand for speed in performing operational tasks increases over time, you are expected to deliver things
faster and faster. Manually performing operational tasks will fail to scale as demand increases. This is where
automation can help. To meet on-demand delivery using an automation platform, you need to develop
automation components (such as runbooks and configurations), create integrations to systems that are already
in place efficiently, and operate and troubleshoot.
Advantages of automating operational tasks include:
Optimize and extend existing processes.
Deliver flexible and reliable services.
Lower costs.
Improve predictability.
Two popular options for automating operational tasks are:
Azure functionsAzure functions - Run code without managing the underlying infrastructure on where the code is run.
Next steps
Azure automationAzure automation- Uses a programming language to automate operational tasks in code and executed on
demand.
For more information, see Automation.
Automate Repeatable Infrastructure
Repeatable Infrastructure
12/16/2022 13 minutes to read • Edit Online
Azure Landing Zones (Repeatable environment configuration)
WARNINGWARNING
Historically, deploying a new service or application involves manual work such as procuring and preparing
hardware, configuring operating environments, and enabling monitoring solutions. Ideally, an organization
would have multiple environments in which to test deployments. These test environments should be similar
enough to production that deployment and run time issues are detected before deployment to production. This
manual work takes time, is error-prone, and can produce inconsistencies between the environments if not done
well.
Cloud computing changes the way we procure infrastructure. No longer are we unboxing, racking, and cabling
physical infrastructure. We have internet accessible management portals and REST interfaces to help us. We can
now provision virtual machines, databases, and other cloud services on demand and globally. When we no
longer need cloud services, they can be easily deleted.
However, cloud computing alone does not remove the effort and risk in provisioning infrastructure. When using
a cloud portal to build systems, many of the same manual configuration tasks remain. Application servers
require configuration, databases need networking, and firewalls need firewalling.
Organizations that centrally manage, govern, or support multiple workloads in the cloud will require repeatable
and consistentand consistent environments. Azure Landing Zones provide central operations teams with a repeatable
approach to environmental configuration. To deliver consistent environments, all Azure Landing Zones provide a
set of common design areas, reference architecture, reference implementation, and a process to modify that
deployment to fit the organization design requirements.
Some organizations are following a growing industry trend towards decentralized operations (or workload operations).
When operations is decentralized, the organization chooses to accept duplication of resources and potential
inconsistencies in environmental configuration, in favor of reduced dependencies and full control of the environment
through Azure Pipelines. Organizations who are following a decentralized operating model are less likely to leverage Azure
Landing Zones to create repeatable environment configurations, but will still find value in the subsequent sections of this
article.
The following is a series of links from the Cloud Adoption Framework to help deploy Azure Landing Zones:
All Azure Landing Zones adhere to a common set of design areas to guide configuration of required
environment considerations including: Identity, Network topology and connectivity, Resource organization,
Governance disciplines, Operations baseline, and Business continuity and disaster recovery (BCDR)
All Azure Landing Zones can be deployed through the Azure portal, but are designed to leverage
infrastructure as code to create, test, and refactor repeatable deployments of the environmental
configuration.
The Cloud Adoption Framework provides a number of Azure Landing Zone implementation options,
including:
Start small & expand implementation using Azure Blueprints and ARM Templates
Enterprise-Scale implementation using Azure Policy and ARM Templates
CAF Terraform modules and various landing zone options
Deploy infrastructure with code
Gain higher confidenceGain higher confidence
To get started with Azure Landing Zones to create consistent, repeatable environment configuration, see the
article series on Azure Landing Zones.
If Azure Landing Zones are not a fit for your organization, you should consider the following sections of this
article to manual integrate environment configuration into your Azure Pipelines.
To fully realize deployment optimization, reduce configuration effort, and automate full environments'
deployment, something more is required. One option is referred to as infrastructure as code.
Infrastructure as code (IaC) is the management of infrastructure - such as virtual machines, load balancers, and
connection topology - in a descriptive model, using a versioning system that is similar to what is used for source
code. When you're creating an application, the same source code will generate the same binary every time it is
compiled. In a similar manner, an IaC model generates the same environment every time it is applied. IaC is a
key DevOps practice, and it is often used with continuous delivery.
Ultimately, IaC allows you and your team to develop and release changes faster, but with much higher
confidence in your deployments.
One of the biggest benefits of IaC is the level of confidence you can have in your deployments, and in your
understanding of the infrastructure and its configuration.
Integrate with your process.Integrate with your process. If you have a process by which code changes are peer reviewed, you can use
that same process for reviewing infrastructure changes. This review process can help proactively detect
problematic configurations that are difficult to detect when making manual infrastructure changes.
Consistency.Consistency. Following an IaC process ensures that the whole team follows a standard, well-established
process. Historically, some organizations designate a single or small set of individuals responsible for deploying
and configuring infrastructure. By following a fully automated process, responsibility for infrastructure
deployments moves from individuals into the automation process and tooling. This move broadens the number
of team members who can initiate infrastructure deployments while maintaining consistency and quality.
Automated scanning.Automated scanning. Many types of IaC configurations can be scanned by automated tooling. One such type
of tooling is linting to check for errors in the code. Another type will scan the proposed changes to your Azure
infrastructure to ensure they follow security and performance best practices. This can be an important part of a
Continuous Security approach.
Secret management.Secret management. Most solutions require secrets to be maintained and managed. These include connection
strings, API keys, client secrets, and certificates. Following an IaC approach means that you need to adopt best
practices for managing secrets. For example, Azure Key Vault is used to store secrets securely. It integrates with
many IaC tools and configurations to ensure that the person conducting the deployment doesn't need access to
production secrets. This in turn helps you adhere to the security principle of least privilege.
Access control.Access control. A fully automated IaC deployment pipeline means that an automated process should perform
all changes to your infrastructure resources. This approach has many security benefits. By automating your
deployments, you can be confident that changes deployed to your environment have followed the correct
procedure. You can even consider expanding the number of people who can initiate a deployment since the
deployment itself is done in a fully automated way. Ideally, you would remove the ability for humans to
manually modify your cloud resources and instead rely completely on automated processes. In emergencies,
you may allow for this to be overridden, by using a 'break glass' account or Privileged Identity Management.
Avoid configuration drift.Avoid configuration drift. When using IaC, you can redeploy all of your environment templates on every
release of your solution. IaC tooling is generally built to be idempotent, which means that it can run several
times and will produce the same result each time.
Manage multiple environmentsManage multiple environments
Better understand your cloud resourcesBetter understand your cloud resources
Running infrastructure as code operations frequently has several benefits.
Avoids deployment staleness, which prevents unforeseen issues during a redeployment. (for example as part
of a disaster recovery plan)
Reduces complexity overall as there is one process that is rehearsed often.
Helps to avoid configuration drift. Accidental changes outside of the regular pipeline are corrected quickly
and the source of truth for your environment's configuration remains in code.
Many organizations maintain multiple environments, for example, test, staging, and production. In some cases,
multiple production environments are maintained for multi-tenanted solutions or geographically distributed
applications. Ensuring consistency across these can be difficult; using infrastructure as code solutions can help.
Manage non-production environments.Manage non-production environments. A common pain point for organizations is when non-production
environments are dissimilar to production environments. Often, when building production and non-production
environments by hand, the configuration of each won't match. This mismatch slows down the testing of changes
and reduces confidence that changes won't harm a production system. When following an IaC approach, this
problem is minimized. Using IaC automation, the same infrastructure configuration files can be used for all
environments, producing almost identical environments. When needed, differentiation can be achieved using
input parameters for each environment.
Dynamically provision environments.Dynamically provision environments. Once you have your IaC configurations defined, you can use them to
provision new environments more efficiently. This agility can be enormously helpful when you're testing your
solution. For example, you could quickly provision a duplicate of your production environment that can then be
used for security penetration tests, load testing or help a developer track down a bug.
Scale production environments.Scale production environments. IaC configurations can be used to deploy more instances of your solution,
ensuring consistency between all environments. This can be useful if you are following certain deployment
patterns or if you need to extend your services to a new geographic region. An example is the Deployment
Stamps pattern.
Disaster recover y.Disaster recover y. In some situations, where recovery time may not be time-sensitive, IaC configurations can
be used as part of a disaster recovery plan. For example, if infrastructure needs to be recreated in a second
region, your IaC configurations can be used to do so. You need to consider deployment time and restoring the
state of your infrastructure as well as the infrastructure itself.
When you plan for disaster and recovery, ensure that your disaster recovery plans are fully tested and that they
meet your Recovery Time Objective.
IaC can also help you better understand the state of your cloud resources.
Audit changes.Audit changes. Changes to your IaC configurations will be version-controlled in the same way as your code,
such as through Git's version history. This means you can review each change that has happened, and
understand who made it and when. This can be helpful if you're trying to understand why a resource is
configured a specific way.
Metadata.Metadata. Many types of IaC configurations let you add metadata, like code comments, to help explain why
something is done a particular way. If your organization has a culture of documenting your code, apply the
same principles to your infrastructure code.
Keep ever ything together.Keep ever ything together. It's common for a developer to work on features that require both code and
infrastructure changes. By keeping infrastructure defined as code, you can group application and infrastructure
code to understand the relationship between them better. For example, if you see a change to an IaC
configuration on a feature branch or in a pull request, you'll have a clearer understanding of what that change
relates to.
Categories of IaC tooling
Automate deployments with ARM Templates
Better understand Azure itself.Better understand Azure itself. The Azure portal is a great way to provision and configure resources;
however, it often simplifies the underlying resource model used. Using IaC will mean that you gain a much
deeper understanding of what is happening in Azure and how to troubleshoot it if something isn't working
correctly. For example, when creating a set of virtual machines using the Azure portal, some of the underlying
resource creation is abstracted for the deployment process. When using IaC not only do you have explicit
control over resource creation, very little is abstracted from the process, which provides a much richer
understanding of what is deployed and how it is configured.
You can use many declarative infrastructure deployment technologies with Azure. Deployment technologies fall
into two main categories.
Imperative IaCImperative IaC involves writing scripts in a language like Bash, PowerShell, C# script files, or Python. These
programmatically execute a series of steps to create or modify your resources. When using imperative
deployments, it is up to you to manage things like dependency sequencing, error control, and resource
updates.
Declarative IaCDeclarative IaC involves writing a definition of how you want your environment to look; the tooling then
figures out how to make this happen by inspecting your current state, comparing it to the target state you've
requested, and applying the differences. There's a good discussion of imperative and declarative IaC here.
There are great Azure tooling options for both models. Here we describe three of the commonly used
declarative IaC technologies for Azure - ARM templates, Bicep, and Terraform.
Azure Resource Manager (ARM) Templates provide an Azure native infrastructure as code solution. ARM
Templates are written in a language derived from JavaScript Object Notation (JSON), and they define the
infrastructure and configurations for Azure deployments. An ARM template is declarative, you state what you
want to deploy, provide configuration values, and the Azure engine takes care of making the necessary Azure
REST API put requests. Other benefits of using ARM templates for infrastructure deployments include:
Parallel resource deployment:Parallel resource deployment: the Azure deployment engine sequences resource deployments based on
defined dependencies. If dependencies do not exist between two resources, they are deployed at the same
time.
Modular deployments:Modular deployments: ARM templates can be broken up into multiple template files for reusability and
modularization.
Day one resource suppor t:Day one resource suppor t: ARM templates support all Azure resources and resource properties as they
are released.
Extensibility:Extensibility: Azure deployments can be extended using deployment scripts and other automation
solutions.
Validation:Validation: Azure deployments are evaluated against a validation API to catch configuration mistakes.
Testing:Testing: the ARM template test toolkit provides a static code analysis framework for testing ARM templates.
Change preview:Change preview: ARM template what-if allows you to see what will be changed before deploying an ARM
template.
Tooling:Tooling: Language service extensions are available for both Visual Studio Code and Visual Studio to help
you author ARM templates.
The following example demonstrates a simple ARM template that deploys a single Azure Storage account. In this
example, a single parameter is defined to take in a name for the storage account. Under the resources section, a
storage account is defined, the
storageName
parameter is used to provide a name, and the storage account
details are defined. See the included documentation for an in-depth explanation of the different sections and
configurations for ARM templates.
{
"$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": {
"storageName": {
"type": "string",
"defaultValue": "newStorageAccount"
}
},
"resources": [
{
"name": "[parameters('storageName')]",
"type": "Microsoft.Storage/storageAccounts",
"apiVersion": "2019-06-01",
"location": "[resourceGroup().location]",
"kind": "StorageV2",
"sku": {
"name": "Premium_LRS",
"tier": "Premium"
}
}
]
}
Automate deployments with Bicep
param location string = resourceGroup().location
param storageAccountName string = 'toylaunch${uniqueString(resourceGroup().id)}'
resource storageAccount 'Microsoft.Storage/storageAccounts@2021-06-01' = {
name: storageAccountName
location: location
sku: {
name: 'Standard_LRS'
}
kind: 'StorageV2'
properties: {
accessTier: 'Hot'
}
}
Learn moreLearn more
Documentation: What are ARM templates
Learn module: Deploy consistent infrastructure with ARM Templates
Code Samples: ARM templates
Bicep is a domain-specific language (DSL) that uses declarative syntax to deploy Azure resources. Bicep provides
concise syntax, reliable type safety, and support for code reuse.
You can think of Bicep as a revision to the Azure Resource Manager template (ARM template) language rather
than a new language. The syntax is different, but the core functionality and runtime remain the same.
The following example shows a simple Bicep file that defines a storage account.
Learn moreLearn more
Documentation: What is Bicep
Documentation: Frequently Asked Questions
Documentation: Key Bicep Concepts
Automate deployments with Terraform
resource "azurerm_resource_group" "example" {
name = "newStorageAccount"
location = "eastus"
}
resource "azurerm_storage_account" "example" {
name = "storageaccountname"
resource_group_name = azurerm_resource_group.example.name
location = azurerm_resource_group.example.location
account_tier = "Standard"
account_replication_type = "GRS"
}
Manual deployment
Hotfix process
Next steps
Terraform is a declarative framework for deploying and configuring infrastructure that supports many private
and public clouds, Azure being one of them. It has the main advantage of offering a cloud-agnostic framework.
While Terraform configurations are specific to each cloud, the framework itself is the same for all of them.
Terraform configurations are written in a domain-specific language (DSL) called Hashicorp Configuration
Language.
The following example demonstrates a simple Terraform configuration that deploys an Azure resource group
and a single Azure Storage account.
Take note, the Terraform provider for Azure is an abstraction on top of Azure APIs. This abstraction is beneficial
because the API complexities are obfuscated. This abstraction comes at a cost; the Terraform provider for Azure
doesn't always provide parity with the Azure APIs' capabilities. To learn more about using Terraform on Azure,
see Using Terraform on Azure
Manual deployment steps introduce significant risks where human error is concerned and also increases overall
deployment times. However, in some cases, manual steps may be required. For these cases, ensure that any
manual steps are documented, including roles and responsibilities.
In some cases, you may have an unplanned deployment need. For instance, to deploy critical hotfixes or security
remediation patches. A defined process for unplanned deployments can help prevent service availability and
other deployment issues during these critical events.
Automate infrastructure configuration
Configure infrastructure
12/16/2022 6 minutes to read • Edit Online
Bootstrap automation
Azure VM extensionsAzure VM extensions
az vm extension set \
--resource-group myResourceGroup \
--vm-name myVM --name customScript \
--publisher Microsoft.Azure.Extensions \
--settings '{"commandToExecute": "apt-get install -y nginx"}'
When working with Azure, many services can be created and configured programmatically using automation or
infrastructure as code tooling. These tools access Azure through the exposed REST APIs or what we refer to as
the Azure control plane. For example, an Azure Network Security Group can be deployed, and security group
rules created using an Azure Resource Manager template. The Network Security Group and its configuration are
exposed through the Azure control plane and natively accessible.
Other configurations, such as installing software on a virtual machine, adding data to a database, or starting
pods in an Azure Kubernetes Service cluster can't be accessed through the Azure control plane. These actions
require a different set of configuration tools. We consider these configurations as being on the Azure data plane
side, or not exposed through Azure REST APIs. These data plane enables tools to use agents, networking, or
other access methods to provide resource-specific configuration options.
For example, when deploying a set of virtual machines to Azure, you may also want to install and configure a
web server, stage content, and then make the content available on the internet. Furthermore, if the virtual
machine configuration changes and no longer aligns with the configuration definition, you may want a
configuration management system to remediate the configuration. Many options are available for these data
plane configurations. This document details several and provides links for in-depth information.
When deploying to Azure, you may need to run post-deployment virtual machine configuration or run other
arbitrary code to bootstrap the deployed Azure resources. Several options are available for these bootstrapping
tasks and are detailed in the following sections of this document.
Azure virtual machine extensions are small packages that run post-deployment configuration and automation
on Azure virtual machines. Several extensions are available for many different configuration tasks, such as
running scripts, configuring antimalware solutions, and configuring logging solutions. These extensions can be
installed and run on virtual machines using an ARM template, the Azure CLI, Azure PowerShell module, or the
Azure portal. Each Azure VM has a VM Agent installed, and this agent manages the lifecycle of the extension.
A typical VM extension use case would be to use a custom script extension to install software, run commands,
and perform configurations on a virtual machine or virtual machine scale set. The custom script extension uses
the Azure virtual machine agent to download and execute a script. The custom script extensions can be
configured to run as part of infrastructure as code deployments such that the VM is created, and then the script
extension is run on the VM. Extensions can also be run outside of an Azure deployment using the Azure CLI,
PowerShell module, or the Azure portal.
In the following example, the Azure CLI is used to deploy a custom script extension to an existing virtual
machine, which installs a Nginx webserver.
Learn moreLearn more
cloud
-
initcloud
-
init
#cloud-config
package_upgrade: true
packages:
- nginx
az group create --name myResourceGroupAutomate --location eastus
az vm create \
--resource-group myResourceGroupAutomate \
--name myAutomatedVM \
--image UbuntuLTS \
--admin-username azureuser \
--generate-ssh-keys \
--custom-data cloud-init.txt
Azure deployment script resourceAzure deployment script resource
Use the included code sample to deploy a virtual machine and configure a web server on that machine with the
custom script extension.
Documentation: Azure virtual machine extensions
Code Samples: Configure VM with script extension during Azure deployment
cloud-init is a known industry tool for configuring Linux virtual machines on first boot. Much like the Azure
custom script extension, cloud-init allows you to install packages and run commands on Linux virtual machines.
cloud-init can be used for things like software installation, system configurations, and content staging. Azure
includes many cloud-init enable Marketplace virtual machine images across many of the most well-known Linux
distributions. For a full list, see cloud-init support for virtual machines in Azure.
To use cloud-init, create a text file named
cloud-init.txt
and enter your cloud-init configuration. In this example,
the Nginx package is added to the cloud-init configuration.
Create a resource group for the virtual machine.
Create the virtual machine, specifying the
--custom-data
property with the cloud-inti configuration name.
On boot, cloud-init will use the systems native package management tool to install Nginx.
Learn moreLearn more
Documentation: cloud-init support for virtual machines in Azure
When performing Azure deployments, you may need to run arbitrary code for bootstrapping things like
managing user accounts, Kubernetes pods, or querying data from a non-Azure system. Because none of these
operations are accessible through the Azure control plane, some other mechanism is required for performing
this automation. To run arbitrary code with an Azure deployment, check out the
Microsoft.Resources/deploymentScripts Azure resource.
The deployment script resource behaves similar to any other Azure resource:
Can be used in an ARM template.
Contain ARM template dependencies on other resources.
Consume input, produce output.
Use a user-assigned managed identity for authentication.
When deployed, the deployment script runs PowerShell or Azure CLI commands and scripts. Script execution
{
"type": "Microsoft.Resources/deploymentScripts",
"apiVersion": "2019-10-01-preview",
"name": "runPowerShellScript",
"location": "[resourceGroup().location]",
"kind": "AzurePowerShell",
"identity": {
"type": "UserAssigned",
"userAssignedIdentities": {"[parameters('identity')]": {}}
},
"properties": {
"forceUpdateTag": "1",
"azPowerShellVersion": "3.0",
"arguments": "[concat('-sqlServer ', parameters('sqlServer'))]",
"primaryScriptUri": "[variables('script')]",
"timeout": "PT30M",
"cleanupPreference": "OnSuccess",
"retentionInterval": "P1D"
}
}
Configuration Management
Azure Automation State ConfigurationAzure Automation State Configuration
configuration linuxpackage {
Import-DSCResource -Module nx
Node "localhost" {
nxPackage nginx {
Name = "nginx"
Ensure = "Present"
}
}
}
and logging can be observed in the Azure portal or with the Azure CLI and PowerShell module. Many options
can be configured like environment variables for the execution environment, timeout options, and what to do
with the resource after a script failure.
The following example shows an ARM template snippet with the deployment script resource configured to run a
PowerShell script.
Learn moreLearn more
Documentation: Use deployment scripts in templates
Configuration management tools can be used to configure and manage the ongoing configuration and state of
Azure virtual machines. Three popular options are Azure Automation State Configuration, Chef, and Puppet.
Azure Automation State Configuration is a configuration management solution built on top of PowerShell
Desired State Configuration (DSC). State configuration works with Azure virtual machines, on-premises
machines, and machines in a cloud other than Azure. Using state configuration, you can import PowerShell DSC
resources and assign them to many virtual from a central location. Once each endpoint has evaluated and / or
applied the desired state, state compliance is reported to Azure and can be seen on a built-in dashboard.
The following example uses PowerShell DSC to ensure the NGINX has been installed on Linux systems.
Once imported into Azure State Configuration and assigned to nodes, the state configuration dashboard
provides compliance results.
ChefChef
PuppetPuppet
Next steps
Learn moreLearn more
Use the included code sample to deploy Azure Automation State Configuration and several Azure virtual
machines. The virtual machines are also onboarded to state configuration, and a configuration applied.
Documentation: Get started with Azure Automation State Configuration
Example Scenario: Azure Automation State Configuration
Chef is an automation platform that helps define how your infrastructure is configured, deployed, and managed.
Additional components included Chef Habitat for application lifecycle automation rather than the infrastructure
and Chef InSpec that helps automate compliance with security and policy requirements. Chef Clients are
installed on target machines, with one or more central Chef Servers that store and manage the configurations.
Learn moreLearn more
Documentation: An Overview of Chef
Puppet is an enterprise-ready automation platform that handles the application delivery and deployment
process. Agents are installed on target machines to allow Puppet Master to run manifests that define the desired
configuration of the Azure infrastructure and virtual machines. Puppet can integrate with other solutions such as
Jenkins and GitHub for an improved DevOps workflow.
Learn moreLearn more
Documentation: How Puppet works
Automate operational tasks
Automate operational tasks
12/16/2022 7 minutes to read • Edit Online
Azure Functions
Operational tasks can include any action or activity you may perform while managing systems, system access,
and processes. Some examples include rebooting servers, creating accounts, and shipping logs to a data store.
These tasks may occur on a schedule, as a response to an event or monitoring alert, or ad-hoc based on external
factors. Like many other activities related to managing computer systems, these activities are often performed
manually, which takes time, and are error-prone.
Many of these operational tasks can and should be automated. Using scripting technologies and related
solutions, you can shift effort from manually performing operational tasks towards building automation for
these tasks. In doing so, you achieve so much, including:
Reduce time to perform an action.
Reduce risk in performing the action.
Automated response to events and alerts.
Increased human capacity for further innovation.
When working in Azure, you have many options for automating operational tasks. This document details some
of the more popular.
Azure Functions allows you to run code without managing the underlying infrastructure on where the code is
run. Functions provide a cost-effective, scalable, and event-driven platform for building applications and running
operational tasks. Functions support running code written in a range of programming languages including C#,
Java, JavaScript, Python, and PowerShell.
When creating a Function, a hosting plan is selected. Hosting plans controls how a function app is scaled,
resource availability, and availability of advanced features such as virtual network connectivity and startup time.
The hosting plan also influences the cost.
Functions hosting plans:
Consumption:Consumption: Default hosting plan, pay only for Function execution time, configurable timeout period,
automatic scale.
Premium:Premium: Faster start, VNet connectivity, unlimited execution duration, premium instance sizes, more
predictable pricing.
Dedicated:Dedicated: Functions run on dedicated virtual machines and can use custom images.
For full details on consumption plans, see Azure Functions scale and hosting.
Functions provide event-driven automation; each function has a trigger associated with it. These triggers are
what run the functions. Common triggers include:
HTTP / Webhook :HTTP / Webhook : Function is run when an HTTP request is received.
Queue:Queue: Function is run when a new message arrives in a message queue.
Blob storage:Blob storage: Function is run when a new blob is created in a storage container.
Timer :Timer : Function is run on a schedule.
Below are example triggers seen in the Azure portal when creating a new function
Azure Automation
Once an event has occurred that initiates a Function, you may also want to consume data from this event or
from another source. Once a Function has been completed, you may want to publish or push data to an Azure
service such as a Blob Storage. Input and output are achieved using input and output bindings. For more
information about triggers and bindings, see Azure Functions triggers and binding concepts.
Both PowerShell and Python are common languages for automating everyday operational tasks. Because Azure
Functions supports both of these languages, it is an excellent platform for hosting, running, and auditing
automated operational tasks. For example, let's assume that you would like to build a solution to facilitate self-
service account creation. An Azure PowerShell Function could be used to create the account in Azure Active
Directory. An HTTP trigger can be used to initiate the Function, and an input binding configured to pull the
account details from the HTTP request body. The only remaining piece would be a solution that consumes the
account details and creates the HTTP requests against the Function.
Learn moreLearn more
Documentation: Azure Functions PowerShell developer guide
Documentation: Azure Functions Python developer guide
PowerShell and Python are popular programming languages for automating operational tasks. Using these
languages, performing operations like restarting services, moving logs between data stores, and scaling
infrastructure to meet demand can be expressed in code and executed on demand. Alone, these languages do
not offer a platform for centralized management, version control, or execution history. The languages also lack a
native mechanism for responding to events like monitoring driven alerts. To provide these capabilities, an
automation platform is needed.
Azure Automation provides an Azure-hosted platform for hosting and running PowerShell and Python code
across Azure, non-Azure cloud, and on-premises environments. PowerShell and Python code is stored in an
Azure Automation Runbook, which has the following attributes:
Execute Runbooks on demand, on a schedule, or through a webhook.
Execution history and logging.
Integrated secrets store.
Scale operations
Azure Monitor autoscaleAzure Monitor autoscale
Source Control integration.
As seen in the following image, Azure Automation provides a portal experience for managing Azure Automation
Runbooks. Use the included code sample (ARM template) to deploy an Azure Automation account, automation
runbook, and explore Azure Automation for yourself.
Learn moreLearn more
Documentation: What is Azure Automation?
So far, this document has detailed options for scripting operational tasks; however, many Azure services come
with built-in automation, particularly in scale operations. As application demand increases (transactions,
memory consumption, and compute availability), you may need to scale the application hosting platform so that
requests continue to be served. As demand decreases, scaling back not only appropriately resizes your
application but also reduces operational cost.
In cloud computing, scale activities are classified into two buckets:
Scale-up:Scale-up: Adding extra resources to an existing system to meet demand.
Scale-out:Scale-out: Adding more infrastructure to meet demand.
Many Azure services can be scaled up by changing the pricing tier of that service. Generally, this operation
would need to be performed manually or using detection logic and custom automation.
Some Azure services support automatic scale-out, which is the focus of this section.
Azure Monitor autoscale can be used to autoscale Virtual Machine Scale Sets, Cloud Services, App Service Web
Apps, and API Management service. To configure scale-out operations for these services, while in the Azure
portal, select the service, and then scale-outscale-out under the resource settings. Select CustomCustom To configure
autoscaling rules. Automatic scale operations can also be configured using an Azure Resource Manager
Template, the Azure PowerShell module, and the Azure CLI.
Azure Kubernetes ServiceAzure Kubernetes Service
Horizontal pod autoscalerHorizontal pod autoscaler
When creating the autoscale rules, configure minimum and maximum instance counts. These settings prevent
inadvertent costly scale operations. Next, configure autoscale rules, at minimum one to add more instances, and
one to remove instances when no longer needed. Azure Monitor autoscale rules give you fine-grain control over
when a scale operation is initiated. See the Learn more section below for more information on configuring these
rules.
Learn moreLearn more
Documentation: Azure Monitor autoscale overview
Azure Kubernetes Service (AKS) offers an Azure managed Kubernetes cluster. When considering scale
operations in Kubernetes, there are two components:
Pod scaling:Pod scaling: Increase or decrease the number of load balanced pods to meet application demand.
Node scaling:Node scaling: Increase or decrease the number of cluster nodes to meet cluster demand.
Azure Kubernetes Service includes automation to facilitate both of these scale operation types.
Horizontal pod autoscaler (HPA) monitors resource demand and automatically scales pod replicas. When
configuring horizontal pod autoscaling, you provide the minimum and maximum pod replicas that a cluster can
run and the metrics and thresholds that initiate the scale operation. To use horizontal pod autoscaling, each pod
apiVersion: apps/v1
kind: Deployment
metadata:
name: azure-vote-back
spec:
replicas: 1
selector:
matchLabels:
app: azure-vote-back
template:
metadata:
labels:
app: azure-vote-back
spec:
nodeSelector:
"beta.kubernetes.io/os": linux
containers:
- name: azure-vote-back
image: redis
# Resource requests and limits
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 250m
memory: 256Mi
ports:
- containerPort: 6379
name: redis
---
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: azure-vote-back-hpa
spec:
maxReplicas: 10 # define max replica count
minReplicas: 3 # define min replica count
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: azure-vote-back
targetCPUUtilizationPercentage: 50 # target CPU utilization
Cluster autoscalerCluster autoscaler
must be configured with resource requests and limits, and a HorizontalPodAutoscalerHorizontalPodAutoscaler Kubernetes object
must be created.
The following Kubernetes manifest demonstrates resource requests on a Kubernetes pod and also the definition
of a horizontal pod autoscaler object.
Where horizontal pod autoscaling is a response to demand on a specific application of service running in a
Kubernetes cluster, cluster autoscaling responds to demand on the entire Kubernetes cluster itself. If a
Kubernetes cluster does not have enough compute resources or nodes to facilitate all requested pods' resource
requests, some of these pods will enter a non-scheduled or pending state. In response to this situation, more
nodes can be automatically added to the cluster. Conversely, once compute resources have been freed up, the
cluster nodes can automatically be removed to match steady-state demand.
Cluster autoscaler can be configured when creating an AKS cluster. The following example demonstrates this
operation with the Azure CLI. This operation can also be completed with an Azure Resource Manager template.
az aks create \
--resource-group myResourceGroup \
--name myAKSCluster \
--node-count 1 \
--vm-set-type VirtualMachineScaleSets \
--load-balancer-sku standard \
--enable-cluster-autoscaler \
--min-count 1 \
--max-count 3
az aks update \
--resource-group myResourceGroup \
--name myAKSCluster \
--enable-cluster-autoscaler \
--min-count 1 \
--max-count 3
Cluster autoscaler can also be configured on an existing cluster using the following Azure CLI command.
See the included documentation for information on fine-grain control for cluster autoscale operations.
Learn moreLearn more
Documentation: Horizontal pod autoscaling
Documentation: AKS cluster autoscaler
Release Engineering: Application Development
12/16/2022 5 minutes to read • Edit Online
Development environments
Examples of "inner
-
loop" development toolsExamples of "inner
-
loop" development tools
One of the primary goals of adopting modern release management strategies is to build systems that allow
your teams to turn ideas into production delivered software with as little friction as possible. Throughout this
section of the Well-Architected Framework, methods and tools for quickly and reliably delivering software are
examined. You will learn about things like continuous deployment, integration strategies, and deployment
environments. Samples are provided to help you quickly get hands-on with this technology.
However, release engineering does not start with fancy deployment software, multiple deployment
environments, or Kubernetes clusters. Before examining how we can quickly and reliably release software, we
need to first look at how software is developed.
Not only has the introduction of cloud computing had a significant impact on how software is delivered and run,
but it's also had a huge downstream impact on how software is developed. For example, the introduction of
container technology has changed how we can host, scale, and deprecate software. That said, containers have
also impacted things like dependency management, host environment, and tooling as we develop software.
This article details many practices that you may want to consider when building strategies for developing for the
cloud. Topics include:
Development environments, or where you write your code.
Source control and branching strategies, how you manage, collaborate on, and eventually deploy your code.
When developing software for the cloud, or any environment, care needs to be taken to ensure that the
development environment is set up for success. When setting up a development environment, you may
consider questions like the following:
How do I ensure that all dependencies are in place?
How can I best configure my development environment to emulate a production environment?
How do I develop code where service dependencies may exist with code already in production?
Based on the chosen deployment method and target application the answers to these questions, and the tools
used for what is often referred to the "inner-loop" development process may differ.
Package management solutionsfor example – can help you manage dependencies in your development
environment. When using containers, you may want to use a solution like Docker Desktop for your development
environment instead. When it comes to effectively emulating a production environment, you may choose to set
up a range of dummy services locally.
Some Linux tools are a handy solution to this, and they are now available on Windows machines too. You can
use them by configuring the Windows Subsystem for Linux.
Finally, tools like Bridge for Kubernetes allow you to debug code in your development environment while being
connected to a Kubernetes cluster. This configuration can be helpful when working on containerized
microservices. You can work on one service locally while services that you take a dependency on are spun up
remotely.
A development team might choose some of the pieces of software below when developing a containerized
Source control
Version Control and code changesVersion Control and code changes
Branches, Forks and Pull RequestsBranches, Forks and Pull Requests
application with a Kubernetes cluster as its deployment target.
Azure Artifacts is a package management solution that lets you host private packages and provide upstream
connectivity to public package repositories for a variety of package management systems such as NuGet.
Docker Desktop is an application that provides a Docker environment on your development system. Docker
Desktop includes not only the Docker runtime but application development tools and local Kubernetes
environment.
The Windows Subsystem for Linux provides a Linux environment on your Windows machines, including
many command-line tools, utilities, and Linux applications.
Bridge to Kubernetes allows you to run and debug code on your development system while connected to a
Kubernetes cluster. This configuration can be helpful when working on microservice type architectures.
Podman is an open-source tool for working with containers.
Source control management (SCM) systems provide a way to control, collaborate, and peer review software
changes. As software is merged into source control, the system helps manage code conflicts. Ultimately, source
control provides a running history of the software, modification, and contributors. Whether a piece of software
is open-sourced or private, using source control software has become a standardized method of managing
software development.
As cloud practices are adopted and because so much of the cloud infrastructure is managed through code,
version control systems are an integral part of infrastructure management.
Many source control systems are powered by Git. Git is a distributed version control system with related tools
that allow you and your team to track source code changes during the software development lifecycle. Using Git,
you can create a copy of the software, make changes, propose the changes, and receive peer review on your
proposal. Modern source control software simplifies this peer review process and helps you see the exact
changes that a contributor wants to make.
Once the proposed changes have been approved, Git helps merge the changes into the source, including conflict
resolution. If, at any point, the changes need to be reverted, Git can also manage rollback.
Beyond providing us with a place to store code, source control systems allow us to understand what version of
the software is current and identify changes between the present and past versions. Version control solutions
should also provide a method for reverting to the previous version when needed.
The following image demonstrates how Git and GitHub are used to see the proposed code changes.
Using source control systems, you can create both branches in an existing repo and your own copies of the
software. A copy of an existing repo is called a "fork".
While branches are usually short lived and will eventually be merged back into a repo via a pull request, forks
are usually around for longer. While they can also be merged back via a pull request, some forks develop
Peer ReviewPeer Review
GitHubGitHub
Azure ReposAzure Repos
Next steps
independently from their origin repo.
You can review our documentation on Forks and Branches to find out more about the differences and similarites
between them.
As updates are made to software and infrastructure configurations, version control software allows us to
propose these changes before merging them into the source. During the proposal, peers can review the
changes, recommend updates, and approve the changes. Source control solutions provide an excellent platform
for collaboration on changes to the software.
To learn more about using Git, visit the DevOps Resource Center.
GitHub is a popular source control system that uses Git. In addition to core Git functionality, GitHub includes
several other features such as access control, collaboration features such as code issues, project boards, wikis,
and an automation platform called GitHub actions.
Azure DevOps is a collection of services for building, collaborating on, testing, and delivering software to any
environment. Azure DevOps Services includes Azure Repos, which is its source control system. Using Azure
Repos includes unlimited free private Git repositories. Standard Git powers Azure Repos, and you can use clients
and tools of your choice for working with them.
Release Engineering: Continuous integration
Release Engineering: Continuous Integration
12/16/2022 4 minutes to read • Edit Online
Continuous integration
Continuous integration pipelines
As code is developed, updated, or even removed, having a friction-free and safe method to integrate these
changes into the main code branch is paramount to enabling developers to provide value fast. As a developer,
making small code changes, pushing these to a code repository, and getting almost instantaneous feedback on
the quality, test coverage, and introduced bugs allows you to work faster, with more confidence, and less risk.
Continuous integration (CI) is a practice where source control systems and software deployment pipelines are
integrated to provide automated build, test, and feedback mechanisms for software development teams.
Continuous integration is about ensuring that software is ready for deployment but does not include the
deployment itself. This article covers the basics of continuous integration and offers links and examples for more
in-depth content.
Continuous integration is a software development practice under which developers integrate software updates
into a source control system on a regular cadence. The continuous integration process starts when an engineer
creates a pull request signaling to the CI system that code changes are ready to be integrated. Ideally, integration
validates the code against several baselines and tests and provides quick feedback to the requesting engineer on
the status of these tests. Assuming baseline checks and testing have gone well, the integration process produces
and stages assets such as compiled code and container images that will eventually deploy the updated software.
As a software engineer, continuous integration can help you deliver quality software more quickly by
performing the following:
Run automated tests against the code, providing early detection of breaking changes.
Run code analysis to ensure code standards, quality, and configuration.
Run compliance and security checks ensuring no known vulnerabilities.
Run acceptance or functional tested to ensure that the software operates as expected.
To provide quick feedback on detected issues.
Where applicable, produce deployable assets or packages that include the updated code.
To achieve continuous integration, use software solutions to manage, integrate, and automate the process. A
common practice is to use a continuous integration pipeline, detailed in this article's next section.
A continuous integration pipeline involves a piece of software, in many cases, cloud-hosted, that provides a
platform for running automated tests, compliance scans, reporting, and all additional components that make up
the continuous integration process. In most cases, the pipeline software is attached to source control such that
when pull requests are created or software is merged into a specific branch, and the continuous integration
pipeline is run. Source control integration also provides the opportunity for providing CI feedback directly on
the pull request.
Many solutions provide continuous integration pipeline capabilities.
Learn moreLearn more
To learn how to create a continuous integration pipeline with either GitHub or Azure DevOps, see these articles:
Azure DevOps: Create your first pipeline
Source Control IntegrationSource Control Integration
Test IntegrationTest Integration
- script: |
pip3 install pytest
pytest azure-vote/azure-vote/tests/ --junitxml=junit/test-results.xml
continueOnError: true
- task: PublishTestResults@2
displayName: 'Publish Test Results'
inputs:
testResultsFormat: 'JUnit'
testResultsFiles: '**/test-results.xml'
failTaskOnFailedTests: true
testRunTitle: 'Python $(python.version)'
GitHub Actions: Setting up continuous integration using workflow templates
The integration of your continuous integration pipeline with your source control system is key to enabling fast,
self-service, and friction-free code contributions.
As pull requests are created, the CI pipeline is run, including all tests, security assessments, and other checks.
CI test results are provided to the pull request initiator directly in the pull request, allowing for almost real-
time feedback on quality.
Another popular practice is building small reports or badges that can be presented in source control to make
visible the current builds states
The following image shows the integration between GitHub and an Azure DevOps pipeline. In this example, a
pull request has been created, which in turn has triggered an Azure DevOps pipeline. The pipeline status can be
seen directly in the pull request.
A key element of continuous integration is the continual building and testing of code as code contributions are
made. Testing pull requests as they are created gives quick feedback that the commit has not introduced
breaking changes. The advantage is that the tests that are run by the continuous integration pipeline can be the
same tests run during test-driven development.
The following code snippet shows a test step from an Azure DevOps pipeline. There are two actions occurring:
The first task is using a popular Python testing framework to run CI tests. These tests reside in source control
alongside the Python code. The test results are output to a file named
test-results.xml
.
The second task consumes the test results and publishing them to the Azure DevOps pipeline as an
integrated report.
The following image shows the test results as seen in the Azure DevOps portal.
Failed testsFailed tests
CI Result BadgesCI Result Badges
Next steps
Failed tests should temporarily block a deployment and lead to a deeper analysis of what has happened and to
either a refinement of the test or an improvement of the change that caused the test to fail.
Many developers like to show that they're keeping their code quality high by displaying a status badge in their
repo. The following image shows an Azure Pipelines badge as displayed on the Readme file for an open-source
project on GitHub.
Learn moreLearn more
To learn how to display badges in your repositories, see these articles:
Add an Azure Pipeline status badge to your repository.
Add a GitHub workflow status badge to your repository.
Release Engineering: Testing
Testing your application and Azure environment
12/16/2022 7 minutes to read • Edit Online
Automated Testing
Unit TestingUnit Testing
Smoke TestingSmoke Testing
Integration TestingIntegration Testing
Testing is one of the fundamental components and DevOps and agile development in general. If automation
gives DevOps the required speed and agility to deploy software quickly, only through extensive testing those
deployments will achieve the required reliability that customers demand.
A main tenet of a DevOps practice to achieve system reliability is the "Shift Left" principle. If developing and
deploying an application is a process depicted as a series of steps going from left to right, testing should not be
limited to being performed at the very end of the process (at the right). It should be shifted as much to the
beginning (to the left) as possible. Errors are cheaper to repair when caught early. They can be expensive or
impossible to fix later in the application life cycle.
Another aspect to consider is that testing should occur on all code. (not just application code but also
infrastructure templates and configuration scripts) As described in Infrastructure as Code, the environment
where applications are running should be version-controlled and deployed through the same mechanisms as
application code, and hence can be tested and validated using the same testing paradigms that teams already
use for application code.
A range of testing tools are available to automate and streamline different kinds of testing, including Azure Load
Testing, Azure Pipelines for automated testing and Azure Test Plans for manual testing.
There are multiple stages at which tests can be performed in the life cycle of code, and each of them has some
particularities that are important to understand. In this guide, you can find a summary of the different tests that
you should consider while developing and deploying applications.
Automating tests is the best way to make sure that they're executed consistently. Depending on how frequently
tests are performed, they're typically limited in duration and scope, as the different types of automated tests will
show:
Unit tests are tests typically run as part of the continuous integration routine. Unit Tests should be extensive
(ideally cover 100% of the code) and quick (run in under 30 seconds, although this number is a guideline).
Unit testing can verify that the syntax and functionality of individual modules of code are working the way they
should. (for example: produce a defined output for a known input) Unit tests can also be used to verify that
infrastructure as code assets are valid.
Unit tests should be applied both to all code assets. (including templates and scripts)
Smoke tests verify that a workload can be stood up in a test environment and performs as expected. They don't
go to the extent of integration tests as they don't verify the interoperability of different components.
Instead they verify that the deployment methodology for both infrastructure and the application works and that
the system responds as intended once the process is complete.
After making sure that the different application components operate correctly individually, integration testing
has as goal determine whether they can interact with each other as they should.
Manual Testing
Acceptance Testing
Testing and Experimentation in Production
Stress Testing
Running a large integration test suite can take a considerable amount of time, which is why tests should be
shifted left as much as possible, with integration tests being reserved to scenarios that cannot be tested with a
smoke or unit test.
Long running test processes can be run on a regular interval if needed. This offers a good compromise,
detecting interoperability issues between application components no later than one day after they were
introduced.
Manual testing is much more expensive than automated testing, and as a consequence it's usually run much less
frequently. However, manual testing is fundamental for the correct functioning of the DevOps feedback loop, to
correct errors before they become too expensive to repair, or cause customer dissatisfaction.
Depending on the context acceptance testing is sometimes performed manually. In some cases it's partially or
fully automated. Its purpose is to determine whether or not the software system has met the requirement
specifications.
The main purpose of this test is to evaluate the system's compliance with the business requirements and verify if
it has met the required criteria for delivery to end users.
Tools like Application Insights User Behavior Analytic, can help you work out how people are using your
application. This way you can decide whether a new feature has improved your applications without bringing
unintended adverse effects.
The next section of this article will cover methods for "Testing in Production", which can be used to verify if
features have met business targets with real world user behavior.
Depending on your chosen deployment strategy Azure platform features might help you conduct experiments
in production.
Azure AppService - for example - comes with the slot functionality that allows you to have two different
versions of the same application running at the same time. Slots allow for A/B testing by directing part of the
user traffic to one version of the application and the rest of the traffic to another.
There are multiple popular approaches to experimentation in production:
Blue/Green deploymentsBlue/Green deployments When deploying a new application version, you can deploy it in parallel to the
existing one. This allows you to start redirecting clients to the new version, and if everything goes well you
can then decommission the old version. If there is any problem with the new deployment, you can always
redirect the users back to the deployment with the previous version.
Canar y releasesCanar y releases You can expose new functionality of your application in a staggered way to select groups
of users. If users are satisfied with the new functionality, or if the new feature performs as expected with the
control group, you can extend the feature to a larger group of users until it's fully rolled out.
A/B testingA/B testing: A/B testing is similar to canary release-testing, but while canary releases focus on mitigate risk,
A/B testing focus on evaluating two versions of an application or feature side by side. For example, if you
have two versions of the layout of a certain area of your application, you could send half of your users to
one, the other half to the other, and use some metrics to see which layout is more successful in the context of
your business goals.
Business Continuity Testing
Disaster Recovery DrillsDisaster Recovery Drills
Exploratory testingExploratory testing
Fault injectionFault injection
Summary
Next steps
As other sections of this framework have explained, designing your application code and infrastructure for
scalability is of paramount importance. Stress tests help you verify regularly that both your application and your
environment will adapt to changing load conditions.
During these stress tests, it's critical to monitor all the components of the system to identify whether you hit any
bottleneck but also to establish a clear baseline to test against.
Every component of the system that isn't able to scale out can turn into a scale limitation, such as active/passive
network components or databases. It's hence important to know their limits so that you can mitigate their
impact when it comes to application scale.
Another important part of stress tests is to verify that the infrastructure scales back down to its normal
condition as expected in order to keep costs under control.
Disaster Recovery Drills are important to verify that the existing backup strategy is complimented by a working
recovery procedure. These should be conducted regularly.
Tools such as Azure Site Recovery make it possible to start an isolated copy of the primary workload location in
a secondary environment, so that it can be verified that the workload has recovered as it should in an
emergency.
In case a problem occurs during the drill, the Disaster Recovery procedure can be optimized, and the
infrastructure in the secondary environment can be deleted safely.
During exploratory testing, experts explore the application in its entirety trying to find faults or suboptimal
implementations of functionality. These experts could be developers, UX specialists, product owners, actual
users, and other profiles.
Structured test plans are typically not used, since testing is left to the ability of the individual tester.
Fault injection tests if the application is resilient to infrastructure failures. The technique introduces faults in the
underlying infrastructure and observes how the application behaves.
This kind of testing is fundamental to build trust in your redundancy mechanisms. Shutting down infrastructure
components, purposely degrading performance, or introducing faults are ways of verifying that the application
is going to react as expected when these situations occur.
Products like Azure Chaos Studio aim to simplify the process of fault injection testing. Many larger organizations
have built their own chaos engines that leverage automated testing tools to achieve fault injection.
In order to deploy software quickly and reliably, testing is a fundamental component of the development and
deployment life cycle. It's important to make sure that the application is going to perform as expected in every
situation. We therefore need to not only test application code but all aspects of our workload.
Release Engineering: Performance
Performance considerations for your deployment
infrastructure
12/16/2022 3 minutes to read • Edit Online
Build times
Human intervention
Build status shows if your product is in a deployable state, so builds are the heartbeat of your continuous
delivery system. It's important to have a build process up and running from the first day of your product
development. Since builds provide such crucial information about the status of your product, you should always
strive for fast builds.
It will be hard to fix build problem if it takes longer to build, and the team will suffer from a broken window
disorder. Eventually, nobody cares if the build is broken since it's always broken and it takes lot of effort to fix it.
Here are few ways you can achieve faster builds.
Choosing agents that meet your performance requirements:Choosing agents that meet your performance requirements: Speeding up your builds starts with
selecting the right build machines. Fast machines can make the difference between hours and minutes. If
your pipelines are in Azure Pipelines, then you've got a convenient option to run your jobs using a
Microsoft-hosted agent. With Microsoft-hosted agents, maintenance and upgrades are taken care of for
you. For more info see, Microsoft-hosted agents.
Build ser ver location:Build ser ver location: When you're building your code, a lot of data is sent across the wire. Inputs to
the builds are fetched from a source control repository and the artifact repository. At the end, the output
from the build process needs to be copied, not only the compiled artifacts, but also test reports, code
coverage results, and debug symbols. It is important that these copy actions are fast. If you are using your
own build server, ensure that the build server is located near the sources and a target location. Fast up-
and downloads can reduce the overall build time significantly.
Scaling out build ser vers:Scaling out build ser vers: A single build server may be sufficient for a small product, but as the size
and the scope of the product and the number of teams working on the product increases, the single
server may not be enough. Scale your infrastructure horizontally over multiple machines when you reach
the limit. For more info see, how you can leverage Azure DevOps Agent Pools.
Optimizing the build:Optimizing the build:
Add parallel build execution so we can speed up the build process. For more info see, Azure
DevOps parallel jobs.
Enable parallel execution of test suites, which is often a huge time saver, especially when executing
integration and UI tests. For more info see, Run tests in parallel using Azure Pipeline.
Use the notion of a multiplier, where you can scale out your builds over multiple build agents. For
more info see, Organizing Azure Pipeline into Jobs.
Consider moving integration, UI, and smoke tests to a release pipeline. This improves the build
speed, and hence the speed of the build feedback loop.
Publish the build artifacts to a package management solution. (such as NuGet or Maven) This
allows you to reuse your build artifact more easily.
Next steps
Your organization might choose to create several different kinds of builds to optimize build times.
CI builds:CI builds: The purpose of this build is to ensure code compiles and unit tests are run. This build gets
triggered at each commit. It serves as the heartbeat of the project, and provides quality feedback to the
team immediately. For more info see, CI triggers or Batching CI builds.
Nightly build:Nightly build: The purpose of a nightly build is not only to compile the code, but also ensure any larger
test suites that are inefficient to run for each build are run on a regular cadence. Usually these are
integration, UI, or smoke tests. For more information review how to schedule builds using cron syntax.
Release build:Release build: In addition to compiling and running tests, this build also compiles the API
documentation, compliance reports, code signing, and other steps which are not required every time the
code is built. This build provides the golden copy that will be pushed to the release pipeline to finally
deploy in the production environment.
What types of builds your organization needs depends on many factors including your team's and
organizational maturity, the kind of product you are working on, and your deployment strategy.
Release Engineering: Deployment
Release Engineering: Deployment
12/16/2022 4 minutes to read • Edit Online
Automation
Automate as many processes as possibleAutomate as many processes as possible
Automate and test deployment and maintenance tasksAutomate and test deployment and maintenance tasks
Implement deployment security measuresImplement deployment security measures
Release process
Document release process
Stage your workloads
As you provision and update Azure resources, application code, and configuration settings, a repeatable and
predictable process will help you avoid errors and downtime. We recommend automated processes for
deployment that you can run on demand and rerun if something fails. After your deployment processes are
running smoothly, process documentation can keep them that way.
To activate resources on demand, deploy solutions rapidly, minimize human error, and produce consistent and
repeatable results, be sure to automate deployments and updates.
The most reliable deployment processes are automated and
idempotent
— that is, repeatable to produce the
same results.
To automate provisioning of Azure resources, you can use Terraform, Ansible, Chef, Puppet, Azure
PowerShell, Azure CLI, or Azure Resource Manager templates.
To configure VMs, you can use cloud-init (for Linux VMs) or Azure Automation State Configuration (DSC).
To automate application deployment, you can use Azure DevOps Services, Jenkins, or other CI/CD solutions.
As a best practice, create a repository of categorized automation scripts for quick access, documented with
explanations of parameters and examples of script use. Keep this documentation in sync with your Azure
deployments, and designate a primary person to manage the repository.
Automation scripts can also activate resources on demand for disaster recovery.
Distributed applications consist of multiple parts that must work together. Deployment should take advantage of
proven mechanisms, such as scripts, that can update and validate configuration and automate the deployment
process. Test all processes fully to ensure that errors don't cause additional downtime.
All deployment tools must incorporate security restrictions to protect the deployed application. Define and
enforce deployment policies carefully, and minimize the need for human intervention.
One of the challenges with automating deployment is the cut-over itself, taking software from the final stage of
testing to live production. You usually need to do this quickly in order to minimize downtime. The blue-green
deployment approach does this by ensuring you have two production environments, as identical as possible.
Without detailed release process documentation, an operator might deploy a bad update or might improperly
configure settings for your application. Clearly define and document your release process, and ensure that it's
available to the entire operations team.
Test environments
Logging and auditing
High availability considerations
Consider deploying across multiple regionsConsider deploying across multiple regions
Redeploy to a secondary regionRedeploy to a secondary region
Deployment to various stages and running tests/validations at each stage before moving on to the next ensures
friction free production deployment.
With good use of staging and production environments, you can push updates to the production environment
in a highly controlled way and minimize disruption from unanticipated deployment issues.
Blue-green deployment
involves deploying an update into a production environment that's separate from
the live application. After you validate the deployment, switch the traffic routing to the updated version. One
way to do this is to use the staging slots available in Azure App Service to stage a deployment before moving
it to production.
Canary releases
are similar to blue-green deployments. Instead of switching all traffic to the updated
application, you route only a small portion of the traffic to the new deployment. If there's a problem, revert to
the old deployment. If not, gradually route more traffic to the new version. If you're using Azure App Service,
you can use the Testing in production feature to manage a canary release.
If development and test environments don't match the production environment, it is hard to test and diagnose
problems. Therefore, keep development and test environments as close to the production environment as
possible. Make sure that test data is consistent with the data used in production, even if it's sample data and not
real production data (for privacy or compliance reasons). Plan to generate and anonymize sample test data.
To capture as much version-specific information as possible, implement a robust logging strategy. If you use
staged deployment techniques, more than one version of your application will be running in production. If a
problem occurs, determine which version is causing it.
An application that depends on a single instance of a service creates a single point of failure. To improve
resiliency and scalability, provision multiple instances.
Below are a couple of examples of how this can be achieved with Azure resources. We recommend that you
review the resiliency guidance for the resources that comprise your workload to achieve high availability across
it.
For Azure App Service, select an App Service plan that offers multiple instances.
For Azure Virtual Machines, ensure that your architecture includes more than one VM and that each VM is
included in an availability set.
We recommend deploying all but the least critical applications and application services across multiple regions.
If your application is deployed to a single region, in the rare event that the entire region becomes unavailable,
the application will also be unavailable. If you choose to deploy to a single region, consider preparing to
redeploy to a secondary region as a response to an unexpected failure.
If you run applications and databases in a single, primary region with no replication, your recovery strategy
might be to redeploy to another region. This solution is affordable but most appropriate for non-critical
applications that can tolerate longer recovery times. If you choose this strategy, automate the redeployment
process as much as possible and include redeployment scenarios in your disaster response testing.
To automate your redeployment process, consider using Azure Site Recovery.
Next steps
Release Engineering: Rollback
Release Engineering: Rollback
12/16/2022 3 minutes to read • Edit Online
Make use of Staged Deployments
Apply desired state configuration
In some cases, a new software deployment can harm or degrade the functionality of a software system. When
building your solutions, it is essential to anticipate deployment issues and to architect solutions that provide
mechanisms for fixing problematic deployments. Rolling back a deployment involves reverting the deployment
to a known good state. Rollback can be accomplished in many different ways. Several Azure services support
native mechanisms for rolling back to a previous state. We recommend leveraging them where available and
where it makes sense for your chosen deployment model. The following examples will help you design your
rollback and/or recovery plan.
If you use a staged deployment strategy, you deploy several versions of your application and use networking
techniques to switch between them. This allows you to quickly revert to a previous version of the application if
you detect a problem along the way.
When using Azure App Service, you can utilize deployment slots to implement this. Deployment slots are
running instances of the application, each with a separate host name. Slots can be used to stage and test
applications before promoting to a production slot. A deployment slot can be created to hold the last known
good instance of your application. In the event of an issue or problematic deployment, the production slot can
be swapped with the known good slot to bring the application back to a known good state quickly. The image
below shows how this can be accomplished via an Azure DevOps pipeline.
Learn moreLearn more
For more information on using Azure App Service deployment slots, see Set up staging environments in Azure
App Service
Depending on the resources in your workload, you may be able to use desired state configurations to automate
and govern your deployment process.
Kubernetes is among the solutions that allow you to express your deployment instructions as a desired state
configuration. With Kubernetes you can use a deployment instruction to describe the particular workload
running in the cluster. A deployment may declare that a workload consists of three replicas of a specific pod that
Work with deployment records
Restore a known good state
should be running at all times. The deployment object creates a ReplicaSet and the associate Pods. When
updating a workload, the deployment itself can be revised, which will generally roll out a new container image
to the deployment pods. Assuming multiple replicas of the pods exist, this rollout can happen in a controlled and
staged manner and in harmony with your workload's overall deployment strategy.
Learn moreLearn more
For more information, see Kubernetes Deployments
A deployment record is created when deploying Azure infrastructure and solutions with Azure Resource
Manager (ARM) templates or other infrastructure as code solutions. When creating a new deployment, you can
provide a previously known good deployment so that if the current deployment fails, the previous know good
deployment is redeployed. There are several considerations and caveats when using this functionality. See the
documentation linked below for these details.
Learn moreLearn more
For more information, see Rollback on an error to successful deployment
Many managed services have a concept of versioning with a built-in restore function.
Azure Logic Apps - for example - create a new version of the application whenever an update is made to it.
Azure maintains a history of versions and can revert or promote any previous version. To do so, in the Azure
portal, select your Logic App and choose VersionsVersions. Previous versions can be selected on the versions pane, and
the application can be inspected both in the code view and the visual designer view. Select the version you
would like to revert to, and click the PromotePromote option and then SaveSave.
Learn moreLearn more
For more information, see Manage logic apps in the Azure portal
Monitoring operations of cloud applications
12/16/2022 2 minutes to read • Edit Online
Checklist
In this section
A SSESSM EN TA SSESSM EN T DESC RIP T IONDESC RIP T ION
Do you monitor your resources?Do you monitor your resources? Have an overall view of the workload resources. The
information can come from application code, frameworks,
external sources with which the application communicates,
and the underlying infrastructure.
Do you have detailed instrumentation in theDo you have detailed instrumentation in the
application code?application code?
Instrumentation lets you gather performance data, diagnose
problems, and make decisions.
Do you correlate application events across allDo you correlate application events across all
application components?application components?
Collect data from various sources, consolidate and clean
various formats, and store in reliable storage.
Do you interpret the collected data to spot issuesDo you interpret the collected data to spot issues
and trends in application health?and trends in application health?
Analyze the data collected from various data sources to
assess the overall well-being of the workload.
Do you visualize monitoring data?Do you visualize monitoring data? Present the analyzed data in a way that an operator can
quickly spot any trends or problems.
Distributed applications and services running in the cloud are, by nature, complex pieces of software that
include many moving parts. In a production environment, it's important to track the way customers use your
system and monitor the health, and performance of your system. Use the following checklist as a diagnostic aid
to detect, correct, and prevent issues from occurring.
How are you monitoring your workload?How are you monitoring your workload?
Ensure that the system remains healthy.
Track the availability of the system and its component elements.
Maintain performance to ensure that the throughput of the system doesn't degrade unexpectedly as the
volume of work increases.
Guarantee that the system meets any service-level agreements (SLAs) established with customers.
Protect the privacy and security of the system, users, and their data.
Track the operations performed for auditing or regulatory purposes.
Monitor the day-to-day usage of the system and spot trends that might lead to problems if they're not
addressed.
Track issues that occur, from initial report through to analysis of possible causes, rectification, consequent
software updates, and deployment.
Trace operations and debug software releases.
Follow these questions to assess the workload at a deeper level.
Do you have aler ts and response plans ready for theDo you have aler ts and response plans ready for the
relevant teams when issues occur?relevant teams when issues occur?
Present the analyzed data in a way that an operator can
quickly spot any trends or problems.
Do you use the Azure platform notifications andDo you use the Azure platform notifications and
updates?updates?
Consider the underlying infrastructure such as virtual
machines, networks, and storage services to collect
important platform-level diagnostic data.
Do you monitor and measure application health?Do you monitor and measure application health? Health monitoring generates a snapshot of the current
health of the system so that you can verify all components
are functioning as expected.
Do you monitor and track resource usage?Do you monitor and track resource usage? Usage monitoring tracks how the features and components
of an application are used.
Do you collect data from the repor ted issuesDo you collect data from the repor ted issues Analyzing data, for unexpected events, can provide insight
about the application health.
Do you collect audit logs for regulatoryDo you collect audit logs for regulator y
requirements?requirements?
Auditing can provide evidence useful for compliance
attestations.
A SSESSM EN TA SSESSM EN T DESC RIP T IONDESC RIP T ION
Azure offering
Reference architecture
Related links
Next steps
You can use any monitoring platform to manage your Azure resources. Microsoft's first party offering is Azure
Monitor, a comprehensive solution for metrics and logs from the infrastructure to the application code,
including the ability to trigger alerts and automated actions as well as data visualization.
Enterprise monitoring with Azure Monitor illustrates an architecture that has centralized monitoring
management by using Azure Monitor features.
Distributed tracing
Application Insights
Monitoring phases
Monitoring workloads
12/16/2022 3 minutes to read • Edit Online
1. Data sources and instrumentation
Monitoring a workload and the infrastructure in which it runs, should not be processed or analyzed in isolation.
Build a pipeline that gives you holistic observability of the system.
This article describes the stages of a common pipeline design.
Monitoring data from many sources:
Application codeApplication code Developers add trace messages in application code to track the flow of control. For
example, recording the entry and exit times can be useful. An entry to a method in the application can
emit a trace message that specifies the name of the method, the current time, the value of each
parameter, and any other pertinent information.
Application frameworksApplication frameworks Many applications use libraries and frameworks to perform common tasks
such as accessing a data store or communicating over a network. The frameworks emit trace messages
and raw diagnostic information, such as transaction rates and data transmission successes and failures
can be important information for debugging issues.
Dependent ser vicesDependent ser vices The application will access external services such as a web server or database
management system to complete business operations. Even a single business operation can result in
multiple point-to-point calls among all services. The services might publish their own trace information,
logs, and performance counters. Examples include SQL Server Dynamic Management Views for tracking
operations performed against a SQL Server database, and IIS trace logs for recording requests made to a
web server.
Release pipelineRelease pipeline As the components of a system are modified and new versions are deployed, it's
important to be able to attribute issues, events, and metrics to each version. This way problems with a
specific version of a component can be tracked quickly and rectified.
NOTENOTE
InstrumentationInstrumentation
2. Collection and storage
3. Analysis and diagnosis
4. Visualization and alerting
AlertingAlerting
As a workload owner, you may not be monitoring infrastructure metrics actively. However, this information can indicate
systemic issues. Consider the underlying infrastructure and components on which your system runs. Virtual machines,
virtual networks, and storage services can all be sources of important infrastructure-level performance counters and other
diagnostic data.
Another important source is the operating system where the application runs. It can be a source of low-level system-wide
information, such as performance counters that indicate I/O rates, memory utilization, and CPU usage. Operating system
errors (such as the failure to open a file correctly) might also be reported.
More information: Sources of monitoring data
Instrumentation is a process of generating diagnostics data in the data sources. The captured data can help you
assess performance, diagnose problems, and make decisions. In the simplest case of a single application, it can
be a log of trace messages.
In complex scenarios, there might be a need to
correlate instrumentation data
from several sources with
sufficient context so that the data can be mapped it to an action. For example, at the framework level, a task
might be identified by a thread ID. Within an application, the same work might be associated with the user ID for
the user who is performing that task.
More information: Instrumentation best practices
At this stage, the information generated by instrumentation is collected and transformed so that it's easy to
consume and then saved to a reliable storage. The choices for the data collection and storage technologies will
depend on some factors:
Do you need to analyze data quickly (hot path) or it can be stored for later analysis (warm, cold path)?
Do you want the data collection tier to retrieve data actively (pull model) or wait for the data to arrive (push
model)?
Do you want to consolidate data from various sources before writing to storage?
Do you need to archive the data?
More information: Data collection and storage considerations
The analysis and diagnosis stage takes the collected raw data and produces information that can be used to
determine the state of the system. This information is useful in deciding remediation actions. In certain cases,
the results can be fed back into the instrumentation and collection stages.
More information: Analyzing data
The purpose of visualization is to present information in near real time by using a series of dashboards. Also,
you can view historical data through reports, graphs, and charts to identify long-term trends.
More information: Visualizing data
If information indicates that the key indicators are likely to exceed acceptable bounds, you can trigger alerts to
Next steps
raise awareness. It can trigger an automated process that attempts to take corrective actions, such as
autoscaling.
More information: Alerting
More information: Sources of monitoring data
Data sources for diagnostic data
12/16/2022 7 minutes to read • Edit Online
Application code
Application frameworks
NOTENOTE
Dependent services
Operating system
The information that the monitoring process uses can come from several sources: application code, frameworks,
external sources with which the application communicates, and the underlying infrastructure.
This article highlights some best practices of those sources and provides guidance on the types of information
that you should capture.
Developers add trace messages in the application code to track the flow of control. Here are some best practices:
Follow a standard approachFollow a standard approach For example, recording the entry and exit times can be useful. An entry to
a method in the application can emit a trace message that specifies the name of the method, the current
time, the value of each parameter, and any other pertinent information.
Log all exceptions and warningsLog all exceptions and warnings Retain a full trace of any nested exceptions and warnings. Ideally,
you should also capture information that identifies the user who is running the code, together with
activity correlation information (to track requests as they pass through the system).
Log attempts to access all dependent resourcesLog attempts to access all dependent resources An application will communicate with services
such as message queues, databases, files, and other dependent services. This information can be used for
metering and auditing purposes.
Capture information that identifies the user who is running the codeCapture information that identifies the user who is running the code This information with
activity correlation information is useful in tracking requests as they pass through the system.
Many applications use libraries and frameworks to perform common tasks such as accessing a data store or
communicating over a network. Consider configuring the frameworks to emit trace messages. These
frameworks might be configurable to provide their own trace messages and raw diagnostic information, such as
transaction rates and data transmission successes and failures.
Many modern frameworks automatically publish performance and trace events. Capturing this information is simply a
matter of providing a means to retrieve and store it where it can be processed and analyzed.
The application will access external services such as a web server or database management system to complete
business operations. Even a single business operation can result in multiple point-to-point calls among all
services. The services might publish their own trace information, logs, and performance counters. Examples
include SQL Server Dynamic Management Views for tracking operations performed against a SQL Server
database, and IIS trace logs for recording requests made to a web server.
Another important source is the operating system where the application runs. It can be a source of low-level
Infrastructure metrics
NOTENOTE
Release pipeline
system-wide information, such as performance counters that indicate I/O rates, memory utilization, and CPU
usage. Operating system errors such as the failure to open a file correctly might also be reported.
As a workload owner, you may not be monitoring infrastructure metrics actively. However, this information can indicate
systemic issues. Consider the underlying infrastructure and components on which your system runs. Virtual machines,
virtual networks, and storage services can all be sources of important infrastructure-level performance counters and other
diagnostic data.
Compute monitoringCompute monitoring Collect metrics from compute resources on which the application is running. This
might be virtual machines, App Services, or Kubernetes. Knowing the state of your infrastructure will
allow you to react promptly if there are any issues.
Data tier monitoringData tier monitoring Include metrics the databases, storage accounts, and other data sources that
interact with the application. A low performance of the data tier of an application could have serious
consequences.
Container monitoringContainer monitoring If your application run on Azure Kubernetes Service (AKS), you'll need to
monitor the state of your cluster, nodes, and pods. One option is to the container insights feature in Azure
Monitor. This feature delivers quick, visual, and actionable information: from the CPU and memory
pressure of your nodes to the logs of individual Kubernetes pods.
Operators who prefer using the open-source Kubernetes monitoring tool Prometheus can take
advantage of its integration with container insights.
The Sidecar Pattern adds a separate container with responsibilities that are required by the main
container. A common use case is for running logging utilities and monitoring agents.
Network monitoringNetwork monitoring Networking is key to an application running without issues. Consider using
Network Watcher, a collection of network monitoring and troubleshooting tools. Some of these tools are:
Traffic Analytics shows the flows the virtual networks and uses Microsoft Threat Intelligence databases
to give you percentage traffic from malicious IP addresses. You can identify bottlenecks by seeing the
systems in your virtual networks that generate most traffic.
Network Performance Monitor can generate synthetic traffic to measure the performance of network
connections over multiple links, giving you a perspective on the evolution of WAN and Internet
connections over time, as well as offering valuable monitoring information about Microsoft
ExpressRoute circuits.
VPN diagnostics can help troubleshooting site-to-site VPN connections connecting your applications
to users on-premises.
As the components of a system are modified and new versions are deployed, it's important to be able to
attribute issues, events, and metrics to each version.
Relate information about issues, events, and others to the release pipelineRelate information about issues, events, and others to the release pipeline Problems with a
specific version of a component can be tracked quickly and rectified.
Log security-related information for successful and failing requestsLog security-related information for successful and failing requests Security issues might occur
at any point in the system. For example, a user might attempt to sign in with an invalid user ID or
password. An authenticated user might try to obtain unauthorized access to a resource. Or a user might
Sources from application and system monitoring
Real user monitoring
IMPORTANTIMPORTANT
Synthetic user monitoring
NOTENOTE
Application profiling
provide an invalid or outdated key to access encrypted information.
This strategy uses internal sources within the application, application frameworks, operating system, and
infrastructure. The application code can generate its own monitoring data at notable points during the lifecycle
of a client request. The application can include tracing statements that might be selectively enabled or disabled
as circumstances dictate. It might also be possible to inject diagnostics dynamically by using a diagnostics
framework. These frameworks typically provide plug-ins that can attach to various instrumentation points in
your code and capture trace data at these points.
Additionally, your code and/or the underlying infrastructure might raise events at critical points. Monitoring
agents that are configured to listen for these events can record the event information.
This approach records the interactions between a user and the application and observes the flow of each
request and response. This information can have a two-fold purpose: it can be used for metering usage by each
user, and it can be used to determine whether users are receiving a suitable quality of service (for example, fast
response times, low latency, and minimal errors). You can use the captured data to identify areas of concern
where failures occur most often. You can also use the data to identify elements where the system slows down,
possibly due to hotspots in the application or some other form of bottleneck. If you implement this approach
carefully, it might be possible to reconstruct users' flows through the application for debugging and testing
purposes.
You should consider the data that's captured by monitoring real users to be highly sensitive because it might include
confidential material. If you save captured data, store it securely. If you want to use the data for performance monitoring
or debugging purposes, strip out all personally identifiable information first.
In this approach, you write your own test client that simulates a user and performs a configurable but typical
series of operations. You can track the performance of the test client to help determine the state of the system.
You can also use multiple instances of the test client as part of a load-testing operation to establish how the
system responds under stress, and what sort of monitoring output is generated under these conditions.
You can implement real and synthetic user monitoring by including code that traces and times the execution of method
calls and other critical parts of an application.
This approach is primarily targeted at monitoring and improving application performance. Rather than
operating at the functional level of real and synthetic user monitoring, it captures lower-level information as the
application runs. You can implement profiling by using periodic sampling of the execution state of an application
(determining which piece of code that the application is running at a given point in time). You can also use
instrumentation that inserts probes into the code at important junctures (such as the start and end of a method
call) and records which methods were invoked, at what time, and how long each call took. You can then analyze
this data to determine which parts of the application might cause performance problems.
Endpoint monitoring
Next steps
This technique uses one or more diagnostic endpoints that the application exposes specifically to enable
monitoring. An endpoint provides a pathway into the application code and can return information about the
health of the system. Different endpoints can focus on various aspects of the functionality. You can write your
own diagnostics client that sends periodic requests to these endpoints and assimilate the responses. For more
information, see the Health Endpoint Monitoring pattern.
For maximum coverage, you should use a combination of these techniques.
Instrumentation
Instrument an application
12/16/2022 8 minutes to read • Edit Online
Trace logs
NOTENOTE
Metrics
Information for correlating data
Instrumentation is a critical part of the monitoring process. You can make meaningful decisions about the
performance and health of a system only if you first capture the data that enables you to make these decisions.
By using instrumentation, the information that you gather should be sufficient to assess performance, diagnose
problems, and make decisions without requiring you to sign in to a remote production server to perform
tracing, and debugging manually. Instrumentation data typically includes metrics and information that's written
to trace logs.
The contents of a trace log can be the result of textual data that's written by the application or binary data that's
created as the result of a trace event, if the application is using Event Tracing for Windows (ETW). Trace log
contents can also be generated from system logs that record events arising from parts of the infrastructure,
such as a web server. Textual log messages are often designed to be human-readable, but they should also be
written in a format that enables an automated system to parse them easily.
Categorize logs and use separate logs to record the trace output from different operational aspects of the
system, instead of writing all trace data to a single log. You can then quickly filter log messages by reading from
the appropriate log rather than having to process a single lengthy file. Never write information that has different
security requirements (such as audit information and debugging data) to the same log.
A log may be implemented as a file on the file system, or it may be held in some other format, such as a blob in blob
storage. Log information may also be held in more structured storage, such as rows in a table.
Metrics will generally be a measure or count of some aspect or resource in the system at a specific time, with
one or more associated tags or dimensions, sometimes called a
sample
. A single instance of a metric isn't useful
in isolation. Instead, metrics have to be captured over time. The key issue to consider is which metrics you
should record and how frequently. Generating data for metrics too often can impose a significant extra load on
the system; whereas, capturing metrics infrequently may cause you to miss the circumstances that lead to a
significant event. The considerations will vary from metric to metric. For example, CPU utilization on a server
might vary significantly from second to second, but high utilization becomes an issue only if it's long-lived over
many minutes.
You can easily monitor individual system-level performance counters, capture metrics for resources, and obtain
application trace information from various log files. Some forms of monitoring require the analysis and
diagnostics stage in the monitoring pipeline to correlate the data that's retrieved from several sources. This data
may take several forms in the raw data, and the analysis process must be provided with sufficient
instrumentation data to map these different forms. For example, at the application framework level, a task may
be identified by a thread ID. Within an application, the same work may be associated with the user ID for the
user who is completing that task.
NOTENOTE
Information to include in the instrumentation data
Human
-
readable dataHuman
-
readable data
Consider performance tradeoffsConsider performance tradeoffs
Invest in traceability and correlationInvest in traceability and correlation
Capture all relevant dataCapture all relevant data
Also, it's unlikely to be a 1:1 mapping between threads and user requests, because asynchronous operations
may reuse the same threads to do operations for more than one user. To complicate matters further, a single
request may be handled by more than one thread as execution flows through the system. If possible, associate
each request with a unique activity Id that's propagated through the system as part of the request context. The
technique for generating and including activity IDs in trace information depends on the technology that's used
to capture the trace data.
All monitoring data should be timestamped in the same way. For consistency, record all dates and times by
using Coordinated Universal Time, which helps you trace sequences of events with ease.
Computers operating in different time zones and networks may not be synchronized. Don't depend on timestamps alone
for correlating instrumentation data that spans multiple machines.
Consider the following points when you're deciding which instrumentation data you need to collect.
Ensure that information captured by trace events is machine and human readable. Adopt well-defined schemas
for this information to help automated processing of log data across systems, and to provide consistency to
operations and engineering staff reading the logs.
Include the following environmental information, such as:
Deployment environment
Machine on which the process is running
Details of the process
Call stack
Enable profiling only when necessary because it can impose a significant overhead on the system. By using
instrumentation, profiling records an event, such as a method call, every time it occurs; but, sampling records
only selected events.
The selection can be time-based, such as once every n seconds, or frequency-based, such as once every n
requests. If events occur frequently, profiling by instrumentation may cause too much of a burden and effect
overall performance. In this case, the sampling approach may be preferable.
However, if the frequency of events is low, sampling may miss them. In this case, instrumentation may be the
better approach.
Provide sufficient context to enable a developer or administrator to determine the source of each request, which
may include some form of activity Id that identifies a specific instance of a request.
It may also include information that can be used to correlate this activity with the computational work
performed and the resources used.
This work might cross process and machine boundaries.
For metering, the context should also include, either directly or indirectly, through other correlated information,
a reference to the customer who caused the request. This context provides valuable information about the
application state at the time that the monitoring data was captured.
Strive for data consistencyStrive for data consistency
Consider external dependenciesConsider external dependencies
Ensure compatibility with telemetry systems
Record all requests, and the locations, or regions from which these requests are made. This information can help
determine whether there are any location-specific hotspots. This information can also be useful to determine
whether to repartition an application or the data that it uses.
Record and capture the details of exceptions carefully. Often, critical debug information is lost because of poor
exception handling.
Capture the full details of exceptions that the application throws, including any inner exceptions and other
context information. Include the call stack, if possible.
Consistent data can help analyze events and correlate them with user requests. Consider using a comprehensive
and configurable logging package to gather information, rather than depending on developers to adopt the
same approach as they implement different parts of the system.
Gather data from key performance counters, such as the volume of I/O being performed, network utilization,
number of requests, memory use, and CPU utilization.
Some infrastructure services may provide their own specific performance counters, such as:
The number of connections to a database.
The rate at which transactions are performed.
The number of transactions that succeed or fail. Applications might also define their own specific
performance counters.
Log all external service calls made.
For example to:
Database systems
Web services
Other system-level services that are part of the infrastructure
Record information about the time taken to perform each call, and the success or failure of the call. If possible,
capture information about all retry attempts and failures for any transient errors that occur.
In many cases, the instrumentation information is generated as a series of events and passed to a separate
telemetry system for processing and analysis. A telemetry system is typically independent of any specific
application or technology, but it expects information to follow a specific format that's defined by a schema. The
schema effectively specifies a contract that defines the data fields and types that the telemetry system can
ingest. The schema should be generalized to allow for data arriving from a range of platforms and devices.
A common schema should include fields that are common to all instrumentation events, such as:
Event name
Event time
IP address of the sender
Details required for event correlation, such as:
User ID
Device ID
Application ID
Remember that any number of devices may raise events, so the schema shouldn't depend on the device type.
Best practices for instrumenting applications
Next steps
Various devices may raise events for the same application. The application may support roaming or some other
form of cross-device distribution.
The schema may also include domain fields that are relevant to a particular scenario that's common across
different applications, such as:
Information about exceptions
Application start and end events
Success or failure of web service API calls
All applications that use the same set of domain fields should emit the same set of events, enabling a set of
common reports and analytics to be built.
Finally, a schema may contain custom fields for capturing the details of application-specific events.
The following list summarizes best practices for instrumenting a distributed application running in the cloud:
Make logs easy to read and easy to parse. Use structured logging where possible.
Be concise and descriptive in log messages.
Identify the source of the log.
Add timing information as each log record is written.
Use the same time zone and format for all timestamps.
Categorize logs and write messages to the appropriate place.
Don't reveal sensitive information about the system or personal information about users.
Log all critical exceptions, but enable the administrator to turn logging on and off for lower levels of
exceptions and warnings.
Capture and log all retry logic information. This data can be useful in monitoring the transient health of the
system.
Trace out process calls, such as requests to external web services or databases.
Don't mix log messages with different security requirements in the same log file.
Except for auditing events, make sure that all logging calls are
fire-and-forget
operations that don't block the
progress of business operations.
Make sure that logging is extensible (for example in code through the use of an interface) and doesn't have
any direct dependencies on a concrete target.
Make sure that all logging is fail-safe and never triggers any cascading errors.
Treat instrumentation as an ongoing iterative process and review logs regularly, not just when there's a
problem.
Scrub this information before it's logged, but ensure that you keep the relevant details.
Auditing events are exceptional because they're critical to the business.
Collection and storage
Collect, aggregate, and store monitoring data for
cloud applications
12/16/2022 11 minutes to read • Edit Online
Best practices
Architecture
In a distributed system, instrumentation can generate large volumes of data from many components, held in
various locations, and in different formats.
For example, an application code might generate trace log files and application event log data. From the
infrastructure perspective, performance counters are captured through other technologies. Also, any third-party
components and services that the application uses might provide their own instrumentation information in
different formats, by using separate trace files, blob storage, or even a custom data store.
This article provides best practices for collecting data from various sources, consolidating and cleaning various
formats, and storing in reliable storage.
Have technologies to collect and aggregate application logs across environments, diagnostics logs from
application resources, infrastructure, and other critical dependencies.
The collection service should run as an out-of-process service and should be simple to deploy.
All output from the collection service should be a structured and agnostic format.
The monitoring and data-collection process must be fail-safe and must not trigger any cascading error
conditions.
In the event of a transient failure in sending information to a data sink, the collection service should be
prepared to reorder telemetry data so that the newest information is sent first. (The monitoring
agent/data-collection service might elect to drop the older data, or save it locally and transmit it later to
catch up, at its own discretion.)
This image shows a typical instrumentation data-collection process.
Data collection service collects data from various sources is not necessarily a single process and might
Application data
Infrastructure data
comprise many constituent parts running on different machines.
If you need data to be analyzed quickly, local components that operate outside the collection service might
perform the analysis tasks immediately. The results can be sent directly to the visualization and alerting
systems.
If you don't need immediate analysis, data is held in storage while it awaits processing.
The hot, warm, and cold analysis patterns are summarized in Analysis patterns.
For applications, the collecting service can be an Application Performance Management (APM) toolApplication Performance Management (APM) tool that
can run autonomously from the application that generates the instrumentation data. After APM is enabled, you'll
have clear visibility of important metrics both in real time and historically. Consider an appropriate level of
logging. Verbose logging can incur significant costs.
An example of an APM is Application Insights that aggregates application level logs and events for subsequent
analysis.
Application logsApplication logs support the end-to-end application lifecycle. Logging is essential in understanding how the
application operates in various environments and what events occur and under which conditions.
Collecting application logs and events across all major environments is recommended. Separate the data
between environments as much as possible. Have filters to ensure non-critical environments do not convolute
production log interpretation. Furthermore, corresponding log entries across the application should capture a
correlation ID for their respective transactions.
Application events should be captured as a structured data typeApplication events should be captured as a structured data type with machine-readable data points
rather than unstructured string types. A structured format, following well-known schema can help in parsing
and analyzing logs. Also, structured data can easily be indexed and searched, and reporting can be greatly
simplified.
Also the data should be an agnostic format that's independent of the machine, operating system, or network
protocol. For example, emit information in a self-describing format such as JSON, MessagePack, or Protobuf
rather than ETL/ETW. Using a standard format enables the system to construct processing pipelines;
components that read, transform, and send data in the agreed format can be easily integrated.
You will also need to collect platform diagnostics to get a holistic view. For example, Windows event logs,
performance counters, diagnostic infrastructure logs, and logs from the management plane. Azure platform logs
addresses all those needs. Here are some recommendations:
Collect Azure Activity LogsAzure Activity Logs to get audit information. These logs are useful in detecting configuration
changes to Azure resources.
Enable resource- or infrastructure- level monitoringresource- or infrastructure- level monitoring throughout the application. This type of logs
includes information emitted by platform services such as Azure VMs, Express Route or SQL Database, and
also third-party solutions. Configure application resources to route diagnostic logs and metrics to the chosen
log aggregation technology.
Enforce consistency.Enforce consistency. You can use Azure Policy to ensure the consistent use of diagnostic settings across the
application, to enforce the desired configuration for each Azure service.
Collect logs and metrics available for critical internal dependencies.Collect logs and metrics available for critical internal dependencies. This information gives you
visibility into the operational state of critical internal dependencies, such as a shared NVA or Express Route
connections, and others.
There are many options for a collection service for aggregating infrastructure and resource logs. Azure Log
Collection strategies
Data collection modelsData collection models
Monitoring agentsMonitoring agents
NOTENOTE
Analytics or Splunk, are popular choices for collating logs and metrics across all application components for
subsequent evaluation. Resources may include Azure IaaS, PaaS services, and third-party appliances such as
firewalls or anti-malware solutions used in the application. For instance, if Azure Event Hubs is used, the
Diagnostic Settings should be configured to push logs and metrics to the data sink.
Understanding usage helps with right-sizing of the workload, but additional cost for logging needs to be
accepted and included in the cost model.
Avoid retrieving telemetry data manually from every component. Have a way of moving the data a central
location and consolidated. For a multiregion solution, it's recommended that you first collect, consolidate, and
store data on a region-by-region basis, and then aggregate the regional data into a single central system.
To optimize the use of bandwidth, prioritize based on the importance of data. You can transfer less urgent data
in batches. However, the data must not be delayed indefinitely, especially if it contains time-sensitive
information.
The collection service can collect instrumentation data in mainly two models:
Pull modelPull model Actively retrieves data from the various logs and other sources for each instance of the
application.
Push modelPush model Passively waits for the data to be sent from the components that constitute each instance of the
application.
Monitoring agents work in pull model. Agents run locally in a separate process with each instance of the
application and periodically pull data and write this information directly to centralized storage shared by all
instances of the application.
For more information, see Azure Diagnostics extension.
Using a monitoring agent is ideally suited to capturing instrumentation data that's naturally pulled from a data source. It's
appropriate for a small-scale application running on a limited number of nodes in a single location. An example is
information from SQL Server Dynamic Management Views or the length of an Azure Service Bus queue.
Performance considerationsPerformance considerations
Consolidation strategies
Storage strategies
A complex, highly scalable, application might generate huge volumes of data. The amount of data can easily
overwhelm the I/O bandwidth available with a single, central location. The telemetry solution must not act as
bottleneck and must be scalable as the system expands. Ideally, the solution should incorporate a degree of
redundancy to reduce the risks of losing important monitoring information (such as auditing or billing data) if
part of the system fails.
One approach is through queuing.
In this architecture, the data-collection service posts data to a queue. A message queue is suitable because it
provides "at least once" semantics that help ensure that queued data will not be lost after it's posted. You can
implement the storage writing service by using a separate worker role. You can implement this with the Priority
Queue pattern.
For scalability, you can run multiple instances of the storage writing service. If there is a high volume of events,
you can use Event Hubs to dispatch the data to a different compute for processing and storage.
The data collected from a single instance of an application gives a localized view of the health and performance
of that instance. To assess the overall health of the system, it's necessary to consolidate some aspects of the data
in the local views. You can perform this after the data has been stored, but in some cases, you can also achieve it
as the data is collected.
The instrumentation data can pass through a separate data consolidation service that combines data and acts as
a filter and cleanup process. For example, instrumentation data that includes the same correlation information
such as an activity ID can be amalgamated. (It's possible that a user starts performing a business operation on
one node and then gets transferred to another node if a node fails, or depending on how load balancing is
configured.) This process can also detect and remove any duplicated data (always a possibility if the telemetry
service uses message queues to push instrumentation data out to storage).
Storage technologiesStorage technologies
Consolidation serviceConsolidation service
Querying considerationsQuerying considerations
When deciding the storage capabilities, consider the type of data, how it's used, and how urgently the data is
required.
Consider a polyglot persistence approach where different types of information are stored by using technologies
that are most appropriate to the way in which each type is likely to be used.
For example, Azure blob and table storage have some similarities in the way in which they're accessed. But they
have differences in the operations you can perform for each, and the granularity of the data that they hold. If
you need to perform more analytical operations or require full-text search capabilities on the data, it might be
more appropriate to use data storage that provides capabilities that are optimized for specific types of queries
and data access. For example:
Performance counter data can be stored in a SQL database to enable ad hoc analysis.
Trace logs might be better stored in Azure Cosmos DB.
Security information can be written to HDFS.
Information that requires full-text search can be stored through Elasticsearch (which can also speed searches
by using rich indexing).
The same instrumentation data might be required for more than one purpose. For example, performance
counters can be used to provide a historical view of system performance over time. This information might be
combined with other usage data to generate customer billing information. In these situations, the same data
might be sent to more than one destination, such as a document database that can act as a long-term store for
holding billing information, and a multidimensional store for handling complex performance analytics.
You can implement another service that periodically retrieves the data from shared storage, partitions and filters
the data according to its purpose, and then writes it to an appropriate set of data stores.
An alternative approach is to include this functionality in the consolidation and cleanup process and write the
data directly to these stores as it's retrieved rather than saving it in an intermediate shared storage area.
Each approach has its advantages and disadvantages. Implementing a separate partitioning service lessens the
load on the consolidation and cleanup service, and it enables at least some of the partitioned data to be
regenerated if necessary (depending on how much data is retained in shared storage). However, it consumes
additional resources. Also, there might be a delay between the receipt of instrumentation data from each
application instance and the conversion of this data into actionable information.
Consider how urgently the data is required. Data that generates alerts must be accessed quickly, so it should be
held in fast data storage and indexed or structured to optimize the queries that the alerting system performs. In
some cases, it might be necessary for the collection service to format and save data locally so that a local
instance of the alerting system can send notifications quickly. The same data can be dispatched to the storage
writing service shown in the previous images and stored centrally if it's also required for other purposes.
Data retention considerationsData retention considerations
Next step
In some cases, after the data has been processed and transferred, the original raw source data that was stored
locally can be removed. In other cases, it might be necessary or useful to save the raw information. For example,
data that's generated for debugging purposes might be best left available in its raw form but can then be
discarded quickly after any bugs have been rectified.
Performance data often has a longer life so that it can be used for spotting performance trends and for capacity
planning. The consolidated view of this data is usually kept online for a finite period to enable fast access. After
that, it can be archived or discarded.
It's useful to store historical data so you can spot long-term trends. Rather than saving old data in its entirety, it
might be possible to down-sample the data to reduce its resolution and save storage costs. As an example,
rather than saving minute-by-minute performance indicators, you can consolidate data that's more than a
month old to form an hour-by-hour view.
Data gathered for metering and billing customers might need to be saved indefinitely. Additionally, regulatory
requirements might dictate that information collected for auditing and security purposes also needs to be
archived and saved. This data is also sensitive and might need to be encrypted or otherwise protected to
prevent tampering. You should never record users' passwords or other information that might be used to
commit identity fraud. Such details should be scrubbed from the data before it's stored.
Information that's used for more considered analysis, for reporting, and for spotting historical trends is less
urgent and can be stored in a manner that supports data mining and ad hoc queries.
Analyze monitoring data
Analyze monitoring data for cloud applications
12/16/2022 6 minutes to read • Edit Online
Best practices
Correlate data
After you've collected data from various data sources, analyze the data to assess the overall well-being of the
system. For analysis, have a clear understanding of:
How to structure data based on KPIs and performance metrics you've defined.
How to correlate the data captured in different metrics and log files. This is important when tracking a
sequence of events and help diagnose problems.
In most cases, data for each part component of the architecture is captured locally and then accurately
combined with data generated by other components.
For example, a three-tier application has:
Presentation tier that allows a user to connect to a website
Middle tier that hosts a set of microservices that processes business logic
Database tier that stores data associated with the operation
The usage data for a single business operation might span across all three tiers. This information needs to be
correlated to provide an overall view of the resource and processing usage for the operation. The correlation
might involve some preprocessing and filtering of data on the database tier. On the middle tier, common tasks
are aggregation and formatting.
Correlate application and resource level logsCorrelate application and resource level logsEvaluate data at both levels to optimize the
detection of issues and troubleshooting of detected issues. You can aggregate the data in a single data
sink or have ways to query events across both levels. Using a unified solution, such as Azure Log
Analytics, is recommended to aggregate and query application and resource level logs.
Define clear retention times on storage for cold analysisDefine clear retention times on storage for cold analysis This practice is recommended to allow
historic analysis over a specific period. Another benefit is control on control storage costs. Have processes
that make sure data gets archived to cheaper storage and aggregate data for long-term trend analysis.
Analyze long-term trends analyzed to predict operational issuesAnalyze long-term trends analyzed to predict operational issuesEvaluate across long-term
data to form operational strategies and also to predict what operational issues are likely to occur and
when. For instance, if the average response times have been slowly increasing over time and getting
closer to the maximum target.
Data generated by instrumentation and captured by Application Performance Monitoring (APM) tools can
provide a snapshot of the system state. The main purpose of the snapshot to make this data actionable.
For example:
What has caused an intense I/O loading at the system level at a specific time?
Is it the result of a large number of database operations?
Is this reflected in the database response times, the number of transactions per second, and application
response times at the same juncture?
A possible remedial action to reduce the load might be to shard the data over more servers. In addition,
exceptions can arise because a fault in any tier. An exception in one level often triggers another fault in the level
above.
So it's recommended that you correlate different types of monitoring data at each level to produce an overall
view of the state of the system and the applications that are running on it. This is vital for root cause analysis for
failures. You can also use this information to make decisions about whether the system is functioning acceptably
or not, and determine what can be done to improve the quality of the system.
In a distributed application, an operation will lead to multiple transactions. Each transaction generates events in
different services and those events must be correlated and visualized. This way you can spot performance
bottlenecks or failure hotspots across services. Application Map is a popular choice for visualizing flows.
Telemetry correlation ensures that the transactions can be mapped through the end-to-end application and
critical system flows. Platform-level metrics and logs such as CPU percentage, network in/out, and disk
operations/sec should be collected from the application to inform a health model and detect/predict issues. This
can also help to distinguish between transient and non-transient faults.
Here's an Application Map of an application that has several microservices. With this visual representation, you
can see that Workflow service is getting errors from the Delivery service.
When you look at the end-to-end transaction, you can see that an exception is thrown due to memory limits in
Azure Cache for Redis.
Analysis patterns
PAT T E RNPAT T E RN C H A RA C T ERIST IC SC H A RA C T ERIST IC S C O N SIDERAT IO NSC O N SIDERAT IO NS USE C A SEUSE C A SE
Hot analysis Time-critical and requires
immediate analysis.
Warm analysis Less time-critical and might
require aggregation before
analysis.
When correlating data make sure that the raw instrumentation data includes sufficient context and activity ID
information to support the required aggregations for correlating events. Additionally, this data might be held in
different formats, and it might be necessary to parse this information to convert it into a standardized format
for analysis.
Analyzing and reformatting data for visualization, reporting, and alerting purposes can be a complex process
that consumes its own set of resources. Some forms of monitoring are time-critical and require immediate
analysis of data to be effective. While other Other forms of analysis are less time-critical and might require some
computation and aggregation after the raw data has been received. This table shows the patterns for analysis.
Data must be
structured and
available quickly for
efficient processing.
Move the analysis
processing to the
individual VMs that
stores the data.
Alerting.
Detecting a security
attack on the
system.
Perform
computation and
aggregation on the
received raw data.
Aggregate a series
of events instead of
an isolated event.
Performance analysis
on data from a
series of events to
identify reliability
issues.
Diagnosis of health
issues by statistical
evaluation of the
events leading up to
a health event.
Cold analysis Analysis can be performed
at a later date.
PAT T E RNPAT T E RN C H A RA C T ERIST IC SC H A RA C T ERIST IC S C O N SIDERAT IO NSC O N SIDERAT IO NS USE C A SEUSE C A SE
Combined approachCombined approach
Diagnose issues
Next steps
Data is stored safely
after it has been
captured.
Aggregate a series
of events instead of
an isolated event.
Usage monitoring
and auditing needs
a view of the system
at points in time.
Predictive health
analysis with
historical
information.
In common monitoring use cases, you'll use a combination of all three patterns.
For example, a health event is typically processed through hot analysis and can raise an alert immediately. An
operator can then drill into the reasons for the health event by examining the data from the warm path. This
data should contain information about the events leading up to the issue that caused the health event. An
operator can also use cold analysis to provide the data for predictive health analysis. The operator can gather
historical information over a specified period and use it in conjunction with the current health data (retrieved
from the hot path) to spot trends that might soon cause health issues. In these cases, it might be necessary to
raise an alert so that corrective action can be taken.
Diagnosis requires the ability to determine the cause of faults or unexpected behavior, including performing root
cause analysis. The information that's required typically includes:
Detailed information from event logs and traces, either for the entire system or for a specified subsystem
during a specified time window.
Complete stack traces resulting from exceptions and faults of any specified level that occur within the system
or a specified subsystem during a specified period.
Crash dumps for any failed processes either anywhere in the system or for a specified subsystem during a
specified time window.
Activity logs recording the operations that are performed either by all users or for selected users during a
specified period.
Analyzing data for troubleshooting purposes often requires a deep technical understanding of the system
architecture and the various components that compose the solution. As a result, a large degree of manual
intervention is often required to interpret the data, establish the cause of problems, and recommend an
appropriate strategy to correct them. It might be appropriate simply to store a copy of this information in its
original format and make it available for cold analysis by an expert.
Visualization
Visualize data and raise alerts
12/16/2022 3 minutes to read • Edit Online
Use dashboards to visualize data
NOTENOTE
An important aspect of any monitoring system is the ability to present the data in such a way that an operator
can quickly spot any trends or problems. Also important is the ability to quickly inform an operator if a
significant event has occurred that might require attention.
Data presentation can take several forms, including visualization by using dashboards, alerting, and reporting.
The most common way to visualize data is to use dashboards that can display information as a series of charts,
graphs, or some other illustration. These items can be parameterized, and an analyst can select the important
parameters, such as the time period, for any specific situation.
Dashboards can be organized hierarchically. Top-level dashboards can give an overall view of each aspect of the
system but enable an operator to drill down to the details. For example, a dashboard that shows the overall disk
I/O for the system should allow an analyst to view the I/O rates for each individual disk to determine whether
one or more specific devices account for a disproportionate volume of traffic. Ideally, the dashboard should also
display related information, such as the source of each request (the user or activity) that's generating this I/O.
This information can then be used to determine whether (and how) to spread the load more evenly across
devices, and whether the system would perform better if more devices were added.
A dashboard may also use color-coding or some other visual cues to indicate values that appear anomalous or
that are outside an expected range. Using the previous example:
A disk with an I/O rate that's approaching its maximum capacity over an extended period (a hot disk) can be
highlighted in red.
A disk with an I/O rate that periodically runs at its maximum limit over short periods (a warm disk) can be
highlighted in yellow.
A disk that's showing normal usage can be displayed in green.
For a dashboard system to work effectively, it must have the raw data to work with. If you're building your own
dashboard system, or using a dashboard developed by another organization, you must understand the
following concepts:
Which instrumentation data do you need to collect?
At what levels of granularity?
How should you format data for the dashboard to consume?
A good dashboard doesn't only display information, it also enables an analyst to pose improvised questions
about that information. Some systems provide management tools that an operator can use to complete these
tasks and explore the underlying data. Instead, depending on the repository that's used to hold this information,
it may be possible to query this data directly, or import it into tools such as Microsoft Excel for further analysis
and reporting.
You should restrict access to dashboards to authorized personnel, because this information may be commercially sensitive.
You should also protect the underlying data for dashboards to prevent users from changing it.
Reporting
Next steps
Reporting is used to generate an overall view of the system. It may incorporate historical data and current
information. Reporting requirements fall into two broad categories: operational reporting and security
reporting.
Operational reporting
typically includes the following aspects:
Aggregating statistics that you can use to understand resource utilization of the overall system or specified
subsystems during a specified time window.
Identifying trends in resource usage for the overall system or specified subsystems during a specified period.
Monitoring exceptions that have occurred throughout the system or in specified subsystems during a
specified period.
Determining the efficiency of the application for the deployed resources, and understanding whether the
volume of resources, and their associated cost, can be reduced without affecting performance unnecessarily.
Security reporting
tracks customers' use of the system. It can include:
Auditing user operations
: This method requires recording the individual requests that each user completes,
together with dates and times. The data should be structured to enable an administrator to quickly
reconstruct the sequence of operations that a user completes over a specified period.
Tracking resource use by user
: This method requires recording how each request for a user accesses the
various resources that compose the system, and for how long. An administrator can use this data to generate
a utilization report, by user, over a specified period, possibly for billing purposes.
In many cases, batch processes can generate reports according to a defined schedule. Latency isn't normally an
issue. Batch processes should also be available for generation on a spontaneous basis, if needed. As an example,
if you are storing data in a relational database, such as Azure SQL Database, you can use a tool such as SQL
Server Reporting Services to extract and format data, and present it as a set of reports.
More information: Alerting
Alerting for operations
12/16/2022 5 minutes to read • Edit Online
Best practices
Alert use cases
Application alerts
Alerting is the process of generating notifications when a significant event is detected after analyzing
instrumentation data. It's an important part of any system that makes performance, availability, and privacy
guarantees to the users where an action is required immediately.
To ensure that the system remains healthy, responsive, and secure, it's highly recommended that you set alerts
so that operators can respond to them in a timely manner. An alert can contain enough contextual information
to help you quickly get started on diagnostic activities. Alerting can be used to invoke remediation functions
such as autoscaling. Alerts can also enable cost-awareness by watching budgets and limits.
Define a process for alert response that identifies the accountable owners and actions.
Configure alerts for a well-defined scope (resource types and resource groups) and adjust the verbosity to
minimize noise.
Use an automated alerting solution, such as Splunk or Azure Monitor, instead of having people actively look
for issues.
Take advantage of alerts to operationalize remediation processes. For example, automatically create tickets to
track issues and resolutions.
Track the health of Azure services in regions, communication about outages, planned maintenance activities,
and other health advisories.
Here are some common situations for which you might want to set alerts and associated actions:
Security eventsSecurity eventsIf the event logs indicate that repeated authentication or authorization failures are
occurring, the system might be under attack and an operator should be informed.
Performance metricsPerformance metrics—If a particular performance metric exceeds a specified threshold, the system must
quickly respond.
Availability informationAvailability informationIf a fault is detected, it might be necessary to quickly restart one or more
subsystems, or failover to a backup resource. Repeated faults in a subsystem might indicate more serious
concerns.
Cost and usage informationCost and usage information—If usage or cost exceeds the allocated budget, it might be necessary to raise
cost awareness to the department with notification that spending has reached a fixed threshold of the quota.
An alerting system should allow customizing alerts. The appropriate values from the underlying
instrumentation data can be provided to the alerting system as parameters. This approach enables an operator
to filter data and focus on those thresholds or combinations of interest values.
In some cases, the raw instrumentation data can be provided. In other situations, it might be more appropriate
to supply aggregated data. For example, an alert can be triggered if the CPU utilization for a node has exceeded
90% over the past 10 minutes. The details provided to the alerting system should also include any appropriate
summary and context information. This data can help reduce the possibility that false-positive events will trip an
alert.
Resource health alerts
Service health alerts
Process considerations
Subscription notification e
-
mailsSubscription notification e
-
mails
IT Service Management
(
ITSM
)
integrationIT Service Management
(
ITSM
)
integration
Health alerts should be configured for specific resource groups and resource types, and should be adjusted to
maximize signal to noise ratios. For example, only send a notification when a resource becomes unhealthy as
per the defined requirements of the application health model or due to an Azure platform-initiated event.
Consider transient issues when setting an appropriate threshold for resource unavailability, such as configuring
an alert for a virtual machine with a threshold of 1 minute for unavailability before an alert is triggered.
Azure Resource Health provides information about the health of individual resources such as a specific virtual
machine instance, and is useful when diagnosing unavailable resources.
You can also view resource health in Azure Monitor. Alerts can be defined by using many different data streams
such as metric values, log search queries, and activity log events.
Get a view into the health of Azure services and regions, communications about outages, planned maintenance
activities, and other health advisories.
Consider configuring Azure Service Health alerts to operationalize Service Health events. These alerts can
provide useful information in interpreting issues that you might have detected through the resource health
alerts. You can use Service Health events to decide the best way to respond operationally.
However, Service Health alerts aren't effective in detecting issues because of the associated latencies. There's a 5
minute Service Level Objectives (SLOs) for automated issues, but many issues require manual interpretation for
root cause analysis.
Alerts proactively notify and in cases even respond to operational states that deviate from the norm. When such
an event occurs, an alert is triggered to notify the accountable teams.
Having well-defined owners and response playbookswell-defined owners and response playbooks per alert is vital to optimizing operational
effectiveness. Alerts can be set for non-technical notifications. For example, a budget owner should be made
aware of capacity issues so that budgets can be adjusted and discussed.
Instead of having teams actively monitor the systems and dashboard, send reliable aler t notificationssend reliable aler t notifications to the
owners.
The operators might receive alert information by using many delivery channels such as email, text message, a
pager device, or push notifications to a mobile app. Many alerting systems support subscriber groups, and all
operators who are members of the same group can receive the same set of alerts.
In Azure Monitor, alerts use action groups to notify the owners.
Subscription notification e-mails can contain important service notifications or security alerts. Make sure that
subscription notification emails are routed to the relevant technical stakeholders.
ITSM systems can help define workflows. You can use those systems to document issues, notify and assign
responsible parties, and track issues. For example, operational alerts from the application could be integrated to
automatically create new tickets to track resolution.
Azure Monitor supports integrations with third-party ITSM systems. For example, if you have a custom solution
that ingests data through an incoming API, this can be engaged with an Azure Monitor action group each time
Alert prioritization
Next steps
Related links
an alert is raised. Many partner integrations are ready to use out of the box.
For information on Azure Monitor and ITSM integration, reference IT Service Management Connector Overview.
All alerts being treated the same is going to reduce the efficacy of notifications.
When defining alerts, analyze the potential business impact and prioritize accordingly. Prioritizing alerts helps
operational teams in cases where multiple events require intervention at the same time. For example, alerts
concerning critical system flows might require special attention. When creating an alert, ensure you establish
and set the correct priority.
One way of specifying the priority is by using a severity level that indicates how critical a situation is. This image
shows this case in Azure Monitor.
Return to the operational excellence overview.
Use case: Health monitoring
Overview of alerts in Microsoft Azure
Create and manage action groups
Health monitoring
12/16/2022 7 minutes to read • Edit Online
Requirements for health monitoring
Best practices
Distributed tracing
A system is healthy if it's running and can process requests. Health monitoring generates a snapshot of the
current health of the system so that you can verify all components are functioning as expected.
If any part of the system is unhealthy, you're alerted within a matter of seconds. You'll determine which parts of
the system are functioning normally and which parts are experiencing problems. A traffic light system indicates
system health:
Red for unhealthy. The system has stopped.
Yellow for partially healthy. The system is running with reduced functionality.
Green for healthy.
A comprehensive health monitoring system enables you to drill down to view the health status of subsystems
and components. For example, if the overall system is partially healthy, you can zoom in and determine which
functionality is currently unavailable.
Correlate events across all application components.
Use an Application Performance Management (APM) tool used to collect application level logs.
Use Application Insights to gather key metrics.
Collect application logs from different application environments.
Consider using log levels used to capture different types of application events.
Capture log messages in a structured format.
Set out critical application performance targets and non-functional requirements with clarity.
Identify known gaps in application observability that led to missed incidents or false positives in the past.
Consider different log aggregation technologies to collect logs and metrics from Azure resources.
Make logs and metrics available for [critical internal dependencies]#logs-for-internal-dependencies).
Implement black-box monitoring to measure platform services and the resulting customer experience.
Implement detailed instrumentation in the application code to better understand the customer experience.
Apply white-box monitoring to instrument the application with semantic logs and metrics.
Trace the execution of user requests to generate raw data to determine which requests have:
Succeeded
Failed
Taken too long
Distributed tracing allows you to build and visualize end-to-end transaction flows for the application. Events
coming from different application components or different component tiers of the application should be
correlated to build these flows.
For instance, using consistent correlation IDs transferred between components within a transaction achieves
Application Performance Management (APM) tools
Logs and metrics
Application logsApplication logs
Log levelsLog levels
Log messagesLog messages
end-to-end transaction flows.
Event correlation between application layers allows you to connect tracing data of the complete application
stack. You can create a complete picture of where time is spent at each layer through tools that can query the
tracing data repositories in correlation to a unique identifier. This unique identifier represents a given transaction
that flowed through the system.
To successfully maintain an application, it's important to
turn the lights on
to have clear visibility into important
metrics, both in real time and historically.
Application Performance Management (APM) toolsApplication Performance Management (APM) tools
An APM technology, such as Application Insights, should be used to manage the performance and availability of
the application, aggregating application level logs, and events for later interpretation. With cost in mind,
consider the appropriate level of logging required.
Application Insights is an extensible Application Performance Management (APM) service for developers and
DevOps professionals to monitor live applications. It automatically detects performance anomalies and includes
analytics tools to help users diagnose issues, and to understand what customers do with your application.
Application Insights monitors diagnostic trace logs from your application so that you can correlate trace events
with requests.
Application Insights allows you to:
Verify that your application is running correctly.
Makes application troubleshooting easier.
Provides custom business telemetry to indicate whether your application is being used as intended.
For more information about how Application Insights helps you monitor applications, reference Monitoring
workloads.
Logging is essential to understand how an application operates in various environments and what events occur
and under which conditions. There are two types of logs: application-generated logs and platform logs.
Application logs
support the end-to-end application lifecycle. You should collect logs and events across all major
application environments. A sufficient degree of separation and filtering should be in place to ensure non-critical
environments don't convolute production log interpretation. Corresponding log entries across the application
should capture a correlation ID for their respective transactions.
Use the following log levels to capture different types of application events across environments:
InfoInfo
WarningWarning
ErrorError
DebugDebug
Pre-configure the preceding log levels and apply these levels within relevant environments. Changing log levels
includes simple configuration changes to support operational scenarios where it's necessary to raise the log
level within an environment.
MetricsMetrics
Analyze health dataAnalyze health data
Performance targets and non
-
functional requirements
(
NFRs
)
Performance targets and non
-
functional requirements
(
NFRs
)
Gaps in application monitoringGaps in application monitoring
TIPTIP
Platform logsPlatform logs
Activity logs and diagnostic settingsActivity logs and diagnostic settings
TIPTIP
Capture log messages and application events in a structured format, following well-known schema. The
structured data type includes machine-readable data points rather than unstructured string types.
Structured data can help you:
Parse and analyze logs.
Index and search logs.
Simplify reporting.
Metrics
are numerical values that are collected at regular intervals and describe some aspect of a system at a
particular time. They reflect the health and usage statistics of your resources.
Analyzing health data involves quickly indicating whether the system is running through metrics. Hot analysis of
the immediate data triggers an alert if a critical component is detected as unhealthy.
For example, the component fails to respond to a consecutive series of pings. You can then take the appropriate
corrective action. For more information about analyzing data, reference Analyze monitoring data for cloud
applications.
Application-level metrics should include end-to-end transaction times of key technical functions, such as:
Database queries.
Response times for external API calls.
Failure rates of processing steps, and so on.
To assess fully the health of key scenarios in the context of targets and NFRs, correlate application log events,
such as user login, across critical system flows.
Known gaps in application observability lead to missed incidents and false positives.
What you can't see, you can't measure. What you can't measure, you can't improve.
Platform logs
provide detailed diagnostic and auditing information for the infrastructure, and resources they
depend on. Monitoring your platform includes collecting rich metrics and logs to verify the state of your
complete infrastructure, and to react promptly if there are any issues.
Application Insights or a full-stack monitoring service like Azure Monitor can help you keep tabs on your entire
landscape.
For more information, reference Monitoring workloads.
Activity logs provide audit information about when a resource is modified, such as when a virtual machine is
started or stopped. Such information is useful for the interpretation and troubleshooting of operational and
performance issues because it provides transparency around configuration changes.
We recommend collecting and aggregating activity logs.
Log aggregation technologiesLog aggregation technologies
Logs for internal dependenciesLogs for internal dependencies
Black
-
box monitoringBlack
-
box monitoring
Instrumentation
White
-
box monitoringWhite
-
box monitoring
Next steps
Log aggregation technologies, such as Azure Log Analytics or Splunk, should be used to collate logs and metrics
across all application components for later evaluation. Resources may include Azure IaaS and PaaS services, and
third-party appliances such as firewalls or anti-malware solutions used in the application. For instance, if Azure
Event Hub is used, the Diagnostic SettingsDiagnostic Settings should be configured to push logs and metrics to the data sink.
Understanding usage helps with right-sizing of the workload, but extra costs for logging should be accepted and
included in the cost model.
All application resources should be configured to route diagnostic logs and metrics to the chosen log
aggregation technology. Use Azure Policy to ensure the consistent use of diagnostic settings across the
application and to enforce the configuration you want for each Azure service.
To build a robust application health model, it's vital to have visibility into the operational state of critical internal
dependencies, such as a shared Network Virtual Appliance (NVA) or ExpressRoute connection.
Black-box monitoring
tests externally visible application behavior without knowledge of the internals of the
system. This type of monitoring is a common approach to measuring customer-centric SLIs, SLOs, and SLAs.
For more information, reference Azure Monitor.
Instrumentation of your code allows precise detection of underperforming pieces when you apply load or stress
tests. It's critical to have this data available to improve and identify performance opportunities in the application
code. Use Application Performance Monitoring (APM) tools, such as Application Insights, to manage the
performance and availability of the application, along with aggregating application-level logs, and events for
later interpretation.
For more resources about instrumentation, reference Monitor performance.
Application-level metrics and logs, such as current memory consumption or request latency, should be collected
from the application to inform a health model, detect, and predict issues.
For more information, reference Instrumenting an application with Application Insights and Instrument an
application.
Usage monitoring
Usage monitoring
12/16/2022 2 minutes to read • Edit Online
Benefits of usage monitoring
Requirements for usage monitoring
Requirements for data collection
Next steps
Usage monitoring tracks how the features and components of an application are used.
This article describes how you can use the data gathered from usage monitoring to gain insight into operational
events that affect your application and workloads.
The following list explores the use cases for data gathered from usage monitoring:
Determine which features you use heavily and determine any potential hotspots in the system. High-traffic
elements may benefit from functional partitioning or even replication to spread the load more evenly. You
can use this information to figure out which features you don't use often and possible candidates for
retirement, or replacement in a future version of the system.
Collect information about the operational events of the system under normal use. For example, in an e-
commerce site, you can record the statistical information about the number of transactions and the volume
of customers that are responsible for them. You can use this information for capacity planning as the number
of customers grows.
Detect user satisfaction with the performance or functionality of the system. For example, if a large number
of customers in an e-commerce system regularly abandon their shopping carts, this behavior may mean
there's a problem with the checkout functionality.
Generate billing information. An application or service may charge customers for the resources they use.
Enforce quotas. If a user exceeds their paid quota of processing time or resource usage during a specific
period, the system can limit their access or throttle processing.
To analyze system usage, you'll need monitoring information that includes:
The number of requests that each subsystem processes and directs to each resource.
The work that each user does.
The volume of data storage that each user occupies.
The resources that each user accesses.
You can track usage at a relatively high level. Usage tracking can note the start and end times of each request
and the nature of the request, such as read, write, and so on, depending on the resource in question. Retrieve
this information through the following ways:
Trace user activity.
Capture performance counters that measure the usage for each resource.
Monitor each users' resource consumption.
For accounting purposes, you'll want to identify which users are responsible for doing which operations, and the
resources that these operations use. The gathered information should be detailed enough for accurate billing.
Issue tracking
Issue tracking
12/16/2022 2 minutes to read • Edit Online
Requirements for issue tracking
Requirements for data collection
Analyze data
Next steps
If unexpected events occur in the system, customers and other users may report these issues.
Issue tracking involves:
Managing issues.
Associating issues with efforts to resolve underlying problems in the system.
Informing customers of possible resolutions.
You can often track issues using a separate system that lets you record and report the details of problems that
users report. These details can include:
Tasks the user was doing.
Symptoms of the problem.
Sequence of events.
Error or warning messages.
The user who initially reported the issue is considered the primary data source. This user can provide more
information, such as:
A crash dump, if the application includes a component that runs on the user's desktop.
A screenshot.
The date and time the error occurred.
The user's location.
You can use this information to help debug and create a backlog for future software releases.
Consider the following scenarios when you analyze issue-tracking data:
Different users may report the same problem. The issue-tracking system should associate common reports.
Record the debugging progress against each issue report.
Inform customers of the solution when you've resolved the issue.
If a user reports an issue that has a known solution in the issue-tracking system, inform the user of the
solution immediately.
Tracing and debugging
Tracing and debugging
12/16/2022 2 minutes to read • Edit Online
Root cause analysis
Requirements for tracing and debugging
Requirements for data collection
NOTENOTE
Next steps
When a user reports an issue, the user is often aware only of the immediate effect that the issue has on their
operations. The user can only report the results of their own experience back to the person who is responsible
for maintaining the system. These experiences are a visible symptom of one or more fundamental problems.
In many cases, an analyst must dig through the chronology of the underlying operations to establish the root
cause of the problem. This process is called a
root cause analysis
.
A root cause analysis may uncover inefficiencies in application design. In these situations, you can try to rework
the affected elements and deploy them as part of a later release. This process requires careful control, and you
should monitor the updated components closely.
For tracing unexpected events and other problems, consider the following requirements:
Monitoring data must provide enough information to enable an analyst to trace the origin of an issue and
reconstruct the sequence of events that lead up to the issue.
Data must be sufficient for the analyst to identify a root cause.
A root cause enables the developer to make the necessary changes to prevent the issue from recurring.
Troubleshooting involves the following data collection requirements:
Trace all methods and their parameters used in an operation to create a model that shows the logical
flow
through the system when a customer makes a specific request.
Capture and log exceptions, and warnings that the system generates because of this flow.
To support debugging, the system should provide the following data:
Hooks that enable you to capture state information at critical points in the system.
Step-by-step information as selected operations continue.
Capturing detailed data can cause extra load on the system and should be a temporary process. Only capture detailed
data when an unusual series of events occur, which are difficult to replicate. Alternately, only capture detailed data when
you're monitoring a new release to ensure that new elements in the system function as expected.
Auditing
Auditing
12/16/2022 2 minutes to read • Edit Online
Requirements for auditing
Requirements for data collection
Analyze audit data
Next steps
Depending on the application, there may be legal requirements for auditing users' operations and recording all
data access. Auditing can provide evidence that links customers to specific requests. Affirming validity is an
important factor in many online business systems to help maintain trust between the customer and the
business responsible for the application, or service.
An analyst can trace the sequence of business operations that users perform so that you can reconstruct users'
actions. Tracing the sequence of operations may be necessary as a matter of record, or as part of a forensic
investigation.
Audit information is highly sensitive. This information includes data that identifies the users of the system and
the tasks that they're doing. Reports contain sensitive audit information available only to trusted analysts. An
analyst can generate a range of reports. For example, reports may list the following activities:
All users' activities occurring during a specified time frame.
The chronology of a single user's activity.
The sequence of operations performed against one or more resources.
The primary sources of auditing information can include:
The security system that manages user authentication.
Trace logs that record user activity.
Security logs that track all network requests.
Regulatory requirements may dictate the format of the audit data and the way it's stored. For example, it may
not be possible to clean the data in any way. It must be recorded in its original format. Access to the data
repository must be protected to prevent tampering.
An analyst must access all the raw data in its original form. Aside from the common audit report requirement,
the tools for analyzing this data are specialized and external to the system.
DevOps Checklist
Operational Excellence patterns
12/16/2022 2 minutes to read • Edit Online
PAT T E RNPAT T E RN SUM M A RYSUM M A RY
Ambassador Create helper services that send network requests on behalf
of a consumer service or application.
Anti-Corruption Layer Implement a fade or adapter layer between a modern
application and a legacy system.
External Configuration Store Move configuration information out of the application
deployment package to a centralized location.
Gateway Aggregation Use a gateway to aggregate multiple individual requests into
a single request.
Gateway Offloading Offload shared or specialized service functionality to a
gateway proxy.
Gateway Routing Route requests to multiple services using a single endpoint.
Health Endpoint Monitoring Implement functional checks in an application that external
tools can access through exposed endpoints at regular
intervals.
Sidecar Deploy components of an application into a separate
process or container to provide isolation and encapsulation.
Strangler Incrementally migrate a legacy system by gradually replacing
specific pieces of functionality with new applications and
services.
Cloud applications run in a remote datacenter where you do not have full control of the infrastructure or, in
some cases, the operating system. This can make management and monitoring more difficult than an on-
premises deployment. Applications must expose runtime information that administrators and operators can use
to manage and monitor the system, as well as supporting changing business requirements and customization
without requiring the application to be stopped or redeployed.
Overview of the performance efficiency pillar
12/16/2022 2 minutes to read • Edit Online
Topics
P ERF O RM A N C E EFF IC IEN C Y TOP ICP ERF O RM A N C E EFF IC IEN C Y TOP IC DESC RIP T IONDESC RIP T ION
Performance efficiency checklist Review your application architecture to ensure your
workload scales to meet the demands placed on it by users
in an efficient manner.
Performance principles Principles to guide you in your overall strategy for improving
performance efficiency.
Design for performance Review your application architecture from a performance
design standpoint.
Consider scalability Plan for growth by understanding your current workloads.
Plan for capacity Plan to scale your application tier by adding extra
infrastructure to meet demand.
Performance monitoring checklist Monitor services and check the health state of current
workloads to maintain overall workload performance.
Performance patterns Implement design patterns to build more performant
workloads.
Tradeoffs Consider tradeoffs between performance optimization and
other aspects of the design, such as reliability, security, cost
efficiency, and operability.
Performance efficiency is the ability of your workload to scale to meet the demands placed on it by users in an
efficient manner. Before the cloud became popular, when it came to planning how a system would handle
increases in load, many organizations intentionally provisioned oversized workloads to meet business
requirements. This decision made sense in on-premises environments because it ensured
capacity
during peak
usage. Capacity reflects resource availability (CPU and memory). Capacity was a major consideration for
processes that would be in place for many years.
Just as you need to anticipate increases in load in on-premises environments, you need to expect increases in
cloud environments to meet business requirements. One difference is that you may no longer need to make
long-term predictions for expected changes to ensure you'll have enough capacity in the future. Another
difference is in the approach used to manage performance.
To assess your workload using the tenets found in the Microsoft Azure Well-Architected Framework, reference
the Microsoft Azure Well-Architected Review.
To boost performance efficiency, we recommend watching Performance Efficiency: Fast & Furious: Optimizing
for Quick and Reliable VM Deployments.
The performance efficiency pillar covers the following topics to help you effectively scale your workload:
Next steps
Reference the performance efficiency principles intended to guide you in your overall strategy.
Principles
Performance efficiency principles
12/16/2022 2 minutes to read • Edit Online
Design for horizontal scaling
A P P RO A C HA P P R OA C H BE NE FITB EN E FIT
Define a capacity model according to the business
requirements
Test the limits for predicted and random spikes andpredicted and random spikes and
fluctuationsfluctuations in load to make sure the application can scale.
Factor in the SKU service limits and regional limits so that
application scales as expected if there's a regional failure.
Use PaaS offerings Take advantage of the built-in capabilities that
automatically trigger scalingautomatically trigger scaling operations instead of
investing in manual scaling efforts that often require custom
implementations and can be error prone.
Choose the right resources and right-size Determine if the resources can support the anticipated load.
Also, justify the cost implications of the choices.
Apply strategies in your design early Accelerate adoption without significant changes. For
example, strive for stateless applicationstrive for stateless application and store state
externally in a database or distributed cache. Use cachingUse caching
where possible, to minimize the processing load.
Shift-left on performance testing
A P P RO A C HA P P R OA C H BE NE FITB EN E FIT
Run load and stress tests Measure the application's performance under
predetermined amounts of loadpredetermined amounts of load and also the
maximum loadmaximum load your application and its infrastructure can
withstand.
Performance efficiency is the ability of your workload to adjust to changes in demands placed on it by users in
an efficient manner. These principles are intended to guide you in your overall strategy for improving
performance efficiency.
Horizontal scaling allows for elasticity
. Instances are added (scale-out) or removed (scale-in) in response to
changes in load. Scaling out can improve resiliency by building redundancy. Scaling in can help reduce costs by
shutting down excess capacity.
An alternate approach is vertical scaling (scale up). However, you eventually may reach a limit where there isn't a
larger system, and you can't scale up anymore. At that point, any further scaling must be horizontal. So it's good
practice to employ a scale-out architecture early on.
Test early and test often to catch issues early
.
Establish performance baselines Determine the current efficiencycurrent efficiency of the application and its
supporting infrastructure. You'll be able to identify
bottlenecks early before it worsens with load. Also, this
strategy can lead to strategies for improvementsstrategies for improvements and
determine if the application is meeting the business goals.
Run the test in the continuous integration (CI) build pipeline. Detect issues early. Any development effort must go
through continuous performance testing to make sure
changes to the codebase doesn't negatively affect
performance.
A P P RO A C HA P P R OA C H BE NE FITB EN E FIT
Continuously monitor for performance in production
A P P RO A C HA P P R OA C H BE NE FITB EN E FIT
Monitor the health of the entire solution Know about the scalability
and
resiliency of the
infrastructure, application, and dependent services. Gather
and review key performance counters regularly.
Capture data from repeatable processes Evaluate the metrics over time that would allow for
autoscaling with demand. For reliability, look for early
warning signs that might require proactive intervention.
Reevaluate the needs of the workload continuously Identify improvement opportunities with resolution
planning. This effort may require updated configurations and
deprecations in favor of more-appropriate solutions.
Next section
Related links
Observe the system holistically to evaluate the overall health of the solution
. Capture the test results not only in
dev/test environment but also in production. Monitoring and logging in production can help identify bottlenecks
and opportunities for improvement.
Use this checklist to review your application architecture from a performance design standpoint.
Design checklist
Performance efficiency impacts the entire architecture spectrum and is interrelated with other pillars of
the Microsoft Azure Well-Architected Framework.
Assess your workload using the Microsoft Azure Well-Architected Review tool.
Checklist
-
Design for performance efficiency
12/16/2022 14 minutes to read • Edit Online
Application design
Application design is critical to handling scale as load increases. Design is part of the Performance Efficiency
pillar in the Microsoft Azure Well-Architected Framework. Use this checklist to review your application
architecture from a performance design standpoint.
Design for scalingDesign for scaling. Scaling allows applications to react to variable load by increasing and decreasing
the number of instances of roles, queues, and other services they use. However, the application must be
designed with this in mind. During scale operations, application and service instances come and go, and
because of this they must be stateless. This prevents the addition or removal of specific instances from
adversely affecting current users.
You should also implement configuration, autodetection, or load balancing so that as services are
added or removed, the application can perform the necessary routing. For example, a web application
might use a set of queues in a round-robin approach to route requests to background services
running in worker roles. The web application must be able to detect changes in the number of queues,
to successfully route requests and balance the load on the application.
You should scale outbound connectivity to the internet with Azure Virtual Network NAT. Virtual
Network NAT (NAT gateway) provides not only a scalable but also reliable and secure way to connect
outbound to the internet and also helps prevent connection failures caused by SNAT exhaustion.
Scale as a unitScale as a unit. Plan for additional resources to accommodate growth. For each resource, know the
upper scaling limits, and use sharding or decomposition to go beyond these limits. Determine the scale
units for the system in terms of well-defined sets of resources. This makes applying scale-out operations
easier. It also makes operations less prone to negative impact on the application through limitations
imposed by lack of resources in some part of the overall system. For example, adding x number of web
and worker roles might require y number of additional queues and z number of storage accounts to
handle the additional workload generated by the roles. So a scale unit could consist of x web and worker
roles,
y
queues, and
z
storage accounts. Design the application so that it's easily scaled by adding one or
more scale units.
Take advantage of platform autoscaling featuresTake advantage of platform autoscaling features. Where the hosting platform supports an
autoscaling capability, such as Azure Autoscale, prefer it to custom or third party mechanisms unless the
built-in mechanism can't fulfill your requirements. Use scheduled scaling rules where possible to ensure
resources are available without a start-up delay, but add reactive autoscaling to the rules where
appropriate to cope with unexpected changes in demand. You can use the autoscaling operations in the
classic deployment model (the older model) to adjust autoscaling, and to add custom counters to rules.
For more information, see Auto-scaling guidance.
Par tition the workloadPar tition the workload. Design parts of the process to be discrete and decomposable. Minimize the
size of each part, while following the usual rules for separation of concerns and the single responsibility
principle. This allows the component parts to be distributed in a way that maximizes use of each compute
unit (such as a role or database server). It also makes it easier to scale the application by adding instances
of specific resources. For complex domains, consider adopting a microservices architecture.
Avoid client affinityAvoid client affinity. Where possible, ensure that the application does not require affinity. When you do
this, requests can be routed to any instance, and the number of instances is irrelevant. This also avoids
the overhead of storing, retrieving, and maintaining state information for each user.
Data management
Offload CPU-intensive and I/O-intensive tasks as background tasksOffload CPU-intensive and I/O-intensive tasks as background tasks. If a request to a service is
expected to take a long time to run or absorb considerable resources, offload the processing for this
request to a separate task. Use worker roles or background jobs (depending on the hosting platform) to
execute these tasks. This strategy enables the service to continue receiving requests and remain
responsive. For more information, see Background jobs guidance.
Distribute the workload for background tasksDistribute the workload for background tasks. Where there are many background tasks, or the
tasks require considerable time or resources, spread the work across multiple compute units (such as
worker roles or background jobs). For one possible solution, see the Competing Consumers pattern.
Consider moving toward a Consider moving toward a
shared-nothingshared-nothing
architecture architecture. A shared-nothing architecture uses
independent, self-sufficient nodes that have no single point of contention (such as shared services or
storage). In theory, such a system can scale almost indefinitely. While a fully shared-nothing approach is
generally not practical for most applications, it may provide opportunities to design for better scalability.
For example, avoiding the use of server-side session state, client affinity, and data partitioning are good
examples of moving toward a shared-nothing architecture.
NOTENOTE
Use data par titioningUse data par titioning. Divide the data across multiple databases and database servers, or design the
application to use data storage services that can provide this partitioning transparently (examples include
Azure SQL Database Elastic Database, and Azure Table storage). This approach can help to maximize
performance and allow easier scaling. There are different partitioning techniques, such as horizontal,
vertical, and functional. You can use a combination of these to achieve maximum benefit from increased
query performance, simpler scalability, more flexible management, better availability, and matching the
type of store to the data it will hold.
Consider using different types of data store for different types of data. Choose the types based on how well they
are optimized for the specific type of data. This may include using table storage, a document database, or a
column-family data store, instead of, or as well as, a relational database. For more information, see Data
partitioning guidance.
Design for eventual consistencyDesign for eventual consistency. Eventual consistency improves scalability by reducing or removing
the time needed to synchronize related data partitioned across multiple stores. The cost is that data is not
always consistent when it is read, and some write operations may cause conflicts. Eventual consistency is
ideal for situations where the same data is read frequently but written infrequently.
Reduce chatty interactions between components and ser vicesReduce chatty interactions between components and ser vices. Avoid designing interactions in
which an application is required to make multiple calls to a service (each of which returns a small amount
of data), rather than a single call that can return all of the data. Where possible, combine several related
operations into a single request when the call is to a service or component that has noticeable latency.
This makes it easier to monitor performance and optimize complex operations. For example, use stored
procedures in databases to encapsulate complex logic, and reduce the number of round trips and
resource locking.
Use queues to level the load for high velocity data writesUse queues to level the load for high velocity data writes. Surges in demand for a service can
overwhelm that service and cause escalating failures. To prevent this, consider implementing the Queue-
Based Load Leveling pattern. Use a queue that acts as a buffer between a task and a service that it
invokes. This can smooth intermittent heavy loads that may otherwise cause the service to fail or the task
to time out.
Minimize the load on the data storeMinimize the load on the data store. The data store is commonly a processing bottleneck, a costly
NOTENOTE
resource, and often not easy to scale out. Where possible, remove logic (such as processing XML
documents or JSON objects) from the data store, and perform processing within the application. For
example, instead of passing XML to the database (other than as an opaque string for storage), serialize or
deserialize the XML within the application layer and pass it in a form that is native to the data store.
Typically, it's much easier to scale out the application than the data store, so you should attempt to do as much of
the compute-intensive processing as possible within the application.
Minimize the volume of data retrievedMinimize the volume of data retrieved. Retrieve only the data you require by specifying columns
and using criteria to select rows. Make use of table value parameters and the appropriate isolation level.
Use mechanisms such as entity tags to avoid retrieving data unnecessarily.
Aggressively use cachingAggressively use caching. Use caching wherever possible to reduce the load on resources and
services that generate or deliver data. Caching is typically suited to data that is relatively static, or that
requires considerable processing to obtain. Caching should occur at all levels where appropriate in each
layer of the application, including data access and user interface generation. For more information, see
Caching Guidance.
Handle data growth and retentionHandle data growth and retention. The amount of data stored by an application grows over time.
This growth increases storage costs as well as latency when accessing the data, affecting application
throughput and performance. It may be possible to periodically archive some of the old data that is no
longer accessed, or move data that is rarely accessed into long-term storage that is more cost efficient,
even if the access latency is higher.
Optimize Data Transfer Objects (DTOs) using an efficient binar y formatOptimize Data Transfer Objects (DTOs) using an efficient binar y format. DTOs are passed
between the layers of an application many times. Minimizing the size reduces the load on resources and
the network. However, balance the savings with the overhead of converting the data to the required
format in each location where it is used. Adopt a format that has the maximum interoperability to enable
easy reuse of a component.
Set cache controlSet cache control. Design and configure the application to use output caching or fragment caching
where possible, to minimize processing load.
Enable client side cachingEnable client side caching. Web applications should enable cache settings on the content that can be
cached. This is commonly disabled by default. Configure the server to deliver the appropriate cache
control headers to enable caching of content on proxy servers and clients.
Use Azure blob storage and the Azure Content Deliver y Network to reduce the load on theUse Azure blob storage and the Azure Content Deliver y Network to reduce the load on the
applicationapplication. Consider storing static or relatively static public content, such as images, resources, scripts,
and style sheets, in blob storage. This approach relieves the application of the load caused by dynamically
generating this content for each request. Additionally, consider using the Content Delivery Network to
cache this content and deliver it to clients. Using the Content Delivery Network can improve performance
at the client because the content is delivered from the geographically closest datacenter that contains a
Content Delivery Network cache. For more information, see Best practices for using content delivery
networks.
Optimize and tune SQL queries and indexesOptimize and tune SQL queries and indexes. Some T-SQL statements or constructs may have an
adverse effect on performance that can be reduced by optimizing the code in a stored procedure. For
example, avoid converting datetimedatetime types to a varcharvarchar before comparing with a datetimedatetime literal value.
Use date/time comparison functions instead. Lack of appropriate indexes can also slow query execution.
If you use an object/relational mapping framework, understand how it works and how it may affect
performance of the data access layer. For more information, see Query Tuning.
Implementation
Consider denormalizing dataConsider denormalizing data. Data normalization helps to avoid duplication and inconsistency.
However, maintaining multiple indexes, checking for referential integrity, performing multiple accesses to
small chunks of data, and joining tables to reassemble the data imposes an overhead that can affect
performance. Consider if some additional storage volume and duplication is acceptable in order to
reduce the load on the data store. Also consider if the application itself (which is typically easier to scale)
can be relied on to take over tasks such as managing referential integrity in order to reduce the load on
the data store. For more information, see Horizontal, vertical, and functional data partitioning.
NOTENOTE
NOTENOTE
Review the performance antipatternsReview the performance antipatterns. See Performance antipatterns for cloud applications for
common practices that are likely to cause scalability problems when an application is under pressure.
Use asynchronous callsUse asynchronous calls. Use asynchronous code wherever possible when accessing resources or
services that may be limited by I/O or network bandwidth, or that have a noticeable latency, in order to
avoid locking the calling thread.
Avoid locking resources, and use an optimistic approach insteadAvoid locking resources, and use an optimistic approach instead. Never lock access to resources
such as storage or other services that have noticeable latency, because this is a primary cause of poor
performance. Always use optimistic approaches to managing concurrent operations, such as writing to
storage. Use features of the storage layer to manage conflicts. In distributed applications, data may be
only eventually consistent.
Compress highly compressible data over high latency, low bandwidth networksCompress highly compressible data over high latency, low bandwidth networks. In the
majority of cases in a web application, the largest volume of data generated by the application and
passed over the network is HTTP responses to client requests. HTTP compression can reduce this
considerably, especially for static content. This can reduce cost as well as reduce the load on the network,
though compressing dynamic content does apply a fractionally higher load on the server. In more
generalized environments, data compression can reduce the volume of data transmitted and minimize
transfer time and costs, but the compression and decompression processes incur overhead.
Compression should only be used when there is a demonstrable gain in performance. Other serialization methods,
such as JSON or binary encodings, may reduce the payload size while having less impact on performance, whereas
XML is likely to increase it.
Minimize the time that connections and resources are in useMinimize the time that connections and resources are in use. Maintain connections and
resources only for as long as you need to use them. For example, open connections as late as possible,
and allow them to be returned to the connection pool as soon as possible. Acquire resources as late as
possible, and dispose of them as soon as possible.
Minimize the number of connections requiredMinimize the number of connections required. Service connections absorb resources. Limit the
number that are required and ensure that existing connections are reused whenever possible. For
example, after performing authentication, use impersonation where appropriate to run code as a specific
identity. This can help to make best use of the connection pool by reusing connections.
APIs for some services automatically reuse connections, provided service-specific guidelines are followed. It's
important that you understand the conditions that enable connection reuse for each service that your application
uses.
Send requests in batches to optimize network useSend requests in batches to optimize network use. For example, send and read messages in
Next steps
batches when accessing a queue, and perform multiple reads or writes as a batch when accessing storage
or a cache. This can help to maximize efficiency of the services and data stores by reducing the number of
calls across the network.
Avoid a requirement to store ser ver-side session stateAvoid a requirement to store ser ver-side session state where possible. Server-side session state
management typically requires client affinity (that is, routing each request to the same server instance),
which affects the ability of the system to scale. Ideally, you should design clients to be stateless with
respect to the servers that they use. However, if the application must maintain session state, store
sensitive data or large volumes of per-client data in a distributed server-side cache that all instances of
the application can access.
Optimize table storage schemasOptimize table storage schemas. When using table stores that require the table and column names to
be passed and processed with every query, such as Azure table storage, consider using shorter names to
reduce this overhead. However, do not sacrifice readability or manageability by using overly compact
names.
Create resource dependencies during deployment or at application star tupCreate resource dependencies during deployment or at application star tup. Avoid repeated
calls to methods that test the existence of a resource and then create the resource if it does not exist.
Methods such as
CloudTable.CreateIfNotExists
and
CloudQueue.CreateIfNotExists
in the Azure Storage
Client Library follow this pattern. These methods can impose considerable overhead if they are invoked
before each access to a storage table or storage queue. Instead:
Create the required resources when the application is deployed, or when it first starts. (A single call
to
CreateIfNotExists
for each resource in the startup code for a web or worker role is acceptable.)
However, be sure to handle exceptions that may arise if your code attempts to access a resource
that doesn't exist. In these situations, you should log the exception, and possibly alert an operator
that a resource is missing.
Under some circumstances, it may be appropriate to create the missing resource as part of the
exception handling code. Adopt this approach with caution, as the non-existence of the resource
might indicate a programming error (for example, a misspelled resource name), or some other
infrastructure-level issue.
Use lightweight frameworksUse lightweight frameworks. Carefully choose the APIs and frameworks you use to minimize resource
usage, execution time, and overall load on the application. For example, using Web API to handle service
requests can reduce the application footprint and increase execution speed, but it may not be suitable for
advanced scenarios where the additional capabilities of Windows Communication Foundation are
required.
Consider minimizing the number of ser vice accountsConsider minimizing the number of ser vice accounts. For example, use a specific account to
access resources or services that impose a limit on connections, or perform better where fewer
connections are maintained. This approach is common for services such as databases, but it can affect the
ability to accurately audit operations due to the impersonation of the original user.
Challenges of monitoring distributed architectures
Challenges of monitoring distributed architectures
12/16/2022 2 minutes to read • Edit Online
Key points
Team expertise
Scaling issues
Antipatterns in design
Fault tracking
Most cloud deployments are based on distributed architectures where components are distributed across
various services. Troubleshooting monolithic applications often requires only one or two lenses—the application
and the database. With distributed architectures, troubleshooting is complex and challenging because of various
factors. This article describes some of those challenges.
The team may not have the expertise across all the services used in an architecture.
Uncovering and resolving bottlenecks by monitoring all of your services and their infrastructure is complex.
Antipatterns in design and code causes issues the application is under pressure.
Resilience any service may impact your application's ability to meet current load.
Distributed architectures require many areas of expertise. To adequately monitor performance, it's critical that
telemetry is captured throughout the application, across all services, and is rich. Also, your team should have the
necessary skills to troubleshoot all services used in the architecture. When making technology choices, it's
advantageous and simple to choose a service over another because of the team's expertise. As the collective
skillset grows, you can incorporate other technologies.
For monolithic applications, scale is two-dimensional. An application usually consist of a group of application
servers, some web front ends (WFEs), and database servers.
Uncovering
bottlenecks is simpler but
resolving
them can require considerable effort. For distributed applications, complexity increases exponentially in both
aspects for performance issues. You must consider each application, its supporting service, and the latency
between all the application layers.
Performance efficiency is a complex mixture of applications and infrastructure (IaaS and PaaS). First, you must
ensure that all services can scale to support the expected load and that one service will not cause a bottleneck.
Second, while performance testing, you may realize that certain services scale under different load conditions as
opposed to scaling all services uniformly. Monitoring all of your services and their infrastructure can help fine-
tune your application for optimal performance.
For more information about monitoring for scalability, see Monitor performance for scalability and reliability.
Antipatterns in design and code are a common cause for performance problems when an application is under
pressure. For example, an application behaves as expected during performance testing. However, when it's
released to production and starts to handle live workloads, performance decreases. Problems such as rejecting
user requests, stalling, or throwing exceptions may arise. To learn how to identify and fix these antipatterns, see
Performance antipatterns for cloud applications.
If a service in the architecture fails, how will it affect your application's overall performance? Is the error
Next
Community links
Related links
transient, allowing your application to continue to function; or, will the application experience a critical failure? If
the error is transient, does your application experience a decrease in performance? Resiliency plays a significant
role in performance efficiency because the failure of any service may impact your application's ability to meet
your business goals and scale to meet current load. Chaos testing—the introduction of random failures within
your infrastructure—against your application can help determine how well your application continues to
perform under varying stages of load.
For more information about reliability impacts on performance, see Monitor performance for scalability and
reliability.
Design scalable Azure applications
To learn more about chaos testing, see Advancing resilience through chaos engineering and fault injection.
Performance antipatterns for cloud applications
Monitor performance for scalability and reliability
Back to the main article
Design scalable Azure applications
12/16/2022 8 minutes to read • Edit Online
Choose the right data storage
Database considerations
Why use a relational database?Why use a relational database?
Application design is critical to handling scale as load increases. This article will give you insight on the most
important topics. For more topics related to handling scale, see the Design Azure applications for efficiency
article in the Performance efficiency pillar.
The overall design of the data tier can greatly affect an application's performance and scalability. The Azure
storage platform is designed to be massively scalable to meet the data storage and performance needs of
modern applications.
Data services in the Azure storage platform are:
Azure Blob - A massively scalable object store for text and binary data. Includes support for big data analytics
through Data Lake Storage Gen2.
Azure Files - Managed file shares for cloud or on-premises deployments.
Azure Queue - A messaging store for reliable messaging between application components.
Azure Tables - A NoSQL store for schemaless storage of structured data.
Azure Disks - Block-level storage volumes for Azure VMs.
Most cloud workloads adopt the
polyglot
persistence approach. Instead of using one data store service, a mix of
technologies is used. Your application will most likely require more than one type of data store depending on
your requirements. For examples of when to use these data storage types, see Example scenarios.
Each service is accessed through a storage account. To get started, see Create a storage account.
The choice of database can affect an application's performance and scalability. Database reads and writes involve
a network call and storage I/O, both of which are expensive operations. Choosing the right database service to
store and retrieve data is therefore a critical decision and must be considered to ensure application scalability.
Azure has many database services that will fit most needs. In addition, there are third-party options that can be
considered from Azure Marketplace.
To help you choose a database type, determine if the application's storage requirements fit a relational design
(SQL) versus a key-value/document/graph design (NoSQL). Some applications may have both a SQL and a
NoSQL database for different storage needs. Use the Azure data store decision tree to help you find the
appropriate managed data storage solution.
Use a relational database when strong consistency guarantees are important — where all changes are atomic,
and transactions always leave the data in a consistent state. However, a relational database generally can't scale
out horizontally without sharding the data in some way. Implementing manual sharding can be a time
consuming task. Also, the data in relational database must be normalized, which isn't appropriate for every data
set.
If a relational database is considered optimal, Azure offers several PaaS options that fully manage hosting and
operations of the database. Azure SQL Database can host single databases or multiple databases (Azure SQL
Database Managed Instance). The suite of offerings spans requirements that cross performance, scale, size,
Why use a NoSQL database?Why use a NoSQL database?
Choose the right VM size
Build with microservices
resiliency, disaster recovery, and migration compatibility. Azure offers the following PaaS relational database
services:
Azure SQL Database
Azure Database for MySQL
Azure Database for PostgreSQL
Azure Database for MariaDB
Use a NoSQL database when application performance and availability are more important than strong
consistency. NoSQL databases are ideal for handling large, unrelated, indeterminate, or rapidly changing data.
NoSQL databases have trade-offs. For specifics, see Some challenges with NoSQL databases.
Azure provides two managed services that optimize for NoSQL solutions: Azure Cosmos DB and Azure Cache
for Redis. For document and graph databases, Azure Cosmos DB offers extreme scale and performance.
For a detailed description of NoSQL and relational databases, see Understanding the differences.
Choosing the wrong VM size can result in capacity issues as VMs approach their limits. It can also lead to
unnecessary cost. To choose the right VM size, consider your workloads, number of CPUs, RAM capacity, disk
size, and speed according to business requirements. For a snapshot of Azure VM sizes and their purpose, see
Sizes for virtual machines in Azure.
Azure offers the following categories of VM sizes, each designed to run different workloads. Click a category for
details.
General-purpose - Provide balanced CPU-to-memory ratio. Ideal for testing and development, small to
medium databases, and low to medium traffic web servers.
Memory optimized - Offer a high memory-to-CPU ratio that is great for relational database servers, medium
to large caches, and in-memory analytics.
Compute optimized - Have a high CPU-to-memory ratio. These sizes are good for medium traffic web
servers, network appliances, batch processes, and application servers.
GPU optimized - Available with single, multiple, or fractional GPUs. These sizes are designed for compute-
intensive, graphics-intensive, and visualization workloads.
High performance compute - Designed to deliver leadership-class performance, scalability, and cost
efficiency for a variety of real-world HPC workloads.
Storage optimized - Offer high disk throughput and IO, and are ideal for Big Data, SQL, NoSQL databases,
data warehousing, and large transactional databases. Examples include Cassandra, MongoDB, Cloudera, and
Redis.
You can change the sizing requirements according to your needs and requirements.
Microservices are a popular architectural style for building applications that are resilient, highly scalable,
independently deployable, and able to evolve quickly. A microservices architecture consists of a collection of
small, autonomous services. Each service is self-contained and should implement a single business capability.
Breaking up larger entities into small discrete pieces alone doesn't ensure sizing and scaling capabilities.
Application logic needs to be written to control this.
One of the many benefits of microservices is that they can be scaled independently. This lets you scale out
subsystems that require more resources, without scaling out the entire application. Another benefit is fault
isolation. If an individual microservice becomes unavailable, it won't disrupt the entire application, as long as
Use dynamic service discovery for microservices applications
TIPTIP
Establish connection pooling
Pool size limitsPool size limits
any upstream microservices are designed to handle faults correctly (for example, by implementing circuit
breaking).
To learn more about the benefits of microservices, see Benefits.
Building with microservices comes with challenges such as development and testing. Writing a small service
that relies on other dependent services requires a different approach than writing a traditional monolithic or
layered application. Existing tools are not always designed to work with service dependencies. Refactoring
across service boundaries can be difficult. It is also challenging to test service dependencies, especially when the
application is evolving quickly.
Reference Challenges for a list of possible drawbacks of a microservice architecture.
When there are many separate services or instances of services in play, they will need to receive instructions on
who to contact and/or other configuration information. Hard coding this information is flawed, and this is where
service discovery steps in. A service instance can spin up and dynamically discover the configuration
information it needs to become functional without having that information hard coded.
When combined with an orchestration platform designed to execute and manage microservices such as
Kubernetes or Service Fabric, individual services can be right sized, scaled up, scaled down, and dynamically
configured to match user demand. Using an orchestrator such as Kubernetes or Service Fabric, you can pack a
higher density of services onto a single host, which allows for more efficient utilization of resources. Both of
these platforms provide built-in services for executing, scaling, and operating a microservices architecture; and
one of those key services is discovery and finding where a particular service is running.
Kubernetes supports pod autoscaling and cluster autoscaling. To learn more, see Autoscaling. A Service Fabric
architecture takes a different approach to scaling for stateless and stateful services. To learn more, reference
Scaling considerations.
When appropriate, decomposing an application into microservices is a level of decoupling that is an architectural best
practice. A microservices architecture can also bring some challenges. The design patterns in Design patterns for
microservices can help mitigate these challenges.
Establishing connections to databases is typically an expensive operation that involves establishing an
authenticated network connection to the remote database server. This is especially true for applications that
open new connections frequently. Use connection pooling to reduce connection latency by reusing existing
connections and enable higher database throughput (transactions per second) on the server. By doing this, you
avoid the expense of opening a new connection for each request.
Azure limits the number of network connections a virtual machine or AppService instance can make. Exceeding
this limit would cause connections to be slowed down or terminated. With connection pooling, a fixed set of
connections are established at the startup time and maintained. In many cases, a default pool size might only
consist of a small handful of connections that performs quickly in basic test scenarios, but become a bottleneck
under scale when the pool is exhausted. Establishing a pool size that maps to the number of concurrent
transactions supported on each application instance is a best practice.
Each database and application platform will have slightly different requirements for the right way to set up and
leverage the pool. Reference SQL Server Connection Pooling for a .NET code example using SQL Server and
TIPTIP
Integrated securityIntegrated security
Next steps
Azure Database. In all cases, testing is paramount to ensure a connection pool is properly established and
working as designed under load.
Use a pool size that uses the same number of concurrent connections. Choose a size that can handle more than the
existing connections so you can quickly handle a new request coming in.
Integrated security is a singular unified solution to protect every service that a business runs through a set of
common policies and configuration settings. In addition to reducing issues associated with scaling, provisioning,
and managing (including higher costs and complexity), integrated security also increases control and overall
security. However, there may be times when you may not want to use connection pooling for security reasons.
For example, although connection pooling improves the performance of subsequent database requests for a
single user, that user cannot take advantage of connections made by other users. It also results in at least one
connection per user to the database server.
Measure your business' security requirements against the advantages and disadvantages of connection pooling.
To learn more, see Pool fragmentation
Application efficiency
Design Azure applications for efficiency
12/16/2022 4 minutes to read • Edit Online
Reduce response time with asynchronous programming
Process faster by queuing and batching requests
Optimize with data compression
Making choices that effect performance efficiency is critical to application design. For additional related topics,
see the Design scalable Azure applications article in the Performance efficiency pillar.
The time for the caller to receive a response could range from milliseconds to minutes. During that time, the
thread is held by the process until the response comes back, or if an exception happens. This is inefficient
because it means that no other requests can be processed during the time waiting for a response. An example
when multiple requests in flight is inefficient is a bank account. In this situation, only one resource can operate
on the request at the same time. Another example is when connection pools can't be shared, and then all of the
requests need separate connections to complete.
Asynchronous programming is an alternative approach. It enables a remote service to be executed without
waiting and blocking resources on the client. This is a critical pattern for enabling cloud scalable software and is
available in most modern programming languages and platforms.
There are many ways to inject asynchronous programming into an application design. For APIs and services that
work across the internet, consider using the Asynchronous Request-Reply pattern. When writing code, remote
calls can be asynchronously executed using built-in language constructs like async / await in .NET C#. Review a
language construct example. .NET has other built-in platform support for asynchronous programming with task
and event based asynchronous patterns.
Similar to asynchronous programming, queuing services has long been used as a scalable mechanism to hand
off processing work to a service. Highly scalable queuing services are natively supported in Azure. The queue is
a storage buffer located between the caller and the processing service. It takes requests, stores them in a buffer,
and queues the requests to provide services around the reliable delivery and management of the queued data.
Using a queue is often the best way to hand off work to a processor service. The processor service receives work
by listening on a queue and dequeuing messages. If items to be processed enter too quickly, the queuing service
will keep them in the queue until the processing service has available resources and asks for a new work item
(message). By leveraging the dynamic nature of Azure Functions, the processor service can easily autoscale on
demand as the queue builds up to meet the intake pressure. Developing processor logic with Azure Functions to
run task logic from a queue is a common, scalable, and cost effective way to using queuing between a client and
a processor.
Azure provides some native first-party queueing services with Azure Storage Queues (simple queuing service
based on Azure Storage) and Azure Service Bus (message broker service supporting transactions and reduced
latency). Many other third-party options are also available through Azure Marketplace.
To learn more about queue-based Load Leveling, see Queue-based Load Leveling pattern. To compare and
contrast queues, see Storage queues and Service Bus queues - compared and contrasted.
A well-known optimization best practice for scaling is to use a compression strategy to compress and bundle
web pages or API responses. The idea is to shrink the data returned from a page or API back to the browser or
Improve scalability with session affinity
TIPTIP
Run background jobs to meet integration needs
Next steps
client app. Compressing the data returned to clients optimizes network traffic and accelerates the application.
Azure Front Door can perform file compression, and .NET has built-in framework support for this technique with
GZip compression. For more information, see Response compression in ASP.NET Core.
If an application is stateful, meaning that data or state will be stored locally in the instance of the application, it
may increase the performance of your application, if you enable session affinity. When session affinity is
enabled, subsequent requests to the application will be directed to the same server that processed the first
request. If session affinity is not enabled, subsequent requests would be directed to the next available server
depending on the load balancing rules. Session affinity allows the instance to have some persistent or cached
data/context, which can speed subsequent requests. However, if your application does not store large amounts
of state or cached data in memory, session affinity might decrease your throughput because one host could get
overloaded with requests, while others are dormant.
Migrate an Azure Cloud Services application to Azure Service Fabric describes best practicesbest practices about stateless services for
an application that is migrated from old Azure Cloud Services to Azure Service Fabric.
Many types of applications require background tasks that run independently of the user interface (UI). Examples
include batch jobs, intensive processing tasks, and long-running processes such as workflows. Background jobs
can be executed without requiring user interaction. The application can start the job and then continue to
process interactive requests from users. To learn more, see Background jobs.
Background tasks must offer sufficient performance to ensure they do not block the application, or cause
inconsistencies due to delayed operation when the system is under load. Typically, performance is improved by
scaling the compute instances that host the background tasks. For a list of considerations, see Scaling and
performance considerations.
Logic Apps is a serverless consumption (pay-per-use) service that enables a vast set of out-of-the-box ready-to-
use connectors and a long-running workflow engine to enable cloud-native integration needs quickly. Logic
Apps is flexible enough for scenarios like running tasks/jobs, advanced scheduling, and triggering. Logic Apps
also has advanced hosting options to allow it to run within enterprise restricted cloud environments. Logic Apps
can be combined with all other Azure services to complement one another, or it can be used independently.
Like all serverless services, Logic Apps doesn't require VM instances to be purchased, enabled, and scaled up
and down. Instead, Logic Apps scale automatically on serverless PaaS provided instances, and a consumer only
pays based on usage.
Design for scaling
Design for scaling
12/16/2022 7 minutes to read • Edit Online
Scalability
is the ability of a system to handle increased load. Services covered by Azure Autoscale can scale
automatically to match demand to accommodate workload. These services scale out to ensure capacity during
workload peaks and return to normal automatically when the peak drops.
To achieve performance efficiency, consider how your application design scales and implement PaaS offerings
that have built-in scaling operations.
Two main ways an application can scale include
vertical scaling
and
horizontal scaling
. Vertical scaling (scaling
up
) increases the capacity of a resource, for example, by using a larger virtual machine (VM) size. Horizontal
scaling (scaling
out
) adds new instances of a resource, such as VMs or database replicas.
Horizontal scaling has significant advantages over vertical scaling, such as:
True cloud scale
: Applications are designed to run on hundreds or even thousands of nodes, reaching scales
that aren't possible on a single node.
Horizontal scale is elastic
: You can add more instances if load increases, or remove instances during quieter
periods.
Scaling out can be triggered automatically, either on a schedule or in response to changes in load.
Scaling out may be cheaper than scaling up. Running several small VMs can cost less than a single large VM.
Horizontal scaling can also improve resiliency, by adding redundancy. If an instance goes down, the
application keeps running.
An advantage of vertical scaling is that you can do it without making any changes to the application. But at some
point, you'll hit a limit, where you can't scale up anymore. At that point, any further scaling must be horizontal.
Horizontal scale must be designed into the system. For example, you can scale out VMs by placing them behind
a load balancer. But each VM in the pool must handle any client request, so the application must be stateless or
store state externally (say, in a distributed cache). Managed PaaS services often have horizontal scaling and
autoscaling built in. The ease of scaling these services is a major advantage of using PaaS services.
Just adding more instances doesn't mean an application will scale, however. It might push the bottleneck
somewhere else. For example, if you scale a web front end to handle more client requests, that might trigger
lock contentions in the database. You'd want to consider other measures, such as optimistic concurrency or data
partitioning, to enable more throughput to the database.
Always conduct performance and load testing to find these potential bottlenecks. The stateful parts of a system,
such as databases, are the most common cause of bottlenecks, and require careful design to scale horizontally.
Resolving one bottleneck may reveal other bottlenecks elsewhere.
Use the Performance efficiency checklist to review your design from a scalability standpoint.
In the cloud, the ability to take advantage of scalability depends on your infrastructure and services. Platforms,
such as Kubernetes, were built with scaling in mind. Virtual machines may not scale as easily although scale
operations are possible. With virtual machines, you may want to plan ahead to avoid scaling infrastructure in
the future to meet demand. Another option is to select a different platform such as Azure virtual machines scale
sets.
When using scalability, you can predict the current average and peak times for your workload. Payment plan
options allow you to manage this prediction. You pay either per minute or per-hour depending on the service
for a chosen time period.
Plan for growth
Add scale unitsAdd scale units
Use Autoscaling to manage load increases and decreases
Planning for growth starts with understanding your current workloads, which can help you anticipate scale
needs based on predictive usage scenarios. An example of a predictive usage scenario is an e-commerce site
that recognizes that its infrastructure should scale appropriately for an expected high volume of holiday traffic.
Perform load tests and stress tests to determine the necessary infrastructure to support the predicted spikes in
workloads. A good plan includes incorporating a buffer to accommodate for random spikes.
For more information on how to determine the upper and maximum limits of an application's capacity, reference
Performance testing in the performance efficiency pillar.
Another critical component of planning for scale is to make sure the region that hosts your application supports
the necessary capacity required to accommodate load increase. If you're using a multiregion architecture, make
sure the secondary regions can also support the increase. A region can offer the product, but may not support
the predicted load increase without the necessary SKUs (Stock Keeping Units) so you need to verify capacity.
To verify your region and available SKUs, first select the product and regions in Products available by region.
Then, check the SKUs available in the Azure portal.
For each resource, know the upper scaling limits, and use sharding or decomposition to go beyond those limits.
Design the application so that it's easily scaled by adding one or more scale units, such as by using the
Deployment Stamps pattern. Determine the scale units for the system for well-defined sets of resources.
The next step might be to use built-in scaling features or tools to understand which resources need to scale
concurrently with other resources. For example, adding X number of front-end VMs might require Y number of
extra queues and Z number of storage accounts to handle the extra workload. So a scale unit could consist of X
VM instances, Y queues, and Z storage accounts.
Autoscaling enables you to run the right amount of resources to handle the load of your app. It adds resources
(called scaling out) to handle an increase in load such as seasonal workloads. Autoscaling saves money by
removing idle resources (called scaling in) during a decrease in load such as nights and weekends for some
corporate apps.
You automatically scale between the minimum and maximum number of instances to run, and add, or remove
VMs automatically based on a set of rules.
Understand scale targetsUnderstand scale targets
Take advantage of platform autoscaling features
NOTENOTE
Autoscale CPU or memory-intensive applications
Autoscale with Azure compute services
For more information, see Autoscaling.
Scale operations (horizontal - changing the number of identical instances, vertical - switching to more/less
powerful instances) can be fast, but usually take time to complete. It's important to understand how this delay
affects the application under load and if degraded performance is acceptable.
For more information, reference Best practices for Autoscale.
Here's how you can benefit from autoscaling features:
Use built-in autoscaling features when possible rather than custom or third-party mechanisms.
Use scheduled scaling rules where possible to ensure that resources are available.
Add reactive autoscaling to the rules where appropriate to cope with unexpected changes in demand.
If your application is explicitly designed to handle the termination of some of its instances, ensure you use autoscaling to
scale down and scale in resources no longer necessary for the given load to reduce operational costs.
For more information, reference Autoscaling.
CPU or memory-intensive applications require scaling up to a larger machine SKU with more CPU or memory.
Once you've reduced the demand for CPU or memory, instances can revert back to the original instance.
For example, you may have an application that processes images, videos, or music. Given the process and
requirements, it may make sense to scale up a server by adding CPU or memory to quickly process the large
media file. While scaling
out
allows the system to process more files simultaneously, it won't impact processing
speed of each individual file.
Autoscaling works by collecting metrics for the resource (CPU and memory usage), and the application
(requests queued and requests per second). Rules can then be created to add and remove instances depending
on how the rule evaluates. An App Services app plan allows autoscale rules to be set for scale-out/scale-in and
scale-up/scale-down. Scaling also applies to Azure Automation.
The Application Service autoscaling sample shows how to create an Azure App Service plan, which
includes an Azure App Service.
Azure Kubernetes Service (AKS) offers two levels of autoscale:
Horizontal autoscaleHorizontal autoscale: Can be enabled on service containers to add more or fewer pod instances within the
cluster.
Cluster autoscaleCluster autoscale: Can be enabled on the agent VM instances running an agent node-pool to add more or
remove VM instances dynamically.
Other Azure services include the following services:
Azure Ser vice FabricAzure Ser vice Fabric: Virtual machine scale sets offer autoscale capabilities for true IaaS scenarios.
Azure App GatewayAzure App Gateway and Azure API ManagementAzure API Management: PaaS offerings for ingress services that enable
NOTENOTE
Next steps
autoscale.
Azure FunctionsAzure Functions, Azure Logic AppsAzure Logic Apps, and App Ser vicesApp Ser vices: Serverless pay-per-use consumption modeling
that inherently provides autoscaling capabilities.
Azure SQL DatabaseAzure SQL Database: PaaS platform to change performance characteristics of a database on the fly and
assign more resources when needed or release the resources when they aren't needed. Allows scaling
up/down, read scale-out, and global scale-out/sharding capabilities.
Each service documents its autoscale capabilities. Review Autoscale overview for a general discussion on Azure
platform autoscale.
If your application doesn't have built-in ability to autoscale, or isn't configured to scale out automatically as load increases,
it's possible that your application's services will fail if they become saturated with user requests. Reference Azure
Automation for possible solutions.
Plan for capacity
Plan for capacity
12/16/2022 4 minutes to read • Edit Online
Scale out rather than scaling up
Prepare infrastructure for large-scale events
NOTENOTE
Choose the right resources
Azure offers many options to meet capacity requirements as your business grows. These options can also
minimize cost.
When using cloud technologies, it's generally easier, cheaper, and more effective to scale out than scaling up.
Plan to scale your application tier by adding extra infrastructure to meet demand. Be sure to remove the
resources when they are not needed. If you plan to scale up by increasing the resources allocated to your hosts,
you will reach a limit where it becomes cost-prohibitive to scale any further. Scaling up also often requires
downtime for your servers to reboot.
Large-scale application design takes careful planning and possibly involves complex implementation. Work with
your business and marketing teams to prepare for large-scale events. Knowing if there will be sudden spikes in
traffic such as Superbowl, Black Friday, or Marketing pushes, can allow you to prepare your infrastructure ahead
of time.
A fundamental design principle in Azure is to scale out by adding machines or service instances based on
increased demand. Scaling out can be a better alternative to purchasing additional hardware, which may not be
in your budget. Depending on your payment plan, you don't pay for idle VMs or need to reserve capacity in
advance. A pay-as-you-go plan is usually ideal for applications that need to meet planned spikes in traffic.
Don't plan for capacity to meet the highest level of expected demand. An inappropriate or misconfigured service can
impact cost. For example, building a multiregion service when the service levels don't require high-availability or geo-
redundancy will increase cost without reasonable business justification.
Right sizing your infrastructure to meet the needs of your applications can save you considerably as opposed to
a "one size fits all" solution often employed with on-premises hardware. You can choose various options when
you deploy Azure VMs to support workloads.
Each VM type has specific features and different combinations of CPU, memory, and disks. For example, the B-
series VMs are ideal for workloads that don't need the full performance of the CPU continuously, like web
servers, proof of concepts, small databases, and development build environments. The B-Series offers a cost
effective way to deploy these workloads that don't need the full performance of the CPU continuously and burst
in their performance.
For a list of sizes and a description of the recommended use, see sizes for virtual machines in Azure.
Continually monitor workloads after migration to find out if your VMs aren't optimized or have frequent periods
when they aren't used. If you discover this, it makes sense to either shut down the VMs or downscale them by
using virtual machine scale sets. You can optimize a VM with Azure Automation, virtual machine scale sets, auto-
shutdown, and scripted or third-party solutions. To learn more, see Automate VM optimization.
Use metrics to fine-tune scaling
Preemptively scaling based on trends
Next steps
Along with choosing the right VMs, selecting the right storage type can save your organization significant cost
every month. For a list of storage data types, access tiers, storage account types, and storage redundancy
options, see Select the right storage.
It's often difficult to understand the relationship between metrics and capacity requirements, especially when an
application is initially deployed. Provision a little extra capacity at the beginning, and then monitor and tune the
autoscaling
rules to bring the capacity closer to the actual load.
Autoscaling
enables you to run the right amount
of resources to handle the load of your app. It adds resources (called scaling out) to handle an increase in load
such as seasonal workloads and customer facing applications.
After configuring the autoscaling rules, monitor the performance of your application over time. Use the results
of this monitoring to adjust the way in which the system scales if necessary.
Azure Monitor autoscale provides a common set of autoscaling functionality for virtual machine scale sets,
Azure App Service, and Azure Cloud Service. Scaling can be performed on a schedule, or based on a runtime
metric, such as CPU or memory usage. For example, you can scale out by one instance if average CPU usage is
above 70%, and scale in by one instance if CPU usage falls below 50 percent.
The default autoscaling rules are set to know when it's time to execute an autoscaling action in order to prevent
the system from reacting too quickly. To learn more, see Autoscaling.
For a list of built-in metrics, see Azure Monitor autoscaling common metrics. You can also implement custom
metrics by using Application Insights to monitor the performance of your live applications. Some Azure services
use different scaling methods.
Preemptively scaling based on historical data can ensure your application has consistent performance, even
though your metrics haven't yet indicated the need to scale. Schedule-based rules allow you to scale when you
see time patterns in your load and want to scale before a possible load increase or decrease occurs. For example,
you can set a trigger attribute to scale out to 10 instances on weekdays, and scale in to four (4) instances on
Saturday and Sunday. If you can predict the load on the application, consider using scheduled autoscaling, which
adds and removes instances to meet anticipated peaks in demand.
To learn more, see Use Azure Monitor autoscale.
Performance testing
Capacity planning:Capacity planning: When performance testing, the business must communicate any fluctuation in expected
load. Load can be impacted by world events, such as political, economic, or weather changes; by marketing
initiatives, such as sales or promotions; or, by seasonal events, such as holidays. You should test variations of
load prior to events, including unexpected ones, to ensure that your application can scale. Additionally, you
should ensure that all regions can adequately scale to support total load, should one region fail.
Checklist
-
Testing for performance efficiency
12/16/2022 5 minutes to read • Edit Online
Performance testing
Is the application tested for performance, scalability, and resiliency?
Performance testing helps to maintain systems properly and fix defects before problems reach system users. It's
part of the Performance Efficiency pillar in the Microsoft Azure Well-Architected Framework.
Performance testing
is the superset of both load and stress testing. The primary goal of performance testing is
to validate benchmark behavior for the application.
Load Testing
validates application scalability by rapidly or gradually increasing the load on the application until it
reaches a threshold.
Stress Testing
is a type of negative testing, which involves various activities to overload existing resources and
remove components. This testing enables you to understand overall resiliency and how the application responds
to issues.
Use the following checklist to review your application architecture from a performance testing standpoint.
Ensure solid performance testing with shared Ensure solid performance testing with shared
teamteam
responsibility responsibility. Successfully implementing
meaningful performance tests requires many resources. It's not just a single developer or QA Analyst
running some tests on their local machine. Instead, performance tests need a test environment (also
known as a
test bed
) that tests can be executed against without interfering with production environments
and data. Performance testing requires input and commitment from developers, architects, database
administrators, and network administrators.
Capacity planningCapacity planning. When performance testing, the business must communicate any fluctuation in
expected load. Load can be impacted by world events, such as political, economic, or weather changes; by
marketing initiatives, such as sales or promotions; or, by seasonal events, such as holidays. Test variations
of load prior to events, including unexpected ones, to ensure that your application can scale. Additionally,
you should ensure that all regions can adequately scale to support total load, should one region fail.
Identify a path for ward to leveraging existing tests or the creation of new testsIdentify a path for ward to leveraging existing tests or the creation of new tests. There are
different types of performance testing: load testing, stress testing, API testing, client-side/browser testing,
and so on. It's important that you understand and articulate the different types of tests, along with their
advantages and disadvantages, to the customer.
Perform testing in all stages in the development and deployment life cyclePerform testing in all stages in the development and deployment life cycle. Application code,
infrastructure automation, and fault tolerance should all be tested. This step ensures that the application
will perform as expected in every situation. You'll want to test early enough in the application life cycle to
catch and fix errors. Errors are cheaper to repair when caught early and can be expensive or impossible to
fix later. To learn more, reference Testing your application and Azure environment.
Avoid experiencing poor performance with testingAvoid experiencing poor performance with testing. Two subsets of performance testing, load
testing and stress testing, can determine the upper limit, and maximum point of failure, respectively, of
the application's capacity. By performing these tests, you can determine the necessary infrastructure to
support the anticipated workloads.
Plan for a load buffer to accommodate random spikes without overloading thePlan for a load buffer to accommodate random spikes without overloading the
infrastructureinfrastructure. For example, if a normal system load is 100,000 requests per second, the infrastructure
Testing tools
Recommendation
Next steps
should support 100,000 requests at 80% of total capacity (for example, 125,000 requests per second). If
you expect that the application will continue to sustain 100,000 requests per second, and the current SKU
(Stock Keeping Unit) introduces latency at 65,000 requests per second, you'll most likely need to
upgrade your product to the next higher SKU. If there's a secondary region, you'll need to ensure that it
also supports the higher SKU.
Test failover in multiregionsTest failover in multiregions. Test the amount of time it would take for users to be rerouted to the
paired region so that the region doesn't fail. Typically, a planned test failover can help determine how
much time would be required to fully scale to support the redirected load.
Configure the environment based on testing results to sustain performance efficiencyConfigure the environment based on testing results to sustain performance efficiency. Scale
out or scale in to handle increases and decreases in load. For example, you may know that you'll
encounter high levels of traffic during the day and low levels on weekends. You may configure the
environment to scale out for increases in load or scale in for decreases before the load actually changes.
Choose testing tools based on the type of performance testing you are attempting toChoose testing tools based on the type of performance testing you are attempting to
executeexecute. There are various performance testing tools available for DevOps. Some tools like JMeter only
perform testing against endpoints and tests HTTP statuses. Other tools such as K6 and Selenium can
perform tests that also check data quality and variations. Azure Load Testing allows you to create a load
test by using an existing JMeter script, and monitor client-side and server-side metrics to identify
performance bottlenecks. Application Insights, while not necessarily designed to test server load, can test
the performance of an application within the user's browser.
Carr y out performance profiling and load testingCarr y out performance profiling and load testing during development, as part of test routines, and
before final release to ensure the application performs and scales as required. This testing should occur
on the same type of hardware as the production platform, and with the same types, and quantities of
data, and user load as it will encounter in production.
Determine if it is better to use automated or manual testingDetermine if it is better to use automated or manual testing. Testing can be automated or
manual. Automating tests is the best way to make sure that they're executed. Depending on how
frequently tests are performed, they're typically limited in duration and scope. Manual testing is run much
less frequently.
Cache data to improve performance, scalability, and availabilityCache data to improve performance, scalability, and availability . The more data that you have,
the greater the benefits of caching become. Caching typically works well with data that is immutable or
that changes infrequently.
Decide how you'll handle local development and testing when some static content isDecide how you'll handle local development and testing when some static content is
expected to be ser ved from a content deliver y network (CDN)expected to be ser ved from a content deliver y network (CDN). For example, you could pre-
deploy the content to the CDN as part of your build script. Instead, use compile directives or flags to
control how the application loads the resources. For example, in debug mode, the application could load
static resources from a local folder. In release mode, the application would use the CDN.
Simulate different workloads on your application and measure application performance forSimulate different workloads on your application and measure application performance for
each workloadeach workload. This technique is the best way to figure out what resources you'll need to host your
application. Use performance indicators to assess whether your application is performing as expected or
not.
Define a testing strategy. For more information, reference Testing.
Performance testing
Performance testing
12/16/2022 5 minutes to read • Edit Online
Establish baselines
Load testing
Performance testing helps to maintain systems properly and fix defects before problems reach system users. It
helps maintain the efficiency, responsiveness, scalability, and speed of applications when compared with
business requirements. When done effectively, performance testing should give you the diagnostic information
necessary to eliminate bottlenecks, which lead to poor performance. A bottleneck occurs when data flow is
either interrupted or stops due to insufficient capacity to handle the workload.
To avoid experiencing poor performance, commit time and resources to testing system performance. Two
subsets of performance testing, load testing and stress testing, can determine the upper (close to capacity limit)
and maximum (point of failure) limit, respectively, of the application's capacity. By performing these tests, you
can determine the necessary infrastructure to support the anticipated workloads.
A best practice is to plan for a load buffer to accommodate random spikes without overloading the
infrastructure. For example, if a normal system load is 100,000 requests per second, the infrastructure should
support 100,000 requests at 80% of total capacity (i.e., 125,000 requests per second). If you anticipate that the
application will continue to sustain 100,000 requests per second, and the current SKU (Stock Keeping Unit)
introduces latency at 65,000 requests per second, you'll most likely need to upgrade your product to the next
higher SKU. If there is a secondary region, you'll need to ensure that it also supports the higher SKU.
Depending on the scale of your performance test, you need to plan for and maintain a testing infrastructure. You
can use a cloud-based tool, such as Azure Load Testing Preview, to abstract the infrastructure needed to run
your performance tests.
First, establish performance baselines for your application. Then, establish a regular cadence for running the
tests. Run the test as part of a scheduled event or part of a continuous integration (CI) build pipeline.
Baselines help to determine the current efficiency state of your application and its supporting infrastructure.
Baselines can provide good insights for improvements and determine if the application is meeting business
goals. Baselines can be created for any application regardless of its maturity. No matter when you establish the
baseline, measure performance against that baseline during continued development. When code and, or
infrastructure changes, the effect on performance can be actively measured.
Load testing measures system performance as the workload increases. It identifies where and when your
application breaks, so you can fix the issue before shipping to production. It does this by testing system behavior
under typical and heavy loads.
Load testing takes places in stages of load. These stages are usually measured by virtual users (VUs) or
simulated requests, and the stages happen over given intervals. Load testing provides insights into how and
when your application needs to scale in order to continue to meet your SLA to your customers (whether internal
or external). Load testing can also be useful for determining latency across distributed applications and
microservices.
The following are key points to consider for load testing:
Know the Azure ser vice limits:Know the Azure ser vice limits: Different Azure services have
soft
and
hard
limits associated with
them. The terms soft limit and hard limit describe the current, adjustable service limit (soft limit) and the
Stress testing
Multiregion testing
Configure the environment based on testing results
maximum limit (hard limit). Understand the limits for the services you consume so that you are not
blocked if you need to exceed them. For a list of the most common Azure limits, see Azure subscription
and service limits, quotas, and constraints.
The ResourceLimits sample shows how to query the limits and quotas for commonly used
resources.
Measure typical loads:Measure typical loads: Knowing the typical and maximum loads on your system helps you understand
when something is operating outside of its designed limits. Monitor traffic to understand application
behavior.
Understand application behavior under various scales:Understand application behavior under various scales: Load test your application to understand
how it performs at various scales. First, test to see how the application performs under a typical load.
Then, test to see how it performs under load using different scaling operations. To get additional insight
into how to evaluate your application as the amount of traffic sent to it increases, see Autoscale best
practices.
Unlike load testing, which ensures that a system can handle what it's designed to handle, stress testing focuses
on overloading the system until it breaks. A stress test determines how stable a system is and its ability to
withstand extreme increases in load. It does this by testing the maximum number requests from another service
(for example) that a system can handle at a given time before performance is compromised and fails. Find this
maximum to understand what kind of load the current environment can adequately support.
Determine the maximum demand you want to place on memory, CPU, and disk IOPS. Once a stress test has
been performed, you will know the maximum supported load and an operational margin. It is best to choose an
operational threshold so that scaling can be performed before the threshold has been reached.
Once you determine an acceptable operational margin and response time under typical loads, verify that the
environment is configured adequately. To do this, make sure the SKUs that you selected are based on the desired
margins. Be careful to stay as close as possible to your margins. Allocating too much can increase costs and
maintenance unnecessarily; allocating too few can result in poor user experience.
In addition to stress testing through increased load, you can stress test by reducing resources to identify what
happens when the machine runs out of memory. You can also stress test by increasing latency (e.g., the database
takes 10x time to reply, writes to storage takes 10x longer, etc.).
A multiregion architecture can provide higher availability than deploying to a single region. If a regional outage
affects the primary region, you can use Front Door to use the secondary region. This architecture can also help if
an individual subsystem of the application fails.
Test the amount of time it would take for users to be rerouted to the paired region so that the region doesn't fail.
To learn more about routing, see Front Door routing methods. Typically, a planned test failover can help
determine how much time would be required to fully scale to support the redirected load.
Once you have performed testing and found an acceptable operational margin and response under increased
levels of load, configure the environment to sustain performance efficiency. Scale out or scale in to handle
increases and decreases in load. For example, you may know that you will encounter high levels of traffic during
the day and low levels on weekends. You may configure the environment to scale out for increases in load or
scale in for decreases before the load actually changes.
NOTENOTE
Next steps
For more information on autoscaling, see Design for scaling in the Performance Efficiency pillar.
Ensure that a rule has been configured to scale the environment back down once load reaches below the set thresholds.
This will save you money.
Testing tools
Testing tools
12/16/2022 5 minutes to read • Edit Online
Identify baselines and goals for performance
Caching data
Use Azure Redis to cache dataUse Azure Redis to cache data
There are multiple stages in the development and deployment life cycle in which tests can be performed.
Application code, infrastructure automation, and fault tolerance should all be tested. Testing in various stages
can ensure that the application will perform as expected in every situation. You'll want to test early enough in the
application life cycle to catch and fix errors. Errors are cheaper to repair when caught early and can be expensive
or impossible to fix later.
Testing can be automated or manual. Automating tests is the best way to make sure that they're executed.
Depending on how frequently tests are performed, they're typically limited in duration and scope. Manual
testing is run much less frequently. For a list of tests that you should consider while developing and deploying
applications, see Testing your application and Azure environment.
Knowing where you are (baseline) and where you want to be (goal) makes it easier to plan how to get there.
Established baselines and goals will help you to stay on track and measure progress. Testing may also uncover a
need to perform more testing on areas that you may not have planned.
Baselines can vary based on connections or platforms used for accessing the application. It may be important to
establish baselines that address the different connections, platforms, and elements such as time of day, or
weekday versus weekend.
There are many types of goals when determining baselines for application performance. Some examples are, the
time it takes to render a page, or a desired number of transactions if your site conducts e-commerce. The
following list shows some examples of questions that may help you to determine goals.
What are your baselines and goals for:
Establishing an initial connection to a service?
An API endpoint complete response?
Server response times?
Latency between systems/microservices?
Database queries?
Caching can dramatically improve performance, scalability, and availability. The more data that you have, the
greater the benefits of caching become. Caching typically works well with data that is immutable or that changes
infrequently. Examples include reference information such as product and pricing information in an e-commerce
application, or shared static resources that are costly to construct. Some or all of this data can be loaded into the
cache at application startup to minimize demand on resources and to improve performance.
Use performance testing and usage analysis to determine whether pre-populating or on-demand loading of the
cache, or a combination of both, is appropriate. The decision should be based on the volatility and usage pattern
of the data. Cache utilization and performance analysis are important in applications that encounter heavy loads
and must be highly scalable.
To learn more about how to use caching as a solution in testing, see Caching.
Content delivery network
Benchmark testing
Metrics
Azure Cache for Redis is a caching service that can be accessed from any Azure application, whether the
application is implemented as a cloud service, a website, or inside an Azure virtual machine. Caches can be
shared by client applications that have the appropriate access key. It's a high-performance caching solution that
provides availability, scalability, and security.
To learn more about using Azure Cache for Redis, see Considerations for implementing caching in Azure.
Content delivery networks (CDNs) are typically used to deliver static content such as images, style sheets,
documents, client-side scripts, and HTML pages. The major advantages of using a CDN are lower latency and
faster delivery of content to users, regardless of their geographical location in relation to the datacenter where
the application is hosted. CDNs can help to reduce load on a web application because the application doesn't
have to service requests for the content that is hosted in the CDN. Using a CDN is a good way to minimize the
load on your application, and maximize availability and performance. Consider adopting this strategy for all of
the appropriate content and resources your application uses.
Decide how you'll handle local development and testing when some static content is served from a CDN. For
example, you could pre-deploy the content to the CDN as part of your build script. Alternatively, use compile
directives or flags to control how the application loads the resources. For example, in debug mode, the
application could load static resources from a local folder. In release mode, the application would use the CDN.
To learn more about CDNs, see Best practices for using content delivery networks (CDNs).
Benchmarking is the process of simulating different workloads on your application and measuring application
performance for each workload. It's the best way to figure out what resources you'll need to host your
application.
Use performance indicators to assess whether your application is performing as expected or not.
For workloads running on virtual machines, take into consideration VM sizes and disk sizes when
benchmarking, as you may hit a particular bottleneck. The Optimize IOPS, throughput, and latency table offers
further guidance.
Tools such as Azure Load Testing can help you simulate load and different usage patterns. The simulation can
help you prepare for particular scenarios that are relevant to your organization or industry. (for example how a
promotion or flash sale might affect an online store)
Metrics measure trends over time. they're available for interactive analysis in the Azure portal with Azure
Metrics Explorer. Metrics also can be added to an Azure dashboard for visualization in combination with other
data and used for near-real time alerting.
Performance testing gives you the ability to see specific details on the processing capabilities of applications.
You'll most likely want a monitoring tool that allows you to discover proactively if the issues you find through
testing are appearing in both your infrastructure and applications. Azure Monitor Metrics is a feature of Azure
Monitor that collects metrics from monitored resources into a time series database.
With Azure Monitor, you can collect, analyze, and act on telemetry from your cloud and on-premises
environments. It helps you understand how applications are performing and identifies issues affecting them and
the resources they depend on.
For a list of Azure metrics, see Supported metrics with Azure Monitor.
Next steps
Performance monitoring
Monitor the performance of a cloud application
12/16/2022 2 minutes to read • Edit Online
Checklist
In this section
A SSESSM EN TA SSESSM EN T DESC RIP T IONDESC RIP T ION
Are application logs and events correlated across allAre application logs and events correlated across all
application components?application components?
Correlate logs and events for subsequent interpretation. This
correlation will give you visibility into end-to-end transaction
flows.
Are you collecting Azure Activity Logs within the logAre you collecting Azure Activity Logs within the log
aggregation tool?aggregation tool?
Collect platform metrics and logs to get visibility into the
health and performance of services that are part of the
architecture.
Are application and resource level logs aggregatedAre application and resource level logs aggregated
in a single data sink, or is it possible to cross-quer yin a single data sink , or is it possible to cross-quer y
events at both levels?events at both levels?)
Implement a unified solution to aggregate and query
application and resource level logs, such as Azure Log
Analytics.
Troubleshooting an application's performance requires monitoring and reliable investigation. Issues in
performance can arise from database queries, connectivity between services, under-provisioned resources, or
memory leaks in code.
Continuously monitoring services and checking the health state of current workloads is key in maintaining the
overall performance of the workload. An overall monitoring strategy consider these factors:
Scalability
Resiliency of the infrastructure, application, and dependent services
Application and infrastructure performance
How are you monitoring to ensure the workload is scaling appropriately?How are you monitoring to ensure the workload is scaling appropriately?
Enable and capture telemetry throughout your application to build and visualize end-to-end transaction
flows for the application.
See metrics from Azure services such as CPU and memory utilization, bandwidth information, current
storage utilization, and more.
Use resource and platform logs to get information about what events occur and under which conditions.
For scalability, look at the metrics to determine how to provision resources dynamically and scale with
demand.
In the collected logs and metrics look for signs that might make a system or its components suddenly
become unavailable.
Use log aggregation technology to gather information across all application components.
Store logs and key metrics of critical components for statistical evaluation and predicting trends.
Identify antipatterns in the code.
Follow these questions to assess the workload at a deeper level.
Azure services
Next section
Related links
The monitoring operations should utilize Azure Monitor. You can analyze data, set up alerts, get end-to-end
views of your applications, and use machine learning–driven insights to identify and resolve problems quickly.
Export logs and metrics to services such as Azure Log Analytics or an external service like Splunk. Furthermore,
application technologies such as Application Insights can enhance the telemetry coming out of applications.
Based on insights gained through monitoring, optimize your code. One option might be to consider other Azure
services that may be more appropriate for your objectives.
Optimize
Back to the main article
Application profiling considerations for performance
monitoring
12/16/2022 2 minutes to read • Edit Online
Key points
Application logs
Application instrumentation
Distributed tracing
Continuously monitor the application with Application Performance Monitoring (APM) technology, such as
Azure Application Insights. This technology can help you manage the performance and availability of the
application, aggregating application level logs, and events for subsequent interpretation.
Enable instrumentation and collect data using Azure Application Insights.
Use distributed tracing to build and visualize end-to-end transaction flows for the application.
Separate logs and events of a non-critical environment from a production environment.
Include end-to-end transaction times for key technical functions.
Correlate application log events across critical system flows.
Are application logs collected from different application environments?Are application logs collected from different application environments?
Application logs support the end-to-end application lifecycle. Logging is essential in understanding how the
application operates in various environments and what events occur and under which conditions.
Collect application logs and events across all application environments. Use a sufficient degree of separation
and filtering to ensure non-critical environments aren't mixed with production log interpretation. Furthermore,
corresponding log entries across the application should capture a correlation ID for their respective transactions.
Are log messages captured in a structured format?Are log messages captured in a structured format?
Structured format in a well-known schema can help expediate parsing and analyzing logs. Structured data can
be indexed, queried, and reported without complexity.
Also, application events should be captured as a structured data type with machine-readable data points rather
than unstructured string types.
Do you have detailed instrumentation in the application code?Do you have detailed instrumentation in the application code?
Instrumentation of your code allows precise detection of underperforming pieces when load or stress tests are
applied. It is critical to have this data available to improve and identify performance opportunities in the
application code.
Use APM such as Application Insights to continuously improve performance and usability. You need to enable
Application Insights by installing an instrumentation package. The service provides extensive telemetry out of
the box. You can customize what is captured for greater visibility. After it's enabled, metrics and logs related to
the performance and operations are collected. View and analyze the captured data in Azure Monitor.
Critical targets
Cost considerations
Next
Related links
Events coming from different application components or different component tiers of the application should be
correlated to build end-to-end transaction flows. Use distributed tracing to build and visualize flows for the
application. For instance, this is often achieved by using consistent correlation IDs transferred between
components within a transaction.
Are application events correlated across all application components?Are application events correlated across all application components?
Event correlation between the layers of the application provides the ability to connect tracing data of the
complete application stack. Once this connection is made, you can see a complete picture of where time is spent
at each layer. You can then query the repositories of tracing data in correlation to a unique identifier that
represents a completed transaction that has flowed through the system.
For more information, see Distributed tracing.
Is it possible to evaluate critical application performance targets and non-functional requirementsIs it possible to evaluate critical application performance targets and non-functional requirements
(NFRs)?(NFRs)?
Application level metrics should include end-to-end transaction times of key technical functions, such as
database queries, response times for external API calls, failure rates of processing steps, and more.
Is the end-to-end performance of critical system flows monitored?Is the end-to-end performance of critical system flows monitored?
It should be possible to correlate application log events across critical system flows, such as user login, to fully
assess the health of key scenarios in the context of targets and NFRs.
If you are using Application Insights to collect instrumentation data, there are cost considerations. For more
information, see Manage usage and costs for Application Insights.
Monitor infrastructure
Distributed tracing
Application Insights
Azure Monitor
Back to the main article
Analyze infrastructure metrics and logs
12/16/2022 4 minutes to read • Edit Online
Key points
Platform metrics
Platform logs
Performance issues can occur because of interaction between the application and other services in the
architecture. For example, issues in database queries, connectivity between services, and under-provisioned
resources are all common causes for inefficiencies.
The practice of continuous monitoring must include analysis of platform metrics and logs to get visibility into
the health and performance of services that are part of the architecture.
View platform metrics to get visibility into the health and performance of Azure services.
Use log data to get visibility into the operations and events of the the management plane.
Track events from internal dependencies.
Check the health of external dependencies such as an API service.
Metrics are numerical values that are collected at regular intervals and describe some aspect of a system at a
particular time. View the platform metrics that are generated by the services used in the architecture. Each Azure
service has set of metrics that's unique to the functionality of the resource. These metrics give you visibility into
their health and performance. There's no added configuration for Azure resources. You can also define custom
metrics for an Azure service using the custom metrics API.
Azure Monitor Metrics is a feature of Azure Monitor that collects numeric data from monitored resources into a
time series database. To learn more about Azure Monitor Metrics, see What can you do with Azure Monitor
Metrics?
If your application is running in Azure Virtual Machines, configure Azure Diagnostics extension to send guest OS
performance metrics to Azure Monitor. Guest OS metrics include performance counters that track guest CPU
percentage or memory usage, both of which are frequently used for autoscaling or alerting.
For more information, see Supported metrics with Azure Monitor.
Also, use technology-specific tools for the services used in the architecture. For example, use network traffic
capturing tools, such as Azure Network Watcher.
One of the challenges to metric data is that it often has limited information to provide context for collected
values. Azure Monitor addresses this challenge with multi-dimensional metrics. These metrics are name-value
pairs that carry more data to describe the metric value. To learn about multi-dimensional metrics and an
example for network throughput, see multi-dimensional metrics.
Azure provides various operational logs from the platform and the resources. These logs provide insight into
what events occurred, what changes were made to the resource, and more. These logs are useful in tracking
operations. For example, you can track scaling events to check if autoscaling is working as expected.
Azure Monitor Logs can store various different data types each with their own structure. You can also perform
complex analysis on logs data using log queries, which cannot be used for analysis of metrics data. Azure
Monitor Logs is capable of supporting near real-time scenarios, making them useful for alerting and fast
Cost considerations of monitoring
Next
Related links
detection of issues. To learn more about Azure Monitor Logs, see What can you do with Azure Monitor Logs?
Are you collecting Azure Activity Logs within the log aggregation tool?Are you collecting Azure Activity Logs within the log aggregation tool?
Azure Activity Logs provide audit information about when an Azure resource is modified, such as when a virtual
machine is started or stopped. This information is useful for the interpretation and troubleshooting of issues. It
provides transparency around configuration changes that can be mapped to adverse performance events.
Are logs available for critical internal dependencies?Are logs available for critical internal dependencies?
To build a robust application health model, ensure there is visibility into the operational state of critical internal
dependencies.
In Azure Monitor, enable Azure resource logs so that you have visibility into operations that were done within an
Azure resource. Similar to platform metrics, resource logs vary by the Azure service and resource type.
For example, for services such as a shared NVA (network virtual appliance) or Express Route connection,
monitor the network performance. Azure monitor can also help diagnose networking related issues. You can
trigger a packet capture, diagnose routing issues, analyze network security group flow logs, and gain visibility
and control over your Azure network.
Here are some tools:
Network performance monitor
Service connectivity monitor
ExpressRoute monitor
Additionally, data from network traffic capturing tools, such as Azure Network Watchercan be helpful.
Are critical external dependencies monitored?Are critical external dependencies monitored?
Monitor critical external dependencies, such as an API service, to ensure operational visibility of performance.
For example, a probe could be used to measure the latency of an external API.
Azure Monitor billing model is based on consumption. Azure creates metered instances that tracks usage to
calculate your bill. Pricing will depend on the metrics, alerting, notifications, Log Analytics and Application
Insights.
For information about usage and estimated costs, see Monitoring usage and estimated costs in Azure Monitor.
You can also use the Pricing Calculator to determine your pricing. The Pricing Calculator will help you estimate
your likely costs based on your expected utilization.
Data analysis considerations
Supported metrics with Azure Monitor
Network performance monitor
Azure Network Watcher
Back to the main article
Performance data integration
12/16/2022 3 minutes to read • Edit Online
Key points
Data interpretation
Aggregated view
Performance testing and investigation should be based on data captured from repeatable processes. To
understand how an application's performance is affected by code and infrastructure changes, retain data for
analysis. Additionally, it's important to measure how performance has changed
over time
, not just compared to
the last measurement taken.
This article describes some considerations and tools you can use to aggregate data for troubleshooting and
analyzing performance trends.
Analyze performance data holistically to detect fault types, bottleneck regressions, and health states.
Use log aggregation technologies to consolidate data into a single workspace and analyze using a
sophisticated query language.
Retain data in a time-series database to predict performance issues before they occur.
Balance the retention policy and service pricing plans with the cost expectation of the organization.
The overall performance can be impacted by both application-level issues and resource-level failures. It's vital
that all data is correlated and evaluated together. This will optimize the detection of issues and troubleshooting
of detected issues. This approach will help to distinguish between transient and non-transient faults.
Use a holistic approach to quantify what
healthy
and
unhealthy
states represent across all application
components. It's highly recommended that a
traffic light
model is used to indicate a healthy state. For example,
green light to show key non-functional requirements and targets are fully satisfied and resources are optimally
used. For example, a healthy state can be 95% of requests are processed in <= 500 ms with AKS node utilization
at x%, and so on. Also, An Application Map can to help spot performance bottlenecks or failure hotspots across
components of a distributed application.
Also, analyze long-term operational data to get historical context and detect if there have been any regressions.
For example, check the average response times to see if they have been slowly increasing over time and getting
closer to the maximum target.
Log aggregation technologies should be used to collate logs and metrics across all application components,
including infrastructural components for later evaluation.
Resources may include Azure IaaS and PaaS services and third-party appliances such as firewalls or Anti-
Malware solutions used in the application. For example, if Azure Event Hub is used, the Diagnostic Settings
should be configured to push logs and metrics to the data sink.
Azure Monitor has the capability of collecting and organizing log and performance data from monitored
resources. Data is consolidated into an Azure Log Analytics workspace so they can be analyzed together using a
sophisticated query language that can quickly analyzing millions of records. Splunk is another popular choice.
How is aggregated monitoring enforced?How is aggregated monitoring enforced?
Long-term data
Cost considerations
Next
Related links
All application resources should be configured to route diagnostic logs and metrics to the chosen log
aggregation technology. Use Azure Policy to ensure the consistent use of diagnostic settings across the
application and to enforce the desired configuration for each Azure service.
Store long-term operational data to understand the history of application performance. This data store is
important for analyzing performance trends and regressions.
Are long-term trends analyzed to predict performance issues before they occur?Are long-term trends analyzed to predict performance issues before they occur?
It's often helpful to store such data in a time-series database (TSDB) and then view the data from an operational
dashboard. An Azure Data Explorer cluster is a powerful TSDB that can store any schema of data, including
performance test metrics. Grafana, an open-source platform for observability dashboards, can then be used to
query your Azure Data Explorer cluster to view performance trends in your application.
Have retention times been defined for logs and metrics, with housekeeping mechanismsHave retention times been defined for logs and metrics, with housekeeping mechanisms
configured?configured?
Clear retention times should be defined to allow for suitable historic analysis but also control storage costs.
Suitable housekeeping tasks should also be used to archive data to cheaper storage or aggregate data for long-
term trend analysis.
While correlating data is recommended, there are cost implications to storing long-term data. For example,
Azure Monitor is capable of collecting, indexing, and storing massive amounts of data in Log Analytics
workspace. The cost of a Log Analytics workspace is based on amount of storage, retention period, and the plan.
For more information about how you can balance cost, see Manage usage and costs with Azure Monitor Logs.
If you are using Application Insights to collect instrumentation data, there are cost considerations. For more
information, see Manage usage and costs for Application Insights.
Scalability and reliability considerations
Application Map
Azure Policy
Azure Data Explorer cluster
Manage usage and costs with Azure Monitor Logs
Manage usage and costs for Application Insights
Back to the main article
Performance Efficiency patterns
12/16/2022 2 minutes to read • Edit Online
PAT T E RNPAT T E RN SUM M A RYSUM M A RY
Cache-Aside Load data on demand into a cache from a data store.
Choreography Have each component of the system participate in the
decision-making process about the workflow of a business
transaction, instead of relying on a central point of control.
CQRS Segregate operations that read data from operations that
update data by using separate interfaces.
Event Sourcing Use an append-only store to record the full series of events
that describe actions taken on data in a domain.
Deployment Stamps Deploy multiple independent copies of application
components, including data stores.
Geodes Deploy backend services into a set of geographical nodes,
each of which can service any client request in any region.
Index Table Create indexes over the fields in data stores that are
frequently referenced by queries.
Materialized View Generate prepopulated views over the data in one or more
data stores when the data isn't ideally formatted for required
query operations.
Priority Queue Prioritize requests sent to services so that requests with a
higher priority are received and processed more quickly than
those with a lower priority.
Queue-Based Load Leveling Use a queue that acts as a buffer between a task and a
service that it invokes in order to smooth intermittent heavy
loads.
Sharding Divide a data store into a set of horizontal partitions or
shards.
Static Content Hosting Deploy static content to a cloud-based storage service that
can deliver them directly to the client.
Performance efficiency is the ability of your workload to scale to meet the demands placed on it by users in an
efficient manner. You need to anticipate these increases to meet business requirements. An important
consideration in achieving performance efficiency is to consider how your application scales and to implement
PaaS offerings that have built-in scaling operations. Scalability is ability of a system either to handle increases in
load without impact on performance or for the available resources to be readily increased. It concerns not just
compute instances, but other elements such as data storage, messaging infrastructure, and more.
Throttling Control the consumption of resources used by an instance
of an application, an individual tenant, or an entire service.
PAT T E RNPAT T E RN SUM M A RYSUM M A RY
Performance efficiency checklist
12/16/2022 14 minutes to read • Edit Online
Application design
Performance efficiency is the ability of your workload to scale to meet the demands placed on it by users in an
efficient manner, and is one of the pillars of the Microsoft Azure Well-Architected Framework. Use this checklist
to review your application architecture from a performance efficiency standpoint.
Par tition the workloadPar tition the workload. Design parts of the process to be discrete and decomposable. Minimize the size of
each part, while following the usual rules for separation of concerns and the single responsibility principle. This
allows the component parts to be distributed in a way that maximizes use of each compute unit (such as a role
or database server). It also makes it easier to scale the application by adding instances of specific resources. For
complex domains, consider adopting a microservices architecture.
Design for scalingDesign for scaling. Scaling allows applications to react to variable load by increasing and decreasing the
number of instances of roles, queues, and other services they use. However, the application must be designed
with this in mind. For example, the application and the services it uses must be stateless, to allow requests to be
routed to any instance. This also prevents the addition or removal of specific instances from adversely affecting
current users. You should also implement configuration or autodetection of instances as they are added and
removed, so that code in the application can perform the necessary routing. For example, a web application
might use a set of queues in a round-robin approach to route requests to background services running in
worker roles. The web application must be able to detect changes in the number of queues, to successfully route
requests and balance the load on the application.
Scale as a unitScale as a unit. Plan for additional resources to accommodate growth. For each resource, know the upper
scaling limits, and use sharding or decomposition to go beyond these limits. Determine the scale units for the
system in terms of well-defined sets of resources. This makes applying scale-out operations easier, and less
prone to negative impact on the application through limitations imposed by lack of resources in some part of
the overall system. For example, adding x number of web and worker roles might require y number of
additional queues and z number of storage accounts to handle the additional workload generated by the roles.
So a scale unit could consist of x web and worker roles,
y
queues, and
z
storage accounts. Design the application
so that it's easily scaled by adding one or more scale units. Consider using the Deployment Stamps pattern to
deploy scale units.
Avoid client affinityAvoid client affinity. Where possible, ensure that the application does not require affinity. Requests can thus
be routed to any instance, and the number of instances is irrelevant. This also avoids the overhead of storing,
retrieving, and maintaining state information for each user.
Take advantage of platform autoscaling featuresTake advantage of platform autoscaling features. Where the hosting platform supports an autoscaling
capability, such as Azure Autoscale, prefer it to custom or third-party mechanisms unless the built-in mechanism
can't fulfill your requirements. Use scheduled scaling rules where possible to ensure resources are available
without a start-up delay, but add reactive autoscaling to the rules where appropriate to cope with unexpected
changes in demand. You can use the autoscaling operations in the classic deployment model to adjust
autoscaling, and to add custom counters to rules. For more information, see Auto-scaling guidance.
Offload CPU-intensive and I/O-intensive tasks as background tasksOffload CPU-intensive and I/O-intensive tasks as background tasks. If a request to a service is expected
to take a long time to run or absorb considerable resources, offload the processing for this request to a separate
task. Use worker roles or background jobs (depending on the hosting platform) to execute these tasks. This
strategy enables the service to continue receiving further requests and remain responsive. For more
information, see Background jobs guidance.
Data management
Distribute the workload for background tasksDistribute the workload for background tasks. Where there are many background tasks, or the tasks
require considerable time or resources, spread the work across multiple compute units (such as worker roles or
background jobs). For one possible solution, see the Competing Consumers pattern.
Consider moving toward a Consider moving toward a
shared-nothingshared-nothing
architecture architecture. A shared-nothing architecture uses
independent, self-sufficient nodes that have no single point of contention (such as shared services or storage). In
theory, such a system can scale almost indefinitely. While a fully shared-nothing approach is generally not
practical for most applications, it may provide opportunities to design for better scalability. For example,
avoiding the use of server-side session state, client affinity, and data partitioning are good examples of moving
toward a shared-nothing architecture.
Use data par titioningUse data par titioning. Divide the data across multiple databases and database servers, or design the
application to use data storage services that can provide this partitioning transparently (examples include Azure
SQL Database Elastic Database, and Azure Table storage). This approach can help to maximize performance and
allow easier scaling. There are different partitioning techniques, such as horizontal, vertical, and functional. You
can use a combination of these to achieve maximum benefit from increased query performance, simpler
scalability, more flexible management, better availability, and to match the type of store to the data it will hold.
Also, consider using different types of data store for different types of data, choosing the types based on how
well they are optimized for the specific type of data. This may include using table storage, a document database,
or a column-family data store, instead of, or as well as, a relational database. For more information, see Data
partitioning guidance.
Design for eventual consistencyDesign for eventual consistency. Eventual consistency improves scalability by reducing or removing the
time needed to synchronize related data partitioned across multiple stores. The cost is that data is not always
consistent when it is read, and some write operations may cause conflicts. Eventual consistency is ideal for
situations where the same data is read frequently but written infrequently. For more information, see the Data
Consistency Primer.
Reduce chatty interactions between components and ser vicesReduce chatty interactions between components and ser vices. Avoid designing interactions in which an
application is required to make multiple calls to a service (each of which returns a small amount of data), rather
than a single call that can return all of the data. Where possible, combine several related operations into a single
request when the call is to a service or component that has noticeable latency. This makes it easier to monitor
performance and optimize complex operations. For example, use stored procedures in databases to encapsulate
complex logic, and reduce the number of round trips and resource locking.
Use queues to level the load for high velocity data writesUse queues to level the load for high velocity data writes. Surges in demand for a service can
overwhelm that service and cause escalating failures. To prevent this, consider implementing the Queue-Based
Load Leveling pattern. Use a queue that acts as a buffer between a task and a service that it invokes. This can
smooth intermittent heavy loads that may otherwise cause the service to fail or the task to time out.
Minimize the load on the data storeMinimize the load on the data store. The data store is commonly a processing bottleneck, a costly resource,
and often not easy to scale out. Where possible, remove logic (such as processing XML documents or JSON
objects) from the data store, and perform processing within the application. For example, instead of passing
XML to the database (other than as an opaque string for storage), serialize or deserialize the XML within the
application layer and pass it in a form that is native to the data store. It's typically much easier to scale out the
application than the data store, so you should attempt to do as much of the compute-intensive processing as
possible within the application.
Minimize the volume of data retrievedMinimize the volume of data retrieved. Retrieve only the data you require by specifying columns and using
criteria to select rows. Make use of table value parameters and the appropriate isolation level. Use mechanisms
like entity tags to avoid retrieving data unnecessarily.
Aggressively use cachingAggressively use caching. Use caching wherever possible to reduce the load on resources and services that
Implementation
generate or deliver data. Caching is typically suited to data that is relatively static, or that requires considerable
processing to obtain. Caching should occur at all levels where appropriate in each layer of the application,
including data access and user interface generation. For more information, see the Caching Guidance.
Handle data growth and retentionHandle data growth and retention. The amount of data stored by an application grows over time. This
growth increases storage costs as well as latency when accessing the data, affecting application throughput and
performance. It may be possible to periodically archive some of the old data that is no longer accessed, or move
data that is rarely accessed into long-term storage that is more cost efficient, even if the access latency is higher.
Optimize Data Transfer Objects (DTOs) using an efficient binar y formatOptimize Data Transfer Objects (DTOs) using an efficient binar y format. DTOs are passed between the
layers of an application many times. Minimizing the size reduces the load on resources and the network.
However, balance the savings with the overhead of converting the data to the required format in each location
where it is used. Adopt a format that has the maximum interoperability to enable easy reuse of a component.
Set cache controlSet cache control. Design and configure the application to use output caching or fragment caching where
possible, to minimize processing load.
Enable client side cachingEnable client side caching. Web applications should enable cache settings on the content that can be cached.
This is commonly disabled by default. Configure the server to deliver the appropriate cache control headers to
enable caching of content on proxy servers and clients.
Use Azure blob storage and the Azure Content Deliver y Network to reduce the load on theUse Azure blob storage and the Azure Content Deliver y Network to reduce the load on the
applicationapplication. Consider storing static or relatively static public content, such as images, resources, scripts, and
style sheets, in blob storage. This approach relieves the application of the load caused by dynamically generating
this content for each request. Additionally, consider using the Content Delivery Network to cache this content
and deliver it to clients. Using the Content Delivery Network can improve performance at the client because the
content is delivered from the geographically closest datacenter that contains a Content Delivery Network cache.
For more information, see the Content Delivery Network Guidance.
Optimize and tune SQL queries and indexesOptimize and tune SQL queries and indexes. Some T-SQL statements or constructs may have an adverse
effect on performance that can be reduced by optimizing the code in a stored procedure. For example, avoid
converting datetimedatetime types to a varcharvarchar before comparing with a datetimedatetime literal value. Use date/time
comparison functions instead. Lack of appropriate indexes can also slow query execution. If you use an
object/relational mapping framework, understand how it works and how it may affect performance of the data
access layer. For more information, see Query Tuning.
Consider denormalizing dataConsider denormalizing data. Data normalization helps to avoid duplication and inconsistency. However,
maintaining multiple indexes, checking for referential integrity, performing multiple accesses to small chunks of
data, and joining tables to reassemble the data imposes an overhead that can affect performance. Consider if
some additional storage volume and duplication is acceptable in order to reduce the load on the data store. Also
consider if the application itself (which is typically easier to scale) can be relied on to take over tasks such as
managing referential integrity in order to reduce the load on the data store. For more information, see Data
partitioning guidance.
Review the performance antipatternsReview the performance antipatterns. See Performance antipatterns for cloud applications for common
practices that are likely to cause scalability problems when an application is under pressure.
Use asynchronous callsUse asynchronous calls. Use asynchronous code wherever possible when accessing resources or services
that may be limited by I/O or network bandwidth, or that have a noticeable latency, in order to avoid locking the
calling thread.
Avoid locking resources, and use an optimistic approach insteadAvoid locking resources, and use an optimistic approach instead. Never lock access to resources such
as storage or other services that have noticeable latency, because this is a primary cause of poor performance.
Always use optimistic approaches to managing concurrent operations, such as writing to storage. Use features
NOTENOTE
of the storage layer to manage conflicts. In distributed applications, data may be only eventually consistent.
Compress highly compressible data over high latency, low bandwidth networksCompress highly compressible data over high latency, low bandwidth networks. In the majority of
cases in a web application, the largest volume of data generated by the application and passed over the network
is HTTP responses to client requests. HTTP compression can reduce this considerably, especially for static
content. This can reduce cost as well as reducing the load on the network, though compressing dynamic content
does apply a fractionally higher load on the server. In other, more generalized environments, data compression
can reduce the volume of data transmitted and minimize transfer time and costs, but the compression and
decompression processes incur overhead. As such, compression should only be used when there is a
demonstrable gain in performance. Other serialization methods, such as JSON or binary encodings, may reduce
the payload size while having less impact on performance, whereas XML is likely to increase it.
Minimize the time that connections and resources are in useMinimize the time that connections and resources are in use. Maintain connections and resources only
for as long as you need to use them. For example, open connections as late as possible, and allow them to be
returned to the connection pool as soon as possible. Acquire resources as late as possible, and dispose of them
as soon as possible.
Minimize the number of connections requiredMinimize the number of connections required. Service connections absorb resources. Limit the number
that are required and ensure that existing connections are reused whenever possible. For example, after
performing authentication, use impersonation where appropriate to run code as a specific identity. This can help
to make best use of the connection pool by reusing connections.
APIs for some services automatically reuse connections, provided service-specific guidelines are followed. It's important
that you understand the conditions that enable connection reuse for each service that your application uses.
Send requests in batches to optimize network useSend requests in batches to optimize network use. For example, send and read messages in batches
when accessing a queue, and perform multiple reads or writes as a batch when accessing storage or a cache.
This can help to maximize efficiency of the services and data stores by reducing the number of calls across the
network.
Avoid a requirement to store ser ver-side session stateAvoid a requirement to store ser ver-side session state where possible. Server-side session state
management typically requires client affinity (that is, routing each request to the same server instance), which
affects the ability of the system to scale. Ideally, you should design clients to be stateless with respect to the
servers that they use. However, if the application must maintain session state, store sensitive data or large
volumes of per-client data in a distributed server-side cache that all instances of the application can access.
Optimize table storage schemasOptimize table storage schemas. When using table stores that require the table and column names to be
passed and processed with every query, such as Azure table storage, consider using shorter names to reduce
this overhead. However, do not sacrifice readability or manageability by using overly compact names.
Create resource dependencies during deployment or at application star tupCreate resource dependencies during deployment or at application star tup. Avoid repeated calls to
methods that test the existence of a resource and then create the resource if it does not exist. Methods such as
CloudTable.CreateIfNotExists
and
CloudQueue.CreateIfNotExists
in the Azure Storage Client Library follow this
pattern. These methods can impose considerable overhead if they are invoked before each access to a storage
table or storage queue. Instead:
Create the required resources when the application is deployed, or when it first starts (a single call to
CreateIfNotExists
for each resource in the startup code for a web or worker role is acceptable). However, be
sure to handle exceptions that may arise if your code attempts to access a resource that doesn't exist. In these
situations, you should log the exception, and possibly alert an operator that a resource is missing.
Under some circumstances, it may be appropriate to create the missing resource as part of the exception
handling code. But you should adopt this approach with caution as the non-existence of the resource might
be indicative of a programming error (a misspelled resource name for example), or some other
infrastructure-level issue.
Use lightweight frameworksUse lightweight frameworks. Carefully choose the APIs and frameworks you use to minimize resource
usage, execution time, and overall load on the application. For example, using Web API to handle service
requests can reduce the application footprint and increase execution speed, but it may not be suitable for
advanced scenarios where the additional capabilities of Windows Communication Foundation are required.
Consider minimizing the number of ser vice accountsConsider minimizing the number of ser vice accounts. For example, use a specific account to access
resources or services that impose a limit on connections, or perform better where fewer connections are
maintained. This approach is common for services such as databases, but it can affect the ability to accurately
audit operations due to the impersonation of the original user.
Carr y out performance profiling and load testingCarr y out performance profiling and load testing during development, as part of test routines, and before
final release to ensure the application performs and scales as required. This testing should occur on the same
type of hardware as the production platform, and with the same types and quantities of data and user load as it
will encounter in production. For more information, see Testing the performance of a cloud service.
Tradeoffs for performance efficiency
12/16/2022 6 minutes to read • Edit Online
Performance efficiency vs. cost efficiency
Performance efficiency vs. operational excellence
Automated performance testingAutomated performance testing
Fast buildsFast builds
As you design the workload, consider tradeoffs between performance optimization and other aspects of the
design, such as cost efficiency, operability, reliability, and security.
Cost can increase as a result of boosting performance. Here are a few factors to consider when optimizing for
performance and how they impact cost:
Avoid cost estimation of a workload at consistently high utilization. Consumption-based pricing will be
more expensive than the equivalent provisioned pricing. Smooth out the peaks to get a consistent flow of
compute and data. Ideally, use manual and autoscaling to find the right balance. Scaling up is generally
more expensive than scaling out.
Cost scales directly with number of regions. Locating resources in cheaper regions shouldn't negate the
cost of network ingress and egress or degraded application performance because of increased latency.
Every render cycle of a payload consumes both compute and memory. You can use caching to reduce
load on servers and save with pre-canned storage and bandwidth costs. The savings can be dramatic,
especially for static content services.
While caching can reduce cost, there are some performance tradeoffs. For example, Azure Traffic
Manager pricing is based on the number of DNS (Domain Name Service) queries that reach the
service. You can reduce that number through caching and configure how often the cache is refreshed.
Relying on the cache that isn't frequently updated will cause longer user failover times if an endpoint
is unavailable.
Using dedicated resources for batch processing long running jobs will increase the cost. You can lower
cost by provisioning Spot VMs but be prepared for the job to be interrupted every time Azure evicts the
VM.
For cost considerations, see the Cost Optimization pillar.
As you determine how to scale your workload to meet the demands placed on it by users in an efficient manner,
consider the operations processes that are keeping an application running in production. To achieve operational
excellence with these processes, make sure the deployments remain reliable and predictable. They should be
automated to reduce the chance of human error. They should be a fast and routine process, so they don't slow
down the release of new features or bug fixes. Equally important, you must be able to quickly roll back or roll
forward if an update has problems.
One operational process that can help to identify performance issues early is automated performance testing.
The impact of a serious performance issue can be as severe as a bug in the code. While automated functional
tests can prevent application bugs, they might not detect performance problems. Define acceptable performance
goals for metrics such as latency, load times, and resource usage. Include automated performance tests in your
release pipeline, to make sure the application meets those goals.
Monitoring performance optimizationMonitoring performance optimization
Performance efficiency vs. reliability
Another operational efficiency process is making sure that your product is in a deployable state through a fast
build process. Builds provide crucial information about the status of your product.
The following can help faster builds:
Select the right size of VMs.
Ensure that the build server is located near the sources and a target location, so it can reduce the duration of
your build considerably.
Scale-out build servers.
Optimizing the build.
For an explanation of these items, see Builds.
As you consider making performance improvements, monitoring should be done to verify that your application
is running correctly. Monitoring should include the application, platform, and networking. To learn more, see
Monitoring.
For operational considerations, see the Operational Excellence pillar.
We acknowledge up front that failures will happen. Instead of trying to prevent failures altogether, the goal is to
minimize the effects of a single failing component.
Reliable applications are
resilient
and
highly available
(HA). Resiliency allows systems to recover gracefully from
failures, and they continue to function with minimal downtime and data loss before full recovery. HA systems
run as designed in a healthy state with no significant downtime. Maintaining reliability enables you to maintain
performance efficiency.
Some reliability considerations are:
Use the Circuit Breaker pattern to provide stability while the system recovers from a failure and
minimizes the impact on performance.
Achieve levels of scale and performance needed for your solution by segregating read and write
interfaces by implementing the CQRS pattern.
Often, you can achieve higher availability by adopting an
eventual consistency
model. To learn about
selecting the correct data store, see Use the best data store for the job.
If your application requires more storage accounts than are currently available in your subscription,
create a new subscription with additional storage accounts. For more information, see Scalability and
performance targets.
Avoid scaling up or down. Instead, select a tier and instance size that meet your performance
requirements under typical load, and then scale out the instances to handle changes in traffic volume.
Scaling up and down may trigger an application restart.
Create a separate storage account for logs. Don't use the same storage account for logs and application
data. This helps to prevent logging from reducing application performance.
Monitor performance. Use a performance monitoring service such as Application Insights or New Relic to
monitor application performance and behavior under load. Performance monitoring gives you real-time
insight into the application. It enables you to diagnose issues and perform root-cause analysis of failures.
For resiliency, availability, and reliability considerations, see the Reliability pillar.
Performance efficiency vs. security
If performance is so poor that the data is unusable, you can consider the data to be inaccessible. From a security
perspective, you need to do whatever you can to make sure that your services have optimal uptime and
performance.
A popular and effective method for enhancing availability and performance is load balancing. Load balancing is
a method of distributing network traffic across servers that are part of a service. It helps performance because
the processor, network, and memory overhead for serving requests are distributed across all the load-balanced
servers. We recommend that you employ load balancing whenever you can, and as appropriate for your
services. For information on load balancing scenarios, see Optimize uptime and performance.
Consider these security measures, which impact performance:
To optimize performance and maximize availability, application code should first try to get OAuth access
tokens silently from a cache before attempting to acquire a token from the identity provider. OAuth is a
technological standard that allows you to securely share information between services without exposing
your password.
Ensure that you are integrating critical security alerts and logs into SIEMs (security information and event
management) without introducing a high volume of low value data. Doing so can increase SIEM cost,
false positives, and lower performance. For more information, see Prioritize alert and log integration.
Use Azure AD Connect to synchronize your on-premises directory with your cloud directory. There are
factors that affect the performance of Azure AD Connect. Ensure Azure AD Connect has enough capacity
to keep underperforming systems from impeding security and productivity. Large or complex
organizations (organizations provisioning more than 100,000 objects) should follow the
recommendations to optimize their Azure AD Connect implementation.
If you want to gain access to real time performance information at the packet level, use packet capture to
set alerts.
For other security considerations, see the Security pillar.
Mission
-
critical workloads
12/16/2022 5 minutes to read • Edit Online
What is a mission-critical workload?
Video: Mission-critical workloads on Azure
What are the common challenges?
Is mission-critical only about reliability?
This section strives to address the challenges of designing mission-critical workloads on Azure. The guidance is
based on lessons learned from reviewing numerous customer applications and first-party solutions. This section
provides actionable and authoritative guidance that applies Well-Architected best practices as the technical
foundation for building and operating a highly reliable solution on Azure at-scale.
The term
workload
refers to a collection of application resources that support a common business goal or the
execution of a common business process, with multiple services, such as APIs and data stores, working together
to deliver specific end-to-end functionality.
The term
mission-critical
refers to a criticality scale that covers significant financial cost (business-critical) or
human cost (safety-critical) associated with unavailability or underperformance.
A
mission-critical workload
therefore describes a collection of application resources, which must be highly
reliable on the platform. The workload must always be available, resilient to failures, and operational.
Microsoft Azure makes it easy to deploy and manage cloud solutions. However, building mission-critical
workloads that are highly reliable on the platform remains a challenge for these main reasons:
Designing a reliable application at scale is complex. It requires extensive platform knowledge to select the
right technologies
and
optimally configure them to deliver end-to-end functionality.
Failure is inevitable in any complex distributed system, and the solution must therefore be architected to
handle failures with correlated or cascading impact. This is a change in mindset for many developers and
architects entering the cloud from an on-premises environment; reliability engineering is no longer an
infrastructure subject, but should be a first-class concern within the application development process.
Operationalizing mission-critical workloads requires a high degree of engineering rigor and maturity
throughout the end-to-end engineering lifecycle as well as the ability to learn from failure.
While the primary focus of mission-critical workloads is Reliability, other pillars of the Well-Architected
Framework are equally important when building and operating a mission-critical workload on Azure.
Security: how a workload mitigates security threats, such as Distributed Denial of Service (DDoS) attacks,
will have a significant bearing on overall reliability.
Operational Excellence: how a workload is able to effectively respond to operational issues will have a
direct impact on application availability.
Performance Efficiency: availability is more than simple uptime, but rather a consistent level of application
service and performance relative to a known healthy state.
What are the key design areas?
DESIGN A READESIGN A REA SUM M A RYSUM M A RY
Application designApplication design The use of a scale-unit architecture in the context of building
a highly reliable application. Also explores the cloud
application design patterns that allow for scaling, and error
handling.
Application platformApplication platform Decision factors and recommendations related to the
selection, design, and configuration of an appropriate
application hosting platform, application dependencies,
frameworks, and libraries.
Data platformData platform Choices in data store technologies, informed by evaluating
the required—volume, velocity, variety, veracity.
Networking and connectivityNetworking and connectivity Network topology concepts at an application level,
considering requisite connectivity and redundant traffic
management. Critical recommendations intended to inform
the design of a secure and scalable global network topology.
Achieving high reliability imposes significant cost tradeoffs, which may not be justifiable for every workload
scenario. It is therefore recommended that design decisions be driven by business requirements.
Mission-critical guidance within this series is composed of architectural considerations and recommendations
orientated around these key design areas.
The design areas are interrelated and decisions made within one area can impact or influence decisions across
the entire design. We recommend that readers familiarize themselves with these design areas, reviewing
provided considerations and recommendations to better understand the consequences of encompassed
decisions. For example, to define a target architecture it's critical to determine how best to monitor application
health across key components. In this instance, the reader should review the health modelinghealth modeling design area,
using the outlined recommendations to help drive decisions.
Health modeling and observabilityHealth modeling and observability Processes to define a robust health model, mapping
quantified application health states through observability
and operational constructs to achieve operational maturity.
Deployment and testingDeployment and testing Eradicate downtime and maintain application health for
deployment operations, providing key considerations and
recommendations intended to inform the design of optimal
CI/CD pipelines for a mission-critical application.
SecuritySecurity Protect the application against threats intended to directly
or indirectly compromise its reliability.
Operational proceduresOperational procedures Adoption of DevOps and related deployment methods is
used to drive effective and consistent operational
procedures.
DESIGN A READESIGN A REA SUM M A RYSUM M A RY
Illustrative examples
Industry scenarios
The guidance provided within this series is based on a solution-orientated approach to illustrate key design
considerations and recommendations. There are several reference implementations available as part of the
Mission-Critical open source project on GitHub. These implementations can be used as a basis for further
solution development.
Baseline architecture of an internet-facing application—Provides a foundation for building a cloud-native,
highly scalable, internet-facing application on Microsoft Azure. The workload is accessed over a public
endpoint and doesn't require private network connectivity to a surrounding organizational technical
estate.
Refer to the implementation: Mission-Critical Online
Baseline architecture of an internet-facing application with network controls—Extends the baseline
architecture with strict network controls in place to prevent unauthorized public access from the internet
to any of the workload resources.
Mission-Critical Connected
Provides a foundation for building a corporate-connected cloud-native application on Microsoft Azure
using existing network infrastructure and private endpoints. The workload requires private connectivity to
other organizational resources and takes a dependency on pre-provided Virtual Networks for
connectivity to other organizational resources. This use case is intended for scenarios that require
integration with a broader organizational technical estate for either public-facing or internal-facing
workloads.
The mission-critical guidance within this series forms an industry agnostic design methodology which can be
applied across a multitude of different industry contexts. The following list provides specific examples where the
mission-critical design methodology has been applied and tailored to a particular industry scenario.
Carrier-grade within the telecommunications industry
A carrier-grade workload pivots on both business-critical and safety-critical aspects, where there's a
fundamental requirement to be operational with only minutes or even seconds of downtime per calendar year.
Next step
Failure to achieve this uptime requirement can result in extensive loss of life, incur significant fines, or
contractual penalties.
Start by reviewing the design methodology for mission-critical application scenarios.
Design methodology
Design methodology for mission
-
critical workloads
on Azure
12/16/2022 4 minutes to read • Edit Online
1Design for business requirements
Select a reliability tierSelect a reliability tier
REL IA B IL IT Y T IERRE L IA BIL IT Y T IER
( AVA IL A B ILIT Y SLO )( AVA IL A B ILIT Y SLO )
P ERM IT T E D DO W N T IM EP ERM IT T ED DOW N T IM E
( WE EK)( WE EK)
P ERM IT T E D DO W N T IM EP ERM IT T ED DOW N T IM E
( M O NT H )( M O N T H )
P ERM IT T E D DO W N T IM EP ERM IT T ED DOW N T IM E
( Y EA R)( Y EA R)
99.9% 10 minutes, 4 seconds 43 minutes, 49 seconds 8 hours, 45 minutes, 56
seconds
99.95% 5 minutes, 2 seconds 21 minutes, 54 seconds 4 hours, 22 minutes, 58
seconds
99.99% 1 minutes 4 minutes 22 seconds 52 minutes, 35 seconds
99.999% 6 seconds 26 seconds 5 minutes, 15 seconds
99.9999% <1 second 2 seconds 31 seconds
Building a mission-critical application on any cloud platform requires significant technical expertise and
engineering investment, particularly since there's significant complexity associated with:
Understanding the cloud platform,
Choosing the right services and composition,
Applying the correct service configuration,
Operationalizing utilized services, and
Constantly aligning with the latest best practices and service roadmaps.
This design methodology strives to provide an easy to follow design path to help navigate this complexity and
inform design decisions required to produce an optimal target architecture.
Not all mission-critical workloads have the same requirements. Expect that the review considerations and design
recommendations provided by this design methodology will yield different design decisions and trade-offs for
different application scenarios.
Reliability is a relative concept and for any workload to be appropriately reliable it should reflect the business
requirements surrounding it. For example, a mission-critical workload with a 99.999% availability Service Level
Objective (SLO) requires a much higher level of reliability than another less critical workload with an SLO of
99.9%.
This design methodology applies the concept of reliability tiers expressed as availability SLOs to inform required
reliability characteristics. The table below captures permitted error budgets associated with common reliability
tiers.
IMPORTANTIMPORTANT
2Evaluate the design areas using the design principles
3Deploy your first mission-critical application
Availability SLO is considered by this design methodology to be more than simple uptime, but rather a consistent level of
application service relative to a known healthy application state.
As an initial exercise, readers are advised to select a target reliability tier by determining how much downtime is
acceptable? The pursuit of a particular reliability tier will ultimately have a significant bearing on the design path
and encompassed design decisions, which will result in a different target architecture.
This image shows how the different reliability tiers and underlying business requirements influence the target
architecture for a conceptual reference implementation, particularly concerning the number of regional
deployments and utilized global technologies.
Recovery Time Objective (RTO) and Recovery Point Objective (RPO) are further critical aspects when
determining required reliability. For instance, if you're striving to achieve an application RTO of less than a
minute then back-up based recovery strategies or an active-passive deployment strategy are likely to be
insufficient.
At the core of this methodology lies a critical design path comprised of:
Foundational design principlesdesign principles
Fundamental design areadesign area with heavily interrelated and dependent design decisions.
The impact of decisions made within each design area will reverberate across other design areas and design
decisions. Review the provided considerations and recommendations to better understand the consequences of
encompassed decisions, which may produce trade-offs within related design areas.
For example, to define a target architecture it's critical to determine how best to monitor application health
across key components. We highly recommend that you review the health modeling design area, using the
outlined recommendations to help drive decisions.
Refer to these reference architectures that describe the design decisions based on this methodology.
TIPTIP
4Integrate your workload in Azure landing zones
Online subscriptionOnline subscription
Baseline architecture of an internet-facing application
Baseline architecture of an internet-facing application with network controls
The architecture is backed by Mission-Critical Online implementation that illustrates the design recommendations.
Production-grade ar tifactsProduction-grade ar tifacts Every technical artifact is ready for use in production environments with all end-
to-end operational aspects considered.
Rooted in real-wold experiencesRooted in real-wold experiences All technical decisions are guided by experiences of Azure customers and
lessons learned from deploying those solutions.
Azure roadmap alignmentAzure roadmap alignment The mission-critical reference architectures have their own roadmap that is
aligned with Azure product roadmaps.
Azure landing zone subscriptions provide shared infrastructure for enterprise deployments that need centralized
governance.
It's crucial to evaluate which connectivity use case is required by your mission-critical application. Azure landing
zones support two main archetypes separated into different Management Group scopes: OnlineOnline or Corp.Corp. as
shown in this image.
A mission-critical workload operates as an independent solution, without any direct corporate network
connectivity to the rest of the Azure landing zone architecture. The application will be further safeguarded
through the policy-driven governance and will automatically integrate with centralized platform logging
through policy.
The baseline architecture and Mission-Critical Online implementation align with the Online approach.
Corp. subscriptionCorp. subscription
TIPTIP
5Deploy a sandbox application environment
6Continuously evolve with Azure roadmaps
Next step
When deployed in a Corp. subscription a mission-critical workload depends on the Azure landing zone to
provide connectivity resources. This approach allows integration with other applications and shared services.
You'll need to design around some foundational resources, which will exist up-front as part of the shared-
service platform. For example, the regional deployment stamp should no longer encompass an ephemeral
Virtual Network or Azure Private DNS Zone because these will exist in the Corp. subscription.
To get started with this use case, we recommend the baseline architecture in an Azure landing zonebaseline architecture in an Azure landing zone
reference architecture.
The preceding architecture is backed by Mission-Critical Connected implementation.
In parallel to design activities, it's highly recommended that a sandbox application environment is established
using the Mission-Critical reference implementations.
This provides hands-on opportunities to validate design decisions by replicating the target architecture, allowing
for design uncertainty to be quickly assessed. If applied correctly with representative requirement coverage,
most problematic issues likely to hinder progress can be uncovered and subsequently addressed.
Application architectures established using this design methodology must continue to evolve in alignment with
Azure platform roadmaps to support optimized sustainability.
Review the design principles for mission-critical application scenarios.
Design principles
Design principles of a mission
-
critical workload
12/16/2022 7 minutes to read • Edit Online
IMPORTANTIMPORTANT
Reliability
DESIGN P R INC IP L EDESIGN P R IN C IP L E C O N SIDE RAT IO N SC O N SIDERAT IO NS
Active/Active designActive/Active design To maximize availability and achieve regional fault tolerance,
solution components should be distributed across multiple
Availability Zones and Azure regions using an active/active
deployment model where possible.
The mission-critical design methodology is underpinned by five key design principles which serve as a compass
for subsequent design decisions across the critical design areas. We highly recommend that you familiarize
yourselves with these principles to better understand their impact and the trade-offs associated with non-
adherence.
This article is part of the Azure Well-Architected mission-critical workload series. If you aren't familiar with this series, we
recommend you start with what is a mission-critical workload?
Mission-Critical open source project
The reference implementations are part of an open source project available on GitHub. The provided code assets illustrate
implementations associated with the design principles highlighted in this article.
These mission-critical design principles resonate and extend the quality pillars of the Azure Well-Architected
Framework—Reliability, Security, Cost Optimization, Operational Excellence, and Performance Efficiency.
Maximum reliabilityMaximum reliability
- Fundamental pursuit of the most reliable solution, ensuring trade-offs are properly
understood.
Blast radius reduction and fault isolationBlast radius reduction and fault isolation Failure is impossible to avoid in a highly distributed multi-
tenant cloud environment like Azure. By anticipating failures
and correlated impact, from individual components to entire
Azure regions, a solution can be designed and developed in
a resilient manner.
Obser ve application healthObser ve application health Before issues impacting application reliability can be
mitigated, they must first be detected and understood. By
monitoring the operation of an application relative to a
known healthy state it becomes possible to detect or even
predict reliability issues, allowing for swift remedial action to
be taken.
Drive automationDrive automation One of the leading causes of application downtime is human
error, whether that is due to the deployment of insufficiently
tested software or misconfiguration. To minimize the
possibility and impact of human errors, it's vital to strive for
automation in all aspects of a cloud solution to improve
reliability; automated testing, deployment, and management.
Design for self-healingDesign for self-healing Self healing describes a system's ability to deal with failures
automatically through pre-defined remediation protocols
connected to failure modes within the solution. It's an
advanced concept that requires a high level of system
maturity with monitoring and automation, but should be an
aspiration from inception to maximize reliability.
Complexity avoidanceComplexity avoidance Avoid unnecessary complexity when designing the solution
and all operational processes to drive reliability and
management efficiencies, minimizing the likelihood of
failures.
DESIGN P R INC IP L EDESIGN P R IN C IP L E C O N SIDE RAT IO N SC O N SIDERAT IO NS
Performance Efficiency
DESIGN P R INC IP L EDESIGN P R IN C IP L E C O N SIDE RAT IO N SC O N SIDERAT IO NS
Design for scale-outDesign for scale-out Scale-out is a concept that focuses on a system's ability to
respond to demand through horizontal growth. This means
that as traffic grows,moreresource units are added in parallel
instead of increasing the size of the existing resources. A
systems ability to handle expected and unexpected traffic
increases through scale-units is essential to overall
performance and reliability by further reducing the impact of
a single resource failure.
Automation for hyperscaleAutomation for hyperscale Scale operations throughout the solution should be fully
automated to minimize the performance and availability
impact from unexpected or expected increases in traffic,
ensuring the time it takes to conduct scale operations is
understood and aligned with a model for application health.
Sustainable performance and scalabilitySustainable performance and scalability
- Design for scalability across the end-to-end solution without
performance bottlenecks.
Continuous validation and testingContinuous validation and testing Automated testing should be performed within CI/CD
processes to drive continuous validation for each application
change. Load testing against a performance baseline with
synchronized chaos experimentation should be included to
validate existing thresholds, targets, and assumptions, as
well as helping to quickly identify risks to resiliency and
availability. Such testing should be conducted within staging
and testing environments, but also optionally within
development environments. It can also be beneficial to run a
subset of tests against the production environment,
particularly in conjunction with a blue/green deployment
model to validate new deployment stamps before receiving
production traffic.
Reduce overhead with managed compute servicesReduce overhead with managed compute services Using managed compute services and containerized
architectures significantly reduces the ongoing
administrative and operational overhead of designing,
operating, and scaling applications by shifting infrastructure
deployment and maintenance to the managed service
provider.
Baseline performance and identify bottlenecksBaseline performance and identify bottlenecks Performance testing with detailed telemetry from every
system component allows for the identification of
bottlenecks within the system, including components that
need to be scaled in relation to other components, and this
information should be incorporated into a capacity model.
Model capacityModel capacity A capacity model enables planning of resource scale levels
for a given load profile, and additionally exposes how system
components perform in relation to each other, therefore
enabling system-wide capacity allocation planning.
DESIGN P R INC IP L EDESIGN P R IN C IP L E C O N SIDE RAT IO N SC O N SIDERAT IO NS
Operational Excellence
DESIGN P R INC IP L EDESIGN P R IN C IP L E C O N SIDE RAT IO N SC O N SIDERAT IO NS
Loosely coupled componentsLoosely coupled components Loose coupling enables independent and on-demand
testing, deployments, and updates to components of the
application while minimizing inter-team dependencies for
support, services, resources, or approvals.
Automate build and release processesAutomate build and release processes Fully automated build and release processes reduce the
friction and increase the velocity of deploying updates,
bringing repeatability and consistency across environments.
Automation shortens the feedback loop from developers
pushing changes to getting insights on code quality, test
coverage, resiliency, security, and performance, which
increases developer productivity.
Developer agilityDeveloper agility Continuous Integration and Continuous Deployment
(CI/CD) automation enables the use of short-lived
development environments with lifecycles tied to that of an
associated feature branch, which promotes developer agility
and drives validation as early as possible within the
engineering cycle to minimize the engineering cost of bugs.
Operations by designOperations by design
- Engineered to last with robust and assertive operational management.
Quantify operational healthQuantify operational health Full diagnostic instrumentation of all components and
resources enables ongoing observability of logs, metrics and
traces, but also facilitates health modeling to quantify
application health in the context to availability and
performance requirements.
Rehearse recover y and practice failureRehearse recover y and practice failure Business Continuity (BC) and Disaster Recovery (DR)
planning and practice drills are essential and should be
conducted frequently, since learnings can iteratively improve
plans and procedures to maximize resiliency in the event of
unplanned downtime.
Embrace continuous operational improvementEmbrace continuous operational improvement Prioritize routine improvement of the system and user
experience, using a health model to understand and
measure operational efficiency with feedback mechanisms to
enable application teams to understand and address gaps in
an iterative manner.
DESIGN P R INC IP L EDESIGN P R IN C IP L E C O N SIDE RAT IO N SC O N SIDERAT IO NS
Security
DESIGN P R INC IP L EDESIGN P R IN C IP L E C O N SIDE RAT IO N SC O N SIDERAT IO NS
Monitor the security of the entire solution and planMonitor the security of the entire solution and plan
incident responsesincident responses
Correlate security and audit events to model application
health and identify active threats. Establish automated and
manual procedures to respond to incidents using Security
Information and Event Management (SIEM) tooling for
tracking.
Model and test against potential threatsModel and test against potential threats Ensure appropriate resource hardening and establish
procedures to identify and mitigate known threats, using
penetration testing to verify threat mitigation, as well as
static code analysis and code scanning.
Identify and protect endpointsIdentify and protect endpoints Monitor and protect the network integrity of internal and
external endpoints through security capabilities and
appliances, such as firewalls or web application firewalls. Use
industry standard approaches to protect against common
attack vectors like Distributed Denial-Of-Service (DDoS)
attacks, such as SlowLoris.
Protect against code level vulnerabilitiesProtect against code level vulnerabilities Identify and mitigate code-level vulnerabilities, such as
cross-site scripting or SQL injection, and incorporate security
patching into operational lifecycles for all parts of the
codebase, including dependencies.
Automate and use least privilegeAutomate and use least privilege Drive automation to minimize the need for human
interaction and implement least privilege across both the
application and control plane to protect against data
exfiltration and malicious actor scenarios.
Classify and encr ypt dataClassify and encr ypt data Classify data according to risk and apply industry standard
encryption at rest and in transit, ensuring keys and
certificates are stored securely and managed properly.
Always secureAlways secure
- Design for end-to-end security to maintain application stability and ensure availability.
Cost Optimization
Cloud native design
Next step
There are obvious cost tradeoffs associated with introducing greater reliability, which should be carefully
considered in the context of workload requirements.
Maximizing reliability can impact the overall financial cost of the solution. For example, the duplication of
resources and the distribution of resources across regions to achieve high availability has clear cost implications.
To avoid excess costs, don't over-engineer or over-provision beyond the relevant business requirements.
Also, there is added cost associated with engineering investment in fundamental reliability concepts, such as
embracing infrastructure as code, deployment automation, and chaos engineering. This comes at a cost in terms
of both time and effort, which could be invested elsewhere to deliver new application functionality and features.
Azure-native managed ser vicesAzure-native managed ser vices - Azure-native managed services are prioritized due to their lower
administrative and operational overhead as well as tight integration with consistent configuration and
instrumentation across the application stack.
Roadmap alignmentRoadmap alignment - Incorporate upcoming new and improved Azure service capabilities as they
become Generally Available (GA) to stay close to the leading edge of Azure.
Embrace preview capabilities and mitigate known gapsEmbrace preview capabilities and mitigate known gaps - While Generally Available (GA) services
are prioritized for supportability, Azure service previews are actively explored for rapid incorporation,
providing technical and actionable feedback to Azure product groups to address gaps.
Azure landing zone alignmentAzure landing zone alignment - Deployable within an Azure landing zone and aligned to the Azure
landing zone design methodology, but also fully functional and deployable in a bare environment outside
of a landing zone.
Review cross-cutting concerns associated with mission-critical workloads.
Cross-cutting concerns
Architecture pattern for mission
-
critical workloads
on Azure
12/16/2022 5 minutes to read • Edit Online
C H A RA C T ERIST ICC H A RA C T ERIST IC C O N SIDERAT IO NSC O N SIDERAT IO NS
Lifetime What's the expected lifetime of the resource, relative to other
resources in the solution? Should the resource outlive or
share the lifetime with the entire system or region, or should
it be temporary?
State What impact will the persisted state at this layer have on
reliability or manageability?
Reach Is the resource required to be globally distributed? Can the
resource communicate with other resources, located globally
or within that region?
Dependencies What are the dependencies on other resources?
Scale limits What is the expected throughput for that resource? How
much scale is provided by the resource to fit that demand?
Availability/disaster recovery What is the impact on availability from a disaster at this
layer? Would it cause a systemic outage or only a localized
capacity or availability issue?
IMPORTANTIMPORTANT
Core architecture pattern
This article presents a key pattern for mission-critical architectures on Azure. Apply this pattern when you start
your design process, and then select components that are best suited for your business requirements. The
article recommends a
north star
design approach and includes other examples with common technology
components.
We recommend that you evaluate the key design areasthe key design areas, define the critical user and system flows that use the
underlying components, and develop a matrix of Azure resources and their configuration while keeping in mind
the following characteristics.
This article is part of the Azure Well-Architected mission-critical workload series. If you aren't familiar with this series, we
recommend you start with what is a mission-critical workload?
Global resourcesGlobal resources
C H A RA C T ERIST ICC H A RA C T ERIST IC C O N SIDERAT IO NSC O N SIDERAT IO NS
Lifetime These resources are expected to be long living (non-
ephemeral). Their lifetime spans the life of the system or
longer. Often the resources are managed with in-place data
and control plane updates, assuming they support zero-
downtime update operations.
State Because these resources exist for at least the lifetime of the
system, this layer is often responsible for storing global, geo-
replicated state.
Certain resources are globally shared by resources deployed within each region. Common examples are
resources that are used to distribute traffic across multiple regions, store permanent state for the whole
application, and monitor resources for them.
Reach The resources should be globally distributed and replicated
to the regions that host those resources. It’s recommended
that these resources communicate with regional or other
resources with low latency and the desired consistency.
Dependencies The resources should avoid dependencies on regional
resources because their unavailability can be a cause for
global failure. For example, certificates or secrets kept in a
single vault could have global impact if there's a regional
failure where the vault is located.
Scale limits Often these resources are singleton instances in the system,
and they should be able to scale such that they can handle
throughput of the system as a whole.
Availability/disaster recovery Regional and stamp resources can use global resources. It's
critical that global resources are configured with high
availability and disaster recovery for the health of the whole
system.
C H A RA C T ERIST ICC H A RA C T ERIST IC C O N SIDERAT IO NSC O N SIDERAT IO NS
Regional stamp resourcesRegional stamp resources
C H A RA C T ERIST ICC H A RA C T ERIST IC C O N SIDERAT IO NSC O N SIDERAT IO NS
Lifetime The resources are expected to have a short life span
(ephemeral) with the intent that they can get added and
removed dynamically while regional resources outside the
stamp continue to persist. The ephemeral nature is needed
to provide more resiliency, scale, and proximity to users.
State Because stamps are ephemeral and will be destroyed with
each deployment, a stamp should be stateless as much as
possible.
Reach Can communicate with regional and global resources.
However, communication with other regions or other stamps
should be avoided.
Dependencies The stamp resources must be independent. They're expected
to have regional and global dependencies but shouldn't rely
on components in other stamps in the same or other
regions.
Scale limits Throughput is established through testing. The throughput
of the overall stamp is limited to the least performant
resource. Stamp throughput needs to estimate the high-
level of demand caused by a failover to another stamp.
Availability/disaster recovery Because of the temporary nature of stamps, disaster
recovery is done by redeploying the stamp. If resources are
in an unhealthy state, the stamp, as a whole, can be
destroyed and redeployed.
Regional resourcesRegional resources
The stamp contains the application and resources that participate in completing business transactions. A stamp
typically corresponds to a deployment to an Azure region. Although a region can have more than one stamp.
C H A RA C T ERIST ICC H A RA C T ERIST IC C O N SIDERAT IO NC O N SIDE RAT IO N
Lifetime The resources share the lifetime of the region and out live
the stamp resources.
State State stored in a region can't live beyond the lifetime of the
region. If state needs to be shared across regions, consider
using a global data store.
Reach The resources don't need to be globally distributed. Direct
communication with other regions should be avoided at all
cost.
Dependencies The resources can have dependencies on global resources,
but not on stamp resources because stamps are meant to
be short lived.
Scale limits Determine the scale limit of regional resources by combining
all stamps within the region.
Baseline architectures for mission-critical workloads
A system can have resources that are deployed in region but outlive the stamp resources. For example,
observability resources that monitor resources at the regional level, including the stamps.
These baseline examples serve as the recommended north star architecture for mission-critical applications. The
baseline strongly recommends containerization and using a container orchestrator for the application platform.
The baseline uses Azure Kubernetes Service (AKS).
Refer to Well-Architected mission-critical workloads: Containerization.
Baseline architecture
If you're just starting your mission-critical journey, use this architecture as a reference. The workload is
accessed over a public endpoint and doesn't require private network connectivity to other company
resources.
Baseline with network controls
This architecture builds on the baseline architecture. The design is extended to provide strict network
controls to prevent unauthorized public access from the internet to the workload resources.
Design areas
Next step
Baseline in Azure landing zones
This architecture is appropriate if you're deploying the workload in an enterprise setup where integration
within a broader organization is required. The workload uses centralized shared services, needs on-
premises connectivity, and integrates with other workloads within the enterprise. It's deployed in an
Azure landing zone subscription that inherits from the Corp. management group.
We recommend that you use the provided design guidance to navigate the key design decisions to reach an
optimal solution. For information, see What are the key design areas?
Review the best practices for designing mission-critical application scenarios.
Application design
Cross
-
cutting concerns of mission
-
critical workloads
on Azure
12/16/2022 2 minutes to read • Edit Online
IMPORTANTIMPORTANT
Scale limits
IMPORTANTIMPORTANT
RecommendationsRecommendations
Automation
RecommendationsRecommendations
There are several cross-cutting concerns that traverse the key design areas. This article contextualizes these
cross-cutting concerns for subsequent consideration within each design area.
This article is part of the Azure Well-Architected mission-critical workload series. If you aren't familiar with this series, we
recommend you start with what is a mission-critical workload?
Mission-Critical open source project
There are reference implementations available as part of an open source project on GitHub. The code assets provided by
these implementations illustrate the recommendations highlighted in this article.
Azure applies various
limits
or
quotas
to ensure a consistent level of service for all customers. Examples of these
limits include restrictions on the number of deployable resources within a single subscription, and restrictions to
network and query throughput.
Service limits may have a significant bearing on a large mission-critical workload. Consider the limits of the
services used in the target architecture carefully to ensure sustainable scale. Otherwise, you may hit one or
more of these limits as the workload grows.
Limits and quotas may change as the platform evolves. Be sure to check the current limits at Azure subscription and
service limits, quotas, and constraints.
Employ a scale unit approach for resource composition, deployment, and management.
Use subscriptions as scale units, scaling out resources and subscriptions as required.
Ensure scale limits are considered as part of capacity planning.
If available, use data about existing application environments to explore which limits might be encountered.
A holistic approach to automation of deployment and management activities can maximize the reliability and
operability of the workload.
Automate continuous integration and continuous delivery (CI/CD) pipelines for all application components.
Automate application management activities, such as patching and monitoring.
Use declarative management semantics, such as Infrastructure as code (IaC), instead of over imperative
approaches.
Azure roadmap alignment
RecommendationsRecommendations
Next step
Prioritize templating over scripting. Defer to scripting only when using templates isn't possible.
Azure is constantly evolving through frequent updates to services, features, and regional availability. It's
important to align the target architecture with Azure platform roadmaps to inform an optimal application
trajectory. For example, making sure that the required services and features are available within the chosen
deployment regions.
Refer to Azure updates for the latest information about new services and features.
Align with Azure engineering roadmaps and regional rollout plans.
Unblock with preview services or by taking dependencies on the Azure platform roadmap.
Only take a dependency on committed services and features; validate roadmap dependencies with Microsoft
engineering product groups.
Explore the design areas that provide critical considerations and recommendations for building a mission-
critical workload.
Architecture pattern
Application design of mission
-
critical workloads on
Azure
12/16/2022 19 minutes to read • Edit Online
IMPORTANTIMPORTANT
Scale-unit architecture
TIPTIP
Both functional application requirements and non-functional requirements are critical to inform key design
decisions for a mission-critical application design. However, these requirements should be examined alongside
key cloud application design patterns to ensure aspirations are fully achieved.
This design area explores the important application design patterns for building a highly reliable application on
Azure.
This article is part of the Azure Well-Architected mission-critical workload series. If you aren't familiar with this series, we
recommend you start with what is a mission-critical workload?
Mission-Critical open source project
The reference implementations are part of an open source project available on GitHub. The implementations provide
solution-orientated showcases for how these foundational design concepts can be leveraged, alongside Azure-native
capabilities, to maximize reliability.
Architecturally, it is critical to optimize end-to-end scalability through the logical compartmentalization of
operational functions at all levels of the application stack. For achieving a highly available application design, all
functional aspects of the solution must be capable of scaling to meet changes in demand.
A
scale-unit
is a logical unit or function that can be scaled independently. A unit can be code components,
application hosting platforms, or even deployment stamps that encompass related components.
For more information, see the Deployment Stamps pattern for further details.
For example, the Mission-Critical online reference implementation considers a user flow for processing product
comments and ratings in a sample web catalog application that use APIs for retrieving and posting comments
and ratings, and supporting components such as an OAuth endpoint, datastore, and message queues. These
stateless API endpoints for retrieving and posting comments and ratings represent granular functional units that
must be able to adapt to changes in demand. However, for these to be truly scalable, the underlying application
platform must also be able to scale in-kind. Similarly, to avoid performance bottlenecks in the end-to-end user
flow and to achieve sustainable scale, the downstream components and dependencies must also be able to scale
to an appropriate degree, either independently, as a separate scale-unit, or together, as part of a single logical
unit.
This image shows the multiple scale-unit scopes that are considered by this reference implementation user flow.
These scopes range from microservice pods to cluster nodes and regional deployment stamps.
Design considerationsDesign considerations
Design recommendationsDesign recommendations
Using a scale-unit architecture is recommended to optimize the end-to-end scalability of a mission-critical
application so that all levels of the solution can appropriately scale. The relationship between related scale-units,
and the components inside a single scale-unit, should be defined according to a capacity model, taking into
consideration non-functional requirements around performance.
Here are some benefits of using a scale-unit architecture:
The scale-unit architecture pattern goes great lengths to address scale limits of individual resources and
the application as a whole.
A scale-unit architecture helps with complex deployment and update scenarios, since an entire regional
stamp can be deployed as one unit. This architecture allows you to test and validate specific versions of
components together, prior to directing traffic to it.
The scale-unit architectural pattern can also be applied to support multi-tenant requirements for
customer segregation.
Azure subscription scale limits and quotas might have a bearing on application design, technology choices, and
the definition of scale-units.
When designing a highly available architecture, start by asking these questions.
How many requests is the solution required to suppor t for each user flow? Are the usage patternsHow many requests is the solution required to suppor t for each user flow? Are the usage patterns
predictable?predictable?
The expected peak request rate (requests per second) and daily/weekly/seasonal traffic patterns are critical to
inform core scale requirements.
Is traffic expected to grow? At what rate will it grow?Is traffic expected to grow? At what rate will it grow?
The expected growth patterns for both traffic and data volume inform the design, about sustainable scale.
Is a degraded ser vice with high response times acceptable under load?Is a degraded ser vice with high response times acceptable under load?
The required performance of the solution under load is a critical decision factor when modeling required
capacity.
NOTENOTE
IMPORTANTIMPORTANT
Define a scale-unit when the scale-limits of a single deployment are likely to be exceeded.
Ensure all application components are able to scale, either as independent scale-units or as part of a
logical scale-unit that encompasses multiple related components.
Define the relationship between scale-units, according to a capacity model and non-functional
requirements.
Define a regional deployment stamp to unify the provisioning, management, and operation of regional
application resources, into a heterogenous but inter-dependent scale-unit.
As the load increases, extra stamps can be deployed within the same or different Azure regions, in order
to horizontally scale the solution.
When deploying within an Azure landing zone, ensure the landing zone subscription is dedicated to the application, in
order to provide a clear management boundary and to avoid potential the Noisy Neighbor antipattern.
For high-scale application scenarios with significant volumes of traffic, design the solution to scale across
multiple Azure subscriptions, to ensure the inherit scale-limits within a single subscription don't constrain
the scalability.
Define a subscription-scoped deployment as a scale-unit to avoid a 'spill-and-fill' subscription model.
Deploy each regional deployment stamp within a dedicated subscription, in order to ensure the
subscription limits only apply within the context of a single deployment stamp and not across the
application as a whole. Where appropriate, multiple deployment stamps can be considered within a
single region, but you should deploy them across independent subscriptions.
Separate the 'global' shared resources within a dedicated subscription to allow for consistent regional
subscription deployment. Avoid using a specialized deployment for a primary region.
The use of multiple subscriptions necessitates additional CI/CD complexity, which must be appropriately managed.
Therefore, it's only recommended in extreme scale scenarios, where the limits of a single subscription are likely to become
a hindrance.
Where multiple production subscriptions are needed to ensure requisite scale, consider using a dedicated
application management group, to simplify policy assignment through a policy aggregation boundary.
Deploy any considered environments, such as production, development, or test environments, into
separate subscriptions. This practice ensures that lower environments don't contribute towards scale
limits, and it reduces the risk of lower environment updates polluting production, by providing a clear
management and identity boundary.
Define and analyze non-functional requirements, such as the availability SLO, within the context of key
end-to-end user-flows. Technical and business scenarios will likely have distinct considerations for
resilience, availability, latency, capacity, and observability. This practice will allow for relative flexibility in
the design approach, tailoring design decisions and technology choices at a user-flow level, since one size
may not fit all.
Model the required capacity around identified traffic patterns, in order to ensure sufficient capacity is
provisioned at peak times and to prevent service degradation. Use traffic patterns to optimize capacity
and resource utilization, during periods of reduced traffic.
Example
-
Subscription scale
-
unit approachExample
-
Subscription scale
-
unit approach
Video: Subscription structure
Global distribution
Measure the time it takes to perform scale-out and scale-in operations, in order to ensure that the natural
variations in traffic don't create an unacceptable level of service degradation. - To drive continuous
improvement, track the scale operation durations as an operational metric.
This image demonstrates how the single subscription reference deployment model can be expanded across
multiple subscriptions, in an extreme scale scenario, to navigate subscription scale-limits.
Failure is impossible to avoid in any highly distributed environment. So, always plan for failure.
Here are some strategies to mitigate many fault scenarios.
Availability Zones (AZ) allows highly available regional deployments across different data centers within
a region. Nearly all Azure services are available in either a zonal configuration (where service is pinned to
a specific zone) or zone-redundant configuration (where the platform automatically ensures the service
spans across zones and can withstand a zone outage). These configurations allow for fault-tolerance up to
a datacenter level.
To maximize reliability, consider using multiple Azure regions to ensure regional fault tolerance, so that
application availability remains even when an entire region goes down. When designing a multi-region
application, consider different deployment strategies, such as active-active and active-passive, alongside
application requirements, because there are significant trade-offs between each approach.
An active-active deployment strategy represents the gold standard because it maximizes availability and allows
for higher composite Service Level Agreement (SLA). While active-active is the recommended approach, it can
introduce challenges around data synchronization and consistency for many application scenarios, and these
challenges must be fully addressed at a data platform level, with other trade-offs, from increased cost exposure
and increased engineering effort.
Design considerationsDesign considerations
Not every workload supports or requires multiple regions running simultaneously, and hence the precise
application requirements should be weighed against these trade-offs to inform an optimal design decision. For
certain application scenarios with lower reliability targets, different deployment models, such as active-passive
or sharding, can be suitable alternatives.
It's important to note that some Azure services are deployable or configurable as global resources, which aren't
constrained to a particular Azure region. So, when accommodating both 'Scale-Unit Architecture' and 'Global
Distribution', carefully consider to how resources are optimally distributed across Azure regions.
This image shows the high-level active-active design. A user accesses the application through a central global
entry point that then redirects requests to a suitable regional deployment stamp.
Not all services or capabilities are available in every Azure region, and so there can be service availability
implications depending on the selected deployment regions. - For example, Availability Zones aren't
available in every region.
Azure regions are grouped into regional pairs consisting of two regions within the same geography.
Some Azure services use paired regions to ensure business continuity and to protect against data loss.
For example, Azure Geo-redundant Storage (GRS) replicates data to a secondary paired region
automatically, ensuring that data is durable if the primary region isn't recoverable. If an outage affects
multiple Azure regions, at least one region in each pair will be prioritized for recovery.
The Azure Safe Deploy Practice (SDP) ensures all code and configuration changes (planned maintenance)
to the Azure platform undergo a phased roll-out, with health analyzed in case any degradation is detected
during the release. After the Canary and Pilot phases have successfully completed, platform updates are
serialized across regional pairs, ensuring that only one region in each pair is updated at a time.
Like any cloud provider, Azure ultimately has a finite amount of resources and as a result there are
situations that can lead to the unavailability of capacity in individual regions. In the event of a regional
outage there will be a significant increase in demand for resources within the paired region as impacted
customer workloads seek to recover within the paired region. In certain scenarios this may create a
capacity challenge where supply temporarily does not satisfy demand.
When designing a globally distributed architecture, start with these questions.
Are there specific regions where data must reside or where resources have to be deployed?Are there specific regions where data must reside or where resources have to be deployed?
Compliance requirements around geographical data residency, data protection, and data retention can have a
Design recommendationsDesign recommendations
significant bearing on appropriate geographical distribution.
Where are the requests physically originating from?Where are the requests physically originating from?
The geographic proximity and density of users or dependent systems should inform design decisions around
the global distribution.
Are users going to connect from home and/or organizational networks? Can all users be expectedAre users going to connect from home and/or organizational networks? Can all users be expected
to have fast internet connections?to have fast internet connections?
The connectivity method by which users or systems access the application, whether over the public Internet or
private networks using either VPN or Express Route connectivity.
Different Azure regions have slightly different cost profiles for some services. There may be further cost
implications depending on the precise deployment regions chosen.
Availability Zones have a latency perimeter of less than 2 milliseconds between availability zones.
For workloads that are particularly 'chatty' across zones this latency can accumulate to form a non-
trivial performance penalty, as well as incurring bandwidth charges for inter-zone data transfer.
An active-active deployment across Azure and other cloud providers can be considered to further
mitigate reliance on global dependencies within a single cloud provider. A multi-cloud active-active
deployment strategy introduces a significant amount of complexity around CI/CD, given the significant
difference in resource specifications and capabilities between cloud providers. This necessitates
specialized deployment stamps for each cloud.
IMPORTANTIMPORTANT
Deploy the solution within a minimum of two Azure regions to protect against regional outages. Prioritize
the use of paired regions to benefit from SDP risk mitigations and platform recovery capabilities.
For scenarios targeting a >= 99.99% SLO, a minimum of three deployment regions is recommended to maximize
the composite SLA and overall reliability.
Use an active-active deployment strategy where possible to maximize reliability.
Where data/state consistency challenges exist explore the use of a globally distributed data store,
stamped regional architecture, a partially active-active deployment, where some components are active
across all regions while others are located centrally within a primary region.
Calculate the composite SLA for all user flows. Ensure the composite SLA is in-line with business targets.
Deploy additional regional deployment stamps to achieve a greater composite SLA. The use of global
resources will constrain the increase in composite SLA from adding further regions.
Define and validate the recovery point objectives (RPO) and recovery time objectives (RTO).
Geographically co-locate Azure resources with users to minimize network latency and maximize end-to-
end performance.
Technical solutions such as a Content Delivery Network (CDN) or edge caching can also be used to
drive optimal network latency for distributed user bases.
For high-scale application scenarios with significant volumes of traffic, design the solution to scale across
multiple regions to navigate potential capacity constraints within a single region.
Select deployment regions that offer requisite capabilities and characteristics to achieve performance and
availability targets, while fulfilling data residency and retention requirements.
Example
-
Global distribution approachExample
-
Global distribution approach
Video - Global distribution
Loosely coupled event-driven architecture
Within a single geography, prioritize the use of regional pairs to benefit from SDP serialized rollouts for
planned maintenance, and regional prioritization in the event of unplanned maintenance.
It's not uncommon that data compliance requirements will constrain the number of available regions and
potentially force design compromises. In such cases, additional investment in operational wrappers is
highly recommended to predict, detect, and respond to failures.
If only a single Azure region is suitable, multiple deployment stamps ('regional scale-units') should be
deployed within the selected region to mitigate some risk, using Availability Zones to provide
datacenter-level fault tolerance. However, such a significant compromise in geographical distribution
will drastically constrain the attainable composite SLA and overall reliability.
If suitable Azure regions do not all offer requisite capabilities, be prepared to compromise on the
consistency of regional deployment stamps to prioritize geographical distribution and maximize
reliability.
For example, when constrained to a geography with two regions where only one region
supports Availability Zones (3 + 1 datacenter model), create a secondary deployment pattern
using fault domain isolation to allow for both regions to be deployed in an active configuration,
ensuring the primary region houses multiple deployment stamps.
Align current service availability with product roadmaps when selecting deployment regions; not all
services may be available in every region on day 1.
Use Availability Zones where possible to maximize availability within a single Azure region.
The Mission-Critical reference implementations consist of both global and regional resources, with regional
resources deployed across multiple regions to provide geo-availability, in the case of regional outages and to
bring services closer to end-users. These regional deployments also serve as scale-unit "stamps" to provide
additional capacity and availability when required.
TIPTIP
Design considerationsDesign considerations
Design recommendationsDesign recommendations
Loose coupling provides the cornerstone of a microservice architecture by allowing services to be designed in a
way that each service has little or no knowledge of surrounding services. The
loose
aspect allows a service to
operate independently. The coupling aspect allows for inter-service communication through well-defined
interfaces. In the context of a mission critical application it further facilitates high-availability by preventing
downstream failures from cascading to frontends or different deployment stamps.
When implementing loose coupling, event-driven architectureevent-driven architecture and asynchronous message processingasynchronous message processing
are key design patterns for interactions which don't require an immediate response. Events indicate a change in
state within a microservice and are generated by event
producers
. Producers don't know anything about how
events should be processed or handled. That is the responsibility of
consumers
. When using asynchronous
event-driven communication, a producer publishes an event when something happens within its domain, which
another component needs to be aware of. An example would be a price change in a product catalog, which
consumers will subscribe to receive so they can process the event asynchronously.
Refer to the event-driven architecture and asynchronous processing patterns for further details.
In reality, applications can combine loose and tight-coupling, depending on business objectives.
Loosely coupled services aren't constrained to use the same compute platform, programming language,
runtime, or operating system.
Services can scale independently, optimizing the use of infrastructure and platform resources.
Failures can be handled separately and don't affect client transactions.
Transactional integrity can be harder to maintain because data creation and persistence happens within
separate services.
End-to-end tracing requires more complex orchestration.
Key functionality should be deployed and managed as independent loosely coupled microservices with
event-driven interaction through well-defined interfaces (synchronous and asynchronous).
The definition of microservice boundaries should consider and align with critical user-flows.
Use event-driven asynchronous communication where possible to support sustainable scale and optimal
performance.
Example
-
Event
-
driven approachExample
-
Event
-
driven approach
Application-level resiliency patterns and error handling
Design considerationsDesign considerations
Use patterns like the Outbox and Transactional Session to guarantee consistency so that every message is
processed correctly.
The Mission-Critical Online reference implementation uses microservices to process a single business
transaction. It applies write operations asynchronously with a message broker and worker, while read operations
are synchronous with the result directly returned to the caller.
A mission-critical application must be developed with resiliency in-mind. It is therefore critical that application
code be designed and developed to be resilient, ensuring that the application can respond to failure, which is
ultimately an unavoidable characteristic of highly distributed multi-tenant cloud environments like Azure.
More specifically, all application components should be designed from the ground-up to apply key resiliency
patterns for self-healing, such as retries with back-off and circuit breaker. Such patterns go great lengths to
transparently handle transient faults such as network packet loss, or the temporary loss of a downstream
dependency. So, the application code should address as many failure scenarios as possible in order to maximize
service availability and reliability.
When issues are not transient in-nature and cannot be fully mitigated within application logic, it becomes the
role of the health model and operational wrappers to take corrective action. However, for this to happen
effectively, it is essential that the application code incorporates proper instrumentation and logging to inform
the health model and facilitate subsequent troubleshooting or root cause analysis when required. More
specifically, application code should be implemented to facilitate distributed tracing, by providing the caller with
a comprehensive error message that includes a correlation ID when a failure occurs.
Tools like Azure Application Insights can help significantly to query, correlate, and visualize application traces.
Vendor-provided SDKs, such as the Azure service SDKs, will typically provide built-in resiliency
capabilities like retry mechanisms.
It's not uncommon for application responses to transient issues to cause cascading failures.
For example, retry without appropriate back-off will exacerbate when a service is being throttled will
likely exacerbate the issue.
PAT T E RNPAT T E RN SUM M A RYSUM M A RY
Queue-Based Load Leveling Introduces a buffer between consumers and requested
resources to ensure consistent load levels. As consumer
requests are enqueued, a worker process dequeues the
requests and processes them against the requested resource
at a pace set by the worker and the requested resource's
ability to process the requests. If consumers expect replies to
their requests, a separate response mechanism will also need
to be implemented.
Circuit Breaker Provides stability by either waiting for recovery, or quickly
rejecting requests rather than blocking while waiting for an
unavailable remote service or resource. Also, handles faults
that might take a variable amount of time to recover from
when connecting to a remote service or resource.
Bulkhead Strives to partition service instances into groups based on
load and availability requirements, isolating failures to
sustain service functionality.
Saga Manage data consistency across microservices with
independent datastores by ensuring services update each
other through defined event or message channels. Each
service performs local transactions to update its own state
and publishes an event to trigger the next local transaction
in the saga. If a service update fails, the saga executes
compensating transactions to counteract preceding service
update steps. Individual service update steps can themselves
implement resiliency patterns, such as retry.
Health Endpoint Monitoring Implement functional checks in an application that external
tools can access through exposed endpoints at regular
intervals.
Retry Handles transient failures elegantly and transparently.
Throttling Controls the consumption of resources used by application
components, protecting them from becoming over
encumbered. When a resource reaches a load threshold, it
should safeguard its availability by deferring lower-
importance operations and degrading non-essential
functionality so that essential functionality can continue until
sufficient resources are available to return to normal
operation.
Design recommendationsDesign recommendations
Retry delays can be linearly spaced, or increase exponentially to 'backoff' via growing delays.
Here are some other resiliency-related patterns:
Design and develop application code to anticipate and handle failures.
Use vendor provided SDKs, such as the Azure SDKs, to connect to dependent services.
Use the resiliency capabilities provided by utilized SDKs instead of reimplementing resiliency
functionality.
Ensure a suitable back-off strategy is applied when retrying failed dependency calls to avoid a self-
inflicted DDoS scenario.
Next step
Define common engineering criteriacommon engineering criteria for all application microservice teams to drive consistency and
acceleration regarding the use application-level resiliency patterns.
Developers should familiarize themselves with common software engineering patterns for resilient
applications.
Implement resiliency patterns using proven standardized packages, such as NServiceBus or Polly for C#
or Sentinel for Java.
Implement Health Endpoint Monitoring by exposing functional checks within application code through
health endpoints which external monitoring solutions can poll to retrieve application component health
statuses. Responses should be interpreted alongside key operational metrics to inform application health
and trigger operational responses, such as raising an alert or performing a compensating roll-back
deployment.
Implement Queue-Based Load Leveling by applying a prioritized ordering so that the most important
activities are performed first.
Implement the Retry pattern to enable application code to handle transient failures elegantly and
transparently.
Cancel if the fault is unlikely to be transient and is unlikely to succeed if the operation is reattempted.
Retry if the fault is unusual or rare and the operation is likely to succeed if attempted again
immediately.
Retry after a delay if the fault is caused by a condition that may need a short time to recover, such as
network connectivity or high load failures.
Apply a suitable 'backoff' strategy with growing retry delays.
Use correlation IDs for all trace events and log messages to tie them to a given request.
Return correlation IDs to the caller for all calls not just failed requests.
Use structured logging for all log messages.
Select a unified operational data sink for application traces, metrics, and logs to enable operators to
seamlessly debug issues.
Ensure operational data is used in conjunction with business requirements to inform an application
health model.
Review the considerations for the application platform.
Application platform
Application platform considerations for mission
-
critical workloads on Azure
12/16/2022 26 minutes to read • Edit Online
IMPORTANTIMPORTANT
Programming language selection
Design considerationsDesign considerations
Azure provides many compute services for hosting highly available applications. The services differ in capability
and complexity. We recommend that you fully understand the capability matrix with focus on:
Non-functional requirements surrounding reliability, availability, performance, and security.
Decision factors such as scalability, cost, operability, and complexity.
The selection of an appropriate application hosting platform is a critical decision that has impact on all other
design areas. For example, the scale-limits of a particular service will have a key bearing on suitability and the
overall application design in terms of scale-unit definitions. A mission-critical application can use more than one
compute service in parallel to support multiple composite workloads and microservices with distinct platform
requirements.
This design area explores the important decision factors and provides recommendations related to the selection,
design, and configuration of an appropriate application hosting platform.
This article is part of the Azure Well-Architected mission-critical workload series. If you aren't familiar with this series, we
recommend you start with what is a mission-critical workload?
Mission-Critical open source project
The reference implementations are part of an open source project available on GitHub. The code assets illustrate the
considerations and implementations for a selected compute service.
Selecting the right programming languages and frameworks is a critical design decision. Typically this decision
is driven by the availability of development skills or by the use of standardized technologies within an
organization. However, given the reliable aspirations, it's essential to also evaluate the performance and
resilience aspects of different languages/frameworks as well as the capability differences within required Azure
SDKs.
There are sometimes significant differences in the capabilities offered by Azure service SDKs in different
languages, and this may influence the selection of an Azure service or programming language.
For instance, if Azure Cosmos DB is a significant dependency, 'Go' may not be an appropriate
development language since there's no first-party SDK.
New features are typically added to the .NET and Java libraries first, and there can be a delay in feature
availability for other supported languages.
The application can use multiple programming languages or frameworks in parallel to support multiple
composite workloads with distinct requirements.
However, significant technology sprawl should be avoided since it introduces management complexity
Design recommendationsDesign recommendations
Containerization
Design considerationsDesign considerations
and operational challenges.
Evaluate all relevant Azure SDKs to confirm requisite capabilities are available for selected programming
languages, ensuring alignment with non-functional requirements.
Optimize the selection of programming languages and frameworks at the microservice level; embrace
multiple technologies where appropriate.
Avoid extensive technology sprawl to prevent unnecessary operational complexity.
Prioritize the .NET SDK to optimize reliability and performance since .NET Azure SDKs typically provide
additional capabilities and receive new features first.
Containerization allows developers to create and deploy applications faster and more reliably by bundling
application code together with related configuration files, libraries, and dependencies required for it to run. This
single software package
container
runs on a shared kernel abstracted from the host operating system, and as a
result is highly portable, capable of running consistently across different infrastructure platforms or cloud
providers.
Containerization has become a major trend in software development since it provides measurable
benefits for developers and operations teams as well as optimizing infrastructure utilization. More
specifically, the benefits of containerizing application components include:
Improved infrastructure utilizationImproved infrastructure utilization: Containers don't include operating system images so require
less system resources. Multiple containers can therefore be hosted on the same virtualized
infrastructure, and this helps to optimize resource utilization by consolidating on fewer resources with
higher container density.
Por tabilityPor tability: Including all software dependencies within the container ensures that it will work across
different operating systems regardless of runtimes or library versions. Containerized applications are
therefore easier to move between application platforms due to the standardized container format.
Faster scaling operationsFaster scaling operations : Containers are lightweight and don't suffer from the slow start-up and
shutdown times afflicting virtual machines, and since container images are pre-built, the start-up
activity can be minimized to focus only on bootstrapping the application.
Simplified managementSimplified management: The consistent portability and ephemeral nature of containers provides a
simplified infrastructure management experience compared to traditional virtualized hosting.
Agile developmentAgile development: Containers support accelerated development, test, and production cycles
through consistent operation and less overhead.
The drawbacks of containerizing application components include:
Complex monitoringComplex monitoring: Monitoring services can find it harder to access applications running inside a
container. Third-party software is typically required to collect and store container state indicators, such
as CPU usage or RAM usage.
SecuritySecurity: The hosting platform OS kernel is shared across multiple containers, creating a single point
of attack. However, the risk of host VM access is limited since containers are isolated from the
underlying operating system.
Containerization has proven to be an excellent option for packaging applications across different
development languages, providing an abstraction layer for application code and its dependencies to
achieve separation from the underlying hosting platform.
Containerization enables workloads to run on Azure without application code needing to be re-written.
Design recommendationsDesign recommendations
Container Orchestration and Kubernetes
Design considerationsDesign considerations
Provides a good opportunity to modernize legacy applications without significant code change, and
can therefore be suitable for 'lift and shift' scenarios depending on the considered application
frameworks and external dependencies.
While it's possible to store data within a running container's filesystem, the data will not persist when the
container is recreated, so instead persistence is typically achieved by 'mounting' external storage.
Containerize all application components, using container images as the primary model for application
deployment packages.
Prioritize Linux-based container runtimes when possible.
Avoid persisting state/data within a container since containers should be immutable and replaceable with
short lifecycles.
Ensure that all relevant logs and metrics are gathered from the container, container host, and underlying
cluster. Gathered logs and metrics should be sent to a unified data sink for further processing.
There are several Azure application platforms capable of effectively hosting containers:
Azure Kubernetes Service (AKS)
Azure Container Instances (ACI)
Azure App Service
Azure Service Fabric
Azure Red Hat OpenShift
There are advantages and disadvantages associated with each of these Azure container platforms which should
be analyzed in the context of business requirements to inform an optimal technical choice; each platform serves
an optimal choice for certain scenarios. However, given the principles underpinning the design methodology
strive to optimize reliability, scalability, and performance, it's strongly recommended to prioritize the use of
Azure Kubernetes Service.
Azure Kubernetes Service (AKS) is Microsoft Azure's native managed Kubernetes service which allows for rapid
Kubernetes cluster provisioning without complex cluster administration activities, and enhances standard
Kubernetes with a rich feature set that includes advanced networking and identity capabilities.
For web and API based workload scenarios Azure App Services offers a feasible alternative to AKS, providing a
low-friction container platform without the complexity of Kubernetes.
The considerations and recommendations within this section will therefore focus on optimal AKS usage as well
as App Services for low-scale scenarios.
Azure Kubernetes Ser viceAzure Kubernetes Ser vice
There are many different container orchestrators, but Kubernetes has become the clear market leader and
is best supported across the open source community and major cloud providers.
Kubernetes expertise is readily available within the employment market.
Kubernetes has a steep learning curve, so if development teams are new, it will require non-trivial
engineering investment to set up and maintain a Kubernetes environment in a secure and reliable
way.
Kubernetes as well as managed Kubernetes offerings like AKS are widely available and can address
concerns reg. vendor lock-in.
Default 'vanilla' Kubernetes requires significant configuration to ensure a suitable security posture for
business-critical application scenarios. AKS addresses various security risks out of the box, such as
support for private clusters, auditing and logging into Log Analytics, and hardened node images.
AKS provides a control plane that is managed by Microsoft. By default the control plane of AKS is
provided free of charge, but without any guaranteed SLA. Customers only manage and pay for the
worker nodes which form the cluster.
The optional AKS Uptime SLA provides availability guarantees for the Kubernetes control plane. 99.95%
availability of the Kubernetes API server endpoint for AKS Clusters that use Azure Availability Zones and
99.9% availability for AKS Clusters that don't use Azure Availability Zones.
Some foundational configuration decisions have to be made upfront and can't be changed without re-
deploying an AKS cluster.
Selection between public and private AKS clusters.
Enabling Azure Network Policy.
Azure AD integration and the use of Managed Identities for AKS instead of Service Principals.
AKS supports Kubernetes versions aligned with the release cycle of the Kubernetes project. Clusters and
node pools need to be upgraded on a regular basis.
AKS supports different ways to update nodes and/or clusters in a manual or automated way. The AKS
team releases new images on a weekly basis for Windows and Linux nodes.
AKS supports different auto-upgrade channels to automatically upgrade AKS clusters to newer versions
of Kubernetes and/or newer node images once available. Planned Maintenance can be used to define
maintenance windows for these operations.
AKS supports different network plugins. The Azure CNI plugin is required to enable certain capabilities
within AKS, such as Windows-based node pools or Kubernetes Network Policies. It also supports bring-
your-own-cni for specific use cases.
AKS differentiates between system node pools and user node pools to separate system and workload
services. User node pools can be scaled down to 0 nodes if needed.
The AKS Stop/Start cluster feature allows an AKS cluster in dev/test scenarios to be paused while
maintaining cluster configuration, saving time and cost compared to re-provisioning.
Azure Monitor for containers (Container Insights) provides a seamless onboarding experience, enables
various monitoring capabilities out of the box as well as more advanced scenarios via its built-in
Prometheus scraping support.
AKS offers integration with Azure AD to enable the use of Managed Identities for AKS as well as for node
pool and individual pods, Role Based Access Control (RBAC) using Azure AD credentials as well as
authentication with Azure Container Registry (ACR).
Azure Policy can help to apply at-scale enforcements and safeguards to AKS clusters in a consistent
centralized manner.
Azure Policy can control what functions pods are granted, and if running contradicts policy. This access
is defined through built-in policies provided by the Azure Policy Add-on for AKS.
AKS has certain scale limits, such as the number of nodes and number of node pools per cluster, as well
as the number of clusters per subscription.
Azure App Ser viceAzure App Ser vice
SNAT port exhaustion is a common failure scenario with Azure App Services, which can be predicted
through load testing while monitoring ports using Azure Diagnostics. SNAT ports are used when making
outbound connections to public IP addresses.
Design recommendationsDesign recommendations
TCP port exhaustion is a further common failure scenario which occurs when the sum of outbound
connections from a given worker exceeds the capacity. The number of available TCP ports depend on the
size of the worker.
Azure App Service has a default, soft limit of instances per App Service Plan. This limit can be increased
by opening a support ticket, if the App Service routinely uses 15 or more instances.
Per-app scaling can be enabled at the App Service Plan level to allow an application to scale
independently from the App Service plan that hosts it. For example, an App Service Plan can be scaled to
10 instances, but an app can be set to use only 5.
Apps are allocated to available nodes using a best effort approach for an even distribution. While an
even distribution isn't guaranteed, the platform will make sure that two instances of the same app will
not be hosted on the same instance.
There are a number of events that can lead App Service workers to restart, such as content deployment,
App Settings changes, and Virtual Network integration configuration changes.
App Service plan autoscale will scale-out run if any rule within the profile is met, but will only scale-in if
all rules within the profile are met.
Diagnostic logging provides the ability to ingest application and platform level logs into either Log
Analytics, Azure Storage, or a third party tool via Event Hubs.
Application performance monitoring with Application Insights provides deep insights into application
performance.
For Linux Plans a code-based enablement (SDK) is required.
For Windows Plans a 'codeless deployment' approach is possible to quickly get insights without
changing any code.
For cost management, see plan and manage costs for Azure App Service.
Azure Kubernetes Ser viceAzure Kubernetes Ser vice
Use Azure Kubernetes Service (AKS) as the primary application hosting platform where requirements allow.
Availability
Deploy AKS clusters across different Azure regions as a scale-unit to maximize reliability and
availability.
Use Availability Zones to maximize resilience within an Azure region by distributing AKS control plane
and agent nodes across physically separate datacenters.
Where co-locality latency requirements exist, either AKS deployment within a single zone or proximity
placement groups should be used to minimize inter-node latency.
Use the AKS Uptime SLA for production clusters to maximize Kubernetes API endpoint availability
guarantees.
Ensure AKS subscription scale limits are appropriately considered when designing the AKS
deployment model to ensure requisite scalability.
Scalability
Enable cluster autoscaler to automatically adjust the number of agent nodes in response to resource
constraints.
Utilize the Horizontal pod autoscaler to adjust the number of pods in a deployment depending on CPU
utilization or other selected metrics.
For high scale and burst scenarios, consider the use of Virtual Nodes for extensive and rapid scale.
Define pod resource requests and limits in application deployment manifests.
Isolation
Ensure the System node pool is isolated from application workloads.
Use dedicated node pools for infrastructure components and tools that require high resource
utilization, to avoid noisy neighbor scenarios.
Separate distinct application workloads to dedicated node pools based on workload requirements,
considering requirements for specialized infrastructure resources such as GPU, high memory VMs.
Avoid deploying large numbers of node pools to reduce additional management overhead.
Use taints and tolerations to provide dedicated nodes and limit resource intensive applications.
Evaluate application affinity and anti-affinity requirements and configure the appropriate colocation of
containers on nodes.
Networking
Ensure proper selection of network plugin based on network requirements and cluster sizing.
Prioritize the use of Azure CNI.
Use Azure or Calico Network Policies to control traffic within the cluster. (requires Azure CNI)
Security
Apply configuration guidance provided within the AKS security baseline.
Harden the AKS cluster to remove critical security risks associated with Kubernetes deployments.
Evaluate Azure AD workload identity to assign managed identities at pod-level.
Use Secrets Store CSI Driver with Azure Key Vault to protect secrets, certificates, and connection
strings.
Use Managed Identities to avoid having to manage and rotate service principal credentials.
Utilize Azure Active Directory integration to take advantage of centralized account
management and passwords, application access management, and identity protection.
Use Kubernetes RBAC with Azure Active Directory for least privilege, and minimize granting
administrator privileges to protect configuration and secrets access.
Limit access to the Kubernetes cluster configuration file with Azure role-based access control.
Limit access to actions that containers can perform, provide the least number of permissions
and avoid the use of root / privileged escalation.
Establish a consistent reliability and security baseline for AKS cluster and pod configurations
using Azure Policy.
Use the Azure Policy Add-on for AKS to control pod functions, such as root privileges,
and disallow pods which don't conform to policy.
Policy assignments should be enforced at a subscription scope or higher to drive
consistency across development teams.
Stay current
Subscribe to the public AKS Roadmap and Release Notes on GitHub to stay up-to-date on upcoming
changes, improvements, and most importantly Kubernetes version releases or the deprecation of old
releases.
Consider and apply the guidance provided within the AKS checklist to ensure alignment with Well-
Architected best practice guidance.
Regularly upgrade to a supported version of Kubernetes.
Establish a governance process to check and upgrade as needed to not fall out of support.
Leverage the AKS Cluster auto-upgrade with Planned Maintenance.
Regularly process node image updates to remain current with new AKS images.
Observability
Utilize a solution like Azure Monitor and Application Insights to centrally collect metrics, logs, and
NOTENOTE
diagnostics from AKS resources for troubleshooting purposes.
Enable and review Kubernetes master node logs.
Configure the scraping of Prometheus metrics with Azure Monitor for containers.
Container Images
Store container images in Azure Container Registry.
Enable geo-replication to replicate container images across all leveraged AKS regions.
Enable Azure Defender for container registries to provide vulnerability scanning for container images.
Use Azure AD to access Azure Container Registry.
When deploying into an Azure landing zone, be aware that the required Azure Policy to ensure consistent
reliability and security should be provided by the landing zone implementation. The mission-critical reference
implementations provide a suite of baseline policies to drive recommended reliability and security configurations.
Azure App Ser viceAzure App Ser vice
For lower scale workload scenarios Azure App Services can provide a feasible alternative to AKS, without
the complexities associated with administering Kubernetes.
Consider and plan for future scalability requirements and application growth so that a strategic
technology decision can be applied from the start, avoiding future technical migration debt as the
solution grows.
If the lack of requisite Kubernetes expertise presents an unacceptable delivery risk, consider Azure
App Service as an alternative container platform.
Leverage the container based deployment model for App Service Plans.
Use Premium V3 plans with 2 or more worker instances for high availability.
Use Linux App Service Plans to optimize performance and costs.
Deploy App Service Plans in an Availability Zone configuration to ensure worker nodes are distributed
across zones within a region.
Deploy Azure App Service Plans across multiple regions as a scale unit, using multiple scale-units
deployed within a single region to navigate the default limit of 30 instances per App Service Plan.
Consider opening a support ticket to increase the maximum number of workers to twice the instance
count required to serve normal peak load.
Evaluate the use of TCP and SNAT ports to avoid outbound connection errors.
Predictively detect SNAT port exhaustion through load testing while monitoring ports using Azure
Monitor, and if SNAT errors occur, it's necessary to either scale across more/larger workers, or
implement coding practices to help preserve and re-use SNAT ports, such as connection pooling
or the lazy loading of resources.
It's recommended not to exceed 100 simultaneous outbound connections to a public IP Address
per worker, and to avoid communicating with downstream services via public IP addresses when a
Private Endpoint or Service Endpoint could be used.
Avoid unnecessary worker restarts.
Make changes within a deployment slot other than the slot currently configured to accept production
traffic. After workers are recycled and warmed up, a 'swap' can be performed without unnecessary
down time.
Serverless compute
Design considerationsDesign considerations
Enable AutoHeal to automatically recycle unhealthy workers.
Enable Health Check to identify non-responsive workers.
While any health check is better than none at all, the logic behind endpoint tests should assess all
critical downstream dependencies to ensure overall health.
Enable AutoScale to ensure adequate resources are available to service requests.
Use a scale-out and scale-in rule combination to ensure auto-scale can take action to both scale-out
and scale-in.
Understand the behavior of multiple scaling rules in a single profile.
Enable Diagnostic Logging to provide insight into application and platform behavior.
Enable Application Insights Alerts to be made aware of fault conditions.
Review Azure App Service diagnostics to ensure common problems are addressed.
It's a good practice to regularly review service-related diagnostics and recommendations, taking
action as appropriate.
Evaluate per-app scaling for high-density hosting on Azure App Service Plans.
Serverless computing provides compute resources on-demand and eliminates the need to manage
infrastructure all together; the cloud provider automatically provisions, scales, and manages resources required
to run deployed application code.
Microsoft Azure provides several serverless compute platforms:
Azure Functions: Application logic is implemented as distinct blocks of code ("functions") which run in
response to events, such as an HTTP request or queue message, with each function scaling as necessary to
meet demand.
Azure Logic Apps: Platform for creating and running automated workflows that integrate various apps, data
sources, services and systems. Similar to Azure Functions, there are built-in triggers for event-driven
processing but instead of deploying application code, Logic Apps can be composed using a graphical user
interface which supports code blocks like conditionals and loops.
Azure API Management: Publish, secure, transform, maintain, and monitor APIs using the Consumption tier.
Power Apps & Power Automate: Provides a 'low-code/no-code' development experience, with simple
workflow logic and integrations configurable through connections in a user interface. Developed Power Apps
can subsequently be deployed to a Microsoft 365 tenant and consumed from either a web browser or the
Power Apps mobile client.
In the context of a reliable application platform, serverless technologies provide a near-zero friction
development and operational experience, which can be highly valuable for simple business process scenarios.
However, this relative simplicity comes at the cost of flexibility in terms of scalability, reliability, and performance,
which is likely unacceptable for most business-critical application scenarios.
The design methodology positions serverless technologies as an alternative platform for simple business
process scenarios which don't share the same stringent business requirements as critical system flows. The
design considerations and recommendations within this section focus on optimal Azure Function and Azure
Logic Apps usage as alternative platforms for non-critical workflow scenarios.
Azure FunctionsAzure Functions
In most cases Azure Functions don't require additional code to call external services or to enable external
Design recommendationsDesign recommendations
events trigger function execution since these can be achieved with Azure Function Bindings.
Azure Functions supports multiple triggers, such as the HTTP trigger, and bindings for Azure Services,
such as Azure Cosmos DB, Azure Service Bus and Azure Blob Storage.
There are three hosting plans available for Azure Functions:
Consumption
is the fully serverless pay-per-use option, with instances dynamically added and
removed based on the number of incoming events; underlying compute resources are charged only
when running.
Premium
uses a Premium SKU App Service plan to host functions and allows the configuration of
compute instance size.
Dedicated
is the least serverless option as it's tied to a provisioned App Service plan or App Service
Environment. Autoscale can be enabled, but scale operations are slower than with the Consumption
and Premium plans.
Fully serverless hosting options, which help optimize costs by de-provisioning allocated resources when
workloads are not running, may incur "cold start" delays, especially for applications comprised of many
files, such as Node.js or PHP applications.
See Azure Functions hosting options for more details about the service limits.
Azure Logic AppsAzure Logic Apps
There are three deployment modes available for Azure Logic Apps:
Consumption
is the fully serverless pay-per-use model, with Azure managing the infrastructure which
is shared across multiple tenants.
Consumption (ISE)
uses the dedicated Integration Service Environment (ISE) to privately host logic
apps. A single logic app can have only one workflow.
Standard
uses the containerized single-tenant Azure Logic Apps runtime based on Azure Functions.
In
Standard
each logic app can have multiple stateful and stateless workflows.For
Consumption
and
Consumption (ISE)
a single logic app can have only one workflow.
Similar to Azure Functions, there are built-in triggers for event-driven processing, however, instead of
deploying application code Logic Apps can be composed using a graphical user interface which supports
blocks like conditionals, loops etc.
With the standard deployment model, default limits can be modified, however, some limits have upper
maximums.
For consumption plans Azure manages the default configuration limits, but some values can be
changed through the creation of a support ticket.
Azure FunctionsAzure Functions
Consider Azure Functions for simple business process scenarios which don't have the same stringent
business requirements as business-critical system flows.
Low-critical scenarios can also be hosted as separate containers within AKS to drive consistency,
provided affinity and anti-affinity requirements are fully considered when collocating containers on
nodes.
Azure Functions should perform distinct operations that run as fast as possible.
This makes them very flexible and highly scalable, ensuring they will work well in event-driven
workloads with short-lived processes.
Use the Premium Function Linux hosting plan to maximize reliability and performance while maintaining
Asynchronous messaging
Design considerationsDesign considerations
the serverless promise.
Take a scale-unit approach to navigate the resource limit of 20 Linux nodes.
When using the HTTP trigger to expose an external endpoint, protect the HTTP endpoint from common
external attack vectors using a Web Application Firewall (WAF).
For internal workloads, consider the use of Service Endpoints or Private Endpoints to restrict access to
private Virtual Networks.
If required, use Private Endpoints to mitigate data exfiltration risks, such as malicious admin scenarios.
Treat Azure Functions code just like any other code; subject it to code scanning tools that integrate it with
CI/CD pipelines.
Azure Logic AppsAzure Logic Apps
Use the standard deployment mode to ensure a single tenant deployment and mitigate 'noisy neighbor'
scenarios.
The recommendation for a loosely coupled microservice architecture relies heavily on asynchronous messaging
to provide requisite integration between application components through a well-defined message bus. Azure
provides several native messaging services which can be used to facilitate asynchronous message interfaces,
including Azure Event HubAzure Event Hub, Azure Ser vice BusAzure Ser vice Bus, Azure Storage QueuesAzure Storage Queues and Azure Event GridAzure Event Grid.
This section will therefore explore key decision factors when selecting an appropriate Azure message service,
and how to optimally configure each service in the context of a mission-critical workload scenario to maximize
reliability.
More than one type of messaging service can be used by an application for different workload scenarios.
There are many factors to consider when selecting an optimal messaging service for a specific scenario.
Asynchronous Messaging on Azure
Azure Storage Queues vs Azure Service Bus Queues
Comparison between Event Grid, Event Hubs and Service Bus
Azure Messaging Services -How to Choose the Right Messaging Technology in Azure
Azure Event HubAzure Event Hub
Azure Event Hubs are designed as a event streamingevent streaming service to scale to multi million messages per
second.
Event Hubs supports Availability Zones (AZs) for zonal redundancy within supported regions in all
pricing tiers.
Azure Event Hubs Namespaces support configurable failover replication between regions, however, only
configuration metadata is replicated, but not messages themselvesnot messages themselves. Configuration that is being
replicated includes: Event Hubs (inside a Namespace), Consumer Groups, Namespace Authorization
Rules, and Event Hubs Authorization Rules.
Azure Event Hubs can be configured to capture raw data into Azure Blob Storage or Azure Data Lake
Storage
Data can be processed from Event Hubs by a client using the Event Processor SDK, or by Azure Stream
Analytics.
Design recommendationsDesign recommendations
Event Hubs can be deployed in Basic, Standard or Dedicated tiers which provide different performance
characteristics and SLA. See SLA for Event Hubs and Azure Event Hubs quotas and limits for more.
Event Hubs provide an Apache Kafka-compatible messaging interface.
See Azure Event Hubs quotas and limits for more details.
Azure Ser vice BusAzure Ser vice Bus
Azure Service Bus provides reliable asynchronous message deliver yreliable asynchronous message deliver y that requires polling.
Service Bus supports Availability Zones (AZs) for zonal redundancy within supported regions in Premium
tier.
Service Bus supports messages sizes of 256 KB in Basic and Standard tier and 1 MB in Premium tier.
Service Bus offers a SLA of 99.9% on send and receive operations.
Azure Storage QueuesAzure Storage Queues
Azure Storage Queues provide a simple messaging solution, which can be communicated with using a
REST API.
Storage queues don't guarantee message ordering.
Throttling will occur if this maximum throughput limit is reached.
Through the geo-replication feature of Azure Storage Accounts, Storage Queues can be configured to
asynchronously replicate to another region.
Storage Queues provide the same SLA as their underlying Storage Accounts. A 99.99% SLA on read
requests for RA-GRS and 99.9% for write requests.
See Scalability and performance targets for Queue Storage for more details about service limitations.
Azure Event GridAzure Event Grid
Azure Event Grid is designed as a event distributionevent distribution service for reactive programming.
Event Grid integrates with many Azure services as event sources. For example, Event Grid can be
configured to react to status changes on Azure resources.
Event Grid provides a SLA of 99.99% for message publishing.
See Azure Event Grid quotas and limits for more details.
Prioritize the use of Event Hubs for scenarios which require high throughput and which can work with
message ordering on a partition-basis.
Use Service Bus for scenarios requiring a higher QoS and message guarantee by implementing two
phase commits.
Use Service Bus Premium in a zonal redundant configuration to provide high-availability within a
region.
Consider Storage Queues when message geo-replication is required provided the message size is less
than 64 KB.
Use an AZ-redundant tier for the underlying Storage Account (ZRS or GZRS).
Use Event Grid for scenarios where services need to react to changes in another service/component.
Constrained migrations using IaaS
Design considerationsDesign considerations
Design recommendationsDesign recommendations
IMPORTANTIMPORTANT
Many applications with existing on-premises deployments use virtualization technologies and redundant
hardware to provide mission-critical levels of reliability. Modernization is often hindered by business constraints
which prevent full alignment with the cloud-native baseline (north-star) architecture pattern recommended for
mission-critical workloads. So, many applications adopt a phased approach, with initial cloud deployments using
virtualization and Azure Virtual Machines as the primary application hosting model.
This section will therefore focus on the optimal usage of Azure Virtual Machines and associated services in
order to maximize the reliability of the application platform, highlighting key aspects of the mission-critical
design methodology which transpose cloud-native and IaaS migration scenarios.
The use of IaaS Virtual Machines can be required for certain scenarios:
Available PaaS services do not provide the required performance or level of control.
The workload requires operating system access, specific drivers, or network and system
configurations.
The workload does not support running in containers.
Lack of support for 3rd-party workloads.
The use of IaaS Virtual Machines significantly increases operational costs compared to PaaS services,
through management responsibility of the virtual machine and the operating system.
Managing virtual machines necessitates the frequent roll-out of software packages and updates.
Azure provides certain capabilities to increase the availability of Virtual Machines, options are:
Availability Sets can be used to protect against network, disk and power failures by distributing
virtual machines across up to fault domains and update domains.
Availability zones can be used to achieve even higher levels of reliability by distributing VMs
across physically separated data center within a region.
Virtual Machine Scale Sets provide functionality to automatically scale the number of virtual
machines along with capabilities to monitor instance health and automatically repair unhealthy
instances.
Prioritize the use of PaaS services and Containers where possible to reduce operational complexity and cost. Only use
IaaS Virtual Machines when required.
Right-size VM sku sizes to ensure effective resource utilization.
Deploy three or more Virtual Machines across Availability zones to achieve data center level fault
tolerance.
If you are deploying commercial off-the-shelf software, consult with the software vendor and test
adequately before deploying into production.
For workloads which cannot be deployed across Availability Zones, use Availability Sets with three or
more VMs.
Availability Sets should only be considered if Availability Zones do not comply with workload
requirements, such as for 'chatty' workloads with low latency requirements.
Prioritize the use of Virtual Machine Scale Sets for scalability and zone-redundancy. This is particularly
Next step
important for workloads with varying load (e.g. number of active users or requests per second).
Do not access individual virtual machines directly, use load balancers in front when possible.
To protect against regional outages, deploy application virtual machines across multiple Azure regions.
Please refer to the networking and connectivity design area for further details about how to optimally
route traffic between active deployment region.
For workloads which do not support multi-region active-active deployments, consider active-passive by
using hot/warm standby virtual machines for regional failover.
Prioritize the use of standard images from the Azure Marketplace over custom images that need to be
maintained.
Implement automated processes to deploy and rollout changes to virtual machines, avoiding any manual
intervention. See IaaS considerations in the Operational procedures design area for more.
Implement chaos experiments to inject application faults into virtual machine components while
observing the mitigation of faults. See Continuous validation and testing in the Deployment and testing
design area for more details.
Monitor virtual machines and ensure diagnostic logs and metrics are ingested into a unified data sink.
Follow and apply security practices for mission-critical application scenarios as described above, when
applicable, as well as the Security best practices for IaaS workloads in Azure.
Review the considerations for the data platform.
Data platform
Data platform considerations for mission
-
critical
workloads on Azure
12/16/2022 49 minutes to read • Edit Online
IMPORTANTIMPORTANT
The Four Vs of Big Data
Design ConsiderationsDesign Considerations
The selection of an effective application data platform is a further crucial decision area, which has far-reaching
implications across other design areas. Azure ultimately offers a multitude of relational, non-relational, and
analytical data platforms, which differ greatly in capability. It's therefore essential that key non-functional
requirements be fully considered alongside other decision factors such as consistency, operability, cost, and
complexity. For example, the ability to operate in a multi-region write configuration will have a critical bearing
on suitability for a globally available platform.
This design area expands on application design, providing key considerations and recommendations to inform
the selection of an optimal data platform.
This article is part of the Azure Well-Architected mission-critical workload series. If you aren't familiar with this series, we
recommend you start with what is a mission-critical workload?
Mission-Critical open source project
The reference implementations are part of an open source project available on GitHub. The code assets illustrate the
considerations and implementations for a selected data platform.
The 'Four Vs of Big Data' provide a framework to better understand requisite characteristics for a highly
available data platform, and how data can be used to maximize business value. This section will therefore
explore how the Volume, Velocity, Variety, and Veracity characteristics can be applied at a conceptual level to
help design a data platform using appropriate data technologies.
VVolume: how much data is coming in to inform storage capacity and tiering requirements -that is the size of
the dataset.
VVelocity: the speed at which data is processed, either as batches or continuous streams -that is the rate of
flow.
VVariety: the organization and format of data, capturing structured, semi-structured, and unstructured formats
-that is data across multiple stores or types.
VVeracity: includes the provenance and curation of considered data sets for governance and data quality
assurance -that is accuracy of the data.
VolumeVolume
Existing (if any) and expected future data volumes based on forecasted data growth rates aligned with
business objectives and plans.
Data volume should encompass the data itself and indexes, logs, telemetry, and other applicable
datasets.
Large business-critical and mission-critical applications typically generate and store high volumes (GB
and TB) on a daily basis.
There can be significant cost implications associated with data expansion.
Data volume may fluctuate due to changing business circumstances or housekeeping procedures.
Data volume can have a profound impact on data platform query performance.
There can be a profound impact associated with reaching data platform volume limits.
Will it result in downtime? and if so, for how long?
What are the mitigation procedures? and will mitigation require application changes?
Will there be a risk of data loss?
Features such as Time to Live (TTL) can be used to manage data growth by automatically deleting records
after an elapsed time, using either record creation or modification.
For example, Azure Cosmos DB provides an in-built TTL capability.
VelocityVelocity
The speed with which data is emitted from various application components, and the throughput
requirements for how fast data needs to be committed and retrieved are critical to determining an
optimal data technology for key workload scenarios.
The nature of throughput requirements will differ by workload scenario, such as those that are read-
heavy or write-heavy.
What is the required throughput? And how is throughput expected to grow?
What are the data latency requirements at P50/P99 under reference load levels?
For example, analytical workloads will typically need to cater to a large read throughput.
Capabilities such as supporting a lock-free design, index tuning, and consistency policies are critical to
achieving high-throughput.
Optimizing configuration for high throughput incurs trade-offs, which should be fully understood.
Load-levelling persistence and messaging patterns, such as CQRS and Event Sourcing, can be used to
further optimize throughput.
Load levels will naturally fluctuate for many application scenarios, with natural peaks requiring a
sufficient degree of elasticity to handle variable demand while maintaining throughput and latency.
Agile scalability is key to effectively support variable throughput and load levels without
overprovisioning capacity levels.
Both read and write throughput must scale according to application requirements and load.
Both vertical and horizontal scale operations can be applied to respond to changing load levels.
The impact of throughput dips can vary based on workload scenario.
Will there be connectivity disruption?
Will individual operations return failure codes while the control plane continues to operate?
Will the data platform activate throttling, and if so for how long?
The fundamental application design recommendation to use an active-active geographical distribution
introduces challenges around data consistency.
There's a trade-off between consistency and performance with regard to full ACID transactional
semantics and traditional locking behavior.
Minimizing write latency will come at the cost of data consistency.
In a multi-region write configuration, changes will need to be synchronized and merged between all
replicas, with conflict resolution where required, and this may impact performance levels and scalability.
Read-only replicas (intra-region and inter-region) can be used to minimize roundtrip latency and
distributing traffic to boost performance, throughput, availability, and scalability.
A caching layer can be used to increase read throughput to improve user experience and end-to-end
client response times.
Cache expiration times and policies need to be considered to optimize data recentness.
VarietyVariety
The data model, data types, data relationships, and intended query model will strongly affect data
platform decisions.
Does the application require a relational data model or can it support a variable-schema or non-
relational data model?
How will the application query data? And will queries depend on database-layer concepts such as
relational joins? Or does the application provide such semantics?
The nature of datasets considered by the application can be varied, from unstructured content such as
images and videos, or more structured files such as CSV and Parquet.
Composite application workloads will typically have distinct datasets and associated requirements.
In addition to relational or non-relational data platforms, graph or key-value data platforms may also be
suitable for certain data workloads.
Some technologies cater to variable-schema data models, where data items are semantically similar
and/or stored and queried together but differ structurally.
In a microservice architecture, individual application services can be built with distinct scenario-optimized
datastores rather than depending on a single monolithic datastore.
Design patterns such as SAGA can be applied to manage consistency and dependencies between
different datastores.
The use of multiple data technologies will add a degree of management overhead to maintain
encompassed technologies.
Inter-database direct queries can impose co-location constraints.
The feature-sets for each Azure service differ across languages, SDKs, and APIs, which can greatly impact
the level of configuration tuning that can be applied.
Capabilities for optimized alignment with the data model and encompassed data types will strongly
influence data platform decisions.
Query layers such as stored procedures and object-relational mappers.
Language-neutral query capability, such as a secured REST API layer.
Business continuity capabilities, such as backup and restore.
Analytical datastores typically support polyglot storage for various types of data structures.
Analytical runtime environments, such as Apache Spark, may have integration restrictions to analyze
polyglot data structures.
In an enterprise context, the use of existing processes and tooling, and the continuity of skills, can have a
significant bearing on the data platform design and selection of data technologies.
VeracityVeracity
Several factors must be considered to validate the accuracy of data within an application, and the
management of these factors can have a significant bearing on the design of the data platform.
Data consistency.
Platform security features.
Design RecommendationsDesign Recommendations
Data governance.
Change management and schema evolution.
Dependencies between datasets.
In any distributed application with multiple data replicas there's a trade-off between consistency and
latency, as expressed in the CAP and PACELC theorems.
When readers and writers are distinctly distributed, an application must choose to return either the
fastest-available version of a data item, even if it out of date compared to a just-completed write
(update) of that data item in another replica, or the most up-to-date version of the data item, which
may incur additional latency to determine and obtain the latest state.
Consistency and availability can be configured at platform level or at individual data request level.
What is the user experience if data was to be served from a replica closest to the user which doesn't
reflect the most recent state of a different replica? i.e. can the application support possibly serving out-
of-date data?
In a multi-region write context, when the same data item is changed in two separate write-replicas before
either change can be replicated, a conflict is created which must be resolved.
Standardized conflict resolution policies, such as "Last Write Wins", or a custom strategy with custom
logic can be applied.
The implementation of security requirements may adversely impact throughput or performance.
Encryption at-rest can be implemented in the application layer using client-side encryption and/or the
data layer using server-side encryption if necessary.
Azure supports various encryption models, including server-side encryption that uses service-managed
keys, customer-managed keys in Key Vault, or customer-managed keys on customer-controlled hardware.
With client-side encryption, keys can be managed in Key Vault or another secure location.
MACsec (IEEE 802.1AE MAC) data-link encryption is used to secure all traffic moving between Azure
datacenters on the Microsoft backbone network.
Packets are encrypted and decrypted on the devices before being sent, preventing physical 'man-in-
the-middle' or snooping/wiretapping attacks.
Authentication and authorization to the data plane and control plane.
How will the data platform authenticate and authorize application access and operational access?
Observability through monitoring platform health and data access.
How will alerting be applied for conditions outside acceptable operational boundaries?
VolumeVolume
Ensure future data volumes associated with organic growth aren't expected to exceed data platform
capabilities.
Forecast data growth rates aligned to business plans and use established rates to inform ongoing
capacity requirements.
Compare aggregate and per-data record volumes against data platform limits.
If there's a risk of limits being reached in exceptional circumstances, ensure operational mitigations
are in place to prevent downtime and data loss.
Monitor data volume and validate it against a capacity model, considering scale limits and expected data
growth rates.
Ensure scale operations align with storage, performance, and consistency requirements.
When a new scale-unit's introduced, underlying data may need to be replicated which will take time
and likely introduce a performance penalty while replication occurs. So ensure these operations are
performed outside of critical business hours if possible.
Define application data tiers to classify datasets based on usage and criticality to facilitate the removal or
offloading of older data.
Consider classifying datasets into 'hot', 'warm', and 'cold' ('archive') tiers.
For example, the foundational reference implementations use Azure Cosmos DB to store 'hot'
data that is actively used by the application, while Azure Storage is used for 'cold' operations
data for analytical purposes.
Configure housekeeping procedures to optimize data growth and drive data efficiencies, such as query
performance, and managing data expansion.
Configure Time-To-Live (TTL) expiration for data that is no-longer required and has no long-term
analytical value.
Offload non-critical data to secondary cold storage, but maintain it for analytical value and to satisfy
audit requirements.
Collect data platform telemetry and usage statistics to enable DevOps teams to continually evaluate
housekeeping requirements and 'right-size' datastores.
Validate that old data can be safely tiered to secondary storage, or deleted outright, without an
adverse impact to the application.
In-line with a microservice application design, consider the use of multiple different data technologies in-
parallel, with optimized data solutions for specific workload scenarios and volume requirements.
Avoid creating a single monolithic datastore where data volume from expansion can be hard to
manage.
VelocityVelocity
The data platform must inherently be designed and configured to support high-throughput, with
workloads separated into different contexts to maximize performance using scenario optimized data
solutions.
Ensure read and write throughput for each data scenario can scale according to expected load
patterns, with sufficient tolerance for unexpected variance.
Separate different data workloads, such as transactional and analytical operations, into distinct
performance contexts.
Load-level through the use of asynchronous non-blocking messaging, for example using the CQRS or
Event Sourcing patterns.
There might be latency between write requests and when new data becomes available to read, which
may have an impact on the user experience.
This impact must be understood and acceptable in the context of key business requirements.
Ensure agile scalability to support variable throughput and load levels.
If load levels are highly volatile, consider overprovisioning capacity levels to ensure throughput and
performance is maintained.
Test and validate the impact to composite application workloads when throughput can't be
maintained.
Prioritize Azure-native data services with automated scale-operations to facilitate a swift response to
load-level volatility.
Configure autoscaling based on service-internal and application-set thresholds.
Scaling should initiate and complete in timeframes consistent with business requirements.
For scenarios where manual interaction is necessary, establish automated operational 'playbooks' that
can be triggered rather than conducting manual operational actions.
Consider whether automated triggers can be applied as part of subsequent engineering
investments.
Monitor application data read and write throughput against P50/P99 latency requirements and align to
an application capacity model.
Excess throughput should be gracefully handled by the data platform or application layer and captured
by the health model for operational representation.
Implement caching for 'hot' data scenarios to minimize response times.
Apply appropriate policies for cache expiration and house-keeping to avoid runaway data growth.
Expire cache items when the backing data changes.
If cache expiration is strictly Time-To-Live (TTL) based, the impact and customer experience of
serving outdated data needs to be understood.
VarietyVariety
In alignment with the principle of a cloud- and Azure-native design, it's highly recommended to prioritize
managed Azure services to reduce operational and management complexity, and taking advantage of
Microsoft's future platform investments.
In alignment with the application design principle of loosely coupled microservice architectures, allow
individual services to use distinct data stores and scenario-optimized data technologies.
Identify the types of data structure the application will handle for specific workload scenarios.
Avoid creating a dependency on a single monolithic datastore.
Consider the SAGA design pattern where dependencies between datastores exist.
Validate that required capabilities are available for selected data technologies.
Ensure support for required languages and SDK capabilities. Not every capability is available for every
language/SDK in the same fashion.
VeracityVeracity
Adopt a multi-region data platform design and distribute replicas across regions for maximum reliability,
availability, and performance by moving data closer to application endpoints.
Distribute data replicas across Availability Zones (AZs) within a region (or use zone-redundant service
tiers) to maximize intra-region availability.
Where consistency requirements allow for it, use a multi-region write data platform design to maximize
overall global availability and reliability.
Consider business requirements for conflict resolution when the same data item is changed in two
separate write replicas before either change can be replicated and thus creating a conflict.
Use standardized conflict resolution policies such as "Last one wins" where possible
If a custom strategy with custom logic is required, ensure CI/CD DevOps practices are
applied to manage custom logic.
Test and validate backup and restore capabilities, and failover operations through chaos testing within
continuous delivery processes.
Run performance benchmarks to ensure throughput and performance requirements aren't impacted by
the inclusion of required security capabilities, such as encryption.
Ensure continuous delivery processes consider load testing against known performance benchmarks.
NOTENOTE
Additional referencesAdditional references
Globally distributed multi-region write datastore
IMPORTANTIMPORTANT
Design considerationsDesign considerations
When applying encryption, it's strongly recommended to use service-managed encryption keys as a way
of reducing management complexity.
If there's a specific security requirement for customer-managed keys, ensure appropriate key
management procedures are applied to ensure availability, backup, and rotation of all considered keys.
When integrating with a broader organizational implementation, it's critical that an application centric approach be applied
for the provisioning and operation of data platform components in an application design.
More specifically, to maximize reliability it's critical that individual data platform components appropriately respond to
application health through operational actions which may include other application components. For example, in a
scenario where additional data platform resources are needed, scaling the data platform along with other application
components according to a capacity model will likely be required, potentially through the provision of additional scale
units. This approach will ultimately be constrained if there's a hard dependency of a centralized operations team to
address issues related to the data platform in isolation.
Ultimately, the use of centralized data services (that is Central IT DBaaS) introduces operational bottlenecks that
significantly hinder agility through a largely uncontextualized management experience, and should be avoided in a
mission-critical or business-critical context.
Additional data-platform guidance is available within the Azure Application Architecture Guide.
Azure Data Store Decision Tree
Criteria for choosing a Data Store
Non-Relational Data Stores
Relational OLTP Data Stores
To fully accommodate the globally distributed active-active aspirations of an application design, it's strongly
recommended to consider a distributed multi-region write data platform, where changes to separate writeable
replicas are synchronized and merged between all replicas, with conflict resolution where required.
The microservices may not all require a distributed multi-region write datastore, so consideration should be given to the
architectural context and business requirements of each workload scenario.
Azure Cosmos DB provides a globally distributed and highly available NoSQL datastore, offering multi-region
writes and tunable consistency out-of-the-box. The design considerations and recommendations within this
section will therefore focus on optimal Azure Cosmos DB usage.
Azure Cosmos DBAzure Cosmos DB
Azure Cosmos DB stores data within Containers, which are indexed, row-based transactional stores
designed to allow fast transactional reads and writes with response times on the order of milliseconds.
Azure Cosmos DB supports multiple different APIs with differing feature sets, such as SQL, Cassandra,
and MongoDB.
The first-party Azure Cosmos DB for NoSQL provides the richest feature set and is typically the API
where new capabilities will become available first.
Azure Cosmos DB supports Gateway and Direct connectivity modes, where Direct facilitates connectivity
over TCP to backend Azure Cosmos DB replica nodes for improved performance with fewer network
hops, while Gateway provides HTTPS connectivity to frontend gateway nodes.
Direct mode is only available when using the Azure Cosmos DB for NoSQL and is currently only
supported on .NET and Java SDK platforms.
Within Availability Zone enabled regions, Azure Cosmos DB offers Availability Zone (AZ) redundancy
support for high availability and resiliency to zonal failures within a region.
Azure Cosmos DB maintains four replicas of data within a single region, and when Availability Zone (AZ)
redundancy is enabled, Azure Cosmos DB ensures data replicas are placed across multiple AZs to protect
against zonal failures.
The Paxos consensus protocol is applied to achieve quorum across replicas within a region.
An Azure Cosmos DB account can easily be configured to replicate data across multiple regions to
mitigate the risk of a single region becoming unavailable.
Replication can be configured with either single-region writes or multi-region writes.
With single region writes, a primary 'hub' region is used to serve all writes and if this 'hub'
region becomes unavailable, a failover operation must occur to promote another region as
writable.
With multi-region writes, applications can write to any configured deployment region, which
will replicate changes between all other regions. If a region is unavailable then the remaining
regions will be used to serve write traffic.
In a multi-region write configuration, update (insert, replace, delete) conflicts can occur where writers
concurrently update the same item in multiple regions.
Azure Cosmos DB provides two conflict resolution policies, which can be applied to automatically address
conflicts.
Last Write Wins (LWW) applies a time-synchronization clock protocol using a system-defined
timestamp _ts property as the conflict resolution path. If of a conflict the item with the highest value
for the conflict resolution path becomes the winner, and if multiple items have the same numeric value
then the system selects a winner so that all regions can converge to the same version of the
committed item.
Custom resolution policies allow for application-defined semantics to reconcile conflicts using a
registered merge stored procedure that is automatically invoked when conflicts are detected.
With delete conflicts, the deleted version always wins over either insert or replace conflicts
regardless of conflict resolution path value.
Last Write Wins is the default conflict resolution policy.
When using Azure Cosmos DB for NoSQL, a custom numerical property such as a custom
timestamp definition can be used for conflict resolution.
The system provides exactly once guarantee for the execution of a merge procedure as part of
the commitment protocol.
A custom conflict resolution policy is only available with Azure Cosmos DB for NoSQL and can
only be set at container creation time.
In a multi-region write configuration, there's a dependency on a single Azure Cosmos DB 'hub' region to
perform all conflict resolutions, with the Paxos consensus protocol applied to achieve quorum across
replicas within the hub region.
The platform provides a message buffer for write conflicts within the hub region to load level and
provide redundancy for transient faults.
The buffer is capable of storing a few minutes worth of write updates requiring consensus.
The strategic direction of the Azure Cosmos DB platform is to remove this single region dependency for
conflict resolution in a multi-region write configuration, utilizing a 2-phase Paxos approach to attain quorum
at a global level and within a region.
The primary 'hub' region is determined by the first region that Azure Cosmos DB is configured within.
A priority ordering is configured for additional satellite deployment regions for failover purposes.
The data model and partitioning across logical and physical partitions plays an important role in
achieving optimal performance and availability.
When deployed with a single write region, Azure Cosmos DB can be configured for automatic failover
based on a defined failover priority considering all read region replicas.
The RTO provided by the Azure Cosmos DB platform is ~10-15 minutes, capturing the elapsed time to
perform a regional failover of the Azure Cosmos DB service if a catastrophic disaster impacting the hub
region.
This RTO is also relevant in a multi-region write context given the dependency on a single 'hub' region
for conflict resolution.
If the 'hub' region becomes unavailable, writes made to other regions will fail after the message
buffer fills since conflict resolution won't be able to occur until the service fails over and a new
hub region is established.
The strategic direction of the Azure Cosmos DB platform is to reduce the RTO to ~5 minutes by allowing
partition level failovers.
Recovery Point Objectives (RPO) and Recovery Time Objectives (RTO) are configurable via consistency
levels, with a trade-off between data durability and throughput.
Azure Cosmos DB provides a minimum RTO of 0 for a relaxed consistency level with multi-region
writes or an RPO of 0 for strong consistency with single-write region.
Azure Cosmos DB offers a 99.999% SLA for both read and write availability for Database Accounts
configured with multiple Azure regions as writable.
The SLA is represented by the Monthly Uptime Percentage, which is calculated as 100% - Average
Error Rate.
The Average Error Rate is defined as the sum of Error Rates for each hour in the billing month divided
by the total number of hours in the billing month, where the Error Rate is the total number of Failed
Requests divided by Total Requests during a given one-hour interval.
Azure Cosmos DB offers a 99.99% SLA for throughput, consistency, availability, and latency for Database
Accounts scoped to a single Azure region when configured with any of the five Consistency Levels.
A 99.99% SLA also applies to Database Accounts spanning multiple Azure regions configured with
any of the four relaxed Consistency Levels.
There are two types of throughput that can be provisioned in Azure Cosmos DB, standard and autoscale,
which are measured using Request Units per second (RU/s).
Standard throughput allocates resources required to guarantee a specified RU/s value.
Autoscale defines a maximum throughput value, and Azure Cosmos DB will automatically scale up or
down depending on application load, between the maximum throughput value and a minimum of
10% of the maximum throughput value.
Standard is billed hourly for provisioned throughput.
Autoscale is billed hourly for the maximum throughput consumed.
Static provisioned throughput with a variable workload may result in throttling errors, which will impact
SIN GLE REGIO N R EA D/ WRIT ESIN GLE REGIO N R EA D/ WRIT E
SIN GLE REGIO N W RIT E - DUA L R EGIO NSIN GL E RE GION W RIT E - DUA L REGIO N
REA DREA D DU A L R EGIO N REA D/ WRIT EDUA L REGIO N REA D/ W RIT E
1 RU 2 RU 4 RU
perceived application availability.
Autoscale protects against throttling errors by enabling Azure Cosmos DB to scale up as needed, while
maintaining cost protection by scaling back down when load decreases.
When Azure Cosmos DB is replicated across multiple regions, the provisioned Request Units (RUs) are
billed per region.
There's a significant cost delta between a multi-region-write and single-region-write configuration which
in many cases may make a multi-master Azure Cosmos DB data platform cost prohibitive.
The delta between single-region-write and multi-region-write is actually less than the 1:2 ratio reflected in
the table above. More specifically, there's a cross-region data transfer charge associated with write updates
in a single-write configuration, which isn't captured within the RU costs as with the multi-region write
configuration.
Consumed storage is billed as a flat rate for the total amount of storage (GB) consumed to host data and
indexes for a given hour.
Session is the default and most widely used consistency level since data is received in the same order as
writes.
Azure Cosmos DB supports authentication via either an Azure Active Directory identity or Azure Cosmos
DB keys and resource tokens, which provide overlapping capabilities.
It's possible to disable resource management operations using keys or resource tokens to limit keys and
resource tokens to data operations only, allowing for fine-grained resource access control using Azure
Active Directory Role-Based Access Control (RBAC).
Restricting control plane access via keys or resource tokens will disable control plane operations for
clients using Azure Cosmos DB SDKs and should therefore be thoroughly evaluated and tested.
The disableKeyBasedMetadataWriteAccess setting can be configured via ARM Template IaC definitions,
or via a Built-In Azure Policy.
Azure Cosmos DB Azure Active Directory RBAC support applies to account and resource control plane
management operations.
Application administrators can create role assignments for users, groups, service principals or
managed identities to grant or deny access to resources and operations on Azure Cosmos DB
resources.
There are several Built-in RBAC Roles available for role assignment, and custom RBAC roles can also
be used to form specific privilege combinations.
Cosmos DB Account Reader enables read-only access to the Azure Cosmos DB resource.
DocumentDB Account Contributor enables management of Azure Cosmos DB accounts
including keys and role assignments, but doesn't enable data-plane access.
Cosmos DB Operator, which is similar to DocumentDB Account Contributor, but doesn't provide
the ability to manage keys or role assignments.
Azure Cosmos DB resources (accounts, databases, and containers) can be protected against incorrect
modification or deletion using Resource Locks.
Resource Locks can be set at the account, database, or container level.
A Resource Lock set at on a resource will be inherited by all child resources. For example, a Resource
Lock set on the Azure Cosmos DB account will be inherited by all databases and containers within the
account.
Resource Locks onlyonly apply to control plane operations and do notnot prevent data plane operations,
such as creating, changing, or deleting data.
If control plane access isn't restricted with disableKeyBasedMetadataWriteAccess , then clients will be
able to perform control plane operations using account keys.
The Azure Cosmos DB change feed provides a time-ordered feed of changes to data in an Azure Cosmos
DB container.
The change feed only includes insert and update operations to the source Azure Cosmos DB
container; it doesn't include deletes.
The change feed can be used to maintain a separate data store from the primary Container used by the
application, with ongoing updates to the target data store fed by the change feed from the source
Container.
The change feed can be used to populate a secondary store for additional data platform redundancy
or for subsequent analytical scenarios.
If delete operations routinely affect the data within the source Container, then the store fed by the change
feed will be inaccurate and unreflective of deleted data.
A Soft Delete pattern can be implemented so that data records are included in the change feed.
A short Time-To-Live (TTL) is typically used with the soft-delete pattern so that Azure Cosmos DB
automatically deletes expired data, but only after it's reflected within the change feed with the deleted
flag set to True.
Instead of explicitly deleting data records, data records are
updated
by setting a flag (e.g.
IsDeleted ) to indicate that the item is considered deleted.
Any target data store fed by the change feed will need to detect and process items with a
deleted flag set to True; instead of storing soft-deleted data records, the
existing
version of the
data record in the target store will need to be deleted.
Accomplishes the original delete intent whilst also propagating the delete through the change
feed.
Azure Cosmos DB can be configured as an analytical store, which applies a column format for optimized
analytical queries to address the complexity and latency challenges that occur with the traditional ETL
pipelines.
Azure Cosmos DB automatically backs up data at regular intervals without affecting the performance or
availability, and without consuming RU/s.
Azure Cosmos DB can be configured according to two distinct backup modes.
Periodic is the default backup mode for all accounts, where backups are taken at a periodic interval
and the data is restored by creating a request with the support team.
The default periodic backup retention period is 8 hours and the default backup interval is
fourhours, which means only the latest two backups are stored by default.
The backup interval and retention period are configurable within the account.
Two backup copies are included at no extra cost, but additional backups incur additional costs.
The maximum retention period extends to a month with a minimum backup interval of
one hour.
A role assignment to the Azure "Cosmos DB Account Reader Role" is required to
configure backup storage redundancy.
Continuous backup mode allows for a restore to any point of time within the last 30 days.
By default, periodic backups are stored within separate Geo-Redundant Storage (GRS) that isn't
directly accessible.
Performing a restore operation requires a restore operation requires a Suppor t RequestSuppor t Request since customers can't directly
perform a restore.
A restore operation creates a new Azure Cosmos DB account where data is recovered.
If throughput is provisioned at the database level, backup and restore will happen at the
database level
Backup storage exists within the primary 'hub' region and is replicated to the paired
region through underlying storage replication.
The redundancy configuration of the underlying backup storage account is configurable
to Zone-Redundant Storage or Locally-Redundant Storage.
Before opening a support ticket, the backup retention period should be increased to at
least seven days within eight hours of the data loss event.
An existing Azure Cosmos DB account can't be used for Restore
By default, a new Azure Cosmos DB account named
<Azure_Cosmos_account_original_name>-restored<n> will be used.
This name can be adjusted, such as by reusing the existing name if the original
account was deleted.
It's not possible to select a subset of containers to restore.
Restore operations can be performed to return to a specific point in time (PITR) with a one-
second granularity.
The available window for restore operations is up to 30 days.
Continuous backups are taken within every Azure region where the Azure Cosmos DB account
exists.
A self-service restore can be performed using the Azure portal or IaC artifacts such as ARM
templates.
There are several limitations with Continuous Backup.
A restore operation creates a new Azure Cosmos DB account for the point-in-time restore.
There's an additional storage cost for Continuous backups and restore operations.
It's also possible to restore to the resource instantiation state.
Continuous backups are stored within the same Azure region as each Azure Cosmos DB
replica, using Locally-Redundant Storage (LRS) or Zone Redundant Storage (ZRS) within
regions that support Availability Zones.
The continuous backup mode isn't currently available in a multi-region-write
configuration.
Only Azure Cosmos DB for NoSQL and Azure Cosmos DB for MongoDB can be
configured for Continuous backup at this time.
If a container has TTL configured, restored data that has exceeded its TTL may be
immediately deleted
Existing Azure Cosmos DB accounts can be migrated from Periodic to Continuous, but not from
Continuous to Periodic; migration is one-way and not reversible.
Each Azure Cosmos DB backup is composed of the data itself and configuration details for provisioned
throughput, indexing policies, deployment region(s), and container TTL settings.
Backups don't contain firewall settings, virtual network access control lists, private endpoint settings,
consistency settings (an account is restored with session consistency), stored procedures, triggers,
UDFs, or multi-region settings.
Customers are responsible for redeploying capabilities and configuration settings. These aren't
Design RecommendationsDesign Recommendations
Azure Synapse Link analytical store data is also not included in Azure Cosmos DB backups.
restored via Azure Cosmos DB backup.
It's possible to implement a custom backup and restore capability for scenarios where Periodic and
Continuous approaches aren't a good fit.
A custom approach introduces significant costs and additional administrative overhead, which should
be understood and carefully assessed.
Azure Storage or an alternative data technology can be used, such an alternative Azure Cosmos DB
container.
Common restore scenarios should be modeled, such as the corruption or deletion of an
account, database, container, on data item.
Housekeeping procedures should be implemented to prevent backup sprawl.
Azure Storage and Azure Cosmos DB provide native integrations with Azure services such as
Azure Functions and Azure Data Factory.
The Azure Cosmos DB documentation denotes two potential options for implementing custom backups.
Azure Cosmos DB change feed to write data to a separate storage facility.
Both continuous or periodic (batched) custom backups can be implemented using the change feed.
The Azure Cosmos DB change feed doesn't yet reflect deletes, so a soft-delete pattern must be applied
using a boolean property and TTL.
Azure Data Factory Connector for Azure Cosmos DB (Azure Cosmos DB for NoSQL or MongoDB API
connectors) to copy data.
An Azure function or equivalent application process uses the change feed processor to bind to
the change feed and process items into storage.
This pattern won't be required when the change feed provides full-fidelity updates.
Azure Data Factory (ADF) supports manual execution and Schedule, Tumbling window, and
Event-based triggers.
ADF is primarily suitable for periodic custom backup implementations due to its batch-oriented
orchestration.
ADF supports Azure Private Link for high network security scenarios
Provides support for both Storage and Event Grid.
It's less suitable for continuous backup implementations with frequent events due to the
orchestration execution overhead.
Azure Cosmos DB is used within the design of many Azure services, so a significant regional outage for
Azure Cosmos DB will have a cascading effect across various Azure services within that region. The precise
impact to a particular service will heavily depend on how the underlying service design uses Azure Cosmos
DB.
Azure Cosmos DBAzure Cosmos DB
Use Azure Cosmos DB as the primary data platform where requirements allow.
For mission-critical workload scenarios, configure Azure Cosmos DB with a write replica inside each
deployment region to reduce latency and provide maximum redundancy.
Configure the application to prioritize the use of the local Azure Cosmos DB replica for writes and
reads to optimize application load, performance, and regional RU/s consumption.
The multi-region-write configuration comes at a significant cost and should be prioritized only for
workload scenarios requiring maximum reliability.
For less-critical workload scenarios, prioritize the use of single-region-write configuration (when using
Availability Zones) with globally distributed read replicas, since this offers a high level of data platform
reliability (99.999% SLA for read-, 99.995% SLA for write-operations) at a more compelling price-point.
Configure the application to use the local Azure Cosmos DB read replica to optimize read
performance.
Select an optimal 'hub' deployment region where conflict resolution will occur in a multi-region-write
configuration, and all writes will be performed in a single-region-write configuration.
Consider distance relative to other deployment regions and associated latency in selecting a primary
region, and requisite capabilities such as Availability Zones support.
Configure Azure Cosmos DB with Availability Zone (AZ) redundancy in all deployment regions with AZ
support, to ensure resiliency to zonal failures within a region.
Use Azure Cosmos DB for NoSQL since it offers the most comprehensive feature set, particularly where
performance tuning is concerned.
Alternative APIs should primarily be considered for migration or compatibility scenarios.
When using alternative APIs, validate that required capabilities are available with the selected
language and SDK to ensure optimal configuration and performance.
Use the Direct connection mode to optimize network performance through direct TCP connectivity to
backend Azure Cosmos DB nodes, with a reduced number of network 'hops'.
The Azure Cosmos DB SLA is calculated by averaging failed requests, which may not directly align with a
99.999% reliability tier error budget. When designing for 99.999% SLO, it's therefore vital to plan for
regional and multi-region Azure Cosmos DB write unavailability, ensuring a fallback storage technology is
positioned if a failure, such as a persisted message queue for subsequent replay.
Define a partitioning strategy across both logical and physical partitions to optimize data distribution
according to the data model.
Minimize cross-partition queries.
Iteratively test and validate the partitioning strategy to ensure optimal performance.
Select an optimal partition key.
The partition key can't be changed after it has been created within the collection.
The partition key should be a property value that doesn't change.
Select a partition key that has a high cardinality, with a wide range of possible values.
The partition key should spread RU consumption and data storage evenly across all logical partitions
to ensure even RU consumption and storage distribution across physical partitions.
Run read queries against the partitioned column to reduce RU consumption and latency.
Indexing is also crucial for performance, so ensure index exclusions are used to reduce RU/s and storage
requirements.
Only index those fields that are needed for filtering within queries; design indexes for the most-used
predicates.
Leverage the built-in error handling, retry, and broader reliability capabilities of the Azure Cosmos DB
SDK.
Implement retry logic within the SDK on clients.
Use service-managed encryption keys to reduce management complexity.
If there's a specific security requirement for customer-managed keys, ensure appropriate key
management procedures are applied, such as backup and rotation.
Disable Azure Cosmos DB key-based metadata write access by applying the built-in Azure Policy.
Relational data technologies
Design considerationsDesign considerations
Enable Azure Monitor to gather key metrics and diagnostic logs, such as provisioned throughput (RU/s).
Route Azure Monitor operational data into a Log Analytics workspace dedicated to Azure Cosmos DB
and other global resources within the application design.
Use Azure Monitor metrics to determine if application traffic patterns are suitable for autoscale.
Evaluate application traffic patterns to select an optimal option for provisioned throughput types.
Consider auto-scale provisioned throughput to automatically level-out workload demand.
Evaluate Microsoft performance tips for Azure Cosmos DB to optimize client-side and server-side
configuration for improved latency and throughput.
When using AKS as the compute platform: For query-intensive workloads, select an AKS node SKU that
has accelerated networking enabled to reduce latency and CPU jitters.
For single write region deployments, it's strongly recommended to configure Azure Cosmos DB for
automatic failover.
Load-level through the use of asynchronous non-blocking messaging within system flows, which write
updates to Azure Cosmos DB.
Consider patterns such as Command and Query Responsibility Segregation and Event Sourcing.
Configure the Azure Cosmos DB account for continuous backups to obtain a fine granularity of recovery
points across the last 30 days.
Consider the use of Azure Cosmos DB backups in scenarios where contained data or the Azure
Cosmos DB account is deleted or corrupted.
Avoid the use of a custom backup approach unless absolutely necessary.
It's strongly recommended to practice recovery procedures on non-production resources and data, as
part of standard business continuity operation preparation.
Define IaC artifacts to re-establish configuration settings and capabilities of an Azure Cosmos DB backup
restore.
Evaluate and apply the Azure Security Baseline control guidance for Azure Cosmos DB Backup and
Recovery.
BR-1: Ensure regular automated backups
BR-3: Validate all backups including customer-managed keys
BR-4, Mitigate risk of lost keys
For analytical workloads requiring multi-region availability, use the Azure Cosmos DB Analytical Store,
which applies a column format for optimized analytical queries.
For scenarios with a highly relational data model or dependencies on existing relational technologies, the use of
Azure Cosmos DB in a multi-region write configuration might not be directly applicable. In such cases, it's vital
that used relational technologies are designed and configured to uphold the multi-region active-active
aspirations of an application design.
Azure provides many managed relational data platforms, including Azure SQL Database and Azure Database for
common OSS relational solutions, including MySQL, PostgreSQL, and MariaDB. The design considerations and
recommendations within this section will therefore focus on the optimal usage of Azure SQL Database and
Azure Database OSS flavors to maximize reliability and global availability.
Whilst relational data technologies can be configured to easily scale read operations, writes are typically
constrained to go through a single primary instance, which places a significant constraint on scalability
and performance.
Sharding can be applied to distribute data and processing across multiple identical structured databases,
partitioning databases horizontally to navigate platform constraints.
For example, sharding is often applied in multi-tenant SaaS platforms to isolate groups of tenants into
distinct data platform constructs.
Azure SQL DatabaseAzure SQL Database
Azure SQL Database provides a fully managed database engine that is always running on the latest stable
version of the SQL Server database engine and underlying Operating System.
Provides intelligent features such as performance tuning, threat monitoring, and vulnerability
assessments.
Azure SQL Database provides built-in regional high availability and turnkey geo-replication to distribute
read-replicas across Azure regions.
With geo-replication, secondary database replicas remain read-only until a failover is initiated.
Up to four secondaries are supported in the same or different regions.
Secondary replicas can also be used for read-only query access to optimize read performance.
Failover must be initiated manually but can be wrapped in automated operational procedures.
Azure SQL Database provides Auto Failover Groups, which replicates databases to a secondary server
and allows for transparent failover if a failure.
Auto-failover groups support geo-replication of all databases in the group to only one secondary
server or instance in a different region.
Auto-failover groups aren't currently supported in the Hyperscale service tier.
Secondary databases can be used to offload read traffic.
Premium or Business Critical service tier database replicas can be distributed across Availability Zones at
no extra cost.
The control ring is also duplicated across multiple zones as three gateway rings (GW).
When using the Business Critical tier, zone redundant configuration is only available when the Gen5
compute hardware is selected.
The routing to a specific gateway ring is controlled by Azure Traffic Manager.
Azure SQL Database offers a baseline 99.99% availability SLA across all of its service tiers, but provides a
higher 99.995% SLA for the Business Critical or Premium tiers in regions that support availability zones.
Azure SQL Database Business Critical or Premium tiers not configured for Zone Redundant
Deployments have an availability SLA of 99.99%.
When configured with geo-replication, the Azure SQL Database Business Critical tier provides a Recovery
Time Objective (RTO) of 30 seconds for 100% of deployed hours.
When configured with geo-replication, the Azure SQL Database Business Critical tier has a Recovery point
Objective (RPO) of 5 seconds for 100% of deployed hours.
Azure SQL Database Hyperscale tier, when configured with at least two replicas, has an availability SLA of
99.99%.
Compute costs associated with Azure SQL Database can be reduced using a Reservation Discount.
It's not possible to apply reserved capacity for DTU-based databases.
Point-in-time restore can be used to return a database and contained data to an earlier point in time.
Design RecommendationsDesign Recommendations
Geo-restore can be used to recover a database from a geo-redundant backup.
Azure Database For PostgreSQLAzure Database For PostgreSQL
Azure Database For PostgreSQL is offered in three different deployment options:
Single Server, SLA 99.99%
Flexible Server, which offers Availability Zone redundancy, SLA 99.99%
Hyperscale (Citus), SLA 99.95% when High Availability mode is enabled.
Hyperscale (Citus) provides dynamic scalability through sharding without application changes.
Distributing table rows across multiple PostgreSQL servers is key to ensure scalable queries in
Hyperscale (Citus).
Multiple nodes can collectively hold more data than a traditional database, and in many cases can use
worker CPUs in parallel to optimize costs.
Autoscale can be configured through runbook automation to ensure elasticity in response to changing
traffic patterns.
Flexible server provides cost efficiencies for non-production workloads through the ability to stop/start
the server, and a burstable compute tier that is suitable for workloads that don't require continuous
compute capacity.
There's no additional charge for backup storage for up to 100% of total provisioned server storage.
Additional consumption of backup storage is charged according to consumed GB/month.
Compute costs associated with Azure Database for PostgreSQL can be reduced using either a Single
Server Reservation Discount or Hyperscale (Citus) Reservation Discount.
Consider sharding to partition relational databases based on different application and data contexts,
helping to navigate platform constraints, maximize scalability and availability, and fault isolation.
This recommendation is particularly prevalent when the application design considers three or more
Azure regions since relational technology constraints can significantly hinder globally distributed data
platforms.
Sharding isn't appropriate for all application scenarios, so a contextualized evaluation is required.
Prioritize the use of Azure SQL Database where relational requirements exist due to its maturity on the
Azure platform and wide array of reliability capabilities.
Azure SQL DatabaseAzure SQL Database
Use the Business-Critical service tier to maximize reliability and availability, including access to critical
resiliency capabilities.
Use the vCore based consumption model to facilitate the independent selection of compute and storage
resources, tailored to workload volume and throughput requirements.
Ensure a defined capacity model is applied to inform compute and storage resource requirements.
Consider Reserved Capacity to provide potential cost optimizations.
Configure the Zone-Redundant deployment model to spread Business Critical database replicas within
the same region across Availability Zones.
Use Active Geo-Replication to deploy readable replicas within all deployment regions (up to four).
Use Auto Failover Groups to provide transparent failover to a secondary region, with geo-replication
applied to provide replication to additional deployment regions for read optimization and database
redundancy.
IMPORTANTIMPORTANT
For application scenarios limited to only two deployment regions, the use of Auto Failover Groups
should be prioritized.
Consider automated operational triggers, based on alerting aligned to the application health model, to
conduct failovers to geo-replicated instances if a failure impacting the primary and secondary within the
Auto Failover Group.
For applications considering more than four deployment regions, serious consideration should be given to application
scoped sharding or refactoring the application to support multi-region write technologies, such as Azure Cosmos DB.
However, if this isn't feasible within the application workload scenario, it's advised to elevate a region within a single
geography to a primary status encompassing a geo-replicated instance to more evenly distributed read access.
Configure the application to query replica instances for read queries to optimize read performance.
Use Azure Monitor and Azure SQL Analytics for near real-time operational insights in Azure SQL DB for
the detection of reliability incidents.
Use Azure Monitor to evaluate usage for all databases to determine if they have been sized appropriately.
Ensure CD pipelines consider load testing under representative load levels to validate appropriate data
platform behavior.
Calculate a health metric for database components to observe health relative to business requirements
and resource utilization, using monitoring and alerts to drive automated operational action where
appropriate.
Ensure key query performance metrics are incorporated so swift action can be taken when service
degradation occurs.
Optimize queries, tables, and databases using Query Performance Insights and common performance
recommendations provided by Microsoft.
Implement retry logic using the SDK to mitigate transient errors impacting Azure SQL Database
connectivity.
Prioritize the use of service-managed keys when applying server-side Transparent Data Encryption (TDE)
for at-rest encryption.
If customer-managed keys or client-side (AlwaysEncrypted) encryption is required, ensure keys are
appropriately resilient with backups and automated rotation facilities.
Consider the use of point-in-time restore as an operational playbook to recover from severe
configuration errors.
Azure Database For PostgreSQLAzure Database For PostgreSQL
Flexible Server is recommended to use it for business critical workloads due to its Availability Zone
support.
When using Hyperscale (Citus) for business critical workloads, enable High Availability mode to receive
the 99.95% SLA guarantee.
Use the Hyperscale (Citus) server configuration to maximize availability across multiple nodes.
Define a capacity model for the application to inform compute and storage resource requirements within
the data platform.
Consider the Hyperscale (Citus) Reservation Discount to provide potential cost optimizations.
Caching for Hot Tier Data
Design ConsiderationsDesign Considerations
Design RecommendationsDesign Recommendations
An in-memory caching layer can be applied to enhance a data platform by significantly increasing read
throughput and improving end-to-end client response times for hot tier data scenarios.
Azure provides several services with applicable capabilities for caching key data structures, with Azure Cache for
Redis positioned to abstract and optimize data platform read access. This section will therefore focus on the
optimal usage of Azure Cache for Redis in scenarios where additional read performance and data access
durability is required.
A caching layer provides additional data access durability since even if an outage impacting the
underlying data technologies, an application data snapshot can still be accessed through the caching
layer.
In certain workload scenarios, in-memory caching can be implemented within the application platform
itself.
Azure Cache for RedisAzure Cache for Redis
Redis cache is an open source NoSQL key-value in-memory storage system.
The Enterprise and Enterprise Flash tiers can be deployed in an active-active configuration across
Availability Zones within a region and different Azure regions through geo-replication.
When deployed across at least three Azure regions and three or more Availability Zones in each
region, with active geo-replication enabled for all Cache instances, Azure Cache for Redis provides an
SLA of 99.999% for connectivity to one regional cache endpoint.
When deployed across three Availability Zones within a single Azure region a 99.99% connectivity
SLA is provided.
The Enterprise Flash tier runs on a combination of RAM and flash non-volatile memory storage, and
while this introduces a small performance penalty it also enables very large cache sizes, up to 13TB with
clustering.
With geo-replication, charges for data transfer between regions will also be applicable in addition to the
direct costs associated with cache instances.
The Scheduled Updates feature doesn't include Azure updates or updates applied to the underlying VM
operating system.
There will be an increase in CPU utilization during a scale-out operation while data is migrated to new
instances.
Consider an optimized caching layer for 'hot' data scenarios to increase read throughput and improve
response times.
Apply appropriate policies for cache expiration and housekeeping to avoid runaway data growth.
Consider expiring cache items when the backing data changes.
Azure Cache for RedisAzure Cache for Redis
Use the Premium or Enterprise SKU to maximize reliability and performance.
For scenarios with extremely large data volumes, the Enterprise Flash tier should be considered.
For scenarios where only passive geo-replication is required, the Premium tier can also be considered.
Deploy replica instances using geo-replication in an active configuration across all considered
Analytical Scenarios
DESC RIP T IONDESC RIP T ION A N A LY T IC A LA N A LY T IC A L T RA N SA C T IO N A LT RA N SA C T IO N A L
Use Case Analyze very large volumes of data
("big data")
Process very large volumes of
individual transactions
Optimized for Read queries and aggregations over
many records
Near real-time
Create/Read/Update/Delete (CRUD)
queries over few records
Key Characteristics - Consolidate from data sources of
record
- Column-based storage
- Distributed storage
- Parallel processing
- Denormalized
- Low concurrency reads and writes
- Optimize for storage volume with
compression
- Data source of record for application
- Row-based Storage
- Contiguous storage
- Symmetrical processing
- Normalized
- High concurrency reads and writes,
index updates
- Optimize for fast data access with in-
memory storage
Design ConsiderationsDesign Considerations
deployment regions.
Ensure replica instances are deployed across Availability Zones within each considered Azure region.
Use Azure Monitor to evaluate Azure Cache for Redis.
Calculate a health score for regional cache components to observe health relative to business
requirements and resource utilization.
Observe and alert on key metrics such as high CPU, high memory usage, high server load, and
evicted keys for insights when to scale the cache.
Optimize connection resilience by implementing retry logic, timeouts, and using a singleton
implementation of the Redis connection multiplexer.
Configure scheduled updates to prescribe the days and times that Redis Server updates are applied to the
cache.
It's increasingly common for mission-critical applications to consider analytical scenarios as a means to drive
additional value from encompassed data flows. Application and operational (AIOps) analytical scenarios
therefore form a crucial aspect of highly reliable data platform.
Analytical and transactional workloads require different data platform capabilities and optimizations for
acceptable performance within their respective contexts.
Azure Synapse provides an enterprise analytical platform that brings together relational and non-relational data
with Spark technologies, using built-in integration with Azure services such as Azure Cosmos DB to facilitate big
data analytics. The design considerations and recommendations within this section will therefore focus on
optimal Azure Synapse and Azure Cosmos DB usage for analytical scenarios.
Traditionally, large-scale analytical scenarios are facilitated by extracting data into a separate data platform
optimized for subsequent analytical queries.
Extract, Transform, and Load (ETL) pipelines are used to extract data will consume throughput and
impact transactional workload performance.
Running ETL pipelines infrequently to reduce throughput and performance impacts will result in
analytical data that is less up-to-date.
ETL pipeline development and maintenance overhead increases as data transformations become more
complex.
For example, if source data is frequently changed or deleted, ETL pipelines must account for
those changes in the target data for analytical queries through an additive/versioned approach,
dump and reload, or in-place changes on the analytical data. Each of these approaches will have
derivative impact, such as index re-creation or update.
Azure Cosmos DBAzure Cosmos DB
Analytical queries run on Azure Cosmos DB transactional data will typically aggregate across partitions
over large volumes of data, consuming significant Request Unit (RU) throughput, which can impact the
performance of surrounding transactional workloads.
The Azure Cosmos DB Analytical Store provides a schematized, fully isolated column-oriented data store
that enables large-scale analytics on Azure Cosmos DB data from Azure Synapse without impact to Azure
Cosmos DB transactional workloads.
When an Azure Cosmos DB Container is enabled as an Analytical Store, a new column store is
internally created from the operational data in the Container. This column store is persisted separately
from the row-oriented transaction store for the container.
Create, Update and Delete operations on the operational data are automatically synced to the
analytical store, so no change feed or ETL processing is required.
Data sync from the operational to the analytical store doesn't consume throughput Request Units
(RUs) provisioned on the Container or Database. There's no performance impact on transactional
workloads. Analytical Store doesn't require allocation of additional RUs on an Azure Cosmos DB
Database or Container.
Auto-Sync is the process where operational data changes are automatically synced to the Analytical
Store. Auto-Sync latency is usually less than two (2) minutes.
Analytical Store storage uses a consumption-based pricing model that charges for volume of data and
number of read and write operations. Analytical store pricing is separate from transactional store
pricing.
Auto-Sync latency can be up to five (5) minutes for a Database with shared throughput and a
large number of Containers.
As soon as Auto-Sync completes, the latest data can be queried from Azure Synapse.
Using Azure Synapse Link, the Azure Cosmos DB Analytical Store can be queried directly from Azure
Synapse. This enables no-ETL, Hybrid Transactional-Analytical Processing (HTAP) from Synapse, so that
Azure Cosmos DB data can be queried together with other analytical workloads from Synapse in near
real-time.
The Azure Cosmos DB Analytical Store isn't partitioned by default.
For certain query scenarios, performance will improve by partitioning Analytical Store data using keys
that are frequently used in query predicates.
Partitioning is triggered by a job in Azure Synapse that runs a Spark notebook using Synapse Link,
which loads the data from the Azure Cosmos DB Analytical Store and writes it into the Synapse
partitioned store in the primary storage account of the Synapse workspace.
Azure Synapse Analytics SQL Serverless pools can query the Analytical Store through automatically
updated views or via SELECT / OPENROWSET commands.
Azure Synapse Analytics Spark pools can query the Analytical Store through automatically updated Spark
tables or the spark.read command.
Data can also be copied from the Azure Cosmos DB Analytical Store into a dedicated Synapse SQL pool
using Spark, so that provisioned Azure Synapse SQL pool resources can be used.
Design RecommendationsDesign Recommendations
Azure Cosmos DB Analytical Store data can be queried with Azure Synapse Spark.
Spark notebooks allow for Spark dataframe combinations to aggregate and transform Azure Cosmos
DB analytical data with other data sets, and use other advanced Synapse Spark functionality including
writing transformed data to other stores or training AIOps Machine Learning models.
The Azure Cosmos DB change feed can also be used to maintain a separate secondary data store for
analytical scenarios.
Azure SynapseAzure Synapse
Azure Synapse brings together analytics capabilities including SQL data warehousing, Spark big data, and
Data Explorer for log and time series analytics.
Azure Synapse uses
linked services
to define connections to other services, such as Azure Storage.
Data can be ingested into Synapse Analytics via Copy activity from supported sources. This permits
data analytics in Synapse without impacting the source data store, but adds time, cost and latency
overhead due to data transfer.
Data can also be queried in-place in supported external stores, avoiding the overhead of data
ingestion and movement. Azure Storage with Data Lake Gen2 is a supported store for Synapse and
Log Analytics exported data can be queried via Synapse Spark.
Azure Synapse Studio unites ingestion and querying tasks.
Source data, including Azure Cosmos DB Analytical Store data and Log Analytics Export data, are
queried and processed in order to support business intelligence and other aggregated analytical use
cases.
Ensure analytical workloads don't impact transactional application workloads to maintain transactional
performance.
Application AnalyticsApplication Analytics
Use Azure Synapse Link with Azure Cosmos DB Analytical Store to perform analytics on Azure Cosmos
DB operational data by creating an optimized data store, which won't impact transactional performance.
Enable Azure Synapse Link on Azure Cosmos DB accounts.
Create a container enabled for Analytical Store, or enable an existing Container for Analytical Store.
Connect the Azure Synapse workspace to the Azure Cosmos DB Analytical Store to enable analytical
workloads in Azure Synapse to query Azure Cosmos DB data. Use a connection string with a read-only
Azure Cosmos DB key.
Prioritize Azure Cosmos DB Analytical Store with Azure Synapse Link instead of using the Azure Cosmos
DB change feed to maintain an analytical data store.
The Azure Cosmos DB change feed may be suitable for very simple analytical scenarios.
AIOps and Operational AnalyticsAIOps and Operational Analytics
Create a single Azure Synapse workspace with linked services and data sets for each source Azure
Storage account where operational data from resources are sent to.
Create a dedicated Azure Storage account and use it as the workspace primary storage account to store
the Synapse workspace catalog data and metadata. Configure it with hierarchical namespace to enable
Next step
Azure Data Lake Gen2.
Maintain separation between the source analytical data and Synapse workspace data and metadata.
Do not use one of the regional or global Azure Storage accounts where operational data is sent.
Review the considerations for networking considerations.
Network and connectivity
Networking and connectivity for mission
-
critical
workloads on Azure
12/16/2022 38 minutes to read • Edit Online
IMPORTANTIMPORTANT
Global traffic routing
Design considerationsDesign considerations
Networking is a fundamental area for a mission-critical application, given the recommended globally distributed
active-active design approach.
This design area explores various network topology topics at an application level, considering requisite
connectivity and redundant traffic management. More specifically, it highlights critical considerations and
recommendations intended to inform the design of a secure and scalable global network topology for a
mission-critical application.
This article is part of the Azure Well-Architected mission-critical workload series. If you aren't familiar with this series, we
recommend you start with what is a mission-critical workload?
Mission-Critical open source project
The reference implementations are part of an open source project available on GitHub. The code assets illustrate
considerations and recommendations for networking and connectivity for a globally distributed active-active design
approach.
The use of multiple active regional deployment stamps requires a global routing service to distribute traffic to
each active stamp.
Azure Front Door, Azure Traffic Manager, and Azure Standard Load Balancer provide the needed routing
capabilities to manage global traffic across a multi-region application.
Alternatively, third-party globally routing technologies can be considered. These options can almost seamlessly
be swapped in to replace or extend the use of Azure-native global routing services. Popular choices include
routing technologies by CDN providers.
This section explores the key differences Azure routing services to define how each can be used to optimize
different scenarios.
A routing service bound to a single region represents a single-point-of-failure and a significant risk with
regard to regional outages.
If the application workload encompasses client control, such as with mobile or desktop client applications,
it's possible to provide service redundancy within client routing logic.
Multiple global routing technologies, such as Azure Front Door and Azure Traffic Manager, can be
considered in parallel for redundancy, with clients configured to fail over to an alternative technology
when certain failure conditions are met. The introduction of multiple global routing services
introduces significant complexities around edge caching and Web Application Firewall capabilities,
and certificate management for SSL offload and application validation for ingress paths.
Third-party technologies can also be considered, providing global routing resiliency to all levels of
Azure platform failures.
Capability disparity between Azure Front Door and Traffic Manager means that if the two technologies
are positioned alongside one another for redundancy, a different ingress path or design changes would
be required to ensure a consistent and acceptable level of service is maintained.
Azure Front Door and Azure Traffic Manager are globally distributed services with built-in multi-region
redundancy and availability.
Hypothetical failure scenarios of a scale large enough to threaten the global availability of these
resilient routing services presents a broader risk to the application in terms of cascading and
correlated failures.
Failure scenarios of this scale are only feasibly caused by shared foundational services, such as
Azure DNS or Azure AD, which serve as global platform dependencies for almost all Azure
services. If a redundant Azure technology is applied, it's likely that the secondary service will
also be experiencing unavailability or a degraded service.
Global routing service failure scenarios are highly likely to significantly impact many other
services used for key application components through interservice dependencies. Even if a
third-party technology is used, the application will likely be in an unhealthy state due to the
broader impact of the underlying issue, meaning that routing to application endpoints on Azure
will provide little value anyway.
Global routing service redundancy provides mitigation for an extremely small number of hypothetical
failure scenarios, where the impact of a global outage is constrained to the routing service itself.
To provide broader redundancy to global outage scenarios, a multi-cloud active-active deployment
approach can be considered. A multi-cloud active-active deployment approach introduces significant
operational complexities, which pose significant resiliency risks, likely far outweighing the hypothetical
risks of a global outage.
For scenarios where client control isn't possible, a dependency must be taken on a single global routing
service to provide a unified entry point for all active deployment regions.
When used in isolation they represent a single-point-of-failure at a service level due to global
dependencies, even though built-in multi-region redundancy and availability are provided.
The SLA provided by the selected global routing service represents the maximum attainable
composite SLA, regardless of how many deployment regions are considered.
When client control isn't possible, operational mitigations can be considered to define a process for
migrating to a secondary global routing service if a global outage disables the primary service. Migrating
from one global routing service to another is typically a lengthy process lasting several hours,
particularly where DNS propagation is considered.
Some third-party global routing services provide a 100% SLA. However, the historic and attainable SLA
provided by these services is typically lower than 100%.
While these services provide financial reparations for unavailability, it comes of little significance when
the impact of unavailability is significant, such as with safety-critical scenarios where human life is
ultimately at stake. Technology redundancy or sufficient operational mitigations should therefore still
be considered even when the advertised legal SLA is 100%.
Azure Front DoorAzure Front Door
Azure Front Door provides global HTTP/S load balancing and optimized connectivity using the Anycast
protocol with split TCP to take advantage of the Microsoft global backbone network.
A number of connections are maintained for each of the backend endpoints.
Incoming client requests are first terminated at the edge node closest to the originating client.
After any required traffic inspection, requests are either forwarded over the Microsoft backbone to the
appropriate backend using existing connections, or served from the internal cache of an edge node.
This approach is very efficient in spreading high traffic volumes over the backend connections.
Provides a built-in cache that serves static content from edge nodes. In many use cases, this can also
eliminate the need for a dedicated Content Delivery Network (CDN).
Azure Web Application Firewall (WAF) can be used on Azure Front Door, and since it's deployed to Azure
network edge locations around the globe, every incoming request delivered by Front Door in inspected at
the network edge.
Azure Front Door protects application endpoints against DDoS attacks using Azure DDoS protection
Basic. Azure DDoS Standard provides additional and more advanced protection and detection capabilities
and can be added as an additional layer to Azure Front Door.
Azure Front Door offers a fully managed certificate service. Enables TLS connection security for endpoints
without having to manage the certificate lifecycle.
Azure Front Door Premium supports private endpoints, enabling traffic to flow from the internet directly
onto Azure virtual networks. This would eliminate the need of using public IPs on the VNet for making
the backends accessible via Azure Front Door Premium.
Azure Front Door relies on health probes and backend health endpoints (URLs) that are called on an
interval basis to return an HTTP status code reflecting if the backend is operating normally, with an HTTP
200 (OK) response reflecting a healthy status. As soon as a backend reflects an unhealthy status, from the
perspective of a certain edge node, that edge node will stop sending requests there. Unhealthy backends
are therefore transparently removed from traffic circulation without any delay.
Supports HTTP/S protocols only.
The Azure Front Door WAF and Application Gateway WAF provide a slightly different feature set, though
both support built-in and custom rules and can be set to operate in either detection mode or prevention
mode.
The Front Door backend IP space may change, but Microsoft will ensure integration with Azure IP Ranges
and Service Tags. It's possible to subscribe to Azure IP Ranges and Service Tags to receive notifications
about any changes or updates.
Azure Front Door supports various load distribution configurations:
Latency-based: the default setting that routes traffic to the "closest" backend from the client; based on
request latency.
Priority-based: useful for active-passive setups, where traffic must always be sent to a primary
backend unless it's not available.
Weighted: applicable for canary deployments in which a certain percentage of traffic is sent to a
specific backend. If multiple backends have the same weights assigned, latency-based routing is used.
By default Azure Front Door uses latency-based routing that can lead to situations where some backends
get a lot more incoming traffic than others, depending on where clients originate from.
If a series of client requests must be handled by the same backend, Session Affinity can be configured on
the frontend. It uses a client-side cookie to send subsequent requests to the same backend as the first
request, provided the backend is still available.
Azure Traffic ManagerAzure Traffic Manager
Azure Traffic Manager is a DNS redirection service.
The actual request payload isn't processed, but instead Traffic Manager returns the DNS name of one
IMPORTANTIMPORTANT
Design recommendationsDesign recommendations
of the backends it the pool, based on configured rules for the selected traffic routing method.
The backend DNS name is then resolved to its final IP address that is subsequently directly called by
the client.
The DNS response is cached and reused by the client for a specified Time-To-Live (TTL) period, and
requests made during this period will go directly to the backend endpoint without Traffic Manager
interaction. Eliminates the extra connectivity step that provides cost benefits compared to Front Door.
Since the request is made directly from the client to the backend service, any protocol supported by the
backend can be leveraged.
Similar to Azure Front Door, Azure Traffic Manager also relies on health probes to understand if a
backend is healthy and operating normally. If another value is returned or nothing is returned, the routing
service recognizes ongoing issues and will stop routing requests to that specific backend.
However, unlike with Azure Front Door this removal of unhealthy backends isn't instantaneous since
clients will continue to create connections to the unhealthy backend until the DNS TTL expires and a
new backend endpoint is requested from the Traffic Manager service.
In addition, even when the TTL expires, there no guarantee that public DNS servers will honor this
value, so DNS propagation can actually take much longer to occur. This means that traffic may
continue to be sent to the unhealthy endpoint for a sustained period of time.
Azure Standard Load BalancerAzure Standard Load Balancer
Cross-Region Standard Load Balancer is available in preview with technical limitations. This option isn't recommended for
mission-critical workloads.
Use Azure Front Door as the primary global traffic routing service for HTTP/S scenarios. Azure Front
Door is strongly advocated for HTTP/S workloads as it provides optimized traffic routing, transparent fail
over, private backend endpoints (with the Premium SKU), edge caching and integration with Web
Application Firewall (WAF).
For application scenarios where client control is possible, apply client side routing logic to consider
failover scenarios where the primary global routing technology fails. Two or more global routing
technologies should be positioned in parallel for added redundancy, if single service SLA isn't sufficient.
Client logic is required to route to the redundant technology in case of a global service failure.
Two distinct URLs should be used, with one applied to each of the different global routing services to
simplify the overall certificate management experience and routing logic for a failover.
Prioritize the use of third-party routing technologies as the secondary failover service, since this will
mitigate the largest number of global failure scenarios and the capabilities offered by industry leading
CDN providers will allow for a consistent design approach.
Consideration should also be given to directly routing to a single regional stamp rather than a
separate routing service. While this will result in a degraded level of service, it represents a far simpler
design approach.
This image shows a redundant global load balancer configuration with client failover using Azure Front Door as
primary global load balancer.
IMPORTANTIMPORTANT
To truly mitigate the risk of global failures within the Azure platform, a multi-cloud active-active deployment approach
should be considered, with active deployment stamps hosted across two or more cloud providers and redundant third-
party routing technologies used for global routing.
Azure can effectively be integrated with other cloud platforms. However, it's strongly recommended not to apply a multi-
cloud approach because it introduces significant operational complexity, with different deployment stamp definitions and
representations of operational health across the different cloud platforms. This complexity in-turn introduces numerous
resiliency risks within the normal operation of the application, which far outweigh the hypothetical risks of a global
platform outage.
Although not recommended, for HTTP(s) workloads using Azure Traffic Manager for global routing
redundancy to Azure Front Door, consider offloading Web Application Firewall (WAF) to Application Gateway
for acceptable traffic flowing through Azure Front Door.
This will introduce an additional failure point to the standard ingress path, an additional critical-path
component to manage and scale, and will also incur additional costs to ensure global high-availability.
It will, however, greatly simplify the failure scenario by providing consistency between the acceptable
and not acceptable ingress paths through Azure Front Door and Azure Traffic Manager, both in terms
of WAF execution but also private application endpoints.
The loss of edge caching in a failure scenario will impact overall performance, and this must be
aligned with an acceptable level of service or mitigating design approach. To ensure a consistent level
of service, consider offloading edge caching to a third-party CDN provider for both paths.
It's recommended to consider a third-party global routing service in place of two Azure global routing services.
This provides the maximum level of fault mitigation and a more simple design approach since most industry
leading CDN providers offer edge capabilities largely consistent with that offered by Azure Front Door.
Azure Front DoorAzure Front Door
Use the Azure Front Door managed certificate service to enable TLS connections, and remove the need to
manage certificate lifecycles.
Use the Azure Front Door Web Application Firewall (WAF) to provide protection at the edge from
common web exploits and vulnerabilities, such as SQL injection.
Use the Azure Front Door built-in cache to serve static content from edge nodes.
In most cases this will also eliminate the need for a dedicated Content Delivery Network (CDN).
Application delivery services
Configure the application platform ingress points to validate incoming requests through header based
filtering using the
X-Azure-FDID
to ensure all traffic is flowing through the configured Front Door
instance. Consider also configuring IP ACLing using Front Door Service Tags to validate traffic originates
from the Azure Front Door backend IP address space and Azure infrastructure services. This will ensure
traffic flows through Azure Front Door at a service level, but header based filtering will still be required to
ensure the use of a configured Front Door instance.
Define a custom TCP health endpoint to validate critical downstream dependencies within a regional
deployment stamp, including data platform replicas, such as Azure Cosmos DB in the example provided
by the foundational reference implementation. If one or more dependencies becomes unhealthy, the
health probe should reflect this in the response returned so that the entire regional stamp can be taken
out of circulation.
Ensure health probe responses are logged and ingest all operational data exposed by Azure Front Door
into the global Log Analytics workspace to facilitate a unified data sink and single operational view across
the entire application.
Unless the workload is extremely latency sensitive, spread traffic evenly across all considered regional
stamps to most effectively use deployed resources.
To achieve this, set the "Latency Sensitivity (Additional Latency)" parameter to a value that is high
enough to cater for latency differences between the different regions of the backends. Ensure a
tolerance that is acceptable to the application workload regarding overall client request latency.
Don't enable Session Affinity unless it's required by the application, since it can have a negative impact
the balance of traffic distribution. With a fully stateless application, if the recommended mission-critical
application design approach is followed, any request could be handled by any of the regional
deployments.
Azure Traffic ManagerAzure Traffic Manager
Use Traffic Manager for non HTTP/S scenarios as a replacement to Azure Front Door. Capability
differences will drive different design decisions for cache and WAF capabilities, and TLS certificate
management.
WAF capabilities should be considered within each region for the Traffic Manager ingress path, using
Azure Application Gateway.
Configure a suitably low TTL value to optimize the time required to remove an unhealthy backend
endpoint from circulation in the event that backend becomes unhealthy.
Similar to with Azure Front Door, a custom TCP health endpoint should be defined to validate critical
downstream dependencies within a regional deployment stamp, which should be reflected in the
response provided by health endpoints.
However, for Traffic Manager additional consideration should be given to service level regional fail over.
such as 'dog legging', to mitigate the potential delay associated with the removal of an unhealthy
backend due to dependency failures, particularly if it's not possible to set a low TTL for DNS records.
Consideration should be given to third-party CDN providers in order to achieve edge caching when using
Azure Traffic Manager as a primary global routing service. Where edge WAF capabilities are also offered
by the third-party service, consideration should be given to simplify the ingress path and potentially
remove the need for Application Gateway.
The network ingress path for a mission-critical application must also consider application delivery services to
ensure secure, reliable, and scalable ingress traffic.
Design considerationsDesign considerations
This section builds on global routing recommendations by exploring key application delivery capabilities,
considering relevant services such as Azure Standard Load Balancer, Azure Application Gateway, and Azure API
Management.
TLS encryption is critical to ensure the integrity of inbound user traffic to a mission-critical application,
with TLS OffloadingTLS Offloading applied only at the point of a stamp's ingress to decrypt incoming traffic. TLS
Offloading Requires the private key of the TLS certificate to decrypt traffic.
A Web Application FirewallWeb Application Firewall provides protection against common web exploits and vulnerabilities,
such as SQL injection or cross site scripting, and is essential to achieve the maximum reliability
aspirations of a mission-critical application.
Azure WAF provides out-of-the-box protection against the top 10 OWASP vulnerabilities using managed
rule sets.
Custom rules can also be added to extend the managed rule set.
Azure WAF can be enabled within either Azure Front Door, Azure Application Gateway, or Azure CDN
(currently in public preview).
The features offered on each of the services differ slightly. For example, the Azure Front Door
WAF provides rate limiting, geo-filtering and bot protection, which aren't yet offered within the
Application Gateway WAF. However, they all support both built-in and custom rules and can be
set to operate in detection mode or prevention mode.
The roadmap for Azure WAF will ensure a consistent WAF feature set is provided across all
service integrations.
Third-party WAF technologies such as NVAs and advanced ingress controllers within Kubernetes can also
be considered to provide requisite vulnerability protection.
Optimal WAF configuration typically requires fine tuning, regardless of the technology used.
Azure Front DoorAzure Front Door
Azure Front Door only accepts HTTP and HTTPS traffic, and will only process requests with a known Host
header. This protocol blocking helps to mitigate volumetric attacks spread across protocols and ports, and
DNS amplification and TCP poisoning attacks.
Azure Front Door is a global Azure resource so configuration is deployed globally to all edge locations.
Resource configuration can be distributed at a massive scale to handle hundreds of thousands of
requests per second.
Updates to configuration, including routes and backend pools, are seamless and won't cause any
downtime during deployment.
Azure Front Door provides both a fully managed certificate service and a bring-your-own-certificate
method for the client-facing SSL certificates. The fully managed certificate service provides a simplified
operational approach and helps to reduce complexity in the overall design by performing certificate
management within a single area of the solution.
Azure Front Door auto-rotates "Managed" certificates at least 60 days ahead of certificate expiration to
protect against expired certificate risks. If self-managed certificates are used, updated certificates should
be deployed no later than 24 hours prior to expiration of the existing certificate, otherwise clients may
receive expired certificate errors.
Certificate updates will only result in downtime if Azure Front Door is switched between "Managed" and
"Use Your Own Certificate".
Azure Front Door is protected by Azure DDoS Protection Basic, which is integrated into Front Door by
Design recommendationsDesign recommendations
Caching and static content delivery
Design considerationsDesign considerations
default. This provides always-on traffic monitoring, real-time mitigation, and also defends against
common Layer 7 DNS query floods or Layer 3/4 volumetric attacks.
These protections help to maintain Azure Front Door availability even when faced with a DDoS attack.
Distributed Denial of Service (DDoS) attacks can render a targeted resource unavailable by
overwhelming it with illegitimate traffic.
Azure Front Door also provides WAF capabilities at a global traffic level, while Application Gateway WAF
must be provided within each regional deployment stamp. Capabilities include firewall rulesets to protect
against common attacks, geo-filtering, address blocking, rate limiting, and signature matching.
Azure Load BalancerAzure Load Balancer
The Azure Basic Load Balancer SKU isn't backed by an SLA and has several capability constraints
compared to the Standard SKU.
Perform TLS Offloading in as few places as possible in order to maintain security whilst simplifying the
certificate management lifecycle.
Use encrypted connections (e.g. HTTPS) from the point where TLS offloading occurs to the actual
application backends. Application endpoints won't be visible to end users, so Azure-managed domains,
such as azurewebsites.net or cloudapp.net , can be used with managed certificates.
For HTTP(S) traffic, ensure WAF capabilities are applied within the ingress path for all publicly exposed
endpoints.
Enable WAF capabilities at a single service location, either globally with Azure Front Door or regionally
with Azure Application Gateway, since this simplifies configuration fine tuning and optimizes
performance and cost.
Configure WAF in Prevention mode to directly block attacks. Only use WAF in Detection mode (i.e. only
logging but not blocking suspicious requests) when the performance penalty of Prevention mode is too
high. The implied additional risk must be fully understood and aligned to the specific requirements of the
workload scenario.
Prioritize the use of Azure Front Door WAF since it provides the richest Azure-native feature set and
applies protections at the global edge, which simplifies the overall design and drives further efficiencies.
Use Azure API Management only when exposing a large number of APIs to external clients or different
application teams.
Use the Azure Standard Load Balancer SKU for any internal traffic distribution scenario within micros-
service workloads.
Provides an SLA of 99.99% when deployed across Availability Zones.
Provides critical capabilities such as diagnostics or outbound rules.
Use Azure DDoS Network Protection to help protect public endpoints hosted within each application
virtual network.
Special treatment of static content like images, JavaScript, CSS and other files can have a significant impact on
the overall user experience as well as on the overall cost of the solution. Caching static content at the edge can
speed up the client load times which results in a better user experience and can also reduce the cost for traffic,
read operations and computing power on backend services involved.
Design recommendationsDesign recommendations
Virtual network integration
C a u t i o nC a u t i o n
Not all content that a solution makes available over the Internet is generated dynamically. Applications serve
both static assets (images, JavaScript, CSS, localization files, etc.) and dynamic content.
Workloads with frequently accessed static content benefit greatly from caching since it reduces the load on
backend services and reduces content access latency for end users.
Caching can be implemented natively within Azure using either Azure Front Door or Azure Content Delivery
Network (CDN).
Azure Front Door provides Azure-native edge caching capabilities and routing features to divide static
and dynamic content.
More complex caching scenarios can be implemented using the Azure CDN service to establish a full-
fledged content delivery network for significant static content volumes.
When comparing the Azure Front Door and Azure CDN services, the following decision factors should
be explored:
By creating the appropriate routing rules in Azure Front Door, /static/* traffic can be
transparently redirected to static content.
The Azure CDN service will likely be more cost effective, but does not provide the same
advanced routing and Web Application Firewall (WAF) capabilities which are recommended for
other areas of an application design. It does, however, offer further flexibility to integrate with
similar services from third-party solutions, such as Akamai and Verizon.
Can required caching rules be accomplished using the rules engine.
Size of the stored content and the associated cost.
Price per month for the execution of the rules engine (charged per request on Azure Front
Door).
Outbound traffic requirements (price differs by destination).
Generated, static content like sized copies of image files that never or only rarely change can benefit from
caching as well. Caching can be configured based on URL parameters and with varying caching duration.
Separate the delivery of static and dynamic content to users and deliver relevant content from a cache to
reduce load on backend services optimize performance for end-users.
Given the strong recommendation (Network and connectivity design area) to use Azure Front Door for
global routing and Web Application Firewall (WAF) purposes, it's recommended to prioritize the use of Azure
Front Door caching capabilities unless gaps exist.
A mission-critical application will typically encompass requirements for integration with other applications or
dependent systems, which could be hosted on Azure, another public cloud, or on-premises data centers. This
application integration can be accomplished using public-facing endpoints and the internet, or private networks
through network-level integration. Ultimately, the method by which application integration is achieved will have
a significant impact on the security, performance, and reliability of the solution, and strongly impacting design
decisions within other design areas.
A mission-critical application can be deployed within one of three overarching network configurations, which
determines how application integration can occur at a network level.
1. PublicPublic application withoutwithout corporate network connectivity.
2. PublicPublic application withwith corporate network connectivity.
3. PrivatePrivate application withwith corporate network connectivity.
When deploying within an Azure landing zone, configuration 1. should be deployed within an Online Landing
Zone, while both 2) and 3) should be deployed within a Corp. Connected Landing Zone to facilitate network-
Design ConsiderationsDesign Considerations
No virtual networksNo virtual networks
Isolated virtual networksIsolated virtual networks
Connected virtual networksConnected virtual networks
level integration.
This section explores these network integration scenarios, layering in the appropriate use of Azure Virtual
Networks and surrounding Azure networking services to ensure integration requirements are optimally
satisfied.
The simplest design approach is to not deploy the application within a virtual network.
Connectivity between all considered Azure services will be provided entirely through public endpoints
and the Microsoft Azure backbone. Connectivity between public endpoints hosted on Azure will only
traverse the Microsoft backbone and won't go over the public internet.
Connectivity to any external systems outside Azure will be provided by the public internet.
This design approach adopts "identity as a security perimeter" to provide access control between the
various service components and dependent solution. This may be an acceptable solution for scenarios
that are less sensitive to security. All application services and dependencies are accessible through a
public endpoint leaves them vulnerable to additional attack vectors orientated around gaining
unauthorized access.
This design approach is also not applicable for all Azure services, since many services, such as AKS, have
a hard requirement for an underlying virtual network.
To mitigate the risks associated with unnecessary public endpoints, the application can be deployed
within a standalone network that isn't connected to other networks.
Incoming client requests will still require a public endpoint to be exposed to the internet, however, all
subsequent communication can be within the virtual network using private endpoints. When using Azure
Front Door Premium, it's possible to route directly from edge nodes to private application endpoints.
While private connectivity between application components will occur over virtual networks, all
connectivity with external dependencies will still rely on public endpoints.
Connectivity to Azure platform services can be established with Private Endpoints if supported. If
other external dependencies exist on Azure, such as another downstream application, connectivity will
be provided through public endpoints and the Microsoft Azure backbone.
Connectivity to any external systems outside Azure would be provided by the public internet.
For scenarios where there are no network integration requirements for external dependencies, deploying
the solution within an isolated network environment provides maximum design flexibility. No addressing
and routing constraints associated with broader network integration.
Azure Bastion is a fully platform-managed PaaS service that can be deployed on a virtual network and
provides secure RDP/SSH connectivity to Azure VMs. When you connect via Azure Bastion, virtual
machines don't need a public IP address.
The use of application virtual networks introduces significant deployment complexities within CI/CD
pipelines, since both data plane and control plane access to resources hosted on private networks is
required to facilitate application deployments.
Secure private network path must be established to allow CI/CD tooling to perform requisite actions.
Private build agents can be deployed within application virtual networks to proxy access to resources
secured by the virtual network.
For scenarios with external network integration requirements, application virtual networks can be
connected to other virtual networks within Azure, another cloud provider, or on-premises networks using
NOTENOTE
Design recommendationsDesign recommendations
a variety of connectivity options. For example, some application scenarios might consider application-
level integration with other line-of-business applications hosted privately within an on-premises
corporate network.
The application network design must align with the broader network architecture, particularly concerning
topics such as addressing and routing.
Overlapping IP address spaces across Azure regions or on-premises networks will create major
contention when network integration is considered.
A virtual network resource can be updated to consider additional address space, however, when a
virtual network address space of a peered network changes a sync on the peering link is required,
which will temporarily disable peering.
Azure reserves five IP addresses within each subnet, which should be considered when determining
appropriate sizes for application virtual networks and encompassed subnets.
Some Azure services require dedicated subnets, such as Azure Bastion, Azure Firewall, or Azure Virtual
Network Gateway. The size of these service subnets is very important, since they should be large
enough to support all current instances of the service considering future scale requirements, but not
so large as to unnecessarily waste addresses.
When on-premises or cross-cloud network integration is required, Azure offers two different solutions to
establish a secure connection.
An ExpressRoute circuit can be sized to provide bandwidths up to 100 Gbps.
A Virtual Private Network (VPN) can be sized to provide aggregated bandwidth up to 10 Gbps in hub
and spoke networks, and up to 20 Gbps in Azure Virtual WAN.
When deploying within an Azure landing zone, be aware that any required connectivity to on-premises networks should
be provided by the landing zone implementation. The design can use ExpressRoute and other virtual networks in Azure
using either Virtual WAN or a hub-and-spoke network design.
The inclusion of additional network paths and resources introduces additional reliability and operational
considerations for the application to ensure health is maintained.
It's recommended that mission-critical solutions are deployed within Azure virtual networks where
possible to remove unnecessary public endpoints, limiting the application attack surface to maximize
security and reliability.
Use Private Endpoints for connectivity to Azure platform services. Service Endpoints can be
considered for services that don support Private Link, provided data exfiltration risks are acceptable or
mitigated through alternative controls.
For application scenarios that don't require corporate network connectivity, treat all virtual networks as
ephemeral resources that are replaced when a new regional deployment is conducted.
When connecting to other Azure or on-premises networks, application virtual networks shouldn't be
treated as ephemeral since it creates significant complications where virtual network peering and virtual
network gateway resources are concerned. All relevant application resources within the virtual network
should continue to be ephemeral, with parallel subnets used to facilitate blue-green deployments of
updated regional deployment stamps.
In scenarios where corporate network connectivity is required to facilitate application integration over
private networks, ensure that the IPv4 address space used for regional application virtual networks
Internet egress
Design ConsiderationsDesign Considerations
NOTENOTE
doesn't overlap with other connected networks and is properly sized to facilitate required scale without
needing to update the virtual network resource and incur downtime.
It's strongly recommended to only use IP addresses from the address allocation for private internet
(RFC 1918).
Align with organization plans for IP addressing in Azure to ensure that application network IP address
space doesn't overlap with other networks across on-premises locations or Azure regions.
Don't create unnecessarily large application virtual networks to ensure that IP address space isn't
wasted.
For environments with a limited availability of private IP addresses (RFC 1918), consider using
IPv6.
If the use of public IP address is required, ensure that only owned address blocks are used.
Prioritize the use Azure CNI for AKS network integration, since it supports a richer feature set.
Consider Kubenet for scenarios with a limited rage available IP addresses to fit the application
within a constrained address space.
Prioritize the use of the Azure CNI network plugin for AKS network integration and consider
Kubenet for scenarios with a limited range of available IP addresses. See Micro-segmentation and
kubernetes network policies for more details.
For scenarios requiring on-premises network integration, prioritize the use Express Route to ensure
secure and scalable connectivity.
Ensure the reliability level applied to the Express Route or VPN fully satisfies application requirements.
Multiple network paths should be considered to provide additional redundancy when required, such
as cross connected ExpressRoute circuits or the use of VPN as a failover connectivity mechanism.
Ensure all components on critical network paths are in line with the reliability and availability
requirements of associated user flows, regardless of whether the management of these paths and
associated component is delivered by the application team of central IT teams.
When deploying within an Azure landing zone and integrating with a broader organizational network topology,
consider the network guidance to ensure the foundational network is aligned with Microsoft best-practices.
Use Azure Bastion or proxied private connections to access the data plane of Azure resources or perform
management operations.
Internet egress is a foundational network requirement for a mission-critical application to facilitate external
communication in the context of:
1. Direct application user interaction.
2. Application integration with external dependencies outside Azure.
3. Access to external dependencies required by the Azure services used by the application.
This section explores how internet egress can be achieved while ensuring security, reliability, and sustainable
performance are maintained, highlighting key egress requirements for services recommended in a mission-
critical context.
Many Azure services require access to public endpoints for various management and control plane
Design recommendationsDesign recommendations
NOTENOTE
functions to operate as intended.
Azure provides different direct internet outbound connectivity methods, such as Azure NAT gateway or
Azure Load Balancer, for virtual machines or compute instances on a virtual network.
When traffic from inside a virtual network travels out to the Internet, Network Address Translation (NAT)
must take place. This is a compute operation that occurs within the networking stack and that can
therefore impact system performance.
When NAT takes place at a small scale the performance impact should be negligible, however, if there are
a large number of outbound requests network issues may occur. These issues typically come in the form
of 'Source NAT (or SNAT) port exhaustion'.
In a multi-tenant environment, such as Azure App Service, there's a limited number of outbound ports
available to each instance. If these ports run out, no new outbound connections can be initiated. This
issue can be mitigated by reducing the number of private/public edge traversals or by using a more
scalable NAT solution such as the Azure NAT Gateway.
In addition to NAT limitations, outbound traffic may also be subject to requisite security inspections.
Azure Firewall provides appropriate security capabilities to secure network egress.
Azure Firewall (or an equivalent NVA) can be used to secure Kubernetes egress requirements by
providing granular control over outbound traffic flows.
Large volumes of internet egress will incur data transfer charges.
Azure NAT GatewayAzure NAT Gateway
Azure NAT Gateway supports 64,000 connections for TCP and UDP per assigned outbound IP address.
Up to 16 IP addresses can be assigned to a single NAT gateway.
A default TCP idle timeout of 4 minutes. If idle timeout is altered to a higher value, flows will be held
for longer, which will increase the pressure on the SNAT port inventory.
NAT gateway can't provide zonal isolation out-of-the-box. To get zonal redundancy, a subnet containing
zonal resources must be aligned with corresponding zonal NAT gateways.
Minimize the number of outgoing Internet connections as this will impact NAT performance.
If large numbers of internet-bound connections are required, consider using Azure NAT Gateway to
abstract outbound traffic flows.
Use Azure Firewall where requirements to control and inspect outbound internet traffic exist.
Ensure Azure Firewall isn't used to inspect traffic between Azure services.
When deploying within an Azure landing zone, consider using the foundational platform Azure Firewall resource (or
equivalent NVA). If a dependency is taken on a central platform resource for internet egress, then the reliability level of
that resource and associated network path should be closely aligned with application requirements. Operational data from
the resource should also be made available to the application in order to inform potential operational action in failure
scenarios.
If there are high-scale requirements associated with outbound traffic, consideration should be given to a
dedicated Azure Firewall resource for a mission-critical application, to mitigate risks associated with using a
centrally shared resource, such as noisy neighbor scenarios.
Inter-zone and inter-region Connectivity
Design ConsiderationsDesign Considerations
When deployed within a Virtual WAN environment, consideration should be given to Firewall Manager to
provide centralized management of dedicated application Azure Firewall instances to ensure organizational
security postures are observed through global firewall policies.
Ensure incremental firewall policies are delegated to application security teams via role-based access control
to allow for application policy autonomy.
While the application design strongly advocates independent regional deployment stamps, many application
scenarios may still require network integration between application components deployed within different
zones or Azure regions, even if only under degraded service circumstances. The method by which inter-zone
and inter-region communication is achieved has a significant bearing on overall performance and reliability,
which will be explored through the considerations and recommendations within this section.
The application design approach for a mission-critical application endorses the use of independent
regional deployments with zonal redundancy applied at all component levels within a single region.
An Availability Zone (AZ) is a physically separate data center location within an Azure region, providing
physical and logical fault isolation up to the level of a single data center.
A round-trip latency of less than 2 ms is guaranteed for inter-zone communication. Zones will have a
small latency variance given varied distances and fiber paths between zones.
Availability Zone connectivity depends on regional characteristics, and therefore traffic entering a region
via an edge location may need to be routed between zones to reach its destination. This will add a ~1ms-
2ms latency given inter-zone routing and 'speed of light' constraints, but this should only have a bearing
for hyper sensitive workloads.
Availability Zones are treated as logical entities within the context of a single subscription, so different
subscriptions might have a different zonal mapping for the same region. For example, zone 1 in
Subscription A could correspond to the same physical data center as zone 2 in subscription B.
Communication between zones within a region incurs a data transfer charge per GB of bandwidth.
With application scenarios that are extremely chatty between application components, spreading
application tiers across zones can introduce significant latency and increased costs. It's possible to
mitigate this within the design by constraining a deployment stamp to a single zone and deploying
multiple stamps across the different zones.
Communication between different Azure regions incurs a larger data transfer charge per GB of
bandwidth.
The applicable data transfer rate depends on the continent of the considered Azure regions.
Data traversing continents are charged at a considerably higher rate.
Express Route and VPN connectivity methods can also be used to directly connect different Azure regions
together for certain scenarios, or even different cloud platforms.
For services to service communication Private Link can be used for direct communication using private
endpoints.
Traffic can be hair-pinned through Express Route circuits used for on-premise connectivity in order to
facilitate routing between virtual networks within an Azure region and across different Azure regions
within the same geography.
Hair-pinning traffic through Express Route will bypass data transfer costs associated with virtual
network peering, so can be used as a way to optimize costs.
Design recommendationsDesign recommendations
Micro-segmentation and Kubernetes network policies
Design ConsiderationsDesign Considerations
This approach necessitates additional network hops for application integration within Azure, which
introduces latency and reliability risks. Expands the role of Express Route and associated gateway
components from Azure/on-premises to also encompass Azure/Azure connectivity.
When submillisecond latency are required between services, Proximity Placement Groups can be used
when supported by the services used.
Use virtual network peering to connect networks within a region and across different regions. It's
strongly recommended to avoid hair-pinning within Express Route.
Use Private Link to establish communication directly between services in the same region or across
regions (service in Region A communicating with service in Region B.
For application workloads that are extremely chatty between components, consider constraining a
deployment stamp to a single zone and deploying multiple stamps across the different zones. This
ensures zonal redundancy is maintained at the level of an encapsulated deployment stamp rather than a
single application component.
Where possible, treat each deployment stamp as independent and disconnected from other stamps.
Use data platform technologies to synchronize state across regions rather than achieving consistency
at an application level with direct network paths.
Avoid 'dog legging' traffic between different regions unless necessary, even in a failure scenario. Use
global routing services and end-to-end health probes to take an entire stamp out of circulation in the
event that a single critical component tier fails, rather than routing traffic at that faulty component
level to another region.
For hyper latency sensitive application scenarios, prioritize the use of zones with regional network
gateways to optimize network latency for ingress paths.
Micro-segmentation is a network security design pattern used to isolate and secure individual application
workloads, with policies applied to limit network traffic between workloads based on a Zero Trust model. It's
typically applied to reduce network attack surface, improve breach containment, and strengthen security
through policy-driven application-level network controls.
A mission-critical application can enforce application-level network security using Network Security Groups
(NSG) at either a subnet or network interface level, service Access Control Lists (ACL), and network policies
when using Azure Kubernetes Service (AKS).
This section explores the optimal use of these capabilities, providing key considerations and recommendations
to achieve application-level micro-segmentation.
AKS can be deployed in two different networking models:
Kubenet networking:Kubenet networking: AKS nodes are integrated within an existing virtual network, but pods exist
within a virtual overlay network on each node. Traffic between pods on different nodes is routed
through kube-proxy.
Azure Container Networking Interface (CNI) networking:Azure Container Networking Interface (CNI) networking: The AKS cluster is integrated within
an existing virtual network and its nodes, pods and services received IP addresses from the same
virtual network the cluster nodes are attached to. This is relevant for various networking scenarios
requiring direct connectivity from and to pods. Different node pools can be deployed into different
subnets.
Design recommendationsDesign recommendations
NOTENOTE
Azure CNI requires more IP address space compared to Kubenet. Proper upfront planning and sizing of the
network is required. For more information, refer to the Azure CNI documentation.
By default, pods are non-isolated and accept traffic from any source and can send traffic to any
destination; a pod can communicate with every other pod in a given Kubernetes cluster; Kubernetes
doesn't ensure any network level isolation, and doesn't isolate namespaces at the cluster level.
Communication between Pods and Namespaces can be isolated using Network Policies. Network Policy is
a Kubernetes specification that defines access policies for communication between Pods. Using Network
Policies, an ordered set of rules can be defined to control how traffic is sent/received, and applied to a
collection of pods that match one or more label selectors.
AKS supports two plugins that implement Network Policy,
Azure
and
Calico
. Both plugins use Linux
IPTables to enforce the specified policies. See Differences between Azure and Calico policies and their
capabilities for more details.
Network policies don't conflict since they're additive.
For a network flow between two pods to be allowed, both the egress policy on the source pod and the
ingress policy on the destination pod need to allow the traffic.
The network policy feature can only be enabled at cluster instantiation time. It is not possible to enable
network policy on an existing AKS cluster.
The delivery of network policies is consistent regardless of whether Azure or Calico is used.
Calico provides a richer feature set, including support for windows-nodes and supports Azure CNI as
well as Kubenet.
AKS supports the creation of different node pools to separate different workloads using nodes with
different hardware and software characteristics, such as nodes with and without GPU capabilities.
Using node pools doesn't provide any network-level isolation.
Node pools can use different subnets within the same virtual network. NSGs can be applied at the
subnet-level to implement micro-segmentation between node pools.
Configure an NSG on all considered subnets to provide an IP ACL to secure ingress paths and isolate
application components based on a Zero Trust model.
Use Front Door Service Tags within NSGs on all subnets containing application backends defined
within Azure Front Door, since this will validate traffic originates from a legitimate Azure Front
Door backend IP address space. This will ensure traffic flows through Azure Front Door at a service
level, but header based filtering will still be required to ensure the use of a particular Front Door
instance and to also mitigate 'IP spoofing' security risks.
Public internet traffic should be disabled on RDP and SSH ports across all applicable NSGs.
Prioritize the use of the Azure CNI network plugin and consider Kubenet for scenarios with a
limited range of available IP addresses to fit the application within a constrained address space.
AKS supports the use of both Azure CNI and Kubenet. It is selected at deployment time.
The Azure CNI network plugin is a more robust and scalable network plugin, and is
recommended for most scenarios.
Kubenet is a more lightweight network plugin, and is recommended for scenarios with a
limited range of available IP addresses.
See Azure CNI for more details.
Next step
The Network Policy feature in Kubernetes should be used to define rules for ingress and egress traffic
between pods in a cluster. Define granular Network Policies to restrict and limit cross-pod
communication.
Enable Network Policy for Azure Kubernetes Service at deployment time.
Prioritize the use of
Calico
because it provides a richer feature set with broader community adoption
and support.
Review the considerations for quantifying and observing application health.
Health modeling and observability
Health modeling and observability of mission
-
critical workloads on Azure
12/16/2022 27 minutes to read • Edit Online
IMPORTANTIMPORTANT
Video: Define a health model for your mission-critical workload
Layered application health
Health modeling and observability are essential concepts to maximize reliability, which focuses on robust and
contextualized instrumentation and monitoring. These concepts provide critical insight into application health,
promoting the swift identification and resolution of issues.
Most mission-critical applications are significant in terms of both scale and complexity and therefore generate
high volumes of operational data, which makes it challenging to evaluate and determine optimal operational
action. Health modeling ultimately strives to maximize observability by augmenting raw monitoring logs and
metrics with key business requirements to quantify application health, driving automated evaluation of health
states to achieve consistent and expedited operations.
This design area focuses on the process to define a robust health model, mapping quantified application health
states through observability and operational constructs to achieve operational maturity.
This article is part of the Azure Well-Architected mission-critical workload series. If you aren't familiar with this series, we
recommend you start with what is a mission-critical workload?
Mission-Critical open source project
The reference implementations are part of an open source project available on GitHub. The code assets illustrate
considerations and recommendations for observing application health.
There are three main levels of operational maturity when striving to maximize reliability.
1.
Detect
and respond to issues as they happen.
2.
Diagnose
issues that are occurring or have already occurred.
3.
Predict
and prevent issues before they take place.
To build a health model, first define application health in the context of key business requirements by
quantifying ‘healthy and ‘unhealthy’ states in a layered and measurable format. Then, for each application
component, refine the definition in the context of a steady running state and aggregated according to the
application user flows. Superimpose with key non-functional business requirements for performance and
availability. Finally, aggregate the health states for each individual user flow to form an acceptable
representation of the overall application health. Once established, these layered health definitions should be
used to inform critical monitoring metrics across all system components and validate operational subsystem
composition.
IMPORTANTIMPORTANT
Design considerationsDesign considerations
Design recommendationsDesign recommendations
When defining what 'unhealthy' states, represent for all levels of the application. It's important to distinguish between
transient and non-transient failure states to qualify service degradation relative to unavailability.
The process of modeling health is a top-down design activity that starts with an architectural exercise to
define all user flows and map dependencies between functional and logical components, thereby
implicitly mapping dependencies between Azure resources.
A health model is entirely dependent on the context of the solution it represents, and therefore can't be
solved 'out-of-the-box' because 'one size doesn't fit all'.
Applications will differ in composition and dependencies
Metrics and metric thresholds for resources must also be finely tuned in terms of what values
represent healthy and unhealthy states, which are heavily influenced by encompassed application
functionality and target non-functional requirements.
A layered health model enables application health to be traced back to lower level dependencies, which
helps to quickly root cause service degradation.
To capture health states for an individual component, that component's distinct operational characteristics
must be understood under a steady state that is reflective of production load. Performance testing is
therefore a key capability to define and continually evaluate application health.
Failures within a cloud solution may not happen in isolation. An outage in a single component may lead
to several capabilities or additional components becoming unavailable.
Such errors may not be immediately observable.
Define a measurable health model as a priority to ensure a clear operational understanding of the entire
application.
The health model should be layered and reflective of the application structure.
The foundational layer should consider individual application components, such as Azure resources.
Foundational components should be aggregated alongside key non-functional requirements to build
a business-contextualized lens into the health of system flows.
System flows should be aggregated with appropriate weights based on business criticality to build a
meaningful definition of overall application health. Financially significant or customer-facing user
flows should be prioritized.
Each layer of the health model should capture what healthy and ‘unhealthy’ states represent.
Ensure the heath model can distinguish between transient and non-transient unhealthy states to
isolate service degradation from unavailability.
Represent health states using a granular health score for every distinct application component and every
user flow by aggregating health scores for mapped dependent components, considering key non-
functional requirements as coefficients.
The health score for a user flow should be represented by the lowest score across all mapped
components, factoring in relative attainment against non-functional requirements for the user flow.
The model used to calculate health scores must consistently reflect operating health, and if not, the
model should be adjusted and redeployed to reflect new learnings.
Define health score thresholds to reflect health status.
The health score must be calculated automatically based on underlying metrics, which can be visualized
Example
-
Layered health modelExample
-
Layered health model
through observability patterns and acted on through automated operational procedures.
The health score should become core to the monitoring solution, so that operating teams no longer
have to interpret and map operational data to application health.
Use the health model to calculate availability Service Level Objective (SLO) attainment instead of raw
availability, ensuring the demarcation between service degradation and unavailability is reflected as
separate SLOs.
Use the health model within CI/CD pipelines and test cycles to validate application health is maintained
after code and configuration updates.
The health model should be used to observe and validate health during load testing and chaos testing
as part of CI/CD processes.
Building and maintaining a health model is an iterative process and engineering investment should be
aligned to drive continuous improvements.
Define a process to continually evaluate and fine-tune the accuracy of the model, and consider
investing in machine learning models to further train the model.
This is a simplified representation of a layered application health model for illustrative purposes. A
comprehensive and contextualized health model is provided in the Mission-Critical reference implementations:
Mission-Critical Online
Mission-Critical Connected
When implementing a health model it's important to define the health of individual components through the
aggregation and interpretation of key resource-level metrics. An example of how resource metrics can be used
is the image below:
This definition of health can subsequently be represented by a KQL query, as demonstrated by the example AKS
query below that aggregates InsightsMetrics (AKS Container insights) and AzureMetrics (Azure diagnostics) and
compares (inner join) against modeled health thresholds.
// ClusterHealthStatus
let Thresholds=datatable(MetricName: string, YellowThreshold: double, RedThreshold: double) [
// Disk Usage:
"used_percent", 50, 80,
// Average node cpu usage %:
"node_cpu_usage_percentage", 60, 90,
// Average node disk usage %:
"node_disk_usage_percentage", 60, 80,
// Average node memory usage %:
"node_memory_rss_percentage", 60, 80
];
InsightsMetrics
| summarize arg_max(TimeGenerated, *) by Computer, Name
| project TimeGenerated,Computer, Namespace, MetricName = Name, Value=Val
| extend NodeName = extract("([a-z0-9-]*)(-)([a-z0-9]*)$", 3, Computer)
| union (
AzureMetrics
| extend ResourceType = extract("(PROVIDERS/MICROSOFT.)([A-Z]*/[A-Z]*)", 2, ResourceId)
| where ResourceType == "CONTAINERSERVICE/MANAGEDCLUSTERS"
| summarize arg_max(TimeGenerated, *) by MetricName
| project TimeGenerated, MetricName, Namespace = "AzureMetrics", Value=Average
)
| lookup kind=inner Thresholds on MetricName
| extend IsYellow = iff(Value > YellowThreshold and Value < RedThreshold, 1, 0)
| extend IsRed = iff(Value > RedThreshold, 1, 0)
| project NodeName, MetricName, Value, YellowThreshold, IsYellow, RedThreshold, IsRed
// ClusterHealthScore
ClusterHealthStatus
| summarize YellowScore = max(IsYellow), RedScore = max(IsRed)
| extend HealthScore = 1-(YellowScore*0.25)-(RedScore*0.5)
Demo video: Monitoring and health modeling demo
The resulting table output can subsequently be transformed into a health score for easier aggregation at higher
levels of the health model.
These aggregated scores can subsequently be represented as a dependency chart using visualization tools such
as Grafana to illustrate the health model.
This image shows an example layered health model from the Azure Mission-Critical online reference
implementation, and demonstrates how a change in health state for a foundational component can have a
cascading impact to user flows and overall application health (the example values correspond to the table in the
previous image).
Unified data sink for correlated analysis
Design considerationsDesign considerations
TIPTIP
Many operational datasets must be gathered from all system components to accurately represent a defined
heath model, considering logs and metrics from both application components and underlying Azure resources.
This vast amount of data ultimately needs to be stored in a format that allows for near-real time interpretation
to facilitate swift operational action. Moreover, correlation across all encompassed data sets is required to
ensure effective analysis is unbounded, allowing for the layered representation of health.
A unified data sink is required to ensure all operational data is swiftly stored and made available for correlated
analysis to build a 'single pane' representation of application health. Azure provides several different operational
technologies under the umbrella of Azure Monitor, and Azure Monitor Log Analytics serves as the core Azure-
native data sink to store and analyze operational data.
Azure MonitorAzure Monitor
Azure Monitor is enabled by default for all Azure subscriptions, but Azure Monitor for Logs (Log
Analytics) and Azure Application Insights resources must be deployed and configured to incorporate data
collection and querying capabilities.
Azure Monitor supports three types of observability data: logs, metrics, and distributed traces.
Logs are stored in Azure Monitor Logs workspaces based on Azure Data Explorer. Log queries are
stored in query packs that can be shared across subscriptions, and are used to drive observability
components such as dashboards, workbooks, or other reporting and visualization tools.
Metrics are stored in an internal time-series diagnostic service database. For most Azure resources,
the retention period is retained for 93 days. Metric collection is configured through resource
Diagnostic settings.
All Azure resources expose logs and metrics, but resources must be appropriately configured to route
diagnostic data to your desired data sink.
Azure provides various Built-In Policies that can be applied to ensure deployed resources are configured to send logs and
metrics to an Azure Monitor instance.
It's not uncommon for regulatory controls to require operational data remains within originating
geographies or countries. Regulatory requirements may stipulate the retention of critical data types for
an extended period of time. For example, in regulated banking, audit data must be retained for at least
seven years.
Different operational data types may require different retention periods. For example, security logs may
need to be retained for a long period, while performance data is unlikely to require long-term retention
outside the context of AIOps.
Data can be exported from Log Analytics Workspaces for long term retention and/or auditing purposes.
Azure Monitor Logs Dedicated Clusters provides a deployment option that enables Availability Zones for
protection from zonal failures in supported Azure regions. Dedicated Clusters require a minimum daily
data ingest commitment.
Azure Monitor for Logs resources, including underlying log and metrics storage, are deployed into a
specified Azure region.
To protect against loss of data from unavailability of an Azure Monitor for Logs workspace, resources can
be configured with multiple Diagnostics configurations. Each Diagnostic configuration can target metrics
and logs at a separate Azure Monitor for Log workspace.
Each additional Azure Monitor for Logs workspace will incur extra costs.
The redundant Azure Monitor for Logs workspaces can be deployed into the same Azure region, or
into separate Azure regions for additional regional redundancy.
Sending logs and metrics from an Azure resource to an Azure Monitor for Logs workspace in a
different region will incur inter-region data egress costs.
Some Azure resources require an Azure Monitor for Logs workspace within the same region as the
resource itself.
Azure Monitor Logs workspace data can be exported to Azure Storage or Azure Event Hubs on a
continuous, scheduled, or one-time basis.
Data export allows for long-term data archiving and protects against possible operational data loss
due to unavailability.
Available export destinations are Azure Storage or Azure Event Hubs. Azure Storage can be configured
for different redundancy levels including zonal or regional. Data export to Azure Storage stores the
data within .json files.
Data export destinations must be within the same Azure region as the Azure Monitor Logs workspace.
An event hub data export destination to be within the same region as the Azure Monitor Logs
workspace. Azure Event Hubs geo-disaster recovery isn't applicable for this scenario.
There are several data export limitations. Only specific Azure Monitor Logs tables are supported for
data export.
Azure Monitor Logs has user query throttling limits, which may appear as reduced availability to clients,
such as observability dashboards.
Five concurrent queries per user: if five queries are already running, additional queries are placed in a
per-user concurrency queue until a running query ends.
Time in concurrency queue: if a query sits in the concurrency queue for over three minutes, it will be
terminated and a 429 error code returned.
Concurrency queue depth limit: the concurrency queue is limited to 200 queries, and additional
queries will be rejected with a 429 error code.
Query rate limit: there's a per-user limit of 200 queries per 30 seconds across all workspaces.
Query Packs are Azure Resource Manager resources, which can be used to protect and recover Azure
Monitor Logs queries if Azure Monitor Logs workspace is unavailable.
Query Packs contain queries as JSON and can be stored external to Azure similar to other
infrastructure-as-code assets.
Design recommendationsDesign recommendations
Deployable through the Microsoft.Insights REST API.
If an Azure Monitor for Logs workspace must be re-created the Query Pack can be redeployed
from an externally stored definition.
Application Insights can be deployed in a workspace-based deployment model, underpinned by a Log
Analytics Workspace where all the data is stored.
Sampling can be enabled within Application Insights to reduce the amount of telemetry sent and
optimize data ingest costs.
Log Analytics and Application Insights charge based on the volume of data ingested and the duration that
data is retained for.
Data ingested into a Log Analytics Workspace can be retained at no additional charge up to first 31
days (90 days if Sentinel is enabled)
Data ingested into a Workspace-based Application Insights is retained for the first 90 days at no extra
charge.
The Log Analytics Commitment Tier pricing model provides a predictable approach to data ingest
charges.
Any usage above the reservation level is billed at the same price as the current tier.
Azure Monitor Log Analytics, Application Insights, and Azure Data Explorer use the Kusto Query
Language (KQL).
Log Analytics queries are saved as
functions
within Log Analytics ( savedSearches ).
Use Azure Monitor for Logs (Log Analytics) as a unified data sink to provide a 'single pane' across all
operational data sets.
Decentralize Log Analytics Workspaces across all used deployment regions. Each Azure region with an
application deployment should consider a Log Analytics Workspace to gather all operational data
originating from that region. All global resources should use a separate dedicated Log Analytics
Workspace, which should be deployed within a primary deployment region.
All deployment stamps within the same region can use the same regional Log Analytics Workspace.
Sending all operational data to a single Log Analytics Workspace would create a single point of
failure.
Requirements for data residency might prohibit data leaving the originating region, and
federated workspaces solves for this requirement by default.
There's a substantial egress cost associated with transferring logs and metrics across regions.
Consider configuring resources with multiple diagnostic configurations pointing to different Azure
Monitor for Logs workspaces to protect against Azure Monitor unavailability for applications with fewer
regional deployment stamps.
Use Application Insights as a consistent Application Performance Monitoring (APM) tool across all
application components to collect application logs, metrics, and traces.
Deploy Application Insights in a workspace-based configuration to ensure each regional Log Analytics
Workspaces contains logs and metrics from both application components and underlying Azure
resources.
Use Cross-Workspace queries to maintain a unified 'single pane' across the different workspaces.
Use Query Packs to protect Azure Monitor Logs queries in the event of workspace unavailability.
Store query packs within the application git repository as infrastructure-as-code assets.
All Log Analytics Workspaces should be treated as long-running resources with a different life-cycle to
NOTENOTE
application resources within a regional deployment stamp.
Export critical operational data from Log Analytics for long-term retention and analytics to facilitate
AIOps and advanced analytics to refine the underlying health model and inform predictive action.
Carefully evaluate which data store should be used for long-term retention; not all data has to be stored
in a hot and queryable data store.
It's strongly recommended to use Azure Storage in a GRS configuration for long-term operational
data storage.
Use the Log Analytics Export capability to export all available data sources to Azure Storage.
Select appropriate retention periods for operational data types within log analytics, configuring longer
retention periods within the workspace where 'hot' observability requirements exist.
Use Azure Policy to ensure all regional resources route operational data to the correct Log Analytics
Workspace.
When deploying into an Azure landing zone, if there's a requirement for centralized storage of operational data, you can
fork data at instantiation so it's ingested into both centralized tooling and Log Analytics Workspaces dedicated to the
application. Alternatively, expose access to application Log Analytics workspaces so that central teams can query
application data. It's ultimately critical that operational data originating from the solution is available within Log Analytics
Workspaces dedicated to the application.
If SIEM integration is required, do not send raw log entries, but instead send critical alerts.
Only configure sampling within Application Insights if it's required to optimize performance, or if not
sampling becomes cost prohibitive.
Excessive sampling can lead to missed or inaccurate operational signals.
Use correlation IDs for all trace events and log messages to tie them to a given request.
Return correlation IDs to the caller for all calls not just failed requests.
Ensure application code incorporates proper instrumentation and logging to inform the health model and
facilitate subsequent troubleshooting or root cause analysis when required.
Application code should use Application Insights to facilitate Distributed Tracing, by providing the
caller with a comprehensive error message that includes a correlation ID when a failure occurs.
Use structured logging for all log messages.
Add meaningful health probes to all application components.
When using AKS, configure the health endpoints for each deployment (pod) so that Kubernetes can
correctly determine when a pod is healthy or unhealthy.
When using Azure App Service, configure the Health Checks so that scale out operations will not
cause errors by sending traffic to instances that are not-yet ready, and making sure unhealthy
instances are recycled quickly.
If the application is subscribed to Microsoft Mission-Critical Support, consider exposing key health probes to
Microsoft Support, so application health can be modelled more accurately by Microsoft Support.
Log successful health check requests, unless increased data volumes can't be tolerated in the context of
application performance, since they provide additional insights for analytical modeling.
Do not configure production Log Analytics Workspaces to apply a daily cap, which limits the daily
Visualization
Design considerationsDesign considerations
Design RecommendationsDesign Recommendations
ingestion of operational data, since this can lead to the loss of critical operational data.
In lower environments, such as Development and Test, it can be considered as an optional cost saving
mechanism.
Provided operational data ingest volumes meet the minimum tier threshold, configure Log Analytics
Workspaces to use Commitment Tier based pricing to drive cost efficiencies relative to the 'pay-as-you-
go' pricing model.
It's strongly recommended to store Log Analytics queries using source control and use CI/CD automation
to deploy them to relevant Log Analytics instances.
Visually representing the health model with critical operational data is essential to achieve effective operations
and maximize reliability. Dashboards should ultimately be utilized to provide near-real time insights into
application health for DevOps teams, facilitating the swift diagnosis of deviations from steady state.
Microsoft provides several data visualization technologies, including Azure Dashboards, Power BI, and Azure
Managed Grafana (currently in-preview). Azure Dashboards is positioned to provide a tightly integrated out-of-
the-box visualization solution for operational data within Azure Monitor. It therefore has a fundamental role to
play in the visual representation of operational data and application health for a mission-critical workload.
However, there are several limitations in terms of the positioning of Azure Dashboards as a holistic observability
platform, and as a result consideration should be given to the supplemental use of market-leading observability
solutions, such as Grafana, which is also provided as a managed solution within Azure.
This section focuses on the use of Azure Dashboards and Grafana to build a robust dashboarding experience
capable of providing technical and business lenses into application health, enabling DevOps teams and effective
operation. Robust dashboarding is essential to diagnose issues that have already occurred, and support
operational teams in detecting and responding to issues as they happen.
When visualizing the health model using Log Analytics queries, note that there are Log Analytics limits on
concurrent and queued queries, as well as the overall query rate, with subsequent queries queued and
throttled.
Queries to retrieve operational data used to calculate and represent health scores can be written and
executed in either Azure Monitor Log Analytics or Azure Data Explorer.
Sample queries are available here.
Log Analytics imposes several query limits, which must be designed for when designing operational
dashboards.
The visualization of raw resource metrics, such as CPU utilization or network throughput, requires
manual evaluation by operations teams to determine health status impacts, and this can be challenging
during an active incident.
If multiple users use dashboards within a tool like Grafana, the number of queries sent to Log Analytics
multiplies quickly.
Reaching the concurrent query limit on Log Analytics will queue subsequent queries, making the
dashboard experience feel 'slow'.
Collect and present queried outputs from all regional Log Analytics Workspaces and the global Log Analytics
Workspace to build a unified view of application health.
NOTENOTE
If deploying into an Azure landing zone, consider querying the central platform Log Analytics Workspace if key
dependencies on platform resources exist, such as ExpressRoute for on-premises communication.
A ‘traffic light’ model should be used to visually represent 'healthy' and 'unhealthy' states, with green
used to illustrate when key non-functional requirements are fully satisfied and resources are optimally
utilized. Use "Green", "Amber, and "Red" to represent "Healthy", "Degraded", and "Unavailable" states.
Use Azure Dashboards to create operational lenses for global resources and regional deployment stamps,
representing key metrics such as request count for Azure Front Door, server side latency for Azure
Cosmos DB, incoming/outgoing messages for Event Hubs, and CPU utilization or deployment statuses for
AKS. Dashboards should be tailored to drive operational effectiveness, infusing learnings from failure
scenarios to ensure DevOps teams have direct visibility into key metrics.
If Azure Dashboards can't be used to accurately represent the health model and requisite business
requirements, then it's strongly recommended to consider Grafana as an alternative visualization
solution, providing market-leading capabilities and an extensive open-source plugin ecosystem. Evaluate
the managed Grafana preview offering to avoid the operational complexities of managing Grafana
infrastructure.
When deploying self-hosted Grafana, employ a highly available and geo-distributed design to ensure
critical operational dashboards can be resilient to regional platform failures and cascading error
scenarios.
Separate configuration state into an external datastore, such as Azure Database for Postgres or
MySQL, to ensure Grafana application nodes remain stateless.
Configure database replication across deployment regions.
Deploy Grafana nodes to App Services in a highly available configuration across ones within a
region, using container based deployments.
Deploy App Service instances across considered deployment regions.
App Services provides a low-friction container platform, which is ideal for low-scale scenarios
such as operational dashboards, and isolating Grafana from AKS provides a clear separation of
concern between the primary application platform and operational representations for that
platform. Please refer to the Application Platform deign area for further configuration
recommendations.
Use Azure Storage in a GRS configuration to host and manage custom visuals and plugins.
Deploy app service and database read-replica Grafana components to a minimum of two
deployment regions, and consider employing a model where Grafana is deployed to all considered
deployment regions.
For scenarios targeting a >= 99.99% SLO, Grafana should be deployed within a minimum of 3 deployment
regions to maximize overall reliability for key operational dashboards.
Mitigate Log Analytics query limits by aggregating queries into a single or small number of queries, such
as by using the KQL 'union' operator, and set an appropriate refresh rate on the dashboard.
An appropriate maximum refresh rate will depend on the number and complexity of dashboard
queries; analysis of implemented queries is required.
If the concurrent query limit of log analytics is being reached, consider optimizing the retrieval pattern by
Automated incident response
IMPORTANTIMPORTANT
Design considerationsDesign considerations
Design recommendationsDesign recommendations
(temporarily) storing the data required for the dashboard in a high performance datastore such as Azure
SQL.
While the visual representations of application health provide invaluable operational and business insights to
support issue detection and diagnosis, it relies on the readiness and interpretations of operational teams, as well
as the effectiveness of subsequent human-triggered responses. Therefore, to maximize reliability it's necessary
to implement extensive alerting to detect proactively and respond to issues in near real-time.
Azure Monitor provides an extensive alerting framework to detect, categorize, and respond to operational
signals through Action Groups. This section will therefore focus on the use of Azure Monitor alerts to drive
automated actions in response to current or potential deviations from a healthy application state.
Alerting and automated action is critical to effectively detect and swiftly respond to issues as they happen, before greater
negative impact can occur. Alerting also provides a mechanism to interpret incoming signals and respond to prevent
issues before they occur.
Alert rules are defined to fire when a conditional criteria is satisfied for incoming signals, which can
include various data sources, such as metrics, log search queries, or availability tests.
Alerts can be defined within Log Analytics or Azure Monitor on the specific resource.
Some metrics are only interrogatable within Azure Monitor, since not all diagnostic data points are made
available within Log Analytics.
The Azure Monitor Alerts API can be used to retrieve active and historic alerts.
There are subscription limits related to alerting and action groups, which must be designed for:
Limits exist for the number of configurable alert rules.
The Alerts API has throttling limits, which should be considered for extreme usage scenarios.
Action Groups have several hard limits for the number of configurable responses, which must be
designed for.
Each response type has a limit of 10 actions, apart from email, which has a limit of 1,000
actions.
Alerts can be integrated within a layered health model by creating an Alert Rule for a saved log search
query from the model's 'root' scoring function. For example, using 'WebsiteHealthScore' and alerting on
a numeric value that represents an 'Unhealthy' state.
For resource-centric alerting, create alert rules within Azure Monitor to ensure all diagnostic data is
available for the alert rule criteria.
Consolidate automated actions within a minimal number of Action Groups, aligned with service teams to
support a DevOps approach.
Respond to excessive resource utilization signals through automated scale operations, using Azure-native
auto-scale capabilities where possible. Where built-in auto-scale functionality isn't applicable, use the
component health score to model signals and determine when to respond with automated scale
operations. Ensure automated scale operations are defined according to a capacity model, which
quantifies scale relationships between components, so that scale responses encompass components that
Predictive action and AI operations (AIOps)
need to be scaled in relation to other components.
Model actions to accommodate a prioritized ordering, which should be determined by business impact.
Use the Azure Monitor Alerts API to gather historic alerts to incorporate within 'cold' operational storage
for advanced analytics.
For critical failure scenarios, which can't be met with an automated response, ensure operational
'runbook automation' is in-place to drive swift and consistent action once manual interpretation and sign
out is provided. Use alert notifications to drive swift identification of issues requiring manual
interpretation
Create allowances within engineering sprints to drive incremental improvements in alerting to ensure
new failure scenarios that haven't previously been considered can be fully accommodated within new
automated actions.
Conduct operational readiness tests as part of CI/CD processes to validate key alert rules for deployment
updates.
Machine learning models can be applied to correlate and prioritize operational data, helping to gather critical
insights related to filtering excessive alert 'noise' and predicting issues before they cause impact, as well as
accelerating incident response when they do.
More specifically, an AIOps methodology can be applied to critical insights about the behavior of the system,
users, and DevOps processes. These insights can include identifying a problem happening now (
detect
),
quantifying why the problem is happening (
diagnose
), or signaling what will happen in the future (
predict
). Such
insights can be used to drive actions that adjust and optimize the application to mitigate active or potential
issues, using key business metrics, system quality metrics, and DevOps productivity metrics, to prioritize
according to business impact. Conducted actions can themselves be infused into the system through a feedback
loop that further trains the underlying model to drive additional efficiencies.
There are multiple analytical technologies within Azure, such as Azure Synapse and Azure Databricks, which can
be used to build and train analytical models for AIOps. This section will therefore focus on how these
technologies can be positioned within an application design to accommodate AIOps and drive predictive action,
focusing on Azure Synapse that reduces friction by bringing together the best of Azure's data services along
with powerful new features.
AIOps is used to drive predictive action, interpreting and correlating complex operational signals observed over
Design considerationsDesign considerations
Design recommendationsDesign recommendations
a sustained period in order to better respond to and prevent issues before they occur.
Azure Synapse Analytics offers multiple Machine Learning (ML) capabilities.
ML models can be trained and run on Synapse Spark Pools with libraries including MLLib, SparkML
and MMLSpark, as well as popular open-source libraries, such as Scikit Learn.
ML models can be trained with common data science tools like PySpark/Python, Scala, or .NET.
Synapse Analytics is integrated with Azure ML through Azure Synapse Notebooks, which enables ML
models to be trained in an Azure ML Workspace using Automated ML.
Synapse Analytics also enables ML capabilities using Azure Cognitive Services to solve general problems
in various domains, such as Anomaly Detection. Cognitive Services can be used in Azure Synapse, Azure
Databricks, and via SDKs and REST APIs in client applications.
Azure Synapse natively integrates with Azure Data Factory tools to extract, transform, and load (ETL) or
ingest data within orchestration pipelines.
Azure Synapse enables external dataset registration to data stored in Azure Blob storage or Azure Data
Lake Storage.
Registered datasets can be used in Synapse Spark pool data analytics tasks.
Azure Databricks can be integrated into Azure Synapse Analytics pipelines for additional Spark
capabilities.
Synapse orchestrates reading data and sending it to a Databricks cluster, where it can be transformed
and prepared for ML model training.
Source data typically needs to be prepared for analytics and ML.
Synapse offers various tools to assist with data preparation, including Apache Spark, Synapse
Notebooks, and serverless SQL pools with T-SQL and built-in visualizations.
ML models that have been trained, operationalized, and deployed can be used for
batch
scoring in
Synapse.
AIOps scenarios, such as running regression or degradation predictions in CI/CD pipelined, may
require
real-time
scoring.
There are subscription limits for Azure Synapse, which should be fully understood in the context of an
AIOps methodology.
To fully incorporate AIOps it's necessary to feed near real-time observability data into real-time ML
inference models on an ongoing basis.
Capabilities such as anomaly detection should be evaluated within the observability data stream.
Ensure all Azure resources and application components are fully instrumented so that a complete
operational dataset is available for AIOps model training.
Ingest Log Analytics operational data from the global and regional Azure Storage Accounts into Azure
Synapse for analysis.
Use the Azure Monitor Alerts API to retrieve historic alerts and store it within cold storage for operational
data to subsequently use within ML models. If Log Analytics data export is used, store historic alerts data
in the same Azure Storage accounts as the exported Log Analytics data.
After ingested data is prepared for ML training, write it back out to Azure Storage so that it's available for
ML model training without requiring Synapse data preparation compute resources to be running.
Next step
Ensure ML model operationalization supports both batch and real-time scoring.
As AIOps models are created, implement MLOps and apply DevOps practices to automate the ML
lifecycle for training, operationalization, scoring, and continuous improvement. Create an iterative CI/CD
process for AIOps ML models.
Evaluate Azure Cognitive Services for specific predictive scenarios due to their low administrative and
integration overhead. Consider Anomaly Detection to quickly flag unexpected variances in observability
data streams.
Review the deployment and testing considerations.
Deployment and testing
Deployment and testing for mission
-
critical
workloads on Azure
12/16/2022 29 minutes to read • Edit Online
IMPORTANTIMPORTANT
Application environments
Application outages are often caused by failed deployments or erroneous releases. This is why the design of
Continuous Integration and Continuous Deployment (CI/CD) pipelines for deployment and testing
methodologies play a critical role in the overall reliability of a mission-critical application.
Deployment and testing shouldn't be constrained to the delivery of planned application updates, but instead
should form the basis for how all application and infrastructure operations are conducted to ensure consistent
outcomes for mission-critical workloads. The variety of deployment contexts results in a frequent, and often
daily, deployment cadence. Therefore using CI/CD pipelines to exhibit maximum reliability is highly
recommended because they perform a critical operational function for a mission-critical application. The
strategy should aspire for:
Rigorous pre-release testingRigorous pre-release testing. Updates shouldn't introduce defects, vulnerabilities, or other factors that
might jeopardize application health.
Transparent deploymentsTransparent deployments. All clients and users should be able to continue application interaction without
interruption using a zero-downtime deployment approach.
Highly available operationsHighly available operations. Deployment and testing processes must themselves be highly available to
support overall application reliability.
End-to-End automationEnd-to-End automation. Manual intervention in the technical execution of deployment and testing
operations represents a significant reliability risk.
Consistent deployment processConsistent deployment process. Same application artifacts and processes are used to deploy the
infrastructure and application code across different environments.
This design area focuses on how to eradicate downtime and maintain application health for deployment
operations, providing key considerations and recommendations intended to inform the design of optimal CI/CD
pipelines for a mission-critical application.
This article is part of the Azure Well-Architected mission-critical workload series. If you aren't familiar with this series, we
recommend you start with what is a mission-critical workload?
Mission-Critical open source project
The reference implementations are part of an open source project available on GitHub. The code assets illustrate
considerations and recommendations for acheiving optimal CI/CD pipelines for a mission-critical application.
Before considering deployment processes and associated tooling, it's important to evaluate the application
environments required to appropriately validate and stage deployment operations. These environment types
will most likely differ in terms of requisite capabilities and longevity. Some environments might reflect
production on a permanent basis, others may be short lived with a reduced level of complexity. These
Design ConsiderationsDesign Considerations
Design recommendationsDesign recommendations
environments should be staged during the engineering and release cycle in order to ensure deployment
operations are fully tested before released into the production environment.
This section explores the key considerations and recommendations for application environments in a mission-
critical context, covering key design objectives such as developer agility and separation of concerns.
Development environmentsDevelopment environments
Development environments will typically not share the same reliability, capacity, and security
requirements as the production environment.
Given the reduced scale, reliability, and security requirements of a development environment, they can
more easily coexist within a single subscription.
It's likely that engineering teams will require multiple development environments to support the
completion of parallel feature development.
Development environments must be available when required, but need not exist permanently and
typically only exist for short periods of time. Keeping environments short lived saves costs, and prevents
configuration drift from the code base. Also, development environments often share the lifecycle of a
feature branch.
Development environments can also encompass the development of Infrastructure-as-Code (IaC)
artifacts such as Terraform or Azure Resource Manager (ARM) templates.
Staging EnvironmentsStaging Environments
Staging environments can vary depending on their intended function within the release cycle. They
typically resemble the requirements of the production environment for reliability, capacity, and security.
Staging environments can be used for various purposes, but the focus is testing and validation, with a
multitude of test cycles considered. Such as,
Load and performance testing.
Chaos testing.
Integration and build verification testing.
User acceptance testing.
Security and penetration testing.
Different test functions can be performed within the same environment, and in some cases this will be
required. For example, for chaos testing to provide meaningful results, the application must first be
placed under load to be able to understand how the application responds to injected faults. Chaos testing
and load testing are therefore typically performed in parallel.
During the early development cycles and in absence of a production load, a constant synthetic user load
against an environment should be used to provide realistic metrics and with valuable health modeling
input.
Production environmentsProduction environments
Some applications may consider multiple different production environments to cater to different clients,
users, or business functionality.
Ensure all environments reflect the production environment as much as possible, with simplifications
applied for lower environments as necessary.
Keep production environments separate from lower environments into a dedicated subscription. This
Demo video: Ephemeral dev environments and automated feature validationDemo video: Ephemeral dev environments and automated feature validation
Ephemeral blue/green deployments
helps to ensure resource utilization in lower-environments doesn't impact production quotas, and to
provide a clear governance boundary and separation of concerns. Depending on the scale requirements
of the application, multiple production subscriptions might be needed to serve as scale-units.
Separate development environments within a distinct subscription context, with all development
environments sharing the same subscription.
Ensure that there's an automated process to deploy code from a feature branch to a development
environment.
Treat development environments as ephemeral, sharing the lifecycle of the associated feature branch.
Define the number of staging environments and their purpose within the development and release cycle.
Avoid sharing components between environments. Possible exceptions are downstream security
appliances like firewalls, or source locations for synthetic test data.
Ensure at least one staging environment is fully reflective of production to enable production-like testing
and validation.
Capacity within this pre-production environment can flex based on the execution of test activities.
Use of a constant synthetic user load generation is required to provide a realistic test case for changes
on one of the pre-production environments.
The Mission Critical Online reference implementation provides an example user load generator.
A blue/green deployment approach requires a minimum of two identical deployment contexts, where an
existing deployment (blue) is actively serving user traffic, and a new secondary deployment (green) is
established and made ready to receive traffic.
After the new deployment is completed and tested, traffic is gradually switched from the blue deployment to
the green.
If the load transfer is successful, that new deployment becomes the new 'active' production environment and
the old, now 'inactive' deployment can be decommissioned.
If there are issues within the new deployment environment, the deployment can be aborted and traffic can
either remain in the old 'active' deployment, or be directed back to it.
Design ConsiderationsDesign Considerations
Design recommendationsDesign recommendations
This provides a clear fallback plan and minimizes potential for reliability issues, such as having to cut
production traffic to rectify faulty deployment issues.
To achieve zero interruptions while performing deployments to a mission-critical application, it's strongly
recommended to adopt a blue/green deployment approach for production environments with ephemeral
resources.
This allows new application code or resources to be deployed and tested in a new parallel environment, with
traffic only transitioned once ready in a phased process before subsequently decommissioning the old
environment.
A blue/green deployment can be implemented at either an application level or at the infrastructure level.
IMPORTANTIMPORTANT
Application LevelApplication Level: New code is deployed to a staging location within the existing infrastructure.
For example, Azure App Service provides this capability through secondary deployment slots that can
be swapped after the deployment, while in AKS this can be achieved using a separate pod deployment
on each node and updating the service definition.
This approach incurs less costs and is faster than a full infrastructure level blue/green deployment.
Infrastructure LevelInfrastructure Level: A deployment containing all infrastructure
and
application components within a
deployment scope.
New Azure resources are established before subsequently deploying application code to the new
infrastructure. When the new deployment has been fully tested and validated, traffic can be
transitioned through a phased process, and the old infrastructure can then be decommissioned when
appropriate.
The advantage of this approach is that all changes within the deployment scope are fully deployed and
tested in production before traffic is transitioned between the environments. Also, this approach
provides a much safer approach for any infrastructure-level changes within a release.
Individual deployments may take longer to complete using this methodology since it takes longer to
deploy the infrastructure and application than deploying the application in isolation.
There's an additional cost associated with an infrastructure based approach, since two deployment
contexts must exist side by side until the deployment is fully complete.
This infrastructure blue/green approach allows all changes within a deployment scope, both to the infrastructure
and application, to be achieved with zero downtime and maximum confidence. In addition all compatibilities with
downstream dependencies such as Azure platform, resource providers or IaC modules can be validated.
The blue and green environments can be long living and reused for each deployment, or treated as is
recommended to deploy a new infrastructure for each new deployment.
At an infrastructure level, the orchestration of user traffic between the blue and green environments can
be controlled using a global load balancer, such as Azure Front Door.
Utilize a blue/green deployment approach to release all production changes.
Prioritize an infrastructure level approach in order to achieve zero-downtime deployments and
provide one consistent deployment strategy for any kind of changes (application-level and/or
infrastructure-level). Use a global load balancer to orchestrate the automated transition of user traffic
between the blue and green environments.
Add a green backend endpoint and using a low traffic volume/weight, such as 10%.
Example
-
Zero
-
downtime deploymentExample
-
Zero
-
downtime deployment
After verifying that the low traffic volume on green is being managed as expected with a maintained
application health, the traffic can be gradually increased in increments until it reaches 100%.
When increasing traffic, a short ramp-up period should be applied to catch faults which may not come
to light immediately.
After all traffic has been migrated to the new green environment, remove the blue backend from
global load balancer service.
Decommission the old and inactive blue environment.
Repeat the process for the next deployment with blue and green reversed.
While blue and green environments can be reused, it's strongly recommended to deploy new
infrastructure for each new deployment. Treat each regional deployment stamp as ephemeral with a
lifecycle tied to that of a single release.
Decommission the old and inactive 'blue' environment, ensuring that any connections established while
this environment was active are also closed and any queues are drained before removing associated
resources. This will save costs relative to maintaining secondary production infrastructure and will ensure
new environments are free of configuration drift.
To prevent downtime, the process to control the transition of traffic between environments should be
fully automated.
Phase the transition of traffic between the blue/green environments to minimize client and user exposure
whilst confidence is established in the new environment.
Allow for a short ramp-up period when transitioning traffic between blue/green environments in order to
catch faults which may not come to light immediately.
Achieving zero-downtime deployments is a fundamental goal of a mission-critical application. However, it's a
complex issue that requires significant engineering investment and greatly influences the overall design. It's
important to invest effort up-front to define and plan deployment processes, to drive key design decisions such
as whether to treat resources as ephemeral.
The Mission-Critical Online and Azure Mission-Critical Connected reference implementations serve as practical
examples for these concepts and recommendations, to establish an optimized zero-downtime deployment
approach as represented in the illustration below.
Infrastructure-as-Code deployments
Design ConsiderationsDesign Considerations
Design RecommendationsDesign Recommendations
DevOps Tooling
Infrastructure-as-Code (IaC) treats infrastructure definitions as source code that is version controlled alongside
other application artifacts. Using IaC ensures code consistency across environments and eliminates the risk of
human error during automated deployments, and providing traceability and rollback.
The recommended approach for infrastructure-level blue/green deployment is to use IaC, with fully automated
and consistent infrastructure deployments.
Typically a mission-critical IaC repository has two resource definitions:
Global Resources: those that are deployed once within the solution, such as Azure Front Door and
Azure Cosmos DB.
Regional (
Stamp
) Resources: those that are deployed multiple times and into different regions, such as
AKS, Event Hubs and Storage.
Apply the concept of IaC and ensure all Azure resources are defined in declarative templates and
maintained in a source control repository from where they can be deployed automatically using CI/CD
pipelines.
Define infrastructure artifacts as declarative templates, and not as imperative scripts.
Ensure the deployment of both infrastructure and application components are fully automated.
Prohibit manual operations against production and lower environments. Only exception should be fully
independent developer environments.
There are a myriad of different products and services that can provide the necessary DevOps capabilities to
Design considerationsDesign considerations
effectively deploy and manage a mission-critical application; Microsoft provides two Azure-native toolsets
through GitHub
Actions
and Azure DevOps (ADO)
Pipelines
.
The appropriate and effective use of used deployment tooling is critical to ensure overall reliability for an
application, particularly because DevOps processes provide such a significant function within the overall
application design. For example, failover and scale operations may depend on automation provided by DevOps
tooling. Deployment tooling must therefore be implemented in a reliable and highly available manner, with
engineering teams understanding the application impact if the deployment service, or parts of it, become
unavailable.
This section focuses on the optimal use of GitHub Actions and Azure Pipelines and decision factors influencing
the optimal selection of DevOps tooling.
The capabilities of GitHub
Actions
and Azure DevOps (ADO)
Pipelines
are largely overlapping.
Different technologies can be used simultaneously to utilize the best features different technologies in
parallel.
A common approach is to hold code repositories in GitHub.com or GitHub AE whilst using the
deployment pipelines in ADO.
It should be noted that the use of multiple technologies adds an element of complexity and impacts
the risk landscape.
Azure PipelinesAzure Pipelines
Azure Pipelines provides highly mature deployment pipelines, including features like gates and
approvals.
ADO instances are hosted in a single Azure region that is chosen at organization-level.
Data is replicated across regions but only for Disaster Recovery purposes.
Hosted build agents are utilized from the same region as the ADO instance.
In the context of mission-critical aspiration for maximum reliability, the dependency on a single Azure
region represents an operational risk. For example, consider a scenario where traffic is spread over West
Europe and North Europe, with West Europe hosting the ADO instance. If West Europe experiences an
outage, the ADO instance would also be affected. While North Europe would automatically now handle
all application traffic, the ability to deploy additional scale-units to North Europe, in order to provide a
consistent failover experience, would be prohibited which may result in a severely degraded application
experience until the issue is resolved.
GitHub ActionsGitHub Actions
GitHub.com is well known and adopted by developers and used for many open source projects.
GitHub.com instances are also hosted in a single Azure region.
Data is replicated across regions but only for Disaster Recovery purposes.
A private and dedicated GitHub AE offering is available in a limited public preview.
GitHub Actions is well-suited for build-related tasks (Continuous Integration).
GitHub Actions is less mature when it comes to deployment tasks (Continuous Deployment).
Templating and reuse of pipeline steps is limited.
Gates and approval options are limited.
Missing options to control pipeline execution, such as the exclusion of specific stages.
Design recommendationsDesign recommendations
Branching strategy
Design considerationsDesign considerations
Design recommendationsDesign recommendations
IMPORTANTIMPORTANT
Define an availability SLA for deployment tooling and ensure alignment with broader application
reliability requirements.
In a multi-region scenario with an active-passive or active-active application deployment configuration,
ensure that failover orchestration and scaling operations can continue to function even if the primary
region hosting deployment toolsets becomes unavailable.
Branching strategies are a fundamental aspect of application source control, and while there are many valid
approaches to apply branching, there are several key aspects that should be considered in the context of a
mission-critical application scenario to ensure maximum reliability for mission-critical workloads.
Developers will carry out their work in
feature/*
and
fix/*
branches and these are the entry points for
changes.
Restrictions can be applied to branches as part of the branching strategy, such as only allowing
administrators to create release branches, or enforcing naming conventions for branches.
There might be rare occasions where a hotfix is urgently required and applied directly into an existing
production environment. Examples of potential hotfixes include critical security updates or remediation of
issues breaking the user experience. Typically, these hotfixes are created on a
fix/*
branch and merged
into the release branch. It's essential that the change is brought into
main
as soon as practical so that is
part of future releases and also avoids any reoccurrence of the issue. This process must only be used for
small changes addressing urgent issues and with restraint.
Prioritize the use of GitHub for source control.
Create a branching strategy that details
feature
work and
releases
as a minimum, using branch policies and permissions
to ensure the strategy is appropriately enforced.
When feature branch changes are pushed to
origin
, trigger an automated testing process to validate the
legitimacy of code contributions before any Pull Request (PR) can be completed.
Ensure any PR requires the review of at least one other team member before merging.
It's recommended to treat the
main
branch as a continuously forward moving and stable branch,
primarily used for integration testing.
Ensure changes are only made to
main
via PRs, using a branch policy to prohibit direct commits.
Every time a PR is merged into
main
, it should automatically kick off a deployment against an
integration environment.
main
should be considered stable and safe to create a release from at any given time.
It's recommended to consider the use of dedicated
release/*
branches, created from the
main
branch and
used to deploy to Production environments.
release/*
branches should remain in the repository and can be used to patch a release.
Define and document a hotfix process and apply it only when needed.
Create hotfixes in a
fix/*
branch for subsequent merging into the release branch and deployment to
production.
Container registry
Design considerationsDesign considerations
Ensure any changes are brought into
main
as soon as practical so that they reflected in all future
releases to avoid reoccurrence of the issue.
A hotfix process should only be used with restraint for small changes addressing urgent issues; almost
all operational issues should follow the standard operating procedure and CI/CD DevOps processes.
Container registries are a key aspect for any containerized application, providing hosting for container images
deployed to container runtime environments, such as AKS. There's a wide variety of container registry
technologies available that predominantly rely on the Docker-provided format and standards for both push and
pull operations.
This section will therefore examine the optimal configuration of container registries, focusing on the native
Azure Container Registry service, while also exploring the trade-offs associated with centralized and federated
deployment models.
Because most container registry solutions rely on the Docker-provided format and standards for both
push and pull operations, they're broadly compatible and mostly interchangeable.
Container registries can sometimes be deployed either as a centralized service that is shared and
consumed by numerous applications within an organization, or a separate application component
dedicated to a specific application workload.
Some application scenarios will require public container images be replicated within a private container
registry to limit egress traffic, increase availability, or avoid potential throttling.
Public Registries - Docker HubPublic Registries - Docker Hub
Container images stored on Docker Hub, or other public registries, exist outside of Azure and a given virtual
network. This isn't necessarily a problem, but in certain scenarios can lead to various potential issues where
service unavailability, throttling and data exfiltration are concerned.
Azure Container Registr y (ACR)Azure Container Registr y (ACR)
Azure Container Registry (ACR) provides an Azure-native service with a range of features including geo-
replication, Azure AD authentication, automated container building, and patching using ACR tasks.
ACR supports high availability through Geo-replication to multiple configured regions, providing
resiliency against regional outage. If a region becomes unavailable, the other regions will continue to
serve image requests, and when the region returns to health the ACR will recover and replicate changes
to it.
This capability also provides registry colocation within each configured region, reducing network
latency and cross-region data transfer costs.
Within Azure regions that provide Availability Zone support, the Premium ACR tier supports Zone
Redundancy to protect against zonal failure.
Tagged ACR images are mutable by default, meaning that the same tag can be used on multiple images
pushed to the registry.
In production scenarios, this may lead to unpredictable behavior that could impact application uptime.
ACR supports locking an image version or a repository to prevent changes or deletes.
Image Locking mitigates multiple failure scenarios and also protects against a previously deployed
image
version
being changed in-place, which would introduce the risk that same-version deployments
may have different results (before and after such a change).
Design recommendationsDesign recommendations
Secret management
Design considerationsDesign considerations
Locking Container Images doesn't protect against the ACR instance being deleted, but Azure Resource
Locks can be used to achieve this.
ACR in Premium tier also offers support to restrict a container registry to a given set of virtual networks
and subnets through Private Endpoints.
For a mission-critical application, employ container registry instances that are dedicated to the
application workload. Avoid taking a dependency on a centralized service unless availability and
reliability requirements are in full alignment with the application.
When using container registries outside Azure, ensure that the provided SLA is aligned with the reliability
and security targets. Take special note of throttling limits in cases such as when relying on Docker Hub.
Use Azure Container Registry to host container images.
Azure Container Registr y (ACR)Azure Container Registr y (ACR)
Treat container registries as 'global resources' with a sustained lifecycle ('long-living'). Consider a single
global container registry per environment, such as the use of a global production registry.
Configure geo-replication to all considered deployment regions in order to remove regional
dependencies and optimize latency.
Images should be hosted geographically as close as possible to the consuming compute resources,
within the same Azure regions.
Prioritize regions with Availability Zone support to take advantage of zonal redundancy capabilities.
Use Azure AD integrated authentication to push and pull images instead of relying on access keys.
For optimal security, fully disable the use of the admin access key.
Secret management is a key technical domain in the context of both security and reliability, since the secret
management solution for a mission-critical application must provide requisite security and also offer an
appropriate level of availability to align with maximum reliability aspirations.
There's a multitude of key and secret management solutions available that can be used on Azure.
Azure Key Vault provides a fully managed Azure-native PaaS solution.
Provides native integration with Azure services out-of-the-box.
Supports Availability Zone deployments and multi-region redundancy.
Offers direct integration with Azure AD for authentication and authorization.
Many Azure services already support Azure AD authentication instead of relying on connection strings /
keys. Doing so greatly reduces the need to managed secrets in the first place.
There are three common approaches applied to define at what point secrets must be read from the selected
secret store and injected into the application:
Deployment-Time RetrievalDeployment-Time Retrieval
Retrieving secrets at deployment time provides the advantage that the secret management solution only
needs to be available at deployment time, since there are no direct dependencies after this point. For
example, injecting secrets as environment variables into a Kubernetes deployment or into a Kubernetes
secret.
Design recommendationsDesign recommendations
Only the deployment service principal needs to be able to access secrets, which simplifies RBAC
permissions within the secret management system. It does, however, introduce additional RBAC
considerations within DevOps tooling around controlling service principal access and the application in
terms of protecting retrieved secrets.
This method introduces a trade-off since the security benefits of the secret management solution aren't
being utilized since this design approach relies solely on the access control within the application
platform to keep secrets safe.
Secret updates or rotation will require a full redeployment in order to take effect.
Application star t-up retrievalApplication star t-up retrieval
Retrieving and inject secrets at application startup provides the benefit that secrets can more easily be
updated or rotated. A restart of the application is required to fetch the latest value.
This method ensures that secrets do not need to be stored on the application platform but can be held in
memory only.
For AKS, available implementations for this approach include the CSI SecretStore driver for KeyVault
and akv2k8s.
A native Azure solution Azure Key Vault referenced App Settings is also available.
The disadvantage of this approach is that it creates a runtime dependency to the secret management
solution. If the secret management solution experiences an outage, application components already
running maymay be able to continue serving requests, however, any restart or scale-out operations will likely
result in failure.
Runtime retrievalRuntime retrieval
Retrieving secrets at runtime from within the application itself serves as the most secure approach since
even the application platform never has access to secrets.
Application components require a direct dependency and a connection to the secret management system.
This makes it harder to test components individually and usually requires the use of an SDK.
The application itself needs to be able to authenticate to the secret management system. For AKS the
latter can be achieved using Pod-managed Identities, but that is currently (as of August 2021) still in
preview.
Where possible, use Azure AD authentication to connect to other services instead of using connection
strings or keys. Use this in conjunction with Azure Managed Identities to remove the need for any secrets
to be stored on the application platform.
Use Azure Key Vault to store all application secrets.
Azure Key Vault instances should be deployed as part of a regional stamp to mitigate the potential impact
of a failure to a single deployment stamp. Global resources, such as Front Door (for certificate storage, if
needed), should use a separate Azure Key Vault instance dedicated to global resources, rather than using
one of the regional Key Vault instances.
Use Managed identities instead of service principals to access Key Vault whenever possible.
Secrets should be retrieved at application startup, not during deployment time or at runtime.
Implement coding patterns so when an authorization failure occurs at runtime, secrets are re-retrieved.
Apply a fully automated key-rotation process that runs periodically within the solution.
Continuous validation and testing
Design considerationsDesign considerations
Use key near expire notification to get alerted on upcoming expiration.
As previously stated, testing is a fundamental activity for any mission-critical solution, to fully validate the health
of both the application code and infrastructure. More specifically, to satisfy desired standards for reliability,
performance, availability, security, quality, and scale, testing must be well defined and applied as a core
component of the application design and DevOps methodologies.
Testing is a key concern for both the local developer experience ("Inner Loop") and the complete DevOps
lifecycle ("Outer Loop"), which captures when developed code begins release pipeline processes on its journey
to a production environment.
The scope of this section focuses on testing conducted within the outer loop for a product release, considering
various test scenarios, such as unit, build, static, security, integration, regression, UX, performance, capacity and
failure injection (chaos). The order of conducted tests is also a critical consideration due to various
dependencies, such as the need to have a running application environment.
With high degrees of deployment automation, automated testing is essential to validate application or
infrastructure changes in a timely and repeatable manor.
The purpose of testing is ultimately to detect errors and issues before they reach production
environments, and there are various methods that are required to holistically achieve this goal.
Unit testingUnit testing
Unit testing is intended to confirm that application business logic works as expected. Improve confidence
in the overall effect of code changes.
Unit testing is typically considered as part of the Inner Loop and as such isn't a primary focus for this
section.
Smoke testingSmoke testing
Smoke testing is used to identify whether infrastructure and application components are available and
act as expected. A smoke test focuses on functionality rather than performance under load. Typically only
a single virtual user session is tested.
Common smoke testing scenarios include; interrogating the HTTPS endpoint of a web application,
querying a database, and simulating a user flow in the application.
The outcome of a smoke test should be that the system responds with expected values and behavior.
UI testingUI testing
UI Testing validates that application user interfaces are deployed and functioning as expected.
UI testing is similar to smoke testing but it's focused on user interface interactions.
UI automation tools can and should be used to drive automation.
During a UI test, a script will mimic a realistic user scenario and follow a series of steps to execute
actions and achieve an intended outcome.
Load testingLoad testing
Load testing is designed to validate scalability and application operation under load through rapid and/or
gradual increase in application test load, until a threshold/limit's reached. Load tests are typically
designed around a particular user flow or scenario, in order to verify that application requirements are
satisfied under a defined load.
Azure services have different soft and hard limits associated with scalability, and load testing can reveal if
a system faces a risk of exceeding them during the expected production load.
Load testing can be used to fine-tune auto-scaling capabilities for services that provide automated
scalability (that is to set appropriate measured thresholds). For services that do not provide native auto-
scaling, established automated operational procedures can also be fine-tuned through load testing.
Stress testingStress testing
Stress testing is a type of negative testing, which applies activities aimed at overloading existing
resources in order to understand where solution limits exist, and to ensure the systems ability to recover
gracefully.
During a stress tests it's essential to monitor all components of the system in order to identify potential
bottlenecks.
Every component of the system unable to appropriately scale can turn into a limitation, such as
active/passive network components or databases. It's important to understand their limits so that effort
can be applied to mitigate potential impact.
Unlike load testing, stress tests do not adhere to a realistic usage pattern, but aim to identify performance
and scale limits.
An alternative approach is to limit (or scale down) the computing resources of the system and monitor
how it behaves under load and whether it's able to recover.
Performance testingPerformance testing
Performance testing combines aspects of
load
and
stress testing
to validate performance under load, and
establish benchmark behaviors for application operation.
Failure injection (chaos) testingFailure injection (chaos) testing
Chaos testing introduces artificial failures to the system to validate how the system reacts and the
effectiveness of resiliency measures, operational procedures and mitigations.
A mission-critical application should be resilient to infrastructure and application failures, so introducing
faults in the application and underlying infrastructure and observing how the application behaves is
essential to achieve confidence in the solutions redundancy mechanisms and validate that it can indeed
operate as an 'always on' application.
Shutting down infrastructure components, purposely degrading performance, or introducing application
faults are examples of test scenarios, which can be used to verify that the application is going to react as
expected in situations when they occur for real.
Azure Chaos Studio provides an Azure-native chaos experimentation suite of tools to easily conduct
chaos experiments and inject faults within Azure services and application components.
Provides built-in chaos experiments for common fault scenarios, providing a growing set of 'behind
the curtain' experiments for underlying and abstracted components of Azure services.
Supports custom experiments targeting infrastructure and application components.
Security (penetration) testingSecurity (penetration) testing
Penetration testing is used to ensures that an application and its environment satisfy an expected security
Design recommendationsDesign recommendations
Demo video: Continuous validation with Azure Load Test and Azure Chaos StudioDemo video: Continuous validation with Azure Load Test and Azure Chaos Studio
AI for DevOps
Design considerationsDesign considerations
posture.
Penetration tests will probe the application an environment for security vulnerabilities.
Security testing can encompass the end-to-end software supply chain and package dependencies, with
scanning and monitoring for known Common Vulnerabilities and Exposures (CVE).
All testing of both infrastructure and application components should be fully automated to ensure
consistency.
All test artifacts should be treated as code and maintained within the source control system and version
controlled along with other application code artifacts.
The results of the tests should be captured and analyzed as both individual test results and aggregated
for assessing trends over time. Test results should be continually evaluated for accuracy and coverage.
The availability of test infrastructure should be aligned with the SLA for deployment and testing cycles.
Use PaaS CI/CD orchestration platforms, such as Azure DevOps or GitHub Actions, to orchestrate and
execute tests where possible.
Execute smoke tests as part of every deployment.
Run extensive load tests, along with stress and chaos testing, as part of every deployment to validate
application performance and operability is maintained.
Use load profiles that are reflective of real peak usage patterns.
Run chaos experiments and failure injection tests at the same time as load tests.
If database interactions are required for load or smoke tests (that is to create new records), use test
accounts with reduced privileges and make test data separable from real user content.
Tests with shorter execution times should generally run earlier in the cycle where possible to increase
testing efficiency.
Scan and monitor the end to end software supply chain and package dependencies for known CVEs.
Use Dependabot for GitHub repositories to ensure the repository automatically keeps up-to-date with
the latest releases of packages and applications it depends on.
AIOps methodologies can be applied within CI/CD pipelines to supplement traditional testing approaches,
providing capabilities to detect likely regressions or degradations, and allowing deployments to be preemptively
stopped to prevent potential negative impact.
CI/CD pipelines and DevOps processes will expose a wide variety of telemetry for machine learning
models, from test results and deployment outcomes, to operational data of test components from
composite deployment stages.
CI/CD pipelines will include various types of automated testing, such as unit, smoke, performance,
load, and chaos tests.
Changes in a deployment will need to be stored in a manner suitable for automated analysis and
correlation to deployment outcomes.
Design recommendationsDesign recommendations
Next step
Traditional data processing approaches such as Extract, Transform, and Load (ETL) may not be able to
scale throughput to keep up with growth of deployment telemetry and application observability data.
Modern analytics approaches that do not require ETL and data movement, such as data virtualization,
can be used to enable ongoing analysis by AIOps models.
Define what DevOps process data will be collected and how it will be analyzed.
Expose application observability data from staged test environments and the production environment
for analysis and correlation within AIOps models.
Gather deployment telemetry from DevOps processes, such as test execution metrics and time series
data of changes within each deployment.
Adopt the MLOps Workflow.
Develop analytical models that are context-aware and dependency-aware to provide predictions along
with automated feature engineering to address schema and behavior changes.
Operationalize models by registering and deploying the best trained models within deployment
pipelines.
Review the security considerations.
Security
Security considerations for mission
-
critical
workloads on Azure
12/16/2022 19 minutes to read • Edit Online
IMPORTANTIMPORTANT
Alignment with the Zero Trust model
Design considerationsDesign considerations
Security is a one of the foundational design principles and also a key design area that must be treated as a first-
class concern within the mission-critical architectural process.
Given that the primary focus of a mission-critical design is to maximize reliability so that the application remains
performant and available, the security considerations and recommendations applied within this design area will
focus on mitigating threats with the capacity to impact availability and hinder overall reliability. For example,
successful Denial-Of-Service (DDoS) attacks are known to have a catastrophic impact on availability and
performance. How an application mitigates those attack vectors, such as SlowLoris will impact the overall
reliability. So, the application must be fully protected against threats intended to directly or indirectly
compromise application reliability to be truly mission critical in nature.
It's also important to note that there are often significant trade-offs associated with a hardened security posture,
particularly with respect to performance, operational agility, and in some cases reliability. For example, the
inclusion of inline Network Virtual Appliances (NVA) for Next-Generation Firewall (NGFW) capabilities, such as
deep packet inspection, will introduce a significant performance penalty, additional operational complexity, and a
reliability risk if scalability and recovery operations are not closely aligned with that of the application. It's
therefore essential that additional security components and practices intended to mitigate key threat vectors are
also designed to support the reliability target of an application, which will form a key aspect of the
recommendations and considerations presented within this section.
This article is part of the Azure Well-Architected mission-critical workload series. If you aren't familiar with this series, we
recommend you start with what is a mission-critical workload?
Mission-Critical open source project
The reference implementations are part of an open source project available on GitHub. The code assets adopt a Zero Trust
model to structure and guide the security design and implementation approach.
The Microsoft Zero Trust model provides a proactive and integrated approach to applying security across all
layers of an application. The guiding principles of Zero Trust strives to explicitly and continuously verify every
transaction, assert least privilege, use intelligence, and advanced detection to respond to threats in near real-
time. It's ultimately centered on eliminating trust inside and outside of application perimeters, enforcing
verification for anything attempting to connect to the system.
As you assess the security posture of the application, start with these questions as the basis for each
consideration.
Continuous security testing to validate mitigations for key security vulnerabilities.
Is security testing performed as a part of automated CI/CD processes?
If not, how often is specific security testing performed?
Design recommendationsDesign recommendations
Are test outcomes measured against a desired security posture and threat model?
Security level across all lower-environments.
Do all environments within the development lifecycle have the same security posture as the
production environment?
Authentication and Authorization continuity in the event of a failure.
If authentication or authorization services are temporarily unavailable, will the application be able to
continue to operate?
Automated security compliance and remediation.
Can changes to key security settings be detected
?
Are responses to remediate non-compliant changes automated?
Secret scanning to detect secrets before code is committed to prevent any secret leaks through source
code repositories.
Is authentication to services possible without having credentials as a part of code?
Secure the software supply chain.
Is it possible to track Common Vulnerabilities and Exposures (CVEs) within utilized package
dependencies?
Is there an automated process for updating package dependencies?
Data protection key lifecycles.
Can service-managed keys be used for data integrity protection?
If customer-managed keys are required, how is the secure and reliable key lifecycle?
CI/CD tooling should require Azure AD service principals with sufficient subscription level access to
facilitate control plane access for Azure resource deployments to all considered environment
subscriptions.
When application resources are locked down within private networks, is there a private data-plane
connectivity path so that CI/CD tooling can perform application level deployments and maintenance.
This introduces additional complexity and requires a sequence within the deployment process
through requisite private build agents.
Use Azure Policy to enforce security and reliability configurations for all service, ensuring that any
deviation is either remediated or prohibited by the control plane at configuration-time, helping to
mitigate threats associated with 'malicious admin' scenarios.
Use Azure AD Privileged Identity Management (PIM) within production subscriptions to revoke sustained
control plane access to production environments. This will significantly reduce the risk posed from
'malicious admin' scenarios through additional 'checks and balances'.
Use Azure Managed Identities for all services that support the capability, since it facilitates the removal of
credentials from application code and removes the operational burden of identity management for
service to service communication.
Use Azure AD Role Based Access Control (RBAC) for data plane authorization with all services that
support the capability.
Use first-party Microsoft identity platform authentication libraries within application code to integrate
with Azure AD.
Consider secure token caching to allow for a degraded but available experience if the chosen identity
platform, isn't available or is only partially available for application authorization.
Threat modeling
If the provider is unable to issue new access tokens, but still validates existing ones, the application
and dependent services can operate without issues until their tokens expire.
Token caching is typically handled automatically by authentication libraries (such as MSAL).
Use Infrastructure-as-Code (IaC) and automated CI/CD pipelines to drive updates to all application
components, including under failure circumstances.
Ensure CI/CD tooling service connections are safeguarded as critical sensitive information, and
shouldn't be directly available to any service team.
Apply granular RBAC to production CD pipelines to mitigate 'malicious admin' risks.
Consider the use of manual approval gates within production deployment pipelines to further
mitigate 'malicious admin' risks and provide additional technical assurance for all production changes.
Additional security gates may come at a trade-off in terms of agility and should be carefully
evaluated, with consideration given to how agility can be maintained even with manual gates.
Define an appropriate security posture for all lower environments to ensure key vulnerabilities are
mitigated.
Do not apply the same security posture as production, particularly with regard to data exfiltration,
unless regulatory requirements stipulate the need to do so, since this will significantly compromise
developer agility.
Enable Microsoft Defender for Cloud (formerly known as Azure Security Center) for all subscriptions that
contain the resources for a mission-critical workload.
Use Azure Policy to enforce compliance.
Enable Azure Defender for all services that support the capability.
Embrace DevSecOps and implement security testing within CI/CD pipelines.
Test results should be measured against a compliant security posture to inform release approvals, be
they automated or manual.
Apply security testing as part of the CD production process for each release.
If security testing each release jeopardizes operational agility, ensure a suitable security testing
cadence is applied.
Enable secret scanning and dependency scanning within the source code repository.
Threat modeling provides a risk based approach to security design, using identified potential threats to develop
appropriate security mitigations. There are many possible threats with varying probabilities of occurrence, and
in many cases threats can chain in unexpected, unpredictable, and even chaotic ways. This complexity and
uncertainty is why traditional technology requirement based security approaches are largely unsuitable for
mission-critical cloud applications. Expect the process of threat modeling for a mission-critical application to be
complex and unyielding.
To help navigate these challenges, a layered defense-in-depth approach should be applied to define and
implement compensating mitigations for modeled threats, considering the following defensive layers.
1. The Azure platform with foundational security capabilities and controls.
2. The application architecture and security design.
3. Security features built-in, enabled, and deployable applied to secure Azure resources.
4. Application code and security logic.
5. Operational processes and DevSecOps.
NOTENOTE
Design considerationsDesign considerations
Design recommendationsDesign recommendations
Network intrusion protection
When deploying within an Azure landing zone, be aware that an additional threat mitigation layer through the
provisioning of centralized security capabilities is provided by the landing zone implementation.
STRIDE provides a lightweight risk framework for evaluating security threats across key threat vectors.
Spoofed Identity: Impersonation of individuals with authority. For example, an attacker impersonating
another user by using their -
Tampering Input: Modification of input sent to the application, or the breach of trust boundaries to modify
application code. For example, an attacker using SQL Injection to delete data in a database table.
Repudiation of Action: Ability to refute actions already taken, and the ability of the application to gather
evidence and drive accountability. For example, the deletion of critical data without the ability to trace to a
malicious admin.
Information Disclosure: Gaining access to restricted information. An example would be an attacker gaining
access to a restricted file.
Denial of Service: Malicious application disruption to degrade user experience. For example, a DDoS botnet
attack such as Slowloris.
Elevation of Privilege: Gaining privileged application access through authorization exploits. For example, an
attacker manipulating a URL string to gain access to sensitive information.
Identity
Authentication
Data integrity
Validation
Blocklisting/allowlisting
Audit/logging
Signing
Encryption
Data exfiltration
Man-in-the-middle attacks
DDoS
Botnets
CDN and WAF capabilities
Remote code execution
Authorization
Isolation
Allocate engineering budget within each sprint to evaluate potential new threats and implement
mitigations.
Conscious effort should be applied to ensure security mitigations are captured within a common
engineering criteria to drive consistency across all application service teams.
Start with a service by service level threat modeling and unify the model by consolidating the thread
model on application level.
Preventing unauthorized access to a mission-critical application and encompassed data is vital to maintain
Design considerationsDesign considerations
Design recommendationsDesign recommendations
availability and safeguard data integrity.
Zero Trust assumes a breached state and verifies each request as though it originates from an
uncontrolled network.
An advanced zero-trust network implementation employs micro-segmentation and distributed
ingress/egress micro-perimeters.
Azure PaaS services are typically accessed over public endpoints. Azure provides capabilities to secure
public endpoints or even make them entirely private.
Azure Private Link/Private Endpoints provide dedicated access to an Azure PaaS resource using private
IP addresses and private network connectivity.
Virtual Network Service Endpoints provide service-level access from selected subnets to selected PaaS
services.
Virtual Network Injection provides dedicated private deployments for supported services, such as App
Service through an App Service Environment.
Management plane traffic still flows through public IP addresses.
For supported services, Azure Private Link using Azure Private Endpoints addresses data exfiltration risks
associated with Service Endpoints, such as a malicious admin writing data to an external resource.
When restricting network access to Azure PaaS services using Private Endpoints or Service Endpoints, a
secure network channel will be required for deployment pipelines to access both the Azure control plane
and data plane of Azure resources in order to deploy and manage the application.
Private self-hosted build agents deployed onto a private network as the Azure resource can be used as
a proxy to execute CI/CD functions over a private connection. A separate virtual network should be
used for build agents.
An alternative approach is to modify the firewall rules for the resource on-the-fly within the pipeline
to allow a connection from an Azure DevOps agent public IP address, with the firewall subsequently
removed after the task is completed.
To perform developer and administrative tasks on the application service jump boxes can be used.
Connectivity to the private build agents from CI/CD tooling is required.
However, this approach is only applicable for a subset of Azure services. For example, this isn't
feasible for private AKS clusters.
The completion of administration and maintenance tasks is a further scenario requiring connectivity to
the data plane of Azure resources.
Service Connections with a corresponding Azure AD service principal can be used within Azure DevOps
to apply RBAC through Azure AD.
Service Tags can be applied to Network Security Groups to facilitate connectivity with Azure PaaS
services.
Application Security Groups don't span across multiple virtual networks.
Packet capture in Azure Network Watcher is limited to a maximum period of five hours.
Limit public network access to the absolute minimum required for the application to fulfill its business
purpose to reduce the external attack surface.
Use Azure Private Link to establish private endpoints for Azure resources that require secure network
integration.
Use hosted private build agents for CI/CD tooling to deploy and configure Azure resources protected
NOTENOTE
Data integrity protection
Design considerationsDesign considerations
by Azure Private Link.
Microsoft-hosted agents won't be able to directly connect to network integrated resources.
When dealing with private build agents, never open an RDP or SSH port directly to the internet.
Use Azure Bastion to provide secure access to Azure Virtual Machines and to perform administrative
tasks on Azure PaaS over the Internet.
Use a DDoS standard protection plan to secure all public IP addresses within the application.
Use Azure Front Door with web application firewall policies to deliver and help protect global HTTP/S
applications that span multiple Azure regions.
Use Header ID validation to lock down public application endpoints so they only accept traffic
originating from the Azure Front Door instance.
If additional in-line network security requirements, such as deep packet inspection or TLS inspection,
mandate the use of Azure Firewall Premium or Network Virtual Appliance (NVA), ensure it's configured
for maximum high availability and redundancy.
If packet capture requirements exist, use Network Watcher packets to capture despite the limited capture
window.
Use Network Security Groups and Application Security Groups to micro-segment application traffic.
Avoid using a security appliance to filter intra-application traffic flows.
Consider the use of Azure Policy to enforce specific NSG rules are always associated with application
subnets.
Enable NSG flow logs and feed them into Traffic Analytics to gain insights into internal and external traffic
flows.
Use Azure Private Link/Private Endpoints, where available, to secure access to Azure PaaS services within
the application design. For information on Azure services that support Private Link, see Azure Private
Link availability.
If Private Endpoint isn't available and data exfiltration risks are acceptable, use Virtual Network Service
Endpoints to secure access to Azure PaaS services from within a virtual network.
Don't enable virtual network service endpoints by default on all subnets as this will introduce
significant data exfiltration channels.
For hybrid application scenarios, access Azure PaaS services from on-premises via ExpressRoute with
private peering.
When deploying within an Azure landing zone, be aware that network connectivity to on-premises data centers is
provided by the landing zone implementation. One approach is by using ExpressRoute configured with private peering.
Encryption is a vital step toward ensuring data integrity and is ultimately one of the most important security
capabilities that can be applied to mitigate a wide array of threats. This section will therefore provide key
considerations and recommendations related to encryption and key management in order to safeguard data
without compromising application reliability.
Azure Key Vault has transaction limits for keys and secrets, with throttling applied per vault within a
certain period.
Design recommendationsDesign recommendations
Azure Key Vault provides a security boundary since access permissions for keys, secrets, and certificates
are applied at a vault level.
Key Vault access policy assignments grant permissions separately to keys, secrets, or certificates.
Granular object-level permissions to a specific key, secret, or certificate are now possible.
After a role assignment is changed, there's a latency of up to 10 minutes (600 seconds) for the role to be
applied.
There's an Azure AD limit of 2,000 Azure role assignments per subscription.
Azure Key Vault underlying hardware security modules (HSMs) are FIPS 140-2 Level 2 compliant.
A dedicated Azure Key Vault managed HSM is available for scenarios requiring FIPS 140-2 Level 3
compliance.
Azure Key Vault provides high availability and redundancy to help maintain availability and prevent data
loss.
During a region failover, it may take a few minutes for the Key Vault service to fail over.
During a failover Key Vault will be in a read-only mode, so it won't be possible to change key vault
properties such as firewall configurations and settings.
If private link is used to connect to Azure Key Vault, it may take up to 20 minutes for the connection to be
re-established during a regional failover.
A backup creates a point-in-time snapshot of a secret, key, or certificate, as an encrypted blob that can't
be decrypted outside of Azure. To get usable data from the blob, it must be restored into a Key Vault
within the same Azure subscription and Azure geography.
Secrets may renew during a backup, causing a mismatch.
With service-managed keys, Azure will perform key management functions, such as rotation, thereby
reducing the scope of application operations.
Regulatory controls may stipulate the use of customer-managed keys for service encryption functionality.
When traffic moves between Azure data centers, MACsec data-link layer encryption is used on the
underlying network hardware to secure data in-transit outside of the physical boundaries not controlled
by Microsoft or on behalf of Microsoft.
Use service-managed keys for data protection where possible, removing the need to manage encryption
keys and handle operational tasks such as key rotation.
Only use customer-managed keys when there's a clear regulatory requirement to do so.
Use Azure Key Vault as a secure repository for all secrets, certificates, and keys if additional encryption
mechanisms or customer-managed keys need considered.
Provision Azure Key Vault with the soft delete and purge policies enabled to allow retention protection
for deleted objects.
Use HSM backed Azure Key Vault SKU for application production environments.
Deploy a separate Azure Key Vault instance within each regional deployment stamp, providing fault
isolation and performance benefits through localization, as well as navigating the scale limits imposed by
a single Key Vault instance.
Use a dedicated Azure Key Vault instance for application global resources.
Follow a least privilege model by limiting authorization to permanently delete secrets, keys, and
certificates to specialized custom Azure AD roles.
Policy-driven governance
Design considerationsDesign considerations
NOTENOTE
Design recommendationsDesign recommendations
Ensure encryption keys, and certificates stored within Key Vault are backed up, providing an offline copy
in the unlikely event Key Vault becomes unavailable.
Use Key Vault certificates to manage certificate procurement and signing.
Establish an automated process for key and certificate rotation.
Automate the certificate management and renewal process with public certificate authorities to ease
administration.
Set alerting and notifications, to supplement automated certificate renewals.
Monitor key, certificate, and secret usage.
Define alerts for unexpected usage within Azure Monitor.
Security conventions are ultimately only effective if consistently and holistically enforced across all application
services and teams. Azure Policy provides a framework to enforce security and reliability baselines, ensuring
continued compliance with a common engineering criteria for a mission-critical application. More specifically,
Azure Policy forms a key part of the Azure Resource Manager (ARM) control plane, supplementing RBAC by
restricting what actions authorized users can perform, and can be used to enforce vital security and reliability
conventions across utilized platform services.
This section will therefore explore key considerations and recommendations surrounding the use of Azure Policy
driven governance for a mission-critical application, ensuring security and reliability conventions are
continuously enforced.
Azure Policy provides a mechanism to drive compliance by enforcing security and reliability conventions,
such as the use of Private Endpoints or the use of Availability Zones.
When deploying within an Azure landing zone, be aware that the enforcement of centralized baseline policy assignments
will likely be applied in the implementation for landing zone management groups and subscriptions.
Azure Policy can be used to drive automated management activities, such as provisioning and
configuration.
Resource Provider registration.
Validation and approval of individual Azure resource configurations.
Azure Policy assignment scope dictates coverage and the location of Azure Policy definitions informs the
reusability of custom policies.
Azure Policy has several limits, such as the number of definitions at any particular scope.
It can take several minutes for the execution of Deploy If Not Exist (DINE) policies to occur.
Azure Policy provides a critical input for compliance reporting and security auditing.
Map regulatory and compliance requirements to Azure Policy definitions.
For example, if there are data residency requirements, a policy should be applied to restrict available
deployment regions.
Define a common engineering criteria to capture secure and reliable configuration definitions for all
NOTENOTE
IaaS specific considerations when using Virtual Machines
utilized Azure services, ensuring this criteria is mapped to Azure Policy assignments to enforce
compliance.
For example, apply an Azure Policy to enforce the use of Availability Zones for all relevant services,
ensuring reliable intra-region deployment configurations.
The Mission Critical reference implementation contain a wide array of security and reliability centric policies
to define and enforce a sample common engineering criteria.
Monitor service configuration drift, relative to the common engineering criteria, using Azure Policy.
For mission-critical scenarios with multiple production subscriptions under a dedicated management group,
prioritize assignments at the management group scope.
Use built-in policies where possible to minimize operational overhead of maintaining custom policy
definitions.
Where custom policy definitions are required, ensure definitions are deployed at suitable management
group scope to allow for reuse across encompassed environment subscriptions to this allow for policy
reuse across production and lower environments.
When aligning the application roadmap with Azure roadmaps, use available Microsoft resources to
explore if critical custom definitions could be incorporated as built-in definitions.
When deploying within an Azure landing zone, consider deploying custom Azure Policy Definitions within the intermediate
company root management group scope to enable reuse across all applications within the broader Azure estate. In a
landing zone environment, certain centralized security policies will be applied by default within higher management group
scopes to enforce security compliance across the entire Azure estate. For example, Azure policies should be applied to
automatically deploy software configurations through VM extensions and enforce a compliant baseline VM configuration.
Use Azure Policy to enforce a consistent tagging schema across the application.
Identify required Azure tags and use the append policy mode to enforce usage.
If the application is subscribed to Microsoft Mission-Critical Support, ensure that the applied tagging
schema provides meaningful context to enrich the support experience with deep application understanding.
Export Azure AD activity logs to the global Log Analytics Workspace used by the application.
Ensure Azure activity logs are archived within the global Storage Account along with operational data
for long-term retention.
In an Azure landing zone, Azure AD activity logs will also be ingested into the centralized platform Log
Analytics workspace. It needs to be evaluated in this case if Azure AD are still required in the global Log
Analytics workspace.
Integrate security information and event management with Microsoft Defender for Cloud (formerly known
as Azure Security Center).
In scenarios where the use of IaaS Virtual Machines is required, some specifics have to be taken into
consideration.
Design considerationsDesign considerations
Design recommendationsDesign recommendations
Next step
Images are not updated automatically once deployed.
Updates are not installed automatically to running VMs.
Images and individual VMs are typically not hardened out-of-the-box.
Do not allow direct access via the public Internet to Virtual Machines by providing access to SSH, RDP or
other protocols. Always use Azure Bastion and jumpboxes with limited access to a small group of users.
Restrict direct internet connectivity by using Network Security Groups, (Azure) Firewall or Application
Gateways (Level 7) to filter and restrict egress traffic.
For multi-tier applications consider using different subnets and use Network Security Groups to restrict
access in between.
Prioritize the use of Public Key authentication, when possible. Store secrets in a secure place like Azure Key
Vault.
Protect VMs by using authentication and access control.
Apply the same security practices as described for mission-critical application scenarios.
Follow and apply security practices for mission-critical application scenarios as described above, when
applicable, as well as the Security best practices for IaaS workloads in Azure.
Review the best practices for operational procedures for mission-critical application scenarios.
Operational procedures
Operational procedures for mission
-
critical
workloads on Azure
12/16/2022 10 minutes to read • Edit Online
IMPORTANTIMPORTANT
DevOps processes
Design considerationsDesign considerations
The mission-critical design methodology leans heavily on the principles
automation wherever possible
and
configuration as code
to drive reliable and effective operations through DevOps processes, with automated
deployment pipelines used to execute versioned application and infrastructure code artifacts within a source
repository. While this level of DevOps adoption requires substantial engineering investment to instantiate and
discipline to maintain, it yields significant operational dividends, enabling consistent and accurate operational
outcomes within minimal manual operational procedures.
This design area explores how the adoption of DevOps and related deployment methods is used to drive
effective and consistent operational procedures.
This article is part of the Azure Well-Architected mission-critical workload series. If you aren't familiar with this series, we
recommend you start with what is a mission-critical workload?
Mission-Critical open source project
The reference implementations are part of an open source project available on GitHub.
DevOps provides the engineering mindset, processes, and tooling to deliver application services in a fast,
efficient, and reliable manner. More specifically, DevOps brings together development and operational processes
as well as teams into a single engineering function that encompasses the entire application lifecycle, using
automation and DevOps tooling to conduct deployment operations swiftly and reliably.
DevOps processes support and sustain the concepts of continuous integration and continuous
deployment (CI/CD), while fostering a culture of continuous improvement.
DevOps can be difficult to apply when there are hard dependencies on central IT functions since it
prevents end-to-end operational action.
Key responsibilities of the DevOps team for a mission-critical application include:
Create and manage application and infrastructure resources through CI/CD automation.
Application monitoring and observability.
Azure RBAC and identity for application components.
Security monitoring and audit of application resources.
Network management for application components.
Cost management for application resources.
DevSecOps expands the DevOps model by integrating security and quality assurance teams with
development and operations throughout the application lifecycle.
A DevOps engineering team can consider various granular Azure RBAC roles for different technical
Design recommendationsDesign recommendations
Application Operations
personas, such as AppDataOps for database management.
A zero trust model can and should be applied across different application DevOps personas.
Define configuration settings and updates for application components or underlying infrastructure as
code.
Manage any changes to code through consistent release and update process, including tasks such as
key or secret rotation and permission management.
Prioritize pipeline-managed update processes, such as with scheduled pipeline runs, over built-in
auto-update mechanisms.
Do not use central processes or provisioning pipelines for the instantiation or management of application
resources, since this introduces external application dependencies and additional risk vectors, such as
those associated with 'noisy neighbor' scenarios.
If centralized provisioning processes are mandated, ensure the availability requirements of used
dependencies are fully in-line with application requirements, and ensure operational transparency is
provided to allow for holistic operationalization of the end-to-end application.
Embrace continuous improvement and allocate a proportion of engineering capacity within each sprint to
optimize platform fundamentals.
Consider allocating 20-40% of capacity within each sprint to drive fundamental platform
improvements and bolster reliability.
To accelerate the development of new services, consider the creation of a common engineering criteria
and reference architectures/libraries for service teams to use, ensuring consistent alignment with core
design principles.
Enforce a consistent baseline configuration for reliability, security, and operations through a policy-
driven approach using Azure Policy.
This common engineering criteria and associated artifacts, such as Azure Policies and Terraform for common
design patterns, can also be used across other workloads within the broader application ecosystem for an
organization.
Consider a DevSecOps for security sensitive and highly regulated scenarios, to ensure security is baked
within the DNA of engineering team throughout the development lifecycle rather than a specific release
stage/gate.
Apply a zero trust model within critical application environments, using capabilities such as Azure AD
Privileged Identity Management (PIM) to ensure consistent operations only occur through CI/CD
processes or automated operational procedures.
No physical users should have standing write-access to any environment, maybe with the exception of
development environments for easier testing and debugging.
Define emergency processes for Just-in-time access to production environments.
Ensure 'break glass' accounts exist for serious issues with the authentication provider.
Consider AIOps as a method to continually improve operational procedures and triggers.
The application design and platform recommendations have a significant bearing on effective operations, and
the extent to which these recommendations are adhered, will therefore greatly influence optimal operational
procedures.
Design considerationsDesign considerations
Furthermore, there's a varied set of operational capabilities provided by different Azure services, particularly
when it comes to high availability and recovery. It's therefore also important to understand and use the
operational capabilities of used services.
This section will therefore highlight key operational aspects associated with application design and
recommended platform services.
Azure services provide a combination of built-in (enabled by default) and configurable platform
capabilities, such as zonal redundancy or geo-replication. The service-level configuration of each service
within an application must therefore be considered by operational procedures.
Many configurable capabilities incur an additional cost, such as the multi-write deployment
configuration for Azure Cosmos DB.
Mission critical design strongly endorses the principle of ephemeral stateless application resources,
meaning that updates can typically be performed through a new deployment and standard delivery
pipelines.
Most of the required operations are exposed and accessible via the Azure ARM management APIs or
through the Azure portal.
Some more intensive operations, such as a restore from a periodic backup of an Azure Cosmos DB
database or the recovery of a deleted resource, can only be performed by Azure Support engineers via a
support case.
For stateless resources and resources that can be entirely configured from deployment, such as Azure
Front Door and associated backends/origins, redeployment will faster generally result in an operational
resource than a Support process to attempt recovery of the deleted resource.
Azure Policy provides a framework to enforce and audit security and reliability baselines, ensuring
continued compliance with a common engineering criteria for a mission-critical application. More
specifically, Azure Policy forms a key part of the Azure Resource Manager (ARM) control plane,
supplementing RBAC by restricting what actions authorized users can perform, and can be used to
enforce vital security and reliability conventions across utilized platform services.
Azure Policy can be extended within Azure Kubernetes Service (AKS) via Azure Policy for Kubernetes
that provides visibility of components deployed within clusters.
Azure resources can be locked to prevent them from being modified or deleted.
Locks introduce management overhead within deployment pipelines that must remove locks, perform
deployment steps, before subsequently re-enabling.
Generally, for most resource a robust RBAC process, with tight restrictions on who can perform write
operations, should be preferred over locking resources.
Update ManagementUpdate Management
Key, secret, and certificate expirations are common causes of application outage.
The Kubernetes version within AKS needs to be updated regularly, especially given support for older
versions isn't sustained.
Components running on K8s also need to be updated, such as cert-manager and Key Vault-csi, and
aligned with the k8s version within AKS.
Terraform providers need to be updated regularly. Newer provider versions frequently contain breaking
changes, which must be properly tested before applied in production.
Other application dependencies, such as the runtime environment (.NET, Java, Python), should be
Design recommendationsDesign recommendations
monitored and kept up-to-date.
New versions of packages, components, and dependencies need to be properly tested before
consideration in a production context.
Container registries will likely need regular housekeeping to delete old image versions that aren't used
anymore.
Azure Policy provides native support for a wide variety of Azure resources.
Use an active-active deployment model, using a health model and automated scale-operations to ensure
no failover intervention is required.
If using an active-passive or active-standby model, ensure failover procedures are automated or at
least codified within pipelines so that manual steps besides triggering is required during operational
crises.
Prioritize the use of Azure-native auto-scale functionality for all available services.
Establish automated operational processes to scale services that do not offer auto-scale.
Leverage scale-units composed of multiple services to provide requisite scalability under relevant
circumstances.
Identify operational procedures and tasks required by global (long-lived) application resources.
For example, if an Azure Cosmos DB resource or encompassed data is incorrectly modified or deleted,
the possible ways of recovery should be well understood and a recovery process should exist.
Similarly, establish procedures to manage decommissioned container images in the registry.
It's strongly recommended to practice recovery operations in advance, on non-production resources and
data, as part of standard business continuity preparations.
Use platform-native capabilities for backup and restore, ensuring they're aligned with RTO/RPO and data
retention requirements.
Avoid building custom solutions unless absolutely necessary.
Define a strategy for long-term backup retention if required.
Use built-in capabilities for SSL certificate management and renewal, such as those offered by Azure
Front Door.
Wherever possible, use Managed Identities in order to avoid dealing with Service Principal credentials or
API keys.
When storing secrets, keys or certificates in Azure Key Vault, make use of the expiry setting and have
alerting configured for upcoming expirations.
All key, secret, and certificate updates should be performed using the standard release process.
Identify critical operational alert and define target audiences and systems, with clear channels to reach
them.
Avoid 'white-noise' by only sending actionable alerts, to prevent operational stakeholders for ignoring
alerts and missing important information.
Leverage continuous improvement to optimize alerting and remove observed 'white-noise'.
Apply the principle of policy-driven governance and Azure Policy to ensure the appropriate use of
operational capabilities and a reliable configuration baseline across all application services.
Apply a resource lock to prevent the deletion of long-lived global resources, such as Azure Cosmos DB.
Avoid the use of resource locks on ephemeral regional resources, and instead rely on the appropriate
IaaS specific considerations when using Virtual Machines
Design considerationsDesign considerations
Design recommendationsDesign recommendations
use of Role Based Access Control (RBAC) and CI/CD pipelines to control operational updates.
Update managementUpdate management
Update external libraries, SDKs, and runtimes frequently, treating it as any other change to the
application. This will ensure the latest security vulnerabilities and performance optimizations are applied.
Ensure all updates are validated prior to production release.
Set up processes to monitor and automatically detect updates, such as GitHub's Dependabot.
All operational tasks, such as key and secret rotation, should be handled using either Azure-native
platform capabilities or via a standard release process applied for code and configuration changes.
Ensure key, secret, and certificate rotation is performed on a regular basis.
Manual operational changes to update components should be avoided and only considered by
emergency exception.
Ensure a process exists to reconciliate any manual changes back into the source repository, avoiding
drift and issue recurrence.
Establish an automated housekeeping procedure to remove old image versions from Azure Container
Registry.
In scenarios where the use of IaaS Virtual Machines is required, some of the procedures and practices described
above might differ. The use of Virtual Machines provides more flexibility in regards to configuration options,
selection of operating systems, access to drivers and low-level operating system access as well as to the kind of
software that can be installed. The price for that are increased operational costs and responsibility for tasks that
are usually done by the cloud provider when using PaaS services.
Individual VMs do not provide high availability, zone or geo-redundancy.
Individual VMs are not automatically updated once deployed.
Services running inside a VM need special treatment and additional tooling to be deployed and configured
via Infra-as-Code.
Azure periodically updates its platform and might require a reboot. These reboots are usually announced in
advance. See Maintenance for virtual machines in Azure and Handling planned maintenance notifications.
Avoid any manual operations on virtual machines and implement proper processes to deploy and rollout
changes.
Automate the provisioning of Azure resources using Infrastructure-as-Code solutions, such as Azure
Resource Manager templates, Bicep, Terraform, or other third-party solutions.
Make sure that operational processes for deployment of virtual machines, updates, backup and recovery
are in place and properly tested. To test for resiliency inject fault in application and take a note of failure
and mitigate those failures.
Ensure that strategies are in place to rollback changes to last known healthy state in case a newer version
is not functioning correctly.
Take frequent backups for stateful workloads and ensure backup tasks are working effectively and alert
on failed backup processes.
Monitor virtual machines and detect for failure. The raw data of for monitoring can come from variety of
sources. Ensure that monitoring is configured and analyze the cause of problems.
Next step
Analyze that scheduled backups are running as expected and that periodic backups are taken (if needed).
Azure Backup center can be used to gain more insights.
Prioritize the use of Virtual Machine Scale Sets over VMs to enable capabilities like scale, autoscale and
provide zone-redundancy.
Prioritize the use of standard images from the Azure Marketplace over custom images that need to be
maintained.
Use Azure VM Image Builder or other tools to automate build and maintenance processes for customized
images if required.
Besides these specifics, apply best practices for operational procedures for mission-critical application scenarios
when applicable.
Review the cross-cutting concerns for mission-critical application scenarios.
Architecture pattern
Well
-
Architected assessment for mission
-
critical
workloads
12/16/2022 2 minutes to read • Edit Online
The Assessment is a review tool for self-assessing the readiness of your mission-critical workload in production.
Working towards building a resilient mission critical architecture can be a complex process. The assessment is
organized in a way that you'll be able to methodically check the alignment to the best practices for resiliency,
reliability, and availability.
We recommend that the team doing the assessment is well versed in the architecture of the specific workload
and has a strong understanding of cloud principles and patterns. These roles include, but aren't limited to, cloud
architect, operators, DevOps engineer.
The assessment is a set of questions based on the mission-critical design methodology as a way of checking the
foundational design choices of a workloads architecture and the end-to-end operational approach.
These questions are designed to help benchmark a workload’s maturity and alignment to the recommended
approach for operating a mission-critical workload. The outcome of the assessment is the delivery of technical
recommendations and documentation that provide guidance on how to implement a highly reliable solution on
Azure.
Next steps
Mission-critical review tool
See these reference architecture that describe design choices for production-ready implementations.
Baseline architecture of an internet-facing application
Baseline architecture of an internet-facing application with network controls
Carrier
-
grade workloads on Azure
12/16/2022 5 minutes to read • Edit Online
NOTENOTE
What is a carrier-grade workload?
What are the common challenges?
Lift and shift approachLift and shift approach
Mission-critical systems primarily focus on maximizing uptime and they exist in many industries. Within the
telecommunications industry, they're referred to as
carrier-grade systems
. These systems are developed due to
one or more of the following drivers:
Minimizing loss of life or injury.
Meeting regulatory requirements on reliability to avoid paying fines.
Optimizing service to customers to minimize churn to competitors.
Meeting contractual Service Level Agreements (SLAs) with business or government customers.
This series of articles applies the design methodology for mission-critical workloads to inform prescriptive
guidance for building and operating a highly reliable, resilient, and available telecommunication workload on
Azure.
The articles within this series focus on providing additional mission-critical considerations when designing for 99.999% ('5
9s') levels of reliability within the telecommunications industry.
The term
workload
refers to a collection of application resources that support a common business goal or the
execution of a common business process, with multiple services, such as APIs and data stores, working together
to deliver specific end-to-end functionality.
The term
mission-critical
refers to a criticality classification where a significant financial cost (business-critical) or
human cost (safety-critical) is associated with unavailability or underperformance.
A
carrier-grade workload
pivots on both business-critical and safety-critical, where there's a fundamental
requirement to be operational with only minutes or even seconds of downtime per calendar year. Failure to
achieve this uptime requirement can result in extensive loss of life, incur significant fines, or contractual
penalties.
The
operational
aspect of the workload includes how reliability is measured and the targets that it must meet or
exceed. Highly reliable systems typically target 99.999% uptime (commonly referred to as '5 9s') or 0.001%
downtime in a year (approximately 5 minutes). Some systems target 99.9999% uptime, or 30 seconds
downtime per year, or even higher levels of reliability. This covers all forms and causes of outage scheduled
maintenance, infrastructure failure, human error, software issues and even natural disaster.
Although the platform used has evolved from dedicated, proprietary hardware through commercial, off-the-
sheld hardware to OpenStack or VMware clouds, Telecommunication companies consistently deliver services
achieving ≤ 5 minutes of downtime per year, and in many cases, achieve ≤ 30 seconds of downtime due to
unscheduled outages.
Building carrier-grade workloads is a challenge for these main reasons:
Monolithic solutionsMonolithic solutions
Only building zonal redundancyOnly building zonal redundancy
What are the key design areas?
DESIGN A READESIGN A REA DESC RIP T IONDESC RIP T ION
Fault toleranceFault tolerance Application design must allow for unavoidable failures so
that the application as a whole can continue to operate at
some level. The design must minimize points of failure and
take a federated approach.
Telecommunication companies have well-architected applications that deliver the expected behavior on their
existing infrastructure. However, care should be taken before assuming that porting these applications
as is
to a
public cloud infrastructure won't impact their resiliency.
The existing applications make a set of assumptions about their underlying infrastructure, which are unlikely to
remain true when moving from on-premises to public cloud. The architect must check that they still hold and
adjust infrastructure and application design to accommodate the new reality. The architect should also look for
opportunities where the new infrastructure removes limitations that existed on-premises.
For example, upgrade of on-premises systems must happen in place because it's not viable to maintain
sufficient hardware to create a new deployment alongside and slowly transition in a safe manner. This upgrade
path generates a host of requirements for how upgrades and rollbacks are managed. These requirements lead
to complexity and mean that upgrades are infrequent and only permitted in carefully managed maintenance
windows.
However, in public cloud, it's reasonable to create a new deployment in parallel with the existing deployment.
This process creates the opportunity for major simplifications in the application's operational design and
improvements in the user's experience, and expectations.
Experience across a range of mission-critical industries shows that it isn't realistic to attempt to construct a
monolithic solution which will achieve the desired levels of availability. There are just too many potential sources
of failure in a complex system. Instead, successful solutions are composed of individual independent and
redundant elements. Each unit has relatively basic availability (typically ~99.9%), but when combined together
the total solution achieves the necessary availability goals. The art of carrier-grade engineering then becomes
ensuring that the only dependencies common to the redundant elements are those which are absolutely
necessary since they exert the most significant influence on overall availability, often orders of magnitude
greater than anything else in the design.
Using Microsoft Azure Availability Zones is the basic choice for reducing the risk of outage due to hardware
failure or localized environmental issues. However, it isn't enough to achieve carrier-grade availability, mainly for
these reasons:
Availability Zones (AZs) are designed so that the network latency between any two zones in a single
region is ≤ 2 ms . AZs can't be widely and geographically dispersed. So, the AZs share a correlated risk of
failure due to natural disasters, such as flooding or massive power outages, which could disable multiple
AZs within a region.
Many Azure services are explicitly designed to be zone-redundant so that the applications using them
don't need explicit logic to benefit from the availability gain. This redundancy function within the service
requires collaboration between the elements in each zone. There's often an unavoidable risk of software
failure in one zone that causes correlated failures in other zones. For example, any issue with a secret or
certificate used with the zone redundant service could impact all AZs simultaneously.
When designing a carrier-grade workload, consider the following areas:
Data modelData model The design must address the consistency, availability, and
partition tolerance needs of the database.Carrier Grade
availability requires that the application is deployed across
multiple regions. This deployment requires careful thought
about how the application's data will be managed across
those regions.
Health modelingHealth modeling An effective health model provides observability of all
components, platform and application, so that issues can be
quickly detected and response is ready either through self-
healing or other remediations.
Testing and validationTesting and validation The design and implementation of the application must be
thoroughly tested. In addition, the integration and
deployment of the application as a whole solution must be
tested.
DESIGN A READESIGN A REA DESC RIP T IONDESC RIP T ION
Next step
Start by reviewing the design principles for carrier-grade application scenarios.
Design principles
Design principles of a carrier
-
grade workload on
Azure
12/16/2022 2 minutes to read • Edit Online
IMPORTANTIMPORTANT
Carrier grade workload must be designed as per the guiding principles of the Well-Architected Framework
quality pillars:
Reliability
Performance Efficiency
Operational Excellence
Security
Cost Optimization
This article describes the carrier-grade design principles that resonate and extend the mission-critical design
principles. These collective principles serve as a road map for subsequent design decisions across the critical
design areas. We highly recommend that you get to know these principles to better understand their effects and
the trade-offs associated with non-adherence.
There are obvious cost tradeoffs associated with introducing greater reliability, which should be carefully
considered in the context of workload requirements.
This article is part of the Azure Well-Architected carrier-grade workload series. If you aren't familiar with this series, we
recommend you start with What is a carrier-grade workload?
Keep this high-level architecture model in mind when considering these points.
Assume failure
Share nothing
Start from the assumption that everything can, and will fail. Application design must allow for these failures with
fault tolerance so that an application can continue to operate at some level.
Minimize single points of failure and implement a federated approach.
Deploy the application across multiple regions with proper data management across those regions,
allowing for the impacts of CAP theorem.
Detect issues automatically and respond within seconds. For more information, see health monitoring.
Test the full solution including the application implementation, platform integration, and deployment. This
testing should include chaos testing on production systems to avoid testing bias.
Share nothing
is a common and straightforward approach to achieve high availability. Use this approach when
an application can be serviced by multiple, distinct elements, which are interchangeable. The individual elements
must have a well-understood availability metric, but it doesn't need to be high. However, the elements must be
combined in a way to remain independent, with no shared infrastructure or dependencies.
To share nothing is often impossible. To start from the position that nothing
should
be shared, and only add in
the smallest possible set of shared dependencies, should result in an optimal solution.
ExampleExample
Next step
Given a single system that has six hours of downtime per year (around 3.5*9s ), a solution that combines four
systems where the periods of downtime are uncorrelated will experience less than 30s of downtime per year.
As soon as those four systems rely on a common service, such as global DNS, their downtime is no longer
uncorrelated. The resulting downtime will be higher.
Review the fault tolerance design area for carrier-grade workloads.
Design area: Fault tolerance
Fault tolerance for carrier
-
grade workloads
12/16/2022 4 minutes to read • Edit Online
High availability through combination
Traffic management failure response
Repair and recovery failure response
Telecommunication companies have shown that it's possible to implement applications that meet and exceed
carrier-grade availability requirements, whilst running on top of unreliable elements. Companies exceed these
requirements through
fault tolerance
, which includes the following aspects:
Combination of independent, non-highly available elements.
Traffic management failure response mechanisms.
Repair and capacity-recovery and failure response mechanisms.
Use of highly available cross-element databases.
When designing for carrier grade reliability and resiliency, assume that every component can
and
will fail. The
design will require a layered approach to failure resolution. Part of validating the design is creating a
quantitative probabilistic model of the various failure modes which clearly identifies the key dependencies that
the application has, and shows that the application can achieve the necessary availability given those
dependencies meet their own Service Level Objectives (SLOs). This model should be retained and continuously
validated after development and in production to assure that the test and live data match what the model
predicts.
To reach high availability, take independent elements, which aren't highly available, and combine them so that
they remain independent entities. The probability of multiple elements failing together is much lower than the
probability of failure of any single element. We define this process as the Share Nothing architectural style.
When individual elements fail, or there are connectivity issues, the system's traffic management layer has
various ways it can respond to maintain overall service. The solution should implement as many of the
following mechanisms as possible or necessary to achieve availability. The first two mechanisms rely on specific
behavior in the client, which may not always be an option:
1. Instantaneous client retr y to alternate targetInstantaneous client retr y to alternate targetOn failure, or after no response from a retry, the
client can instantly retry the request with an alternative element that's considered likely to succeed, as
opposed to submitting a new DNS query to identify an alternative.
2. Client-based health listsClient-based health lists—Maintains lists of elements, which are
known-healthy
and
known-
unhealthy
. Requests are always sent to a known-healthy element, unless that list is empty. If empty, the
known-unhealthy elements are tried.
3. Ser ver-based pollingSer ver-based polling—System-based distribution mechanisms, such as DNS or Azure Traffic Manager
(ATM), implement their own liveness detection and monitoring to enable routing around unhealthy
elements.
Traffic management responses can work around failures, but where the failure is long-lived or permanent, the
defective element must be repaired or replaced. There are two main choices here:
1. Restoring redundancyRestoring redundancyFailure of an element reduces overall system capacity. System design should
Failure rate analysis
Cascading failures and overload
allow for this capacity reduction through provisioning redundant capacity; however, multiple overlapping
failures will lead to true outages. There must be an automated process that detects the failure and adds
capacity to restore redundancy to its normal levels. The impact can be determined from the failure rate
analysis. In many cases, automatically restoring the redundancy level can increase the acceptable
downtime of any individual element.
2. (Optional) Repairing the failed element(Optional) Repairing the failed elementThe solution may need to include a mechanism that
detects the failure and attempts to repair the failed element in place. This solution may not apply for
cases where the process of restoring redundancy instantiates a new element to replace the failed one,
and terminates and decommissions the failed element.
The following diagram shows how the two modes of failure response cooperate to provide service-level fault
tolerance:
No amount of effort leads to a perfect system, so start with the assumption that everything can and will fail.
Consider how the solution, as a whole, will behave. Failure rate analysis starts with the lower-level individual
systems and combines the results together to model the full solution.
The analysis is typically done using Markov modeling, which considers all possible failure modes. For each
failure mode, consider the following factors:
Rate
Duration and resolution time
Probability of correlated failure (what is the chance that a failure in service X causes a failure in service Y)
The outcome is an estimate of the system availability and informs consideration of the
blast radius
. The blast
radius is the set of things that won't work given a particular failure. Good design aims to limit the scope and
severity of that set, since failure is going to happen. For example, a failure that blocks creation of new users, but
doesn't impact existing ones is less concerning than a failure that stops service for all users.
Failure rate analysis provides an estimate of the overall system-level availability. The analysis identifies the
critical dependencies on which that availability relies. Where failure rate analysis indicates that system
availability falls short, it also provides specific, quantitative insights on where improvements are needed and to
what extent.
For more details on failure mode analysis in Azure, reference Failure mode analysis for Azure applications.
Carrier-grade application design must pay careful attention to the risks of cascading failures, where failure of
one element leads to failure of other elements, often due to overload. Cascading failures aren't unique to carrier-
grade applications, but the reliability and the response time demand graceful degradation and automated
recovery.
Next step
Review the considered data model design area for carrier-grade workloads.
Design area: Data model
Data modeling for carrier
-
grade workloads
12/16/2022 2 minutes to read • Edit Online
CAP theorem
NOTENOTE
IMPORTANTIMPORTANT
Enterprise applications deployed in a single region can typically ignore this model of application design and
safely delegate responsibility to the database layer to make data reliably available. This behavior isn't the case
for carrier-grade applications, which must be deployed across multiple regions. Multi-region deployment forces
the application architect to consider the compromises they're willing to accept for their data.
Databases can provide three key properties for the data they manage for an application, known as CAP:
Consistency: A data read returns the most recent write.
Availability: Every request receives a valid response.
Partition tolerance: The system continues to operate despite delay or total loss of communication between
elements.
The CAP theorem states that a database layer can't provide all three of these properties for the same data at the
same time in the presence of network partitions. The architect needs to make explicit design decisions about
which of the CAP properties to favor under which conditions, and how to deal with the edge cases.
According to the CAP theorem, any database can only guarantee two out of three possible properties for the
same data at the same time in the presence of network partition.
Multi-region deployment means partition tolerance becomes significant. In most cases, carrier-grade architects
prioritize partition tolerance and availability over consistency. For each type of data, the architect must consider
what tradeoffs they're willing to make, considering the edge cases.
For example, consider the database of system users. Is it acceptable for the user database to drop to read-only if
there's a network partition? This behavior prioritizes consistency and read availability over write availability. This
prioritization may not be suitable if it's unacceptable for a user to access an isolated site after their permissions
are revoked elsewhere. The described scenario would require all database access to be blocked if there's a
partition, which prioritizes write availability over read availability.
The compromises made can be different for different databases within the same application, since the databases are likely
to have different usage profiles.
Where consistency is the compromise, the application must cope with data inconsistency artifacts by using
conflict-free replicated data types (CRDTs). The use of CRDTs requires discipline in the application design and
implementation. Their use means data merges following partition events can be handled automatically by the
data layer without human intervention or complex application-level logic.
More details on data platform choices for your mission-critical workload is available here
The architect must also understand the tradeoffs in the data model, which were made within the dependent
services. Those tradeoffs impact the service delivered to their application because those tradeoffs may not align
Next step
with the application-level requirements.
Review the health modeling design area for carrier-grade workloads.
Design area: Health modeling
Health modeling for carrier
-
grade workloads
12/16/2022 3 minutes to read • Edit Online
IMPORTANTIMPORTANT
Management and monitoring
Federated modelFederated model
High availability requires careful health monitoring to automatically detect and respond to issues within
seconds. This monitoring requires built-in telemetry of key dependencies to reliably detect the failure. The
application itself requires additional telemetry (Service Level Indicators) which accurately report the health of
the application in a way that is perceived by users of the application. Evaluation against SLOs may be necessary.
The failure rate analysis and general health modeling of the application should generate clear metrics indicative
of the service and health of its constituent elements. These metrics must be included in the design so that the
true service availability can be monitored. By including metrics, you can track the most useful leading indicators
to trigger the automated failure responses and to generate the necessary alerts for human intervention.
More details on how to build a health model for your mission-critical workload are available here.
Monitoring and management require the following thought processes:
How will the application handle bugs in the framework?
How is the application upgraded?
What actions should be taken during an incident?
For example, a solution may rely on Azure DevOps (ADO) to host its Git repository for all configuration. If the
Azure region hosting that ADO repo fails, the recovery time is two hours. If the solution is deployed in the same
region, it's not possible to modify configuration to add capacity elsewhere for that entire two hour period. As a
result, the application architect must consider correlated failure modes for key services, such as:
Azure Traffic Manager
Azure Key Vault
Azure Kubernetes Service
Correlated failure modes for these key services may be a necessary part of the application-level response to
failure. It's vital to create control planes which aren't impacted by the same application failure.
The management tooling required to issue diagnosis and troubleshooting must be the same as tooling used for
normal day-to-day operations tasks. Similar tooling ensures that it's familiar and proven to work. Similar tooling
also maximizes the users' familiarity with the user interface and process steps. Requiring operators to switch to
a different tool set to resolve a high-pressure outage isn't conducive to identifying and resolving the issue
effectively.
A highly available application or service must have a highly available management and monitoring
infrastructure built using the same well-architected principles of federation and fault tolerance. Infrastructures
built on these well-architected principles ensure individual regions can be self sufficient if disconnected.
If there's a disconnect event, the system degenerates into individually functioning islands, instead of using a
primary/backup system. The federated model is flexible and resilient, and automatically adapts to partition and
reconnection events.
Health and unhealth metrics
Monitoring and tracing
Next step
For example, logs and metrics are stored in the Availability Zone (AZ) where they're generated. A query of
metrics uses an opaque process of federated search to query the metric stores in all reachable AZs. It comes
down to the requirements of the specific application about what level of logs, metrics, and alarms data should
be replicated to other regions. Typically, alarms should be replicated, but there may be insufficient justification to
replicate logs and metrics.
Internal metrics are useful as
unhealth
metrics. These metrics reliably indicate the presence of an issue, but the
reverse isn't true. No evidence of poor health isn't evidence of good health, as the customer perceives health.
For example, a DNS issue indicates requests aren't arriving at the database service. The DNS error doesn't affect
the database read-success metric because this metric isn't seeing any errors. However, the end user perceives a
total outage because they aren't able to access the database. At least a portion of
health
metrics must be
measured externally, so that these metrics include everything the end user will experience.
The support team's ability to detect, diagnose, and resolve issues is an important part of delivering a highly
available application. To ensure success, the monitoring and tracing element must deliver high levels of visibility,
so that one in 1,000 events can be found and resolved.
A tracing solution that only logs 0.1% of requests has a 1,000,000 chance of recording such events, which
means the diagnosis and resolution are highly unlikely. Yet, failure to resolve such issues will have a meaningful
impact on availability.
Review the testing and validation design area for carrier-grade workloads.
Design area: Testing and validation
Testing and validation for carrier
-
grade workloads
12/16/2022 2 minutes to read • Edit Online
IMPORTANTIMPORTANT
Human error
Clients
Continuous testing and validation can detect and help resolve issues before they become potentially life
threatening. Consider well-known testing methodologies such as chaos testing. Testing should be conducted for
the lifetime of the application because the deployment environment is complex and multi-layered.
More details on how to implement continuous validation for your mission-critical workload is available here.
Also, supportability must be strong throughout the application lifetime. Highly available systems rely on high
quality support teams able to rapidly respond to and resolve issues in the field, conduct root cause analysis and
look for systematic design flaws.
Proving that an application is well architected requires testing, ideally use a chaos testing framework to avoid
testing bias. This methodology simulates failures of all dependent elements. Robust and regular testing should
both prove the design and validate the original failure mode analysis.
A warning flag should be raised for any application or service for which the redundancy or resiliency measures
can't be tested because it's considered
too risky
.
If redundancy and resiliency measures aren't tested, then the only valid assumption, from a safety-critical point
of view, is that these measures aren't going to work when needed. Using common paths for software upgrades,
configuration updates, and fault recovery, for example, provide a good mechanism for validating that measures
will work.
Experience from Telcos is that as much as 60% of all outages are actually the result of human error. A well-
architected application recognizes this and seeks to compensate. Here are some suggested approaches, but the
list is not exhaustive, and what is applicable to a given workload needs to be considered on a case-by-case basis.
Maximizing use of automation avoids human operators having to enter long and complex commands or
conduct repetitive operations across multiple elements. However, care must be taken to consider the blast
radius, as there is a risk of automation actually magnifying the effect of a configuration error, allowing it to
roll out across a global network in seconds. Strong checks and balances such as decision gates requiring
human approval before proceeding to the next step are advised.
Leveraging syntax checkers and simulation tools minimize the chance of errors or unforeseen side effects
from changes making their way into widespread production.
Use of carefully controlled canary deployments ensure that the effect of changes in full production can be
observed and validated at limited scope.
Ensuring that the management interfaces and processes needed for fault recovery are the same as those
used in day-to-day operation avoid operators being confronted with unfamiliar screens and barely used
method of procedures (MOPs) at times of peak stress.
Common client libraries are also part of the end-to-end system and need equivalent analysis and testing.
Software issues in common client code that simultaneously impacts a proportion of the system clients will
Next step
impact overall availability in the same way as application server-side issues.
Revisit the five pillars of architectural excellence to form a solid foundation for your carrier-grade workloads.
Azure Well-Architected Framework
Overview of a hybrid workload
12/16/2022 4 minutes to read • Edit Online
Extend Azure management to any infrastructure
TIPTIP
Run Azure services anywhere
Run Azure data services anywhereRun Azure data services anywhere
Customer workloads are becoming increasingly complex, with many applications often running on different
hardware across on-premises, multicloud, and the edge. Managing these disparate workload architectures,
ensuring uncompromised security, and enabling developer agility are critical to success.
Azure uniquely helps you meet these challenges, giving you the flexibility to innovate anywhere in your hybrid
environment while operating seamlessly and securely. The Well-Architected Framework includes a hybrid
description for each of the five pillars: cost optimization, operational excellence, performance efficiency,
reliability, and security. These descriptions create clarity on the considerations needed for your workloads to
operate effectively across hybrid environments.
Adopting a hybrid model offers multiple solutions that enable you to confidently deliver hybrid workloads: run
Azure data services anywhere, modernize applications anywhere, and manage your workloads anywhere.
Applying the principles in this article series to each of your workloads will better prepare you for hybrid adoption. For
larger or centrally managed organizations, hybrid and multicloud are commonly part of a broader strategic objective. If
you need to scale these principle across a portfolio of workloads using hybrid and multicloud environments, you may
want to start with the Cloud Adoption Framework's hybrid and multicloud scenario and best practices. Then return to this
series to refine each of your workload architectures.
Use
Azure Arc enabled infrastructure
to extend Azure management to any infrastructure in a hybrid
environment. Key features of Azure Arc enabled infrastructure are:
Unified OperationsUnified Operations
Organize resources such as virtual machines, Kubernetes clusters and Azure services deployed across
your entire IT environment.
Manage and govern resources with a single pane of glass from Azure.
Integrate with Azure Lighthouse for managed service provider support.
Adopt cloud practicesAdopt cloud practices
Easily adopt DevOps techniques such as infrastructure as code.
Empower developers with self-service and choice of tools.
Standardize change control with configuration management systems, such as GitOps and DSC.
Azure Arc allows you to run Azure Services anywhere. This allows you to build consistent hybrid and multicloud
application archtiectures by using Azure services that can run in Azure, on-premises, at the edge, or at other
cloud providers.
Use
Azure Arc enabled data services
to run Azure data services anywhere to support your hybrid workloads.
Key features of Azure Arc enabled data services are:
Run Azure Application services anywhereRun Azure Application services anywhere
Modernize applications anywhere
Manage workloads anywhere
Next steps
Run Azure data services on any Kubernetes cluster deployed on any hardware.
Gain cloud automation benefits, always up-to-date innovation in Azure data services, unified management of
your on-premises and cloud data assets with a cloud billing model across both environments.
Azure SQL Database and Azure PostgreSQL Hyperscale are the first set of Azure data services that are Azure
Arc enabled.
Use
Azure Arc enabled Application services
to run Azure App Service, Functions, Logic Apps, Event Grid, and API
Management anywhere to support your hybrid workloads. Key features of Azure Arc enabled application
services are as follows:
Web Apps - Azure App Service makes building and managing web applications and APIs easy, with a fully
managed platform and features like autoscaling, deployment slots, and integrated web authentication.
Functions - Azure Functions makes event-driven programming simple, with state-of-the-art autoscaling, and
with triggers and bindings to integrate with other Azure services.
Logic Apps - Azure Logic Apps produces automated workflows for integrating apps, data, services, and
backend systems, with a library of more than 400 connectors.
Event Grid - Azure Event Grid simplifies event-based applications, with a single service for managing the
routing of events from any source to any destination.
Azure API Management gateway - Azure API Management provides a unified management experience and
full observability across all internal and external APIs.
Use the
Azure Stack family
to modernize applications without ever leaving the datacenter. Key features of the
Azure Stack family are:
Extend Azure to your on-premises workloads with Azure Stack Hub. Build and run cloud apps on premises, in
connected or disconnected scenarios, to meet regulatory or technical requirements.
Use Azure Stack HCI to run virtualized workloads on premises and easily connect to Azure to access cloud
management and security services.
Build and run your intelligent edge solutions on Azure Stack Edge, an Azure managed appliance to run
machine learning models and compute at the edge to get results quickly—and close to where data is being
generated. Easily transfer the full data set to Azure for further analysis or archive.
Use
Azure Arc management
to extend Azure management to all assets in your workloads, regardless of where
they are hosted. Key features of Azure Arc management are:
Adopt cloud practicesAdopt cloud practices
Easily adopt DevOps techniques such as infrastructure as code.
Empower developers with self-service and choice of tools.
Standardize change control with configuration management systems, such as GitOps and DSC.
Scale across workloads with Scale across workloads with Unified OperationsUnified Operations
Organize resources such as virtual machines, Kubernetes clusters and Azure services deployed across
your entire IT environment.
Manage and govern resources with a single pane of glass from Azure.
Integrate with Azure Lighthouse for managed service provider support.
Cost optimization
Cost optimization in a hybrid workload
12/16/2022 4 minutes to read • Edit Online
Workload definitions
Functionality
A key benefit of hybrid cloud environments is the ability to scale dynamically and back up resources in the
cloud, avoiding the capital expenditures of a secondary datacenter. However, when workloads sit in both on-
premises and cloud environments, it can be challenging to have visibility into the cost. With Azure's hybrid
technologies, you can define policies and constraints for both on-premises and cloud workloads with Azure Arc.
By utilizing Azure Policy, you're able to enforce organizational standards for your workload and the entire IT
estate.
Azure Arc helps minimize or even eliminate the need for on-premises management and monitoring systems,
which reduces operational complexity and cost, especially in large, diverse, and distributed environments. This
helps offset additional costs associated with Azure Arc-related services. For example, advanced data security for
Azure Arc enabled SQL Server instance requires Microsoft Defender for Cloud functionality of Microsoft
Defender for Cloud, which has pricing implications.
Other considerations are described in the Principles of cost optimization section in the Microsoft Azure Well-
Architected Framework.
Define the following for your workloads:
Monitor cloud spend with hybrid workloadsMonitor cloud spend with hybrid workloads. Track cost trends and forecast future spend with
dashboards in Azure for your on-prem data estates with Azure Arc.
Keep within cost constraintsKeep within cost constraints.
Choose a flexible billing modelChoose a flexible billing model. With Azure Arc enabled data services, you can use existing hardware
with the addition of an operating expense (OPEX) model.
Create, apply, and enforce standardized and custom tags and policies.
Enforce run-time conformance and audit resources with Azure Policy.
For budget concerns, you get a considerable amount of functionality at no cost that you can use across all of
your servers and cluster with Azure Arc enabled servers. You can turn on additional Azure services to each
workload as you need them, or not at all.
Free Core Azure Arc capabilitiesFree Core Azure Arc capabilities
Update, management
Search index
Group, tags
Portal
Templates, extensions
RBAC, subscriptions
Paid-for Azure Arc enabled attached ser vicesPaid-for Azure Arc enabled attached ser vices
Azure policy
Azure monitor
Defender for Cloud Standard
TipsTips
Azure Architecture Center (AAC) resources related to hybrid cost
Infrastructure Decisions
Capacity planningCapacity planning
Provision
Microsoft Sentinel
Backup
Config and change management
Star t slowStar t slow. Light up new capabilities as needed. Most of Azure Arc's resources are free to start.
Save time with unified managementSave time with unified management for your on-premises and cloud workloads by projecting them all
into Azure.
Automate and delegateAutomate and delegate remediation of incidents and problems to service teams without IT intervention.
Manage configurations for Azure Arc enabled servers
Azure Arc hybrid management and deployment for Kubernetes clusters
Optimize administration of SQL Server instances in on-premises and multi-cloud environments by
leveraging Azure Arc
Disaster Recovery for Azure Stack Hub virtual machines
Build high availability into your BCDR strategy
Use Azure Stack HCI switchless interconnect and lightweight quorum for Remote Office/Branch Office
Archive on-premises data to cloud
Azure Stack HCI can help in cost-savings by using your existing Hyper-V and Windows Server skills to
consolidate aging servers and storage. Azure Stack HCI pricing follows the monthly subscription billing model,
with a flat rate per physical processor core in an Azure Stack HCI cluster.
Use Azure Stack HCI to modernize on-prem workloads with hyperconverged infra. Azure Stack HCI billing is
based on a monthly subscription fee per physical processor core, not a perpetual license. When customers
connect to Azure, the number of cores used is automatically uploaded and assessed for billing purposes. Cost
doesn't vary with consumption beyond the physical processor cores. This means that more VMs don't cost
more, and customers who are able to run denser virtual environments are rewarded.
If you are currently using VMware, you can take advantage of cost savings only available with Azure VMware
Solution. Easily move VMware workloads to Azure and increase your productivity with elasticity, scale, and fast
provisioning cycles. This will help enhance your workloads with the full range of Azure compute, monitor,
backup, database, IoT, and AI services.
Lastly, you can slowly begin migrating out of your datacenter and use Azure Arc while you're migrating to
project everything into Azure.
Check out our checklist under the Cost Optimization pillar in the Well-Framework to learn more about capacity
planning, and build a checklist to design cost-effective workloads.
Define SLAs
Determine regulatory needs
One advantage of cloud computing is the ability to use the PaaS model. And in some cases, PaaS services can be
cheaper than managing VMs on your own. Some workloads cannot be moved to the cloud though for
regulatory or latency reasons. Therefore, using a service like Azure Arc enabled services allows you to flexibly
Monitor and optimize
Next steps
use cloud innovation where you need it by deploying Azure services anywhere.
Click the following links for guidance in provisioning:
Azure Arc pricing
Azure Arc Jumpstart for templates (in GitHub)
Azure Stack HCI pricing
Azure VMware Solution pricing - Run your VMware workloads natively on Azure
Azure Stack Hub pricing
Azure Stack HCI can reduce costs by saving in server, storage, and network infrastructure.
Run your VMware workloads natively on Azure.
Treat cost monitoring and optimization as a process, rather than a point-in-time activity. You can conduct regular
cost reviews and forecast the capacity needs so that you can provision resources dynamically and scale with
demand.
Managing the Azure Arc enabled servers agent
With Azure Stack HCI
Bring all your resources into a single system so you can organize and inventory through a variety of
Azure scopes, such as Management groups, Subscriptions, and Resource Groups.
Create, apply, and enforce standardized and custom tags to keep track of resources.
Build powerful queries and search your global portfolio with Azure Resource Graph.
Costs for datacenter real estate, electricity, personnel, and servers can be reduced or eliminated.
Costs are now part of OPEX, which can be scaled as needed.
Operational excellence
Operational excellence in a hybrid workload
12/16/2022 5 minutes to read • Edit Online
Build cloud native apps anywhere, at scale
Connect Kubernetes clusters to Azure and start deploying using a
GitOps model
Operational excellence consists of the operations processes that keep a system running in production.
Applications must be designed with DevOps principles in mind, and deployments must be reliable and
predictable. Use monitoring tools to verify that your application is running correctly and to gather custom
business telemetry that will tell you whether your application is being used as intended.
Use
Azure Arc enabled infrastructure
to add support for cloud Operational Excellence practices and tools to any
environment. Be sure to utilize reference architectures and other resources from this section that illustrate
applying these principles in hybrid and multicloud scenarios. The architectures referenced here can also be
found in the Azure Architecture Center, Hybrid and Multicloud category.
To keep your systems running, many workload teams have architected and designed applications where
components are distributed across public cloud services, private clouds, data centers, and edge locations. With
Azure Arc enabled Kubernetes, you can accelerate development by using best in class application services with
standardized deployment, configuration, security, and observability. One of the primary benefits of Azure Arc is
facilitating implementation of DevOps principles that apply established development practices to operations.
This results in improved agility, without jeopardizing the stability of IT environment.
Centrally code and deploy applications confidently to any Kubernetes distribution in any location.
Centrally manage and delegate access for DevOps roles and responsibilities.
Reduce errors with consistent configuration and policy-driven deployment and operations for applications
and Kubernetes clusters.
Delegate access for DevOps roles and responsibilities through Azure RBAC.
Reduce errors with consistent policy driven deployment and operations through GitHub and Azure Policy.
GitOps relies on a Git repository to host files that contain the configuration representing the expected state of a
resource. An agent running on the cluster monitors the state of the repository and, when there is a change on
the repository, the agent pulls the changed files to the cluster and applies the new configuration.
In the context of Azure Arc enabled Kubernetes clusters, a Git repository hosts a configuration of a Kubernetes
cluster, including its resources such as pods and deployments. A pod or a set of pods running on the cluster
polls the status of the repository and, once it detects a change, it pulls and applies the new configuration to the
cluster.
Azure Arc enabled Kubernetes clusters rely on Flux, an open-source GitOps deployment tool to implement the
pods responsible for tracking changes to the Git repository you designate and applying them to the local cluster.
In addition, the containerized Flux operator also periodically reviews the existing cluster configuration to ensure
that it matches the one residing in the Git repository. If there is a configuration drift, the Flux agent remediates it
by reapplying the desired configuration.
Each association between an Azure Arc enabled Kubernetes cluster configuration and the corresponding GitOps
repository resides in Azure, as part of the Azure Resource Manager resource representing the Azure Arc enabled
Kubernetes clusters. You can configure that association via traditional Azure management interfaces, such as the
Modernize applications anywhere with Azure Kubernetes Service on
Azure Stack HCI
Azure Stack HCI use casesAzure Stack HCI use cases
Resources and architectures related to Operational Excellence
Application designApplication design
Azure portal or Azure CLI. Alternatively, you can use Azure Policy to automate this process, allowing you to apply
it consistently to all resources in an entire subscription or individual resource groups you designate.
If you are looking for a fully managed Kubernetes solution on-premises in your datacenters and/or edge
locations, AKS on Azure Stack HCI is a great option. Azure Kubernetes Service on Azure Stack HCI is an on-
premises implementation of Azure Kubernetes Service (AKS), which automates running containerized
applications at scale. Azure Kubernetes Service is now in preview on Azure Stack HCI and Windows Server 2019
Datacenter, making it quicker to get started hosting Linux and Windows containers in your datacenter.
AKS clusters on Azure Stack HCI can be connected to Azure Arc for centralized management. Once connected,
you can deploy your applications and Azure data services to these clusters and extend Azure services such as
Azure Monitor, Azure Policy and Microsoft Defender for Cloud.
Modernize your high-performance workloads and containerized applicationsModernize your high-performance workloads and containerized applications
Use Azure Stack HCI to enable automated deployment, scaling and management of containerized
applications by running a Kubernetes cluster on your hyperconverged infrastructure.
Deploy AKS on Azure Stack HCI using Windows Admin Center or PowerShell.
Deploy and manage workloads in remote and branch sitesDeploy and manage workloads in remote and branch sites
Use Azure Stack HCI to deploy your container-built edge workloads, and essential business
applications in highly available virtual machines (VMs).
Bring efficient application development and deployment to remote locations at the right price by
leveraging switchless deployment and 2 node clusters.
Get a global view of your system's health using Azure Monitor.
Upgrade your infrastructure for remote work using VDIUpgrade your infrastructure for remote work using VDI
Bring desktops on-premises for low latency and data sovereignty enabling remote work using a
brokerage service like Microsoft Remote Desktop Services. With Azure Stack HCI you can scale your
resources in a simple predictable way. Provide a secure way to deliver desktop services to a wide
range of devices without allowing users to store data locally or upload data from those local devices.
The introduction of cloud computing had a significant impact on how software is developed, delivered, and run.
With
Azure Arc enabled infrastructure
and Azure Arc components like Azure Arc enabled Kubernetes and Azure
Arc enabled data services it becomes possible to design cloud native applications with a consistent set of
principles and tooling across public cloud, private cloud, and the edge.
Click the following links for architecture details and diagrams that enable application design and DevOps
practices consistent with Operational excellence principles.
Azure Arc hybrid management and deployment for Kubernetes clusters
Run containers in a hybrid environment
Managing K8 clusters outside of Azure with Azure Arc
Optimize administration of SQL Server instances in on-premises and multi-cloud environments by
leveraging Azure Arc
Azure Data Studio dashboards
MonitoringMonitoring
Application performance managementApplication performance management
Manage data anywhere
Next steps
microsoft/azure_arc: Azure Arc environments bootstrapping for everyone (in github.com)
All Azure Architecture Center Hybrid and Multicloud Architectures
Enable monitoring of Azure Arc enabled Kubernetes cluster
Azure Monitor for containers overview
Hybrid availability and performance monitoring
Performance Efficiency
Performance efficiency in a hybrid workload
12/16/2022 5 minutes to read • Edit Online
Azure Arc design
Azure Arc enabled Kubernetes
Azure Arc enabled SQL Managed Instance
Performance efficiency is the ability of your workload to scale to meet the demands placed on it by users in an
efficient manner. In a hybrid environment, it is important to consider how you manage your on-premises or
multicloud workloads to ensure they can meet the demands for scale. You have options to scale up into the
cloud when your on-premises resources reach capacity. Scale up, down, and scale out your databases without
application downtime.
Using a tool like Azure Arc, you can build cloud native apps anywhere, at scale. Architect and design hybrid
applications where components are distributed across public cloud services, private clouds, data centers and
edge locations without sacrificing central visibility and control. Deploy and configure applications and
Kubernetes clusters consistently and at scale from source control and templates. You can also bring PaaS
services on premises. This allows you to use cloud innovation flexibly, where you need it by deploying Azure
services anywhere. Implement cloud practices and automation to deploy faster, consistently, and at scale with
always up-to-date Azure Arc enabled services. You can scale elastically based on capacity, with the ability to
deploy in seconds.
The first steps with Azure Arc are to connect the machines to Azure. To use Azure Arc to connect the machine to
Azure, you need to install the Azure Connected Machine agent on each machine that you plan to connect using
Azure Arc. You can connect any other physical or virtual machine running Windows or Linux to Azure Arc.
There are four ways to connect machines:
1. Manual installation
2. Script-based installation
3. Connect machines at scale using service principal
4. Installation using Windows PowerShell DSC
After connecting the machines, you can then manage the VM extensions all from Azure, which provides
consistent extension management between Azure and non-Azure VMs. In Azure you can use Azure Automation
State Configuration to centrally store configurations and maintain the desired state of Arc enabled servers
through the DSC VM extension. You can also collect log data for analysis with Azure Monitor Logs enabled
through the Log Analytics agent VM extension. With Azure Monitor, you can analyze the performance of your
Windows and Linux VMs and monitor their processes and dependencies on other resources and external
processes.
With Azure Arc enabled Kubernetes, you need to register the cluster first. You can register any CNCF Kubernetes
cluster that is running. You'll need a kubeconfig file to access the cluster and cluster-admin role on the cluster for
deploying Arc-enabled Kubernetes agents. You'll use Azure Command-Line Interface (Azure CLI) to perform
cluster registration tasks.
When planning for deployment of Azure Arc enabled SQL Managed Instance, you should identify the correct
amount of compute, memory, and storage that will be required to run the Azure Arc data controller and the
NOTENOTE
Azure Stack HCI
Monitoring in a hybrid environment
Monitoring containers
intended SQL managed instance server groups.
You have the flexibility to extend the capacity of the underlying Kubernetes or AKS cluster over time by adding
additional compute nodes or storage. Kubernetes or AKS offers an abstraction layer over the underlying
virtualization stack and hardware. Storage classes implement such abstraction for storage.
When provisioning a pod, you need to decide which storage class to use for its volumes. Your decision is important from a
performance standpoint because an incorrect choice could result in suboptimal performance.
When planning for deployment of Azure Arc enabled SQL Managed Instance, you should consider a range of
factors affecting storage configuration kubernetes-storage-class-factors for both data controller and database
instances.
With the scope of Azure Arc extended to Azure Stack HCI VMs, you'll be able to automate their configuration by
using Azure VM extensions and evaluate their compliance with industry regulations and corporate standards by
using Azure Policy.
In remote office/branch office scenarios, you must consider storage resiliency versus usage efficiency, versus
performance. Planning for Azure Stack HCI volumes involves identifying the optimal balance between resiliency,
usage efficiency, and performance. The challenge results from the fact that maximizing one of these
characteristics typically has a negative impact on at least one of the other two.
To learn more, see Use Azure Stack HCI switchless interconnect and lightweight quorum for Remote
Office/Branch Office.
Monitoring in a hybrid environment can be a challenge. However, with tools like Azure Arc as you bring Azure
services on-premises, you can easily enroll in additional Azure services such as monitoring, security, and update
by simply turning them on.
Across products: Integrate with Microsoft Sentinel, Microsoft Defender for Cloud
Bring Microsoft Defender for Cloud to your on-prem data and servers with Arc
Set security policies, resource boundaries, and RBAC for workloads across the hybrid infra
Proper admin roles for read, modify, re-onboard, and delete a machine
Monitoring your containers is critical. Azure Monitor for containers provides a rich monitoring experience for
the AKS and AKS engine clusters.
Configure Azure Monitor for containers to monitor Azure Arc enabled Kubernetes clusters hosted outside of
Azure. This helps achieve comprehensive monitoring of your Kubernetes clusters across Azure, on-premises,
and third-party cloud environments.
Azure Monitor for containers can provide you with performance visibility by collecting memory and processor
metrics from controllers, nodes, and containers available in Kubernetes through the Metrics application
programming interface (API). Container logs are also collected. After you enable monitoring from Kubernetes
clusters, metrics and logs are automatically collected for you through a containerized version of the Log
Analytics agent. Metrics are written to the metrics store and log data is written to the logs store associated with
your Log Analytics workspace. For more information about Azure Monitor for containers, refer to Azure Monitor
Deploy and manage containerized applications
Next steps
for containers overview.
Enable Azure Monitor for containers for one or more existing deployments of Kubernetes by using either a
PowerShell or a Bash script. To enable monitoring for Arc enabled Kubernetes clusters, refer to Enable
monitoring of Azure Arc enabled Kubernetes cluster.
Automatically enroll in additional Azure Arc enabled resources and services. Simply turn them on when needed:
Strengthen your security posture and protect against threats by turning on Microsoft Defender for Cloud.
Get actionable alerts from Azure Monitor.
Detect, investigate, and mitigate security incidents with the power of a cloud-native SIEM, by turning on
Microsoft Sentinel.
Deploy and manage containerized applications with GitHub and Azure Policy. Ensure that applications and
clusters are consistently deployed and configured at scale from source control. Write with your familiar tool set
to the same application service APIs that can run consistently on-premises, across multicloud, and in edge
environments. Easily instrument Azure monitoring, telemetry, and security services into your hybrid apps
wherever they run.
Reliability
Reliability in a hybrid workload
12/16/2022 5 minutes to read • Edit Online
NOTENOTE
Backup and Recovery
In the cloud, we acknowledge up front that failures will happen. Instead of trying to prevent failures altogether,
the goal is to minimize the effects of a single failing component. While historically you may have purchased
levels of redundant higher-end hardware to minimize the chance of an entire application platform failing, in the
cloud, we acknowledge up front that failures will happen.
For hybrid scenarios, Azure offers an end-to-end backup and disaster recovery solution that's simple, secure,
scalable, and cost-effective, and can be integrated with on-premises data protection solutions. In the case of
service disruption or accidental deletion or corruption of data, recover your business services in a timely and
orchestrated manner.
Many customers operate a second datacenter, however, Azure can help reduce the costs of deploying,
monitoring, patching, and scaling on-premises disaster recovery infrastructure, without the need to manage
backup resources or build a secondary datacenter.
Extend your current backup solution to Azure, or easily configure our application-aware replication and
application-consistent backup that scales based on your business needs. The centralized management interface
for Azure Backup and Azure Site Recovery makes it simple to define policies to natively protect, monitor, and
manage enterprise workloads across hybrid and cloud. These include:
Azure Virtual Machines
SQL and SAP databases
On-premises Windows servers
VMware machines
By not having to build on-premises solutions or maintain a costly secondary datacenter, customers can reduce the cost of
deploying, monitoring, patching, and scaling disaster recovery infrastructure by backing up their hybrid data and
applications with Azure.
Together, Azure Backup and Azure Site Recovery use the underlying power and unlimited scale of the cloud to
deliver high availability with minimal maintenance or monitoring overhead. These native capabilities are
available through a pay-as-you-use model that bills only for the storage that is consumed.
Using Azure Site Recovery, users can set up and manage replication, failover, and failback from a single location
in the Azure portal. The Azure hybrid services tool in Windows Admin Center can also be used as a centralized
hub to easily discover all the available Azure services that bring value to on-premises or hybrid environments.
Windows Admin Center streamlines setup and the process of replicating virtual machines on Hyper-V servers or
clusters, making it easier to bolster the resiliency of environments with Azure Site Recovery's disaster recovery
service.
Azure is committed to providing the best-in-class data protection to keep your applications running. Azure
Backup protects backups of on-premises and cloud-resources from ransomware attacks by isolating backup
data from source data, combined with multi-factor authentication and the ability to recover maliciously or
accidentally deleted backup data. With Azure Site Recovery you can fail over VMs to the cloud or between cloud
Availability Considerations
For Azure ArcFor Azure Arc
Azure Arc enabled data servicesAzure Arc enabled data services
Azure Stack HCI
Site
-
level fault domainsSite
-
level fault domains
Site awarenessSite awareness
Resiliency limitsResiliency limits
data centers and secure them with network security groups.
In the case of a disruption, accidental deletion, or corruption of data, customers can rest assured that they will be
able to recover their business services and data in a timely and orchestrated manner. These native capabilities
support low recovery-point objective (RPO) and recovery-time objective (RTO) targets for any mission-critical
workload in your organization. Azure is here to help customers pivot towards a strengthened BCDR strategy.
In most cases, the location you select when you create the installation script should be the Azure region
geographically closest to your machine's location. The rest of the data will be stored within the Azure geography
containing the region you specify, which might also affect your choice of region if you have data residency
requirements. If an outage affects the Azure region to which your machine is connected, the outage will not
affect the Arc enabled server, but management operations using Azure might not be able to complete. For
resilience in the event of a regional outage, if you have multiple locations which provide a geographically-
redundant service, it's best to connect the machines in each location to a different Azure region.
Ensure that Azure Arc is supported in your regions by checking supported regions. Also, ensure that services
referenced in the Architecture section are supported in the region to which Azure Arc is deployed.
With Azure Arc enabled SQL Managed Instance, you can deploy individual databases in either a single or
multiple pod pattern. For example, the developer or general-purpose pricing tier implements a single pod
pattern, while a highly available business critical pricing tier implements a multiple pod pattern. A highly
available Azure SQL managed instance uses Always On Availability Groups to replicate the data from one
instance to another either synchronously or asynchronously.
With Azure Arc enabled SQL Managed Instance, planning for storage is also critical from the data resiliency
standpoint. If there's a hardware failure, an incorrect choice might introduce the risk of total data loss. To avoid
such risk, you should consider a range of factors affecting storage configuration kubernetes-storage-class-
factors for both data controller and database instances.
Azure Arc enabled SQL Managed Instance provides automatic local backups, regardless of the connectivity
mode. In the Directly Connected mode, you also have the option of leveraging Azure Backup for off-site, long-
term backup retention.
Each physical site of an Azure Stack HCI stretched cluster represents distinct fault domains that provide
additional resiliency. A fault domain is a set of hardware components that share a single point of failure. To be
fault tolerant to a particular level, you need multiple fault domains at that level.
Site awareness allows you to control placement of virtualized workloads by designating their preferred sites.
Specifying the preferred site for a stretched cluster offers many benefits, including the ability to group
workloads at the site level and to customize quorum voting options. By default, during a cold start, all virtual
machines use the preferred site, although it is also possible to configure the preferred site at the cluster role or
group level.
Azure Stack HCI provides multiple levels of resiliency, but because of its hyper-converged architecture, that
resiliency is subject to limits imposed not only by the cluster quorum, but also by the pool quorum. You can
Next steps
eliminate this limit by implementing cluster sets in which you combine multiple Azure Stack HCI clusters to
create an HCI platform consisting of hundreds of nodes.
Security
Security in a hybrid workload
12/16/2022 3 minutes to read • Edit Online
Azure Architecture Center (AAC) resources
Principles
Azure Arc management security capabilitiesAzure Arc management security capabilities
Azure Arc enabled data services security capabilitiesAzure Arc enabled data services security capabilities
Azure Stack HCIAzure Stack HCI
Security is one of the most important aspects of any architecture. Particularly in hybrid and multicloud
environments, an architecture built on good security practices should be resilient to attacks and provide
confidentiality, integrity, and availability. To assess your workload using the tenets found in the Microsoft Azure
Well-Architected Framework, see the Microsoft Azure Well-Architected Review.
Microsoft Defender for Cloud can monitor on-premises systems, Azure VMs, Azure Monitor resources, and even
VMs hosted by other cloud providers. To support that functionality, the standard fee-based tier of Microsoft
Defender for Cloud is needed. We recommend that you use the 30-day free trial to validate your requirements.
Defender for Cloud's operational process won't interfere with your normal operational procedures. Instead, it
passively monitors your deployments and provides recommendations based on the security policies you enable.
Microsoft Sentinel can help simplify data collection across different sources, including Azure, on-premises
solutions, and across clouds using built-in connectors. Microsoft Sentinel works to collect data at cloud scale—
across all users, devices, applications, and infrastructure, both on-premises and in multiple clouds.
Hybrid Security Monitoring using Microsoft Defender for Cloud and Microsoft Sentinel
DevSecOps in Azure
Optimize administration of SQL Server instances in on-premises and multi-cloud environments by
leveraging Azure Arc
Implement a secure hybrid network
Securely managed web applications
Access unique Azure security capabilities such as Microsoft Defender for Cloud.
Centrally manage access for resources with Role-Based Access Control.
Centrally manage and enforce compliance and simplify audit reporting with Azure Policy.
Protect your data workloads with Microsoft Defender for Cloud in your environment, using the advanced
threat protection and vulnerability assessment features for unmatched security.
Set security policies, resource boundaries, and role-based access control for various data workloads
seamlessly across your hybrid infrastructure.
Protection in transitProtection in transit. Storage Replica offers built-in security for its replication traffic. This includes packet
signing, AES-128-GCM full data encryption, support for Intel AES-NI encryption acceleration, and pre-
authentication integrity man-in-the-middle attack prevention.
Encr yption at restEncr yption at rest. Azure Stack HCI supports BitLocker Drive Encryption for its data volumes, thus
facilitating compliance with standards such as FIPS 140-2 and HIPAA.
Integration with a range of Azure ser vices that provide more security advantagesIntegration with a range of Azure ser vices that provide more security advantages. You can
integrate virtualized workloads that run on Azure Stack HCI clusters with Azure services such as Microsoft
Storage Replica also utilizes Kerberos AES256 for authentication between the replicating nodes.
Design
Azure Arc enabled serversAzure Arc enabled servers
Azure Stack HCIAzure Stack HCI
Monitor
Defender for Cloud.
Firewall-friendly configurationFirewall-friendly configuration. Storage Replica traffic requires a limited number of open ports between
the replicating nodes.
Implement Azure MonitorImplement Azure Monitor
Use Azure Monitor to monitor your VMs, virtual machine scale sets, and Azure Arc machines at scale. Azure
Monitor analyzes the performance and health of your Windows and Linux VMs. It also monitors their processes
and dependencies on other resources and external processes. It includes support for monitoring performance
and application dependencies for VMs that are hosted on-premises or in another cloud provider.
Implement Microsoft SentinelImplement Microsoft Sentinel
Use Microsoft Sentinel to deliver intelligent security analytics and threat intelligence across the enterprise. This
provides a single solution for alert detection, threat visibility, proactive hunting, and threat response. Microsoft
Sentinel is a scalable, cloud-native, security information event management (SIEM) and security orchestration
automated response (SOAR) solution that enables several scenarios including:
Collect data at cloud scale across all users, devices, applications, and infrastructure, both on-premises and in
multiple clouds.
Detect previously undetected threats and minimize false positives.
Investigate threats with artificial intelligence and hunt for suspicious activities at scale.
Respond to incidents rapidly with built-in orchestration and automation of common tasks.
A stretched Azure Stack HCI cluster relies on Storage Replica to perform synchronous storage replication
between storage volumes hosted by the two groups of nodes in their respective physical sites. If a failure affects
the availability of the primary site, the cluster automatically transitions its workloads to nodes in the surviving
site to minimize potential downtime.
Across products: Integrate with Microsoft Sentinel and Microsoft Defender for Cloud.
Bring Microsoft Defender for Cloud to your on-premises data and servers with Arc.
Set security policies, resource boundaries, and RBAC for workloads across the hybrid infrastructure.
Set the correct admin roles to read, modify, re-onboard, and delete a machine.
Overview of well
-
architected IoT workloads
12/16/2022 12 minutes to read • Edit Online
Azure Well-Architected Framework for IoT workloads
IoT architectural patterns
Internet of Things (IoT) is a collection of managed and platform services across edge and cloud environments
that connect, monitor, and control physical assets. The IoT workload for the Microsoft Azure Well-Architected
Framework helps you meet architectural challenges to design, build, and operate IoT solutions according to your
requirements and constraints.
The IoT Well-Architected Framework addresses the three components of IoT systems:
Things
, or the physical objects, industrial equipment, devices, and sensors that connect to the cloud
persistently or intermittently.
Insights
, information that the things collect that humans or AI analyze and turn into actionable knowledge.
Actions
, the responses of people or systems to insights, which connect to business outcomes, systems, and
tools.
The IoT Well-Architected Framework uses a set of
IoT guiding principles
based on the Azure Well-Architected
Framework to drive planning and decision making. The principles guide a
layered architecture
approach that
identifies the logical elements of an IoT solution in either a
connected components
or
connected operations
architectural pattern.
This article describes the IoT guiding principles, architectural patterns, and architecture layers in the IoT Well-
Architectured Framework. The remaining articles in this series delve into how to apply the Azure Well-
Architected Framework pillars of excellence to IoT solutions.
The Azure Well-Architected Framework consists of five pillars of architectural excellence, which you can use to
improve the quality of IoT workloads. The following articles highlight how each pillar relates to IoT workloads
and guiding principles:
Reliability
ensures that applications meet availability commitments. Resiliency ensures that workloads are
available and can recover from failures at any scale. Reliability in your IoT workload discusses how the IoT
principles of heterogeneity, scale, connectivity, and hybridity affect IoT reliability.
Security
provides confidentiality, integrity, and availability assurances against deliberate attacks and
abuse of data and systems. Security in your IoT workload describes how heterogeneity and hybridity
affect IoT security.
Cost optimization
balances business goals with budget justification to create cost-effective workloads
while avoiding capital-intensive solutions. Cost optimization in your IoT workload looks at ways to reduce
expenses and improve operational efficiency across IoT workload layers.
Operational excellence
covers the processes that build and run applications in production. Operational
excellence in your IoT workload discusses how heterogeneity, scale, connectivity, and hybridity affect IoT
operations.
Performance efficiency
is a workload's ability to scale efficiently to meet demands. Performance efficiency
in your IoT workload describes how heterogeneity, scale, connectivity, and hybridity affect IoT
performance.
IoT guiding principles
HeterogeneityHeterogeneity
SecuritySecurity
Most IoT systems use either a
connected products
or
connected operations
architectural pattern. Each pattern
has specific requirements and constraints in the IoT Well-Architected Framework.
Connected products
architectures focus on the
hot path
. End users manage and interact with products by
using real-time applications. This pattern applies to manufacturers of smart devices for consumers and
businesses in a wide range of locations and settings. Examples include smart coffee machines, smart TVs,
and smart production machines. In these IoT solutions, the product builders provide connected services
to the product users.
Connected operations
architectures focus on the
warm or cold path
with edge devices, alerts, and cloud
processing. These solutions analyze data from multiple sources, gather operational insights, build
machine learning models, and initiate further device and cloud actions.
The connected operations pattern applies to enterprises and smart service providers that connect pre-
existing machines and devices. Examples include smart factories and smart buildings. In these IoT
solutions, service builders deliver smart services that provide insights and support the effectiveness and
efficiency of connected environments.
The IoT Well-Architected Framework adds IoT-specific guiding principles to the Azure Well-Architected
Framework pillars. These principles help clarify considerations to ensure your IoT workloads meet requirements
across architectural layers.
The high-level guiding principles that facilitate good IoT solution design are:
Heterogeneity
Security
Scale
Flexibility
Serviceability
Connectivity
Hybridity
The following sections describe the IoT guiding principles, and how they apply to the IoT connected products
and connected operations architectural patterns.
IoT solutions must accommodate various devices, hardware, software, scenarios, environments, processing
patterns, and standards. It's important to identify the necessary level of heterogeneity for each architectural
layer at design time.
In connected products architectures, heterogeneity describes the varieties of machines and devices that need to
be supported. Heterogeneity also describes the variety of environments where you can deploy smart product,
such as networks and types of users. In connected operations architectures, heterogeneity focuses on support
for different operational technology (OT) protocols and connectivity.
IoT solutions must consider security and privacy measures across all layers. Security measures include device
and user identity, authentication and authorization, data protection for data at rest and in transit, and strategies
for data attestation.
In connected products architectures, limited control over product use in heterogeneous and widely distributed
environments affects security. According to the Microsoft Threat Modeling Tool STRIDE model, the highest risk
to devices is from tampering, and the threat to services is from denial of services from hijacked devices.
ScalabilityScalability
FlexibilityFlexibility
ServiceabilityServiceability
ConnectivityConnectivity
HybridityHybridity
In connected operations architectures, the security requirements for the deployment environment are
important. Security focuses on specific OT environment requirements and deployment models, such as ISA95
and Purdue, and integration with the cloud-based IoT platform. Based on STRIDE, the highest security risks for
connected operations are spoofing, tampering, information disclosure, and elevation of privilege.
IoT solutions must be able to support
hyper-scalability
, with millions of connected devices and events ingesting
large amounts of data at high frequency. IoT solutions must enable proof of concept and pilot projects that start
with a few devices and events, and then scale out to hyper-scale dimensions. Considering the scalability of each
architectural layer is essential to IoT solution success.
In connected products architectures, scale describes the number of devices. In most cases, each device has a
limited set of data and interactions, controlled by the device builder, and scalability comes only from the number
of devices deployed.
In connected operations architectures, scalability depends on the number of messages and events to process. In
general, the number of machines and devices is limited, but OT machines and devices send large numbers of
messages and events.
IoT solutions build on the principle of
composability
, which enables combining various first-party or third-party
components as building blocks. A well-architected IoT solution has extension points that enable integration with
existing devices, systems, and applications. A high-scale, event-driven architecture with brokered
communication is part of the backbone, with loosely coupled composition of services and processing modules.
In connected products architectures, changing end-user requirements define flexibility. Solutions should allow
you to easily change device behavior and end-user services in the cloud, and provide new services. In connected
operations architectures, the support for different types of devices defines flexibility. Solutions should be able to
easily connect legacy and proprietary protocols.
IoT solutions must consider ease of maintaining and repairing components, devices, and other system elements.
Early detection of potential problems is critical. Ideally, a well-architected IoT solution should correct problems
automatically before serious trouble occurs. Maintenance and repair operations should cause as little downtime
or disruption as possible.
In connected products architectures, the wide distribution of devices affects serviceability. The ability to monitor,
manage, and update devices within end user context and control, without direct access to that environment, is
limited. In connected operations architectures, serviceability depends on the given context, controls, and
procedures of the OT environment, which may include systems and protocols already available or in use.
IoT solutions must be able to handle extended periods of offline, low-bandwidth, or intermittent connectivity. To
support connectivity, you can create metrics to track devices that don't communicate regularly.
Connected products run in uncontrolled consumer environments, so connectivity is unknown and hard to
sustain. Connected products architectures must be able to support unexpected extended periods of offline and
low-bandwidth connectivity.
In connected operations architectures, the deployment model of the OT environment affects connectivity.
Typically, the degree of connectivity, including intermittent connectivity, is known and managed in OT scenarios.
IoT solutions must address hybrid complexity, running on different hardware and platforms across on-premises,
edge, and multi-cloud environments. It's critical to manage disparate IoT workload architectures, ensure
uncompromised security, and enable developer agility.
IoT architecture layers
Core layers and servicesCore layers and services
In connected products architectures, the wide distribution of devices defines hybridity. The IoT solution builder
controls the hardware and runtime platform, and hybridity focuses on the diversity of the deployment
environments.
In connected operations architectures, hybridity describes the data distribution and processing logic. Scale and
latency requirements determine where to process data and how fast feedback must be.
An IoT architecture consists of a set of foundational layers. Specific technologies support the different layers,
and the IoT Well-Architected Framework highlights options for designing and creating each layer.
Core layers
identify IoT-specific solutions.
Common layers
aren't specific to IoT workloads.
Cross-cutting layers
support all layers in designing, building, and running solutions.
The IoT Well-Architected Framework addresses different layer-specific requirements and implementations. The
framework focuses on the
core layers
, and identifies the specific impact of the IoT workload on the
common
layers
.
The following sections describe the IoT architecture layers and the Microsoft technologies that support them.
The IoT core layers and services identify whether a solution is an IoT solution. The
core layers
of an IoT workload
Device and gateway layerDevice and gateway layer
Ingestion and communication layerIngestion and communication layer
Device management and modeling layerDevice management and modeling layer
are:
Device and gateway
Device management and modeling
Ingestion and communication
The IoT Well-Architected Framework focuses primarily on these layers. To realize these layers, Microsoft
provides IoT-specific services such as Azure IoT Hub, Azure IoT Edge, IoT Hub Device Provisioning Service (DPS),
and IoT Central.
This layer represents the physical or virtual device and gateway hardware deployed at the edge or on premises.
Elements in this layer include the operating systems that manage the processes on the devices and gateways,
and the device and gateway firmware, which is the software and instructions programmed onto devices and
gateways. This layer is responsible for:
Sensing and acting on other peripheral devices and sensors.
Processing and transferring IoT data.
Communicating with the IoT cloud platform.
Base level device security, encryption, and trust root.
Device level software and processing management.
Common use cases include reading sensor values from a device, processing and transferring data to the cloud,
and enabling local communication.
Relevant Microsoft technologies include:
Azure IoT Edge
Azure IoT device SDKs
Azure RTOS
Microsoft Defender for IoT
Azure Sphere
Windows for IoT
This layer aggregates and brokers communications between the device and gateway layer and the IoT cloud
solution. This layer enables:
Support for bi-directional communication with devices and gateways.
Aggregating and combining communications from different devices and gateways.
Routing communications to a specific device, gateway, or service.
Bridging and transforming between different protocols. For example, mediate cloud or edge services into an
MQTT message going to a device or gateway.
Relevant Microsoft technologies include:
Azure IoT Hub
Azure IoT Central
This layer maintains the list of devices and gateway identities, their state, and their capabilities. This layer also
enables the creation of device type models and relationships between devices.
Relevant Microsoft technologies include:
IoT Hub device twins
Common layers and servicesCommon layers and services
Transport layerTransport layer
Event processing and analytics layerEvent processing and analytics layer
Storage layerStorage layer
Interaction and reporting layerInteraction and reporting layer
IoT Central device templates
IoT Hub DPS
Azure Digital Twins
Azure IoT Plug and Play
Workloads other than IoT, such as Data & AI and modern applications, also use the common layers. The top-
level Microsoft Azure Well-Architected Framework addresses the generic elements of these common layers, and
other workload frameworks address other requirements. The following sections touch on the IoT-related impact
on requirements, and include links to other guidance.
This layer represents the way devices, gateways, and services connect and communicate, the protocols they use,
and how they move or route events, both on premises and in the cloud.
Relevant Microsoft technologies include:
OT and IoT protocols, such as MQTT(S), AMQP(S), HTTPS, OPC-UA, and Modbus
Azure IoT Hub routing
Azure IoT Edge routing
This layer processes and acts on the IoT events from the ingestion and communication layer.
Hot path
stream processing and analytics happen in near real-time to identify immediate insights and
actions. For example, stream processing generates alerts when temperatures rise.
Warm path
processing and analytics identify short-term insights and actions. For example, analytics predict a
trend of rising temperatures.
Cold path
processing and analytics create intelligent data models for the hot or warm paths to use.
Relevant Microsoft technologies include:
Azure Stream Analytics
Azure Functions
Azure Databricks
Azure Machine Learning
Azure Synapse Analytics
This layer persists IoT device event and state data for some period of time. The type of storage depends on the
required use for the data.
Streaming storage
, such as message queues, decouple IoT services and communication availability.
Time series-based storage
enables warm-path analysis.
Long-term storage
supports machine learning and AI model creation.
Relevant Microsoft technologies include:
Azure Event Hubs
Azure Data Explorer
Azure Cosmos DB
Azure SQL
Azure Data Lake Storage
Integration layerIntegration layer
Cross
-
cutting activitiesCross
-
cutting activities
This layer lets end users interact with the IoT platform and have a role-based view into device state, analytics,
and event processing.
Relevant Microsoft technologies include:
Azure App Service
Power Apps
Power BI
Dynamics 365 Connected Field Service
This layer enables interaction with systems outside the IoT solution by using machine-to-machine or service-to-
service communications APIs.
Relevant Microsoft technologies include:
Azure Logic Apps
Azure Functions
Azure API Management
Azure Event Grid
Power Automate
Cross-cutting activities like DevOps help you design, build, deploy, and monitor IoT solutions. DevOps lets
formerly siloed roles, like development, operations, quality engineering, and security, coordinate and collaborate
to produce better, more reliable, and agile products.
DevOps is well-known in software development, but can apply to any product or process development and
operations. Teams who adopt a DevOps culture, practices, and tools can better respond to customer needs,
increase confidence in the applications and products they build, and achieve business goals faster.
The following diagram shows the DevOps continuous planning, development, delivery, and operations cycle:
Development and deployment activities include the design, build, test, and deployment of the IoT solution
and its components. The activity covers all layers and includes hardware, firmware, services, and reports.
Next steps
Related resources
Management and operations activities identify the current health state of the IoT system across all layers.
Correctly executing DevOps and other cross-cutting activities can determine your success in creating and
running a well-architected IoT solution. Cross-cutting activities help you meet the requirements set at design
time and adjust for changing requirements over time. It's important to clearly assess your expertise in these
activities and take measures to ensure execution at the required quality level.
Relevant Microsoft technologies include:
Visual Studio
Azure DevOps
Microsoft Security Development Lifecycle (SDL)
Azure Monitor
Azure Arc
Microsoft Defender for IoT
Microsoft Sentinel
Reliability in your IoT workload
Security in your IoT workload
Cost optimization in your IoT workload
Operational excellence in your IoT workload
Performance efficiency in your IoT workload
Azure IoT reference architecture
Azure IoT documentation
Reliability in your IoT workload
12/16/2022 16 minutes to read • Edit Online
Everything has the potential to break and IoT workloads are no exception. Because of this, you should design
your architecture with availability and resiliency in mind. The key considerations are how quickly you can detect
change and how quickly you can resume operations. IoT applications are unique as they're distributed at
massive scale, and operate over unreliable networks with no persistent access or visibility into the end-to-end
data flows.
Your environment should consider resilient architectures, cross-region redundancies, service levels indicators
(SLIs), service-level objectives (SLOs), service-level agreements (SLAs), and critical support. The existing
environment should include auditing, monitoring, and alerting by using integrated monitoring and a notification
framework.
On top of these environmental controls, the workload team should consider:
Defining SLIs and SLOs around observability for your specific use case.
Making architecture modifications to improve service level SLAs.
Building redundancy into the workload-specific architecture.
Adding processes for monitoring and notification beyond what's provided by the cloud operations teams.
Designing for reliability in an IoT solution should take into consideration the foundational layers in the IoT
architecture. To achieve overall solution reliability, each layer should have acceptable levels of reliability.
Device and gateway layerDevice and gateway layer: Devices and gateways come in many forms and shape. Typically, devices and
gateways perform the following functions:
Data collection
Supervisory control
Edge analytics
Data collection means the device is connected to sensors or subscribes to telemetry from downstream systems
and then pushes this data to the cloud. The solution design should ensure reliable device management and
reliable communications from the device to the cloud.
Devices that provide supervisory control not only collect data to send to the cloud, but also take actions based in
that data. Actions send data back to the machines or environment that the device is authorized to perform
supervisory actions on. The reliability of the application running on a device that has supervisory control is
crucial.
Transpor t layerTranspor t layer: To connect to the cloud service for data, control, and management, devices need access to a
network. Depending on the type of IoT solution, connectivity reliability is in your hands or the hands of the
network service provider. Networks may have intermittent connectivity issues and in this case the device needs
to manage its behavior accordingly.
Device management and modeling layerDevice management and modeling layer: The cloud service provides each device with an identity and
manages your devices at scale. The cloud is often the final data ingress point for all messages flowing from the
devices. IoT solutions must provide reliability for the cloud services required for the IoT devices to integrate and
Prerequisites
Principles
Device and gateway layer
Design devices for resiliencyDesign devices for resiliency
DesignDesign
ImplementImplement
transmit data. As described in the Azure WAF – Reliability pillar, the cloud services in your IoT solution must
implement reliability principles to provide high availability for your overall IoT solution.
Building a reliable IoT solution requires careful consideration for devices, cloud, and how they interact. The
choices you make for device hardware, connectivity and protocols, and cloud services affect the reliability needs
of your solution.
As described in the Azure WAF – Reliability pillar, desired reliability is subjective. Ensure you understand the
business requirements for your solution.
To assess your IoT workload based on the principles described in the Microsoft Azure Well-Architected
Framework, complete the IoT workload questionnaires in the Azure Well-Architected Review assessment:
Azure Well-Architected Review
After you complete the assessment, this guide helps you address the key reliability recommendations identified
for your solution.
Follow these principles as you design your IoT solution for reliability:
Design devices for resiliency.
Establish safe practices for updates.
Establish observability across your IoT solution.
Implement HA/DR in critical components.
Plan for capacity.
As part of your overall IoT solution, design your devices to satisfy the uptime and availability requirements of
your end-to-end solution. The following practices focus on connectivity to the cloud service, error handling, and
monitoring:
Ensure that your IoT device can operate efficiently with intermittent connectivity to the cloud. An IoT solution
should enable the flow of information between intermittently connected devices and the cloud-based solutions.
Best practices include:
Implement retry and backoff logic in device software.
Synchronize device state with the cloud.
Ensure that devices can store data on the device if your solution can't tolerate data loss.
Use data sampling and simulations to baseline the network capacity and storage requirements.
The Azure IoT device SDKs provide a set of client libraries that you can use on devices or gateways to simplify
connectivity with the Azure IoT services. You can use the SDKs to implement an IoT device client that:
Connects to the cloud.
Provides a consistent client development experience across different platforms.
Simplifies common connectivity tasks by abstracting details of the underlying protocols and message
processing patterns such as exponential backoff with jitter and retry logic.
MonitorMonitor
ResourcesResources
Establish safe practices for updatesEstablish safe practices for updates
DesignDesign
ImplementImplement
MonitorMonitor
Considerations for IoT devicesConsiderations for IoT devices
DesignDesign
Device lifecycleDevice lifecycle
Connectivity issues for IoT devices can be difficult to troubleshoot because of the many possible points of
failure. Application logic, physical networks, protocols, hardware, IoT Hub, and other cloud services can all cause
problems. The ability to detect and pinpoint the source of an issue is critical. However, an IoT solution at scale
could have thousands of devices, so it's not practical to manually check individual devices. Azure Monitor and
Event Grid are tools that can help you to diagnose connectivity issues in IoT Hub.
Manage connectivity and reliable messaging by using Azure IoT Hub device SDKs
Monitor, diagnose, and troubleshoot Azure IoT Hub disconnects
Retry general guidance - Best practices for cloud applications
Error handling for resilient apps - Azure Architecture Center
Circuit Breaker pattern - Cloud Design Patterns
Compensating Transaction pattern - Cloud Design Patterns
Throttling pattern - Cloud Design Patterns
An enterprise IoT solution should provide a strategy for how operators manage your devices. IoT operators
require simple and reliable tools and applications that enable reliable device updates.
Due to the distributed nature of IoT solutions it's important to adopt safe and secure policies for updating and
deploying applications. Unlike cloud-based solutions, the device side of IoT solutions brings new challenges. For
example, devices need to be continually updated for vulnerabilities and application changes.
A device update solution must support:
Gradual update rollout through device grouping and update scheduling controls.
Support for resilient device updates (A/B) to deliver seamless rollback.
Detailed update management and reporting tools.
Optimization for network based on available bandwidth.
Device Update for IoT Hub is a service that enables safe, secure, and reliable over-the-air IoT device updates.
Device Update for IoT Hub can group devices and specify which devices should receive an update. Operators can
view the status of update deployments and make sure each device successfully applies required updates.
When an update fails, Device Update for IoT Hub lets operators identify the devices that failed to apply the
update and see failure details. The ability to identify which devices failed to update results in fewer manual
hours trying to pinpoint the source of a failure.
Operators need to monitor the state of device deployments and updates. In Device Update for IoT Hub,
compliance measures how many devices have installed the highest version compatible update. A device is
compliant if it has installed the highest version available update that is compatible for it.
Design and select IoT devices to function reliably in the expected operating conditions and over their expected
lifetime.
A reliable device should perform according to its hardware and software specifications, and any failure should
be detected and managed through mitigation, repair, or replacement.
"Design for reliability, but also plan for failures."
Environmental requirementsEnvironmental requirements
Operational profileOperational profile
Regulations and standardsRegulations and standards
ConnectivityConnectivity
Management and operations layer (DevOps)
Establish observability across your IoT solutionEstablish observability across your IoT solution
DesignDesign
Device lifetimes are limited and affect the solution reliability. Assess the consequences of device failure on the
solution, and define a device lifecycle strategy in accordance with the solution requirements.
Device failure impact assessment includes severity (for example single points of failures), probability (for
example MTBF), detectability (see FMEA practice), and an acceptable downtime period.
The acceptable operational downtime determines the speed and extent to which devices should be maintained.
For a solution lifecycle, the availability or longevity of the supply of devices and parts is an important aspect. The
more modular the design, the easier it is to swap out parts of the system, especially if some parts become
obsolete earlier than others. Alternative or multi-sourcing of component and module supply chains are critical
for highly reliable solutions.
The conditions in which a device operates affect its reliability. Define your environmental requirements, and use
devices with appropriate feature specifications. These specifications include parameters such as operating
temperature range, humidity, ingress protection rating (IPxx), relevant and EMI immunity, and shock and
vibration immunity.
The operational behavior of devices affects the performance stress applied to devices, and therefore their
reliability. Define operational profiles (estimating device behavior throughout its lifetime), and assess device
reliability accordingly. Such profiles include operation modes (for example wireless transmission on, low-power
modes) and environmental condition (such as temperature) over device lifetime.
In normal operating conditions, the device and software should run safely inside the specified performance
profiles and avoid running at the limit of the device capabilities. A device should be able to service and process
all external sensors and data processing required by the solution.
For specific industries, devices are subject to applicable regulations and standards. Regulations and standards
should be defined, and devices should meet any compliance and conformity requirements. Regulations include
certification and marking (for example FCC, CE). Standards include industry/application (for example ATEX and
MILspec) and safety (for example IEC 61508) conformance.
Device connectivity conditions, including upstream to the cloud and downstream local network, should be part
of the solution design and reliability. Assess the potential effect of connectivity interruption or interference, and
define a connectivity strategy accordingly. The connectivity strategy should include robustness, for example
fallback capability and disconnection management, and include buffering backup to mitigate cloud dependency
for critical/safety functions.
Monitor every component of your IoT Solution and define alerts and procedures to manage the overall
reliability. All Azure IoT services publish metrics that describe the health and availability of the service. In
addition to the cloud service metrics, consider the metrics that you need on the device side to establish end-to-
end observability for reliability. Consider using these metrics as part of your overall solution reliability
monitoring.
By monitoring the operation of an IoT application and devices relative to a healthy state, you can detect and fix
reliability issues. Monitoring and diagnostics of an IoT application are crucial for availability and resiliency. If
something fails, you need to know that it failed, when it failed, and why it failed.
Before you mitigate issues that affect the reliability of an IoT application, you must be able to capture logs and
ImplementImplement
signals that help you detect issues in the end-to-end operation of the solution.
Use IoT solution logging and monitoring systems to determine whether the solution is functioning as expected
and to help troubleshoot what's wrong with its components.
Complete the following actions to establish observability on an IoT solution:
Establish a mechanism to collect and analyze performance metrics and alerts related to your IoT solution.
Configure devices, cloud services, and applications to collect and connect with Azure Monitor.
Enable a real-time dashboard and alerts to monitor the Azure backend services.
Define roles and responsibilities to monitor and act on events and alerts.
Implement continuous monitoring.
Establish a mechanism to collect and analyze performance metrics and alerts related to the IoT solution.
Azure Monitor is the recommended monitoring and visualization solution for Azure IoT solutions. Devices,
services, and applications, regardless of deployment location, can push log messages directly or through built-in
connectors into Azure Monitor. Azure Monitor provides custom log parsing to facilitate the decomposition of
events and records into individual fields for indexing and search.
Configure devices, cloud ser vices, and applications to collect and connect to Azure Monitor.Configure devices, cloud ser vices, and applications to collect and connect to Azure Monitor.
Enable remote monitoring of IoT Edge devices using Azure Monitor and built-in metrics integration. To enable
this capability on your devices, add the metrics-collector module to your deployment and configure it to collect
and transport module metrics to Azure Monitor.
If your solution uses IoT Central, you can use the set of metrics provided by IoT Central to assess the health of
devices connected to your IoT Central application and the health of your active data exports. IoT Central
applications enable metrics by default, and you access them from the Azure portal. In addition, the Azure
Monitor data platform exposes these metrics and provides several ways for you to interact with them.
IoT Hub provides usage metrics, such as the number of messages used, and the number of devices connected.
You should monitor data generated by Azure IoT Hub and relay it to Azure Monitor for analysis and to alert
other services.
If the IoT solution requires a custom application such as App Service, Azure Kubernetes Service, Azure Functions,
use Application Insights for monitoring and analysis. Application Insights is a feature of Azure Monitor that
provides extensible application performance management and monitoring for live web apps. Developers and
DevOps professionals can use Application Insights to:
Automatically detect performance anomalies.
Help diagnose issues by using powerful analytics tools.
See what users actually do with apps.
Help continuously improve app performance and usability.
Enable a real-time dashboard and aler ts to monitor the Azure backend ser vicesEnable a real-time dashboard and aler ts to monitor the Azure backend ser vices
Azure Monitor alerts proactively notify you when it finds specific conditions in your monitoring data. Alerts let
you identify and address issues in your system before your customers notice them. You can set alerts on
metrics, logs, and the activity log.
Define roles and responsibilities to monitor and act on events and aler tsDefine roles and responsibilities to monitor and act on events and aler ts
Roles, responsibilities, and permissions - Microsoft Azure Well-Architected Framework.
Continuous monitoringContinuous monitoring
Continuous Integration and Continuous Deployment (CI/CD) is a DevOps concept that helps you deliver
ResourcesResources
Implement HA/DR in critical componentsImplement HA/DR in critical components
DesignDesign
ImplementImplement
MonitorMonitor
software more quickly and reliably and provide continuous value to your users. Continuous Monitoring (CM) is
a new follow-up concept where you can incorporate monitoring across each phase of your DevOps and IT Ops
cycles. CM ensures the health, performance, and reliability of your apps and infrastructure continuously as it
flows through the developer, production, and customers:
Seven best practices for Continuous Monitoring with Azure Monitor
Continuous integration and continuous deployment to Azure IoT Edge devices
Monitoring Azure IoT Hub
Monitoring Azure IoT Hub data reference
Trace Azure IoT device-to-cloud messages with distributed tracing
Check Azure IoT Hub service and resource health
Collect and transport metrics - Azure IoT Edge
Monitoring data reference
Enable message tracking
Check Azure IoT Hub service and resource health
As you build out your IoT solution, your SLA guides you into what to implement in your critical components for
HA/DR. You must design to meet the SLA to recover from failures across the IoT solution stack. There are
multiple approaches, starting with redundancy across the IoT solution stack through to enabling redundancy for
specific layers of the solution stack. Cost is also a major consideration that you must weigh against the benefits
of meeting the SLAs.
Define uptime goals for your IoT solutionDefine uptime goals for your IoT solution: Azure IoT services have defined uptime and availability targets.
Review the SLAs for Azure IoT services that are part of your solution and define your uptime goals. For example,
Azure IoT Hub has an SLA of 99.9%, which translates into 1 minute and 36 seconds of potential downtime per
day to plan for. The Azure IoT SDK provides built-in, configurable logic to handle retries and backoff.
Consider breaking the uptime goals of your solution into two categories. Operational (data ingestion) and
device management. For example, if a device is sending data to an IoT hub, but the services to manage the
device are unavailable, how does that affect your solution?
To learn more, see Azure IoT Hub SDK reliability features.
Consider using redundant hardware options for sensors, power, and storage. Redundant hardware enables the
device to function completely if one of the critical features isn't available. Your hardware also needs to account
for issues with connectivity. For example, use a store and forward approach for data when connectivity isn't
available. Azure IoT Edge has this feature built in.
Your device must also be able to handle outages in the cloud. Microsoft Azure region pairing provides an HA/DR
strategy for IoT Hub. For many, Microsoft Azure region pairing meets their SLA requirements. If region pairing
isn't enough, consider an implementation with a secondary IoT Hub enabled. You should also use Azure Device
Provisioning Service (DPS) to avoid hardcoded IoT Hub configurations on your devices. Should your primary IoT
hub go down, DPS can assign your device to a different hub.
For devices that are expected to be online most of the time, consider implementing a heartbeat message pattern
using a custom route in IoT Hub that's processed by Azure Stream Analytics, a Logic App, or an Azure Function
to determine if a heartbeat has failed. You can then use the heartbeat to define Azure Monitor alerts that take
actions as needed.
Use Azure Monitor as it lets you monitor the state of your IoT Hub environment, ensure it's running properly,
ResourcesResources
Ingestion and communication layer
Plan for capacityPlan for capacity
DesignDesign
ImplementImplement
MonitorMonitor
ResourcesResources
and that your devices aren't being throttled or experiencing connection issues.
IoT Hub SDK reliability features
IoT Hub HA/DR
Apply IoT Hub Intra-region HA/DR
Apply IoT Hub Cross-region HA/DR
How to clone an Azure IoT hub
Test apps for availability and resiliency - Azure Architecture Center
Plan for ser vice quotas:Plan for ser vice quotas: As with all platform services, IoT Hub and DPS have quotas and throttles enforced on
certain operations, enabling Azure to deliver predictable service levels and cost for its services. Quotas and
throttles are tied to the service tier and number of units you deploy so that you can design your solution with
the right number of resources. Review quotas and throttles in advance and design your IoT Hub and DPS
resources accordingly.
Plan redundant capacity:Plan redundant capacity: When you're planning your thresholds and alerts, consider the latency between the
detection and the action taken so that the system and operators have enough time to respond to change
requests. Otherwise, you may detect a need to increase the number of units, but the system fails - for example
by losing messages - before the increase takes effect.
Establish benchmarks at production scaleEstablish benchmarks at production scale: Due to the distributed nature of IoT solutions, the number of
devices, and the volume of data, it's important to establish scale benchmarks for the overall solution. As the
number of devices or volumes of data increase, the cloud gateway must scale to support uninterrupted data
flow. These benchmarks help to plan for capacity risks. Use the IoT device telemetry simulator to simulate
production scale volumes.
Auto scale to adjust to quotas dynamically:Auto scale to adjust to quotas dynamically: To provide the lowest cost and operational effort, consider
implementing an automated system to scale your resources up and down with the varying needs of your
solution. To learn more, see the following Resources section.
Monitor quotas and throttling:Monitor quotas and throttling: A benefit of using PaaS services is the ability to scale up and down with a
little effort according to your needs. To ensure solution reliability, you need to continuously monitor resource
usage against quotas and throttles to detect increases in resource usage that indicate the need to scale.
Depending on your business requirements, you may implement something as simple as continuously
monitoring resource usage and alerting the operator when thresholds are met, or an automated system to auto
scale.
Understand Azure IoT Hub quotas and throttling
Overview of the Microsoft Azure IoT Hub Device Provisioning Service
Auto-scale your Azure IoT Hub - Code Samples
Azure IoT Device Telemetry Simulator - Code Samples
Understanding the IP address of your IoT hub
Device configuration best practices for Azure IoT Hub
Azure IoT Hub scaling
Azure IoT Central quotas and limits
Next steps
Azure IoT Hub scaling
Operational Throttles | Understand Azure IoT Hub quotas and throttling
Other Limits | Understand Azure IoT Hub quotas and throttling
Security in your IoT workload
Security in your IoT workload
12/16/2022 23 minutes to read • Edit Online
Prerequisites
The Internet of Things (IoT) provides vast opportunities to gain insights into the complex relationships of people,
systems, and data. Those opportunities also come with the challenge of securing diverse and heterogeneous
device-based solutions that may have little or no direct interaction.
The security guidance of Microsoft Azure Well-Architected Framework for IoT provides recommendations to
help improve the security of your IoT solution. It will help you identify key considerations for secure solution
design and provides guidance on how to implement capabilities that can reduce risk to your IoT solution.
There are multiple personas such as
IoT device builders
,
IoT application developers
, and
IoT solution operators
involved in an IoT solution. They all are required to ensure security in the full lifecycle of an IoT solution.
There are key differences between IoT solutions in traditional IT solutions and operational technology (OT)
solutions, such as having devices that are on-premises being used to monitor and control physical devices.
These OT devices add different security challenges such as, tampering, packing sniffing, the need for out of band
management, and over-the-air (OTA) updates.
When you design an IoT solution, it's important to understand the potential threats to that solution, and add
defense in depth as you design and architect it. It's important to design the solution from the start with security
in mind because understanding how an attacker might be able to compromise a system helps make sure
appropriate mitigations are in place from the start.
Security starts with a threat model. Threat modeling offers the greatest value when you incorporate it into the
design phase. When you're designing, you have the greatest flexibility to make changes to eliminate threats. To
learn more, see Internet of Things (IoT) security architecture.
Securing IoT solutions with a Zero Trust security model starts with non-IoT specific requirements—specifically
ensuring you've implemented the basics to securing identities, their devices, and limit their access. These include
explicitly verifying users, having visibility into the devices they're bringing on to the network, and being able to
make dynamic access decisions using real-time risk detections. This helps limit the potential impact of users
gaining unauthorized access to IoT services and data in the cloud or on-premises, which can lead to both mass
information disclosure (such as leaked production data of a factory) and potential elevation of privilege for
command and control of cyber-physical systems (such as stopping a factory production line). Factories and OT
environments are often easy targets for malware and security breaches due to equipment that can be more than
40 years old, physically vulnerable, and isolated from server level security. For an end-to-end perspective,
review the Azure Well-Architected Framework's security pillar, which includes guiding principles that are critical
to other aspects of your solution.
Guidance for creating zero trust architecture (ZTA) has also been provided by NIST in its Zero Trust Architecture
(nist.gov) document. This document provides general deployment models and uses cases where zero trust could
improve an enterprise's overall information technology security posture.
To assess your IoT workload based on the principles described in the Microsoft Azure Well-Architected
Framework, complete the IoT workload questionnaires in the Azure Well-Architected Review assessment:
Azure Well-Architected Review
After you complete the assessment, this guide helps you address the key security recommendations identified
for your solution.
Principles
The Azure Well-Architected Framework for IoT overview refers to the foundational layers in an IoT solution.
These layers are shown in the diagram below. Security functions cut across these layers and should follow
specific principles and recommendations.
When you extend the overall Well-Architected Framework security pillar, there are principles that relate
specifically to IoT. The Zero Trust Cybersecurity for the Internet of Things whitepaper describes how to apply a
Zero Trust approach to your IoT solutions based on customer experiences and Microsoft's own environment.
The key security principles that inform a strong security posture that apply to an IoT solution are:
Strong identityStrong identity to authenticate devices. Register devices, issue renewable credentials, employ strong
authentication for personnel using techniques such as multi-factor authentication or password-less
authentication, and use a hardware root of trust to ensure you can trust its identity before making
decisions.
Least privileged accessLeast privileged access to mitigate the impact of a device being compromised. Implement device and
workload access control to limit any impact from authenticated identities that may have been
compromised or running unapproved workloads.
Device healthDevice health to gate access or flag devices for remediation. Check security configuration, assess
vulnerabilities and insecure passwords, and monitor for active threats and anomalous behavioral alerts
to build ongoing risk profiles.
Continual updatesContinual updates to keep devices healthy. Utilize a centralized configuration and compliance
management solution and a robust update mechanism to ensure devices are up to date and in a healthy
state.
Security monitoring and responseSecurity monitoring and response to detect and respond to emerging threats. Employ proactive
monitoring to rapidly identify unauthorized or compromised devices.
An IoT architecture consists of a set of foundational layers. Layers are realized by using specific technologies,
and the IoT Well-Architected Framework highlights options for designing and realizing each layer. There are also
cross-cutting layers that enable the design, building, and running of IoT solutions:
Design
The following sections address the layer specifics for the security pillar.
As part of the threat modeling exercise, you should divide a typical IoT architecture into several
components/zones. This article focuses on the IoT aspects of security, but you should take a holistic approach to
security. The IoT security guidance focuses on these two zones:
Device and field gateway zonesDevice and field gateway zones: The immediate physical space around the device and gateway where
physical access and/or peer-to-peer digital access is feasible. Many industrial companies use the Purdue
model included in the ISA 95 standard to ensure their process control networks protect both the limited
bandwidth of the network and the ability to offer real time deterministic behavior. In more recent years,
with cyber security events being on the climb from internal and external parties security teams look at
the Purdue model as an extra layer of the defense in depth methodology.
Cloud gateway and ser vices zoneCloud gateway and ser vices zone: Any software component or module running in the cloud that is
interfacing with devices and/or gateways for data collection and analysis, as well as for command and
control.
Zones are broad way to segment a solution. Each zone often has its own data and authentication and
authorization requirements. Zones can also be used to isolate damage and restrict the impact of low trust zones
on higher trust zones.
All elements in the architecture are subject to various threats that can be classified according to one of the six
Strong IdentityStrong Identity
STRIDE categories:
spoofing
,
tampering
,
repudiation
,
information disclosure
,
denial of service
, and
elevation of
privilege
.
Follow the SDL process when you design and build these services.
Adopt DevOps methodologies also focus on security. To learn more, see Enable DevSecOps with Azure and
GitHub.
Strong device identity is delivered through tightly integrated capabilities of IoT devices and services, including:
A hardware root of trust
Strong authentication using techniques such as certificates, multi-factor auth, or password-less
authentication
Renewable credentials
Organizational IoT device registry
When possible, use a hardware
root of trust
(RoT) with the following attributes:
Secure storage of the credentials proving the identity in a dedicated, tamper-resistant hardware.
Immutable
onboarding identity
tied to the physical device.
Unique per-device renewable
operational credentials
for regular device access.
The onboarding identity represents the physical device, and is inseparable from it. It can’t be changed for the
device's lifetime, and is typically created and installed during manufacturing.
Given its immutability and lifetime, only use the device onboarding identity to onboard devices into IoT
solutions. After onboarding, provision and use a renewable operational identity and credentials for
authentication and authorization to the IoT application. Making this identity renewable enables you to manage
access and revocation of the devices for operational access. You can apply policy driven gates - such as
attestation of device integrity and health - at renewal time.
Protect the supply chain of the hardware RoT, or any other hardware components of an IoT device, to ensure
that attacks on supply chain don't compromise the integrity of such devices. This also ensures devices are built
to security specifications and design that conform to required compliance regimes.
Password-less authenticationPassword-less authentication, usually using standard x509 certificates to prove a device's identity, offers
greater protection than secrets such as passwords and symmetric tokens shared between both parties.
Certificates are a strong, standardized mechanism that provides renewable, password-less authentication.
Provision operational certificates from a trusted PKI and use a renewal lifetime appropriate for the security
posture of their business use, management overhead, and cost. Make renewal automatic to minimize any
potential access disruption due to manual rotation, and use standard up to date cryptography techniques. For
example, renew through CSR instead of transmitting a private key. Any access granted to a device should be
granted based on its operational identity. Support credential revocation (such as CRL when using x509
certificates) to enable immediate removal of device access. For example, to respond to compromise or theft.
Some legacy or resource constrained IoT devices aren't manufactured with or capable of utilizing a strong
identity, password-less authentication, or renewable credentials. For these devices, use IoT gateways as
guardians to locally interface with these less-capable devices, bridging them to access IoT services with strong
identity patterns. This enables Zero Trust adoption today, while transitioning to use more capable devices over
time.
If you can't use a hardware root of trust - for example, virtual machines, containers, or any service that embeds
an IoT client - use whatever capabilities are available. Common implementations use virtual machines or
containers, which don't have hardware root of trust support, but can use password-less authentication and
renewable credentials. Defense in depth lets the solution provide redundancies where possible and fills in gaps
Least
-
privileged accessLeast
-
privileged access
Device healthDevice health
where necessary. Virtual machines and containers might run from an area with more physical security, such as a
data center, as compared to an IoT device in the field.
Use a centralized Organizational IoT device registr yOrganizational IoT device registr y for your organization's IoT devices to manage their
lifecycle and audit device access. This approach is similar to the way you secure the user identities of an
organization's workforce to achieve Zero Trust security. Use a cloud-based identity registry to handle the scale,
management, and security of an IoT solution. IoT device registry information is used to onboard devices into an
IoT solution by verifying that the device identity and credentials are known and authorized. After a device is
onboarded, the device registry contains the core properties of the device, including its operational identity and
the renewable credentials used to authenticate for everyday use.
You can use IoT device registry data to view the inventory of an organization's IoT devices (including health,
patch, and security state), and to query and group devices for scaled operation, management, workload
deployment, and access control.
In cases where devices don't connect to Azure services for IoT, use network sensors to detect and inventory
unmanaged IoT devices in an organization's network for awareness and monitoring.
Least-privileged access control helps to limit any impact from authenticated identities that may have been
compromised or that are running unapproved workloads. For IoT scenarios, this means granting access to
solution operators, devices, and their workloads using:
Device and workload access control, to provide access control to the scoped workloads running on the
device.
Access only from secure locations and systems, granting just-in-time access, and implementing strong
authentication mechanisms such as multi-factor auth, password-less authentication.
Conditional access, to conditionally grant access based on the device's context. Examples include:
Location (IP address, GPS location)
Uniqueness (access from one location at a time)
Time of the day
Network traffic patterns
Services can also use the device's context to conditionally deploy workloads.
Configure the access management of the IoT cloud gateway to only grant appropriate access permissions based
on the functionality required by the backend.
Limit access points to IoT devices and cloud applications by ensuring the ports are kept to minimum access.
Build mechanisms to prevent and detect physical tempering of devices.
An IoT solution is used by a person. It's prudent to ensure that the user access is managed through an
appropriate access control model such as role based or attribute based access control. To layer least-privileged
access for IoT devices and design for defense in depth, use network segmentation to group IoT devices in
addition to end point protection, mitigating the impact of a potential compromise. Network micro-segmentation
enables isolation of less-capable devices at the network layer, either behind a gateway or on a discrete network
segment. Put in place a holistic firewall rule strategy to ensure devices access the network when required for the
solution to operate, and blocking access when not allowed. More mature organizations can also implement
micro-segmentation policies at multiple layers of the Purdue model.
Following the Zero Trust principle, use device health as a key factor to determine the risk profile (such as trust
level) of a device. Use this risk profile as an access gate to ensure only healthy devices can access IoT
applications and services, or to identify devices in questionable health for remediation.
According to industry standards, the evaluation of the device health should include:
Continual updatesContinual updates
Security monitoring and responseSecurity monitoring and response
Security configuration assessment and attestation - is the device configured securely?
Vulnerability assessment - is the device running software that is out of date, or software that has known
vulnerabilities?
Insecure credential assessment - is the device using secure credentials (such as certificates) and protocols
(such as TLS 1.2+)?
Active threats and threat alerts.
Anomalous behavioral alerts such as network patterns and usage deviation.
The other side of enabling your organization to control device access based on health is the need to proactively
maintain production devices in a working, healthy target state.
The capabilities to proactively support healthy devices include:
Centralized configuration and compliance management, to apply policies and to securely distribute and
update certificates.
Deployable updates to update the full set of software on devices, firmware, drivers, base OS and host
applications, and cloud-deployed workloads. The update mechanism should have remote deployment
capabilities and should be resilient to changes in environment, operating conditions, and authentication
mechanism (for example, a certificate change because of expiry or revocation). The update mechanism
should support update rollout verification, and ideally be integrated with pervasive security monitoring to
enable scheduled updates for security. You should be able to defer deployments so as to not interfere with
the business continuity, but be eventually completed within a well-defined time interval after a vulnerability
is detected. Devices that haven't been updated, should be flagged as unhealthy.
As a defense in depth strategy, monitoring adds an extra layer of protection for managed greenfield devices
while also providing a compensating control for legacy unmanaged brownfield devices that don't support
agents and can't easily be patched or configured remotely. You need to decide the levels of logging, types of
activities that you need to monitor, and the responses required for alerts. Logs should be stored securely and
not contain any security details.
As recommended by CISA, security monitoring includes:
Generating an as-is asset inventory and network map of all IoT/OT devices.
Identifying all communication protocols used across IoT/OT networks.
Cataloging all external connections to and from IoT/OT networks.
Identifying vulnerabilities in IoT/OT devices and using a risk-based approach to mitigate them.
Implementing a vigilant monitoring program with anomaly detection to detect malicious cyber tactics such
as
living off the land
techniques within IoT/OT systems. The program should monitor for unauthorized
changes to controllers, unusual behavior from devices, and audit access and authorization attempts.
Most IoT attacks follow a familiar
kill chain
pattern in which adversaries establish an initial foothold, elevate their
privileges, and move laterally across the network. Often, they use privileged credentials to bypass barriers such
as next generation firewalls established to enforce network segmentation across subnets. Rapidly detecting and
responding to these multistage attacks requires a bird's-eye view across IT, IoT, and OT networks, combined with
automation, machine learning, and threat intelligence. Collect signals from the entire environment: all users,
devices, applications, and infrastructure, both on-premises, and in multiple clouds. Analyze the signals in
centralized security information and event management (SIEM) and extended detection and response (XDR)
platforms, where security operations center (SOC) analysts can hunt for and uncover previously unknown
threats. Finally, security orchestration and automated response (SOAR) platforms are essential for responding to
incidents rapidly and mitigating attacks before they have material impact on your organization. This is
accomplished by defining playbooks that are automatically executed when specific incidents are detected. For
Implement
Evaluate and deploy a Zero Trust security modelEvaluate and deploy a Zero Trust security model
Apply Zero Trust to your existing IoT infrastructureApply Zero Trust to your existing IoT infrastructure
Use Zero Trust as criteria to select IoT servicesUse Zero Trust as criteria to select IoT services
example, you can automatically block or quarantine compromised devices so they're unable to infect other
systems.
The IoT solution needs to be able to perform monitoring and remediation at scale for all its connected devices.
Data protectionData protection: Data that's ingested into the IoT solution should be protected with the guidance in the overall
Azure Well-Architecture. For IoT solutions, these principles extend to the devices too. It's critical to ensure that
communication from the device to the cloud is secure and encrypted using the latest Transport Layer Security
(TLS) standards. If data is stored on devices (data at rest), use standard encryption algorithms to encrypt the
data.
Make sure devices are well protected physically. Turn-off or disable any features that aren't needed on the
device, such as physical ports (USB, UART) and connectivity (WIFI, BT). Perform physical removal or
covering/blocking when necessary.
When possible, use a firewall on the device to restrict the network access.
To learn more about Microsoft's approach, see Zero Trust.
Use the Zero Trust Assessment to analyze the gaps in your current protection for identity, endpoints, apps,
network, infrastructure and data.
Use the recommended solutions from the assessment to prioritize your Zero Trust implementation, and
move forward with guidance from the Microsoft Zero Trust Deployment Center.
To help prioritize IoT Zero Trust investments, you can use IIC's IoT Security Maturity Model to help assess the
security risks for your business.
Microsoft Defender for IoT is an agentless solution that continuously monitors network traffic using IoT-aware
behavioral analytics to immediately identify unauthorized or compromised IoT devices. Deploy Microsoft
Defender for IoT to:
Inventory all IoT devices
Assess all IoT devices for vulnerabilities and provide risk based mitigation recommendations
Continuously monitor devices for anomalous or unauthorized behavior.
Additionally, Microsoft Defender for IoT is tightly integrated with Azure Sentinel, a cloud-based SIEM/SOAR
platform that supports third-party SOC solutions such as Splunk, IBM QRadar, and ServiceNow.
Explore adding Azure Sphere guardian modules to your critical brownfield devices to enable them to connect to
IoT services with Zero Trust capabilities including strong identity, end-to-end encryption, and regular security
updates.
Network design and configuration provide other opportunities to build defense in depth by segmenting IoT
devices based on their traffic patterns and risk exposure. This minimizes the potential impact of compromised
devices and ability of adversaries to pivot to higher-value assets. Network segmentation is typically
accomplished using next-generation firewalls.
Use IoT ser vicesIoT ser vices that offer the following key zero-trust capabilities:
Enable full support for Zero Trust user access control via Zero Trust. For example, strong user identities,
multi-factor authentication, and conditional user access.
Include integration with user access control systems for least-privileged access and conditional controls.
Use Zero Trust as criteria to select IoT devicesUse Zero Trust as criteria to select IoT devices
Provide a central device registry for full device inventory and device management.
Perform mutual authentication, offering renewable device credentials with strong identity verification.
Enforce least-privileged device access control with conditional access to ensure only devices fulfilling criteria
can connect, such as only healthy devices or devices from known locations.
Support OTA updates to keep devices healthy.
Enable security monitoring of both the IoT services themselves, and of the range of connected IoT devices.
Monitor and control access to all public end points and implement authentication and authorization to any
calls made to these end points.
Microsoft Azure offers several IoT services that provide Zero Trust capabilities:
Azure IoT Hub, supports an operational registry for organizational IoT devices. Accepts device operational
certificates to enable strong identity. Devices can be disabled centrally from Azure IoT Hub to prevent
unauthorized connection. Azure IoT Hub supports provisioning of module identities in support of IoT
Edge workloads.
Azure IoT Hub Device Provisioning Service (DPS), provides a central device registry for organizational
devices to register for onboarding at scale. DPS accepts device certificates to enable onboarding with
strong device identity and renewable credentials, registering devices in IoT Hub for their daily operation.
Azure Device Update (ADU) for IoT Hub, enables you to deploy OTA updates for your IoT devices. It
provides a cloud-hosted solution to connect virtually any device. ADU supports a broad range of IoT
operating systems, including Linux and Azure RTOS.
Microsoft Defender for IoT is an agentless, network layer security platform delivering continuous asset
discovery, vulnerability management, and threat detection for IoT and OT devices, including proprietary,
embedded OT devices as well as legacy Windows systems commonly found in OT environments.
Microsoft Defender for IoT interoperates with Azure Sentinel to provide a bird's-eye view of security
across your entire enterprise. By collecting data at cloud scale—across all users, devices, applications, and
infrastructure, including firewall, NAC, and network switch devices, Sentinel can quickly spot anomalous
behaviors indicating potential compromise of IoT/OT devices.
IoT devicesIoT devices supporting Zero Trust should:
Contain a hardware root of trust that it can use to provide a strong device identity.
Use renewable credentials for regular operation and access.
Enforce least-privileged access control to local device resources such as cameras, storage, and sensors.
Emit proper device health signals to services to enable enforcement of conditional access.
Provide update agents and corresponding software updates for the usable lifetime of the device to ensure
security updates can be applied, along with device management capabilities to enable cloud-driven device
configuration and automated security response.
Run security agents that integrate with security monitoring, detection, and response systems.
Minimize physical attack footprint. For example, no USB ports.
Microsoft Azure offers several products that can be used to support the implementation of IoT devices:
Azure IoT Edge provides an edge runtime connection to IoT Hub and other services:
Supports use of certificates as strong device identities.
Supports the PKCS#11 standard that enables support for device manufacturing identities - and other
secrets for operational identities - stored on a TPM or HSM.
The Azure IoT platform SDKS are a set of device client libraries, developer guides, samples, and
documentation. Device SDKs implement various security features, such as encryption, authentication, to
assist you in developing a robust and secure device application.
Azure IoT Hub support for virtual networks enable you to restrict connectivity to IoT Hub through a VNet
that you operate. This network isolation prevents connectivity exposure to the public internet. It can help
to prevent exfiltration attacks from sensitive on-premises networks.
Azure RTOS provides a real-time operating system as collection of C-language libraries that you can
deploy on a wide range of embedded IoT platforms. Azure RTOS includes a complete TCP/IP stack with
TLS 1.2 and 1.3 and basic X.509 capabilities. However, Azure RTOS along with the Azure IoT Embedded
SDK is designed to integrate with Azure IoT Hub, DPS, and Microsoft Defender and can utilize some of the
same security mechanisms as larger, more expensive IoT devices. With features such as X.509 mutual
authentication and support for modern TLS cipher suites such as ECDHE and AES-GCM, customers can
build Azure RTOS applications knowing that the basics of secure network communication are covered. In
addition to the integration with Azure IoT services, Azure RTOS provides support for Zero Trust design on
microcontroller platforms that support hardware security features such as ARM TrustZone (a memory
protection/partitioning architecture), secure element devices such as the STSAFE-A110 from ST
Microelectronics, and industry standards such as the ARM Platform Security Architecture (PSA), which
combines hardware and firmware to provide a standardized set of security features including secure
boot, cryptography, attestation, and more.
Windows 10 IoT Enterprise has significant security features that can be used to help ensure security
across three key pillars of the IoT security spectrum:
Protect data. Securing data means always protecting, including at rest, during code execution, and
in motion. This is done by using BitLocker Drive Encryption, Secure Boot, Windows Defender
Application Control, Windows Defender Exploit Guard, secure Universal Windows Platform (UWP)
applications, Unified Write Filter, a secure communication stack, and security credential
management.
Monitor and detect. Device Health Attestation (DHA) lets you start with a trusted device and
maintain trust over time.
Update and manage. You can use Device Update Center and Windows Server Update Services to
apply the latest security patches. If you determine that a device might be exposed to a threat, you
can remediate that threat by using Azure IoT Hub device management features, Microsoft Intune
or third-party mobile device management solutions, and Microsoft System Center Configuration
Manager.
Along with the edge platforms above, Microsoft provides a program to certify different device capabilities:
The Azure Certified Device program empowers device partners to easily differentiate and promote
devices, and enables solution builders and end customers to find IoT devices built with features that
enable a zero-trust solution.
Edge Secured-core program (preview), which validates whether devices meet more security requirements
around device identity, secure boot, operating system hardening, device updates, data protection, and
vulnerability disclosures. The Edge Secured-core program requirements have been distilled from various
industry requirements and security engineering points of view. The Edge Secured-core program enables
Azure services such as the Azure Attestation service to make conditional decisions that are based on the
posture of the device, thus enabling the Zero Trust model. Some of the highlights consist of requiring the
device to include a hardware root of trust and provide secure boot and firmware protection. Attributes
such as these can be measured by the attestation service and used by downstream services to
conditionally grant access to sensitive resources.
Microsoft has several solutions available that provide hardware built specifically for IoT scenarios fully
integrated with Azure services:
Resources
Next steps
Azure Sphere is a fully managed integrated hardware, OS, and cloud platform solution for medium and
low-power IoT devices that meets all seven properties of highly secured devices. Azure Sphere has
several features that can help an organization implement Zero Trust. Devices are designed for explicit
verification and implement certificate-based Device Attestation and Authentication (DAA), which will
automatically renew trust. In addition to supporting the Zero Trust principle of explicit verification, Azure
Sphere implements least-privileged access, where applications are denied access by default to all
peripheral and connectivity options. This even extends to network connectivity, where permitted web
domains must be included in the software manifest or the application isn't able to connect outside of the
device. Azure Sphere is built around assumed breach. Protections are layered with defense in depth
throughout the OS design, and a secure-world partition—running in Arm TrustZone on Azure Sphere
devices—helps segment breaches to the OS from access to Pluton or hardware resources. Azure Sphere
can be applied as a guardian module to secure other devices, including existing brownfield systems that
weren't designed for trusted connectivity. In this scenario, an Azure Sphere guardian module is deployed
with an application designed to interface with an existing product through an interface such as ethernet,
serial, or BLE, that doesn't necessarily have direct internet connectivity.
Azure Percept is an end-to-end edge AI platform that includes hardware accelerators integrated with
Azure AI and IoT services, pre-built AI models, and solution management to help you start your proof of
concept in minutes. Azure Percept devices are designed with a hardware root of trust to help protect
inference data (in transit and at rest), AI models and privacy-sensitive sensors such as cameras and
microphones. It enables device authentication and authorization for Azure Percept Studio services. To
learn more, see Azure Percept security.
Azure Architecture Center - Azure Architecture Center
How to apply a Zero Trust approach to your IoT solutions - Microsoft Security Blog
Zero trust whitepaper (microsoft.com)
IoT Security Architecture
Overview of an IoT workload
Cost optimization in your IoT workload
12/16/2022 38 minutes to read • Edit Online
H IGH C O ST OF SC A L IN GH IGH C O ST O F SC A L IN G 32%32%
Lack of necessary technology 26%
Pilots demonstrate unclear business value/ROI 25%
Too many platforms to test 24%
Pilot takes too long to deploy 23%
According to the survey of IoT Signals EDITION 3 report, the top reason of proof of concept (PoC) failure is high
cost of scaling. The report also says that the high cost of scaling IoT projects comes from the complexities of
integrating across layers, for example devices, edge connectivity, compatibility across applications. 40% of
adopters experiencing this issue.
Reasons for PoC failureReasons for PoC failure (IoT Signals EDITION 3)
IoT solution implementations are complex projects that require understanding of broad technologies and
consideration of various layers. The IoT Well-Architected Framework describes these IoT architecture layers. The
layers help you simplify and understand Azure IoT architecture. You need to establish a different cost strategy
for each layer because each one has a different technology and sometimes a different ecosystem such as
devices, telecommunications, or edge. Cost optimization for IoT Well-Architected Framework provides guidance
on how to optimize costs across the IoT architecture layers.
Cost optimization is the process of closed-loop cost control that needs to be continuously monitored, analyzed,
and improved throughout a lifecycle. Understanding the IoT architecture layers helps you design an architecture
for your IoT solution that enables you to define a baseline of the costs and to consider multiple architectures for
cost comparison.
IoT cost results from a tradeoff between various technology options. Sometimes, it's not a simple comparison
because IoT is an end-to-end solution. You should also consider synergy and cost benefits when reconciling
multiple services and technologies such as IoT central with Plug and Play, or using device twins to handle events
in Azure Digital Twins. In some areas, a one-time cost is more effective than recurring costs. For example, in
security where hacking techniques are always changing, it's better to import a reliable commercial operating
system and module such as Azure Sphere that, for a one-off payment, provides monthly security patches to
devices for a long time.
The key criteria for architecture decisions are the requirements for your IoT solution. You can separate the
requirements into functional and non-functional requirements. You also need to separate the cost
considerations on each type of requirements because functional requirements affect system design and non-
functional requirements affect system architecture.
You need to develop multiple use cases based on the requirement and compare them before finalizing your
architecture choices.
A typical IoT project includes the following processes:
Design
Hardware sourcing
Prerequisites
Principles
Understand total cost of ownershipUnderstand total cost of ownership
Development
Integration
Onboarding
Deployment
Operation
The complexity of these processes leads to many opportunities to cut costs. Cost optimization in the IoT Well-
Architected Framework includes guidance to simplify cost drivers from complex IoT architecture and processes
and help you make better decisions to achieve cost effectiveness.
This guide considers various combinations of Azure IoT services and related technologies. However, it doesn't
consider cost optimization for specific industry and IoT use cases such as connected factory, predictive
maintenance, and remote monitoring.
Because IoT projects include cloud technologies, you should review Microsoft Azure Well-Architected
Framework - Cost Optimization and Cloud Cost Optimization | Microsoft Azure for overall guidance about cloud
costs.
To assess your IoT workload based on the principles described in the Microsoft Azure Well-Architected
Framework, complete the IoT workload questionnaires in the Azure Well-Architected Review assessment:
Azure Well-Architected Review
After you complete the assessment, this guide helps you address the key cost optimization recommendations
identified for your solution.
The Microsoft Well-Architected Framework cost optimization documents is a set of generic principles. This
workload view focuses on specifics for an IoT solution and considers unique factors that add to the generic
guidance.
Cost effectiveness is one of the key success factors in IoT projects. In a typical IoT solution, devices generate
large quantities of telemetry that is then stored in a cloud repository. How you develop devices and applications,
handle large volumes of data, and design your architecture have an effect on overall costs. Because an IoT
solution is a multilayered technology stack, there are many cost saving factors to consider.
Apply the following design principles to build cost effective IoT solutions:
Understand total cost of ownership (TCO)Understand total cost of ownership (TCO) – Take into account both direct and indirect costs when you
decide on the architecture for your IoT solution.
Establish strategies for IoT technology ecosystemsEstablish strategies for IoT technology ecosystems – Use the best approach for each IoT technology
area that has its own ecosystem.
Define implementation plans for each IoT architecture layerDefine implementation plans for each IoT architecture layer – Identify action items for each layer in
the IoT Well-Architected Framework architecture.
Monitor and optimize costsMonitor and optimize costs – Establish ongoing activities for cost optimization after you implement your
IoT solution.
Many IoT solutions fail because of the difficulty of estimating long-term aggregated costs of various cloud
services. It's critical to understand how much you're spending to run and operate all services involved. Without
this understanding, it's common for budgets to run into the red, leaving you with ballooning costs.
When you evaluate infrastructure costs, account for the basics first: storage, compute, and network. Then, you
Establish strategies for IoT technology ecosystemsEstablish strategies for IoT technology ecosystems
Device management strategyDevice management strategy
Edge, cloud, or hybrid implementation strategyEdge, cloud, or hybrid implementation strategy
Connectivity and IoT protocol managementConnectivity and IoT protocol management
Platform for IoT solutionsPlatform for IoT solutions
must account for all services your solution needs to ingest, egress, and prepare data for use in business
decisions. Make sure to estimate costs based on the architecture of the solution when it's running at scale in
production, not your PoC architecture. Architecture and costs evolve rapidly after the PoC.
Furthermore, don't overlook the long-term operational costs that increase in parallel with infrastructure costs.
Examples include employing technicians to operate the solution, vendors to manage non-public clouds, and
customer support teams.
Your costs will depend greatly on the
chattiness
of your devices and the size of the messages your devices send.
Chatty devices send many messages to the cloud every minute, while others are relatively quiet, only sending
data every hour or more. Clarity about device chattiness and message size helps reduce the likelihood of over-
provisioning, which leads to unused cloud capacity, or under-provisioning, which leads to elastic scale
challenges. As you plan your IoT solution, consider carefully the size and frequency of the message payloads to
ensure your infrastructure is the correct size for where you are today and ready to scale with you. Be sure to
consider the cost of developing new features as compared to using a
Platform as a Service
(PaaS) or
Application
Platform as a Service
(aPaaS) based solution.
Device management requires a strategic approach because various options are available during the
Plan –
Provision – Configure – Monitor – Retire
device lifecycle. Understanding the roles for IoT solution development
and operation and IoT device types provides a foundation for establishing a strategy for device cost
management.
Evaluate TCO for deploying edge assets. Edge solutions can be realized with different edge architectures:
multiple endpoints, IoT devices, directly connected to the cloud, or connected through an edge and/or cloud
gateway. Different options for sourcing the edge solution devices can affect total cost and lead time. Ongoing
maintenance and support of the device fleet also affects the overall solution cost. Cost optimization is an
important step in maximizing the return on investment (ROI) of any IoT solution.
Where data should be stored and processed in a given IoT solution affects many factors such as latency, security,
and cost. Analyze each use case and examine where it makes most sense to use edge processing and data
storage and how it affects cost. Storing and processing data on the edge can often save storage, transportation,
and processing costs. However, when you take scale into account, cloud services often become better options
because of cost and development overheads. Assess each case for costs and benefits.
The Azure pricing calculator is a useful tool to compare these options.
There are many factors to take into account for device to IoT gateway connectivity. For example, device
connectivity, network, and protocol. An understanding of IoT communication protocols, network types, and
messaging patterns helps you to design and optimize a cost effective architecture.
For device connectivity, it's important to specify the type of network to use. If you select a private LAN or WAN
solution, such as WiFi or LoraWAN, consider the TCO of the network as part of the overall costs. If you use
carrier networks (such as 4G, 5G, or LPWAN), include the recurring connectivity costs.
Understand the difference between Application Platform as a Service (aPaaS) and Platform as a Service (PaaS)
and choose the right platform for your IoT solution. Depending on your business requirements, development,
and operation capacities, you can choose an aPaaS and PaaS IoT solution platform:
aPaaSaPaaS IoT Central is an Azure aPaaS. It enables you to focus on your business and product innovation
instead of maintenance, infrastructure, development, and operations. It simplifies the creation of IoT
solutions.
Build an architecture for scale by implementing deployment stampsBuild an architecture for scale by implementing deployment stamps
Define implementation plans for each IoT architecture layerDefine implementation plans for each IoT architecture layer
Monitor and optimize costsMonitor and optimize costs
Build a cost model and budgetBuild a cost model and budget
Continuously monitor and reviewContinuously monitor and review
Design
PaaSPaaSIoT Hub is and Azure PaaS. It acts as a central message hub for bi-directional communications
between your IoT app and the devices it manages. You can tune services to control the overall cost.
Deployment stamping
is a common design pattern for designing applications for flexible deployment strategies,
predictable scale, and cost. Using this pattern for IoT solutions provides several advantages such as geo-
distributing groups of devices, deploying new features to specific stamps, observability of cost-per-device. To
learn more, see Scale IoT solutions with deployment stamps.
Overview of an IoT workload - IoT Well-Architected Framework describes the elements of an IoT architecture. It
enables you to easily review cost options for each layer in a complex IoT architecture. Cost optimization for IoT
Well-Architected Framework provides actionable guides for the IoT architecture layer implementations.
After you review the IoT technology stacks, estimate the initial cost through a cost model:
Pricing Calculator helps estimate the initial and operational costs.
Develop a cost model - Microsoft Azure Well-Architected Framework.
For your solution, factor in the growth in the number of devices deployed and their messaging patterns over
time. Both can have a linear or non-linear growth, which affects your total solution cost over time. You might
start with a limited number of connected devices that grows each year.
Continuously monitor and review costs to identify gaps between planned and actual costs. A regular cost review
meeting is a good way to achieve cost optimization.
An IoT architecture consists of a set of foundational layers. Layers are realized by using specific technologies,
and the IoT Well-Architected Framework highlights options for designing and realizing each layer. There are also
cross-cutting layers that enable the design, building, and running of IoT solutions.
Device and gateway layer
Understand design considerationsUnderstand design considerations
The following sections address the layer specifics for the cost optimization pillar:
This layer is responsible for generating the data in your solution. Cost is a key consideration for the design of
this layer. Because devices and gateways can optimize data before transferring it to the cloud, a device and
gateway strategy affects the costs in your solution.
IoT Device development requires internal or outsourced engineering resources with a specific skill set. Include
these resource costs in the overall device costs. Required skills include hardware design, embedded application
development, cloud and local connectivity, security and privacy, and IoT solution architecture. Industry-specific
expertise may also be required.
Choosing off-the-shelf hardware or a custom design affects the cost and time-to-market for an IoT device:
An off-the-shelf device may cost more per unit, but has predictable costs and lead times. Off-the-shelf
devices also remove the need for complex supply chain management.
A custom device can reduce unit costs, but development time and non-recurring engineering costs such
as design, test, certification submissions, and manufacture are incurred. Pre-certified system components
or modules can reduce time-to-market to create a semi-custom device, but these components are also
more expensive than discrete chips. Supply-chain and inventory management must be properly
Azure Certified Device catalogAzure Certified Device catalog
LPWAN devicesLPWAN devices
Azure RTOSAzure RTOS
resourced.
For a device connected to the cloud as part of an IoT Solution, you need to optimize upstream data
transmissions to maintain cost boundaries. Factors to consider include minimizing payload sizes, batching of
messages, and transmission during off-peak periods. These optimizations also have incurred costs to
implement.
A make-or-buy decision takes into account qualitative (for example, WiFi certification) and quantitative (for
example, BOM cost or time to market) variables.
Azure Certified Device catalog helps to reduce costs and time to build for devices that work well with Azure IoT.
Most of the device development process is reduced to hardware selection. IoT Plug and Play devices can reduce
both device and cloud development costs.
If LPWAN devices, such as LoRaWAN, NB-IoT, or LTE-M, are already connected to another IoT cloud, the Azure
IoT Central Device Bridge can help to bridge to Azure IoT Central without incurring costs to change your existing
devices.
Azure RTOS is an embedded development suite. Azure RTOS includes a small but powerful operating system
that provides reliable, ultra-fast performance for resource-constrained devices. It's easy-to-use and market-
proven, having been deployed on more than 10 billion devices worldwide. Azure RTOS supports the most
popular 32-bit microcontrollers and embedded development tools, so you can make the most of existing device
builder skills.
Azure RTOS is free for commercial deployment using pre-licensed processors. Azure RTOS comes with Azure IoT
cloud capabilities and features such as device update and security that work with cloud services. These features
help to reduce both device and cloud development costs.
Azure RTOS is certified for safety and security. This helps to reduce the time and cost of building compliant
Azure SphereAzure Sphere
Azure Stack solutionsAzure Stack solutions
Azure public or private multi
-
access edge compute
(
MEC
)
Azure public or private multi
-
access edge compute
(
MEC
)
Edge management and operationEdge management and operation
devices for specific verticals such as medical, automotive, and industrial.
Azure Sphere is a highly secure, end-to-end IoT solution platform with built-in communication and security
features for internet-connected devices. It comprises a secured, connected, crossover microcontroller unit
(MCU), a custom high-level Linux-based operating system (OS), and a cloud-based security service that provides
continuous, renewable security.
Azure Sphere reduces the effort to build and maintain a secure environment from device to the cloud.
Microsoft provides OS updates and zero-day renewable security for 10 years. X.509 based PKI, user app
updates, error reporting, and device management will be supported beyond 10 years without any extra
subscription cost. Azure Sphere reduces the operational cost of keeping millions of devices up to date with the
latest security.
Azure Stack solutions extend Azure services and capabilities to environments beyond Azure datacenters, such as
to a customer datacenter or edge location.
Azure Stack solutions consist of Azure Stack Edge, Azure Stack HCI, and Azure Stack Hub:
Azure Stack Edge is a Microsoft-managed appliance, ideal for hardware-accelerated machine learning
workload at edge locations. Azure Stack Edge deployed in edge location can serve multiple workloads as
the core of Azure Stack Edge runs on modern technology stacks such as containers. By sharing
computational power with other workloads, the cost of ownership can be reduced. To learn more, see
Azure Stack Edge.
Azure Stack HCI is a purposed-built hyperconverged solution, with native integration with Azure. Azure
Stack HCI offers a scalable virtualization solution to host industry solutions. Virtualization brings extra
benefits to overall IoT solutions such as security benefits on virtualization, scalability and flexibility of
virtualized environments, which can then reduce total cost of ownership by sharing the hardware with
other workloads. Azure Stack HCI offers more compute power compared to Azure Stack Edge and is ideal
for industry process transformation. To learn more, see Azure Stack HCI.
Whether you choose Azure Stack Edge or Azure Stack HCI, you should identify the use cases and estimated
compute power. While Azure Stack solutions bring Azure capability to the edge, the hardware sizing constrains
the total compute power. Therefore sizing must be factored in to optimize cost to performance needs.
Many IoT devices are small, inexpensive, and are designed for one or a few tasks such as collecting sensor or
location data and then offloading that data for further processing. Azure public or private MEC and 5G help to
optimize the costs associated with offloading data from devices.
IoT devices can generate large quantities of data. Some IoT devices also have strong requirements for low power
consumption and low costs. MEC based IoT solutions enable low-latency processing of this data at the edge
instead of on the devices or in the cloud – 1-5 ms instead of the typical 100-150 ms for the cloud. MEC based
IoT solutions remain flexible, and the devices themselves can remain inexpensive, operate with minimal
maintenance, and use smaller, cheaper, and long-lasting batteries. With MEC, data analytics, AI, and optimization
functions remain at the edge, which keeps your IoT solution simple and inexpensive.
In addition to serving as an edge processing, compute, and 5G communication device, for IoT workloads, other
workloads can use the same communication device to establish high-speed connections to the public cloud or
remote sites.
Azure IoT Edge managed devices with gateway capabilities reduce network costs and minimize the number of
Secure device provisioning and authenticationSecure device provisioning and authentication
messages through local processing and edge scenarios.
Deploying Azure services such as Azure Stream Analytics and Azure Functions to IoT Edge lets you aggregate
and filter large volumes of data at the edge and send only important data to the cloud IoT gateway. IoT Edge
reduces the costs of exchanging data between the edge and an IoT cloud gateway.
Azure Blob Storage on IoT Edge reduces the need to transfer large quantities of data and over the network. Edge
storage is useful for transforming and optimizing large quantities of data before sending it to the cloud.
There are free-of-charge Azure IoT Edge modules for open protocols such as OPC Publisher and Modbus. These
modules help you to connect various devices with minimal development. You can search for and download the
modules in the Azure Marketplace. Depending on your requirements, you need to decide between build and
buy. For example, if upload performance is critical, choosing a proven IoT Edge module from vendors can be
more cost effective than building a custom module.
Every edge solution requires IoT devices to be deployed in the field. The deployment may need networking and
power supply infrastructures that affect the cost. Using pre-existing infrastructures minimizes the installation
costs. However, you may need to guarantee that the installation doesn't affect existing systems. The installation
of your IoT devices requires dedicated internal or external resources that may need to be trained. Logistics, such
as storage, inventory management, and transport should be organized and considered in the solution cost.
Include the cost of any decommissioning activities that take place when the devices reach the end of their
operational lifecycle.
The lambda architecture pattern - hot/warm/cold - is common in IoT solutions. This pattern is often fully
implemented in the cloud, and there are also relevant implementations on the edge such as when you use more
performant edge devices or the Azure IoT Edge runtime. Optimizing this pattern on the edge can influence
overall solution costs as it lets you choose the most cost-effective service for cloud data ingestion and
processing:
Hot path processing includes near real-time processing, process alerts, or notifications at the edge. Use
Azure IoT Hub event streams to process the alerts in the cloud.
Warm path processing includes using storage solutions on the edge. For example, use open source, time
series databases, or Azure SQL Edge. Azure SQL Edge includes edge stream processing features and time
series optimized storage.
Cold path processing includes batching up events of lower importance at the edge and using a file
transfer option through the Azure Blob Storage module. This approach uses a lower cost data transfer
mechanism as compared to streaming everything directly through Azure IoT Hub or IoT Central. After
your cold data arrives in Azure blob storage, there are many options to process the data in the cloud.
Azure IoT Edge has built-in capabilities to use when you have high message volumes. These built-in features to
batch messages can help to reduce the costs of using Azure IoT Hub or IoT Central. This feature can help reduce
both the number of daily messages in IoT Hub, and the number of device-to-cloud operations per second,
enabling you to choose a lower SKU or scale in IoT Hub. To learn more, see Stretching the IoT Edge performance
limits.
Azure IoT Hub with the Device Provisioning Service (DPS) and Azure Central both support device authentication
with symmetric keys, trusted platform module (TPM) attestation, or X.509 certificates. There's a cost factor
associated with each of these options:
X.509 certificates are considered the most secure option for authenticating to Azure IoT but certificate
management can be costly. The lack of upfront certificate lifecycle management planning can make the
use of certificates even more costly in the long-run. Typically, there's a requirement to work with third-
party vendors who offer CA and certificate management solutions. This option requires using a public
key infrastructure (PKI). Options include a self-managed PKI, a third-party PKI, or the Azure Sphere
Support and maintenanceSupport and maintenance
Ingestion layer
Platform for building IoT solutionsPlatform for building IoT solutions
TIPTIP
security service.
The Azure Sphere security service is only available when you use Azure Sphere devices.
TPM when used with X.509 certificates stored in the TPM offers an added layer of security. DPS also
supports authentication through TPM endorsement keys. The main effect on costs is from hardware,
potential board redesign, and its complexity.
Symmetric key authentication is the simplest and lowest cost option. However, you must evaluate the
effect on security. You need to protect keys on the device and in the cloud. Options to securely store the
key on the device often move you towards a more secure option anyway.
Review costs associated with each of these options, and evaluate how they affect the overall security of your
solution, balancing potentially higher hardware or services costs with increased security. Integration with your
manufacturing process may also influence overall costs.
To learn more, see Security practices for Azure IoT device manufacturers.
After you deploy your IoT devices, you need to support and maintain them for the lifetime of the solution. Tasks
include hardware repairs, software upgrades, OS maintenance, and security patching. Consider ongoing
licensing costs for commercial software and proprietary drivers and protocols. If you can't do remote
maintenance, then you need to budget for onsite repairs and updates. For hardware repairs or replacements,
you should keep suitable spares in stock.
For solutions that use cellular or paid for connectivity media, select as suitable data plan based on the number of
devices, the size and frequency of data transmissions, and where the devices are deployed.
If a service level agreement (SLA) is in place, you need a cost-effective combination of hardware, infrastructure,
and trained staff to be able to meet it.
The long-term support and maintenance of field devices can escalate to become the largest cost burden for a
deployed solution. Careful consideration of the TCO of a system is crucial to realizing return on investments.
An IoT gateway is a bridge between devices and cloud services. As a front-end service to the cloud platform, a
gateway can aggregate all data with protocol translation and provide bi-directional communication with devices.
The following sections describe technology and capability considerations for a cost effective implementation of
this layer:
Azure IoT offers two types of platform for building IoT solutions: IoT Hub is a platform-as-a-service (PaaS) and
IoT Central is an application platform as a service (aPaaS). To reduce your costs and effort, you should
understand the difference between PaaS and aPaaS. Choose your platform type by considering factors such as
requirements, budget, timelines, operation resources, and outsourcing. An aPaaS solution typically has more
predictable operational costs in the device and gateway layer.
For total cost reduction, consider extended use cases as part of platform selection. Azure IoT Central with IoT
Plug and Play is one of scenarios to review for device onboarding.
TIPTIP
Reduce chatty interactions to avoid overhead:
Devices: Avoid designing device-to-cloud interactions that use many small messages. For example, group
multiple device or module twin updates into a single update. Twin updates have their own throttling. Use
message batching to send multiple telemetry messages to the cloud. Be aware of the message size used for
the daily quota (4K for non-free IoT Hub tiers), sending smaller messages leaves some capacity unused,
sending larger messages adds to the available quota. For device-to-device or module-to-module
communication at the edge, avoid designing interactions that use many small messages.
Cloud: Avoid designing cloud-to-device interactions that use many small messages. Use a single direct
method when direct feedback is required. Use a single device or module twin status update to exchange
configuration and status information asynchronously.
You can monitor chatty interactions by using Defender for IoT micro agent and on Azure IoT Hub. You can create custom
alerts on IoT Hub if any device-to-cloud or cloud-to-device interaction exceeds a certain threshold.
IoT Hub SKU selection for message ingestion: Most IoT solutions require bi-directional communication between
devices and the cloud to be fully functional and secure. The basic IoT Hub SKU provides core functionality, but
excludes bi-directional control. For some early implementations of a solution, you may be able to reduce costs
by using the basic SKU. As your solution progresses, you can switch to a standard SKU, so that you can optimize
a secure communication channel for a lower cloud-to-device messaging costs. To learn more, see Compare IoT
Hub and Event Hubs.
Understand the Azure IoT Hub limits and quotas to optimize costs for device-to-cloud and cloud-to-device
messaging:
Messages quota: Azure IoT Hub SKUs each have a set of quotas and throttling limits that apply to operations. For
example, the Standard S1 SKU has a daily quota of 400,000 messages. This quota is achieved when devices send
400,000 messages each day. The messages metric increases as a combination of several factors:
One device-to-cloud (D2C) message up to 4 KB.
D2C messages that exceed 4 KB are charged in chunks of 4 KB.
For messages smaller than 4 KB, use the Azure IoT SDK
SendEventBatchAsync
method to optimize batching
on the device side. For example, bundling up to four 1-KB messages at the edge increases the daily meter
with just one message. Batching is only applicable for AMQP or HTTPS.
Event processing and analytics layer
Select data process path based on analytics requirementsSelect data process path based on analytics requirements
Most operations such as cloud-to-device messages or device twin operations also charge messages in 4-KB
chunks. All these operations also add to the daily throughput, and maximum quota of messages.
Review the Azure IoT Hub pricing information documentation, which includes detailed pricing examples.To learn
more, see IoT Hub quotas and throttling.
Different throttling limits apply to different operations in Azure IoT Hub:
In addition to message daily quotas, operations on the service have throttling limits. A key part of cost
optimization with Azure IoT Hub is to optimize both message quotas and operations throttling limits.
Study the differences between the different limits in the form of operations per second, or bytes per
second. To learn more, see IoT Hub quotas and throttling.
Device-to-cloud operations have an operations per second throttle that depends on the SKU. In addition
to the message size, which is metered in 4-KB chunks, you should also consider the number of
operations. This is an example where the batching on the edge is useful as it lets you send more
messages in a single operation.
One single message of 2 KB, a batched message of 10 KB, or a batched message of 256 KB only counts as
one single operation, allowing you to send more data to the endpoint without reaching throttling limits.
Message size:
If payload size is critical to cost management, reducing the fixed length overhead is important when you have a
long device life-cycle or a large deployment. Options to reduce this overhead include:
Use a shorter device ID, module ID, twin name, and message topic because an MQTT packet contains this
information as payload. An MQTT payload looks like:
devices/{device_id}/modules/{module_id}/messages/events/
Abbreviating the fixed length overhead and the message reduces bandwidth costs.
Compressing the payload, for example by using Gzip, can help reduce bandwidth costs.
The purpose of the event processing and analytics layer is to enable data-driven decisions. The event timing and
purpose of analytics are the key factors to consider when you choose the service. The right service choice
increases the efficiency of the architecture and reduces cost of processing data and events.
Base on your requirements, implement hot, warm, or cold path processing for IoT data analytics. The Azure IoT
reference architecture - Azure Reference Architectures helps you understand the difference between those three
analytics paths and reviews the available analytics services on each path. The Microsoft Well-Architected
Framework includes cost considerations for these services. To get started, determine which types of data go
through the hot, warm, or cold path:
Hot path: Data is kept in-memory and analyzed in near-real-time, typically using a stream processing
engine. The output may trigger an alert or be written to a structured format that can be queried
immediately using analytical tools.
Warm path: Recent data (such as from the last day, week, or month) is kept in a storage service that can
be queried immediately.
Cold path: Historical data is kept in lower-cost storage to be queried in large batches.
Storage layer
Define an end
-
to
-
end IoT scenario for choosing storage typesDefine an end
-
to
-
end IoT scenario for choosing storage types
Storing data affects IoT solution costs because providing data to end-users is one of the goals in an IoT solution.
There are opportunities for cost optimization in this layer. It's important to understand storage types, capacity,
and pricing, to create strategy for optimizing storage costs.
The choice of a repository for telemetry depends on the use case for your IoT data. If the purpose is just to
monitor IoT data and volumes are low volume, you should consider using a database. However, if your scenario
includes data analysis, you need to save telemetry data to storage.
For time series optimized, append-only storage and querying, consider purpose-designed solutions such as
Azure Data Explorer.
Storage and database aren't mutually exclusive. Both can work together, especially if clear hot, warm, and cold
analytics paths are defined. Azure Data Explorer, or databases are commonly used for hot and warm path
scenarios.
From the Azure Storage perspective, it's also important to consider the data lifecycle throughout the solution.
For example, access frequency, retention requirements, and backup.
Azure Storage enables you to define the lifecycle of the data and to automate the process of moving data
from the hot tier to other tiers, which reduces storage costs in the long run. To learn more, see Configure a
Device management and modeling layer
Azure IoT Hub Device Provisioning Service
(
DPS
)
Azure IoT Hub Device Provisioning Service
(
DPS
)
lifecycle management policy.
For database capabilities, it's common to choose between SQL and no-SQL solutions. SQL databases are best
suited for fixed schema telemetry with simple data transformation or data aggregation requirements. To learn
more, see Types of databases on Azure
Azure SQL Database and TimescaleDB for PostgreSQL are common choices for SQL database. Adopt best
practices to optimize costs:
Plan and manage Azure SQL Database
Azure SQL Database cost optimization
Azure SQL Database for PostgreSQL Extension
Performance tuning for Azure SQL Databases
If the schema of the message isn't fixed and is best represented as an object or document, no-SQL is a better
option. Azure Cosmos DB provides multiple APIs such as SQL or MongoDB.
For any database, the choice of partition and index strategies are important for performance optimization and
reducing unnecessary costs. To learn more, see:
Azure Cosmos DB partitioning and scaling.
Azure Cosmos DB cost optimization
Azure Synapse Analytics is modern data warehouse offering from Azure. Depending on the use case, you can
pause compute when no job is running to reduce operational costs. Azure Synapse Analytics scales by Data
Warehouse Units (DWU) and you should choose the right capacity to handle your solution requirements.
Managing devices is a task that orchestrates complex processes such as supply chain management, device
inventory, deployment, installation, operational readiness, device update, bi-directional communication with
devices, and provisioning. There are many benefits of using the device management and modeling standards of
Azure IoT. The Device Provisioning Service enables low/no touch device provisioning, so you don't have to train
and send people on site. Using DPS reduces the cost for truck rolls and the time spent on training and
configuration. DPS also reduces the risk of mistakes due to manual provisioning.
DPS is a helper service for IoT Hub that enables low-cost, zero-touch, just-in-time provisioning to the right IoT
hub without requiring human intervention to reduce error and cost. DPS enables the provisioning of millions of
devices in a secure and scalable manner.
General device management stages are common in most enterprise IoT projects. In Azure IoT, there are five
stages within the device lifecycle. DPS helps device lifecycle management with IoT Hub through enrollment
allocation policies, zero-touch provisioning, initial configuration setting, reprovisioning, and de-provisioning.
IoT Plug and PlayIoT Plug and Play
Modeling assets and device state in the cloudModeling assets and device state in the cloud
Transport layer
Choose a proper messaging protocol and connectivityChoose a proper messaging protocol and connectivity
To learn more, see:
DPS pricing
Provisioning guide
Modeling devices reduces management costs and messaging traffic volumes. Use IoT Plug and Play and Azure
IoT Central for easy deployment and management. To learn more, see Modeling devices with DTDL.
Compare cost differences between several device topology and entity stores such as Azure Cosmos DB, Azure
Digital Twins, and Azure SQL Database. Each of these services has a different cost structure, but also deliver
different capabilities to your IoT solution. Depending on the required usage of the service, choose the most cost-
efficient service, in line with the lower development and operations effort from adopting PaaS services.
Azure Digital Twins can implement a graph-based model of the IoT environment, both for asset management,
device state, and telemetry data. You can use it as a tool to model entire environments, with real time IoT data
streaming, and merging business data from non-IoT sources. You can build custom ontologies, or use standards
based ontologies such as RealEstateCore, CIM, or NGSI-LD to simplify data exchange with third parties. Azure
Digital Twins has a pay-per-use pricing model without a fixed fee.
Azure Cosmos DB is a globally distributed, multi-model database. Cost is affected by storage and throughput,
with regional or globally distributed and replicated data options.
Azure SQL Database can be an efficient solution for device and asset modeling. SQL Database has several
pricing models to help you optimize costs.
The transport layer conveys data between other layers. As data travels between layers and services, the choice
of protocol affects costs. There are many options to transfer and route data at a low cost.
You can choose the protocol that your IoT devices use to send telemetry. Choosing the right protocol for your
scenario enables devices to reduce transmission sizes and costs. Use cases such as field gateways, industry open
protocol, and IoT network selection also affect costs in the transport layer.
Optimize network trafficOptimize network traffic
Interaction and reporting layer
Utilize data visualization platform for visualizing time series dataUtilize data visualization platform for visualizing time series data
Integration layer
Integrate business systems and Azure services with Azure Digital TwinsIntegrate business systems and Azure services with Azure Digital Twins
Managing connectivity and reliable messaging with the SDK helps to reduce costs of handling unexpected
behavior between device and Azure IoT services.
Device clients regularly send a
keep-alive
message to IoT Hub. For flexibility, some Azure IoT Device SDKs
provide the option to set a timespan for the message if you're using the AMQP or MQTT protocols. You don't
need to add a keep-alive property in the telemetry if there's no specific requirement for it. Charges per
operation indicates that keep-alive messages aren't charged.
Reconnect or keep-alive. Battery powered IoT devices must choose between keeping connections alive or
reconnecting whenever they wake up. This choice affects power consumption and network costs.
The tradeoff relates to the business scenario and costs:
Reconnect: Consumes packets around 6 KB for TLS connection, device authentication, and retrieving a
device twin, but saves battery capacity if the device wakes up once or twice per day.
Keep-alive: Consumes hundreds of bytes, so keeping the connection alive saves network costs if the
device wakes up every few hours or less.
Bundle messages together to decrease TLS overhead.
Device provisioning and reprovisioning for optimizing network traffic
Azure IoT Hub Device Provisioning Service (DPS) reduces the cost of device lifecycle management from zero-
touch provisioning to retirement.
Connecting to DPS every time consumes network cost for TLS and authentication. To reduce network traffic, the
device should cache the IoT Hub information when it's provisioned by DPS, and then connect to IoT Hub directly
until it needs to reprovision. To learn more, see Send a provisioning request from the device.
As IoT handles time series data, there are many interactions from a large number of devices. Reporting and
visualizing this IoT data realizes the value of the data. Building intuitive and simplified user experiences and well-
designed data interactions are costly.
Grafana is an open-source data visualization tool that provides optimized dashboards for time series data. Its
communities provide various examples that you can reuse and customize in your environment. You can
implement metrics and dashboard from time series data with less effort. Azure provides a Grafana plug-in for
Azure Monitor.
Reporting and dashboard tools such as Power BI enable a quick start from unstructured IoT data. Power BI
provides an intuitive user interface and capabilities. You can easily develop dashboards and reports using time
series data and get the benefits of security and deployment at less cost.
Integration with other systems and services is often complex. Maximizing efficiency is the way to optimize costs
within integration layer. There are many services that you should consider to optimize integration costs. Review
the guides that optimize Azure IoT for integration.
Azure Digital Twins provides capabilities to integrate various systems and services with IoT data. As Azure
Digital Twins transforms all data into its own digital entity, it's important to understand its service limits and
tuning points for cost reductions.
DevOps layer
DeploymentDeployment
Cloud GovernanceCloud Governance
Development EnvironmentDevelopment Environment
Review Azure Digital Twins service limits when designing your architecture. There are many factors to consider
for implementation. Understanding functional limitations helps effectively integrate with business systems.
Reduce the complexity queries and the number of query results to optimize Query Unit (QU) costs. When you
use the query API, Azure Digital Twins charges prices per QU. You can trace the number of QUs the query
consumed in the response header. To learn more, see Find the QU consumption in Azure Digital Twins.
Cloud platform transforms capital expenditure (CAPEX) to operational expenditure (OPEX). While this provides
flexibility and agility, you should still have a well-defined deployment and operational model to take full
advantage of the cloud platform. A well-planned deployment creates repeatable assets to shorten time to
market. A cloud platform provides agility for developers to deploy resources in seconds, but there's a risk that
resources are provisioned unintentionally, or that you over-provision cloud resources. A proper cloud
governance model can minimize such risks, and help to avoid incurring unwanted costs.
It's common to deploy workloads in multiple environments, such as development and production. Through
infrastructure-as-code, you can accelerate the deployment by reusing the code and reduce time to market.
Infrastructure-as-code can help avoid unintentional deployments such as deploying the incorrect SKU. First-
party services such as Azure Resource Manager and Azure Bicep, or third-party services such as Terraform and
Pulumi are common infrastructure-as-code options.
DevOps practices for deploying applications can be applied in IoT solutions, using the same concepts of build
and release pipelines to different environments. To learn more, see Please Use a DevOps pipeline to deploy a
predictive maintenance solution.
Coupled with Azure Policy, you can be sure that deployments follow Well-Architected framework guidance.
Cloud governance isn't just about compliance and security, it's essential to prevent unnecessary costs.
Resource tagging is helpful to provide labeling on deployed resources. Labeling can provide insights on ongoing
costs based on the label, together with Azure Cost Management. To learn more, see Common cost analysis uses.
Azure Policy comes with built-in policy to label resources automatically, or flag resources without tagging. To
learn more, see Assign policy definitions for tag compliance.
Another use case for Azure Policy is to prevent provisioning of certain SKUs, which helps to prevent over-
provisioning in the development or production environments.
Developers can take advantage of the flexibility that Azure provides to optimize development cost. Developers
can use the Azure IoT Hub free tier, limited to one instance per subscription. This tier offers standard SKU
capabilities but is limited to 8000 messages a day. This SKU is sufficient for early-stage development with a
limited number of devices and messages.
If you prefer agility over control, Azure IoT Central is a good starting point for IoT solutions development. Each
tier comes with two free devices, and some messages are included, depending on the SKU. Azure IoT Central
bills by devices. It's a cost effective and simple option to calculate total cost of ownership for end-to-end IoT
solutions.
For compute environments, serverless architecture is often adopted for cloud native IoT solutions. Some
popular Azure services for IoT workload include Azure Functions and Azure Stream Analytics. The billing
mechanism depends on respective Azure services, and some allow developers to pause services without
incurring extra compute cost. An example is Azure Stream Analytics for real time processing needs that
Testing IoT solutions for cost estimationTesting IoT solutions for cost estimation
Monitor
Azure AdvisorAzure Advisor
Azure cost management APIsAzure cost management APIs
Azure MonitorAzure Monitor
developers can pause. Some Azure services are billed by usage, for example, Azure Functions is billed based on
number of transactions.
Developers can take advantage of these cloud native capabilities to optimize both development cost and
operational cost, with the right understanding of billing mechanism for each Azure services.
An Integrated Development Environment accelerates development and deployment. Some open-source IDEs
such as Visual Studio Code provide Azure IoT extensions that enable developers to develop and deploy code to
Azure IoT services with no cost.
Azure IoT provides various free sample codes in GitHub with guidance. These samples help developers extend
device, IoT Edge, IoT gateway, and Azure Digital Twins applications. GitHub also has features to implement
seamless CI/CD environments, with less cost and effort. If the project is intended to be open-sourced, GitHub
Actions are free. To learn more, see GitHub plans and features.
When you build end-to-end IoT solutions, you need to estimate overall costs including various cloud services,
through load testing. As IoT solutions use large amounts of data, a simulator can help with load testing. Some
simulation code samples such as Azure IoT Device Telemetry Simulator help you test and estimate costs with
various parameters at scale.
There are many tools included in your Azure subscription to get more value out of your IoT services and
implement financial governance in your organization. Such tools enable organizations to track resource usage
and manage costs across all your clouds with a single, unified view, and access rich operational and financial
insights to make informed decisions:
Azure Advisor is a personalized cloud consultant that helps you follow best practices to optimize your Azure
deployments. It analyzes your resource configuration and usage telemetry and then recommends solutions that
can help you improve the cost effectiveness, performance, reliability, and security of your Azure resources.
Azure Advisor helps you optimize and reduce your overall Azure spend by identifying idle and underutilized
resources. You can get cost recommendations from the cost tab on the Advisor dashboard.
Although Azure advisor doesn't have specific out-of-box recommendations for IoT services, it can be useful to
gather recommendations for Azure infrastructure, storage, and analytics services. To learn more, see Reduce
service costs by using Azure Advisor.
These APIs provide the ability to explore cost and usage data through multidimensional analysis. They enable
you to create customized filters and expressions that let you answer consumption-related questions about your
Azure resources.
Azure Cost Management has an alert feature. Alerts are generated when consumption reaches a threshold.
Azure Cost Management APIs are available for:
IoT Central
IoT Hub
Device Provisioning Service
Azure Monitor and Log Analytics Workspace are commonly used for telemetry logging. Log Analytics includes 5
GB of storage and first 30 days of retention is free. Depending on business need you may need a longer
Auto
-
scale for IoT Hub unit adjustmentAuto
-
scale for IoT Hub unit adjustment
Next steps
Resources
retention period.
Review and decide the right retention period to avoid unintentional costs. Azure Log Analytics provides a
workspace environment to query logs interactively. You can export logs periodically to external locations such as
Azure Data Explorer or archive logs in a storage account for a cheaper storage option.
To learn more, see Monitor usage and estimated costs in Azure Monitor.
Each IoT Hub unit has a threshold for the number of messages. Adjusting the number of IoT Hub units
dynamically helps to optimize the cost when the message volume fluctuates. You can implement an auto-scale
service that automatically monitors and scales your IoT Hub service. IoT Hub provides a customizable Azure
Function sample to implement auto-scale capability. You can use your own custom logic to optimize IoT Hub tier
and number of units.
Operational excellence in your IoT workload
Overview of an IoT workload: Includes guidance and recommendations that apply to each of the five pillars
in IoT workload.
Cost optimization design principles - Microsoft Azure Well-Architected Framework: Understand cost
optimization principles. These principles are a series of important considerations that can help achieve
business objectives and cost justification.
Checklist - Design for cost - Microsoft Azure Well-Architected Framework: View checklists for the cost model
and architecture to use when you design a cost-effective workload in Azure
Capture cost requirements for an Azure - Microsoft Azure Well-Architected Framework: Learn to enumerate
cost requirements and considerations, and how to align costs with business goals.
Checklist - Optimize cost - Microsoft Azure Well-Architected Framework: Use these checklist considerations
to help monitor and optimize workloads by using the right resources and sizes.
Develop a cost model - Microsoft Azure Well-Architected Framework: Do cost modeling to map logical
groups of cloud resources to an organization's hierarchy, and then estimate costs for those groups.
Operational excellence in your IoT workload
12/16/2022 22 minutes to read • Edit Online
Given the complexity of non-functional or operational needs such as security, reliability, performance, and cost
of IoT devices, perhaps the single most important factor to scaling an IoT solution and driving sustainable
business value for an organization is the ability of an organization to have excellent IoT device operational
capabilities. Operational excellence in an IoT workload is about enabling an organization for full visibility and
control over the hardware and software components of an IoT solution, ensuring the best experience for users.
Operational excellence includes making the IoT solution design, development, provisioning, monitoring,
support, and maintenance practices more agile and valuable to business without increasing operational risk. The
principles of operational excellence in an IoT workload are a set of considerations that help achieve superior
operational practices. These principles are described in a following section.
As you consider operational excellence in an IoT workload, the shared responsibility model in cloud and hybrid
scenarios also applies. This model is best described in the context of security considerations. An organization's
IoT operation comprises three different stacks of technologies: cloud, communication network, and IoT devices.
Given the scale and diversity of devices in IoT solutions and the use of different types of communication
networks at geographically distributed locations where devices are installed, there's a significant shift in IoT
solutions in the cloud shared responsibility model. It leaves operational ownership of key elements of an IoT
solution with the organization. While Microsoft has made it easy in many ways for organizations to operate
these elements on their own or using third parties, the organization owns the operational responsibility for
these elements.
The IoT Well-Architected Framework operational excellence guide focuses on the operational excellence aspects
of the IoT devices and Azure services that uniquely address the core requirements of an IoT solution. The key
factor in IoT operational excellence is the ability of an organization to plan, provision, configure, monitor, and
retire IoT devices. The core Azure IoT services discussed in this guide provide an array of capabilities for an
organization to accomplish these tasks effectively and efficiently and enable an organization to have excellent
IoT device operational capabilities.
The communication network technology stack of an organization's IoT operation is typically handled by the
organization's network operations team partnering with the organization's telecommunication operator. It's
recommended that the organization coordinate with their telecommunication operator to set up and operate the
wired and wireless communication network components of their IoT solutions and operations.
An IoT architecture consists of a set of foundational layers. Layers are realized by using specific technologies,
and the IoT Well-Architected Framework highlights options for designing and realizing each layer. There are also
cross-cutting layers that enable the design, building, and running of IoT solutions:
Prerequisites
Principles
To assess your IoT workload based on the principles described in the Microsoft Azure Well-Architected
Framework, complete the IoT workload questionnaires in the Azure Well-Architected Review assessment:
Azure Well-Architected Review
After you complete the assessment, this guide helps you address the key operational excellence
recommendations identified for your solution.
The following principles apply to the operational excellence pillar in an IoT workload:
Solution operation and scaling:Solution operation and scaling: Follow the best practices around this principle to ensure that the IoT
solution can successfully manage the automated provisioning of devices, can be integrated with other backend
systems, is designed for use by different types of personas including solution developers, solution
administrators, and operators, and adapts/scales efficiently to any changes in demand such as new IoT devices
being deployed or higher ingestion throughput.
Device management:Device management: Any successful enterprise IoT solution requires a strategy to establish and update a
device or fleet of device's configuration. A device's configuration includes device properties, connection settings,
relationships, and firmware. IoT operators require simple and reliable tools that enable them to update a device
Solution operation and scaling
DesignDesign
ImplementImplement
Select and test IoT hardwareSelect and test IoT hardware
or fleet of device's configuration at any point during the device's lifetime.
Monitoring and aler ting:Monitoring and aler ting: Use IoT solution logging, monitoring, and alerting systems to determine whether
the solution is functioning as expected and to help troubleshoot what is wrong throughout the lifecycle of the
solution.
Automation and DevOps:Automation and DevOps: An IoT device is fundamentally a small computer with specialized hardware and
software. IoT devices are often constrained in hardware, for example having limited memory or compute
capacity. Automation and DevOps are essential to ensure that OS and software for IoT devices and gateways are
properly uploaded and deployed to minimize any operational downtime. Automation and DevOps become an
essential aspect for monitoring and managing the lifecycle of IoT devices.
The following sections address the layer specifics of these principles for the operational security pillar:
Some key design decisions to consider include:
Select IoT hardware that meets the business and technical requirements of the solution and define
appropriate testing procedures to ensure the operational reliability of your IoT devices and gateways.
Implement an automated way to identify device capabilities at scale as they connect to your IoT solution
to easily integrate them with your backend services.
Define a remote device provisioning strategy to enable zero-touch, just-in-time provisioning of IoT
devices in the field without requiring human intervention.
Determine the roles that are responsible for developing, managing, and operating the IoT solution at
scale. Determine the users in the organization to be assigned to those roles.
Implement a centralized device management solution to administer, monitor, and operate the lifecycle of
IoT devices and to manage the overall configuration of the IoT solution. Consider an integrated UI to
assist operation teams to management the device fleet.
Configure and test reliable integration with other Azure and third party services that support the backend
and frontend services of the IoT application. A typical IoT solution is composed of multiple components
such as ingestion, routing, data storage, and data processing. It's important to verify that all these
components in the data flow are functional.
Configure the ingestion and other backend layers of the IoT solution in the cloud to scale to be able to
handle expected and unexpected capacity needs as demand for the service grows.
The Azure Device Catalog site lets IoT solution developers find hardware from a list of certified partners and
select the IoT devices that meet the specific requirements of the IoT solution being implemented. For example,
during product selection or development, you may need to identify devices that comply with certain regional
certification requirements and regulations such as CE, FCC, UL, PCI, and FDA.
For a greenfield project, there's more flexibility regarding the types of devices that can be used, the required
firmware/connectivity features and the technical specification for the IoT devices. For brownfield projects, there
are typically more restrictions when you select hardware (because hardware has already been deployed in the
field) but you might still need to look for other type of hardware such as protocol/identity translation devices or
connectivity gateways - such as Bluetooth to MQTT gateway - depending on the specific scenario. For either
case, the Azure Device Catalog has search and filter capabilities that you can use to find IoT hardware that's been
certified by the IoT device certification program and are compatible with Azure IoT services.
Provision and manage the lifecycle of IoT devicesProvision and manage the lifecycle of IoT devices
Determine the list of roles and users for the IoT solutionDetermine the list of roles and users for the IoT solution
Integration with backend systemsIntegration with backend systems
An important feature to look for when selecting hardware is Azure Plug-and-Play and the Digital Twins
Definition Language (DTDL) compatibility. These features ensure the device integrates seamlessly with services
such as Azure IoT Central and Azure Digital Twins. For Azure IoT Edge scenarios, it's also important to find
devices in the catalog that have the IoT Edge Managed certification that guarantees the device can run the Azure
IoT Edge runtime and enables the deployment and management of IoT Edge Modules to support local
processing and analytical workloads.
When you select devices or components for your solution, take care to ensure a timely and secure supply of
equipment. Consider dual or multi supply sources and a trusted vendor chain. The availability of spares to cover
the solution lifetime and maintenance or support contracts should also be included, as this step is sometimes
forgotten at the start of a project and can be expensive to introduce later.
For remote provisioning of IoT devices, Azure Device Provisioning Service (DPS) enables connecting and
configuring remote devices to an IoT Hub. DPS enables zero-touch provisioning without hardcoding information
at the factory and it enables load-balancing of devices across multiple IoT Hubs. Although DPS supports
symmetric key attestation, in a production environment you should use either the X.509 certificate or TPM
attestation mechanisms. If you use X.509 certificates, the root certificate (or an intermediate certificate that has
been signed by the root certificate) should be deployed to DPS to allow devices in the field to properly
authenticate to the service and assigned to their corresponding IoT hub.
Part of an IoT solution lifecycle includes reprovisioning devices in the field and/or moving them between IoT
hubs. DPS enables the configuration of reprovisioning policies that determine the expected behavior when an
IoT device submits a new provisioning request. Devices should be programmed to send a provisioning request
on reboots and should implement a method to manually trigger a provisioning on demand. This ensures that
every time the device boots, it always contacts DPS before getting redirected to the appropriate IoT hub.
A key decision to make early in the IoT solution design phase is to determine the list of roles that implement and
manage the solution along with the corresponding responsibilities. Ideally, the solution should trust a
centralized identity provider, such as Azure Active Directory, and only let the appropriate users in those roles
perform any management and/or operation activities such as creating and provisioning new devices, sending
commands to hardware in the field, deploying updates, and modifying user permissions. If the solution uses IoT
Central, there are built-in roles that you can assign to existing users in Azure Active Directory eliminating the
need to develop a customized identity service. In a solution backed by IoT Hub, you can use Azure Active
Directory to authenticate requests to Azure IoT Hub service APIs, such as creating device identities or invoking
direct methods. You can develop a custom management UI interface for solution operators and administrators
can be developed that authenticates users against Azure Active Directory and executes the API requests to the
IoT solution backend on behalf of those users.
A successful IoT implementation requires the proper integration of IoT services such as IoT Hub and DPS with
other backend and frontend services, both Azure or third party. Although the configuration of those other
services is out of the scope of this document, it's important to document and have a good understanding of the
entire data flow of the IoT solution and to have testing procedures in place to ensure the different parts of the
solution are working as expected and meeting the technical and operational requirements of the organization.
For example, DPS supports custom allocation policies by using custom code and Azure Functions, therefore it's
important to confirm that the Azure Function allows traffic coming from DPS and IoT Hub. Another example is
the integration between IoT Hub and backend services to enable features such as message routing and file
upload. IoT Hub needs to properly authenticate to those Azure services and you should use managed identities
to eliminate the need for developers to manage those credentials manually. Integration of Azure IoT services
with other services should be tested periodically to ensure that the IoT solution is performing as expected and
meeting the operational SLAs defined by the business. The Azure Architecture Center contains many IoT
reference architectures that outline and describe the proper and secure integration of the Azure IoT platform
Design a management interfaceDesign a management interface
Capacity scalingCapacity scaling
ResourcesResources
Device management
DesignDesign
with other services.
Solution operators and administrators need an interface to interact with the IoT solution. If an IoT solution is
based on IoT Central, an easy-to-use management interface is already built into the application enabling
operators and administrators to complete tasks such as provisioning new devices, adding/removing users,
sending commands to IoT devices, and managing device updates. If the UI management interface provided by
IoT Central doesn't meet the requirements of the solution or the solution is based on PaaS services such IoT
Hub, then you can build a custom management UI that uses the service REST APIs exposed by IoT Hub and IoT
Central.
Azure offers many options to meet capacity requirements as your business grows. Capacity planning and
scaling for your IoT solution varies depending on whether you choose to build an IoT Central or IoT Hub based
solution.
When you implement an IoT solution by using the application-platform-as-a-service (aPaaS) IoT Central, little
needs to be done. IoT Central automatically manages multiple instances of its underlying services to scale your
IoT Central applications and make them highly available. However, since IoT Central only stores 30 days of data,
you're likely to export data to other services. These other services are where you should focus your time, making
sure the solution can handle expected and unexpected capacity needs.
With a solution based on IoT Hub, it's your responsibility to scale up to handle growth in the number of
messages being ingested and to scale out to handle regional demands for the solution. Understanding the
number of messages that will be sent to the IoT hub and the sustained throughput is critical to select the correct
IoT Hub tier to support the predicted demand. If you're approaching the message limit for your IoT Hub, you can
follow these steps to automatically scale the IoT Hub up to the next unit of capacity. In addition to properly
configuring IoT Hub to scale as demand changes, it's important to design any backend services in the IoT
solution (such as Azure Stream Analytics, Azure Cosmos DB, and Azure Data Explorer) with scalability in mind to
ensure there are no bottlenecks anywhere in the solution's data flow.
If your solution is tied to a connected product, the business must communicate any fluctuation in expected load.
Load can be impacted by marketing initiatives, such as sales or promotions; or, by seasonal events, such as
holidays. You should test variations of load prior to events, including unexpected ones, to ensure that your IoT
solution can scale.
Also, you should plan for the capacity needs/requirements of the IoT Edge devices. Whether you're managing
RTOS based devices or larger compute devices with IoT Edge, there are many factors that need to be taken into
account to make sure the sizing of compute and memory are adequate for the specific use-cases.
Azure IoT Hub scaling
Azure IoT Edge Requirements
DPS X.509 Attestation
Manage Users and Roles in IoT Central
Azure IoT Edge Managed Certification Requirements
Controlling access to Azure IoT Hub by using Azure Active Directory
Plan for capacity - Microsoft Azure Well-Architected Framework
An IoT solution's scale and specific use of a device's configuration influences the design of a configuration
management strategy. It's important to automate this strategy as much as possible and ensure a device or fleet
ImplementImplement
ResourcesResources
Monitoring and alerting
DesignDesign
of device's configuration is set and updated efficiently as well.
A configuration management strategy should support:
Inventory of IoT devices and IoT Edge devices deployed in the field
Gradual update rollout through device grouping
Resilient updates to support testing and rollbacks
Automatic updates for existing or new devices
Update status reports and alerts
Example services that support these requirements for configuration management are Azure IoT Hub Automatic
Device Management, Azure IoT Edge Automatic Deployment, Azure IoT Hub Scheduled Jobs, and Device Update
for IoT Hub.
For continuous updates to existing or new device and IoT Edge device configurations, such as properties,
application specific settings, and/or relationships, use either Azure IoT Hub Automatic Device Management or
Azure IoT Edge Automatic Deployment. Both services enable an efficient, secure, and reliable way to automate
configuration deployments for either a fleet or specific group of devices. Both services continuously monitor all
new and existing targeted devices, based on tags, and their configuration to ensure these devices always have a
specified configuration. The key difference between these services is that Automatic Device Management only
applies to non-Azure IoT Edge devices and Azures IoT Edge Automatic Deployment only applies to Azure IoT
Edge devices.
To update an existing device or IoT Edge device configuration, such as properties, application specific settings,
and/or relationships, based on a schedule, whether this update is onetime or recurring, use Azure IoT Hub
scheduled jobs. This service enables an efficient, secure, and reliable way to provide a configuration update for
either a fleet or specific group of devices at a scheduled time.
To update an existing device or IoT Edge device configuration, such as firmware, application, or package updates,
use Device Update for IoT Hub. This service enables a safe, secure, and reliable way to provide firmware,
application, or package updates over-the-air (OTA) to either a fleet or specific group of devices. It's a good idea to
include a manual update method when developing IoT devices. Because of root certificate changes to IoT Hub or
DPS, or connectivity issues, you may need to update devices using a manual method such as physically
connecting to a local computer or using a local connectivity protocol such as Bluetooth.
Azure IoT Hub Automatic Device Management
Azure IoT Edge Automatic Deployment
Azure IoT Hub Scheduled Jobs
Device Update for IoT Hub
Monitoring and logging systems help to answer the following operational questions:
Are devices or systems in an error condition?
Are devices or systems correctly configured?
Are devices or systems generating accurate data?
Are systems meeting the defined service level objectives?
Monitoring and logging systems can help answer these questions. If the answer is no, these systems surface
relevant information for operations teams to help mitigate problems. IoT solutions logging and monitoring
ImplementImplement
Monitor Azure IoT HubMonitor Azure IoT Hub
systems are often more complicated than those of standard line-of-business applications. The complexity arises
from the fact that IoT solutions span:
Physical sensors interacting with an environment.
Applications on the intelligent edge performing activities such as data shaping and protocol translation.
Infrastructure components such as on-premise gateways, firewalls, and switches.
Ingestion and messaging services.
Persistence mechanisms.
Insight and reporting applications.
Subsystems that operate and scale independently in the cloud.
The following simplified logging and monitoring architecture shows examples of typical IoT solution
components and how they use some recommended technologies.
When you have critical applications and business processes that rely on Azure resources, you should monitor
those resources for their availability and performance.
The Azure Monitor service supports these requirements. This service can:
Detect and diagnose issues across applications and dependencies with Application Insights.
Correlate infrastructure issues with VM insights and Container insights.
Drill into your monitoring data with Log Analytics for troubleshooting and deep diagnostics.
Support operations at scale with smart alerts and automated actions.
Create visualizations with Azure dashboards and workbooks.
Collect data from monitored resources using Azure Monitor Metrics.
The Over viewOver view page in the Azure portal for each IoT hub includes charts that provide some usage metrics, such
as the number of messages used and the number of devices connected to the IoT Hub. The information on the
Over viewOver view page is useful but represents only a small amount of the monitoring data available for an IoT hub.
Some monitoring data is collected automatically and is available for analysis as soon as you create your IoT hub.
You can configure other types of data collection.
Azure IoT Hub collects the same types of monitoring data as other Azure resources as described in Monitoring
IoT Edge metrics collectorIoT Edge metrics collector
data from Azure resources.
To learn more about the metrics and logs that IoT Hub creates, see Monitoring Azure IoT Hub data reference.
Microsoft provides a ready to use module called
Metrics Collector
. Add this first-party module to an IoT Edge
deployment to collect module metrics and send them to Azure Monitor. The module code is open source and is
provided as a multi-arch Docker container image that supports Linux x64, ARM32, ARM64 (version 1809). It's
available in the IoT Edge Module Marketplace.
Module features include:
Collect logs from all the modules that can emit metrics using the Prometheus data model. While built-in
metrics enable broad workload visibility by default, you can also use custom modules to emit scenario-
specific metrics that enhance the monitoring solution.
There are two options to send metrics from the metrics collector module to the cloud:
Send the metrics to Log Analytics. The collected metrics are ingested into the specified Log
Analytics workspace using a fixed, native table called InsightsMetrics.
Send the metrics to IoT Hub. The collector module can be configured to send the collected metrics
as UTF-8 encoded JSON device-to-cloud messages through the
edgeHub
module. This option
unlocks monitoring of locked-down IoT Edge devices that are only allowed external access to the
IoT Hub endpoint.
The
AllowedMetrics
and
BlockedMetrics
configuration options take space or comma separated lists of
metric selectors. A metric is matched to the list and included or excluded if it matches one or more
metrics in either list.
You can visually explore metrics collected from IoT Edge devices using Azure Monitor workbooks. Curated
monitoring workbooks for IoT Edge devices are provided as public templates:
For devices connected to IoT Hub, from the IoT HubIoT Hub page in the Azure portal, navigate to the
WorkbooksWorkbooks page in the MonitoringMonitoring section.
For devices connected to IoT Central, from the IoT Central page in the Azure portal, navigate to the
WorkbooksWorkbooks page in the MonitoringMonitoring section.
Curated workbooks use built-in metrics from the IoT Edge runtime. They first need metrics to be ingested into a
Log Analytics workspace. These views don't need any metrics instrumentation from the workload modules:
Monitor updatesMonitor updates
Monitor configuration managementMonitor configuration management
Monitor automation and DevOpsMonitor automation and DevOps
ResourcesResources
Automation and DevOps
As with any deployment/update, you should monitor the state of deployment and device. In the Device Update
for the IoT Hub service, compliance measures how many devices have installed the highest version compatible
update. A device is compliant if it has installed the highest version available compatible update.
As with any deployment and/or update, you should have monitoring and alerting on the status of a device
configuration deployment and/or update in place. Each Azure IoT configuration service, as mentioned
previously, collects and stores logs and metrics in Azure Monitor. You can use this data to create alerts in Azure
Monitor for sending notifications when a configuration deployment and/or update has been created, completed,
or failed.
If the monitoring data provided by each of the Azure IoT configuration services isn't enough, the Azure IoT Hub
service APIs offer a more granular view. See the following ResourcesResources section.
DPS, Azure IoT hub configurations and IoT Edge provide continuous metrics and status updates that are the key
inputs to monitor the state of CI/CD or automation script output. You can collect and analyze these metrics in a
Log Analytics workspace and then define alerts.
Monitor device connectivity using the Azure IoT Central Explorer
Monitor, diagnose, and troubleshoot Azure IoT Hub disconnects
Monitoring Azure IoT Hub
Azure IoT Hub Automatic Device Management Monitoring
Check Azure IoT Hub service and resource health
Tutorial - Set up and use metrics and logs with an Azure IoT hub
Collect and transport metrics - Azure IoT Edge
The key benefit for organizations that are mature in using DevOps is agility - the ability to quickly sense and
DesignDesign
ImplementImplement
EXPEC TAT IO N SEXP E C TAT IO NS P L AT F O RM FEAT URE AVA IL A B L E W IT H C O DE SN IPP E T SP L AT F ORM FE AT UR E AVA IL A B L E WIT H C O DE SNIP P E T S
Device registration (non-DPS) Bulk Device updates
respond to changes in business needs and seamlessly drive through the full lifecycle of developing, deploying,
and operating a solution. DevOps in the world of application software development is enabled by automation in
testing, integration, and deployment. Finally, when application software changes are deployed in an
infrastructure-as-code environment, the ongoing operation of deployed software is automated and managed.
In the world of IoT solutions, there are three areas where agility, DevOps and automation come into play in the
development, deployment, and operation of an IoT solution:
The automation of the IoT application software lifecycle from development through testing to
deployment to IT operations.
The IoT application software deployed in devices using Azure IoT Edge. Here, once again, you can use a
combination of services available in Azure IoT Hub and Azure IoT Edge to use Azure DevOps tools and
processes in automating software lifecycle aspects of an IoT solution at the edge.
Addressing the challenge to operationally manage the IoT devices by providing operators with tools that
let them collaborate and get the necessary visibility, insights, and control to properly maintain and
operate a reliable IoT solution.
Consider the following aspects when you design your automation and DevOps processes for IoT:
Spreading approach:Spreading approach: Thousands of devices are connected to single cloud endpoint such as Azure IoT
Hub. Any change in configuration or software must be spread across all the devices.
Cross functional aspect:Cross functional aspect: Device vendors and cross-functional solution developers work together to
develop and deploy the IoT solution. Embrace cross-functional teams to deliver continuously for the
solution.
Software or configuration defined solutions:Software or configuration defined solutions: System functionality is defined by the program or
software installed on the hardware. Therefore, to change system functionality, update software instead of
making hardware changes or local interventions.
Productivity and rapid development cycle:Productivity and rapid development cycle: A process that boosts the development cycle seamlessly
across solution development cycle. Use CI/CD DevOps principles and processes.
Evolving business and deployment model:Evolving business and deployment model: The core purposes of IoT solutions are to create
possibilities for different business models and pilot validation, deployment, and enhancements. DevOps
provides a way to consistently deliver fresh software updates to meet the solution and business
requirements.
Intelligence at the edge:Intelligence at the edge: Connected IoT Edge devices have a lifecycle that extends beyond deploy,
break and fix, and retire. Connectivity creates the opportunity to continuously add incremental innovation
across the system lifecycle. DevOps puts organizations in the best position to capitalize on that
opportunity.
When you implement automation and DevOps in IoT systems, follow the device lifecycle and specific
automation and DevOps requirements in each phase. The following tables describe three phases of the device
lifecycle:
Beginning of lifeBeginning of life:
Device provision DPS Service: Prepare the configuration required to provide
Zero touch device provisioning
Device certificate and token management Generate SAS token or public key at device for
authentication and channel security
Device certificate lifecycle management CA certificate lifecycle management (Preview feature with
DPS and DigiCert)
Device initial configurations Device Twins and device modules
EXPEC TAT IO N SEXP E C TAT IO NS P L AT F O RM FEAT URE AVA IL A B L E W IT H C O DE SN IPP E T SP L AT F ORM FE AT UR E AVA IL A B L E WIT H C O DE SNIP P E T S
EXPEC TAT IO N SEXP E C TAT IO NS P L AT F O RM FEAT URE AVA IL A B L E W IT H C O DE SN IPP E T SP L AT F ORM FE AT UR E AVA IL A B L E WIT H C O DE SNIP P E T S
Continuous device configuration management at scale Device Twins and device modules
Create a CI/CD pipeline for IoT Edge modules IoT Edge CI/CD
Reprovision device DPS device reprovision
SAS key generation if expired or require changes Generate SAS token or public key at device for
authentication and channel security
Log and device diagnostics Analyze the device logs, Azure IoT hub supported device
heartbeats and diagnostics, Azure IoT Hub workbook blade
with pre-configured workbooks
Azure IoT Edge monitoring diagnostics Collect and automate Azure IoT Edge device logs and
metrics
OTA device updates Device updates
EXPEC TAT IO N SEXP E C TAT IO NS P L AT F O RM FEAT URE AVA IL A B L E W IT H C O DE SN IPP E T SP L AT F ORM FE AT UR E AVA IL A B L E WIT H C O DE SNIP P E T S
Unenroll devices Revoke device access
Remove any configuration specific to a device Device Twins and device modules
Device replacement Follow beginning of life steps
Next steps
Middle of life:Middle of life:
End of Life:End of Life:
Performance efficiency in your IoT workload
Performance efficiency in your IoT workload
12/16/2022 22 minutes to read • Edit Online
Prerequisites
Principles
Run performance testing in the scope of developmentRun performance testing in the scope of development
Continuously monitor the application and the supporting infrastructureContinuously monitor the application and the supporting infrastructure
Performance efficiency is the ability of your workload to scale to meet the demands in an efficient manner.
IoT workloads span from millions of small devices connected to the cloud to industrial solutions where the
devices are a few powerful servers acting as gateways for cloud connectivity. A device, or more specifically an
IoT Device, may generally be defined as a computing device that collects and transmits information collected
from sensors. The number of devices, the number of messages sent/received and the physical geographical
placement of the devices are a few of the factors that define the performance efficiency of an IoT workload.
A benefit of the cloud is the geographical availability and the capability to scale services to meet the demands.
Scale our IoT Services without, or with minimal, application downtime.
To assess your IoT workload based on the principles described in the Microsoft Azure Well-Architected
Framework, complete the IoT workload questionnaires in the Azure Well-Architected Review assessment:
Azure Well-Architected Review
After you complete the assessment, this guide helps you address the key performance efficiency
recommendations identified for your solution.
This pillar represents the performance relative to the resources used under stated conditions. This characteristic
is composed of the following subcharacteristics:
Time behaviorTime behavior - Degree to which the response times, processing times and throughput rates of a
product or system, when performing its functions, meet requirements.
Resource utilizationResource utilization - Degree to which the amounts and types of resources used by a product or
system, when performing its functions, meet requirements.
CapacityCapacity - Degree to which the maximum limits of a product or system parameter meet requirements.
Overall performance principles can be found at Principles of the performance efficiency
An IoT solution has several specific challenges to address. An IoT project contains both IoT devices, edge
software and cloud software parts and, when deployed, may contain millions of devices connected from
multiple regions of the world and sending millions of messages per minute.
Be aware of the complexity of having sensors, devices and gateways in geographically different locations with
different characteristics, speed and reliability of communication. Plan for this in your testing and make sure to
test for failure scenarios like network disconnect etc. In addition, plan for stress/load test of device, edge and
cloud components of your full IoT solution.
Monitoring an IoT solution with several types of devices in multiple geographical regions requires a distributed
monitoring solution that balances the amount of information monitored and sent to the cloud versus the cost in
memory and performance of the monitoring (tune the transmission for diagnostics scenarios). Monitoring must
Invest in capacity planningInvest in capacity planning
Design
Device and gateway layer
occur at multiple levels / layers, exposing metrics etc. on a gateway etc. for industrial or gateway enabled
solutions.
An IoT solution can easily start with a few hundred devices or messages and grow to millions of devices and
messages per minute. One of the benefits of the cloud is that it can often be scaled to an increase in load. For IoT
devices and gateways, the situation is much more complex as these devices are often designed long before the
solution is finalized and the need to update the capacity is costly, as devices may need to be replaced. In
industrial IoT or similar industries, the lifespan of a device is measured in decades so it's more important than
ever in these scenarios to think ahead.
An IoT architecture consists of a set of foundational layers. Layers are realized by using specific technologies,
and the IoT Well-Architected Framework highlights options for designing and realizing each layer. There are also
cross-cutting layers that enable the design, building, and running of IoT solutions:
The following sections address the layer specifics for the performance efficiency pillar:
Devices are computing devices that connect to an IoT solution and have the ability to transmit or receive data.
Gateways are devices that serve as the connection point between an IoT solution and other devices. Design
Optimize the hardware specifications of devices and gatewaysOptimize the hardware specifications of devices and gateways
Categorize individual workloadsCategorize individual workloads
Optimize connectivityOptimize connectivity
solution accordingly, factor in the load and the limits/quotas.
Optimize the capabilities of your existing hardware by using more efficient languages and frameworks
like C and Rust.
Upgrading or replacing hardware is costly and takes time, therefore devices that are connected to the IoT
solution should be sized for the required capacity and functionality in advance.
Use gateways as units of scale. For example, if your solution adds OPC UA servers over time, other edge
gateways can be used to ingest data from these other servers.
When scaling the Device and gateway layer conduct a scale assessment for all upstream layers including
edge gateways, cloud gateways, and upstream cloud services.
Use multiple IoT hubs as scale units for the IoT solution, see Tutorial: Provision devices across load-
balanced IoT hubs. Use DPS to distribute devices over multiple IoT Hub instances.
Depending on the requirements of the solution, run CPU and I/O-intensive tasks on specific hardware
(for example, run ML algorithms on local GPUs) or run their workloads in the cloud (such as blurring
faces in a video stream).
Using the Azure IoT Device SDKs for connecting the cloud gateway because it manages required message
translation, error handling, and retry mechanisms needed for a resilient connection.
Consider using the Azure IoT Embedded C SDK when developing for constrained devices or when most
of the security and communication stack is already available on the device. Consider using the Azure IoT
device SDK for C to include all that is needed to connect to the cloud gateway.
Consider running workloads on the device/edge to use local compute for low latency and not always
connected scenarios. Consider running workloads in the cloud to use scale.
Consider separating workloads by time constraint and required latency and response times, for example
response within seconds versus and batch oriented per hour. Depending on system constraints such as
network throughput or latency the solution may demand running your workload at the edge.
On the edge, use priority queues for sending different streams of data in the order that is required.
Although messages will be sent in order of priority to the IoT Hub, this doesn't impact receiving the
messages from the IoT Hub as they're still journaled on the IoT Hub based on receipt order.
In the cloud, use different consumer groups on Event Hubs to separate out different streams of data, you
can handle and scale alarms differently from telemetry.
In the cloud, use IoT Hub routes to separate out different streams of data, with filtering and separate
endpoints. IoT Hub message routing will add some latency.
Consider using open stateful connections during the operational time of the devices and gateways to
minimize the overhead of setting up the connection
Use an open stateful connection to the IoT solution for bi-directional communication between the devices
and the IoT solution
Consider using IoT Hubs with the lowest latency to your devices. This could mean that IoT Hubs in
multiple regions are needed when devices need to connect from different geographical locations.
Use DPS to set up a connection to the IoT hub during provisioning, when the connection isn't available
anymore or during reboot of the device.
Optimize offline scenarioOptimize offline scenario
Ingestion and communication layer
Limits and throttlingLimits and throttling
Use the evenly weighted distribution policy of DPS to adjust the weight for provisioning, refer to How the
allocation policy assigns devices to IoT Hubs.
Consider devices not to connect all at once, for example after a regional power outage. Use truncated
exponential backoff with introduced jitter when retrying.
Consider using Device and Module twins to asynchronously sync state information between devices and
the cloud, even when the device isn't currently connected to the cloud gateway. Device and Module twins
contain only the current state at a point in time, not any history or removed information.
Consider that a device has enough information and context to work without the need for a connection to
the cloud and store this information locally so that the device can recover from a reboot.
Ensure a device is capable of storing the data locally, including logs and cached telemetry per priority
(when the device isn't connected).
Consider setting a time-to-live on the data, so that expired data will be removed automatically.
Consider discarding less important data when the device isn't connected to reduce the local storage
needed and reduce the synchronization time when the device reconnects.
Consider using a cache eviction strategy such as FIFO, LIFO, or priority based for when edge device
storage capacities are reached.
Consider using a separate disk (or disk controller) to store data, so that the device runtime/application
can continue to work if running low on storage.
Data ingestion is the process used to send data from the devices into the IoT solution. Next to data ingestion
there can be other patterns of communication between devices and the IoT solution. These include:
Device-to-cloud messages.
Cloud-to-device messages.
Initiate file uploads.
Retrieve and update device twin properties.
Receive direct method requests.
For more information about IoT Hub endpoints, see IoT Hub Dev Guide Endpoints
The cloud IoT gateway has well defined limits per unit dependent on the tier of the IoT Hub. These limits are
defined as
Quota
for sustained throughput and sustained send rates for the selected IoT Hub tier. Although these
quotas are defined as the sustained throughput and send rate, IoT Hub is capable of handling loads above these
quotas for short periods of time, to resiliently handle short bursts or load overshoots. This is where another
limit becomes important, which is the
Throttle Limit
, which is an hourly or daily upper limit for the load of the
service. The throttle limits protect an IoT Hub from too much load for a longer period of time.
The diagrams below show the relation between the load, the quota and the throttle limits. The left diagram
shows that an IoT Hub can handle sustained or constant high load up to the level of the quota for the selected
IoT Hub tier. The right diagram shows that an IoT Hub can handle load that is changing over time as long as it
isn't hitting the throttle limit and on average not above the quota for the selected IoT Hub tier.
Optimize interactions between device and cloud, components, and servicesOptimize interactions between device and cloud, components, and services
Optimize communication and cloud processing of messagesOptimize communication and cloud processing of messages
The number of messages sent from device to cloud is an important parameter for IoT solution performance
efficiency. Also, Azure IoT services such as IoT Hub and IoT Central have a defined limit of messages for a
selected tier and affect both performance and cost of the solution.
Consider the following recommendations for sending messages:
IoT Hub and IoT Central calculate daily quota message count based on a 4-KB max message size. Sending
smaller messages will leave some capacity unused. In general, use message sizes close to the boundary
of 4 KB.
Consider message sizing. Group smaller device-to-cloud messages into larger messages to reduce the
total number of messages. Consider the introduced latency when combining messages.
Consider message frequency. Avoid using chatty communication dependent on the use case.
Use Azure IoT SDK message batching for AMQP to send multiple telemetry messages to the cloud.
Consider using AMQP connection multiplexing to reduce the dependency on TCP connections limits per
SDK client. With AMQP connection multiplexing, a single TCP connection to IoT Hub can be used by
multiple devices.
Use built-in message batching capabilities for AMQP of IoT Edge to send multiple telemetry messages to
the cloud.
Consider using application-level batching by combining multiple smaller messages at the downstream
device and send larger messages to the gateway/edge device. This will limit the message overhead and
lower the writes to local disk storage at the edge.
For device-to-device or module-to-module communication at the edge, also avoid designing interactions
in which many small messages are used.
Consider using Direct Methods for request/reply interaction with a device that can succeed or fail
immediately (after a user-specified timeout). This approach is useful for scenarios where the course of
immediate action is different depending on whether the device was able to respond.
Consider using Device Twins for device state information including metadata and configurations. Azure
IoT Hub maintains a device twin for each device that you connect to IoT Hub.
It's common that messages from a device or gateway need to be translated, processed or enriched with
additional information before being stored. This could potentially be a time-consuming step, so it's important to
evaluate the effect on performance in your solution. Some of the recommendations are also conflicting, such as
using compression for optimizing data transfer versus avoiding cloud processing decrypting messages. These
recommendations need to be balanced and evaluated against other pillars.
Optimize the data format used to send data to the cloud. Consider both performance (and cost) of
Categorize and prioritize your dataCategorize and prioritize your data
Transport layer
Optimize resource usageOptimize resource usage
bandwidth used and performance effect of any cloud processing of data that is needed.
Consider using message enrichment in IoT Hub to add context to device messages.
For time critical event processing on the ingested data as it arrives, instead of storing unprocessed data
and requiring complex queries to acquire data. For time-critical event processing consider late arrival,
windowing, and their impacts. This depends on the use case, for example critical alarm handling versus
message enrichment.
Select the right IoT Hub tier (Basic or Standard) based on solution feature requirements. Be aware of the
features that aren't supported in the Basic tier.
Select the right IoT Hub tier size (1, 2 or 3) and the number of instances based on data throughput,
quotas and operation throttles. Select the right IoT Central tier (Standard 0, Standard 1, Standard 2) based
on the number of messages that are sent from a device to the cloud.
Consider using Event Grid for publish-subscribe event routing, see Azure IoT Hub and Event Grid and
[Compare Event Grid, routing for IoT Hub]/azure/iot-hub/iot-hub-event-grid-routing-comparison).
Often some of the data sent from devices to the cloud is more important than other data. Classifying the data
based on priority and handling the categories in different ways are often a good practice for performance
efficiency.
One good example is a thermometer sensor sending temperature, humidity, and other telemetry, but also
having a feature of sending an alarm when temperature is outside of a defined range. The alarm message would
be classified as more important than the temperature values.
Consider the following recommendations for data categories:
Use IoT Edge priority queues to make sure important data is prioritized while sending to IoT Hub. This
doesn’t impact the priority on the receiving side of IoT Hub.
Use IoT Hub message routing to separate routes for different data priorities dependent on the use case.
IoT Hub message routing will add some latency.
Consider saving and sending low priority data at longer intervals, or even using batch/file uploads.
Malware detection on uploaded files will increase latency.
When you use IoT Edge, messages will be buffered when there's no connection, but after the connection
is restored, all buffered messages will be sent in order and by priority first followed by new messages.
Consider separating messages based on timing constraints. For example, send messages to IoT Hub
directly when there's a time constraint, and utilize file upload via IoT Hub or batch data transfer (for
example, Data Factory) if there's no time constraint. When you use IoT Edge, the blob module can be used
for the file upload.
The transport layer handles connections between a device and the IoT solution, transforming IoT messages to
network packages, and sending them over the physical network.
Common protocols used in IoT solutions are:
AMQP – Advanced Message Queuing Protocol
MQTT – Message Queue Telemetry Transport
Optimize data communicationOptimize data communication
Device management and modeling layer
Device provisioningDevice provisioning
The connection between a device and the cloud needs to be secure, reliable and scalable to handle the targeted
number of devices and messages.
Use an open stateful connection from a device to the cloud gateway (for example using MQTT or AMQP).
IoT Hub is optimized for managing millions of open stateful connections (using MQTT, AMQP or
WebSocket protocols). If you keep open connections to the devices, the overhead of security handshakes,
authentication and authorization is minimized, which will improve performance and reduce the required
bandwidth considerably.
Consider using a protocol to connect to the cloud gateway (AMQP) that supports multiplexing of multiple
channels on a single connection to minimize the number of open connections required by the cloud
gateway. By using multiplexing, a transparent gateway can connect multiple leaf devices using their own
channel over a single connection.
Use the cloud gateway patterns of Device and Module twins to asynchronously exchange state
information between the device and the cloud.
Consider configuring DPS to move the device state when a device is connecting to another cloud
gateway.
The number and size of messages that are sent from device to cloud have an effect on performance and cost.
Evaluating data communication is key to performance efficiency in your IoT Workload.
Use an efficient data format and encoding used to send data to the cloud, ensure that no extensive device
to cloud bandwidth is used.
Optimize the data format used to send data to the cloud especially for low bandwidth networks. Consider
using a compressed or binary format to send data to the cloud using low bandwidth networks. Consider
the overhead of uncompressing or converting in the cloud after the cloud gateway.
Consider storing high volume data sent from device to cloud locally and uploading hourly/daily.
Group many smaller devices to cloud messages into fewer larger messages to reduce the total number of
messages. However, don't only send large messages but balance between average message size and
throughput.
Different types of devices, using different protocols, connectivity, data ingestion frequencies and communication
patterns can be connected to an IoT solution and an IoT solution can connect to many devices and gateways at
the same time. The IoT solution must be capable of managing which devices and gateways are connected and
how they're configured.
Besides the physical connection and configuration of the devices and gateways, the data captured and ingested
by the devices and gateways must be understood by the IoT solution and therefore needs to be transferred and
contextualized.
Use DPS to set up a connection to the IoT hub during provisioning, when the connection isn't available
anymore or during reboot of the device.
Consider using DPS to allocate devices to IoT Hubs in different regions based on the latency.
Use the evenly weighted distribution policy of DPS to adjust the weight for provisioning, based on the
use cases of the IoT solution.
Consider provisioning devices not all at once but over a period of time (distributed or in smaller batches)
Downstream devicesDownstream devices
Optimize sizing based on the device and message loadOptimize sizing based on the device and message load
Storage layer
Time series dataTime series data
Storage on devices and gatewaysStorage on devices and gateways
to the IoT solution to balance the load and quota of DPS. Account for maximum connections and device
registrations, including latency and retries.
When onboarding in batches, plan for the batches and overall migration timeline.
Consider using a caching strategy for connection string returned by DPS to reduce Device Provisioning
Service operations for reconnects.
Account for the limits of DPS in number of operations and registrations per minute.
Use multiple gateways devices in Translation mode when the number of downstream devices, their
messages and message size will change/increase over time, and their protocol or message must be
translated. With the ability to have multiple gateways/IoT Edge devices per site/location and downstream
devices that can connect to any of these gateways/IoT Edge devices, the solution will be horizontally
scalable. Gateways/IoT Edge devices in Translation mode will be able to translate protocols or messages
to and from downstream devices, however a mapping is needed to find the gateway a downstream
device is connected to.
Use multiple gateways/IoT Edge devices in Transparent mode when connecting downstream
MQTT/AMQP devices and their number can change/increase over time per site/location. With the ability
to have multiple gateways/IoT Edge devices per site/location and downstream devices that can connect to
any of these gateways/IoT Edge devices, the solution will be horizontally scalable. Gateways/IoT Edge
devices in Transparent mode will be able to connect MQTT/AMQP devices for bidirectional
communication.
Account for additional message translation and buffering overhead at the gateway/IoT Edge device when
using Translation mode.
Account for additional message buffering, storage and configuration overhead at the gateway/IoT Edge
device when using Transparent mode.
Account for the number of messages the cloud gateway can handle, dependent on the tier and number of
units used. Take into account anomalies in sustained throughput due to data distribution, seasonality and
bursting.
Consider using multiple cloud gateways (IoT Hubs) when millions of devices need to be managed using
the IoT solution. Use DPS to get devices assigned to one of the IoT Hubs in the IoT solution.
An IoT solution often requires multiple storage types residing from devices and gateways and into the cloud.
The different types of data collected and referenced in the IoT solution often require different storage types,
specialized, and optimized for different scenarios.
An IoT solution that is available in multiple geographical regions might also include complexity based on data
that needs to be available globally and/or locally and in some cases replicated for optimizing latency.
Use a Time Series Database for storing time series data, timestamp, value(s).
Consider enriching telemetry time series data with columns used for filtering when storing in a Time
Series Database, for example CustomerID, RoomID or any use-case specific column.
Use storage on devices and gateways for caching data.
Event processing and analytics layer
Optimize on edge versus cloud processingOptimize on edge versus cloud processing
Categorize individual workloadsCategorize individual workloads
Optimize high volume cloud data handlingOptimize high volume cloud data handling
Use storage on devices and gateways to make sure data is kept when disconnected. Account for required
storage space. Consider keeping not all data locally but use down sampling, store only aggregates, or for
a limited period of time.
Data generated by devices can be processed before sending it to the IoT solution or within the IoT solution. The
data processing can contain translation, contextualization, filtering and routing, or more advanced analytics like
trend analysis or anomaly detections.
Consider running local workloads on the device/edge using local compute. For example, running a
machine learning algorithm to count people in a video stream can run on the edge and an event
containing the count is sent to the cloud.
Consider running large workloads, or workloads that have additional or external data or compute
dependencies, in the cloud to use scale. For example, comparing actual trends between different factories.
Consider running analytics workloads on the edge using the Steam Analytics Edge module. For example,
anomaly detection can run on the edge and label the events sent to the cloud with the detected anomaly.
When running analytics on the edge, account for extra latency, late arrival, and windowing impact.
Consider running real-time and near real-time workloads on the device or at the edge.
Consider running small, optimized, low-latency processing with time constraints on a device.
Be aware of the overhead of an edge with many connected downstream devices. The edge node has to
forward or process all messages and handle caching all the data in case of lost and restored connectivity
to the cloud. Validate the performance impact on your solution by testing with the planned maximum of
downstream devices and messages per edge node.
Be aware of the performance impact that message translation or enrichment can have for the edge, IoT
Hub or the cloud event processing.
Consider separating workloads by time constraint and required latency and response times, for example
response within seconds versus and batch oriented per hour. Hybrid hardware SoCs can support this on
device level.
On the edge use priority queues to separate different streams of data with different priorities and time to
live. For example, alarms should always be sent first but have a lower time to live compared to telemetry.
In the cloud, use different consumer groups on Event Hubs to separate out different streams of data, you
can handle and scale alarms differently from telemetry.
In the cloud use IoT Hub routes to separate out different streams of data, with filtering and separate
endpoints. IoT Hub message routing will add some latency.
Use the right messaging construct (Event Hubs, Event Grids or Service Bus) to distribute workloads while
protecting against back pressure in the cloud.
Overly complex IoT Hub routing rules can have an impact on the throughput. This is especially true for
routing rules with message body JSON filters where every message needs to be deserialized and
scanned.
Consider using out-of-the-box service integration between IoT Hub and data destinations (like Azure Data
Lake Storage and Azure Data Explorer) as these have already been optimized for high performance
Integration layer
Loose couplingLoose coupling
ReportingReporting
DevOps
Optimize deployments and configurationOptimize deployments and configuration
MonitoringMonitoring
Development and deploymentDevelopment and deployment
throughput.
Consider using the Event Hubs SDK to develop custom ingestion from an IoT Hub and the included event
processor. The event processor can rebalance devices and hosts.
Consider separating the storage needed for data ingestion and event processing from the storage needed
for reporting and integration.
Account for the right number of IoT Hub partitions and consumer groups regarding the number of
simultaneous data readers and the required throughput.
Consider using the data storage that fits the need based on the required throughput, size, retention
period, data volume, CRUD requirements, and regional replication (ADLS, ADX, SQL, or Azure Cosmos DB;
see also Select an Azure data store for your application).
The integration layer is the connection between your IoT solution and the business applications utilizing the IoT
data.
Separate the IoT solution/ingestion pipeline from processing/integration.
Use well defined and versioned APIs for access to IoT data and commands.
Make sure complex queries or load from integration layer doesn't affect the data ingestion from a
performance perspective.
Consider a separate data store for integration.
Avoid tools for end users to create user-defined queries against IoT data storage.
Consider a separate data store for reporting.
Considered the use of a connected registry for local caching and deployment of container images.
Use the deployment mechanisms provided by IoT Hub to update deployments to multiple devices at
once, including devices and gateways.
Use device twins and module twins to update configurations of devices using IoT Hub in a scalable and
efficient way.
IoT Hub metrics collected using Azure Monitor with alerts for critical metrics. See Monitoring Azure IoT
Hub data reference.
Azure Service Health service alerts are set to trigger notifications when status of IoT Hub changes.
Consider setting up Azure Monitor alerts based on current scale limits such as device to cloud messages
sent per second. It's also good practice to set the alert to a percentage of the limit, such as 75%, to allow
for prenotification of upcoming scalability limits.
Consider setting up Azure Monitor alerts for metrics and logs such as "Number of throttling errors"
Next steps
Plan for and execute performance testing, including stress and load tests to replicate the production
environment, such as location, heterogenous devices, and so on.
Reliability in your IoT workload
SAP workload best practices
12/16/2022 2 minutes to read • Edit Online
SAP landing zone prerequisite
Guidance overview
WE L L -A RC H IT EC T ED F RA M E WO RK P IL L A RW EL L - A RC H IT E C T E D F RA M EW O RK P IL L A R SUM M A RYSUM M A RY
Reliability An SAP workload requires resiliency at the architecture layer.
You’ll learn how to create an SAP application with high
availability to process critical business data.
SAP is one of the world’s leading producers of software solutions for business management and customer
operations. SAP provides a suite of powerful applications that you can configure to meet specific environment
and organizational needs. These applications facilitate data processing and information flow across
organizations and provide critical capabilities that drive key organizational functions.
SAP applications thrive on infrastructure that is tailored to maximize their capabilities and with cloud services
designed to optimize SAP workload functionality. Microsoft has hosted SAP instances in its data centers for
decades and has custom built infrastructure and services to run SAP and its applications as they were designed.
We’ve used this SAP experience to create a concise set of best practices for evaluating, designing, and
optimizing an SAP workload from premigration to operations. The goal of this guidance is to help you get the
most benefits from each SAP workload.
Figure 1: Using SAP workload guidance throughout the lifecycle of an SAP workload.
The guidance is designed for one SAP workload at a time. Anytime you want to optimize a specific SAP
application (S4/HANA, NetWeaver, etc.) for Azure or in Azure, you should use this guidance. Work through the
guidance as many times as you need to derive the benefits you expect. In operations, this guidance should be
paired with the Well-Architected Review assessments and health checks. We address these tools in our
guidance.
Before using this content, you should have an SAP platform landing zone in Azure. The platform landing zone
provides shared services to one or more of your SAP workloads. If you don’t have a platform landing zone, you
should use the SAP cloud adoption framework and deploy the SAP landing zone accelerator. The landing zone
accelerator establishes the required foundation for your SAP workload. For more information, see SAP cloud
adoption framework.
We built this guidance around the Azure Well-Architected Framework and its five pillars of architectural
excellence. The table below lists each pillar and provides a general summary of the articles in this set.
Security An SAP workload contains critical business data. You’ll learn
to secure your SAP applications with multiple security layers,
including identity, access, input validation, data sovereignty,
and encryption.
Cost Optimization An SAP workload has several architecture layers and
numerous resources supporting it. You’ll learn how to make
sure your SAP application deployment meets performance
expectations while reducing the total cost of ownership.
Performance Efficiency An SAP workload needs to be high performing resources to
meet productivity requirements. You’ll learn how to make
sure that your SAP workload meets user demands while
managing costs.
Operational Excellence An SAP workload spends most of its lifecycle in operations.
You’ll learn how to manage an SAP workload and to keep it
running.
WE L L -A RC H IT EC T ED F RA M E WO RK P IL L A RW EL L - A RC H IT E C T E D F RA M EW O RK P IL L A R SUM M A RYSUM M A RY
Next Steps
We invite you to explore SAP workload design best practices and return to this content regularly throughout the
lifecycle of your SAP workload. The content highlights critical areas of focus but also refers you to other
documentation for deeper technical insight.
For more information, see:
Azure Center for SAP Solutions
SAP workload in Azure
SAP workload architectures
Reliability
Security
Cost Optimization
Operational Excellence
Performance Efficiency
SAP workload reliability
12/16/2022 9 minutes to read • Edit Online
Conduct a reliability assessment
Create architecture reliability
Create SAP central services reliability
Create SAPMNT share reliability
A reliable SAP workload is both resilient and available. Resiliency is the ability to recover from failures and
continue to function. Availability is uptime. High availability reduces SAP application downtime during critical
maintenance and improves recovery from failures such as VM crashes, backend updates, major downtime, or
ransomware incidents. Failures happen on-premises and in the cloud, so it’s important to design your SAP
workload for resiliency and availability. Below we’ve outlined key reliability recommendations.
Before you can standardize the reliability of an SAP workload and improve areas of weakness, you need to
assess its reliability. It’s critical to know how reliable an SAP workload is so steps can be taken to fix issues or
solidify those configurations. We recommend conducting a reliability assessment on your SAP workload. The
assessment asks you questions about your workload and provides specific recommendations to focus on. The
assessment builds on itself, so you can track your progress without restarting every time. For the assessment,
start an Azure Well-Architected Review and select “SAP on Azure” when prompted.
Creating a multi-tier architecture to support an SAP workload is essential for reliability. The number of tiers and
architecture varies for each SAP application. Make sure to isolate application components from each other and
create redundancy to achieve high availability. Where applicable, you should isolate the SAP Web Dispatcher,
SAP Central Services, SAP App Server, SAPMNT Share and database. We have sample architectures for several
different SAP applications you can use to inform your design. For more information, see:
SAP S/4HANA in Linux
SAP BW/4HANA
SAP NetWeaver
SAP central services (SCS) or ABAP SAP central services (ASCS) is the basis of SAP application communication. It
consists of the message server and enqueue server. The central services layer is often a single point of failure
and must be set up for high availability to achieve SAP application resiliency. To add redundancy, create a cluster
of SAP central services with compatible shared storage technology supporting the cluster. Depending on the
operating system and available shared storage technology in general availability or private/public preview,
various options are available. Availability zones provide an opportunity to create a highly available ASCS
architecture. For more information, see:
SAP workload configurations with Azure Availability Zones
High-availability architecture for an SAP ASCS/SCS instance on Windows
High-availability architecture for an SAP ASCS/SCS instance on Linux
SAPMNT hosts the physical kernel files for SAP application and can be a single point of failure. Several options
are available on Azure to created redundancy and architect a highly available SAPMNT share. We recommend
using Azure Premium Files or Azure NetApp Files for Linux and Azure Premium Files. For Windows-based
deployments, you should use Azure NetApp Files or Azure Shared Disk.
OSOS GUIDA N C EGUIDA N C E
Windows Cluster an SAP ASCS/SCS instance on a Windows failover
cluster by using a cluster shared disk in Azure
Cluster an SAP ASCS/SCS instance on a Windows failover
cluster by using a file share in Azure
High availability for SAP NetWeaver on Azure VMs on
Windows with Azure Files Premium SMB for SAP applications
High availability for SAP NetWeaver on Azure VMs on
Windows with Azure NetApp Files(SMB) for SAP applications
Red Hat Enterprise Linux (RHEL) High availability for SAP NetWeaver on Azure VMs on Red
Hat Enterprise Linux with NFS on Azure Files
Azure Virtual Machines high availability for SAP NetWeaver
on Red Hat Enterprise Linux with Azure NetApp Files for SAP
applications
SUSE Linux Enterprise Server (SLES) High-availability SAP NetWeaver with simple mount and NFS
on SLES for SAP Applications VMs
High availability for SAP NetWeaver on Azure VMs on SUSE
Linux Enterprise Server with NFS on Azure Files
Create database resiliency
There are also a few application specific configurations you should address for SAPMNT reliability. You need
shared directories in the environment ( /sapmnt/SID and /usr/sap/trans ) to deploy the SAP NetWeaver
application layer. We recommend creating highly available file systems and ensuring they're resilient. The
/sapmnt/SID and /usr/sap/SID/ASCS directories are important. You should place these file systems on NFS on
Azure Files to achieve the maximum reliability. For more general file share information see, NFS on Azure Files.
The table below provides SAPMNT guidance specific to each operating system.
An SAP application feeds data to multiple enterprise systems, making database resiliency a key workload
consideration. We recommend replicating production data for the highest resiliency. Cross-region replication is
the preferred disaster recovery solution. But for a more affordable option, you should configure zone
redundancy at a minimum. The methods you choose depends on the database management system (DBMS)
and required business service-level agreement (SLA). Below are recommendations for the database layer.
(1) Define RPO and RTO(1) Define RPO and RTO - Creating database resiliency requires a plan to recover data loss. A logical error on
the SAP database, a large-scale disaster, or a system outage can cause data loss in an SAP workload. Your
recovery plan should identify how much data you’re willing to lose and how fast you need to recover. The
amount of data loss you’re willing to lose is your recovery point objective (RPO). How fast you need to recover
is your recovery time objective (RTO). When you design for recoverability, you need to understand the desired
and actual RPO and RTO of your SAP application.
(2) Use synchronous replication for no data loss(2) Use synchronous replication for no data loss - In some scenarios, there’s no tolerance for data loss.
The recovery point objective is 0. To achieve this RPO, you need use synchronous replication on the database
layer. Synchronous replication commits database transactions to database instances in two separate zones or
regions. You should measure the latency between the two instances to ensure it meets workload needs, and you
can do it with the SAP niping measuring tool. Higher network latency will slow down the scalability of your
workload, and physical distance between the instances adds network latency. As a result, replication across
regions will have higher latency than across availability zones. Database replication between different regions
Create application server resiliency
Use backups
should be asynchronous and replication between availability zones should be synchronous. Its important to
balance resiliency and latency in SAP workload design. For more information, see:
General Azure Virtual Machines DBMS deployment for SAP workload
High-availability architecture and scenarios for SAP NetWeaver
Network latency between and within zones
The goal of application server resiliency is to have multiple application servers load balance traffic and failover
when needed. Resiliency for the SAP application server layer can be achieved through redundancy. You can
configure multiple dialog instances on different instances of Azure virtual machines with a minimum of two
application servers. Here are application server resiliency recommendations.
(1) Use Availability Sets / Availability Zones(1) Use Availability Sets / Availability Zones - An SAP application server can be deployed in an availability
set or across availability zones. The decision you make needs to be based on workload requirements. We
recommend you choose one method to improve resiliency, but we don’t recommend scale sets. For more
information, see availability zones for SAP.
(2) Use multiple application ser vers(2) Use multiple application ser vers - Using multiple smaller application servers instead of one larger
application server is recommended. This setup avoids a single point of failure. It’s a best practice to configure
SAP Logon Group (SMLG) and Batch Server Group (RZ12) for better load balancing between end-user & batch
processing. For more information, see:
Azure Virtual Machines high availability for SAP NetWeaver
SAP HANA high availability for Azure virtual machines
SAP workload configurations with Azure Availability Zones
The SAP workload should implement a regular backup solution. Backups are the backbone of disaster recovery
and help ensure continuity of operations. We have a few recommendations for backup reliability.
(1) Star t with Azure Backup(1) Star t with Azure Backup - We recommend you use Azure Backup as the foundational backup strategy for
an SAP workload. Azure Backup is the native backup solution in Azure, and it provides multiple capabilities to
help streamline your SAP backups. With Azure Backup, we want to point out a few features.
Native database backup compatibility
- Azure Backup provides native backups through the Backint connector for
SAP HANA, SQL Server, and Oracle databases used by SAP Applications. Azure backup for SAP offers an API
called Backint. Backint allows backup solutions to create backups directly on the database layer. Azure backup
also supports the database backup capability for HANA & SQL Server databases today.
Storage backup
- The storage backup feature can help optimize the backup strategy by using disk snapshots of
Azure Premium storage for selective disks. For more information on application-consistent backups, see
snapshot consistency.
Virtual Machine backup
- Back up and restore Azure VM data through the Azure portal. Cross-region restoration
lets you restore Azure VMs that were to a paired secondary region.
Long-term retention
- Azure Backup allows you to retain SAP backups years for compliance and audit needs.
Backup Management
- Azure Backup enables you to manage backups from the Azure portal with an easy user
interface.
For more information, see:
Azure Backup documentation
Next step
SAP HANA backup overview
Backup guide for SAP HANA on Azure Virtual Machines
Backup guide for SQL Server on Azure Virtual Machines
(2) Find marketplace solutions(2) Find marketplace solutions - Several certified third-party backup solutions exist in the Azure
Marketplace. These solutions offer vendor backup capabilities and SAP-certified backup capabilities. You should
consider layering these solutions on top of Azure Backup to generate custom solutions with foundational
support.
Microsoft partners provide solutions that are integrated with Azure Storage for archive, backup, and for
business continuity and disaster recovery (BCDR) workloads. The partner solutions take advantage of the scale
and cost benefits of Azure Storage. You can use the solutions to help solve backup challenges, create a disaster
recovery site, or archive unused content for long-term retention. They can replace tape-based backups and offer
an on-demand economic recovery site with all the compliance standards and storage features such as
immutable storage and lifecycle management.
(3) Use snapshots(3) Use snapshots - A snapshot is a point-in-time, copy of your data. The speed and reliability of snapshots
can help manage large databases and protect the primary database against corruption or failure. These features
make snapshots critical for disaster recovery. We have a few options to create and store backups for your SAP
workload.
Azure Backup can take database backups for HANA and SQL Server, for example. The Backup vault feature of
Azure Shared Disk can serve as your database storage solution. Azure NetApp Files (ANF) can also back up
critical data by using snapshots, such as ANF volumes Snapshot. ANF Cross Region Replication uses ANF
snapshots to replicate data from one region to another.
The right solution depends on your desired cost and availability levels. In some scenarios, you might want to
replicate your SAP on Azure data to other Azure regions for disaster recovery. However, you can achieve the
same capabilities with Azure Storage replication, such as Geo-redundant storage (GRS) or Azure Site Recovery.
For more information, see:
SAP workload configurations with Azure Availability Zones
SAP NetWeaver disaster recovery
Azure Site Recovery for SAP workloads
Azure Storage redundancy
Back up SAP HANA databases' instance snapshots in Azure VMs
(4) Implement Disaster Recover y(4) Implement Disaster Recover y - We recommend you invest in Disaster Recovery (DR) to improve the
reliability of the SAP workload. Disaster recovery is achieved by replicating primary data to a secondary
location. Several tools & methodology can be used to the achieve goal. Disaster Recovery is required when the
primary location isn't accessible due to technical or natural disaster. Disaster Recovery solutions can be across
zones within region or across regions based on your business requirements, but we recommended DR across
region for better resiliency. For more information, see:
Azure Site Recovery
Cross-region replication of Azure NetApp Files volumes
Cross-region snapshot copy for Azure Disk Storage
Backup and Disaster Recovery
Overview
Security
Cost Optimization
Operational Excellence
Performance Efficiency
SAP workload security
12/16/2022 7 minutes to read • Edit Online
Use identity management
Azure provides all the tools needed to secure your SAP workload. SAP applications can contain sensitive data
about your organization. You must protect your SAP architecture with secure authentication methods, hardened
networking, and encryption.
SAP on Azure is delivered in the infrastructure as a service (IaaS) cloud model. Microsoft builds security
protections into the service at the levels of the physical data center, physical network, physical host, and
hypervisor. But you're responsible for areas above the hypervisor, such as the guest operating system for SAP.
We recommend you regularly evaluate the services and technologies used to ensure your security posture
evolves with the threat landscape. Below are security recommendations for consideration.
Identity management is a framework to enforce the policies that control access to critical resources. Identity
management controls access your SAP workload within or outside its virtual network. There are three identity
management use cases to consider for your SAP workload, and the identity management solution differs for
each.
(1) Operating system(1) Operating system - Organizations can improve the security of Windows and Linux virtual machines in
Azure by integrating with Azure Active Directory (Azure AD). Azure AD is a fully managed identity and access
management service. Azure AD can authenticate and authorize end user’s access to the SAP operating system.
You can use Azure AD to create domains that exist on Azure, or use it integrate with your on-premises Active
Directory identities. Azure AD also integrates with Microsoft 365, Dynamics CRM Online, and many Software-
as-a-Service (SaaS) applications from partners. We recommend using System for Cross-Domain Identity
Management (SCIM) for identity propagation. This pattern enables optimal user life cycle. For more information,
see:
SCIM synchronization with Azure Active Directory
Configure SAP Cloud Platform Identity Authentication for automatic user provisioning
Azure Active Directory Single sign-on (SSO) integration with SAP NetWeaver
Sign in to a Linux virtual machine in Azure by using Azure AD and OpenSSH
Sign in to a Windows virtual machine in Azure by using Azure AD
(2) SAP application(2) SAP application – You can access the SAP application with the SAP frontend software (SAP GUI) or a
browser with HTTP/S. We recommend configuring single sign-on (SSO) using Azure Active Directory or Active
Directory Federation Services (AD FS). SSO allows end users to connect to SAP applications via browser where
possible. For more information, see:
SAP HANA SSO
SAP NetWeaver SSO
SAP Fiori SSO
SAP Cloud Platform SSO
SuccessFactors SSO
Azure Active Directory overview
(3) SAP PaaS and SaaS application(3) SAP PaaS and SaaS application - We recommend consulting the SAP Identity Authentication Service for
SAP Analytics Cloud, SuccessFactors, and SAP Business Technology Platform. You can also integrate services
from the SAP Business Technology Platform with Microsoft Graph using Azure AD and the SAP Identity
SA P SO L UT IO NSA P SOLUT ION SSO M ET H O DSSO M E T H O D
SAP NetWeaver based-web applications such as Fiori,
WebGui
Security Assertion Markup Language (SAML)
SAP GUI Kerberos with windows active directory or AAD-Domain
services or third party solution
SAP PaaS and SaaS applications such as SAP Business
Technology Platform (BTP), Analytics Cloud, Cloud Identity
Services, SuccessFactors, Cloud for Customer, Ariba
SAML / OAuth / JSON Web Tokens (JWT) and pre-configured
authentication flows with Azure AD directly or by proxy with
the SAP Identity Authentication Service
Use role-based access control (RBAC)
Enforce network and application security
Authentication Service. For more information, see:
Using Azure Active Directory to secure access to SAP platforms and applications.
SAP Identity Authentication Service
SAP Identity Provisioning Service
A common customer scenario is deploying SAP application into Microsoft Teams. This solution requires SSO
with Azure AD. We recommend browsing the Microsoft commercial marketplace to see which SAP apps are
available in Microsoft Teams. For more information, see:
Microsoft AppSource marketplace.
The table below provides a summary of the recommended SSO method for the given SAP solution.
Its important to control access to the SAP workload resources you deploy. Every Azure subscription has a trust
relationship with an Azure AD tenant. We recommend you use Azure role-based access control (Azure RBAC) to
grant users within your organization access the SAP application. Grant access by assigning Azure roles to users
or groups at a certain scope. The scope can be a subscription, a resource group, or a single resource. The scope
depends on the user and how you’ve grouped your SAP workload resources. For more information, see Azure
AD trust relationship and Azure RBAC.
Network and application security controls are baseline security measures for every SAP workload. Their
importance bears repeating to enforce the idea that the SAP network and application requires rigorous security
review and baseline controls.
(1) Azure Network Design(1) Azure Network Design - It’s critical to differentiate between shared services and SAP application services.
A hub-spoke architecture is a good approach to security. You should keep workload specific resources in its own
virtual network separate from the shared services in hub such as management services and DNS.
For SAP-native setups, you should use SAP Cloud Connector and SAP Private Link for Azure as part of the hub-
spoke setup. These technologies support the SAP extension and innovation architecture for the SAP Business
Technology Platform (BTP). Azure native integrations fully integrated with Azure virtual networks and APIs and
don’t require these components.
(2) Vir tual network security(2) Vir tual network security - Network security groups (NSGs) allow you to filter network traffic to and from
your SAP workload. You can define NSG rules to allow or deny access to your SAP application. You can allow
access to the SAP application ports from on-premises IP addresses ranges and denying public internet access.
For more information, see network security groups
(3) Application security(3) Application security - In general, the security best practices for application development also apply in the
Encrypt data
cloud. These include things like protecting against cross-site request forgery, thwarting cross-site scripting (XSS)
attacks, and preventing SQL injection attacks.
Application security groups (ASGs) make it easier to configure the network security of a workload. The ASG can
be used in security rules instead of explicit IPs for VMs. The VMs are then assigned to ASG. This configuration
supports the reuse of the same policy over different application landscapes, because of this abstraction layer.
Cloud applications often use managed services that have access keys. Never check access keys into source
control. Instead, store application secrets in Azure Key Vault. For more information, see application security
groups.
(4) Internet facing SAP workload(4) Internet facing SAP workload – An internet facing workload must be protected using services like Azure
Firewall, Web Application Firewall, Application Gateway to create separation between endpoints. For more
information, see inbound and outbound internet connections for SAP on Azure.
Azure includes tools to safeguard data according to your organization's security and compliance needs. It's
essential that you encrypt SAP workload data at rest and in transit.
(1) Encr ypt data at rest(1) Encr ypt data at rest – Encrypting data at rest is a common security requirement. Azure Storage service-
side encryption is enabled by default for all managed disks, snapshots, and images. Service-side encryption
uses service-managed keys by default, and these keys are transparent to the application.
We recommend you review and understand service/server-side encryption (SSE) with customer-managed keys
(CMKs). The combination of server-side encryption and a customer-managed key allows you to encrypt data at
rest in the operating system (OS) and data disks for available SAP OS combinations. Azure Disk Encryption
doesn’t support all SAP operating systems. The customer-managed key should be stored in Key Vault to help
ensure the integrity of the operating system. We also recommend encrypting your SAP databases. Azure Key
Vault supports database encryption for SQL Server from the database management system (DBMS) and other
storage needs. Below is an encryption workflow to help you visualize the encryption process.
When you use client-side encryption, you encrypt the data and upload the data as an encrypted blob. Key
management is done by the customer.
For more information, see:
Server-side encryption for managed disks
Collect and analyze SAP application logs
Next Step
Azure Storage service-side encryption
Service-side encryption using customer-managed key in Azure Key Vault
Client-side encryption
(2) Encr ypt data in transit(2) Encr ypt data in transit – Encryption in transit applies to the state of data moving from one location to
another. Data in transit can be encrypted in several ways, depending on the nature of the connection. For more
information, see encryption of data in transit.
Application log monitoring is essential for detecting security threats at the application level. We recommend
using the Microsoft Sentinel Solution for SAP. It’s a cloud-native security information and event management
(SIEM) solution built for your SAP workload running on a VM. For more information, see Microsoft Sentinel
Solution for SAP.
For general security information, see:
Azure security documentation
Trusted Cloud eBook
Overview
Reliability
Cost Optimization
Operational Excellence
Performance Efficiency
SAP workload cost optimization
12/16/2022 6 minutes to read • Edit Online
Optimize workload compute costs
Microsoft makes significant investments in the fast evolution of its hardware to provide more value for less. The
frequent increase in Azure hardware capability provides regular opportunity for an SAP workload to optimize
costs, eliminate waste, and improve technology. To align Azure and your SAP workload, we recommend creating
a plan for each SAP workload. The plan should contain the objectives and motivations for the workload.
Organizational objectives and investment priorities should drive cost optimization initiatives. Below are cost-
optimization recommendations for your SAP workload.
Compute cost optimization is achieved through planning, monitoring, and resizing VMs throughout the SAP
workload lifecycle. VMs provide the compute power for the SAP application and have a direct effect on cost and
performance. We recommend monitoring the compute costs of an SAP workload to ensure the dollars spent are
helping you meet organizational goals. Here are cost-optimization recommendations for SAP workload
compute.
(1) Choose the right VM type(1) Choose the right VM type - Azure has SAP-certified VMs for your workload. The wrong VM type will
require larger sizes to get the performance need, increasing cost without benefit. A smaller VM of the correct
type can give you equal or better performance than a large instance of the wrong type. Azure offers a list of
SAP-certified configurations to help you understand what VMs work well with your business needs.
For more information, see SAP certified infrastructure.
Memory-optimized VMs can meet the requirements of most SAP applications. An SAP online analytical
processing (OLAP) workload and an online transactional processing (OLTP) workload should use M-series VMs.
Examples of an OLTP workload include SAP HANA, SAP Business Suite on HANA, and SAP S/4HANA. Examples
of OLAP workloads include SAP BW on HANA BW/4HANA. SAP Business One on HANA pairs with M-series
VMs.
SAP NetWeaver, Business All-in-One, Business Suite Software, and BusinessObjects BI have broader alignment
with different VM types. They can run on VMs in the D, G, E, and M-series.
(2) Optimize compute cost throughout migration(2) Optimize compute cost throughout migration - Many SAP journeys start on-premises, so it’s
important to plan for compute optimization throughout the migration of a workload. Make sure to follow best
practices across every SAP migration. You can find process guidance in our [CAF SAP documentation] (
/azure/cloud-adoption-framework/scenarios/sap/). With the broader process guidance in place, you’ll still need
to customize your compute optimization for each SAP workload. We want you to consider pre-migration and
post-migration milestones.
Optimize pre-migration
– Pre-migration optimization ensures you have sufficient cloud resources provisioned
to support the expected migration runtime of the SAP workload. You need to verify that the Azure VM meets the
technical requirements of the on-premises workload. Planning will shorten the migration time of a workload
and minimize the time of migration will keep costs lower.
Optimize post-migration
– Post-migration optimization focuses on the end-user experience. This step coincides
with the hypercare period, a time of elevated customer service to make sure that the workload is performing.
You should monitor the workload as users begin to interact with it. The performance metrics might indicate that
you need to downsize the VM or switch to a different VM type.
(3) Optimize in operations(3) Optimize in operations - Its important to optimize VMs in operations for the most cost-savings. By VM
Optimize workload storage
operations, we're referring to the daily management of an SAP workload. This phase of a workload brings the
ability to predict the compute needs. It’s important to see how user demand affects compute needs over time.
The VM choice should change along with the SAP workload’s requirements. Here are cost-saving
recommendations for operations.
Use Azure Advisor
– We recommend you use Azure Advisor to identify VM usage that needs optimization. For
more information, see Azure Advisor cost optimization.
Enforce VM governance
– You should enforce VM governance at VM creation as a cost and security best
practice. Every VM deployment increases the operational cost and attack surface of the SAP workload. VMs
created outside of the governance process tend to create unneeded expense and have more vulnerabilities. We
recommend using Azure Policy to enforce VM governance for your SAP workload. For more information, see
Azure Policy for SAP.
Stop a non-critical SAP workload
– Each SAP workload has different levels of criticality. Some workloads aren’t
needed on nights and weekends. A sandbox environment is a good example of low criticality. We recommend
stopping VMs that support non-critical workload environment to reduce costs. You can automate this process
for the SAP workload in Azure. For more information, see automate VM start and stop.
Use Reserved Instances
– Any SAP workload that needs to run continuously should use reserved instances to
optimize cost. For budget predictability, you can make an advanced purchase for one or three years in a
specified region. For more information, see Azure Reservations.
Use the Azure Hybrid Benefit
– Azure lets you use on-premises Software Assurance-enabled Windows Server
and SQL Server licenses. The benefit applies to Red Hat and SUSE Linux subscriptions. This benefit can generate
significant savings for a hybrid SAP workload. For more information, see hybrid licensing benefit.
For more information, see:
SAP on Azure
SAP NetWeaver
SAP HANA install
SAP HANA configuration
We recommend optimizing the storage cost for your SAP workload. Storage is an essential component of an
SAP workload. Storage contains active data and backup data that is critical to your organization. Storage affects
the performance, availability, and recoverability of an SAP workload. It's important to have the right
performance at the right cost. Here are recommendations to help you reach this goal.
(1) Use reser ved capacity storage type(1) Use reser ved capacity storage type - There are several storage options available to choose from based
on the workload requirement. Managed disks, blog storage, and backup storage can support an SAP workload
in various combinations. Each of these options comes with storage reservation options that lower overall costs
for persistent data. For more information, see:
Azure disk reserved capacity
Blob storage reserved capacity
Azure Backup Storage reserved capacity
(2) Use lifecycle management policies(2) Use lifecycle management policies - Other than reserved capacity, you need to ensure the data-
retention period is right for the SAP workload. An SAP database backup can be large and add to the storage cost
if not optimized. We recommend that you create a lifecycle policy that meets the recovery time objective (RTO)
and recovery point objective (RPO) of your SAP workload. The policy should move into Premium, Standard,
Cold, Archive storage based on its age and business requirements.
Optimize the SAP application
Next Step
Optimizing your SAP application can lower the total cost of ownership without reducing capabilities. The goal is
to generate the maximum return on investment (ROI). Here’s a few ways to optimize an SAP application.
(1) Identify application responsibility(1) Identify application responsibility - Optimizing an SAP Application should be the responsibility of the
customer business application team. Having someone or a group responsible for costs will help drive decisions
that optimize costs over the lifecycle of the SAP workload.
(2) Rationalize and rearchitect(2) Rationalize and rearchitect - You should consider rationalizing or rearchitecting the SAP application,
especially during migrations. S4 HANA often replaces older SAP applications that can be added as a legacy
system. The SAP WAF assessment can help validate rearchitecting efforts and should be conducted on a periodic
basis. For more information, see Azure Well-Architected Review.
(3) Minimize investment in legacy systems(3) Minimize investment in legacy systems - You should host a legacy SAP application on minimum-
supported architecture to help reduce cost. A legacy application is slower and less performant. Any legacy
systems that remain after rationalizing and rearchitecting should receive the minimum spend possible and be
retired when appropriate. For more information, see Azure Cost Management.
Overview
Reliability
Security
Operational Excellence
Performance Efficiency
SAP workload operational excellence
12/16/2022 4 minutes to read • Edit Online
Use health checks and assessments
Operational excellence is about creating efficient processes to support your SAP workload. Operations will be
the longest phase of the SAP workload lifecycle, and teams must be equipped with operational best practices to
manage the day-today tasks. Failure in operations will affect the other design areas and the overall success of
the SAP workload. Its critical to tailor your operations to support an SAP workload in operations. Below are
recommendations to drive operational excellence.
Standard operating procedures (SOPs) are documented processes for managing a workload. Each SAP workload
should have SOPs to govern operations. Without SOPs, teams drift from management best practices, so we
recommend a continuous cycle of assessment and health checks for your SAP workload.
(1) Health checks(1) Health checks - We have four Azure SAP (AzSAP) health checks: (1) deployment checklist, (2) inventory
checklist, (3) quality checks, and (4) Linux VM OS analyzer. The image below shows how they share a cycle with
our Azure SAP assessments. For more information on the health checks, see SAP quality checks.
Figure 1: The cycle of SAP health checks and assessments throughout th journey.
(2) Assessments(2) Assessments - We have three SAP assessments: (1) landing zone accelerator (LZA), (2) Azure SAP (AzSAP)
deployment management assessment, and (3) the AzSAP Well-architected framework assessment. These
assessments are designed for different stages in the SAP workload lifecycle.
The AzSAP Well-architected framework assessment is for operations. It compares your SAP operations against
SAP workload best practices. The assessment encourages continuous improvement by building on each
previous assessment.
Figure 2: How the Well-architected assessment creates milestones and builds on these milestones over time.
The initial assessment creates a baseline, and the next iteration of assessment uses the previous assessment as
the starting point. It will maintain the selections from the last assessment to track and review the design
principle. Because the assessment builds on itself, you can track improvements overtime. The assessment is
designed for an existing SAP workload in Azure and can assess one or more of the WAF pillars.
Monitor the workload
Automate workload infrastructure
BEN EF IT DO M A INB EN EF IT DO M A IN AUTO M AT E DEP LOY M EN T B EN EF IT SA UTO M AT E DE P LO Y M E NT B EN EF IT S
M A N UA L DEP LOY M ENTM A N UA L DEP LO Y M EN T
DISA DVA N TA GESDISA DVA N TA GES
Knowledge Works immediately after some initial
preparation time. Requires little
domain knowledge.
Requires specialized knowledge in
many domains outside of SAP.
Time Consumes predictable time from 30
minutes to a couple of hours.
Can take much more time depending
on the size of the SAP landscape,
depending on the size of the SAP
landscape.
Cost Makes automated deployments cheap
due to less time spent.
Expensive due to more time spent.
We recommend using this SAP assessment to develop and realign the SOPs for your SAP workload. The
assessment identifies areas of strength and weakness that allow you to build better SOPs. For more information,
see Azure Well-Architected Review.
Monitoring is the process of collecting, analyzing, and acting on data gathered from an SAP workload.
Monitoring provides insights of the health of the workload to compare with an expected baseline. It allows you
to know when, where, and why failures occur.
A monitoring best practice is to use a common and consistent logging schema that lets you correlate events
across systems. The monitoring and diagnostics process has several distinct phases:
Instrumentation
- Generating the raw data from application logs, web server logs, the diagnostics built into
the Azure platform, and other sources.
Collection and storage
- Consolidating the data into one place.
Analysis and diagnosis
- Troubleshooting issues and seeing the overall health.
Visualization and alerts
- Using data to spot trends or alert your operations team.
We recommend using Azure Monitor for SAP solutions to drive these processes. Azure Monitor for SAP is an
Azure-native monitoring product for SAP landscapes that run on Azure. Azure Monitor for SAP solutions uses
specific parts of the Azure Monitor infrastructure to provide insights into the monitoring of SAP Netweaver, SAP
HANA, SQL Server & Pacemaker High-Availability deployments on Azure. For more information, see Azure
Monitor for SAP Solutions.
You should use infrastructure as code (IaC) to automate SAP workload deployments with minimal human
intervention and build a scalable and consistent SAP workload on Azure. The manual process of creating the
required SAP workload resources is slow and allows for errors. Microsoft has a repository of SAP deployment
templates that you should use. It’s called the SAP on Azure Deployment Automate Framework. The templates
support SAP HANA and NetWeaver with any database on any SAP-supported operating systems. For more
information, see:
SAP deployment automation framework
SAP automate repository
Azure Monitor for SAP solutions
The table below outlines benefits of automated deployments with IaC.
Testing Provides templates that include test
instrumentation during deployment
and migration.
Allows for limited testing. Requires
more work to inject tests in the
process.
Scaling Allows you to easily scale up, down,
and out. Provides new deployment
templates.
Takes more time to scale and
customize the environment.
Standardization Applies your defined standards with
each deployment.
Sometimes leads to unwanted
variations in design.
BEN EF IT DO M A INB EN EF IT DO M A IN AUTO M AT E DEP LOY M EN T B EN EF IT SA UTO M AT E DE P LO Y M E NT B EN EF IT S
M A N UA L DEP LOY M ENTM A N UA L DEP LO Y M EN T
DISA DVA N TA GESDISA DVA N TA GES
Next Step
Overview
Reliability
Security
Cost Optimization
Performance Efficiency
SAP workload performance efficiency
12/16/2022 5 minutes to read • Edit Online
Optimize workload compute
Optimize workload storage
Performance efficiency is about accelerating digital transformation with less. The goal is to get the most out of
your SAP workload and meeting user demand without over or under provisioning resources. Inefficient
performance can degrade user experience or inflate costs. Performance affects productivity for internal
applications. It determines growth for public facing applications. Designing an SAP workload that can't meet
user demand will slow the application. Overcompensating with too much compute power will drive up cost
needlessly. These scenarios are avoidable with the right guidance. Below are recommendations to drive
performance efficiency for your SAP workload.
Compute is the core that powers an SAP application. Compute includes the hardware, number of cores, and
memory. These features are foundational to organizations. If you don’t optimize your compute configuration, an
SAP workload will be unable to meet spikes in user demand or stay withing predefined budgets. It’s important
to know the demands on your workload and match those demands with the compute you use for your SAP
workload. Here are some compute performance considerations.
(1) Conduct reference sizing for on-premises workload(1) Conduct reference sizing for on-premises workload - Reference sizing is the process of checking the
configurations and resource utilization data of an SAP workload on-premises. Reference sizing data shows the
current compute needs of the workload, and these needs should be matched in Azure. To find this information,
use the SAP OS Collector. SAP OS Collector retrieves system utilization information that can be reported via SAP
transaction OS07N and the EarlyWatch Alert. Any system performance and statistics gathering tools can collect
similar information.
(2) Use SAP Quick Sizer for a new workload(2) Use SAP Quick Sizer for a new workload - SAP Quick Sizer is a free web-based tool developed by SAP
that translates business requirements into technical requirements. Use this tool when you build a new SAP
workload to find the Azure VM with the correct network and storage throughput. For more information, see SAP
quick sizer.
Its important to choose the appropriate storage solutions to support the data needs of the SAP workload. The
correct solution can improve the performance of existing capabilities and allow you to add new features. In
general, storage needs to meet the input/output operations per second (IOPS) requirements and throughput
needs of the SAP database. For more information, see storage types for an SAP workload.
(1) Use storage that suppor ts performance requirement(1) Use storage that suppor ts performance requirement - Microsoft supports different storage
technology to meet your performance requirement. For SAP workload, you can use Azure Managed Disk (for
example, Premium SSD, Premium SSD v2, Standard SSD) and Azure NetApp Files.
(2) Configure storage for performance(2) Configure storage for performance - We've published a storage configuration guideline for SAP HANA
databases. It covers production scenarios and a cost-conscious non-production variant. Following the
recommended storage configurations will ensure the storage passes all SAP hardware and cloud measurement
tool (HCMT) KPIs. For more information, see SAP HANA Azure virtual machine storage configurations.
(3) Enable write accelerator(3) Enable write accelerator - Write accelerator is a capability for M-Series VMs on Premium Storage with
Azure Managed Disks exclusively. Its imperative to enable write accelerator on the disks associated with the
/hana/log volume. This configuration facilitates sub millisecond writes latency for 4 KB and 16-KB blocks sizes.
For more information, see Azure Write Accelerator.
Optimize workload networking
(4) Choose the right VM(4) Choose the right VM - Choosing the right VM has cost and performance implications. The goal is to pick a
storage VM that supports the IOPS and throughput requirements of the SAP workload. There are three critical
areas to focus while selecting a VM.
Number of vCPUs
- The number of CPUs has a direct effect on the licenses in the database node. Most of the
databases follow a core-based licensing model. Use the amount that meets your needs and adjust licensing
agreements as necessary.
Memory
- Memory is critical to application performance, and your SAP application can have high memory
demands. In general, higher memory provides more memory-reads, less paging, and higher VM cost.
Throughput
- Throughput is important for an application hosted on one of the VMs to communicate with
outside the VM by using its network interface cards (NICs).
An SAP workload needs to communicate with other workloads. Common communication paths are to local
storage, external storage, NICs, VMs in the network, VMs in other networks, and third-party applications.
Optimize workload networking to improve these communication channels to meet workload and application
demand. If SAP network performance isn't considered, it will cause application performance issues.
(1) Understand proximity placement groups(1) Understand proximity placement groups - Proximity placement groups reduce the distance between
SAP workloads. They can group different VM types under a single network spine. As the Azure footprint grows, a
single availability zone may span multiple physical data centers. The distribution across data centers can create
network latency impacting your SAP application performance. The proximity of the VMs provides lower network
latency.
The capabilities are ideal, but there are drawbacks to be aware of. Proximity placement groups often limit your
VM choices and make resizing VMs more difficult. Proximity placement groups bind VMs to a specific network
spine. This binding limits the possible combinations of different VM types. The host hardware that is needed to
run a certain VM type might not be present in the data center or under the network spine to which the proximity
placement group was assigned. The availability of VM types can be severely restricted.
We recommend using proximity placement groups in two scenarios. (1) Use proximity placement groups in
Azure regions where latency across zones is higher than recommended for the SAP workload. (2) Use proximity
placement groups for application volume group. The Application Volume Group feature of Azure NetApp Files
(ANF) uses PPG to deploy ANF volumes close to the VM/compute cluster. We recommend using this feature as
designed. For more information, see:
Overview of proximity placement groups
SAP and proximity placement groups
(2) Use accelerated networking(2) Use accelerated networking - Accelerated Network is default for most of the VM deployments and is
recommended for every VM hosting an SAP workload. Accelerated Network improves the network performance
by bypassing the physical switch. We recommend you enable Accelerated Networking on the Azure VMs
running your SAP Application and Database. Accelerated networking provides improved latency, jitter, and CPU
utilization. You should test the latency between the SAP application server and database with the SAP ABAP
report /SSA/CAT. It's an Inventory Check for the SAP Azure Workbook. For more information, see accelerated
networking overview.
(3) Use ExpressRoute GlobalReach(3) Use ExpressRoute GlobalReach - ExpressRoute is a private and resilient way to connect your on-
premises networks to different Azure regions. This feature allows you to link ExpressRoute circuits to make a
private network between your on-premises networks. Global Reach should be used for SAP HANA Large
Instance deployments to enable direct access from on-premises to your HANA Large Instance units deployed in
different regions. For more information, see ExpressRoute Global Reach.
Next Step
Overview
Reliability
Security
Cost Optimization
Operational Excellence
Sustainable workloads
12/16/2022 3 minutes to read • Edit Online
What is a sustainable workload?
Cloud efficiency overviewCloud efficiency overview
What are the common challenges?
This section of the Microsoft Azure Well-Architected Framework aims to address the challenges of building
sustainable workloads on Azure. Review the provided guidance that applies Well-Architected best practices as a
technical foundation for building and operating sustainable solutions on Azure.
We encourage you to also read more about the Microsoft Cloud for Sustainability for opportunities to leverage
the capabilities of that platform as part of your solution architecture. However, guidance found in this article
series is focused on
any
solutions you're building or operating on Azure.
Additionally, read about The Carbon Benefits of Cloud Computing: a Study of the Microsoft Cloud to learn more
about how Azure is more energy efficient and carbon efficient than on-premises solutions.
The term
workload
refers to a collection of application resources that support a common business goal or the
execution of a common business process, with multiple services, such as APIs and data stores, working together
to deliver specific end-to-end functionality.
With
sustainability
, we refer to the environmental impact of our workloads.
A
sustainable workload
therefore describes the practice of designing solutions that maximize utilization while
minimizing waste, ultimately reducing the footprint on the environment.
Making workloads more cloud efficient requires combining efforts around cost optimization, reducing carbon
emissions, and optimizing energy consumption. Optimizing the application's cost is the initial step in making
workloads more sustainable.
Here's a conceptual overview of cloud efficiency in this context:
Scoring and measuring the cloud efficiency is essential to understand whether changes tracked over time have
any impact.
Learn about building more sustainable and efficient workloads by starting with the design area for sustainable
Application Design.
Building and designing sustainable workloads on Microsoft Azure can be challenging for these main reasons:
Is sustainability only about performance and cost?
What are the key design areas?
DESIGN A READESIGN A REA DESC RIP T IONDESC RIP T ION
Application design Cloud application patterns that allow for designing
sustainable workloads.
Application platform Choices around hosting environment, dependencies,
frameworks, and libraries.
Testing Strategies for CI/CD pipelines and automation, and how to
deliver more sustainable software testing.
Operational procedures Processes related to sustainable operations.
Storage Design choices for making the data storage options more
sustainable.
Network and connectivity Networking considerations that can help reduce traffic and
amount of data transmitted to and from the application.
Security Relevant recommendations to design more efficient security
solutions on Azure.
Next step
Understanding if your workloads are in alignment with sustainability targets. It requires assessments of
current workloads to determine if they're designed in a sustainable way.
Designing workloads that are natively environmentally friendly and optimized.
Measuring and tracking the emissions of your workloads.
While performance efficiency and cost optimization are areas of strong focus for designing sustainable
workloads, the other pillars of the Well-Architected Framework are equally important when building long-term
sustainable workloads on Azure.
Security: how the security appliances in a workload are optimized and designed to auto-scale will have an
impact on the environment.
Reliability: designing reliable workloads that meet sustainability guidelines from the Green Software
Foundation can greatly reduce the workloads' carbon and electricity footprint.
Operational Excellence: how a workload is able to effectively respond to operational issues can ultimately
reduce carbon emissions.
Sustainable guidance within this series is composed of architectural considerations and recommendations
oriented around these key design areas.
Decisions made in one design area can impact or influence decisions across the entire design. The focus is
ultimately on building a sustainable solution to minimize the footprint and impact on the environment.
We recommend that readers familiarize themselves with these design areas, reviewing provided considerations
and recommendations to better understand the consequences of encompassed decisions.
Review the sustainability design methodology.
Design methodology
Design methodology for sustainable workloads on
Azure
12/16/2022 5 minutes to read • Edit Online
1Design for business requirements
2Evaluate the design areas using the design principles
3Understanding your emissions
Briefly about emission scopesBriefly about emission scopes
Building a sustainable application on any cloud platform requires technical expertise and an understanding of
sustainability guidelines in general and for your specific cloud platform.
This design methodology aims to help establish an understanding about producing more carbon efficient
solutions, measuring your carbon impact, and ultimately reducing unnecessary energy usage and emissions.
Businesses globally have different requirements. Expect that the review considerations and design
recommendations provided by this design methodology will yield different design decisions and trade-offs for
different scenarios and organizations.
Establish your business requirements and priorities, then review the design methodologies in alignment with
those requirements.
Refer to the sustainability design principles and the design areas below for your sustainability workloads.
Decisions made within each design area will echo across other design areas. Review the considerations and
recommendations in each design area to understand the consequences and impact and any known trade-offs.
Design areas:
Application design
Application platform
Deployment and testing
Operational procedures
Storage
Network and connectivity
Security
To lower your emissions, you need to understand how to measure your sustainability efforts.
At Microsoft, we segment our greenhouse gas (GHG) emissions into three categories, consistent with the
Greenhouse Gas Protocol.
Scope 1 emissionsScope 1 emissions: direct emissions that your activities create.
Scope 2 emissionsScope 2 emissions: indirect emissions that result from the production of the electricity or heat you use.
Scope 3 emissionsScope 3 emissions: indirect emissions from all other activities you're engaged in. For a business, these
Scope 3 emissions can be extensive. They must be accounted for across its supply chain, materials in its
buildings, employee business travel, and the life cycle of its products (including the electricity customers
consume when using the products). A company's Scope 3 emissions are often far more significant than its
Measure and track carbon impactMeasure and track carbon impact
Carbon tracking and reporting with the Emissions Impact DashboardCarbon tracking and reporting with the Emissions Impact Dashboard
Leverage the Microsoft Sustainability ManagerLeverage the Microsoft Sustainability Manager
Use a proxy solution to measure emissionsUse a proxy solution to measure emissions
Scope 1 and 2 emissions combined.
As a customer, the context of Scope 3 emissions can be network configuration and delivery, power consumption,
and devices outside the data center. If an application uses excess bandwidth or packet size, it will impact from
when the traffic leaves the data center, through the various hops on the internet, down to the end-user device.
Reducing network bandwidth, therefore, can have a significant impact throughout the delivery chain. The same
considerations apply to compute resources, data storage, application platform decisions, application design, and
more.
Find more in-depth details and definitions in Azure's Scope 3 Methodology White Paper, published in 2021.
Microsoft aligns with the Green Software Foundation, responsible for creating the Software Carbon Intensity
(SCI) specification.
To measure the carbon impact of an application, the GSF provided a scoring methodology called SCI, calculated
as follows:
SCI = ((E*I)+M) per R
Where:
E = Energy consumed by a software system. Measured in kWh.
I = Location-based marginal carbon emissions. Carbon emitted per kWh of energy, gCO2/kWh.
M = Embodied emissions of a software system. Carbon that is emitted through the hardware on which the
software is running.
R = Functional unit, which is how the application scales; per extra user, per API call, per service, etc.
With this knowledge, it's essential to consider not only the application infrastructure and hardware but also the
user devices and application scalability, as it can alter the environmental footprint considerably.
Read the full SCI specification on GitHub.
Microsoft offers the Emissions Impact Dashboard for Azure and Microsoft 365, which helps you measure your
cloud-based emissions and carbon savings potential.
We recommend you use this tool to get the insights and transparency you need to understand your carbon
footprint and to measure and track emissions over time.
Download the Emissions Impact Dashboard Power BI app for Azure to get started.
Customers using Microsoft Cloud for Sustainability can leverage Microsoft Sustainability Manager. This
extensible solution unifies data intelligence and provides comprehensive, integrated, and automated
sustainability management for organizations at any stage of their sustainability journey. It automates manual
processes, enabling organizations to record, report, and reduce their emissions more efficiently.
One way of estimating the carbon emissions from workloads is to design a proxy solution architecture based on
the SCI model as described above.
Defining the proxies for applications can be done in different ways. For example, using these variables:
Any known carbon emission of the infrastructure
The cost of the infrastructure
Edge services and infrastructure carbon emissions
The number of users that are concurrently using the application
A P P L IC AT IO N P E RFORM A N C EAP P L IC AT IO N P ERFORM A N C E A P P L IC AT IO N C O STAP P L IC AT IO N C O ST L IK ELY O UTC O M EL IK ELY OUTC O M E
High Unchanged Optimized app
High Lower Optimized app
Unchanged/Lower Higher According to the green principles, a
higher energy cost can cause higher
carbon emissions. Therefore, you can
assume that the app produces
unnecessary carbon emissions.
High High The app may be producing
unnecessary carbon
4The shared responsibility model for sustainability
Ways to reduce emissionsWays to reduce emissions
A shared responsibilityA shared responsibility
Metrics of the application to inform us about the performance over time
By designing an equation using the above variables, you can estimate the carbon score (an approximation),
helping you understand if you're building sustainable solutions.
There's also the aspect of application performance. You can link performance to cost and carbon and assume this
relationship yields a value. With this relation, you can simplify the view like this:
Therefore, building a carbon score dashboard can make use of the following proxies:
Cost
Performance
Carbon emissions of the infrastructure (if known/available)
Usage over time (requests, users, API calls, etc.)
Any extra measurement that is relevant to the application
Reducing emissions is a shared responsibility between the cloud provider and the customer designing and
deploying applications on the platform.
Reducing carbon emissions can happen with three possible solutions:
Carbon neutralization; compensating carbon emissions
Carbon avoidance; not emitting carbon in the first place
Carbon removal; subtract carbon from the atmosphere
The goal of green software is to avoid unnecessary emissions in the first place, hence actively working toward a
more sustainable future. Further,
carbon removal
is the preferred goal for removing emissions from our
atmosphere.
Microsoft is committed to being carbon negative by 2030, and by 2050 to have removed all the carbon the
company has emitted since it was founded in 1975.
As a cloud provider, Microsoft is responsible for the data centers hosting your applications.
However, deploying an application in the Microsoft cloud doesn't automatically make it sustainable, even if the
data centers are optimized for sustainability. Applications that aren't optimized may still emit more carbon than
necessary.
Next steps
Let's take an example.
You deploy an app to an Azure service, but you only utilize 10% of the allocated resources. The provisioned
resources are underutilized, ultimately leading to unnecessary emissions.
It would help if you considered scaling to an appropriate tier of the resource (rightsizing) or deploying more
apps to the same provisioned resources.
We recommend making applications more efficient to utilize the data center capacity in the best way possible.
Sustainability is a shared responsibility goal that must combine the efforts of the cloud provider and the
customers in designing and implementing applications.
Review the design principles for sustainability.
Design principles
Design principles of a sustainable workload
12/16/2022 4 minutes to read • Edit Online
Principles of green software
Carbon efficiency
Energy efficiency
The sustainability design methodology provides a framework to record, report, and reduce or optimize the
environmental impact of your workloads.
To achieve an increase in carbon efficiency, consider how your workload, directly and indirectly, can reduce
carbon emissions through:
Using less physical and virtual resources
Using less energy
Using energy and resources more intelligently
Supporting older devices
It's important to effectively record, report, and reduce carbon emissions through actionable insights.
Gain transparency into your current carbon impact
Estimate savings
Take action to accelerate progress
These critical design principles for sustainability resonate and extend the quality pillars of the Azure Well-
Architected Framework—Reliability, Security, Cost Optimization, Operational Excellence, and Performance
Efficiency.
Microsoft is actively working toward sustainability targets, and empowers every organization to help reduce
emissions and improve our environmental health. The Azure Well-Architected Framework workload for
sustainability aligns with the Green Software Principles from the Green Software Foundation.
The principles of green software are the starting point to understand the SCI model and how this will be
included in our framework.
Principle:Principle: Emit the least amount of carbon possible.
The application or software must emit the least amount of carbon possible. A carbon efficient cloud application
is one that is optimized, and the starting point is the cost – streamlining the application infrastructure and cost
will ensure that no unnecessary resources are wasted in the cloud to run the software. But this isn't enough, as
you might have cost-optimized your application but still waste tons of resources that emit carbon for no reason.
Read more about the Carbon Efficiency principle from the Green Software Foundation.
Principle:Principle: Use the least amount of energy possible.
The goal of this principle is that you build applications that are energy-efficient. This is a common pattern for
mobile applications, since they must rely on a battery powered device and are optimizing its consumption. It's
less common, however, for desktop or web applications, since until now, developers have never been asked to
optimize the electricity consumption of their software.
Carbon awareness
T E C H N IQUET E C H N IQU E DESC RIP T IONDESC RIP T ION
Demand shifting Demand shifting means moving the workloads and
resources to regions or data centers, or a time in the data
center where the energy supply is high and the demand is
lower and can be met by renewable energy. Delaying
running apps to a time when there's less demand should
result in lower carbon intensity.
Demand shaping Demand shaping means changing the application's behavior
and appearance to match the energy supply in real-time. A
good practice is to build an eco-version of the app and keep
it as a benchmark for demand shaping and carbon
optimization.
Hardware efficiency
Measuring sustainability
Climate commitments
Read more about the Energy Efficiency principle from the Green Software Foundation
Principle:Principle: Do more when the electricity is cleaner and do less when the electricity is dirtier.
We need to make the application aware of how much carbon it's emitting. This way, we can react to specific
conditions of energy supply using demand shifting and demand shaping techniques:
Read more about the Carbon Awareness principle from the Green Software Foundation.
Principle:Principle: Use the least amount of embodied carbon possible.
Embodied carbon is the carbon that was emitted to build a device. Therefore, a sustainable application will make
sure older devices are supported and will maximize the efficiency of each device. The goal is to build hardware-
efficient applications.
Consider the tradeoff that older devices can have power inefficiencies, and may not always be suitable.
Read more about the Hardware Efficiency principle from the Green Software Foundation.
Principle:Principle: What you can't measure, you can't improve.
Measuring carbon emissions of a cloud application is a complex task, as it involves the whole ecosystem of the
software: from the cloud infrastructure (where we have the emissions dashboards to help us out), to the
network path that is crossed, to the edge technology and user devices. With the SCI, we aren't targeting a
discrete measurement of carbon emissions, but a score that will change over time and with our optimization
techniques.
Read more about the Measurement from the Green Software Foundation.
Principle:Principle: Understand the exact mechanism of reduction.
Many corporations and groups have made commitments to the climate. They actively work toward new
sustainability goals with a primary objective to remove, reduce and prevent carbon emissions.
There are several options for reducing the carbon footprint of any organization or entity. However, and aligned
Next steps
with the goal of the Green Software Foundation, our main direction should always be to avoid emitting carbon
in the first place. This is what we call Abatement, or Carbon Elimination.
Once we've pursued this goal, there will still be emissions that can't be avoided. All the remaining carbon
reduction methodologies will help us do so, offsetting (either compensating or neutralizing carbon).
Your company's strategy can be a mix of all the possible methodologies and, depending on the final result, can
reach a Net Zero target when carbon emissions are eliminated where possible and the residual emissions
compensated.
The SCI equation aims to eliminate emissions, which should always be the primary goal of a sustainable
workload, and the score can only be reduced with abatement.
Read more about the Climate Commitments from the Green Software Foundation.
Review the considerations for application design.
Application design
Application design of sustainable workloads on
Azure
12/16/2022 6 minutes to read • Edit Online
IMPORTANTIMPORTANT
Code efficiency
Evaluate moving monoliths to a microservice architectureEvaluate moving monoliths to a microservice architecture
Improve API efficiencyImprove API efficiency
When building new or updating existing applications, it's crucial to consider how the solution will impact the
climate and if there are ways to improve and optimize. Learn about considerations and recommendations to
optimize your code and applications for a more sustainable application design.
This article is part of the Azure Well-Architected sustainable workload series. If you aren't familiar with this series, we
recommend you start with what is a sustainable workload?
Demands on applications can vary, and it's essential to consider ways to stabilize the utilization to prevent over-
or underutilization of resources, which can lead to unnecessary energy spills.
Monolithic applications usually scale as a unit, leaving little room to scale only the individual components that
may need it.
Green Software Foundation alignment: Energy efficiency, Hardware efficiency
Recommendation:Recommendation:
Evaluate the microservice architecture guidance.
A microservice architecture allows for scaling of only the necessary components during peak load; ensuring
idle components are scaled down or in. Additionally, it may reduce the overhead and resources required for
deploying monolithic applications.
Consider this tradeoff: While reducing the compute resources required, you may increase the amount of
traffic on the network, and the complexity of the application may increase significantly.
Consider this other tradeoff: Moving to microservices can result in extra deployment overhead with
numerous similarities in deployment pipelines. Carefully consider the required deployment resources for
monolithic versus microservice architectures.
Additionally, read about containerizing monolithic applications.
Many modern cloud applications are designed to transact many messages between services and components
asynchronously. Consider the format used to encode the payload data. How much information does your
application need to communicate, and is there room to reduce the chattiness?
Green Software Foundation alignment: Energy efficiency
Recommendation:Recommendation:
Learn about the chatty I/O antipattern to better understand how a large number of requests can impact
performance and responsiveness.
Improve the reliability and reduce unnecessary load to your systems. Implement advanced request throttling
with API Management.
Ensure backward software compatibility to ensure it works on legacy hardwareEnsure backward software compatibility to ensure it works on legacy hardware
Leverage cloud native design patternsLeverage cloud native design patterns
Consider using circuit breaker patternsConsider using circuit breaker patterns
Optimize code for efficient resource usageOptimize code for efficient resource usage
Optimize for async access patternsOptimize for async access patterns
Minimize the amount of data the application returns from requests by being selective and encoding the
messages. See message encoding considerations.
Cache responses to avoid reprocessing the same type of information from the backend system unless
necessary. See caching in Azure API Management.
Consider how applications render information. Does the application need to critically serve everything in the
highest quality, resulting in higher bandwidth and processing? Is there room for reducing the quality of
components in the UI to serve sustainability goals better?
Green Software Foundation alignment: Hardware efficiency
Recommendation:Recommendation:
Support more end-user consumer devices, like older browsers and operating systems. This backward
compatibility improves hardware efficiency by reusing existing hardware instead of requiring a hardware
upgrade for the solution to work.
Consider this tradeoff: If the most recent software updates have significant performance improvements,
using older software versions may not be more efficient.
Learning about cloud-native design patterns is helpful for building applications, whether they're hosted on
Azure or running elsewhere. Optimizing the performance and cost of your cloud application will also reduce its
resource utilization, hence its carbon emissions.
Green Software Foundation alignment: Energy efficiency, Hardware efficiency
Recommendation:Recommendation:
Leverage cloud-native design patterns when writing or updating applications.
Consider evaluating and preventing applications from performing operations that are likely to fail. Repeated
failures can lead to overhead and unnecessary processing that you can avoid with proper design patterns.
Green Software Foundation alignment: Energy efficiency
Recommendation:Recommendation:
A circuit breaker can act as a proxy for operations that might fail and should monitor the number of recent
failures that have occurred and use that information to decide whether to proceed.
Study the Circuit Breaker pattern, and then consider how you can implement the Circuit Breaker patterns to
your applications.
Consider using Azure Monitor to monitor failures and set up alerts.
Applications deployed using inefficient code may result in an inherent impact on sustainability.
Green Software Foundation alignment: Energy efficiency, Hardware efficiency
Recommendation:Recommendation:
Reduce CPU cycles and the number of resources you need for your application.
Use optimized and efficient algorithms and design patterns.
Consider the Don't repeat yourself (DRY) principle.
Demands on applications can vary, and it's essential to consider ways to stabilize the utilization to prevent over-
Evaluate server
-
side vs. client
-
side renderingEvaluate server
-
side vs. client
-
side rendering
Be aware of UX design for sustainabilityBe aware of UX design for sustainability
or underutilization of resources, which can lead to unnecessary energy spills.
Green Software Foundation alignment: Energy efficiency
Recommendation:Recommendation:
Queue and buffer requests that don't require immediate processing, then process in batch. Designing your
applications in this way helps achieve a stable utilization and helps flatten consumption to avoid spiky
requests.
Read about optimizing for async access patterns.
Determine whether to render on the server-side or client-side when building applications with a UI.
Green Software Foundation alignment: Energy efficiency, Hardware efficiency
Recommendation:Recommendation:
Consider these benefits of server-side rendering:
When the server's power comes from less-polluting alternatives than the client's locale.
When the hardware on the server has better processing-energy ratios.
Can use centralized caching to reduce multiple unnecessary renders.
Reducing the number of browser-to-server round-trips can be particularly important when the client's
device has a lossy link.
When the client devices are older and have slower CPUs. Users don't need to upgrade their devices to
support a modern browser.
Consider these benefits of client-side rendering:
When the end-user devices are more suitable, pushing the responsibility of rendering to the clients.
It's more efficient only to render what's needed and as requested, as opposed to rendering everything
at least once.
There's no need for a server, as you can rely on static storage.
Browser caching is used on the clients.
Consider how the UX design of a workload impacts sustainability and determine what options exist for
improving energy efficiency and reducing unnecessary network load, data processing, and compute resources.
Green Software Foundation alignment: Energy efficiency
Recommendation:Recommendation:
Consider reducing the number of components to load and render on pages.
Determine whether the application can render lower-resolution images and videos.
Ensuring there are no unused pages will help minimize the UX design.
Consider search and findability. Making it easier for users to find what they're looking for helps lower the
amount of data stored and retrieved.
Consider providing a lighter UI, using fewer resources and with a lower impact on sustainability, and provide
users with an informed choice.
Save energy by offering your apps and websites in dark mode, with dark backgrounds.
Opt for using system fonts when possible to avoid forcing clients to download additional fonts, which causes
Don't render full-size images as thumbnails where the browser is doing the resizing.
Using full-size images as thumbnails or resized images will transfer more data, unnecessary network
traffic, and additional client-side CPU usage due to image resizing and pre-rendering.
Update legacy codeUpdate legacy code
Next step
more network load.
Consider upgrading or deprecating legacy code if it's not running on modern cloud infrastructure or with the
latest updates.
Green Software Foundation alignment: Hardware efficiency
Recommendation:Recommendation:
Identify inefficient legacy code suited for modernization.
Review if there are options to move to serverless or any of the optimized PaaS options.
Consider this tradeoff: Updating old code that might end up being deprecated can consume valuable time.
Review the design considerations for the application platform.
Application platform
Application platform considerations for sustainable
workloads on Azure
12/16/2022 6 minutes to read • Edit Online
IMPORTANTIMPORTANT
Platform and service updates
Review platform and service updates regularlyReview platform and service updates regularly
Regional differences
Deploy to low
-
carbon regionsDeploy to low
-
carbon regions
Designing and building sustainable workloads requires understanding the platform where you're deploying the
applications. Review the considerations and recommendations in this section to know how to make better
informed platform-related decisions around sustainability.
This article is part of the Azure Well-Architected sustainable workload series. If you aren't familiar with this series, we
recommend you start with what is a sustainable workload?
Keep platform and services up to date to leverage the latest performance improvements and energy
optimizations.
Platform updates enable you to use the latest functionality and features to help increase efficiency. Running on
outdated software can result in running a suboptimal workload with unnecessary performance issues. New
software tends to be more efficient in general.
Green Software Foundation alignment: Energy efficiency
Recommendation:Recommendation:
Upgrade to newer and more efficient services as they become available.
Consider backward compatibility and hardware reusability. An upgrade may not be the most efficient
solution if the hardware or the OS isn't supported.
Make use of Azure Automation Update Management to ensure software updates are deployed to Azure VMs.
The Microsoft Azure data centers are geographically spread across the planet and powered using different
energy sources. Making decisions around where to deploy your workloads can significantly impact the
emissions your solutions produce.
Learn more about sustainability from the data center to the cloud with Azure.
Learn about what Azure regions have a lower carbon footprint than others to make better-informed decisions
about where and how our workloads process data.
Green Software Foundation alignment: Carbon efficiency
Recommendation:Recommendation:
Use less carbon because the data centers where you deploy the workload are more likely to be powered by
renewable and low-carbon energy sources.
Consider these potential tradeoffs:
Process when the carbon intensity is lowProcess when the carbon intensity is low
Choose data centers close to the customerChoose data centers close to the customer
Run batch workloads during low
-
carbon intensity periodsRun batch workloads during low
-
carbon intensity periods
Modernization
Containerize workloads where applicableContainerize workloads where applicable
The effort and time it takes to move to a low-carbon region.
Migrating data between data centers may not be carbon efficient.
Consider the cost for new regions, including low-carbon regions, which may be more expensive.
If the workloads are latency sensitive, moving to a lower carbon region may not be an option.
Some regions on the planet are more carbon intense than others. Therefore it's essential to consider where we
deploy our workloads and combine this with other business requirements.
Green Software Foundation alignment: Carbon efficiency, Carbon awareness
Recommendation:Recommendation:
Where you have the data available, consider optimizing workloads when knowing that the energy mix comes
mostly from renewable energy sources.
If your application(s) allow it, consider moving workloads dynamically when the energy conditions change.
For example, running specific workloads at night may be more beneficial when renewable sources are
at their peak.
Deploying cloud workloads to data centers is easy. However, consider the distance from a data center to the
customer. Network traversal increases if the data center is a greater distance from the consumer.
Green Software Foundation alignment: Energy efficiency
Recommendation:Recommendation:
Consider deploying to data centers close to the consumer.
Proactively designing batch processing of workloads can help with scheduling intensive work during low-
carbon periods.
Green Software Foundation alignment: Carbon awareness
Recommendation:Recommendation:
Where you have the data available to you, plan your deployments to maximize compute utilization for
running batch workloads during low-carbon intensity periods.
Potential tradeoffs may include the effort and time it takes to move to a low-carbon region. Additionally,
migrating data between data centers may not be carbon efficient, and the cost for new regions-including low
—carbon regions—may be more expensive.
Consider these platform design decisions when choosing how to operate workloads. Leveraging managed
services and highly optimized platforms in Azure helps build cloud-native applications that inherently contribute
to a better sustainability posture.
Consider options for containerizing workloads to reduce unnecessary resource allocation and to utilize the
deployed resources better.
Green Software Foundation alignment: Hardware efficiency
Recommendation:Recommendation:
Deploying apps as containers allows for bin packing and getting more out of a VM, ultimately reducing the
Evaluate moving to PaaS and serverless workloadsEvaluate moving to PaaS and serverless workloads
Use SPOT VMs where possibleUse SPOT VMs where possible
Right sizing
Turn off workloads outside of business hoursTurn off workloads outside of business hours
Utilize auto
-
scaling and bursting capabilitiesUtilize auto
-
scaling and bursting capabilities
need for duplication of libraries on the host OS.
Removes the overhead of managing an entire VM, and allows deploying more apps per physical machine.
Containerization also optimizes server utilization rates and improves service reliability, lowering operational
costs. Fewer servers are needed, and the existing servers can be better utilized.
Consider these tradeoffs: The benefit of containerization will only realize if the utilization is high. Additionally,
provisioning an orchestrator such as Azure Kubernetes Services (AKS) or Azure Red Had OpenShift (ARO) for
only a few containers would likely lead to higher emissions overall.
Managed services are highly optimized and operate on more efficient hardware than other options, contributing
to a lower carbon impact.
Green Software Foundation alignment: Hardware efficiency, Energy efficiency
Recommendation:Recommendation:
Build a cloud-native app without managing the infrastructure, using a fully managed and inherently
optimized platform. The platform handles scaling, availability, and performance, ultimately optimizing the
hardware efficiency.
Review design principles for Platform as a Service (PaaS) workloads.
Think about the unused capacity in Azure data centers. Utilizing the otherwise wasted capacity—at significantly
reduced prices—the workload contributes to a more sustainable platform design.
Green Software Foundation alignment: Hardware efficiency
Recommendation:Recommendation:
By utilizing SPOT VMs, you take advantage of unused capacity in Azure data centers while getting a
significant discount on the VM.
Consider the tradeoff: When Azure needs the capacity back, the VMs get evicted. Learn more about the SPOT
VM eviction policy.
Ensuring workloads use all the allocated resources helps deliver a more sustainable workload. Oversized
services are a common cause of more carbon emissions.
Operating idle workloads will waste energy and contributes to an added carbon emission.
Green Software Foundation alignment: Energy efficiency, Hardware efficiency
Recommendation:Recommendation:
Dev and testing workloads should be turned off or downsized when not used. Instead of leaving them
running, consider shutting them off outside regular business hours.
Learn more about starting/stopping VMs during off-hours.
It's not uncommon with oversized compute workloads where much of the capacity is never utilized, ultimately
leading to a waste of energy.
Green Software Foundation alignment: Hardware efficiency
Recommendation:Recommendation:
Match the scalability needsMatch the scalability needs
Evaluate Ampere Altra Arm
-
based processors for Virtual MachinesEvaluate Ampere Altra Arm
-
based processors for Virtual Machines
Delete zombie workloadsDelete zombie workloads
Next step
Review auto-scaling guidance for Azure workloads.
Review the B-series burstable virtual machine sizes.
Consider that it may require tuning to prevent unnecessary scaling during short bursts of high demand, as
opposed to a static increase in demand.
Consider the application architecture as part of scaling considerations. For example, logical components
should scale independently to match the demand of that component, as opposed to scaling the entire
application if only a portion of the components needs scaling.
Consider the platform and whether it meets the scalability needs of the solution. For example, having
provisioned resources with a dedicated allocation may lead to unused or underutilized compute resources.
Examples:
Provisioning an Azure App Service Environment (ASE) over an App Service plan may lead to having
provisioned compute, whether utilized or not.
Choosing the Azure API Management Premium tier instead of the consumption tier leads to unused
resources if you aren't utilizing it fully.
Green Software Foundation alignment: Hardware efficiency
Recommendation:Recommendation:
Review the platform design decisions regarding scalability, and ensure the workload utilizes as much of the
provisioned resources as possible.
Consider this tradeoff: Some services require a higher tier to access certain features and capabilities
regardless of resource utilization.
Consider and prefer services that allow dynamic tier scaling where possible.
The Arm-based VMs represent a cost-effective and power-efficient option that doesn't compromise on the
required performance.
Green Software Foundation alignment: Energy efficiency
Recommendation:Recommendation:
Evaluate if the Ampere Altra Arm-based VMs is a good option for your workloads.
Read more about Azure Virtual Machines with Ampere Altra Arm–based processors on Azure.
Consider discovering unutilized workloads and resources and if there are any orphaned resources in your
subscriptions.
Green Software Foundation alignment: Hardware efficiency, Energy efficiency
Recommendation:Recommendation:
Delete any orphaned workloads or resources if they're no longer necessary.
Review the design considerations for deployment and testing.
Deployment and testing
Testing considerations for sustainable workloads on
Azure
12/16/2022 4 minutes to read • Edit Online
IMPORTANTIMPORTANT
Testing efficiency
Run integration, performance, load, or any other intense testing during low
-
carbon periodsRun integration, performance, load, or any other intense testing during low
-
carbon periods
Automate CI/CD to scale worker agents as neededAutomate CI/CD to scale worker agents as needed
Consider caching when using CI/CD agentsConsider caching when using CI/CD agents
Organizations developing and deploying solutions to the cloud also need reliable testing. Learn about the
considerations and recommendations for running workload tests and how to optimize for a more sustainable
testing model.
This article is part of the Azure Well-Architected sustainable workload series. If you aren't familiar with this series, we
recommend you start with what is a sustainable workload?
Running integration, performance, load, or any other intense testing capability may result in much processing. A
well-crafted design for testing the deployed workloads can help ensure full utilization of the available resources,
reducing carbon emissions.
Green Software Foundation alignment: Carbon awareness
Recommendation:Recommendation:
Where you have the data available to you, plan for running testing when the data center's energy mix
primarily uses renewable energy. It may, for example, be more beneficial to run testing during the night in
some regions.
Running underutilized or inactive CI/CD agents results in more emissions.
Green Software Foundation alignment: Hardware efficiency
Recommendation:Recommendation:
Keeps the compute utilization high, based on the current demand, avoiding unnecessary capacity allocation.
Only scale out when necessary, and when not testing, scale in. Ultimately this ensures there's no idle compute
resources in test environments.
Consider optimized platform services like containers over testing in a VM, utilizing the platform to reduce
maintenance.
Using caching mechanisms during CI/CD can reduce compute time and, thus, carbon emissions.
Green Software Foundation alignment: Energy Efficiency
Recommendation:Recommendation:
Store results from steps in a cache and re-use them between different CI/CD runs when possible: when there
are steps that take CPU time to produce an artifact that does not often change between different runs, it is
wise to save it for future usage so that CPU time is not wasted on every run producing the same artifact, over
Split large code repositoriesSplit large code repositories
Profiling and measuring
Assess where parallelization is possibleAssess where parallelization is possible
Assess with chaos engineeringAssess with chaos engineering
and over.
If the CI/CD agent is self-hosted, use a cache local to the agent to further reduce data transfers and
emissions. This ensures that the cache is not transferred over the network, which can be a significant source
of emissions.
Splitting large repositories can help the CI/CD phases, where only the parts of the code that have changed are
compiled. This reduces compute time, which ultimately lowers carbon emissions.
Green Software Foundation alignment: Energy Efficiency
Recommendation:Recommendation:
Split large code repositories, separating main code from libraries and dependencies.
Publish and re-use artifacts and libraries of code that are common across multiple repositories.
Recommendation:Recommendation:
Split large repositories of code into smaller ones, separating main code from libraries and dependencies.
Publish and re-use artifacts and libraries of code that are common across multiple repositories.
Measuring, profiling, and testing workloads are imperative to understanding how to best use allocated
resources.
Without properly profiling and testing workloads, it's difficult to know if it's making the best use of the
underlying platform and deployed resources.
Green Software Foundation alignment: Measuring sustainability
Recommendation:Recommendation:
Test your applications to understand concurrent requests, simultaneous processing, and more.
If you're running Machine Learning (ML) for tests, consider machines with a GPU for better efficiency gains.
Identify if the workload is performance intensive and work toward optimization.
Consider this tradeoff:
Running GPU-based machines for ML tests may increase the cost.
Running integration, performance and load tests increase the reliability of a workload. However, the introduction
of chaos engineering can significantly help improve reliability and resilience and how the applications react to
failures. In doing so, the workload can be optimized to handle failures gracefully and with less wasted resources.
Green Software Foundation alignment: Measuring sustainability
Recommendation:Recommendation:
Use load testing or chaos engineering to assess how the workload handles platform outages and traffic
spikes or dips. This helps increase service resilience and the ability to react to failures, allowing for a more
optimized fault handling.
Consider this tradeoff:
Injecting fault during chaos engineering and increasing the load on any system also
increases the emissions used for the testing resources. Evaluate how and when you can utilize chaos
engineering to increase the workload reliability while considering the climate impact of running unnecessary
testing sessions.
Another angle to this is using chaos engineering to test energy faults or moments with higher carbon
emissions: consider setting up tests that will challenge your application to consume the minimum possible
Establish CPU and Memory thresholds in testingEstablish CPU and Memory thresholds in testing
Next step
energy. Define how the application will react to such conditions with a specific "eco" version informing users
that they're emitting the minimum possible carbon by sacrificing some features and possibly some
performance. This can also be your benchmark application for scoring its sustainability.
Help build tests for testing sustainability in your application. Consider having a baseline CPU utilization
measurement, and detect abnormal changes to the CPU utilization baseline when tests run. With a baseline,
suboptimal decisions made in recent code changes can be discovered earlier.
Adding tests and quality gates into the deployment and testing pipeline helps avoid deploying non-sustainable
solutions, contributing to lowered emissions.
Green Software Foundation alignment: Energy efficiency
Recommendation:Recommendation:
Monitor CPU and memory allocations when running integration tests or unit tests.
Find abnormally high resource consumption areas in the application code and focus on mitigating those first.
Configure alerts or test failures if surpassing the established baseline values, helping avoid deploying non-
sustainable workloads.
Consider this tradeoff: As applications grow, the baseline may need to shift accordingly to avoid failing the
tests when introducing new features.
Review the design considerations for operational procedures.
Operational procedures
Operational procedure considerations for
sustainable workloads on Azure
12/16/2022 7 minutes to read • Edit Online
IMPORTANTIMPORTANT
Measure and track carbon impact
The Emissions Impact DashboardThe Emissions Impact Dashboard
Define emissions targetDefine emissions target
The discipline of green software and its implementation within cloud efficiency patterns is relatively recent, and
no specific and universal standards have been agreed upon yet.
The Green Software Foundation works on creating and standardizing ways of making green software. However,
it's vital that everyone considers this aspect in their daily work and that when designing, planning, and
deploying Azure workloads, we consider the best practices that are already available and prepare our
environment to incorporate new standards when ready.
This document will guide you through setting up an environment for measuring and continuously improving
your Azure workloads' cost and carbon efficiency.
This article is part of the Azure Well-Architected sustainable workload series. If you aren't familiar with this series, we
recommend you start with what is a sustainable workload?
To optimize or improve something, we first must decide what we want to change and how to measure it. In this
section, you'll learn about best practices and guidelines to measure and track the sustainability impact of your
workloads.
An essential aspect of working toward any sustainability goal is tracking and quantifying progress. If you can't
track and measure the impact, you'll never be sure if the efforts are worthwhile. The Emissions Impact
Dashboard is a Power BI dashboard that will give you a measure of the carbon impact of all your services and
resource groups in your Azure subscription(s).
The Emissions Impact Dashboards produce insights in various forms, and allows for a wide range of reporting
capabilities:
Series of visual representations in the dashboard itself.
Snapshot export to Excel, Power Point and PDF.
Continuous export to Microsoft Sustainability Manager and Dataverse.
Green Software Foundation alignment: Measuring sustainability
Recommendation:Recommendation:
Use the Emissions Impact Dashboard to record current and future environmental impact.
Identify and track metrics to quantify the achievement of technical, business, and sustainability outcomes.
Rely on tooling to help measure the impact, and record any changes made to your workload.
Learn more about the Sustainability and Dataverse API access in the Microsoft Learn module Access
Microsoft Sustainability Manager data.
Identify the metrics and set improvement goalsIdentify the metrics and set improvement goals
Cost optimization as a proxyCost optimization as a proxy
The Software Carbon Intensity (SCI) is the score you're looking for to measure the carbon impact of your
application(s) by adding the scalability and cost metrics to any carbon emissions measurement.
If you aren't using the Emissions Impact Dashboard, there are still ways of building carbon proxies that allow
you to measure your application's impact on emissions.
It can be a challenge to build carbon proxies for existing applications. Therefore, we recommend planning for
efficiency targets during the design phase of every workload. When adding new workloads to Azure, you should
consider planning for costs and emissions that will add to your existing footprint. The main goal should always
be not to emit carbon, so ideally, you should immediately find an optimization pattern to make up for the new
emissions.
The next step is to define your target emissions, either for a single application or for your entire set of cloud
workloads. The target can also include cost constraints, making it even easier to build upon since shrinking costs
will give you some budget to optimize emissions. Once you know your target, the cloud efficiency continuous
optimization process can start.
Green Software Foundation alignment: Measuring sustainability
Recommendations:Recommendations:
Calculate your new workload's minimum cost and carbon emissions (where applicable).
Track progress with Service Level Objectives (SLO), Service Level Agreements (SLA), or other performance
metrics.
Provide optimization patterns to accommodate the new application to your overall cloud efficiency score.
Once you've defined your target, you'll need to identify a few metrics that you can measure to prove your
changes had a positive effect on efficiency.
The metrics can, as an example, be derived from these categories:
Application performance metrics.
Cost optimization metrics.
Carbon emissions metrics (or proxies).
Green Software Foundation alignment: Measuring sustainability
Recommendation:Recommendation:
Discuss with every application owner since the impact of optimizing can vary and might affect many users.
Make sure that any plan that impacts performance is agreed upon and communicated clearly to the app
users so that they know that a lower performance may be necessary for the greater good of fewer carbon
emissions.
If you've connected the Microsoft Emissions Impact Dashboard (EID) to your Microsoft Sustainability
Manager (MSM) instance, you can use the Goal Tracking feature in MSM to define and track your goals by
linking them to live data from EID.
Sometimes the ease of deploying cloud resources makes us forget what is useful and what is simply a waste of
resources, money, and carbon. The message here's that experiments in the cloud can sometimes be costly in
terms of overall cloud efficiency, not purely cost, while bringing no innovation.
Use cloud resources wisely, considering any extra workload's carbon footprint.
When defining your SCI, you can use carbon proxies to compensate for the lack of specific standards and
measurements. One of the safest and most potent proxies for carbon emissions are your application(s) cost.
Defining policiesDefining policies
Community and knowledge sharing
Create a sustainability communityCreate a sustainability community
Plan for learningPlan for learning
Reducing unnecessary spending lowers the number of excessive emissions from deployed workloads as you're
using fewer cloud resources.
Linking cost performance metrics to carbon efficiency can be a sound strategy because you won't necessarily
need to compromise on your defined workload Key Performance Indicators (KPI) by optimizing cost and
reducing carbon emissions. However, you might decide that you're prepared to sacrifice a KPI towards your
carbon goal, which can also be part of your strategy.
Green Software Foundation alignment: Measuring sustainability
Recommendation:Recommendation:
Review the concept of using a proxy solution to measure emissions.
Leverage the guidance in the Azure Well-Architected Framework Cost Optimization pillar.
Azure Policy is a powerful tool that can make some decisions for your cloud efficiency easier to implement.
Consider defining one of more policies to keep your Azure virtual data center continuously optimized.
Green Software Foundation alignment: Climate commitments
Recommendation:Recommendation:
Incorporate and use the cost policies available in the Cloud Adoption Framework.
Leverage built-in policies relevant to cost in Azure Policy, as they're technically closely tied to sustainability.
Customize Azure Policy policies according to green software principles. For example, create a new Azure
Policy initiative for "Sustainability".
Consider this tradeoff: Enforcement of new policies must not impact any unplanned operational
performance metric.
Teams needs to be constantly aware of new advancements in sustainability, so they leverage these learnings
when implementing workloads.
Building a community around cloud efficiency and green software is a good starting point to foster cloud
efficiency awareness and culture across your organization.
Creating a sustainability community doesn't have to be a tedious task. Start with a small team that will invest
some time in learning the sustainability status and the relevant information on green software. This team can
also join the Green Software Foundation and be part of the teams that create rules, standards, and more.
The Core cloud Efficiency team will have to be up to date with all the innovative tools and principles that drive
your Azure workload's cost and carbon footprint.
Green Software Foundation alignment: Climate commitments
Recommendation:Recommendation:
Define policies and targets, and communicate their efforts and goals with the rest of the organization.
Learn more by reading how do I start a sustainability community in my organization?
Make time for the core team to learn about advancements in sustainable operations. Meanwhile, ensure that
your entire organization starts thinking about green software and how to contribute to the sustainability picture
with their daily choices.
Share best practices across teamsShare best practices across teams
Plan for incentivesPlan for incentives
Next step
Green Software Foundation alignment: Climate commitments
Recommendation:Recommendation:
Review these popular training and learning resources:
Use the self-paced learning module to Learn about The Principles of Sustainable Software Engineering.
Use the self-paced learning path to Get started with Microsoft Cloud for Sustainability.
Find more resources in the Microsoft Sustainability Learning Center.
Driving adoption of sustainability efforts requires input and work from across the organization.
Green Software Foundation alignment: Climate commitments
Recommendation:Recommendation:
Let team members share their workload and company-specific best practices for sustainable operations.
Set up a shared repository of best practices and guidance that have been tested in your environment with
tangible results.
Consider frequent knowledge-sharing sessions or internal webinars for getting everyone up to speed.
The quickest way of enforcing policies and creating the right culture is by setting incentives for improving the
environmental sustainability of a workload by either putting sustainability as a core KPI or adding it to the
overall efficiency of the applications.
Many software partners already include green software in their best practices. Therefore, ensure that your
efficiency targets are defined and accepted when discussing the workload.
Green Software Foundation alignment: Climate commitments
Recommendations:Recommendations:
Promote carbon-aware applications. Reward application owners if the measured carbon footprint meets the
KPI.
Introduce gamification by creating a friendly culture of sustainability competition—track records to promote
green workloads, SCI scoring, and any optimization or improvement on the score.
Consider introducing loyalty programs, where participants get incentives when they can prove the cloud
efficiency of their applications.
Explore the opportunity to introduce badges like "Carbon Aware" and "Carbon Optimized".
Review the design considerations for networking and connectivity.
Networking and connectivity
Networking considerations for sustainable
workloads on Azure
12/16/2022 3 minutes to read • Edit Online
IMPORTANTIMPORTANT
Network efficiency
Make use of a CDNMake use of a CDN
Follow caching best practicesFollow caching best practices
Select Azure regions based on where the customer residesSelect Azure regions based on where the customer resides
Most workloads in the cloud rely heavily on networking to operate. Whether internal networking or public-
facing workloads, the components and services used in provisioned solutions must consider the impact of
carbon emissions. Consider that network equipment consumes electricity, including traffic between the data
centers and end consumers. Learn about considerations and recommendations to enhance and optimize
network efficiency to reduce unnecessary carbon emissions.
Internet traversal between data centers and end consumers is a significant Scope 3 emission. Therefore,
recommendations in this section are aligned with the Principles of Green Software Networking area to improve
networking efficiency.
This article is part of the Azure Well-Architected sustainable workload series. If you aren't familiar with this series, we
recommend you start with what is a sustainable workload?
Reduce unnecessary network traffic and lower bandwidth requirements where possible, allowing for a more
optimized network efficiency with less carbon emission.
Unnecessary traffic on the network should be avoided, as it's a cause for extra carbon emissions.
Green Software Foundation alignment: Energy efficiency
Recommendation:Recommendation:
A CDN helps minimize latency through storing frequently read static data closer to consumers, and helps
reduce the network traversal and server load.
Ensure to follow best practices for CDN.
Minimizing the amount of data transferred is crucial.
Green Software Foundation alignment: Energy efficiency, Hardware efficiency
Recommendation:Recommendation:
Caching is a well-understood design technique to improve performance and efficiency.
A caching solution helps reduce network traversal and reduces the server load.
Consider that it may require tuning of parameters to maximize the benefit and minimize the carbon
drawbacks. For example, setting a Time to Live (TTL).
Adding in-memory caching can help use idle compute resources, increasing the compute density of
resources that are already allocated.
Read caching best practices.
Use managed audio and video streaming services with built
-
in compressionUse managed audio and video streaming services with built
-
in compression
Enable network file compressionEnable network file compression
Maximize network utilization within the same cloud and regionMaximize network utilization within the same cloud and region
Next step
The location of an application's consumers can be disparate, and it can be challenging to serve requests with
good performance and energy efficiency if the distance is too great.
Green Software Foundation alignment: Energy efficiency
Recommendation:Recommendation:
Deploy or move Azure resources across regions to better serve the applications from where most consumers
reside.
Applications making use of a media streaming service may have high requirements for bandwidth and
compression, and can have a substantial carbon footprint if not designed carefully.
Green Software Foundation alignment: Hardware efficiency
Recommendation:Recommendation:
By making use of a managed service for audio and video, applications can leverage built-in optimizations like
encoding, compressions, and more.
Read about managed audio and video streaming services.
Networks sending uncompressed data can have a higher requirement on bandwidth, the allocated resources,
and the solution in general. Consider compressing data to optimize the workload and design for a more
network efficient solution.
Green Software Foundation alignment: Energy efficiency
Recommendation:Recommendation:
Reduce the network payload by improving CDN performance.
Operating solutions in multiple regions have a networking impact. Network traversals between components in
Azure are optimized to stay within the Azure infrastructure. However, any network traffic destined for the
internet or a component in another cloud involves the public internet's router resources, which you have no
control over regarding resource impact measurement or utilization.
Green Software Foundation alignment: Energy efficiency
Recommendation:Recommendation:
Keeping resources in a single cloud gives you maximum control and allows the cloud provider to optimize
the network routing.
Maximize network utilization within the same cloud and, if possible, within the same region.
Since the cost can be a proxy for sustainability, review the Azure regions documentation in the Cost
Optimization pillar of the Azure Well-Architected Framework.
Review the design considerations for storage.
Storage
Data and storage design considerations for
sustainable workloads on Azure
12/16/2022 3 minutes to read • Edit Online
IMPORTANTIMPORTANT
Storage efficiency
Enable storage compressionEnable storage compression
Optimize database query performanceOptimize database query performance
Use the best suited storage access tierUse the best suited storage access tier
Data storage in Azure is a crucial component of most provisioned workloads. Learn how to design for a more
sustainable data storage architecture and optimize existing deployments.
This article is part of the Azure Well-Architected sustainable workload series. If you aren't familiar with this series, we
recommend you start with what is a sustainable workload?
Build solutions with efficient storage to increase performance, lower the required bandwidth, and minimize
unnecessary storage design climate impact.
Storing much uncompressed data can result in unnecessary bandwidth waste and increase the storage capacity
requirements.
Green Software Foundation alignment: Hardware efficiency
Recommendation:Recommendation:
A solution to reduce the storage requirements, including both capacity and required bandwidth to write or
retrieve data. For example, compressing files in Azure Front Door and compressing files in Azure CDN.
Compression is a well-known design technique to improve network performance.
Consider the tradeoff of compression: Does the benefit of compression outweigh the increased
carbon
cost
in the resources (CPU, RAM) needed to perform the compression/decompression?
Querying extensive databases or retrieving much information simultaneously can have a performance penalty.
Ideally, apps should optimize for query performance.
Green Software Foundation alignment: Energy efficiency
Recommendation:Recommendation:
Reduces the latency of data retrieval while also reducing the load on the database.
Understand the query performance for Azure SQL Databases
There are many well-known ways to optimize data query performance, for example tuning apps and
databases for performance in an Azure SQL database.
Consider that it may require fine-tuning to achieve optimal results.
The carbon impact of data retrieved from hot storage can be higher than data from cold- or archive storage.
Designing solutions with the correct data access pattern can enhance the application's carbon efficiency.
Green Software Foundation alignment: Energy efficiency
Only store what is relevantOnly store what is relevant
Determine the most suitable access tier for blob dataDetermine the most suitable access tier for blob data
Reduce the number of recovery points for VM backupsReduce the number of recovery points for VM backups
Revise backup and retention policiesRevise backup and retention policies
Optimize the collection of logsOptimize the collection of logs
Recommendation:Recommendation:
Use storage best suited for the application's data access patterns.
Make sure your most frequent data is stored in hot storage, making it easy to retrieve and doesn't require
more processing to access.
Infrequently used data should be stored in cold or offline archive storage, using less energy.
Backup is a crucial part of reliability. However, storing backups indefinitely can quickly allocate much
unnecessary disk space. Consider how you plan backup storage retention.
Green Software Foundation alignment: Hardware efficiency
Recommendation:Recommendation:
Implement policies to streamline the process of storing and keeping relevant information. Microsoft Purview
can help label data and add time-based purging to delete it after a retention period automatically.
Additionally, this lets you stay in control of your data and reduces the amount of data to process and transfer.
Workloads integrated with Azure Monitor can rely on Data Collection Rules (DCR) to specify what data
should be collected, how to transform that data, and where to send the data.
Consider whether to store data in an online tier or an offline tier. Online tiers are optimized for storing data that
is accessed or modified frequently. Offline tiers are optimized for storing data that is rarely accessed.
Green Software Foundation alignment: Energy efficiency
Recommendation:Recommendation:
Read Hot, Cool, and Archive access tiers for blob data.
Recovery points aren't automatically cleaned up. Therefore, consider where soft delete is enabled for Azure
Backup. The expired recovery points aren't cleaned up automatically.
Green Software Foundation alignment: Hardware efficiency
Recommendation:Recommendation:
Read more about the impact of expired recovery points for items in soft deleted state.
Consider reviewing backup policies and retention periods for backups to avoid storing unnecessary data.
Green Software Foundation alignment: Hardware efficiency
Recommendation:Recommendation:
Review and revise backup and retention policies to minimize storage overhead.
Actively review and delete backups that are no longer needed.
Continuously collecting logs across workloads can quickly aggregate and store lots of unused data.
Green Software Foundation alignment: Energy efficiency
Recommendation:Recommendation:
Make sure you are logging and retaining only data that is relevant to your needs.
Read more about the Cost optimization and Log Analytics.
Next step
Review the design considerations for security.
Security
Security considerations for sustainable workloads on
Azure
12/16/2022 7 minutes to read • Edit Online
IMPORTANTIMPORTANT
Security monitoring
Use cloud native log collection methods where applicableUse cloud native log collection methods where applicable
Avoid transferring large unfiltered data sets from one cloud service provider to anotherAvoid transferring large unfiltered data sets from one cloud service provider to another
Designing sustainable workloads on Azure must encompass security, which is a foundational principle through
all phases of a project. Learn about considerations and recommendations leading to a more sustainable security
posture.
This article is part of the Azure Well-Architected sustainable workload series. If you aren't familiar with this series, we
recommend you start with what is a sustainable workload?
Use cloud native security monitoring solutions to optimize for sustainability.
Traditionally, log collection methods for ingestion to a Security Information and Event Management (SIEM)
solution required the use of an intermediary resource to collect, parse, filter and transmit logs onward to the
central collection system. Using this design can carry an overhead with more infrastructure and associated
financial and carbon-related costs.
Green Software Foundation alignment: Hardware efficiency, Energy efficiency
Recommendation:Recommendation:
Using cloud native service-to-service connectors simplify the integration between the services and the SIEM,
and removes the overhead of extra infrastructure.
It's possible to ingest log data from existing compute resources using previously deployed agents such as the
Azure Monitor Analytics agent. Review how to migrate to Azure Monitor agent from Log Analytics agent.
Consider this tradeoff: Deploying more monitoring agents will increase the overhead in processing as it
needs more compute resources. Carefully design and plan for how much information is needed to cover the
security requirements of the solution and find a suitable level of information to store and keep.
A possible solution to reduce unnecessary data collection is to rely on the Azure Monitor Data
Collection Rules (DCR).
Conventional SIEM solutions required all log data to be ingested and stored in a centralized location. In a
multicloud environment, this solution can lead to a large amount of data being transferred out of a cloud service
provide and into another, causing increased burden on the network and storage infrastructure.
Green Software Foundation alignment: Carbon efficiency, Energy efficiency
Recommendation:Recommendation:
Cloud native security services can perform localized analysis on relevant security data source. This analysis
allows the bulk of log data to remain within the source cloud service provider environment. Cloud native
SIEM solutions can be connected via an API or connector to these security services to transmit only the
relevant security incident or event data. This solution can greatly reduce the amount of data transferred while
Filter or exclude log sources before transmission or ingestion into a SIEMFilter or exclude log sources before transmission or ingestion into a SIEM
Archive log data to long
-
term storageArchive log data to long
-
term storage
Network architecture
Use cloud native network security controls to eliminate unnecessary network trafficUse cloud native network security controls to eliminate unnecessary network traffic
Minimize routing from endpoints to the destinationMinimize routing from endpoints to the destination
maintaining a high level of security information to respond to an incident.
In time, using the described approach helps reduce data egress and storage costs, which inherently help reduce
emissions.
Consider the complexity and cost of storing all logs from all possible sources. For instance, applications, servers,
diagnostics and platform activity.
Green Software Foundation alignment: Carbon efficiency, Energy efficiency
Recommendation:Recommendation:
When designing a log collection strategy for cloud native SIEM solutions, consider the use cases based on the
Microsoft Sentinel analytics rules required for your environment and match up the required log sources to
support those rules.
This option can help remove the unnecessary transmission and storage of log data, reducing the carbon
emissions on the environment.
Many customers have a requirement to store log data for an extended period due to regulatory compliance
reasons. In these cases, storing log data in the primary storage location of the SIEM system is a costly solution.
Green Software Foundation alignment: Energy efficiency
Recommendation:Recommendation:
Log data can be moved out to a cheaper long-term storage option which respects the retention policies of
the customer, but lowers the cost by utilizing separate storage locations.
Increase the efficiency and avoid unnecessary traffic by following good practices for network security
architectures.
When you use a centralized routing- and firewall design, all network traffic is sent to the hub for inspection,
filtering, and onward routing. While this approach centralizes policy enforcement, it can create an overhead on
the network of unnecessary traffic from the source resources.
Green Software Foundation alignment: Hardware efficiency, Energy efficiency
Recommendation:Recommendation:
Use Network security groups and Application security groups to help filter traffic at the source, and to
remove the unnecessary data transmission. Using these capabilities can help reduce the burden on the cloud
infrastructure, with lower bandwidth requirements and less infrastructure to own and manage.
In many customer environments, especially in hybrid deployments, all end user device network traffic is routed
through on-premises systems before being allowed to reach the internet. Usually, this happens due to the
requirement to inspect all internet traffic. Often, this requires higher capacity network security appliances within
the on-premises environment, or more appliances within the cloud environment.
Green Software Foundation alignment: Energy efficiency
Recommendation:Recommendation:
Minimize routing from endpoints to the destination.
Use network security tools with auto
-
scaling capabilitiesUse network security tools with auto
-
scaling capabilities
Evaluate whether to use TLS terminationEvaluate whether to use TLS termination
Use DDoS protectionUse DDoS protection
Endpoint security
Integrate Microsoft Defender for EndpointIntegrate Microsoft Defender for Endpoint
Where possible, end user devices should be optimized to split out known traffic directly to cloud
services while continuing to route and inspect traffic for all other destinations. Bringing these
capabilities and policies closer to the end user device prevents unnecessary network traffic and its
associated overhead.
Based on network traffic, there will be times when demand of the security appliance will be high, and other
times where it will be lower. Many network security appliances are deployed to a scale to cope with the highest
expected demand, leading to inefficiencies. Additionally, reconfiguration of these tools often requires a reboot
leading to unacceptable downtime and management overhead.
Green Software Foundation alignment: Hardware efficiency
Recommendation:Recommendation:
Making use of auto-scaling allows the rightsizing of the backend resources to meet demand without manual
intervention.
This approach will vastly reduce the time to react to network traffic changes, resulting in a reduced waste of
unnecessary resources, and increases your sustainability effect.
Learn more about relevant services by reading how to enable a Web Application Firewall (WAF) on an
Application Gateway, and deploy and configure Azure Firewall Premium.
Terminating and re-establishing TLS is CPU consumption that might be unnecessary in certain architectures.
Green Software Foundation alignment: Energy efficiency
Recommendation:Recommendation:
Consider if you can terminate TLS at your border gateway and continue with non-TLS to your workload load
balancer and onwards to your workload.
Review the information on TLS termination to better understand the performance and utilization impact it
offers.
Consider the tradeoff: A balanced level of security can offer a more sustainable and energy efficient workload
while a higher level of security may increase the requirements on compute resources.
Distributed Denial of Service (DDoS) attacks aim to disrupt operational systems by overwhelming them,
creating a significant impact on the resources in the cloud. Successful attacks flood network and compute
resources, leading to an unnecessary spike in usage and cost.
Green Software Foundation alignment: Energy efficiency, Hardware efficiency
Recommendation:Recommendation:
DDoS protection seeks to mitigate attacks at an abstracted layer, so the attack is mitigated before reaching
any customer operated services.
Mitigating any malicious usage of compute and network services will ultimately help reduce
unnecessary carbon emissions.
It's imperative that we secure our workloads and solutions in the cloud. Understanding how we can optimize our
mitigation tactics all the way down to the client devices can have a positive outcome for reducing emissions.
Reporting
Tag security resourcesTag security resources
Next step
Many attacks on cloud infrastructure seek to misuse deployed resources for the attacker's direct gain. Two such
misuse cases are botnets and crypto mining.
Both of these cases involve taking control of customer-operated compute resources and use them to either
create new cryptocurrency coins, or as a network of resources from which to launch a secondary action like a
DDoS attack, or mass e-mail spam campaigns.
Green Software Foundation alignment: Hardware efficiency
Recommendations:Recommendations:
Integrate Microsoft Defender for Endpoint with Defender for Cloud to identify and shut down crypto mining
and botnets.
The EDR capabilities provide advanced attack detections and are able to take response actions to
remediate those threats. The unnecessary resource usage created by these common attacks can
quickly be discovered and remediated, often without the intervention of a security analyst.
Getting the right information and insights at the right time is important for producing reports around emissions
from your security appliances.
It can be a challenge to quickly find and report on all security appliances in your tenant. Identifying the security
resources can help when designing a strategy for a more sustainable operating model for your business.
Green Software Foundation alignment: Measuring sustainability
Recommendation:Recommendation:
Tag security resources to record emissions impact of security resources.
Review the design principles for sustainability.
Design principles
Azure Well
-
Architected Framework review
-
Azure
Service Fabric
12/16/2022 14 minutes to read • Edit Online
Prerequisites
Reliability
Azure Service Fabric is a distributed systems platform that makes it easy to package, deploy, and manage
scalable and reliable microservices and containers. These resources are deployed onto a network-connected set
of virtual or physical machines, which is called a clustercluster .
There are two clusters models in Azure Service Fabric: standard clustersstandard clusters and managed clustersmanaged clusters.
Standard clustersStandard clusters require you to define a cluster resource alongside a number of supporting resources. These
resources must be set up correctly upon deployment and maintained correctly throughout the lifecycle of the
cluster. Otherwise, the cluster and your services will not function properly.
Managed clustersManaged clusters simplify your deployment and management operations. The managed cluster model
consists of a single Service Fabric managed cluster resource that encapsulates and abstracts away the
underlying resources.
This article primarily discusses the managed clustermanaged cluster model for simplicity. However, call-outs are made for any
special considerations that apply to the standard clusterstandard cluster model.
In this article, you learn architectural best practices for Azure Service Fabric. The guidance is based on the five
pillars of architectural excellence:
Reliability
Security
Cost optimization
Operational excellence
Performance efficiency
Understanding the Well-Architected Framework pillars can help produce a high quality, stable, and
efficient cloud architecture. Check out the Azure Well-Architected Framework overview page to review
the five pillars of architectural excellence.
Reviewing the core concepts of Azure Service Fabric and microservice architecture can help you
understand the context of the best practices provided in this article.
The following sections cover design considerations and configuration recommendations, specific to Azure
Service Fabric and reliability.
When discussing reliability with Azure Service Fabric, it's important to distinguish between
cluster reliability
and
workload reliability
. Cluster reliability is a shared responsibility between the Service Fabric cluster admin and
their resource provider, while workload reliability is the domain of a developer. Azure Service Fabric has
considerations and recommendations for both of these roles.
In the design checklistdesign checklist and list of recommendationslist of recommendations below, call-outs are made to indicate whether each
choice is applicable to cluster architecture, workload architecture, or both.
Design checklistDesign checklist
RecommendationsRecommendations
A Z URE SER VIC E FA B RIC R EC OM M E NDAT IONA ZURE SERVIC E FA B RIC RE C O M M EN DAT IO N B EN EF ITB EN EF IT
Cluster architecture:Cluster architecture: Use Standard SKU for production
scenarios.
This level ensures the resource provider maintains cluster
reliability. Standard cluster :Standard cluster : A Standard SKU managed
cluster provides the equivalent of durability level Silver. To
achieve this using the standard cluster model, you will need
to use 5 VMs (or more).
Cluster architecture:Cluster architecture: Consider using Availability Zones for
your Service Fabric clusters.
Service Fabric managed cluster supports deployments that
span across multiple Availability Zones to provide zone
resiliency. This configuration will ensure high-availability of
the critical system services and your applications to protect
from single-points-of-failure.
Cluster architecture:Cluster architecture: Consider using Azure API
Management to expose and offload cross-cutting
functionality for APIs hosted on the cluster.
API Management can integrate with Service Fabric directly.
Workload architecture:Workload architecture: For stateful workload scenarios,
consider using Reliable Services.
The Reliable Services model allows your services to stay up
even in unreliable environments where your machines fail or
hit network issues, or in cases where the services themselves
encounter errors and crash or fail. For stateful services, your
state is preserved even in the presence of network or other
failures.
Security
For more information about Azure Service Fabric cluster reliability, check out the capacity planning
documentation.
For more information about Azure Service Fabric workload reliability, reference the Reliability subsystem
included in the Service Fabric architecture.
As you make design choices for Azure Service Fabric, review the design principles for adding reliability to the
architecture.
Cluster architecture:Cluster architecture: Use Standard SKU for production scenarios. Standard cluster :Standard cluster : Use durability level
Silver (5 VMs) or greater for production scenarios.
Cluster architecture:Cluster architecture: For critical workloads, consider using Availability Zones for your Service Fabric
clusters.
Cluster architecture:Cluster architecture: For production scenarios, use the Standard tier load balancer. Managed clusters
create an Azure public Standard Load Balancer and fully qualified domain name with a static public IP for
both the primary and secondary node types. You can also bring your own load balancer, which supports
both Basic and Standard SKU load balancers.
Cluster architecture:Cluster architecture: Create additional, secondary node types for your workloads.
Explore the following table of recommendations to optimize your Azure Service Fabric configuration for service
reliability:
For more suggestions, see Principles of the reliability pillar.
The following sections cover design considerations and configuration recommendations, specific to Azure
Service Fabric and security.
When discussing security with Azure Service Fabric, it's important to distinguish between
cluster security
and
Design checklistDesign checklist
RecommendationsRecommendations
A Z URE SER VIC E FA B RIC R EC OM M E NDAT IONA ZURE SERVIC E FA B RIC RE C O M M EN DAT IO N B EN EF ITB EN EF IT
Cluster architecture:Cluster architecture: Ensure Network Security Groups
(NSG) are configured to restrict traffic flow between subnets
and node types.
For example, you may have an API Management instance
(one subnet), a frontend subnet (exposing a website directly),
and a backend subnet (accessible only to frontend).
Cluster architecture:Cluster architecture: Deploy Key Vault certificates to
Service Fabric cluster virtual machine scale sets.
Centralizing storage of application secrets in Azure Key Vault
allows you to control their distribution. Key Vault greatly
reduces the chances that secrets may be accidentally leaked.
Cluster architecture:Cluster architecture: Apply an Access Control List (ACL)
to your client certificate for your Service Fabric cluster.
Using an ACL provides an additional level of authentication.
Cluster architecture:Cluster architecture: Use resource requests and limits to
govern resource usage across the nodes in your cluster.
Enforcing resource limits helps ensure that one service
doesn't consume too many resources and starve other
services.
Workload architecture:Workload architecture: Encrypt Service Fabric package
secret values.
Encryption on your secret values provides an additional level
of security.
workload security
. Cluster security is a shared responsibility between the Service Fabric cluster admin and their
resource provider, while workload security is the domain of a developer. Azure Service Fabric has considerations
and recommendations for both of these roles.
In the design checklistdesign checklist and list of recommendationslist of recommendations below, call-outs are made to indicate whether each
choice is applicable to cluster architecture, workload architecture, or both.
For more information about Azure Service Fabric cluster security, check out Service Fabric cluster security
scenarios.
For more information about Azure Service Fabric workload security, reference Service Fabric application and
service security.
As you make design choices for Azure Service Fabric, review the design principles for adding security to the
architecture.
Cluster architecture:Cluster architecture: Ensure Network Security Groups (NSG) are configured to restrict traffic flow between
subnets and node types. Ensure that the correct ports are opened for application deployment and workloads.
Cluster architecture:Cluster architecture: When using the Service Fabric Secret Store to distribute secrets, use a separate data
encipherment certificate to encrypt the values.
Cluster architecture:Cluster architecture: Deploy client certificates by adding them to Azure Key Vault and referencing the URI
in your deployment.
Cluster architecture:Cluster architecture: Enable Azure Active Directory integration for your cluster to ensure users can access
Service Fabric Explorer using their Azure Active Directory (AAD) credentials. Don't distribute the cluster client
certificates among users to access Explorer.
Cluster architecture:Cluster architecture: For client authentication, use admin and read-only client certificates and/or AAD
authentication.
Cluster and workload architectures:Cluster and workload architectures: Create a process for monitoring the expiration date of client
certificates.
Cluster and workload architectures:Cluster and workload architectures: Maintain separate clusters for development, staging, and
production.
Consider the following recommendations to optimize your Azure Service Fabric configuration for security:
Workload architecture:Workload architecture: Include client certificates in
Service Fabric applications.
Having your applications use client certificates for
authentication provides opportunities for security at both
the cluster and workload level.
Workload architecture:Workload architecture: Authenticate Service Fabric
applications to Azure Resources using Managed Identity.
Using Managed Identity allow you to securely manage the
credentials in your code for authenticating to various
services without saving them locally on a developer
workstation or in source control.
Cluster and workload architectures:Cluster and workload architectures: Follow Service
Fabric best practices when hosting untrusted applications.
Following the best practices provides a security standard to
follow.
A Z URE SER VIC E FA B RIC R EC OM M E NDAT IONA ZURE SERVIC E FA B RIC RE C O M M EN DAT IO N B EN EF ITB EN EF IT
Policy definitionsPolicy definitions
Cost optimization
Design checklistDesign checklist
For more suggestions, see Principles of the security pillar.
Azure Advisor helps you ensure and improve the security of Azure Service Fabric. You can review the
recommendations in the Azure Advisor section of this article.
Azure Policy helps maintain organizational standards and assess compliance across your resources. Keep the
following built-in policies in mind as you configure Azure Service Fabric:
Service Fabric clusters should have the ClusterProtectionLevel property set to EncryptAndSign . This is the
default value for managed clusters and isn't changeable. Standard cluster :Standard cluster : Ensure you set
ClusterProtectionLevel to EncryptAndSign .
Service Fabric clusters should only use Azure Active Directory for client authentication.
All built-in policy definitions related to Azure Service Fabric are listed in Built-in policies - Service Fabric.
The following sections cover design considerations and configuration recommendations, specific to Azure
Service Fabric and cost optimization.
When discussing cost optimization with Azure Service Fabric, it's important to distinguish between
cost of
cluster resources
and
cost of workload resources
. Cluster resources are a shared responsibility between the
Service Fabric cluster admin and their resource provider, while workload resources are the domain of a
developer. Azure Service Fabric has considerations and recommendations for both of these roles.
In the design checklistdesign checklist and list of recommendationslist of recommendations below, call-outs are made to indicate whether each
choice is applicable to cluster architecture, workload architecture, or both.
For cluster cost optimization, go to the Azure pricing calculator and select Azure Ser vice FabricAzure Ser vice Fabric from the
available products. You can test different configuration and payment plans in the calculator.
For more information about Azure Service Fabric workload pricing, check out the example cost calculation
process for application planning.
As you make design choices for Azure Service Fabric, review the design principles for optimizing the cost of
your architecture.
Cluster architecture:Cluster architecture: Select appropriate VM SKU.
Cluster architecture:Cluster architecture: Use appropriate node type and size.
Cluster and workload architectures:Cluster and workload architectures: Use appropriate managed disk tier and size.
RecommendationsRecommendations
A Z URE SER VIC E FA B RIC R EC OM M E NDAT IONA ZURE SERVIC E FA B RIC RE C O M M EN DAT IO N B EN EF ITB EN EF IT
Cluster architecture:Cluster architecture: Avoid VM SKUs with temp disk
offerings.
Service Fabric uses managed disks by default, so avoiding
temp disk offerings ensures you don't pay for unneeded
resources.
Cluster architecture:Cluster architecture: If you need to select a certain VM
SKU for capacity reasons and it happens to offer temp disk,
consider using temporary disk support for your stateless
workloads.
Make the most of the resources you're paying for. Using a
temporary disk instead of a managed disk can reduce costs
for stateless workloads.
Cluster and workload architectures:Cluster and workload architectures: Align SKU selection
and managed disk size with workload requirements.
Matching your selection to your workload demands ensures
you don't pay for unneeded resources.
Operational excellence
Design checklistDesign checklist
RecommendationsRecommendations
A Z URE SER VIC E FA B RIC R EC OM M E NDAT IONA ZURE SERVIC E FA B RIC RE C O M M EN DAT IO N B EN EF ITB EN EF IT
Workload architecture:Workload architecture: Use Application Insights to
monitor your workloads.
Application Insights integrates with the Azure platform,
including Service Fabric.
Cluster and workload architectures:Cluster and workload architectures: Create a process
for monitoring the expiration date of client certificates.
For example, Key Vault offers a feature that sends an email
when x% of the certificate's lifespan has elapsed.
Explore the following table of recommendations to optimize your Azure Service Fabric configuration for cost:
For more suggestions, see Principles of the cost optimization pillar.
The following sections cover design considerations and configuration recommendations, specific to Azure
Service Fabric and operational excellence.
When discussing security with Azure Service Fabric, it's important to distinguish between
cluster operation
and
workload operation
. Cluster operation is a shared responsibility between the Service Fabric cluster admin and
their resource provider, while workload operation is the domain of a developer. Azure Service Fabric has
considerations and recommendations for both of these roles.
In the design checklistdesign checklist and list of recommendationslist of recommendations below, call-outs are made to indicate whether each
choice is applicable to cluster architecture, workload architecture, or both.
As you make design choices for Azure Service Fabric, review the design principles for operational excellence.
Cluster architecture:Cluster architecture: Prepare a cluster monitoring solution.
Cluster architecture:Cluster architecture: Review the cluster health policies in the Service Fabric health model.
Workload architecture:Workload architecture: Prepare an application monitoring solution.
Workload architecture:Workload architecture: Review the application and service type health policies in the Service Fabric health
model.
Cluster and workload architectures:Cluster and workload architectures: Prepare an infrastructure monitoring solution.
Cluster and workload architectures:Cluster and workload architectures: Design your cluster with build and release pipelines for continuous
integration and deployment.
Explore the following table of recommendations to optimize your Azure Service Fabric configuration for
operational excellence:
Cluster and workload architectures:Cluster and workload architectures: For pre-production
clusters use Azure Chaos Studio to drill service disruption on
a Virtual Machine Scale Set instance failure.
Practicing service disruption scenarios will help you
understand what is at-risk in your infrastructure and how to
best mitigate the issues if they arise.
Cluster and workload architectures:Cluster and workload architectures: Use Azure Monitor
to monitor cluster and container infrastructure events.
Azure Monitor integrates well with the Azure platform,
including Service Fabric.
Cluster and workload architectures:Cluster and workload architectures: Use Azure
Pipelines for your continuous integration and deployment
solution.
Azure Pipelines integrates well with the Azure platform,
including Service Fabric.
A Z URE SER VIC E FA B RIC R EC OM M E NDAT IONA ZURE SERVIC E FA B RIC RE C O M M EN DAT IO N B EN EF ITB EN EF IT
Performance efficiency
Design checklistDesign checklist
RecommendationsRecommendations
A Z URE SER VIC E FA B RIC R EC OM M E NDAT IONA ZURE SERVIC E FA B RIC RE C O M M EN DAT IO N B EN EF ITB EN EF IT
Cluster architecture:Cluster architecture: Exclude the Service Fabric processes
from Windows Defender to improve performance.
By default, Windows Defender antivirus is installed on
Windows Server 2016 and 2019. To reduce any performance
impact and resource consumption overhead incurred by
Windows Defender, and if your security policies allow you to
exclude processes and paths for open-source software, you
can exclude.
For more suggestions, see Principles of the operational excellence pillar.
The following section covers configuration recommendations, specific to Azure Service Fabric and performance
efficiency.
When discussing security with Azure Service Fabric, it's important to distinguish between
cluster operation
and
workload operation
. Cluster performance is a shared responsibility between the Service Fabric cluster admin
and their resource provider, while workload performance is the domain of a developer. Azure Service Fabric has
considerations and recommendations for both of these roles.
In the design checklistdesign checklist and list of recommendationslist of recommendations below, call-outs are made to indicate whether each
choice is applicable to cluster architecture, workload architecture, or both.
For more information about how Azure Service Fabric can reduce performance issues for your workload with
Service Fabric performance counters, reference Monitoring and diagnostic best practices for Azure Service
Fabric.
Cluster architecture:Cluster architecture: Exclude the Service Fabric processes from Windows Defender to improve
performance.
Cluster architecture:Cluster architecture: Select appropriate VM SKU.
Workload architecture:Workload architecture: Decide what programming model you will use for your services.
Cluster and workload architectures:Cluster and workload architectures: Use appropriate managed disk tier and size.
Consider the following recommendations to optimize your Azure Service Fabric configuration for performance
efficiency:
Cluster architecture:Cluster architecture: Consider using Autoscaling for your
cluster.
Autoscaling gives great elasticity and enables addition or
reduction of nodes on demand on a secondary node type.
This automated and elastic behavior reduces the
management overhead and potential business impact by
monitoring and optimizing the amount of nodes servicing
your workload.
Cluster architecture:Cluster architecture: Consider using Accelerated
Networking.
Accelerated networking enables a high-performance path
that bypasses the host from the data path, which reduces
latency, jitter, and CPU utilization for the most demanding
network workloads.
Cluster architecture:Cluster architecture: Considering using encryption at
host instead of Azure Disk Encryption (ADE).
This encryption method improves on ADE by supporting all
OS types and images, including custom images, for your
VMs by encrypting data in the Azure Storage service.
Workload architecture:Workload architecture: Review the Service Fabric
programming models to decide what model would best suit
your services.
Service Fabric supports several programming models. Each
come with their own advantages and disadvantages.
Knowing about the available programming models can help
you make the best choices for designing your services.
Workload architecture:Workload architecture: Leverage loosely-coupled
microservices for your workloads where appropriate.
Using microservices allows you to get the most out of
Service Fabric's features.
Workload architecture:Workload architecture: Leverage event-driven
architecture for your workloads where appropriate.
Using event-driven architecture allows you to get the most
out of Service Fabric's features.
Workload architecture:Workload architecture: Leverage background processing
for your workloads where appropriate.
Using background processing allows you to get the most
out of Service Fabric's features.
Cluster and workload architectures:Cluster and workload architectures: Review the
different ways you can scale your solution in Service Fabric.
You can use scaling to enable maximum resource utilization
for your solution.
A Z URE SER VIC E FA B RIC R EC OM M E NDAT IONA ZURE SERVIC E FA B RIC RE C O M M EN DAT IO N B EN EF ITB EN EF IT
Azure Advisor recommendations
SecuritySecurity
Additional resources
For more suggestions, see Principles of the performance efficiency pillar.
Azure Advisor is a personalized cloud consultant that helps you follow best practices to optimize your Azure
deployments. Here are some recommendations that can help you improve the reliability, security, cost
effectiveness, performance, and operational excellence when using Azure Service Fabric.
Service Fabric clusters should have the ClusterProtectionLevel property set to EncryptAndSign . This is the
default value for managed clusters and isn't changeable. Standard cluster :Standard cluster : Ensure you set
ClusterProtectionLevel to EncryptAndSign .
Service Fabric clusters should only use Azure Active Directory for client authentication.
Check out the Azure Service Fabric managed cluster configuration options article for a list of all the options you
have while creating and maintaining your cluster.
Review the Azure application architecture fundamentals for guidance on how to develop your workloads. While
Service Fabric can be used solely as a container hosting platform, using well-architected workloads leverages
Next steps
Service Fabric's full functionality.
Use these recommendations as you create your Service Fabric managed cluster using an ARM template or
through the Azure portal:
Quickstart: Deploy a Service Fabric managed cluster with an Azure Resource Manager template
Quickstart: Deploy a Service Fabric managed cluster using the Azure portal
Azure App Service and reliability
12/16/2022 8 minutes to read • Edit Online
Design considerations
Checklist
Azure App Service is an HTTP-based service for hosting web applications, REST APIs, and mobile back ends. This
service adds the power of Microsoft Azure to your application, such as:
Security
Load balancing
Autoscaling
Automated management
To explore how Azure App Service can bolster the resiliency of your application workload, reference key features
in Why use App Service?
The following sections include design considerations, a configuration checklist, and recommended configuration
options specific to Azure App Service.
Microsoft guarantees that Azure App Service will be available 99.95% of the time. However, no SLA is provided
using either the Free or Shared tiers.
For more information, reference the SLA for App Service.
Have you configured Azure App Ser vice with resiliency in mind?Have you configured Azure App Ser vice with resiliency in mind?
Consider disabling ARR Affinity for your App Service.
Use a different store for session state.
Use Web Jobs.
Enable Always On to ensure Web Jobs run reliably.
Access the on-prem database using private connections like Azure VPN or Express Route.
Set up backup and restore.
Understand IP Address deprecation impact.
Ensure App Service Environments (ASE) are deployed in highly available configurations across Availability
Zones.
Ensure the ASE Network is configured correctly.
Consider configuring Upgrade preference if multiple environments are used.
Plan for scaling out the ASE cluster.
Use Deployment slots for resilient code deployments.
Avoid unnecessary worker restarts.
Use Run From Package to avoid deployment conflicts.
Use Basic or higher plans with two or more worker instances for high availability.
Evaluate the use of TCP and SNAT ports to avoid outbound connection errors.
Enable Health check to identify non-responsive workers.
Enable Autoscale to ensure adequate resources are available to service requests.
Enable Local Cache to reduce dependencies on cluster file servers.
Configuration recommendations
A SE REC O M M EN DAT IO NA SE REC O M M EN DAT IO N DESC RIP T IO NDESC RIP T ION
Consider disabling ARR Affinity for your App Service. ARR Affinity (Application Request Routing) sets an affinity
cookie, which is used to redirect users to the same node that
handled their previous requests.
Use a different store for session state. Storing session state in memory can result in losing session
state when there's a problem with the application or App
Service. It also limits the possibility of spreading the load
over other instances.
Enable Always On to ensure Web Jobs run reliably. A web app can time out after 20 minutes of inactivity. Only
requests to the actual web app reset the timer. If your app
runs continuously or if you schedule Timer trigger Web Jobs,
enable Always On. Only available in the Basic, Standard, and
Premium pricing tiers.
Access the on-prem database using private connections like
Azure VPN or Express Route.
Access the on-premises database through a private
connection so that the connection is secure and predictable.
Set up backup and restore. Backup and restore lets you manually create or schedule app
backups. You can retain backups for an indefinite amount of
time.
Understand IP Address deprecation impact. Floating addresses aren't guaranteed to remain on the
resource. It's essential to check for deprecated IP addresses.
Deploy in highly available configuration across Availability
Zones.
Ensures applications can continue to operate even if there's a
datacenter-level failure. This provides excellent redundancy
without requiring multiple deployments in different Azure
regions.
Configure ASE Network correctly. A common ASE pitfall occurs when ASE is deployed into a
subnet with an IP address space that is too small to support
future expansion. In such cases, ASE can be left unable to
scale without redeploying the entire environment into a
larger subnet. We highly recommend that adequate IP
addresses be used to support either the maximum number
of workers or the largest number considered workloads will
need. A single ASE cluster can scale to 201 instances, which
would require a /24 subnet.
Enable Diagnostic Logging to provide insight into application behavior.
Enable Application Insights Alerts to signal fault conditions.
Review Azure App Service diagnostics to ensure common problems are addressed.
Evaluate per-app scaling for high density hosting on Azure App Service.
Explore the following table of recommendations to optimize your App Service configuration for service
reliability:
Configure Upgrade preferenceUpgrade preference if multiple environments
are used.
If lower environments are used for staging or testing,
consider configuring these environments to receive updates
sooner than the production environment. This will help to
identify any conflicts or problems with an update, and
provides a window to mitigate issues before they reach the
production environment. If multiple load balanced (zonal)
production deployments are used,
Upgrade preference
can
be used to protect the broader environment against issues
from platform upgrades.
Scale out the ASE cluster. Scaling ASE instances vertically or horizontally takes 30 to
60 minutes as new private instances need to be
provisioned. We highly recommend investing in up-front
planning for scaling during spikes in load or transient failure
scenarios.
Use Deployment slotsDeployment slots for resilient code deployments.
Deployment slots
allow for code to be deployed to instances
that are
warmed-up
before serving production traffic. For
more information, reference Testing in production with Azure
App Service.
Avoid unnecessary worker restarts. Many events can lead App Service workers to restart, such
as content deployment, App Settings changes, and VNet
integration configuration changes. A best practice is to make
changes in a deployment slot other than the slot currently
configured to accept production traffic. After workers are
recycled and warmed up, a
swap
can be performed without
unnecessary down time.
Run From Package to avoid deployment conflicts Run from Package provides several advantages:
- Eliminates file lock conflicts between deployment and
runtime.
- Ensures only fully deployed apps are running at any time.
- May reduce
cold-start
times, particularly for JavaScript
functions with large npm package trees.
Use Basic or higher plans with two or more worker instances
for high availability.
Azure App Service provides many configuration options that
aren't enabled by default.
Evaluate the use of TCP and SNAT ports. TCP connections are used for all outbound connections; but,
SNAT ports are used when making outbound connections to
public IP addresses. SNAT port exhaustion is a common
failure scenario that can be predicted by load testing while
monitoring ports using Azure Diagnostics. For more
information, reference TCP and SNAT ports.
Enable Health checkHealth check to identify non-responsive workers. Any health check is better than none at all. The logic behind
endpoint tests should assess all critical downstream
dependencies to ensure overall health. As a best practice, we
highly recommend tracking application health and cache
status in real time as this removes unnecessary delays
before action can be taken.
A SE REC O M M EN DAT IO NA SE REC O M M EN DAT IO N DESC RIP T IO NDESC RIP T ION
Enable AutoscaleAutoscale to ensure adequate resources are
available to service requests.
The default limit of App Service workers is 30 . If the App
Service routinely uses 15 or more instances, consider
opening a support ticket to increase the maximum number
of workers to 2x the instance count required to serve
normal peak load.
Enable Local_Cache to reduce dependencies on cluster file
servers.
Enabling local cache is always appropriate because it can lead
to slower worker startup times. When coupled with
Deployment slotsDeployment slots, it can improve resiliency by removing
dependencies on file servers and also reduces storage-
related recycle events. Don't use local cache with a single
worker instance or when shared storage is required.
Enable Diagnostic LoggingDiagnostic Logging to provide insight into
application behavior.
Diagnostic logging
provides the ability to ingest rich
application and platform-level logs through Log Analytics,
Azure Storage, or a third-party tool using Event Hub.
Enable Application Insights alertsApplication Insights alerts to make you aware of
fault conditions.
Application performance monitoring with Application
Insights provides deep analyses into application
performance. For Windows Plans, a
codeless deployment
approach is possible to quickly get a performance analysis
without changing any code.
Review Azure App Ser vice diagnosticsAzure App Ser vice diagnostics to ensure
common problems are addressed.
It's a good practice to regularly review service-related
diagnostics and recommendations, and take action as
appropriate.
Evaluate per-app scalingper-app scaling for high density hosting on Azure
App Service.
Per-app scaling can be enabled at the App Service plan level
to allow for scaling an app independently from the App
Service plan that hosts it. This way, an App Service plan can
be scaled to 10 instances, but an app can be set to use
only five. Apps are allocated within the available App Service
plan using a best effort approach for even distribution
across instances. While an even distribution isn't guaranteed,
the platform will make sure that two instances of the same
app won't be hosted on the same App Service plan instance.
A SE REC O M M EN DAT IO NA SE REC O M M EN DAT IO N DESC RIP T IO NDESC RIP T ION
TCP and SNAT portsTCP and SNAT ports
TC P P O RT STC P P ORT S SM A L L ( B 1, S1, P 1, I1)SM A L L (B1, S1, P 1, I1) M EDIUM (B 2, S2, P 2, I2)M EDIUM (B 2, S2, P 2, I2) L A RGE ( B 3, S3, P 3, I3)L A RGE ( B 3, S3, P 3, I3)
TCP ports 1920 3968 8064
If a load test results in SNAT errors, it's necessary to either scale across more or larger workers, or implement
coding practices to help preserve and reuse SNAT ports, such as connection pooling and the lazy loading of
resources. We don't recommend exceeding 100 simultaneous outbound connections to a public IP address per
worker, and to avoid communicating with downstream services through public IP addresses when a private
address (Private Endpoint) or Service Endpoint through vNet Integration could be used. TCP port exhaustion
happens when the sum of connection from a given worker exceeds the capacity. The number of available TCP
ports depend on the size of the worker.
The following table lists the current limits:
Applications with many longstanding connections require ports to be left open for long periods of time, which
can lead to TCP Connection exhaustion. TCP Connection limits are fixed based on instance size, so it's necessary
Source artifacts
Resources
| where type == "microsoft.web/serverfarms" and properties.computeMode == `Dedicated`
| where sku.capacity == 1
Learn moreLearn more
Next step
to scale up to a larger worker size to increase the allotment of TCP connections, or implement code level
mitigations to govern connection usage. Similar to SNAT port exhaustion, you can use Azure Diagnostics to
identify a problem exists with TCP port limits.
To identify App Service Plans with only one instance, use the following query:
The Ultimate Guide to Running Healthy Apps in the Cloud
Azure App Service and cost optimization
Azure App Service and cost optimization
12/16/2022 2 minutes to read • Edit Online
Checklist
Configuration recommendations
A SE REC O M M EN DAT IO NA SE REC O M M EN DAT IO N DESC RIP T IO NDESC RIP T ION
Ensure the ASE subnet is appropriately sized. The size of the subnet used to host an ASE directly affects
maximum scale. An ASE with no App Service plans will use
12 to 13 addresses before you create an app. It's
recommended that you deploy ASEs into a /24 subnet.
The maximum number of nodes in an ASE is 100 . During a
scale-up event, the new machines are provisioned and
placed into the subnet before the applications are migrated
to the new machines, and the old machines are removed.
The subnet must allow for at least 200 machines to handle
the maximum deployment size, which requires a /24
subnet. If you plan for insufficient capacity, scale-out
operations will be limited.
Use App Service Premium v3 plan over the Premium v2 plan The App Service Premium (v3) Plan has a 20% discount
versus comparable Pv2 configurations. Reserved Instance
commitment (1Y, 3Y, Dev/Test) discounts are available for
App Services running in the Premium v3 plan.
Azure App Service is an HTTP-based service for hosting web applications, REST APIs, and mobile back ends. This
service adds the power of Microsoft Azure to your application, such as:
Security
Load balancing
Autoscaling
Automated management
To explore how to optimize costs for Azure App Service in your workload, reference key features in Why use App
Service?
The following sections include a checklist and recommended configuration options specific to Azure App
Service.
Have you configured Azure App Ser vice while considering cost optimization?Have you configured Azure App Ser vice while considering cost optimization?
Ensure the ASE subnet is appropriately sized.
Consider cost savings by using the App Service Premium v3 plan over the Premium v2 plan.
Always use a scale-out and scale-in rule combination.
Understand the behavior of multiple scaling rules in a profile.
Consider Basic or Free tier for non-production usage.
Explore the following table of recommendations to optimize your App Service configuration for service cost:
Use a scale-out and scale-in rule combination If you use only one part of the combination, autoscale will
only take action in a single direction (scale out, or in) until it
reaches the maximum, or minimum instance counts defined
in the profile. This scaling behavior isn't optimal, ideally you
want your resource to scale up at times of high usage to
ensure availability. Similarly, at times of low usage, you want
your resource to scale down, so you can realize cost savings.
Understand the behavior of multiple scaling rules in a profile. There are cases where you may have to set multiple rules in
a profile. On scale-out, autoscale runs if any rule is met. On
scale-in, autoscale requires all rules to be met.
Consider Basic or Free tier for non-production usage. For non-prod App Service plans, consider scaling to Basic or
Free Tier and scale up, as needed, and scale down when not
in use – for example, during a Load Test exercise or based on
the capabilities provided (such as custom domain, SSL, and
more).
A SE REC O M M EN DAT IO NA SE REC O M M EN DAT IO N DESC RIP T IO NDESC RIP T ION
Next step
Azure App Service and operational excellence
Azure App Service and operational excellence
12/16/2022 7 minutes to read • Edit Online
Design considerations
Checklist
Azure App Service is an HTTP-based service for hosting web applications, REST APIs, and mobile back ends. This
service adds the power of Microsoft Azure to your application, such as:
Security
Load balancing
Autoscaling
Automated management
To explore how Azure App Service can benefit the operational excellence of your application workload, reference
key features in Why use App Service?
The following sections include design considerations, a configuration checklist, and recommended configuration
options specific to Azure App Service.
Microsoft guarantees that Azure App Service will be available 99.95% of the time. However, no SLA is provided
using either the Free or Shared tiers.
For more information, reference the SLA for App Service.
Have you configured Azure App Ser vice while considering operational excellence?Have you configured Azure App Ser vice while considering operational excellence?
Create a deployment plan because redeploying the app service can reset the scaled units.
Review the App Service Advisor recommendations.
Ensure you configure the App Service Environments (ASE) Network correctly.
Consider configuring Upgrade Preference if you're using multiple environments.
Plan for scaling out the ASE cluster.
Use Deployment Slots for resilient code deployments.
Avoid unnecessary worker restarts when deploying application code or configuration.
Use Run From Package to avoid deployment conflicts.
Use Basic or higher plans with two or more worker instances for high availability.
Evaluate the use of TCP and SNAT ports to avoid outbound connection errors.
Enable Health check to identify non-responsive workers.
Enable Autoscale to ensure adequate resources are available to service requests.
Enable Local Cache to reduce dependencies on cluster file servers.
Enable Diagnostic Logging to provide insight into application behavior.
Enable Application Insights Alerts to signal fault conditions.
Review Azure App Service diagnostics to ensure common problems are addressed.
Evaluate per-app scaling for high density hosting on Azure App Service.
A SE REC O M M EN DAT IO NA SE REC O M M EN DAT IO N DESC RIP T IO NDESC RIP T ION
Create a deployment plan because redeploying the app
service can reset the scaled units.
Automatic scaling rules apply during operation of the
environment, but redeploying the app service may cause the
plan to reset to the default number of units. Customers
should be aware of this behavior and plan for it during
deployments. Deploy only during off-peak times or deploy
maximum units with automatic scaling enabled to scale in
and out to prevent website performance implications.
Review the App Service Advisor recommendations. App Service Advisor gives you real-time recommendations in
the portal on resource exhaustion and conditions related to
CPU, memory, and connections.
Ensure you configure the App Ser vice EnvironmentsApp Ser vice Environments
(ASE) Network(ASE) Network correctly.
One common ASE pitfall occurs when ASE is deployed into a
subnet with an IP address space that is too small to support
future expansion. In such cases, ASE can be left unable to
scale without redeploying the entire environment into a
larger subnet. We highly recommended that adequate IP
addresses be used to support either the maximum number
of workers or the largest number considered workloads will
need. A single ASE cluster can scale to 201 instance, which
would require a /24 subnet.
Configure Upgrade preferenceUpgrade preference if you're using multiple
environments.
If lower environments are used for staging or testing,
consider configuring these environments to receive updates
sooner than the production environment. This will help to
identify any conflicts or problems with an update, and
provides a window to mitigate issues before they reach the
production environment. If multiple load balanced (zonal)
production deployments are used,
Upgrade preference
can
be used to protect the broader environment against issues
from platform upgrades.
Scale out the ASE cluster. Scaling ASE instances vertically or horizontally takes 30 to
60 minutes as new private instances need to be
provisioned. We highly recommend investing in up-front
planning for scaling during spikes in load or transient failure
scenarios.
Use Deployment slotsDeployment slots for resilient code deployments.
Deployment slots
allow for code to be deployed to instances
that are
warmed-up
before serving production traffic. For
more information, reference Testing in production with Azure
App Service.
Avoid unnecessary worker restarts. Many events can lead App Service workers to restart, such
as content deployment, App Settings changes, and VNet
integration configuration changes. A best practice is to make
changes in a deployment slot other than the slot currently
configured to accept production traffic. After workers are
recycled and warmed up, a
swap
can be performed without
unnecessary down time.
Run From Package to avoid deployment conflicts Run from Package provides several advantages:
- Eliminates file lock conflicts between deployment and
runtime.
- Ensures only fully deployed apps are running at any time.
- May reduce
cold-start
times, particularly for JavaScript
functions with large npm package trees.
Use Basic or higher plans with two or more worker instances
for high availability.
Azure App Service provides many configuration options that
aren't enabled by default.
Evaluate the use of TCP and SNAT ports. TCP connections are used for all outbound connections, but
SNAT ports are used when making outbound connections to
public IP addresses. SNAT port exhaustion is a common
failure scenario that can be predicted by load testing while
monitoring ports using Azure Diagnostics. For more
information, reference TCP and SNAT ports.
Enable Health checkHealth check to identify non-responsive workers. Any health check is better than none at all. The logic behind
endpoint tests should assess all critical downstream
dependencies to ensure overall health. As a best practice, we
highly recommend tracking application health and cache
status in real time as this removes unnecessary delays
before action can be taken.
Enable AutoscaleAutoscale to ensure adequate resources are
available to service requests.
The default limit of App Service workers is 30 . If the App
Service routinely uses 15 or more instances, consider
opening a support ticket to increase the maximum number
of workers to 2x the instance count required to serve
normal peak load.
Enable Local_Cache to reduce dependencies on cluster file
servers.
Enabling local cache is always appropriate because it can lead
to slower worker startup times. When coupled with
Deployment slotsDeployment slots, it can improve resiliency by removing
dependencies on file servers and also reduces storage-
related recycle events. Don't use local cache with a single
worker instance or when shared storage is required.
Enable Diagnostic LoggingDiagnostic Logging to provide insight into
application behavior.
Diagnostic logging
provides the ability to ingest rich
application and platform-level logs through Log Analytics,
Azure Storage, or a third-party tool using Event Hub.
Enable Application Insights alertsApplication Insights alerts to make you aware of
fault conditions.
Application performance monitoring with Application
Insights provides deep analyses into application
performance. For Windows Plans, a
codeless deployment
approach is possible to quickly get a performance analysis
without changing any code.
Review Azure App Ser vice diagnosticsAzure App Ser vice diagnostics to ensure
common problems are addressed.
It's a good practice to regularly review service-related
diagnostics and recommendations, and take action as
appropriate.
Evaluate per-app scalingper-app scaling for high density hosting on Azure
App Service.
Per-app scaling can be enabled at the App Service plan level
to allow for scaling an app independently from the App
Service plan that hosts it. This way, an App Service plan can
be scaled to 10 instances, but an app can be set to use
only five. Apps are allocated to available App Service plan
using a best effort approach for an even distribution across
instances. While an even distribution isn't guaranteed, the
platform will make sure that two instances of the same app
won't be hosted on the same App Service plan instance.
A SE REC O M M EN DAT IO NA SE REC O M M EN DAT IO N DESC RIP T IO NDESC RIP T ION
TCP and SNAT portsTCP and SNAT ports
If a load test results in SNAT errors, it's necessary to either scale across more or larger workers, or implement
coding practices to help preserve and reuse SNAT ports, such as connection pooling and the lazy loading of
TC P P O RT STC P P ORT S SM A L L ( B 1, S1, P 1, I1)SM A L L (B1, S1, P 1, I1) M EDIUM (B 2, S2, P 2, I2)M EDIUM (B 2, S2, P 2, I2) L A RGE ( B 3, S3, P 3, I3)L A RGE ( B 3, S3, P 3, I3)
TCP ports 1920 3968 8064
Source artifacts
Resources
| where type == "microsoft.web/serverfarms" and properties.computeMode == `Dedicated`
| where sku.capacity == 1
Learn moreLearn more
Next step
resources. We don't recommend exceeding 100 simultaneous outbound connections to a public IP address per
worker, and to avoid communicating with downstream services through public IP addresses when a private
address (Private Endpoint) or Service Endpoint through vNet Integration could be used. TCP port exhaustion
happens when the sum of connection from a given worker exceeds the capacity. The number of available TCP
ports depend on the size of the worker.
The following table lists the current limits:
Applications with many longstanding connections require ports to be left open for long periods of time, which
can lead to TCP Connection exhaustion. TCP Connection limits are fixed based on instance size, so it's necessary
to scale up to a larger worker size to increase the allotment of TCP connections, or implement code level
mitigations to govern connection usage. Similar to SNAT port exhaustion, you can use Azure Diagnostics to
identify if a problem exists with TCP port limits.
To identify App Service plans with only one instance, use the following query:
The Ultimate Guide to Running Healthy Apps in the Cloud
Azure Batch and reliability
Azure Batch and reliability
12/16/2022 2 minutes to read • Edit Online
Design and configuration checklist
Design and configuration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Keep application binaries and reference data up to date in all
regions.
Staying up to date will ensure the region can be brought
online quickly without waiting for file upload and
deployment.
Use fewer jobs and more tasks. Using a job to run a single task is inefficient. For example, it's
more efficient to use a single job containing 1000 tasks
rather than creating 100 jobs that contain 10 tasks each.
Running 1000 jobs, each with a single task, would be the
least efficient, slowest, and most expensive approach.
Use multiple Batch accounts in various regions to allow your
application to continue running, if an Azure Batch account in
one region becomes unavailable.
It's crucial to have multiple accounts for a highly available
application.
Azure Batch allows you to run large-scale parallel and high-performance computing (HPC) batch jobs efficiently
in Azure.
Use Azure Batch to:
Create and manage a pool of compute nodes (virtual machines).
Install applications you want to run.
Schedule jobs to run on the compute nodes.
The following sections include a design and configuration checklist, recommended design, and configuration
options specific to Azure Batch.
Have you designed your workload and configured Azure Batch with resiliency in mind?Have you designed your workload and configured Azure Batch with resiliency in mind?
Keep application binaries and reference data up to date in all regions.
Use fewer jobs and more tasks.
Use multiple Batch accounts in various regions to allow your application to continue running, if an Azure
Batch account in one region becomes unavailable.
Build durable tasks.
Pre-create all required services in each region, such as the Batch account and storage account.
Make sure the appropriate quotas are set on all subscriptions ahead of time, so you can allocate the required
number of cores using the Batch account.
Explore the following table of recommendations to optimize your workload design and Azure Batch
configuration for service reliability:
Build durable tasks. Tasks should be designed to withstand failure and
accommodate retry, especially for long running tasks. Ensure
tasks generate the same, single result even if they're run
more than once. One way to achieve the same result is to
make your tasks
goal seeking
. Another way is to make sure
your tasks are
idempotent
(tasks will have the same
outcome no matter how many times they're run).
Pre-create all required services in each region, such as the
Batch account and storage account.
There's often no charge for creating accounts and charges
accrue only when you use the account, or when you store
data.
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Next step
Azure Batch and operational excellence
Azure Batch and operational excellence
12/16/2022 2 minutes to read • Edit Online
Design and configuration checklist
Design and configuration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Keep application binaries and reference data up to date in all
regions.
Staying up to date will ensure the region can be brought
online quickly without waiting for file upload and
deployment.
Use fewer jobs and more tasks. Using a job to run a single task is inefficient. For example, it's
more efficient to use a single job containing 1000 tasks
rather than creating 100 jobs that contain 10 tasks each.
Running 1000 jobs, each with a single task, would be the
least efficient, slowest, and most expensive approach.
Pre-create all required services in each region, such as the
Batch account and storage account.
There's often no charge for creating accounts and charges
accrue only when you use the account, or when you store
data.
Next step
Azure Batch allows you to run large-scale parallel and high-performance computing (HPC) batch jobs efficiently
in Azure.
Use Azure Batch to:
Create and manage a pool of compute nodes (virtual machines).
Install applications you want to run.
Schedule jobs to run on the compute nodes.
The following sections include a design and configuration checklist, recommended design, and configuration
options specific to Azure Batch.
Have you designed your workload and configured Azure Batch with operational excellence inHave you designed your workload and configured Azure Batch with operational excellence in
mind?mind?
Keep application binaries and reference data up to date in all regions.
Use fewer jobs and more tasks.
Pre-create all required services in each region, such as the Batch account and storage account.
Make sure the appropriate quotas are set on all subscriptions ahead of time, so you can allocate the required
number of cores using the Batch account.
Explore the following table of recommendations to optimize your workload design and Azure Batch
configuration for operational excellence:
Azure Batch and performance efficiency
Azure Batch and performance efficiency
12/16/2022 2 minutes to read • Edit Online
Design checklist
Design and configuration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Use fewer jobs and more tasks. Using a job to run a single task is inefficient. For example, it's
more efficient to use a single job containing 1000 tasks
rather than creating 100 jobs that contain 10 tasks each.
Running 1000 jobs, each with a single task, would be the
least efficient, slowest, and most expensive approach.
Next step
Azure Batch allows you to run large-scale parallel and high-performance computing (HPC) batch jobs efficiently
in Azure.
Use Azure Batch to:
Create and manage a pool of compute nodes (virtual machines).
Install applications you want to run.
Schedule jobs to run on the compute nodes.
The following sections include a design checklist and recommended design options specific to Azure Batch.
Have you designed your workload and configured Azure Batch with performance efficiency inHave you designed your workload and configured Azure Batch with performance efficiency in
mind?mind?
Use fewer jobs and more tasks.
Consider the following recommendation to optimize your workload design and Azure Batch configuration for
performance efficiency:
AKS and reliability
Azure Well
-
Architected Framework review
-
Azure
Kubernetes Service
(
AKS
)
12/16/2022 22 minutes to read • Edit Online
Prerequisites
Reliability
Design checklistDesign checklist
This article provides architectural best practices for Azure Kubernetes Service (AKS). The guidance is based on
the five pillars of architecture excellence:
Reliability
Security
Cost optimization
Operational excellence
Performance efficiency
We assume that you understand system design principles, have working knowledge of Azure Kubernetes
Service, and are well versed with its features. For more information, see Azure Kubernetes Service.
Understanding the Well-Architected Framework pillars can help produce a high-quality, stable, and efficient
cloud architecture. We recommend that you review your workload by using the Azure Well-Architected
Framework Review assessment.
For context, consider reviewing a reference architecture that reflects these considerations in its design. We
recommend that you start with the baseline architecture for an Azure Kubernetes Service (AKS) cluster and
Microservices architecture on Azure Kubernetes Service. Also review the AKS landing zone accelerator, which
provides an architectural approach and reference implementation to prepare landing zone subscriptions for a
scalable Azure Kubernetes Service (AKS) cluster.
In the cloud, we acknowledge that failures happen. Instead of trying to prevent failures altogether, the goal is to
minimize the effects of a single failing component. Use the following information to minimize failed instances.
When discussing reliability with Azure Kubernetes Service, it's important to distinguish between
cluster
reliability
and
workload reliability
. Cluster reliability is a shared responsibility between the cluster admin and
their resource provider, while workload reliability is the domain of a developer. Azure Kubernetes Service has
considerations and recommendations for both of these roles.
In the design checklistdesign checklist and list of recommendationslist of recommendations below, call-outs are made to indicate whether each
choice is applicable to cluster architecture, workload architecture, or both.
Cluster architecture:Cluster architecture: For critical workloads, use availability zones for your AKS clusters.
Cluster architecture:Cluster architecture: Plan the IP address space to ensure your cluster can reliably scale, including handling
of failover traffic in multi-cluster topologies.
Cluster architecture:Cluster architecture: Enable Container insights to monitor your cluster and configure alerts for reliability-
impacting events.
Workload architecture:Workload architecture: Ensure workloads are built to support horizontal scaling and report application
readiness and health.
AKS configuration recommendationsAKS configuration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N BEN EF ITB EN EF IT
Cluster and workload architectures:Cluster and workload architectures: Control pod
scheduling using node selectors and affinity.
Allows the Kubernetes scheduler to logically isolate
workloads by hardware in the node. Unlike tolerations, pods
without a matching node selector can be scheduled on
labeled nodes, which allows unused resources on the nodes
to consume, but gives priority to pods that define the
matching node selector. Use node affinity for more flexibility,
which allows you to define what happens if the pod can't be
matched with a node.
Cluster architecture:Cluster architecture: Ensure proper selection of network
plugin based on network requirements and cluster sizing.
Azure CNI is required for specific scenarios, for example,
Windows-based node pools, specific networking
requirements and Kubernetes Network Policies. Reference
Kubenet versus Azure CNI for more information.
Cluster and workload architectures:Cluster and workload architectures: Use the AKS
Uptime SLA for production grade clusters.
The AKS Uptime SLA guarantees:
- 99.95% availability of the Kubernetes API server endpoint
for AKS Clusters that use Azure Availability Zones, or
- 99.9% availability for AKS Clusters that don't use Azure
Availability Zones.
Cluster and workload architectures:Cluster and workload architectures: Configure
monitoring of cluster with Container insights.
Container insights help monitor the health and performance
of controllers, nodes, and containers that are available in
Kubernetes through the Metrics API. Integration with
Prometheus enables collection of application and workload
metrics.
Cluster architecture:Cluster architecture: Use availability zones to maximize
resilience within an Azure region by distributing AKS agent
nodes across physically separate data centers.
By spreading node pools across multiple zones, nodes in one
node pool will continue running even if another zone has
gone down. If colocality requirements exist, either a regular
VMSS-based AKS deployment into a single zone or proximity
placement groups can be used to minimize internode
latency.
Cluster architecture:Cluster architecture: Adopt a multiregion strategy by
deploying AKS clusters deployed across different Azure
regions to maximize availability and provide business
continuity.
Internet facing workloads should leverage Azure Front Door
or Azure Traffic Manager to route traffic globally across AKS
clusters.
Cluster and workload architectures:Cluster and workload architectures: Define Pod
resource requests and limits in application deployment
manifests, and enforce with Azure Policy.
Container CPU and memory resource limits are necessary to
prevent resource exhaustion in your Kubernetes cluster.
Cluster and workload architectures:Cluster and workload architectures: Keep the System
node pool isolated from application workloads.
System node pools require a VM SKU of at least 2 vCPUs
and 4 GB memory, but 4 vCPU or more is recommended.
Reference System and user node pools for detailed
requirements.
Cluster and workload architectures:Cluster and workload architectures: Ensure your workload is running on user node pools and chose the
right size SKU. At a minimum, include two nodes for user node pools and three nodes for the system node
pool.
Cluster architecture:Cluster architecture: Use the AKS Uptime SLA to meet availability targets for production workloads.
Explore the following table of recommendations to optimize your AKS configuration for Reliability.
Cluster and workload architectures:Cluster and workload architectures: Separate
applications to dedicated node pools based on specific
requirements.
Applications may share the same configuration and need
GPU-enabled VMs, CPU or memory optimized VMs, or the
ability to scale-to-zero. Avoid large number of node pools to
reduce extra management overhead.
Cluster architecture:Cluster architecture: Use a NAT gateway for clusters that
run workloads that make many concurrent outbound
connections.
To avoid reliability issues with Azure Load Balancer
limitations with high concurrent outbound traffic, us a NAT
Gateway instead to support reliable egress traffic at scale.
REC O M M EN DAT IO NREC O M M EN DAT IO N BEN EF ITB EN EF IT
Azure PolicyAzure Policy
Cluster and workload architectureCluster and workload architecture
Security
Design checklistDesign checklist
RecommendationsRecommendations
For more suggestions, see Principles of the reliability pillar.
Azure Kubernetes Service offers a wide variety of built-in Azure Policies that apply to both the Azure resource
like typical Azure Policies and, using the Azure Policy add-on for Kubernetes, also within the cluster. There are a
numerous number of policies, and key policies related to this pillar are summarized here. For a more detailed
view, see built-in policy definitions for Kubernetes.
Clusters have readiness or liveness health probes configured for your pod spec.
In addition to the built-in Azure Policy definitions, custom policies can be created for both the AKS resource and
for the Azure Policy add-on for Kubernetes. This allows you to add additional reliability constraints you'd like to
enforce in your cluster and workload architecture.
Security is one of the most important aspects of any architecture. To explore how AKS can bolster the security of
your application workload, we recommend you review the Security design principles. If your Azure Kubernetes
Service cluster needs to be designed to run a sensitive workload that meets the regulatory requirements of the
Payment Card Industry Data Security Standard (PCI-DSS 3.2.1), review AKS regulated cluster for PCI-DSS 3.2.1.
To learn about DoD Impact Level 5 (IL5) support and requirements with AKS, review Azure Government IL5
isolation requirements.
When discussing security with Azure Kubernetes Service, it's important to distinguish between
cluster security
and
workload security
. Cluster security is a shared responsibility between the cluster admin and their resource
provider, while workload security is the domain of a developer. Azure Kubernetes Service has considerations and
recommendations for both of these roles.
In the design checklistdesign checklist and list of recommendationslist of recommendations below, call-outs are made to indicate whether each
choice is applicable to cluster architecture, workload architecture, or both.
Cluster architecture:Cluster architecture: Use Managed Identities to avoid managing and rotating service principles.
Cluster architecture:Cluster architecture: Use Kubernetes role-based access control (RBAC) with Azure AD for least privilege
access and minimize granting administrator privileges to protect configuration, and secrets access.
Cluster architecture:Cluster architecture: Use Microsoft Defender for containers with Azure Sentinel to detect and quickly
respond to threats across your cluster and workloads running on them.
Cluster architecture:Cluster architecture: Deploy a private AKS cluster to ensure cluster management traffic to your API server
remains on your private network. Or use the API server allow list for non-private clusters.
Workload architecture:Workload architecture: Use a Web Application Firewall to secure HTTP(S) traffic.
Workload architecture:Workload architecture: Ensure your CI/CID pipeline is hardened with container-aware scanning.
REC O M M EN DAT IO NREC O M M EN DAT IO N BEN EF ITB EN EF IT
Cluster architecture:Cluster architecture: Use Azure Active Directory
integration.
Using Azure AD centralizes the identity management
component. Any change in user account or group status is
automatically updated in access to the AKS cluster. The
developers and application owners of your Kubernetes
cluster need access to different resources.
Cluster architecture:Cluster architecture: Authenticate with Azure Active
Directory (Azure AD) to Azure Container Registry.
AKS and Azure AD enables authentication with Azure
Container Registry without the use of imagePullSecrets
secrets. Review Authenticate with Azure Container Registry
from Azure Kubernetes Service for more information.
Cluster architecture:Cluster architecture: Secure network traffic to your API
server with private AKS cluster.
By default, network traffic between your node pools and the
API server travels the Microsoft backbone network; by using
a private cluster, you can ensure network traffic to your API
server remains on the private network only.
Cluster architecture:Cluster architecture: For non-private AKS clusters, use
API server authorized IP ranges.
When using public clusters, you can still limit the traffic that
can reach your clusters API server by using the authorized IP
range feature. Include sources like the public IPs of your
deployment build agents, operations management, and
node pools' egress point (such as Azure Firewall).
Cluster architecture:Cluster architecture: Protect the API server with Azure
Active Directory RBAC.
Securing access to the Kubernetes API Server is one of the
most important things you can do to secure your cluster.
Integrate Kubernetes role-based access control (RBAC) with
Azure AD to control access to the API server. Disable local
accounts to enforce all cluster access using Azure AD-based
identities.
Cluster architecture:Cluster architecture: Use Azure network policies or Calico. Secure and control network traffic between pods in a cluster.
Cluster architecture:Cluster architecture: Secure clusters and pods with Azure
Policy.
Azure Policy can help to apply at-scale enforcement and
safeguards on your clusters in a centralized, consistent
manner. It can also control what functions pods are granted
and if anything is running against company policy.
Cluster architecture:Cluster architecture: Secure container access to resources. Limit access to actions that containers can perform. Provide
the least number of permissions, and avoid the use of root
or privileged escalation.
Workload architecture:Workload architecture: Use a Web Application Firewall to
secure HTTP(S) traffic.
To scan incoming traffic for potential attacks, use a web
application firewall such as Azure Web Application Firewall
(WAF) on Azure Application Gateway or Azure Front Door.
Cluster architecture:Cluster architecture: Control cluster egress traffic. Ensure your cluster's outbound traffic is passing through a
network security point such as Azure Firewall or an HTTP
proxy.
Cluster architecture:Cluster architecture: Use the open-source Azure AD
Workload Identity and Secrets Store CSI Driver with Azure
Key Vault.
Protect and rotate secrets, certificates, and connection
strings in Azure Key Vault with strong encryption. Provides
an access audit log, and keeps core secrets out of the
deployment pipeline.
Explore the following table of recommendations to optimize your AKS configuration for security.
Cluster architecture:Cluster architecture: Use Microsoft Defender for
Containers.
Monitor and maintain the security of your clusters,
containers, and their applications.
REC O M M EN DAT IO NREC O M M EN DAT IO N BEN EF ITB EN EF IT
Policy definitionsPolicy definitions
Cluster architectureCluster architecture
Cluster and workload architectureCluster and workload architecture
Cost optimization
For more suggestions, see Principles of the security pillar.
Azure Advisor helps ensure and improve Azure Kubernetes service. It makes recommendations on a subset of
the items listed in the policy section below, such as clusters without RBAC configured, missing Microsoft
Defender configuration, unrestricted network access to the API Server. Likewise, it makes workload
recommendations for some of the pod security initiative items. Review the recommendations.
Azure Policy offers various built-in policy definitions that apply to both the Azure resource and AKS like standard
policy definitions, and using the Azure Policy add-on for Kubernetes, also within the cluster. Many of the Azure
resource policies come in both
Audit/Deny
, but also in a
Deploy If Not Exists
variant.
There are a numerous number of policies, and key policies related to this pillar are summarized here. For a more
detailed view, see built-in policy definitions for Kubernetes.
Microsoft Defender for Cloud-based policies
Authentication mode and configuration policies (Azure AD, RBAC, disable local authentication)
API Server network access policies, including private cluster
Kubernetes cluster pod security initiatives Linux-based workloads
Include pod and container capability policies such as AppArmor, sysctl, security caps, SELinux, seccomp,
privileged containers, automount cluster API credentials
Mount, volume drivers, and filesystem policies
Pod/Container networking policies, such as host network, port, allowed external IPs, HTTPs, and internal load
balancers
Azure Kubernetes Service deployments often also use Azure Container Registry for Helm charts and container
images. Azure Container Registry also supports a wide variety of Azure policies that spans network restrictions,
access control, and Microsoft Defender for Cloud, which complements a secure AKS architecture.
In addition to the built-in policies, custom policies can be created for both the AKS resource and for the Azure
Policy add-on for Kubernetes. This allows you to add additional security constraints you'd like to enforce in your
cluster and workload architecture.
For more suggestions, see AKS security concepts and evaluate our security hardening recommendations based
on the CIS Kubernetes benchmark.
Cost optimization is about looking at ways to reduce unnecessary expenses and improve operational
efficiencies. We recommend you review the Cost optimization design principles. Tightly associated with (but not
limited to) cost optimization is sustainability. Cost optimization goals typically support workload sustainability
goals as well.
When discussing cost optimization with Azure Kubernetes Service, it's important to distinguish between
cost of
cluster resources
and
cost of workload resources
. Cluster resources are a shared responsibility between the
cluster admin and their resource provider, while workload resources are the domain of a developer. Azure
Kubernetes Service has considerations and recommendations for both of these roles.
Design checklistDesign checklist
RecommendationsRecommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N BEN EF ITB EN EF IT
Cluster and workload architectures:Cluster and workload architectures: Align SKU selection
and managed disk size with workload requirements.
Matching your selection to your workload demands ensures
you don't pay for unneeded resources.
Cluster and workload architectures:Cluster and workload architectures: Use the Start and
Stop feature in Azure Kubernetes Services (AKS).
The AKS Stop and Start cluster feature allows AKS customers
to pause an AKS cluster, saving time and cost. The stop and
start feature keeps cluster configurations in place and
customers can pick up where they left off without
reconfiguring the clusters.
Cluster architecture:Cluster architecture: Enable cluster autoscaler to
automatically reduce the number of agent nodes in response
to excess resource capacity.
Automatically scale down the number of nodes in your AKS
cluster lets you run an efficient cluster when demand is low,
and scale up again when demand returns.
Workload architecture:Workload architecture: Consider using Azure Spot VMs
for workloads that can handle interruptions, early
terminations, and evictions.
For example, workloads such as batch processing jobs,
development and testing environments, and large compute
workloads may be good candidates for you to schedule on a
spot node pool. Using spot VMs for nodes with your AKS
cluster allows you to take advantage of unused capacity in
Azure at a significant cost savings.
Cluster architecture:Cluster architecture: Enforce resource quotas at the
namespace level.
Resource quotas provide a way to reserve and limit
resources across a development team or project. These
quotas are defined on a namespace and can be used to set
quotas on compute resources, storage resources, and object
counts. When you define resource quotas, all pods created in
the namespace must provide limits or requests in their pod
specifications.
Workload architecture:Workload architecture: Use the Horizontal pod
autoscaler.
Adjust the number of pods in a deployment depending on
CPU utilization or other select metrics, which support cluster
scale-in operations.
Cluster architecture:Cluster architecture: Configure monitoring of cluster with
Container insights.
Container insights help provide actionable insights into your
clusters idle and unallocated resources.
In the design checklistdesign checklist and list of recommendationslist of recommendations below, call-outs are made to indicate whether each
choice is applicable to cluster architecture, workload architecture, or both.
For cluster cost optimization, go to the Azure pricing calculator and select Azure Kubernetes Ser viceAzure Kubernetes Ser vice from the
available products. You can test different configuration and payment plans in the calculator.
Cluster architecture:Cluster architecture: Use appropriate VM SKU per node pool and reserved instances where long-term
capacity is expected.
Cluster and workload architectures:Cluster and workload architectures: Use appropriate managed disk tier and size.
Cluster architecture:Cluster architecture: Review performance metrics, starting with CPU, memory, storage, and network, to
identify cost optimization opportunities by cluster, nodes, and namespace.
Cluster architecture:Cluster architecture: Use cluster autoscaler to scale in when workloads are less active.
Explore the following table of recommendations to optimize your AKS configuration for cost.
Cluster architecture:Cluster architecture: Select the appropriate region. Due to many factors, cost of resources varies per region in
Azure. Evaluate the cost, latency, and compliance
requirements to ensure you're running your workload cost-
effectively and it doesn't affect your end-users or create
additional networking charges.
Cluster architecture:Cluster architecture: Sign up for Azure Reservations. If you properly planned for capacity, your workload is
predictable and will exist for an extended period of time, sign
up for Azure Reserved Instances to further reduce your
resource costs.
Workload architecture:Workload architecture: Maintain small and optimized
images.
Streamlining your images helps reduce costs since new
nodes need to download these images. Build images in a
way that allows the container start as soon as possible to
help avoid user request failures or timeouts while the
application is starting up, potentially leading to
overprovisioning.
Cluster architecture:Cluster architecture: Use Kubernetes Resource Quotas. Resource quotas can be used to limit resource consumption
for each namespace in your cluster, and by extension
resource utilization for the Azure service.
REC O M M EN DAT IO NREC O M M EN DAT IO N BEN EF ITB EN EF IT
Policy definitionsPolicy definitions
Cloud efficiencyCloud efficiency
Operational excellence
Design checklistDesign checklist
For more suggestions, see Principles of the cost optimization pillar.
While there are no built-in policies that are related to cost optimization, custom policies can be created for both
the AKS resource and for the Azure Policy add-on for Kubernetes. This allows you to add additional cost
optimization constraints you'd like to enforce in your cluster and workload architecture.
Making workloads more sustainable and cloud efficient, requires combining efforts around cost optimizationcost optimization,
reducing carbon emissionsreducing carbon emissions, and optimizing energy consumptionoptimizing energy consumption. Optimizing the application's cost is the
initial step in making workloads more sustainable.
Learn how to build sustainable and efficient AKS workloads, in Sustainable software engineering principles in
Azure Kubernetes Service (AKS).
Monitoring and diagnostics are crucial. Not only can you measure performance statistics, but also use metrics
troubleshoot and remediate issues quickly. We recommend you review the Operational excellence design
principles and the Day-2 operations guide.
When discussing operational excellence with Azure Kubernetes Service, it's important to distinguish between
cluster operational excellence
and
workload operational excellence
. Cluster operations are a shared
responsibility between the cluster admin and their resource provider, while workload operations are the domain
of a developer. Azure Kubernetes Service has considerations and recommendations for both of these roles.
In the design checklistdesign checklist and list of recommendationslist of recommendations below, call-outs are made to indicate whether each
choice is applicable to cluster architecture, workload architecture, or both.
Cluster architecture:Cluster architecture: Use a template-based deployment using Bicep, Terraform, or others. Make sure that
all deployments are repeatable, traceable, and stored in a source code repo.
Cluster architecture:Cluster architecture: Build an automated process to ensure your clusters are bootstrapped with the
RecommendationsRecommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N BEN EF ITB EN EF IT
Cluster and workload architectures:Cluster and workload architectures: Review AKS best
practices documentation.
To build and run applications successfully in AKS, there are
key considerations to understand and implement. These
areas include multi-tenancy and scheduler features, cluster,
and pod security, or business continuity and disaster
recovery.
Cluster and workload architectures:Cluster and workload architectures: Review Azure
Chaos Studio.
Azure Chaos Studio can help simulate faults and trigger
disaster recovery situations.
Cluster and workload architectures:Cluster and workload architectures: Configure
monitoring of cluster with Container insights.
Container insights help monitor the performance of
containers by collecting memory and processor metrics from
controllers, nodes, and containers that are available in
Kubernetes through the Metrics API and container logs.
Workload architecture:Workload architecture: Monitor application performance
with Azure Monitor.
Configure Application Insights for code-based monitoring of
applications running in an AKS cluster.
Workload architecture:Workload architecture: Configure scraping of
Prometheus metrics with Container insights.
Container insights, which are part of Azure Monitor, provide
a seamless onboarding experience to collect Prometheus
metrics. Reference Configure scraping of Prometheus metrics
for more information.
Cluster architecture:Cluster architecture: Adopt a multiregion strategy by
deploying AKS clusters deployed across different Azure
regions to maximize availability and provide business
continuity.
Internet facing workloads should leverage Azure Front Door
or Azure Traffic Manager to route traffic globally across AKS
clusters.
Cluster architecture:Cluster architecture: Operationalize clusters and pods
configuration standards with Azure Policy.
Azure Policy can help to apply at-scale enforcement and
safeguards on your clusters in a centralized, consistent
manner. It can also control what functions pods are granted
and if anything is running against company policy.
Workload architecture:Workload architecture: Use platform capabilities in your
release engineering process.
Kubernetes and ingress controllers support many advanced
deployment patterns for inclusion in your release
engineering process. Consider patterns like blue-greem
deployments or canary releases.
necessary cluster-wide configurations and deployments. This is often performed using GitOps.
Workload architecture:Workload architecture: Use a repeatable and automated deployment processes for your workload within
your software development lifecycle.
Cluster architecture:Cluster architecture: Enable diagnostics settings to ensure control plane or core API server interactions are
logged.
Cluster and workload architectures:Cluster and workload architectures: Enable Container insights to collect metrics, logs, and diagnostics to
monitor the availability and performance of the cluster and workloads running on it.
Workload architecture:Workload architecture: The workload should be designed to emit telemetry that can be collected, which
should also include liveliness and readiness statuses.
Cluster and workload architectures:Cluster and workload architectures: Use chaos engineering practices that target Kubernetes to identify
application or platform reliability issues.
Workload architecture:Workload architecture: Optimize your workload to operate and deploy efficiently in a container.
Cluster and workload architectures:Cluster and workload architectures: Enforce cluster and workload governance using Azure Policy.
Explore the following table of recommendations to optimize your AKS configuration for operations.
Cluster and workload architectures:Cluster and workload architectures: For mission-critical
workloads, use stamp-level blue/green deployments.
Automate your mission-critical design areas, including
deployment and testing.
REC O M M EN DAT IO NREC O M M EN DAT IO N BEN EF ITB EN EF IT
Policy definitionsPolicy definitions
Cluster architectureCluster architecture
Cluster and workload architectureCluster and workload architecture
Performance efficiency
Design checklistDesign checklist
For more suggestions, see Principles of the operational excellence pillar.
Azure Advisor also makes recommendations on a subset of the items listed in the policy section below, such
unsupported AKS versions and unconfigured diagnostic settings. Likewise, it makes workload recommendations
around the use of the default namespace.
Azure Policy offers various built-in policy definitions that apply to both the Azure resource and AKS like standard
policy definitions, and using the Azure Policy add-on for Kubernetes, also within the cluster. Many of the Azure
resource policies come in both
Audit/Deny
, but also in a
Deploy If Not Exists
variant.
There are a numerous number of policies, and key policies related to this pillar are summarized here. For a more
detailed view, see built-in policy definitions for Kubernetes.
Azure Policy add-on for Kubernetes
GitOps configuration policies
Diagnostics settings policies
AKS version restrictions
Prevent command invoke
Namespace deployment restrictions
In addition to the built-in policies, custom policies can be created for both the AKS resource and for the Azure
Policy add-on for Kubernetes. This allows you to add additional security constraints you'd like to enforce in your
cluster and workload architecture.
Performance efficiency is the ability of your workload to scale to meet the demands placed on it by users in an
efficient manner. We recommend you review the Performance efficiency principles.
When discussing performance with Azure Kubernetes Service, it's important to distinguish between
cluster
performance
and
workload performance
. Cluster performance is a shared responsibility between the cluster
admin and their resource provider, while workload performance is the domain of a developer. Azure Kubernetes
Service has considerations and recommendations for both of these roles.
In the design checklistdesign checklist and list of recommendationslist of recommendations below, call-outs are made to indicate whether each
choice is applicable to cluster architecture, workload architecture, or both.
As you make design choices for Azure Kubernetes Service, review the Performance efficiency principles.
Cluster and workload architectures:Cluster and workload architectures: Perform and iterate on a detailed capacity plan exercise that
includes SKU, autoscale settings, IP addressing, and failover considerations.
Cluster architecture:Cluster architecture: Enable cluster autoscaler to automatically adjust the number of agent nodes in
response workload demands.
Cluster architecture:Cluster architecture: Use the Horizontal pod autoscaler to adjust the number of pods in a deployment
depending on CPU utilization or other select metrics.
RecommendationsRecommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N BEN EF ITB EN EF IT
Cluster and workload architectures:Cluster and workload architectures: Develop a detailed
capacity plan and continually review and revise.
After formalizing your capacity plan, it should be frequently
updated by continuously observing the resource utilization
of the cluster.
Cluster architecture:Cluster architecture: Enable cluster autoscaler to
automatically adjust the number of agent nodes in response
to resource constraints.
The ability to automatically scale up or down the number of
nodes in your AKS cluster lets you run an efficient, cost-
effective cluster.
Cluster and workload architectures:Cluster and workload architectures: Separate
workloads into different node pools and consider scaling
user node pools.
Unlike System node pools that always require running
nodes, user node pools allow you to scale up or down.
Workload architecture:Workload architecture: Use AKS advanced scheduler
features.
Helps control balancing of resources for workloads that
require them.
Workload architecture:Workload architecture: Use meaningful workload scaling
metrics.
Not all scale decisions can be derived from CPU or memory
metrics. Often scale considerations will come from more
complex or even external data points. Use KEDA to build a
meaningful auto scale ruleset based on signals that are
specific to your workload.
Policy definitionsPolicy definitions
Cluster and workload architectureCluster and workload architecture
Additional resources
Azure Architecture Center guidanceAzure Architecture Center guidance
Cluster and workload architectures:Cluster and workload architectures: Perform ongoing load testing activities that exercise both the pod
and cluster autoscaler.
Cluster and workload architectures:Cluster and workload architectures: Separate workloads into different node pools allowing independent
scalling.
Explore the following table of recommendations to optimize your Azure Kubernetes Service configuration for
performance.
For more suggestions, see Principles of the performance efficiency pillar.
Azure Policy offers various built-in policy definitions that apply to both the Azure resource and AKS like standard
policy definitions, and using the Azure Policy add-on for Kubernetes, also within the cluster. Many of the Azure
resource policies come in both
Audit/Deny
, but also in a
Deploy If Not Exists
variant.
There are a numerous number of policies, and key policies related to this pillar are summarized here. For a more
detailed view, see built-in policy definitions for Kubernetes.
CPU and memory resource limits
In addition to the built-in policies, custom policies can be created for both the AKS resource and for the Azure
Policy add-on for Kubernetes. This allows you to add additional security constraints you'd like to enforce in your
cluster and workload architecture.
AKS baseline architecture
Advanced AKS microservices architecture
AKS cluster for a PCI-DSS workload
AKS baseline for multiregion clusters
Cloud Adoption Framework guidanceCloud Adoption Framework guidance
Next steps
AKS Landing Zone Accelerator
Deploy an Azure Kubernetes Service (AKS) cluster using the Azure CLI Quickstart: Deploy an Azure
Kubernetes Service (AKS) cluster using the Azure CLI
Azure Functions and security
12/16/2022 2 minutes to read • Edit Online
Design consideration checklist
Design consideration recommendations
A Z URE F UN C T ION S DESIGN REC O M M ENDAT IO N SA Z URE F UN C T ION S DESIGN REC O M M ENDAT IO N S DESC RIP T IONDESC RIP T ION
Evaluate if Azure Functions requires HTTP trigger. Azure Functions supports multiple specific triggers and
bindings. These include Azure Blob storage, Azure Cosmos
DB, Azure Service Bus, and many more. If HTTP trigger is
needed, then consider protecting that HTTP endpoint like
any other web application. Common protection measures
include keeping HTTP endpoint internal to specific Azure
virtual networks by using Private endpoint connections or
service endpoints. Consider using guidance available on
Azure Functions networking options for more information. If
Functions HTTP endpoint will be exposed to internet, then
it's recommended to secure the endpoint behind a web
application firewall (WAF).
Treat Azure Functions code just like any other code. Subject Azure Functions code to code scanning tools that
are integrated with CI/CD pipeline.
Use guidance available on Securing Azure Functions. This guidance addresses key security concerns such as
operations, deployment, and network security.
Consider using Azure Functions Proxy to act as a facade. Functions Proxy can inspect and modify incoming requests
and responses.
Next step
Azure Functions is a cloud service available on-demand that provides all the continually updated infrastructure
and resources needed to run your applications. Functions allow you to write less code, maintain less
infrastructure, and save on costs. Instead of worrying about deploying and maintaining servers, the cloud
infrastructure provides all the up-to-date resources needed to keep your applications securely running.
For more information related to network security, reference Securing Azure Functions.
The following sections include a design consideration checklist and recommendations specific to Azure
Functions, and security.
Have you designed your workload and configured Azure Functions with security in mind?Have you designed your workload and configured Azure Functions with security in mind?
Evaluate if Azure Functions requires HTTP trigger.
Treat Azure Functions code just like any other code.
Use guidance available on Securing Azure Functions.
Consider using Azure Functions Proxy to act as a facade.
The following table reflects design consideration recommendations and descriptions related to Azure Functions:
Azure Service Fabric and reliability
Azure Well
-
Architected Framework review
-
Virtual
Machines
12/16/2022 9 minutes to read • Edit Online
Prerequisites
Reliability
Design checklistDesign checklist
RecommendationsRecommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N BEN EF ITB EN EF IT
Review SLAs for virtual machines. When defining test availability and recovery targets, make
sure you have a good understanding of the SLAs offered for
VMs.
Virtual Machines is an on-demand, scalable computing resource that gives you the flexibility of virtualization
without having to buy and maintain physical hardware to run it.
In this article, you learn architectural best practices for Azure Virtual Machines. The guidance is based on the five
pillars of architectural excellence:
Reliability
Security
Cost optimization
Operational excellence
Performance efficiency
Understanding the Well-Architected Framework pillars can help produce a high quality, stable, and
efficient cloud architecture. We recommend that you review your workload using the Microsoft Azure
Well-Architected Review assessment.
Use a reference architecture to review the considerations based on the guidance provided in this article.
We recommend, you start with Run a Linux VM on Azure.
As you make design choices for virtual machines, review the design principles for adding reliability to the
architecture.
Review the SLAs for virtual machines.
VMs should be deployed in a scale set using the Flexible orchestration mode.
Deployed VMs across Availability Zones .
Install applications on data disks.
Use maintenance control.
Explore the following table of recommendations to optimize your Virtual Machine configuration for service
reliability:
Deploy using Flexible scale sets. Even single instance VMs should be deployed into a scale set
using the Flexible orchestration mode to future-proof your
application for scaling and availability. Flexible orchestration
offers high availability guarantees (up to 1000 VMs) by
spreading VMs across fault domains in a region or within an
Availability Zone.
Deploy across availability zones Azure availability zones are physically separate locations
within each Azure region that are tolerant to local failures.
Install applications on data disks. Having your data on a separate disk from your OS disk
makes it easier to recover from failures and to migrate
workloads.
Use maintenance control Control when VM maintenance occurs to manage the timing
of system restarts.
REC O M M EN DAT IO NREC O M M EN DAT IO N BEN EF ITB EN EF IT
Security
Design checklistDesign checklist
RecommendationsRecommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N BEN EF ITB EN EF IT
Consider using Azure Bastion Authentication and access control using Azure Bastion
provides secure and seamless RDP/SSH connectivity to your
virtual machines directly from the Azure portal over TLS
Protect against malware Install antimalware protection to help identify and remove
viruses.
Manage updates Use a solution like Azure Automation to manage operating
system updates.
Azure Advisor helps you ensure and improve the continuity of your business-critical applications. Review the
Azure Advisor recommendations.
This article provides an overview of the core Azure security features that can be used with virtual machines.
As you make design choices for virtual machines, review the security principles and Security best practices for
adding security to the architecture.
As you make design choices for your virtual machine deployment, review the design principles for security.
Review the Linux security baseline
Review the Windows security baseline
Manage authentication and access control.
Protect against malware
Managed updates
Encryption
Explore the following table of recommendations to optimize your virtual machine configuration for security.
Monitor for security To monitor the security posture of your Windows and Linux
VMs, use Microsoft Defender for Cloud.
Use encryption Use Azure Disk Encryption to protect your data.
REC O M M EN DAT IO NREC O M M EN DAT IO N BEN EF ITB EN EF IT
Policy definitionsPolicy definitions
For more suggestions, see Principles of the security pillar.
Azure Advisor helps you ensure and improve security. Review the recommendations.
Deploy default Microsoft IaaSAntimalware extension for Windows Server - This policy deploys a Microsoft
IaaSAntimalware extension with a default configuration when a VM is not configured with the antimalware
extension.
Microsoft IaaSAntimalware extension should be deployed on Windows servers - This policy audits any
Windows server VM without Microsoft IaaSAntimalware extension deployed.
Only approved VM extensions should be installed - This policy governs the virtual machine extensions that
are not approved.
Managed disks should be double encrypted with both platform-managed and customer-managed keys - High
security sensitive customers who are concerned of the risk associated with any particular encryption
algorithm, implementation, or key being compromised can opt for additional layer of encryption using a
different encryption algorithm/mode at the infrastructure layer using platform managed encryption keys.
The disk encryption sets are required to use double encryption. Learn more at https://aka.ms/disks-
doubleEncryption.
Managed disks should use a specific set of disk encryption sets for the customer-managed key encryption -
Requiring a specific set of disk encryption sets to be used with managed disks give you control over the keys
used for encryption at rest. You are able to select the allowed encrypted sets and all others are rejected when
attached to a disk. Learn more at https://aka.ms/disks-cmk.
Microsoft Antimalware for Azure should be configured to automatically update protection signatures - This
policy audits any Windows virtual machine not configured with automatic update of Microsoft Antimalware
protection signatures.
OS and data disks should be encrypted with a customer-managed key - Use customer-managed keys to
manage the encryption at rest of the contents of your managed disks. By default, the data is encrypted at rest
with platform-managed keys, but customer-managed keys are commonly required to meet regulatory
compliance standards. Customer-managed keys enable the data to be encrypted with an Azure Key Vault key
created and owned by you. You have full control and responsibility for the key lifecycle, including rotation
and management. Learn more at https://aka.ms/disks-cmk.
Virtual machines and virtual machine scale sets should have encryption at host enabled - Use encryption at
host to get end-to-end encryption for your virtual machine and virtual machine scale set data. Encryption at
host enables encryption at rest for your temporary disk and OS/data disk caches. Temporary and ephemeral
OS disks are encrypted with platform-managed keys when encryption at host is enabled. OS/data disk
caches are encrypted at rest with either customer-managed or platform-managed key, depending on the
encryption type selected on the disk. Learn more at https://aka.ms/vm-hbe.
Require automatic OS image patching on Virtual Machine Scale Sets - This policy enforces enabling
automatic OS image patching on Virtual Machine Scale Sets to always keep virtual Machines secure by safely
applying latest security patches every month.
All built-in policy definitions related to Azure Virtual Machines are listed in Azure Policy built-in definitions for
Azure Virtual Machines.
Cost optimization
Design considerationsDesign considerations
RecommendationsRecommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N BEN EF ITB EN EF IT
Stop VMs during off-hours Configuring start and stop times will shut down instances
that aren't in use. The feature is suitable as a low-cost
automation option.
Use Spot VMs when appropriate. Spot VMs are ideal for workloads that can be interrupted,
such as highly parallel batch processing jobs. These VMs
take advantage of the surplus capacity in Azure at a lower
cost. They're also well suited for experimenting, developing,
and testing large-scale solutions. Check out our Azure
Virtual Machine Spot Eviction guide to learn how to create a
reliable interruptible workload in Azure.
Right-size your VMs Identify the best VM for your workloads with the virtual
machines selector. See Windows and Linux pricing.
Prepay for added cost savings Purchasing reserved instances is a way to reduce Azure costs
for workloads with stable usage. Make sure you manage
usage. If usage is too low, then you're paying for resources
that aren't used. Keep reserved instances simple and keep
management overhead low to prevent increasing cost.
Use existing licensing through the hybrid benefit licensing
program
Hybrid benefit licensing is available for both Linux and
Windows
Policy definitionsPolicy definitions
Operational excellence
To optimize costs, review the design principles.
To estimate costs related to virtual machines, use these tools.
Identify the best VM for your workloads with the virtual machines selector. For more information, see
Linuxand Windows pricing.
Use this pricing calculator to configure and estimate the costs of your Azure VMs.
Shut down VM instances which aren't in use.
Use Spot VMs when appropriate.
Choose the right VM size for your workload.
Use Zone to Zone disaster recovery for virtual machines.
Prepay for reserved instances for one year, three years, or more.
Use hybrid benefit licensing
Explore the following table of recommendations to optimize your Virtual Machine configuration for service cost:
Azure Advisor helps you ensure and improve cost optimization. Review the recommendations.
Consider setting an Allowed virtual machine SKU policy to limit the sizes that can be used.
All built-in policy definitions related to Azure Virtual Machines are listed in Azure Policy built-in definitions for
Azure Virtual Machines.
Design checklistDesign checklist
RecommendationsRecommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N BEN EF ITB EN EF IT
Monitor and measure health In a production environment, it's important to monitor the
health, and performance of your VMs.
Automate tasks Building automation reduces deviations from your plans and
reduces that time it takes to manage your workload.
Build a robust testing environment Ideally, an organization will have multiple environments in
which to test deployments. These test environments should
be similar enough to production that deployment and run
time issues are detected before deployment to production.
Right-size your VMs Choose the right VM family for your workload.
Manage your quota Plan what level of quota will be required and review that
level regularly as the workload evolves and grows and
request changes early
Policy definitionsPolicy definitions
Performance efficiency
Design checklistDesign checklist
To ensure operational excellence, review the design principles.
Monitor and measure health.
Automate tasks like provisioning and updating.
Build a robust testing environment.
Right size your VMs.
Manage your quota.
For more suggestions, see Principles of the operational excellence pillar.
Azure Advisor helps you ensure and improve the continuity of your business-critical applications. Review the
recommendations.
Consider setting an Allowed virtual machine SKU policy
All built-in policy definitions related to Azure Virtual Machines are listed in Azure Policy built-in definitions for
Azure Virtual Machines.
Performance efficiency is matching the resources that are available to an application with the demand that it's
receiving. Performance efficiency includes scaling resources, identifying and optimizing potential bottlenecks,
and optimizing your application code for peak performance.
As you make design choices for your virtual machine deployment, review Microsoft Azure Well-Architected
Framework - Performance efficiency for performance and efficiency.
Reduce latency by deploying VMs closer together in proximity placement groups
Convert disks from standard HDD to premium SSD
Enable Accelerated Networking to improve network performance and latency
Autoscale your Flexible scale sets.
RecommendationsRecommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N BEN EF ITB EN EF IT
Reduce latency Consider deploying VMs in Creating and using proximity
placement groups using PowerShell.
Convert disks from standard HDD to premium SSD Azure premium SSDs deliver high-performance and low-
latency disk support for virtual machines (VMs) with
input/output (IO)-intensive workloads.
Consider accelerated networking Accelerated networking enables single root I/O virtualization
(SR-IOV) to a VM, greatly improving its networking
performance.
Use autoscaling Automatically increase or decrease the number of VM
instances that run your application with autoscaling.
Azure Advisor recommendations
Additional resources
Cost analysisCost analysis
Next steps
Explore the following table of recommendations to optimize your virtual machine deployment configuration for
performance and efficiency.
For more suggestions, see Principles of the performance efficiency pillar.
Azure Advisor helps you ensure and improve performance. Review the recommendations.
Azure Advisor is a personalized cloud consultant that helps you follow best practices to optimize your Azure
deployments. Here are some recommendations that can help you improve the reliability, security, cost
effectiveness, performance, and operational excellence of your Virtual Machines.
Reliability
Cost Optimization
Performance
Operational excellence
Here are other resources to help you query for unhealthy instances.
Planned versus actual spending can be managed through Azure Cost Management + Billing. There are several
options for grouping resources by billing unit.
Use the recommendations as you provision virtual machines for your solution.
Learn module: Introduction to Azure virtual machines
Review the Virtual Machine recommendations provided by Azure Advisor.
Review the built-in definitions provided by Azure Policy that apply to Virtual Machines. All built-in policy
definitions related to Azure Virtual Machines are listed in Azure Policy built-in definitions for Azure
Virtual Machines.
Azure Cache for Redis and reliability
12/16/2022 4 minutes to read • Edit Online
Design considerations
Azure Cache for Redis provides an in-memory data store based on the Redis (Remote Dictionary Server)
software. It's a secure data cache and messaging broker that provides high throughput and low-latency access
to data for applications.
Key concepts and best practices that support reliability include:
High availability
Failover and patching
Connection resilience
The following sections include design considerations, a configuration checklist, and recommended configuration
options specific to Azure Cache for Redis.
The Azure Cache for Redis Service Level Agreements (SLA) covers only Standard and Premium tier caches. Basic
tier isn't covered.
Redis is an in-memory cache for key value pairs and has High Availability (HA), by default, except for Basic tier.
There are three tiers for Azure Cache for Redis:
Basic
:
Not recommended for production workloads
. Basic tier is ideal for:
Single node
Multiple sizes
Development
Test
Non-critical workloads
Standard
: A replicated cache in a two-node primary and secondary configuration managed by Microsoft,
with a high availability SLA.
Premium
: Includes all standard-tier features and includes the following other features:
Faster hardware and performance compared to Basic or Standard tier.
Larger cache size, up to 120GB .
Data persistence, which includes Redis Database File (RDB) and Append Only File (AOF).
VNET support.
Clustering
Geo-Replication: A secondary cache is in another region and replicates data from the primary for
disaster recovery. To failover to the secondary, the caches need to be unlinked manually and then the
secondary is available for writes. The application writing to Redis needs to be updated with the
secondary's cache connection string.
Availability Zones: Deploy the cache and replicas across availability zones.
Checklist
Configuration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Schedule updates. Schedule the days and times that Redis Server updates will
be applied to the cache, which doesn't include Azure
updates, or updates to the VM operating system.
Monitor the cache and set alerts. Set alerts for exceptions, high CPU, high memory usage,
server load, and evicted keys for insights about when to
scale the cache. If the cache needs to be scaled,
understanding when to scale is important because it will
increase CPU during the scaling event to migrate data.
Deploy the cache within a VNET. Gives the customer more control over the traffic that can
connect to the cache. Make sure that the subnet has
sufficient address space available to deploy the cache nodes
and shards (cluster).
NOTENOTE
Import and export.
By default, each deployment will have one replica per shard. Persistence, clustering, and geo-replication are
all disabled at this time with deployments that have more than one replica. Your nodes will be distributed
evenly across all zones. You should have a replica count >= number of zones.
Microsoft guarantees at least 99.9% of the time that customers will have connectivity between the Cache
Endpoints and Microsoft's Internet gateway.
Have you configured Azure Cache for Redis with resiliency in mind?Have you configured Azure Cache for Redis with resiliency in mind?
Schedule updates.
Monitor the cache and set alerts.
Deploy the cache within a VNET.
Evaluate a partitioning strategy within Redis cache.
Configure Data PersistenceData Persistence to save a copy of the cache to Azure Storage or use Geo-Replication, depending
on the business requirement.
Implement retry policies in the context of your Azure Redis Cache.
Use one static or singleton implementation of the connection multiplexer to Redis and follow the best
practices guide.
Review How to administer Azure Cache for Redis.
Explore the following table of recommendations to optimize your Azure Cache for Redis configuration for
service reliability:
Evaluate a partitioning strategy within Redis cache. Partitioning a Redis data store involves splitting the data
across instances of the Redis server. Each instance makes up
a single partition. Azure Redis Cache abstracts the Redis
services behind a facade and doesn't expose them directly.
The simplest way to implement partitioning is to create
multiple Azure Redis Cache instances and spread the data
across them. You can associate each data item with an
identifier (a partition key) that specifies which cache stores
the data item. The client application logic can then use this
identifier to route requests to the appropriate partition. This
scheme is simple, but if the partitioning scheme changes (for
example, if extra Azure Redis Cache instances are created),
client applications may need to be reconfigured.
Configure Data PersistenceData Persistence to save a copy of the cache to
Azure Storage or use Geo-Replication, depending on the
business requirement.
Data Persistence
: if the master and replica reboot, the data
will be loaded automatically from the storage account.
Geo-
Replication
: The secondary cache needs to be unlinked from
the primary. The secondary will now become the primary
and can receive
writes
.
Implement retry policies in the context of your Azure Redis
Cache.
Most Azure services and client SDKs include a retry
mechanism. These mechanisms differ because each service
has different characteristics and requirements. Each retry
mechanism is tuned to a specific service.
Review How to administer Azure Cache for Redis. Understand how data loss can occur with cache reboots and
how to test the application for resiliency.
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Source artifacts
Resources
| where type == 'microsoft.cache/redis'
| where properties.sku.name != 'Premium'
Next step
To identify Redis instances that aren't on the Premium tier, use the following query:
Azure Cache for Redis and operational excellence
Azure Cache for Redis and operational excellence
12/16/2022 4 minutes to read • Edit Online
Design considerations
Azure Cache for Redis provides an in-memory data store based on the Redis (Remote Dictionary Server)
software. It's a secure data cache and messaging broker that provides high throughput and low-latency access
to data for applications.
Best practices that support operational excellence include:
Server load management
Memory management
Performance testing
The following sections include design considerations, a configuration checklist, and recommended configuration
options specific to Azure Cache for Redis.
The Azure Cache for Redis Service Level Agreements (SLA) covers only Standard and Premium tier caches. Basic
tier isn't covered.
Redis is an in-memory cache for key value pairs and has High Availability (HA), by default, except for Basic tier.
There are three tiers for Azure Cache for Redis:
Basic
:
Not recommended for production workloads
. Basic tier is ideal for:
Single node
Multiple sizes
Development
Test
Non-critical workloads
Standard
: A replicated cache in a two-node primary and secondary configuration managed by Microsoft,
with a high availability SLA.
Premium
: Includes all standard-tier features and includes the following other features:
Faster hardware and performance compared to Basic or Standard tier.
Larger cache size, up to 120GB .
Data persistence, which includes Redis Database File (RDB) and Append Only File (AOF).
VNET support.
Clustering
Geo-Replication: A secondary cache is in another region and replicates data from the primary for
disaster recovery. To failover to the secondary, the caches need to be unlinked manually and then the
secondary is available for writes. The application writing to Redis will need to be updated with the
secondary's cache connection string.
Availability Zones: Deploy the cache and replicas across availability zones.
Checklist
Configuration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Schedule updates. Schedule the days and times that Redis Server updates will
be applied to the cache, which doesn't include Azure
updates, or updates to the VM operating system.
Monitor the cache and set alerts. Set alerts for exceptions, high CPU, high memory usage,
server load, and evicted keys for insights about when to
scale the cache. If the cache needs to be scaled,
understanding when to scale is important because it will
increase CPU during the scaling event to migrate data.
Deploy the cache within a VNET. Gives the customer more control over the traffic that can
connect to the cache. Make sure that the subnet has
sufficient address space available to deploy the cache nodes
and shards (cluster).
NOTENOTE
Import and export.
By default, each deployment will have one replica per shard. Persistence, clustering, and geo-replication are
all disabled at this time with deployments that have more than one replica. Your nodes will be distributed
evenly across all zones. You should have a replica count >= number of zones.
Microsoft guarantees at least 99.9% of the time that customers will have connectivity between the Cache
Endpoints and Microsoft's Internet gateway.
Have you configured Azure Cache for Redis with operational excellence in mind?Have you configured Azure Cache for Redis with operational excellence in mind?
Schedule updates.
Monitor the cache and set alerts.
Deploy the cache within a VNET.
Use the correct caching type (local, in role, managed, redis) within your solution.
Configure Data PersistenceData Persistence to save a copy of the cache to Azure Storage or use Geo-Replication, depending
on the business requirement.
Use one static or singleton implementation of the connection multiplexer to Redis and follow the best
practices guide.
Review How to administer Azure Cache for Redis.
Explore the following table of recommendations to optimize your Azure Cache for Redis configuration for
operational excellence:
Use the correct caching type (local, in role, managed, redis)
within your solution.
Distributed applications typically implement either or both of
the following strategies when caching data:
- Using a private cache, where data is held locally on the
machine that's running an instance of an application or
service.
- Using a shared cache, serving as a common source that
can be accessed by multiple processes and machines.
In both cases, caching can be performed client-side and
server-side. Client-side caching is done by the process that
provides the user interface for a system, such as a web
browser or desktop application. Server-side caching is done
by the process that provides the business services that are
running remotely.
Configure Data PersistenceData Persistence to save a copy of the cache to
Azure Storage or use Geo-Replication, depending on the
business requirement.
Data Persistence
: If the master and replica reboot, the data
will be loaded automatically from the storage account.
Geo-
Replication
: The secondary cache needs to be unlinked from
the primary. The secondary will now become the primary
and can receive
writes
.
Review How to administer Azure Cache for Redis. Understand how data loss can occur with cache reboots and
how to test the application for resiliency.
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Source artifacts
Resources
| where type == 'microsoft.cache/redis'
| where properties.sku.name != 'Premium'
Next step
To identify Redis instances that aren't on the Premium tier, use the following query:
Azure Databricks and security
Azure Databricks and security
12/16/2022 2 minutes to read • Edit Online
Design considerations
Checklist
Configuration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Ensure that the cloud workspaces for your analytics are only
accessible by properly managed users.
Azure Active Directory can handle single sign-on for remote
access. For extra security, reference Conditional Access.
Azure Databricks is a data analytics platform optimized for Azure cloud services. It offers three environments for
developing data intensive applications:
Databricks SQL
Databricks Data Science and Engineering
Databricks Machine Learning
To learn more about how Azure Databricks improves the security of big data analytics, reference Azure
Databricks concepts.
The following sections include design considerations, a configuration checklist, and recommended configuration
options specific to Azure Databricks.
All users' notebooks and notebook results are encrypted at rest, by default. If other requirements are in place,
consider using customer-managed keys for notebooks.
Have you configured Azure Databricks with security in mind?Have you configured Azure Databricks with security in mind?
Use Azure Active Directory credential passthrough to avoid the need for service principals when
communicating with Azure Data Lake Storage.
Isolate your workspaces, compute, and data from public access. Make sure that only the right people have
access and only through secure channels.
Ensure that the cloud workspaces for your analytics are only accessible by properly managed users.
Implement Azure Private Link.
Restrict and monitor your virtual machines.
Use Dynamic IP access lists to allow admins to access workspaces only from their corporate networks.
Use the VNet injection functionality to enable more secure scenarios.
Use diagnostic logs to audit workspace access and permissions.
Consider using the Secure cluster connectivity feature and hub/spoke architecture to prevent opening ports,
and assigning public IP addresses on cluster nodes.
Explore the following table of recommendations to optimize your Azure Databricks configuration for security:
Implement Azure Private Link. Ensure all traffic between users of your platform, the
notebooks, and the compute clusters that process queries
are encrypted and transmitted over the cloud provider's
network backbone, inaccessible to the outside world.
Restrict and monitor your virtual machines. Clusters, which execute queries, should have SSH and
network access restricted to prevent installation of arbitrary
packages. Clusters should use only images that are
periodically scanned for vulnerabilities.
Use the VNet injection functionality to enable more secure
scenarios.
Such as:
- Connecting to other Azure services using service
endpoints.
- Connecting to on-premises data sources, taking advantage
of user-defined routes.
- Connecting to a network virtual appliance to inspect all
outbound traffic and take actions according to allow and
deny rules.
- Using custom DNS.
- Deploying Azure Databricks clusters in existing virtual
networks.
Use diagnostic logs to audit workspace access and
permissions.
Use audit logs to see privileged activity in a workspace,
cluster resizing, files, and folders shared on the cluster.
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Source artifacts
Next step
Azure Databricks source artifacts include the Databricks blog: Best practices to secure an enterprise-scale data
platform.
Azure Database for MySQL and cost optimization
Azure Database for MySQL and cost optimization
12/16/2022 2 minutes to read • Edit Online
Design considerations
Checklist
Configuration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Choose the appropriate server size for your workload. Configuration options: Single Server and Flexible Server.
Azure Database for MySQL is a relational database service in the Microsoft cloud based on the MySQL
Community Edition. You can use either Single Server or Flexible Server to host a MySQL database in Azure. It's a
fully managed database as a service offering that can handle mission-critical workloads with predictable
performance and dynamic scalability.
For more information about how Azure Database for MySQL supports cost optimization for your workload,
reference Server concepts, specifically, Stop/Start an Azure Database for MySQL.
The following sections include design considerations, a configuration checklist, and recommended configuration
options specific to Azure Database for MySQL.
Azure Database for MySQL includes the following design considerations:
Take advantage of the scaling capabilities of Azure Database for MySQL to lower consumption cost whenever
possible. To scale your database up and down, as needed, reference the following Microsoft Support article,
which covers the automation process using runbooks: How to autoscale an Azure Database for
MySQL/PostgreSQL instance with Azure run books and Python.
Plan your Recovery Point Objective (RPO) according to your operation level requirement. There's no extra
charge for backup storage for up to 100% of your total provisioned server storage. Extra consumption of
backup storage will be charged in GB/month .
The cloud native design of the Single-Server service allows it to support 99.99% of availability, eliminating
the cost of passive
hot
standby.
Consider using Flexible Server SKU for non-production workloads. Flexible servers provide better cost
optimization controls with ability to stop and start your server. They provide a burstable compute tier that is
ideal for workloads that don't need continuous full compute capacity.
Have you configured Azure Database for MySQL with cost optimization in mind?Have you configured Azure Database for MySQL with cost optimization in mind?
Choose the appropriate server size for your workload.
Consider Reserved Capacity for Azure Database for MySQL Single Server.
Explore the following table of recommendations to optimize your Azure Database for MySQL configuration for
cost optimization:
Consider Reserved Capacity for Azure Database for MySQL
Single Server.
Compute costs associated with Azure Database For MySQL
Single Server Reservation Discount. Once you've determined
the total compute capacity and performance tier for Azure
Database for MySQL in a region, this information can be
used to reserve the capacity. The reservation can span one
or three years. You can realize significant cost optimization
with this commitment.
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Azure Database for PostgreSQL and cost optimization
Azure Database for PostgreSQL and cost
optimization
12/16/2022 2 minutes to read • Edit Online
Design considerations
Azure Database for PostgreSQL is a relational database service in the Microsoft cloud based on the PostgreSQL
Community Edition.
You can choose between three deployment modes, such as Single Server, Flexible Server, and Hyperscale
(Citus):
Single Server
: A fully managed database service with minimal requirements for database customizations.
This platform is designed to handle most database management functions, such as:
Patching
Backups
High Availability
Security with minimal user configuration and control
This service offers three pricing tiers and allows you to pay only for the resources you need, and
only when you need them.
Flexible Server
: A fully managed database service designed to provide more granular control and
flexibility over database management functions and configuration settings based on user requirements.
This service provides better cost optimization controls because you can stop and start the server, and the
burstable compute tier.
Hyperscale
: This option uses sharding to horizontally scale across multiple machines. Hyperscale is best
for applications that require greater scale where workloads approach 100GB of data.
The following sections include design considerations, a configuration checklist, and recommended configuration
options specific to Azure Database for PostgreSQL.
Azure Database for PostgreSQL includes the following design considerations:
Hyperscale (Citus) provides dynamic scalability without the cost of manual sharding with low application
rearchitecture required.
Distributing table rows across multiple PostgreSQL servers is a key technique for scalable queries in
Hyperscale (Citus). Together, multiple nodes can hold more data than a traditional database, and in many
cases can use worker CPUs in parallel to execute queries potentially lowering the database costs. Follow
this Shard data on worker nodes tutorial to practice this potential savings architecture pattern.
Consider using Flexible Server SKU for non-production workloads.
Flexible servers provide better cost optimization controls with ability to stop and start your server, and
burstable compute tier that is ideal for workloads that don't need continuous full compute capacity.
Plan your Recovery Point Objective (RPO) according to your operation level requirement.
There's no extra charge for backup storage for up to 100% of your total provisioned server storage. Extra
Checklist
Configuration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Choose the appropriate server size for your workload. Configuration options: Single Server, Flexible Server,
Hyperscale (Citus).
Consider Reserved Capacity for Azure Database for
PostgreSQL Single Server and Hyperscale (Citus).
Compute costs associated with Azure Database For
PostgreSQL Single Server Reservation Discount and
Hyperscale (Citus) Reservation Discount. Once the total
compute capacity and performance tier for Azure Database
for PostgreSQL in a region is determined, this information
can be used to reserve the capacity. The reservation can
span one or three years. You can realize significant cost
optimization with this commitment.
Next step
consumption of backup storage will be charged in GB/month.
Take advantage of the scaling capabilities of Azure Database for PostgreSQL to lower consumption cost
whenever possible.
This Microsoft Support article about How to autoscale an Azure Database for MySQL/PostgreSQL
instance with Azure runbooks and Python covers the automation process using runbooks to scale your
database up and down, as needed.
The cloud native design of the Single- Server service allows it to support 99.99% of availability
eliminating the cost of passive
hot
standby.
Have you configured Azure Database for PostgreSQL with cost optimization in mind?Have you configured Azure Database for PostgreSQL with cost optimization in mind?
Choose the appropriate server size for your workload.
Consider Reserved Capacity for Azure Database for PostgreSQL Single Server and Hyperscale (Citus).
Explore the following table of recommendations to optimize your Azure Database for PostgreSQL configuration
for cost optimization:
Azure SQL Database and reliability
Azure Well
-
Architected Framework review
-
Azure
SQL Database
12/16/2022 19 minutes to read • Edit Online
Prerequisites
Azure SQL Database and reliability
Design considerationsDesign considerations
Azure SQL Database is a fully managed platform as a service (PaaS) database engine that handles most of the
database management functions without user involvement. Management functions include upgrades, patches,
backups, and monitoring.
The single database resource type creates a database in Azure SQL Database with its own set of resources and is
managed via a logical server. You can choose between the DTU-based purchasing model or vCore-based
purchasing model. You can create multiple databases in a single resource pool, with elastic pools.
The following sections include a design checklist and recommended design options specific to Azure SQL
Database security. The guidance is based on the five pillars of architectural excellence:
Reliability
Security
Cost optimization
Operational excellence
Performance efficiency
Understanding the Well-Architected Framework pillars can help produce a high quality, stable, and
efficient cloud architecture. Check out the Azure Well-Architected Framework overview page to review
the five pillars of architectural excellence.
Review the core concepts of Azure SQL Database and What's new in Azure SQL Database?.
Azure SQL Database is a fully managed platform as a service (PaaS) database engine that handles most of the
database management functions without user involvement. Management functions include:
Upgrades
Patches
Backups
Monitoring
This service allows you to create a highly available and high-performance data storage layer for your Azure
applications and workloads. Azure SQL Database is always running on the latest stable version of the SQL
Server database engine and patched OS with 99.99% availability.
For more information about how Azure SQL Database promotes reliability and enables your business to
continue operating during disruptions, reference Availability capabilities.
The following sections include design considerations, a configuration checklist, and recommended configuration
options specific to Azure SQL Database and reliability.
Azure SQL Database includes the following design considerations:
ChecklistChecklist
Azure SQL Database Business Critical tier configured with geo-replication has a guaranteed Recovery
time objective (RTO) of 30 seconds for 100% of deployed hours.
Use
sharding
to distribute data and processes across many identically structured databases. Sharding
provides an alternative to traditional scale-up approaches for cost and elasticity. Consider using sharding
to partition the database horizontally. Sharding can provide fault isolation. For more information,
reference Scaling out with Azure SQL Database.
Azure SQL Database Business Critical or Premium tiers not configured for Zone Redundant Deployments,
General Purpose, Standard, or Basic tiers, or Hyperscale tier with two or more replicas have an availability
guarantee. For more information about the availability guarantee, reference SLA for Azure SQL Database.
Provides built-in regional high availability and turnkey geo-replication to any Azure region. It includes
intelligence to support self-driving features, such as:
Performance tuning
Threat monitoring
Vulnerability assessments
Fully automated patching and updating of the code base
Define an application performance SLA and monitor it with alerts. Quickly detect when your application
performance inadvertently degrades below an acceptable level, which is important to maintain high
resiliency. Use the monitoring solution previously defined to set alerts on key query performance metrics
so you can take action when the performance breaks the SLA. Go to Monitor Your Database and alerting
tools for more information.
Use geo-restore to recover from a service outage. You can restore a database on any SQL Database
server or an instance database on any managed instance in any Azure region from the most recent geo-
replicated backups. Geo-restore uses a geo-replicated backup as its source. You can request geo-restore
even if the database or datacenter is inaccessible because of an outage. Geo-restore restores a database
from a geo-redundant backup. For more information, reference Recover an Azure SQL database using
automated database backups.
Use the Business Critical tier configured with geo-replication, which has a guaranteed Recovery point
objective (RPO) of 5 seconds for 100% of deployed hours.
PaaS capabilities built into Azure SQL Database enable you to focus on the domain-specific database
administration and optimization activities that are critical for your business.
Use point-in-time restore to recover from human error. Point-in-time restore returns your database to an
earlier point in time to recover data from changes done inadvertently. For more information, read the
Point-in-time restore (PITR) documentation.
Business Critical or Premium tiers are configured as Zone Redundant Deployments which have an
availability guarantee. For more information about the availability guarantee, reference SLA for Azure
SQL Database.
Have you configured Azure SQL Database with reliability in mind?Have you configured Azure SQL Database with reliability in mind?
Use Active Geo-Replication to create a readable secondary in a different region.
Use Auto Failover Groups that can include one or multiple databases, typically used by the same application.
Use a Zone-Redundant database.
Monitor your Azure SQL Database in near-real time to detect reliability incidents.
Implement Retry Logic.
Back up your keys.
Configuration recommendationsConfiguration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Use Active Geo-Replication to create a readable secondary in
a different region.
If your primary database fails, perform a manual failover to
the secondary database. Until you fail over, the secondary
database remains read-only. Active geo-replication enables
you to create readable replicas and manually failover to any
replica if there is a datacenter outage or application upgrade.
Up to four secondaries are supported in the same or
different regions, and the secondaries can also be used for
read-only access queries. The failover must be initiated
manually by the application or the user. After failover, the
new primary has a different connection end point.
Use Auto Failover Groups that can include one or multiple
databases, typically used by the same application.
You can use the readable secondary databases to offload
read-only query workloads. Because autofailover groups
involve multiple databases, these databases must be
configured on the primary server. Autofailover groups
support replication of all databases in the group to only one
secondary server or instance in a different region. Learn
more about AutoFailover Groups and DR design.
Use a Zone-Redundant database. By default, the cluster of nodes for the premium availability
model is created in the same datacenter. With the
introduction of Azure Availability Zones, SQL Database can
place different replicas of the Business Critical database to
different availability zones in the same region. To eliminate a
single point of failure, the control ring is also duplicated
across multiple zones as three gateway rings (GW). The
routing to a specific gateway ring is controlled by Azure
Traffic Manager (ATM). Because the zone redundant
configuration in the Premium or Business Critical service
tiers doesn't create extra database redundancy, you can
enable it at no extra cost. Learn more about Zone-
redundant databases.
Monitor your Azure SQL Database in near-real time to
detect reliability incidents.
Use one of the available solutions to monitor SQL DB to
detect potential reliability incidents early and make your
databases more reliable. Choose a near real-time monitoring
solution to quickly react to incidents. Reference Azure SQL
Analytics for more information.
Implement Retry Logic. Although Azure SQL Database is resilient when it concerns
transitive infrastructure failures, these failures might affect
your connectivity. When a transient error occurs while
working with SQL Database, make sure your code can retry
the call. For more information, reference how to implement
retry logic.
Back up your keys. If you're not using encryption keys in Azure Key Vault to
protect your data, back up your keys.
Azure SQL Database and security
Explore the following table of recommendations to optimize your Azure SQL Database configuration for
reliability:
SQL Database provides a range of built-in security and compliance features to help your application meet
various security and compliance requirements.
Design checklistDesign checklist
RecommendationsRecommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N BEN EF ITB EN EF IT
Review the minimum TLS version. Determine whether you have legacy applications that require
older TLS or unencrypted connections. After you enforce a
version of TLS, it's not possible to revert to the default.
Review and configure the minimum TLS version for SQL
Database connections via the Azure portal. If not, set the
latest TLS version to the minimum.
Ledger Consider designing database tables based on the Ledger to
provide auditing, tamper-evidence, and trust of all data
changes.
Always Encrypted Consider designing application access based around Always
Encrypted to protect sensitive data inside applications by
delegating data access to encryption keys.
Private endpoints and private link Private endpoint connections enforce secure communication
by enabling private connectivity to Azure SQL Database. You
can use a private endpoint to secure connections and deny
public network access by default. Azure Private Link for
Azure SQL Database is a type of private endpoint
recommended for Azure SQL Database.
Automated vulnerability assessments Monitor for vulnerability assessment scan results and
recommendations for how to remediate database
vulnerabilities.
Advanced Threat Protection Detect anomalous activities indicating unusual and
potentially harmful attempts to access or exploit databases
with Advanced Threat Protection for Azure SQL Database.
Advanced Threat Protection integrates its alerts with
Microsoft Defender for Cloud.
Auditing Track database events with Auditing for Azure SQL Database.
Have you designed your workload and configured Azure SQL Database with security in mind?Have you designed your workload and configured Azure SQL Database with security in mind?
Understand logical servers and how you can administer logins for multiple databases when appropriate.
Enable Azure AD authentication with Azure SQL. Azure AD authentication enables simplified permission
management and centralized identity management.
Azure SQL logical servers should have an Azure Active Directory administrator provisioned.
Verify contact information email address in your Azure Subscription for service administrator and co-
administrators is reaching the correct parties inside your enterprise. You don't want to miss or ignore
important security notifications from Azure!
Review the Azure SQL Database connectivity architecture. Choose the Redirect or Proxy connection policy
as appropriate.
Review Azure SQL Database firewall rules.
Use virtual network rules to control communication from particular subnets in virtual networks.
If using the Azure Firewall, configure Azure Firewall application rules with SQL FQDNs.
Managed identities Consider configuring a user-assigned managed identity
(UMI). Managed identities for Azure resources eliminate the
need to manage credentials in code.
Azure AD-only authentication Consider disabling SQL-based authentication and allowing
only on Azure AD authentication.
REC O M M EN DAT IO NREC O M M EN DAT IO N BEN EF ITB EN EF IT
Policy definitionsPolicy definitions
Azure SQL Database and cost optimization
ChecklistChecklist
Configuration recommendationsConfiguration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Review the Azure security baseline for Azure SQL Database and Azure Policy built-in definitions.
All built-in policy definitions related to Azure SQL are listed in Built-in policies.
Review Tutorial: Secure a database in Azure SQL Database.
Azure SQL Database is a fully managed platform as a service (PaaS) database engine that handles most of the
database management functions without user involvement. Management functions include:
Upgrades
Patches
Backups
Monitoring
This service allows you to create a highly available and high-performance data storage layer for your Azure
applications and workloads. SQL Database includes built-in intelligence that helps you dramatically reduce the
costs of running and managing databases through automatic performance monitoring and tuning.
For more information about how Azure SQL Database provides cost-saving features, reference Plan and manage
costs for Azure SQL Database.
The following sections include a configuration checklist and recommended configuration options specific to
Azure SQL Database and cost optimization.
Have you configured Azure SQL Database with cost optimization in mind?Have you configured Azure SQL Database with cost optimization in mind?
Optimize queries.
Evaluate resource usage.
Fine-tune backup storage consumption.
Evaluate Azure SQL Database serverless.
Consider reserved capacity for Azure SQL Database.
Consider elastic pools for managing and scaling multiple databases.
Explore the following table of recommendations to optimize your Azure SQL Database configuration for cost
savings:
Optimize queries. Optimize the queries, tables, and databases using Query
Performance Insights and Performance Recommendations to
help reduce resource consumption, and arrive at appropriate
configuration.
Evaluate resource usage. Evaluate the resource usage for all databases and determine
if they've been sized and provisioned correctly. For non-
production databases, consider scaling resources down as
applicable. The DTUs or vCores for a database can be scaled
on demand, for example, when running a load test or user
acceptance test.
Fine-tune backup storage consumption For vCore databases in Azure SQL Database, the storage
consumed by each type of backup (full, differential, and log)
is reported on the database monitoring pane as a separate
metric. Backup storage consumption up to the maximum
data size for a database is not charged. Excess backup
storage consumption will depend on the workload and
maximum size of the individual databases. For more
information, see Backup storage consumption.
Evaluate Azure SQL Database Serverless. Consider using Azure SQL Database serverless over the
Provisioned Computing Tier. Serverless is a compute tier for
single databases that automatically scales compute based on
workload demand and bills for the amount of compute used
per second. The serverless compute tier also automatically
pauses databases during inactive periods when only storage
is billed. It automatically resumes databases when activity
returns. Azure SQL Database serverless isn't suited for all
scenarios. If you have a database with unpredictable or
bursty usage patterns interspersed with periods of low or
idle usage, serverless is a solution that can help you optimize
price-performance.
Consider reserved capacity for Azure SQL Database. You can reduce compute costs associated with Azure SQL
Database by using Reservation Discount. Once you've
determined the total compute capacity and performance tier
for Azure SQL databases in a region, you can use this
information to reserve the capacity. The reservation can
span one or three years. For more information, reference
Save costs for resources with reserved capacity.
Elastic pools help you manage and scale multiple databases
in Azure SQL Database
Azure SQL Database elastic pools are a simple, cost-effective
solution for managing and scaling multiple databases that
have varying and unpredictable usage demands. The
databases in an elastic pool are on a single server and share
a set number of resources at a set price. For more
information, see Elastic pools for managing and scaling
multiple databases.
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Azure SQL Database and operational excellence
For more information, see Plan and manage costs for Azure SQL Database.
Azure SQL Database is a fully managed platform as a service (PaaS) database engine that handles most of the
database management functions without user involvement. Management functions include:
Upgrades
Design considerationsDesign considerations
Patches
Backups
Monitoring
This service allows you to create a highly available and high-performance data storage layer for your Azure
applications and workloads. Azure SQL Database provides advanced monitoring and tuning capabilities backed
by artificial intelligence to help you troubleshoot and maximize the performance of your databases and
solutions.
For more information about how Azure SQL Database promotes operational excellence and enables your
business to continue operating during disruptions, reference Monitoring and performance tuning in Azure SQL
Database.
The following sections include design considerations, a configuration checklist, and recommended configuration
options specific to Azure SQL Database, and operational excellence.
Azure SQL Database includes the following design considerations:
Azure SQL Database Business Critical tier configured with geo-replication has a guaranteed Recovery
time objective (RTO) of 30 seconds for 100% of deployed hours.
Use
sharding
to distribute data and processes across many identically structured databases. Sharding
provides an alternative to traditional scale-up approaches for cost and elasticity. Consider using sharding
to partition the database horizontally. Sharding can provide fault isolation. For more information,
reference Scaling out with Azure SQL Database.
Azure SQL Database Business Critical or Premium tiers not configured for Zone Redundant Deployments,
General Purpose, Standard, or Basic tiers, or Hyperscale tier with two or more replicas have an availability
guarantee. For more information, reference SLA for Azure SQL Database.
Provides built-in regional high availability and turnkey geo-replication to any Azure region. It includes
intelligence to support self-driving features, such as:
Performance tuning
Threat monitoring
Vulnerability assessments
Fully automated patching and updating of the code base
Define an application performance SLA and monitor it with alerts. Quickly detect when your application
performance inadvertently degrades below an acceptable level, which is important to maintain high
resiliency. Use the monitoring solution previously defined to set alerts on key query performance metrics
so you can take action when the performance breaks the SLA. Go to Monitor Your Database for more
information.
Use geo-restore to recover from a service outage. You can restore a database on any SQL Database
server or an instance database on any managed instance in any Azure region from the most recent geo-
replicated backups. Geo-restore uses a geo-replicated backup as its source. You can request geo-restore
even if the database or datacenter is inaccessible because of an outage. Geo-restore restores a database
from a geo-redundant backup. For more information, reference Recover an Azure SQL database using
automated database backups.
Use the Business Critical tier configured with geo-replication, which has a guaranteed Recovery point
objective (RPO) of 5 seconds for 100% of deployed hours.
PaaS capabilities built into Azure SQL Database enable you to focus on the domain-specific database
administration and optimization activities that are critical for your business.
ChecklistChecklist
Configuration recommendationsConfiguration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Use Active Geo-Replication to create a readable secondary in
a different region.
If your primary database fails, perform a manual failover to
the secondary database. Until you fail over, the secondary
database remains read-only. Active geo-replication enables
you to create readable replicas and manually failover to any
replica if there is a datacenter outage or application upgrade.
Up to four secondaries are supported in the same or
different regions, and the secondaries can also be used for
read-only access queries. The failover must be initiated
manually by the application or the user. After failover, the
new primary has a different connection end point.
Use Auto Failover Groups that can include one or multiple
databases, typically used by the same application.
You can use the readable secondary databases to offload
read-only query workloads. Because autofailover groups
involve multiple databases, these databases must be
configured on the primary server. Autofailover groups
support replication of all databases in the group to only one
secondary server or instance in a different region. Learn
more about Auto-Failover Groups and DR design.
Use a Zone-Redundant database. By default, the cluster of nodes for the premium availability
model is created in the same datacenter. With the
introduction of Azure Availability Zones, SQL Database can
place different replicas of the Business Critical database to
different availability zones in the same region. To eliminate a
single point of failure, the control ring is also duplicated
across multiple zones as three gateway rings (GW). The
routing to a specific gateway ring is controlled by Azure
Traffic Manager (ATM). Because the zone redundant
configuration in the Premium or Business Critical service
tiers doesn't create extra database redundancy, you can
enable it at no extra cost. Learn more about Zone-
redundant databases.
Use point-in-time restore to recover from human error. Point-in-time restore returns your database to an
earlier point in time to recover data from changes done inadvertently. For more information, read the
Point-in-time restore (PITR) documentation.
Business Critical or Premium tiers are configured as Zone Redundant Deployments. For more information
about the availability guarantee, reference SLA for Azure SQL Database .
Have you configured Azure SQL Database with operational excellence in mind?Have you configured Azure SQL Database with operational excellence in mind?
Use Active Geo-Replication to create a readable secondary in a different region.
Use Auto Failover Groups that can include one or multiple databases, typically used by the same application.
Use a Zone-Redundant database.
Monitor your Azure SQL Database in near-real time to detect reliability incidents.
Implement retry logic.
Back up your keys.
Explore the following table of recommendations to optimize your Azure SQL Database configuration for
operational excellence:
Monitor your Azure SQL Database in near-real time to
detect reliability incidents.
Use one of the available solutions to monitor SQL DB to
detect potential reliability incidents early and make your
databases more reliable. Choose a near real-time monitoring
solution to quickly react to incidents. Reference Azure SQL
Analytics for more information.
Implement Retry Logic. Although Azure SQL Database is resilient when it concerns
transitive infrastructure failures, these failures might affect
your connectivity. When a transient error occurs while
working with SQL Database, make sure your code can retry
the call. For more information, reference how to implement
retry logic and Configurable retry logic in SqlClient
introduction.
Back up your keys. If you're not using encryption keys in Azure Key Vault to
protect your data, back up your keys.
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Azure SQL Database and performance efficiency
Design checklistDesign checklist
RecommendationsRecommendations
Azure SQL Database is a fully managed platform as a service (PaaS) database engine that handles most of the
database management functions without user involvement. Management functions include:
Upgrades
Patches
Backups
Monitoring
The following sections include a design checklist and recommended design options specific to Azure SQL
Database performance efficiency.
Have you designed your workload and configured Azure SQL Database with performanceHave you designed your workload and configured Azure SQL Database with performance
efficiency in mind?efficiency in mind?
Review resource limits. For specific resource limits per pricing tier (also known as service objective) for single
databases, refer to either DTU-based single database resource limits or vCore-based single database
resource limits. For elastic pool resource limits, refer to either DTU-based elastic pool resource limits or
vCore-based elastic pool resource limits.
Choose the right deployment model for your workload, vCore or DTU. Compare the vCore and DTU-based
purchasing models.
Microsoft recommends the latest vCore database standard-series or premium-series hardware. Older Gen4
hardware has been retired.
When using elastic pools, familiarize yourself with resource governance.
Review the default max degree of parallelism (MAXDOP) and configure as needed based on a migrated or
expected workload.
Consider using read-only replicas of critical database to offload read-only query workloads.
Review the Performance Center for SQL Server Database Engine and Azure SQL Database.
Applications connecting to Azure SQL Database should use the latest connection providers, for example the
latest OLE DB Driver or ODBC Driver.
REC O M M EN DAT IO NREC O M M EN DAT IO N BEN EF ITB EN EF IT
Diagnose and troubleshoot high CPU utilization. Azure SQL Database provides built-in tools to identify the
causes of high CPU usage and to optimize workload
performance.
Understand blocking and deadlocking issues. Blocking due to concurrency and terminated sessions due to
deadlocks have different causes and outcomes.
Tune applications and databases for performance. Tune your application and database to improve performance.
Review best practices.
Review Azure portal utilization reporting and scale as
appropriate.
After deployment, use built-in reporting in the Azure portal
to regularly review peak and average database utilization
and right-size up or down. You can easily scale single
databases or elastic pools with no data loss and minimal
downtime.
Review Performance Recommendations. In the Intelligent Performance menu of the database page in
the Azure portal, review and consider action on any of the
Performance Recommendations and implement any index,
schema, and parameterization issues.
Review Query Performance Insight. Review Query Performance Insight for Azure SQL Database
reports to identify top resource-consuming queries, long
running queries, and more.
Configure Automatic tuning. Provide peak performance and stable workloads through
continuous performance tuning based on AI and machine
learning. Consider using Azure Automation to configure
email notifications for automatic tuning.
Evaluate potential use of in-memory database objects. In-memory technologies enable you to improve
performance of your application, and potentially reduce cost
of your database. Consider designing some database objects
in high-volume OLTP applications.
Leverage the Query Store. Enabled by default in Azure SQL Database, the Query Store
contains a wealth of query performance and resource
consumption data, as well as advanced tuning features like
Query Store hints and automatic plan correction. Review
Query Store defaults in Azure SQL Database.
Implement retry logic for transient errors. Applications should include automatic transaction retry logic
for transient errors including common connection errors.
Leverage exponential retry interval logic.
Additional resources
Next steps
For information about supported features, see Features and Resolving Transact-SQL differences during
migration to SQL Database.
Migrating to Azure SQL Database? Review our Azure Database Migration Guides.
Watch episodes of Data Exposed covering Azure SQL topics and more.
Try Azure SQL Database free with Azure free account, then get started with single databases in Azure SQL
Database.
Azure SQL Managed Instance and reliability
12/16/2022 3 minutes to read • Edit Online
Design considerations
Checklist
Azure SQL Managed Instance is the intelligent, scalable cloud database service that combines the broadest SQL
Server database engine compatibility with all the benefits of a fully managed and evergreen platform as a
service.
The goal of the high availability architecture in SQL Managed Instance is to guarantee that your database is up
and running without worrying about the impact of maintenance operations and outages. This solution is
designed to:
Ensure that committed data is never lost because of failures.
Ensure that maintenance failures don't affect your workload.
Ensure that the database won't be a single point of failure in your software architecture.
For more information about how Azure SQL Managed Instance supports application and workload resilience,
reference the following articles:
High availability for Azure SQL Managed Instance
Use autofailover groups to enable transparent and coordinated geo-failover of multiple databases
The following sections include design considerations, a configuration checklist, and recommended configuration
options specific to Azure SQL Managed Instance, and reliability.
Azure SQL Managed Instance includes the following design considerations:
Define an application performance SLA and monitor it with alerts. Detecting quickly when your application
performance inadvertently degrades below an acceptable level is important to maintain high resiliency. Use a
monitoring solution to set alerts on key query performance metrics so you can take action when the
performance breaks the SLA.
Use point-in-time restore to recover from human error. Point-in-time restore returns your database to an
earlier point in time to recover data from changes done inadvertently. For more information, read the Point-
in-time-restore (PITR) documentation for managed instance.
Use geo-restore to recover from a service outage. Geo-restore restores a database from a geo-redundant
backup into a managed instance in a different region. For more information, reference Recover a database
using Geo-restore documentation.
Consider the time required for certain operations. Make sure you separate time to thoroughly test the
amount of time required to scale up and down your existing managed instance, and to create a new
managed instance. This timing practice ensures that you understand completely how time consuming
operations will affect your RTO and RPO.
Have you configured Azure SQL Managed Instance with reliability in mind?Have you configured Azure SQL Managed Instance with reliability in mind?
Use the Business Critical Tier.
Configure a secondary instance and an Autofailover group to enable failover to another region.
Implement Retry Logic.
Monitor your SQL MI instance in near-real time to detect reliability incidents.
Configuration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Use the Business Critical Tier. This tier provides higher resiliency to failures and faster
failover times because of the underlying HA architecture,
among other benefits. For more information, reference SQL
Managed Instance High availability.
Configure a secondary instance and an Autofailover group
to enable failover to another region.
If an outage impacts one or more of the databases in the
managed instance, you can manually or automatically
failover all the databases inside the instance to a secondary
region. For more information, read the Autofailover groups
documentation for managed instance.
Implement Retry Logic. Although Azure SQL MI is resilient to transitive
infrastructure failures, these failures might affect your
connectivity. When a transient error occurs while working
with SQL MI, make sure your code can retry the call. For
more information, reference how to implement retry logic.
Monitor your SQL MI instance in near-real time to detect
reliability incidents.
Use one of the available solutions to monitor your SQL MI
to detect potential reliability incidents early and make your
databases more reliable. Choose a near real-time monitoring
solution to quickly react to incidents. For more information,
check out the Azure SQL Managed Instance monitoring
options.
Next step
Explore the following table of recommendations to optimize your Azure SQL Managed Instance configuration
for reliability:
Azure SQL Managed Instance and operational excellence
Azure SQL Managed Instance and operational
excellence
12/16/2022 3 minutes to read • Edit Online
Design considerations
Checklist
Azure SQL Managed Instance is the intelligent, scalable cloud database service that combines the broadest SQL
Server database engine compatibility with all the benefits of a fully managed and evergreen platform as a
service.
The goal of the high availability architecture in SQL Managed Instance is to guarantee that your database is up
and running without worrying about the impact of maintenance operations and outages. This solution is
designed to:
Ensure that committed data is never lost because of failures.
Ensure that maintenance failures don't affect your workload.
Ensure that the database won't be a single point of failure in your software architecture.
For more information about how Azure SQL Managed Instance supports operational excellence for your
application workloads, reference the following articles:
Overview of Azure SQL Managed Instance management operations
Monitoring Azure SQL Managed Instance management operations
The following sections include design considerations, a configuration checklist, and recommended configuration
options specific to Azure SQL Managed Instance, and operational excellence.
Azure SQL Managed Instance includes the following design considerations:
Define an application performance SLA and monitor it with alerts. Detecting quickly when your application
performance inadvertently degrades below an acceptable level is important to maintain high resiliency. Use a
monitoring solution to set alerts on key query performance metrics so you can take action when the
performance breaks the SLA.
Use point-in-time restore to recover from human error. Point-in-time restore returns your database to an
earlier point in time to recover data from changes done inadvertently. For more information, read the Point-
in-time-restore (PITR) documentation for managed instance.
Use geo-restore to recover from a service outage. Geo-restore restores a database from a geo-redundant
backup into a managed instance in a different region. For more information, reference Recover a database
using Geo-restore documentation.
Consider the time required for certain operations. Make sure you separate time to thoroughly test the
amount of time required to scale up and down your existing managed instance, and to create a new
managed instance. This timing practice ensures that you understand completely how time consuming
operations will affect your RTO and RPO.
Have you configured Azure SQL Managed Instance with operational excellence in mind?Have you configured Azure SQL Managed Instance with operational excellence in mind?
Use the Business Critical Tier.
Configure a secondary instance and an Autofailover group to enable failover to another region.
Configuration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Use the Business Critical Tier. This tier provides higher resiliency to failures and faster
failover times because of the underlying HA architecture,
among other benefits. For more information, reference SQL
Managed Instance High availability.
Configure a secondary instance and an Autofailover group
to enable failover to another region.
If an outage impacts one or more of the databases in the
managed instance, you can manually or automatically
failover all the databases inside the instance to a secondary
region. For more information, read the Autofailover groups
documentation for managed instance.
Implement Retry Logic. Although Azure SQL MI is resilient to transitive
infrastructure failures, these failures might affect your
connectivity. When a transient error occurs while working
with SQL MI, make sure your code can retry the call. For
more information, reference how to implement retry logic.
Monitor your SQL MI instance in near-real time to detect
reliability incidents.
Use one of the available solutions to monitor your SQL MI
to detect potential reliability incidents early and make your
databases more reliable. Choose a near real-time monitoring
solution to quickly react to incidents. For more information,
check out the Azure SQL Managed Instance monitoring
options.
Next step
Implement Retry Logic.
Monitor your SQL MI instance in near-real time to detect reliability incidents.
Explore the following table of recommendations to optimize your Azure SQL Managed Instance configuration
for operational excellence:
Azure Cosmos DB and reliability
Azure Cosmos DB and reliability
12/16/2022 8 minutes to read • Edit Online
Design considerations
Checklist
Azure Cosmos DB is a fully managed NoSQL database for modern app development.
Key features include:
Guaranteed speed at any scale
Simplified application development
Mission-critical ready
Fully managed and cost effective
To understand how Azure Cosmos DB bolsters resiliency for your application workload, reference the following
articles:
Distribute your data globally with Azure Cosmos DB
How does Azure Cosmos DB provide high availability
Consistency levels in Azure Cosmos DB
Configure Azure Cosmos DB account with periodic backup
The following sections include design considerations, a configuration checklist, recommended configuration
options, and source artifacts specific to Azure Cosmos DB and reliability.
Azure Cosmos DB includes the following design considerations:
SLA for read availability for Database Accounts spanning two or more Azure regions.
SLAs for throughput, consistency, availability, and latency.
SLA for both read and write availability with the configuration of multiple Azure regions as writable
endpoints.
For more detailed information about SLAs specific to this product, see Azure Cosmos DB Service Level
Agreements.
Have you configured Azure Cosmos DB with reliability in mind?Have you configured Azure Cosmos DB with reliability in mind?
Deploy Azure Cosmos DB and the application in the region that corresponds to end users.
If the multi-master option is enabled on Azure Cosmos DB, it's important to understand Conflict Types and
Resolution Policies.
Start with, Session, the default consistency level.
Change the consistency level, depending on the data operation and usage.
Evaluate connectivity modes and connection protocols in Azure Cosmos DB.
Configure preferred locations.
Specify index precisions.
Use Azure Monitor to see the provisioned autoscale max RU/s (Autoscale Max Throughput) and the RU/s
the system is currently scaled to (Provisioned Throughput).
Understand your traffic pattern to pick the right option for provisioned throughput types.
Configuration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Deploy Azure Cosmos DB and the application in the region
that corresponds to end users.
There are two common scenarios for configuring two or
more regions: Delivering low-latency access to data to end
users no matter where they're located around the globe.
Adding regional resiliency for business continuity and
disaster recovery (BCDR). For delivering low-latency to end
users, it's recommended to deploy both the application and
add Azure Cosmos DB in the regions that correspond to
where the application's users are located.
Start with, Session, the default consistency level. It's the recommended consistency level to start with as it
receives data later, but in the same order as the writes.
Change the consistency level, depending on the data
operation and usage.
Depending on the type of data stored, different consistency
levels may be changed on a per request basis. For example, if
log data is written to Azure Cosmos DB, an EventualEventual
consistency may be relevant, but if writing e-commerce
transactions, then StrongStrong may be more appropriate.
New applications
: If you don't know your traffic pattern yet, start at the entry point RU/s to avoid over-
provisioning in the beginning.
Existing applications
: Use Azure Monitor metrics to determine if your traffic pattern is suitable for autoscale.
Existing applications
: Find the normalized request unit consumption metric of your database or container.
Existing applications
: The closer the number is to 100% , the more you're fully using your provisioned RU/s .
Set provisioned RU/s to T for all hours in a month.
Enable automatic failover when you configure Azure Cosmos DB accounts used for production workloads.
Implement retry logic in your client.
For query-intensive workloads, use Windows 64-bit instead of Linux or Windows 32-bit host processing.
To reduce latency and CPU jitter, enable accelerated networking on client virtual machines in both Windows
and Linux.
Increase the number of threads and tasks.
To avoid network latency, collocate client in the same region as Azure Cosmos DB.
Call OpenAsync to avoid startup latency on first request.
Scale out client applications across multiple servers if client consumes more than 50,000 RU/s .
Select a partition key.
Ensure the partition key is a property that has a value that doesn't change.
You can't change a partition key after it's created with the collection.
Ensure the partition key has a high cardinality.
Ensure the partition key spreads RU consumption and data storage evenly across all logical partitions.
Ensure you're running read queries with the partitioned column to reduce RU consumption and latency.
Evaluate ways to improve data performance.
Configure data replication to ensure Azure Cosmos DB meets the SLAs.
Explore the following table of recommendations to optimize your Azure Cosmos DB configuration for reliability:
Evaluate connectivity modes and connection protocols in
Azure Cosmos DB.
Azure Cosmos DB supports two connectivity modes. In
Gateway mode
, requests are always made to the Azure
Cosmos DB gateway, which forwards it to the corresponding
data partitions. In
Direct connectivity mode
, the client
fetches the mapping of tables to partitions, and requests are
made directly against data partitions. We recommend Direct,
the default mode. Azure Cosmos DB supports two
connection protocols: HTTPS and TCP , which is the
default. TCP is recommended because it's more lightweight.
Configure preferred locations. Setting preferred locations can improve query performance.
To take advantage of global distribution, client applications
can specify the ordered preference list of regions to be used
to perform document operations, which can be done by
setting the connection policy. Based on the Azure Cosmos
DB account configuration, current regional availability and
the preference list specified, the most optimal endpoint will
be chosen by the SQL SDK to perform write and read
operations. This preference list is specified when initializing a
connection using the SQL SDKs. The SDKs accept an optional
parameter, PreferredLocations , that is an ordered list of
Azure regions. The SDK will automatically send all writes to
the current write region. All reads will be sent to the first
available region in the PreferredLocations list. If the
request fails, the client will fail down the list to the next
region, and so on. The SDKs will only attempt to read from
the regions specified in PreferredLocations .
Specify index precisions. Setting these values appropriately can improve query
performance and reduce throughput requests. You can use
index precision to make trade-offs between index storage
overhead and query performance. For numbers, we
recommend using the default precision configuration of -1
(maximum). Because numbers are 8 bytes in JSON, this is
equivalent to a configuration of 8 bytes. Choosing a lower
value for precision, such as 1 through 7 , means that
values within some ranges map to the same index entry. You
reduce index storage space, but query execution might have
to process more documents. It consumes more throughput
in request units. Index precision configuration has more
practical application with string ranges. Because strings can
be any arbitrary length, the choice of the index precision
might affect the performance of string range queries. It also
may affect the amount of index storage space that's
required. String Range indexes can be configured with 1
through 100 or -1 (maximum).
Existing applications
: Find the normalized request unit
consumption metric of your database or container.
Normalized
usage is a measure of how much you're currently
using your standard (manual) provisioned throughput.
Set provisioned RU/s to T for all hours in a month. If you set provisioned RU/s to T and use the full amount
for 66% of the hours or more, it's estimated you'll save with
standard (manual) provisioned RU/s . If you set autoscale
max RU/s to Tmax and use the full amount Tmax for
66% of the hours or less, it's estimated you'll save with
autoscale.
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Scale out client applications across multiple servers if client
consumes more than 50,000 RU/s .
There could be a bottleneck because of the machine capping
out on CPU or network usage.
Ensure the partition key spreads RU consumption and data
storage evenly across all logical partitions.
This spread ensures even RU consumption and storage
distribution across your physical partitions.
Evaluate ways to improve data performance. Best practices for query performance:
- Connection policy: Use direct connection mode.
- Connection Policy: Use the TCP protocol
Call OpenAsync to avoid startup latency on first request.
- Collocate clients in the same Azure region for performance.
- Increase number of threads and tasks.
- Install the most recent SDK.
- Use a singleton Azure Cosmos DB client for the lifetime of
your application.
- Increase System.Net MaxConnections per host when
using Gateway mode.
- Tune parallel queries for partitioned collections.
- Turn on server-side GC.
- Implement backoff at RetryAfter intervals.
- Scale out your client-workload.
- Cache document URIs for lower read latency.
- Tune the page size for queries and read feeds for better
performance.
- Use 64-bit host processing.
- Exclude unused paths from indexing for faster writes.
- Measure and tune for lower request units and second
usage.
- Handle rate limiting and request rates that are too large.
- Design for smaller documents for higher throughput.
Configure data replication to ensure Azure Cosmos DB
meets the SLAs
If you've replicated your data in more than one data center,
Azure Cosmos DB automatically rolls over your operations
should a regional data center go offline. You can create a
prioritized list of failover regions using the regions in which
your data is replicated. Even within a single data center,
Azure Cosmos DB automatically replicates data for high
availability giving you the choice of consistency levels.
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Source artifacts
Resources
|where type =~ 'Microsoft.DocumentDb/databaseAccounts'
|where properties.enableAutomaticFailover!=True
resources
| where type == "microsoft.documentdb/databaseaccounts"
and properties.enableMultipleWriteLocations == "true"
To check for cosmosdb instances where automatic failover isn't enabled, use the following query:
Use the following query to see the list of multiregion writes:
To view consistency levels for your Azure Cosmos DB accounts, use the following query:
Resources
| project name, type, location, consistencyLevel = properties.consistencyPolicy.defaultConsistencyLevel
| where type == "microsoft.documentdb/databaseaccounts"
| order by name asc
Resources
|where type =~ 'Microsoft.DocumentDb/databaseAccounts'
|where array_length( properties.locations) <=1
Learn more
Next step
To check if multilocation isn't selected, use the following query:
High availability in Azure Cosmos DB
Autoscale FAQ
Performance tips for Azure Cosmos DB
Azure Cosmos DB and operational excellence
Azure Cosmos DB and operational excellence
12/16/2022 4 minutes to read • Edit Online
Design considerations
Checklist
Azure Cosmos DB is a fully managed NoSQL database for modern app development.
Key features include:
Guaranteed speed at any scale
Simplified application development
Mission-critical ready
Fully managed and cost effective
To understand how Azure Cosmos DB promotes operational excellence for your application workload, reference
the following articles:
Monitor Azure Cosmos DB
Monitor and debug with insights in Azure Cosmos DB
Visualize Azure Cosmos DB data by using the Power BI connector
The following sections include design considerations, a configuration checklist, recommended configuration
options, and source artifacts specific to Azure Cosmos DB, and operational excellence.
Azure Cosmos DB includes the following design considerations:
SLA for read availability for Database Accounts spanning two or more Azure regions.
SLAs for throughput, consistency, availability, and latency.
SLA for both read and write availability with the configuration of multiple Azure regions as writable
endpoints.
For more granular information specific to this product, reference Azure Cosmos DB Service Level Agreements.
Have you configured Azure Cosmos DB with operational excellence in mind?Have you configured Azure Cosmos DB with operational excellence in mind?
Monitor for normal and abnormal activity.
If the multi-master option is enabled on Azure Cosmos DB, it's important to understand Conflict Types and
Resolution Policies.
Start with, Session, the default consistency level.
Use Azure Monitor to see the provisioned autoscale max RU/s (Autoscale Max Throughput) and the RU/s
the system is currently scaled to (Provisioned Throughput).
Understand your traffic pattern to pick the right option for provisioned throughput types.
New applications
: If you don't know your traffic pattern yet, start at the entry point RU/s to avoid over-
provisioning in the beginning.
Existing applications
: Use Azure Monitor metrics to determine if your traffic pattern is suitable for autoscale.
Existing applications
: Find the normalized request unit consumption metric of your database or container.
Existing applications
: The closer the number is to 100% , the more you're fully using your provisioned RU/s .
Set provisioned RU/s to T for all hours in a month.
Configuration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Monitor for normal and abnormal activity. The Azure Activity Log is a subscription log that provides
insight into subscription-level events that have occurred in
Azure. The Activity Log reports control plane events for your
subscriptions under the Administrative category. Using the
Activity Log, you can determine the
what
,
who
, and
when
for
any write operations ( PUT , POST , DELETE ) taken on the
resources in your subscription. You can also understand the
status of the operation and other relevant properties. The
Activity Log differs from Diagnostic Logs. Activity Logs
provide data about the operations on a resource from the
outside (the
control plane
). In the Azure Cosmos DB context,
some of the control plane operations include create
collection, list keys, delete keys, list database, and more.
Diagnostic Logs are emitted by a resource and provide
information about the operation of that resource (the
data
plane
). Some of the data plane diagnostic log examples
include delete, insert, ReadFeed operation, and more.
Start with, Session, the default consistency level. It's the recommended consistency level to start with as it
receives data later, but in the same order as the writes.
Existing applications
: Find the normalized request unit
consumption metric of your database or container.
Normalized
usage is a measure of how much you're currently
using your standard (manual) provisioned throughput.
Set provisioned RU/s to T for all hours in a month. If you set provisioned RU/s to T and use the full amount
for 66% of the hours or more, it's estimated you'll save with
standard (manual) provisioned RU/s . If you set autoscale
max RU/s to Tmax and use the full amount Tmax for
66% of the hours or less, it's estimated you'll save with
autoscale.
Enable automatic failover when you configure Azure Cosmos DB accounts used for production workloads.
Implement retry logic in your client.
For query-intensive workloads, use Windows 64-bit instead of Linux or Windows 32-bit host processing.
To reduce latency and CPU jitter, enable accelerated networking on client virtual machines in both Windows
and Linux.
Increase the number of threads and tasks.
To avoid network latency, colocate the client in the same region as the Azure Cosmos DB instance.
Call OpenAsync to avoid startup latency on first request.
Scale out client applications across multiple servers if client consumes more than 50,000 RU/s .
Select a partition key.
Ensure the partition key is a property that has a value that doesn't change.
You can't change a partition key after it's created with the collection.
Ensure the partition key has a high cardinality.
Ensure the partition key spreads RU consumption and data storage evenly across all logical partitions.
Ensure you're running read queries with the partitioned column to reduce RU consumption and latency.
Explore the following table of recommendations to optimize operational excellence for your Azure Cosmos DB
configuration:
Scale out client applications across multiple servers if client
consumes more than 50,000 RU/s .
There could be a bottleneck because of the machine capping
out on CPU or network usage.
Ensure the partition key spreads RU consumption and data
storage evenly across all logical partitions.
This spread ensures even RU consumption and storage
distribution across your physical partitions.
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Source artifacts
Resources
|where type =~ 'Microsoft.DocumentDb/databaseAccounts'
|where properties.enableAutomaticFailover!=True
resources
| where type == "microsoft.documentdb/databaseaccounts"
and properties.enableMultipleWriteLocations == "true"
Resources
| project name, type, location, consistencyLevel = properties.consistencyPolicy.defaultConsistencyLevel
| where type == "microsoft.documentdb/databaseaccounts"
| order by name asc
Resources
|where type =~ 'Microsoft.DocumentDb/databaseAccounts'
|where array_length( properties.locations) <=1
Learn more
Next step
To check for cosmosdb instances where automatic failover isn't enabled, use the following query:
Use the following query to see the list of multiregion writes:
To view consistency levels for your Azure Cosmos DB accounts, use the following query:
To check if multilocation isn't selected, use the following query:
Autoscale FAQ
Performance tips for Azure Cosmos DB
Azure Stack Hub and reliability
Azure Stack Hub and reliability
12/16/2022 2 minutes to read • Edit Online
Design considerations
Checklist
Azure Stack Hub is a hybrid cloud platform that lets you provide Azure services from your datacenter. It provides
a way to run apps in an on-premises environment.
This service unlocks the following hybrid cloud use cases for customer-facing and internal line-of-business apps:
Edge and disconnected solutions
: Addresses latency and connectivity requirements by processing data
locally.
Cloud apps that meet varied regulations
: Allows you to develop and deploy apps with full flexibility to meet
regulatory or policy requirements.
Cloud app model on-premises
: Provides Azure services, containers, serverless, and microservice
architectures to update and extend existing apps or build new ones.
For more information, reference Azure Stack Hub overview.
To understand how Azure Stack Hub supports resiliency for your application workload, reference the following
articles:
Capacity planning for Azure Stack Hub overview
Storage Spaces Direct cache and capacity tiers
Datacenter integration planning considerations for Azure Stack Hub integrated systems
The following sections include design considerations, a configuration checklist, and recommended configuration
options specific to Azure Stack Hub and reliability.
Azure Stack Hub includes the following design considerations:
Microsoft doesn't provide an SLA for Azure Stack Hub because Microsoft doesn't have control over customer
datacenter reliability, people, and processes.
Azure Stack Hub only supports a single Scale Unit (SU) within a single region, which consists of between four
and 16 servers that use Hyper-V failover clustering. Each region serves as an independent Azure Stack Hub
stamp
with separate portal and API endpoints.
Azure Stack Hub doesn't support Availability Zones because it consists of a single
region
or a single physical
location. High availability to cope with outages of a single location should be implemented by using two
Azure Stack Hub instances deployed in different physical locations.
Azure Stack Hub supports premium storage to ensure compatibility. However, provisioning premium storage
accounts or disks doesn't guarantee that storage objects will be allocated onto SSD or NVMe drives.
Azure Stack Hub supports only a subset of VPN Gateway SKUs available in Azure with a limited bandwidth of
100 or 200 Mbps .
Only one site-to-site (S2S) VPN connection can be created between two Azure Stack Hub deployments. This
connection limit is because of a platform limitation that allows only a single VPN connection to the same IP
address. Multiple S2S VPN connections with higher throughput can be established using third-party NVAs.
Apply general Azure configuration recommendations for all Azure Stack Hub services.
Configuration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Treat Azure Stack Hub as a scale unit and deploy multiple
instances to remove Azure Stack Hub as a single point of
failure for encompassed workloads.
Deploy workloads in either an active-active or active-passive
configuration across Azure Stack Hub stamps or Azure.
Next step
Have you configured Azure Stack Hub with reliability in mind?Have you configured Azure Stack Hub with reliability in mind?
Treat Azure Stack Hub as a scale unit and deploy multiple instances to remove Azure Stack Hub as a single
point of failure for encompassed workloads.
Consider the following recommendation table to optimize your Azure Stack Hub configuration for reliability:
Azure Stack Hub and operational excellence
Azure Stack Hub and operational excellence
12/16/2022 2 minutes to read • Edit Online
Design considerations
Checklist
Configuration recommendations
Azure Stack Hub is a hybrid cloud platform that lets you provide Azure services from your datacenter. It provides
a way to run apps in an on-premises environment.
This service unlocks the following hybrid cloud use cases for customer-facing and internal line-of-business apps:
Edge and disconnected solutions
: Addresses latency and connectivity requirements by processing data
locally.
Cloud apps that meet varied regulations
: Allows you to develop and deploy apps with full flexibility to meet
regulatory or policy requirements.
Cloud app model on-premises
: Provides Azure services, containers, serverless, and microservice
architectures to update and extend existing apps or build new ones.
For more information, reference Azure Stack Hub overview.
To understand how Azure Stack Hub supports operational excellence for your application workload, reference
the following articles:
Monitor health and alerts in Azure Stack Hub
Monitor Azure Stack Hub hardware components
Manage network resources in Azure Stack Hub
The following sections include design considerations, a configuration checklist, and recommended configuration
options specific to Azure Stack Hub and operational excellence.
Azure Stack Hub includes the following design considerations:
Microsoft doesn't provide an SLA for Azure Stack Hub because Microsoft doesn't have control over customer
datacenter reliability, people, and processes.
Azure Stack Hub only supports a single Scale Unit (SU) within a single region, which consists of between four
and 16 servers that use Hyper-V failover clustering. Each region serves as an independent Azure Stack Hub
stamp
with separate portal and API endpoints.
Azure Stack Hub doesn't support Availability Zones because it consists of a single
region
or a single physical
location. High availability to cope with outages of a single location should be implemented by using two
Azure Stack Hub instances deployed in different physical locations.
Apply general Azure configuration recommendations for all Azure Stack Hub services.
Have you configured Azure Stack Hub with operational excellence in mind?Have you configured Azure Stack Hub with operational excellence in mind?
Treat Azure Stack Hub as a scale unit and deploy multiple instances to remove Azure Stack Hub as a single
point of failure for encompassed workloads.
Consider the following recommendation table to optimize your Azure Stack Hub configuration for operational
excellence:
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Treat Azure Stack Hub as a scale unit and deploy multiple
instances to remove Azure Stack Hub as a single point of
failure for encompassed workloads.
Deploy workloads in either an active-active or active-passive
configuration across Azure Stack Hub stamps or Azure.
Next step
Storage Accounts and reliability
Storage Accounts and reliability
12/16/2022 7 minutes to read • Edit Online
Design considerations
Checklist
Azure Storage Accounts are ideal for workloads that require fast and consistent response times, or that have a
high number of input output (IOP) operations per second. Storage accounts contain all your Azure Storage data
objects, which include:
Blobs
File shares
Queues
Tables
Disks
Storage accounts provide a unique namespace for your data that's accessible anywhere over HTTP or HTTPS .
For more information about the different types of storage accounts that support different features, reference
Types of storage accounts.
To understand how an Azure storage account supports resiliency for your application workload, reference the
following articles:
Azure storage redundancy
Disaster recovery and storage account failover
The following sections include design considerations, a configuration checklist, and recommended configuration
options specific to Azure storage accounts and reliability.
Azure storage accounts include the following design considerations:
General purpose v1 storage accounts provide access to all Azure Storage services, but may not have the
latest features or the lower per-gigabyte pricing. It's recommended to use general purpose v2 storage
accounts, in most cases. Reasons to use v1 include:
Applications require the classic deployment model.
Applications are transaction intensive or use significant geo-replication bandwidth, but don't require
large capacity.
The use of a Storage Service REST API that is earlier than February 14, 2014, or a client library with a
version earlier than 4.x is required. An application upgrade isn't possible.
For more information, reference the Storage account overview.
Storage account names must be between three and 24 characters and may contain numbers, and lowercase
letters only.
For current SLA specifications, reference SLA for Storage Accounts.
Go to Azure Storage redundancy to determine which redundancy option is best for a specific scenario.
Storage account names must be unique within Azure. No two storage accounts can have the same name.
Have you configured your Azure Storage Account with reliability in mind?Have you configured your Azure Storage Account with reliability in mind?
Configuration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Turn on soft delete for blob data. Soft delete for Azure Storage blobs enables you to recover
blob data after it has been deleted.
Use Azure AD to authorize access to blob data. Azure AD provides superior security and ease of use over
Shared Key for authorizing requests to blob storage. It's
recommended to use Azure AD authorization with your blob
and queue applications when possible to minimize potential
security vulnerabilities inherent in Shared Key. For more
information, reference Authorize access to Azure blobs and
queues using Azure Active Directory.
Consider the principle of least privilege when you assign
permissions to an Azure AD security principal through Azure
RBAC.
When assigning a role to a user, group, or application, grant
that security principal only those permissions necessary for
them to perform their tasks. Limiting access to resources
helps prevent both unintentional and malicious misuse of
your data.
Use managed identities to access blob and queue data. Azure Blob and Queue storage support Azure AD
authentication with managed identities for Azure resources.
Managed identities for Azure resources can authorize access
to blob and queue data using Azure AD credentials from
applications running in Azure virtual machines (VMs),
function apps, virtual machine scale sets, and other services.
By using managed identities for Azure resources together
with Azure AD authentication, you can avoid storing
credentials with your applications that run in the cloud and
issues with expiring service principals. Reference Authorize
access to blob and queue data with managed identities for
Azure resources for more information.
Turn on soft delete for blob data.
Use Azure AD to authorize access to blob data.
Consider the principle of least privilege when you assign permissions to an Azure AD security principal
through Azure RBAC.
Use managed identities to access blob and queue data.
Use blob versioning or immutable blobs to store business-critical data.
Restrict default internet access for storage accounts.
Enable firewall rules.
Limit network access to specific networks.
Allow trusted Microsoft services to access the storage account.
Enable the Secure transfer requiredSecure transfer required option on all your storage accounts.
Limit shared access signature (SAS) tokens to HTTPS connections only.
Avoid and prevent using Shared Key authorization to access storage accounts.
Regenerate your account keys periodically.
Create a revocation plan and have it in place for any SAS that you issue to clients.
Use near-term expiration times on an impromptu SAS, service SAS, or account SAS.
Consider the following recommendations to optimize reliability when configuring your Azure Storage Account:
Use blob versioning or immutable blobs to store business-
critical data.
Consider using Blob versioning to maintain previous
versions of an object or the use of legal holds and time-
based retention policies to store blob data in a WORM
(Write Once, Read Many) state. Immutable blobs can be
read, but can't be modified or deleted during the retention
interval. For more information, reference Store business-
critical blob data with immutable storage.
Restrict default internet access for storage accounts. By default, network access to Storage Accounts isn't
restricted and is open to all traffic coming from the internet.
Access to storage accounts should be granted to specific
Azure Virtual Networks only whenever possible or use
private endpoints to allow clients on a virtual network (VNet)
to access data securely over a Private Link. Reference Use
private endpoints for Azure Storage for more information.
Exceptions can be made for Storage Accounts that need to
be accessible over the internet.
Enable firewall rules. Configure firewall rules to limit access to your storage
account to requests that originate from specified IP
addresses or ranges, or from a list of subnets in an Azure
Virtual Network (VNet). For more information about
configuring firewall rules, reference Configure Azure Storage
firewalls and virtual networks.
Limit network access to specific networks. Limiting network access to networks hosting clients
requiring access reduces the exposure of your resources to
network attacks either by using the built-in Firewall and
virtual networks functionality or by using private endpoints.
Allow trusted Microsoft services to access the storage
account.
Turning on firewall rules for storage accounts blocks
incoming requests for data by default, unless the requests
originate from a service operating within an Azure Virtual
Network (VNet) or from allowed public IP addresses. Blocked
requests include those requests from other Azure services,
from the Azure portal, from logging and metrics services,
and so on. You can permit requests from other Azure
services by adding an exception to allow trusted Microsoft
services to access the storage account. For more information
about adding an exception for trusted Microsoft services,
reference Configure Azure Storage firewalls and virtual
networks.
Enable the Secure transfer requiredSecure transfer required option on all your
storage accounts.
When you enable the Secure transfer requiredSecure transfer required option, all
requests made against the storage account must take place
over secure connections. Any requests made over HTTP will
fail. For more information, reference Require secure transfer
in Azure Storage.
Limit shared access signature (SAS) tokens to HTTPS
connections only.
Requiring HTTPS when a client uses a SAS token to access
blob data helps to minimize the risk of eavesdropping. For
more information, reference Grant limited access to Azure
Storage resources using shared access signatures (SAS).
Avoid and prevent using Shared Key authorization to access
storage accounts.
It's recommended to use Azure AD to authorize requests to
Azure Storage and to prevent Shared Key Authorization. For
scenarios that require Shared Key authorization, always
prefer SAS tokens over distributing the Shared Key.
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Regenerate your account keys periodically. Rotating the account keys periodically reduces the risk of
exposing your data to malicious actors.
Create a revocation plan and have it in place for any SAS
that you issue to clients.
If a SAS is compromised, you'll want to revoke that SAS
immediately. To revoke a user delegation SAS, revoke the
user delegation key to quickly invalidate all signatures
associated with that key. To revoke a service SAS that's
associated with a stored access policy, you can delete the
stored access policy, rename the policy, or change its expiry
time to a time that is in the past.
Use near-term expiration times on an impromptu SAS,
service SAS, or account SAS.
If a SAS is compromised, it's valid only for a short time. This
practice is especially important if you can't reference a stored
access policy. Near-term expiration times also limit the
amount of data that can be written to a blob by limiting the
time available to upload to it. Clients should renew the SAS
well before the expiration to allow time for retries if the
service providing the SAS is unavailable.
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Next step
Storage Accounts and security
Storage Accounts and security
12/16/2022 7 minutes to read • Edit Online
Design considerations
Checklist
Azure Storage Accounts are ideal for workloads that require fast and consistent response times, or that have a
high number of input output (IOP) operations per second. Storage accounts contain all your Azure Storage data
objects, which include:
Blobs
File shares
Queues
Tables
Disks
Storage accounts provide a unique namespace for your data that's accessible anywhere over HTTP or HTTPS .
For more information about the different types of storage accounts that support different features, reference
Types of storage accounts.
To understand how an Azure storage account boosts security for your application workload, reference the
following articles:
Azure security baseline for Azure Storage
Azure Storage encryption for data at rest
Use private endpoints for Azure Storage
The following sections include design considerations, a configuration checklist, and recommended configuration
options specific to Azure storage accounts and security.
Azure storage accounts include the following design considerations:
Storage account names must be between three and 24 characters and may contain numbers, and lowercase
letters only.
For current SLA specifications, reference SLA for Storage Accounts.
Go to Azure Storage redundancy to determine which redundancy option is best for a specific scenario.
Storage account names must be unique within Azure. No two storage accounts can have the same name.
Have you configured your Azure Storage Account with security in mind?Have you configured your Azure Storage Account with security in mind?
Enable Azure Defender for all your storage accounts.
Turn on soft delete for blob data.
Use Azure AD to authorize access to blob data.
Consider the principle of least privilege when you assign permissions to an Azure AD security principal
through Azure RBAC.
Use managed identities to access blob and queue data.
Use blob versioning or immutable blobs to store business-critical data.
Restrict default internet access for storage accounts.
Configuration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Enable Azure Defender for all your storage accounts. Azure Defender for Azure Storage provides an extra layer of
security intelligence that detects unusual and potentially
harmful attempts to access or exploit storage accounts.
Security alerts are triggered in Azure Security Center when
anomalies in activity occur. Alerts are also sent through email
to subscription administrators, with details of suspicious
activity and recommendations on how to investigate, and
remediate threats. For more information, reference
Configure Azure Defender for Azure Storage.
Turn on soft delete for blob data. Soft delete for Azure Storage blobs enables you to recover
blob data after it has been deleted.
Use Azure AD to authorize access to blob data. Azure AD provides superior security and ease of use over
Shared Key for authorizing requests to blob storage. It's
recommended to use Azure AD authorization with your blob
and queue applications when possible to minimize potential
security vulnerabilities inherent in Shared Key. For more
information, reference Authorize access to Azure blobs and
queues using Azure Active Directory.
Consider the principle of least privilege when you assign
permissions to an Azure AD security principal through Azure
RBAC.
When assigning a role to a user, group, or application, grant
that security principal only those permissions necessary for
them to complete their tasks. Limiting access to resources
helps prevent both unintentional and malicious misuse of
your data.
Use managed identities to access blob and queue data. Azure Blob and Queue storage support Azure AD
authentication with managed identities for Azure resources.
Managed identities for Azure resources can authorize access
to blob and queue data using Azure AD credentials from
applications running in Azure virtual machines (VMs),
function apps, virtual machine scale sets, and other services.
By using managed identities for Azure resources together
with Azure AD authentication, you can avoid storing
credentials with your applications that run in the cloud and
issues with expiring service principals. Reference Authorize
access to blob and queue data with managed identities for
Azure resources for more information.
Enable firewall rules.
Limit network access to specific networks.
Allow trusted Microsoft services to access the storage account.
Enable the Secure transfer requiredSecure transfer required option on all your storage accounts.
Limit shared access signature (SAS) tokens to HTTPS connections only.
Avoid and prevent using Shared Key authorization to access storage accounts.
Regenerate your account keys periodically.
Create a revocation plan and have it in place for any SAS that you issue to clients.
Use near-term expiration times on an impromptu SAS, service SAS, or account SAS.
Consider the following recommendations to optimize security when configuring your Azure Storage Account:
Use blob versioning or immutable blobs to store business-
critical data.
Consider using Blob versioning to maintain previous
versions of an object or the use of legal holds and time-
based retention policies to store blob data in a WORM
(Write Once, Read Many) state. Immutable blobs can be
read, but can't be modified or deleted during the retention
interval. For more information, reference Store business-
critical blob data with immutable storage.
Restrict default internet access for storage accounts. By default, network access to Storage Accounts isn't
restricted and is open to all traffic coming from the internet.
Access to storage accounts should be granted to specific
Azure Virtual Networks only whenever possible or use
private endpoints to allow clients on a virtual network (VNet)
to access data securely over a Private Link. Reference Use
private endpoints for Azure Storage for more information.
Exceptions can be made for Storage Accounts that need to
be accessible over the internet.
Enable firewall rules. Configure firewall rules to limit access to your storage
account to requests that originate from specified IP
addresses or ranges, or from a list of subnets in an Azure
Virtual Network (VNet). For more information about
configuring firewall rules, reference Configure Azure Storage
firewalls and virtual networks.
Limit network access to specific networks. Limiting network access to networks hosting clients
requiring access reduces the exposure of your resources to
network attacks either by using the built-in Firewall and
virtual networks functionality or by using private endpoints.
Allow trusted Microsoft services to access the storage
account.
Turning on firewall rules for storage accounts blocks
incoming requests for data by default, unless the requests
originate from a service operating within an Azure Virtual
Network (VNet) or from allowed public IP addresses. Blocked
requests include those requests from other Azure services,
from the Azure portal, from logging and metrics services,
and so on. You can permit requests from other Azure
services by adding an exception to allow trusted Microsoft
services to access the storage account. For more information
about adding an exception for trusted Microsoft services,
reference Configure Azure Storage firewalls and virtual
networks.
Enable the Secure transfer requiredSecure transfer required option on all your
storage accounts.
When you enable the Secure transfer requiredSecure transfer required option, all
requests made against the storage account must take place
over secure connections. Any requests made over HTTP will
fail. For more information, reference Require secure transfer
in Azure Storage.
Limit shared access signature (SAS) tokens to HTTPS
connections only.
Requiring HTTPS when a client uses a SAS token to access
blob data helps to minimize the risk of eavesdropping. For
more information, reference Grant limited access to Azure
Storage resources using shared access signatures (SAS).
Avoid and prevent using Shared Key authorization to access
storage accounts.
It's recommended to use Azure AD to authorize requests to
Azure Storage and to prevent Shared Key Authorization. For
scenarios that require Shared Key authorization, always
prefer SAS tokens over distributing the Shared Key.
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Regenerate your account keys periodically. Rotating the account keys periodically reduces the risk of
exposing your data to malicious actors.
Create a revocation plan and have it in place for any SAS
that you issue to clients.
If a SAS is compromised, you'll want to revoke that SAS
immediately. To revoke a user delegation SAS, revoke the
user delegation key to quickly invalidate all signatures
associated with that key. To revoke a service SAS that's
associated with a stored access policy, you can delete the
stored access policy, rename the policy, or change its expiry
time to a time that is in the past.
Use near-term expiration times on an impromptu SAS,
service SAS, or account SAS.
If a SAS is compromised, it's valid only for a short time. This
practice is especially important if you can't reference a stored
access policy. Near-term expiration times also limit the
amount of data that can be written to a blob by limiting the
time available to upload to it. Clients should renew the SAS
well before the expiration to allow time for retries if the
service providing the SAS is unavailable.
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Next step
Storage Accounts and cost optimization
Storage Accounts and cost optimization
12/16/2022 2 minutes to read • Edit Online
Design considerations
C O N SIDERAT IO NSC O N SIDERAT IO NS DESC RIP T IONDESC RIP T ION
Periodically dispose and clean up unused storage resources,
such as unattached disks and old snapshots.
Unused storage resources can incur cost and it's a good idea
to regularly perform cleanup to reduce cost.
Consider Azure Blob access time tracking and access time-
based lifecycle management.
Minimize your storage cost automatically by setting up a
policy based on last access time to: cost-effective backup
storage options.
Transition your data from a hotter access tier to a cooler
access tier if there's no access for a period
For example:
- Hot to cool
- Cool to archive
- Hot to archive
Azure Storage Accounts are ideal for workloads that require fast and consistent response times, or that have a
high number of input output (IOP) operations per second. Storage accounts contain all your Azure Storage data
objects, which include:
Blobs
File shares
Queues
Tables
Disks
Storage accounts provide a unique namespace for your data that's accessible anywhere over HTTP or HTTPS .
For more information about the different types of storage accounts that support different features, reference
Types of storage accounts.
To understand how an Azure storage account can optimize costs for your workload, reference the following
articles:
Plan and manage costs for Azure Blob Storage
Optimize costs for Blob storage with reserved capacity
Understand how reservation discounts are applied to Azure storage services
The following sections include design considerations, a configuration checklist, and recommended configuration
options specific to Azure storage accounts and cost optimization.
Azure storage accounts include the following design considerations:
Periodically dispose and clean up unused storage resources, such as unattached disks and old snapshots.
Consider Azure Blob access time tracking and access time-based lifecycle management.
Transition your data from a hotter access tier to a cooler access tier if there's no access for a period.
Delete your data if there's no access for an extended period.
Checklist
Configuration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Consider cost savings by reserving data capacity for block
blob storage.
Save money by reserving capacity for block blob and for
Azure Data Lake Storage gen 2 data in standard storage
account when customer commits to one or three years
reservation.
Organize data into access tiers. You can reduce cost by placing blob data into the most cost-
effective access tier. Place frequently accessed data in a hot
tier, less frequent in a cold or archive tier. Use Premium
storage for workloads with high transaction volumes or
workloads where latency is critical.
Use lifecycle policy to move data between access tiers. Lifecycle management policy periodically moves data
between tiers. Policies can move data based on rules
specified by the user. For example, you can create rules that
move blobs to the archive tier if that blob has been modified
in 90 days. Unused data can be removed completely using a
policy. By creating policies that adjust the access tier of your
data, you can design the least expensive storage options for
your requirements.
Next step
Have you configured your Azure Storage Account with cost optimization in mind?Have you configured your Azure Storage Account with cost optimization in mind?
Consider cost savings by reserving data capacity for block blob storage.
Organize data into access tiers.
Use lifecycle policy to move data between access tiers.
Consider the following recommendations to optimize costs when configuring your Azure Storage Account:
Storage Accounts and operational excellence
Storage Accounts and operational excellence
12/16/2022 7 minutes to read • Edit Online
Design considerations
Checklist
Azure Storage Accounts are ideal for workloads that require fast and consistent response times, or that have a
high number of input output (IOP) operations per second. Storage accounts contain all your Azure Storage data
objects, which include:
Blobs
File shares
Queues
Tables
Disks
Storage accounts provide a unique namespace for your data that's accessible anywhere over HTTP or HTTPS .
For more information about the different types of storage accounts that support different features, reference
Types of storage accounts.
To understand how an Azure storage account can promote operational excellence for your workload, reference
the following articles:
Best practices for monitoring Azure Blob Storage
Use Azure Storage analytics to collect logs and metrics data
Azure Storage analytics logging
The following sections include design considerations, a configuration checklist, and recommended configuration
options specific to Azure storage accounts and operational excellence.
Azure storage accounts include the following design considerations:
General purpose v1 storage accounts provide access to all Azure Storage services, but may not have the
latest features or the lower per-gigabyte pricing. It's recommended to use general purpose v2 storage
accounts, in most cases. Reasons to use v1 include:
Applications require the classic deployment model.
Applications are transaction intensive or use significant geo-replication bandwidth, but don't require
large capacity.
The use of a Storage Service REST API that is earlier than February 14, 2014, or a client library with a
version earlier than 4.x is required. An application upgrade isn't possible.
For more information, reference the Storage account overview.
Storage account names must be between three and 24 characters and may contain numbers, and lowercase
letters only.
For current SLA specifications, reference SLA for Storage Accounts.
Go to Azure Storage redundancy to determine which redundancy option is best for a specific scenario.
Storage account names must be unique within Azure. No two storage accounts can have the same name.
Configuration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Enable Azure Defender for all your storage accounts. Azure Defender for Azure Storage provides an extra layer of
security intelligence that detects unusual and potentially
harmful attempts to access or exploit storage accounts.
Security alerts are triggered in Azure Security Center when
anomalies in activity occur. Alerts are also sent through email
to subscription administrators, with details of suspicious
activity and recommendations on how to investigate, and
remediate threats. For more information, reference
Configure Azure Defender for Azure Storage.
Turn on soft delete for blob data. Soft delete for Azure Storage blobs enables you to recover
blob data after it has been deleted.
Use Azure AD to authorize access to blob data. Azure AD provides superior security and ease of use over
Shared Key for authorizing requests to blob storage. It's
recommended to use Azure AD authorization with your blob
and queue applications when possible to minimize potential
security vulnerabilities inherent in Shared Key. For more
information, reference Authorize access to Azure blobs and
queues using Azure Active Directory.
Consider the principle of least privilege when you assign
permissions to an Azure AD security principal through Azure
RBAC.
When assigning a role to a user, group, or application, grant
that security principal only those permissions necessary for
them to complete their tasks. Limiting access to resources
helps prevent both unintentional and malicious misuse of
your data.
Have you configured your Azure Storage Account with operational excellence in mind?Have you configured your Azure Storage Account with operational excellence in mind?
Enable Azure Defender for all your storage accounts.
Turn on soft delete for blob data.
Use Azure AD to authorize access to blob data.
Consider the principle of least privilege when you assign permissions to an Azure AD security principal
through Azure RBAC.
Use managed identities to access blob and queue data.
Use blob versioning or immutable blobs to store business-critical data.
Restrict default internet access for storage accounts.
Enable firewall rules.
Limit network access to specific networks.
Allow trusted Microsoft services to access the storage account.
Enable the Secure transfer requiredSecure transfer required option on all your storage accounts.
Limit shared access signature (SAS) tokens to HTTPS connections only.
Avoid and prevent using Shared Key authorization to access storage accounts.
Regenerate your account keys periodically.
Create a revocation plan and have it in place for any SAS that you issue to clients.
Use near-term expiration times on an impromptu SAS, service SAS, or account SAS.
Consider the following recommendations to optimize operational excellence when configuring your Azure
Storage Account:
Use managed identities to access blob and queue data. Azure Blob and Queue storage support Azure AD
authentication with managed identities for Azure resources.
Managed identities for Azure resources can authorize access
to blob and queue data using Azure AD credentials from
applications running in Azure virtual machines (VMs),
function apps, virtual machine scale sets, and other services.
By using managed identities for Azure resources together
with Azure AD authentication, you can avoid storing
credentials with your applications that run in the cloud and
issues with expiring service principals. Reference Authorize
access to blob and queue data with managed identities for
Azure resources for more information.
Use blob versioning or immutable blobs to store business-
critical data.
Consider using Blob versioning to maintain previous
versions of an object or the use of legal holds and time-
based retention policies to store blob data in a WORM
(Write Once, Read Many) state. Immutable blobs can be
read, but can't be modified or deleted during the retention
interval. For more information, reference Store business-
critical blob data with immutable storage.
Restrict default internet access for storage accounts. By default, network access to Storage Accounts isn't
restricted and is open to all traffic coming from the internet.
Access to storage accounts should be granted to specific
Azure Virtual Networks only whenever possible or use
private endpoints to allow clients on a virtual network (VNet)
to access data securely over a Private Link. Reference Use
private endpoints for Azure Storage for more information.
Exceptions can be made for Storage Accounts that need to
be accessible over the internet.
Enable firewall rules. Configure firewall rules to limit access to your storage
account to requests that originate from specified IP
addresses or ranges, or from a list of subnets in an Azure
Virtual Network (VNet). For more information about
configuring firewall rules, reference Configure Azure Storage
firewalls and virtual networks.
Limit network access to specific networks. Limiting network access to networks hosting clients
requiring access reduces the exposure of your resources to
network attacks either by using the built-in Firewall and
virtual networks functionality or by using private endpoints.
Allow trusted Microsoft services to access the storage
account.
Turning on firewall rules for storage accounts blocks
incoming requests for data by default, unless the requests
originate from a service operating within an Azure Virtual
Network (VNet) or from allowed public IP addresses. Blocked
requests include those requests from other Azure services,
from the Azure portal, from logging and metrics services,
and so on. You can permit requests from other Azure
services by adding an exception to allow trusted Microsoft
services to access the storage account. For more information
about adding an exception for trusted Microsoft services,
reference Configure Azure Storage firewalls and virtual
networks.
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Enable the Secure transfer requiredSecure transfer required option on all your
storage accounts.
When you enable the Secure transfer requiredSecure transfer required option, all
requests made against the storage account must take place
over secure connections. Any requests made over HTTP will
fail. For more information, reference Require secure transfer
in Azure Storage.
Limit shared access signature (SAS) tokens to HTTPS
connections only.
Requiring HTTPS when a client uses a SAS token to access
blob data helps to minimize the risk of eavesdropping. For
more information, reference Grant limited access to Azure
Storage resources using shared access signatures (SAS).
Avoid and prevent using Shared Key authorization to access
storage accounts.
It's recommended to use Azure AD to authorize requests to
Azure Storage and to prevent Shared Key Authorization. For
scenarios that require Shared Key authorization, always
prefer SAS tokens over distributing the Shared Key.
Regenerate your account keys periodically. Rotating the account keys periodically reduces the risk of
exposing your data to malicious actors.
Create a revocation plan and have it in place for any SAS
that you issue to clients.
If a SAS is compromised, you'll want to revoke that SAS
immediately. To revoke a user delegation SAS, revoke the
user delegation key to quickly invalidate all signatures
associated with that key. To revoke a service SAS that's
associated with a stored access policy, you can delete the
stored access policy, rename the policy, or change its expiry
time to a time that is in the past.
Use near-term expiration times on an impromptu SAS,
service SAS, or account SAS.
If a SAS is compromised, it's valid only for a short time. This
practice is especially important if you can't reference a stored
access policy. Near-term expiration times also limit the
amount of data that can be written to a blob by limiting the
time available to upload to it. Clients should renew the SAS
well before the expiration to allow time for retries if the
service providing the SAS is unavailable.
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Next step
Disks and cost optimization
Disks and cost optimization
12/16/2022 2 minutes to read • Edit Online
Design considerations
C O N SIDERAT IO NSC O N SIDERAT IO NS DESC RIP T IONDESC RIP T ION
Use a shared disk for workload, such as SQL server failover
cluster instance (FCI), file server for general use (IW
workload), and SAP ASCS/SCS.
You can use shared disks to enable cost-effective clustering
instead of setting up your own shared disks through S2D
(Storage Spaces Direct). Sample workloads that would benefit
from shared disks include:
- SQL Server Failover Cluster Instances (FCI)
- Scale-out File Server (SoFS)
- File Server for General Use (IW workload)
- SAP ASCS/SCS
Checklist
Azure managed disks are block-level storage volumes that are managed by Azure and used with Azure Virtual
Machines. Managed disks are like a physical disk in an on-premises server, but these disks are virtualized.
Available disk types include:
Ultra disks
Premium solid-state drives (SSD)
Standard SSDs
Standard hard disk drives (HDD)
For more information about the different types of disks, reference Azure managed disk types.
To understand how Azure managed disks are cost-effective solutions for your workload, reference the following
articles:
Overview of Azure Disk Backup
Understand how your reservation discount is applied to Azure disk storage
Reduce costs with Azure Disks Reservation
The following sections include design considerations, a configuration checklist, and recommended configuration
options specific to Azure managed disks and cost optimization.
Azure Disks include the following design considerations:
Use a shared disk for workload, such as SQL server failover cluster instance (FCI), file server for general use
(IW workload), and SAP ASCS/SCS.
Consider selective disk backup and restore for Azure VMs.
Have you configured your Azure managed disk with cost optimization in mind?Have you configured your Azure managed disk with cost optimization in mind?
Configure data and log files on different disks for database workloads.
Use bursting for P20 and lower disks for workloads, such as batch jobs, workloads, which handle traffic
spikes, and to improve OS boot time.
Consider using Premium disks (P30 and greater).
Configuration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Configure data and log files on different disks for database
workloads.
You can optimize IaaS DB workload performance by
configuring system, data, and log files to be on different disk
SKUs (leveraging Premium Disks for data and Ultra Disks for
logs satisfies most production scenarios). Ultra Disk cost and
performance can be optimized by taking advantage of
configuring capacity, IOPS, and throughput independently.
Also, you can dynamically configure these attributes.
Example workloads include:
- SQL on IaaS
- Cassandra DB
- Maria DB
- MySql and
- Mongo DB on IaaS
Use bursting for P20 and lower disks for workloads, such as
batch jobs, workloads, which handle traffic spikes, and to
improve OS boot time.
Azure Disks offer various SKUs and sizes to satisfy different
workload requirements. Some of the more recent features
could help further optimize cost performance of existing disk
use cases. You can use disk bursting for Premium (disks P20
and lower). Example scenarios that could benefit from this
feature include:
- Improving OS boot time
- Handling batch jobs
- Handling traffic spikes
Consider using Premium disks (P30 and greater). Premium Disks (P30 and greater) can be reserved (one or
three years) at a discounted price.
Next step
Consider the following recommendations to optimize costs when configuring your Azure managed disk:
Event Grid and reliability
Event Grid and reliability
12/16/2022 3 minutes to read • Edit Online
Design considerations
Checklist
Configuration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Monitor Event Grid for failed event delivery. The Delivery Failed metric will increase every time a
message can't be delivered to an event handler (timeout or a
non- 200-204 HTTP status code). If an event can't be lost,
set up a Dead-Letter-Queue (DLQ) storage account. A DLQ
account is where events that can't be delivered after the
maximum retry count will be placed. Optionally, implement a
notification system on the DLQ storage account, for
example, by handling a
new file
event through Event Grid.
Azure Event Grid lets you easily build applications with event-based architectures. This solution has build-in
support for events coming from Azure services, like storage blobs and resource groups. Event Grid also has
support for your own events, using custom topics.
For more information about using Event Grid, reference Create and route custom events with Azure Event Grid.
To understand how using Event Grid creates a more reliable workload, reference Server-side geo disaster
recovery in Azure Event Grid.
The following sections are specific to Azure Event Grid and reliability:
Design considerations
Configuration checklist
Recommended configuration options
Source artifacts
Azure Event Grid provides an uptime SLA. For more information, reference SLA for Event Grid.
Have you configured Azure Event Grid with reliability in mind?Have you configured Azure Event Grid with reliability in mind?
Deploy an Event Grid instance per region, in case of a multi-region Azure solution.
Monitor Event Grid for failed event delivery.
Use batched events.
Event batches can't exceed 1MB in size.
Configure and optimize batch-size selection during load testing.
Ensure Event Grid messages are accepted with HTTP 200-204 responses only if delivering to an endpoint
that holds custom code.
Monitor Event Grid for failed event publishing.
Consider the following recommendations to optimize reliability when configuring Azure Event Grid:
Use batched events in high-throughput scenarios. The service will deliver a json array with multiple events to
the subscribers, instead of an array with one event. The
consuming application must be able to process these arrays.
Event batches can't exceed 1MB in size. If the message payload is large, only one or a few messages
will fit in the batch. The consuming service will need to
process more event batches. If your event has a large
payload, consider storing it elsewhere, such as in blob
storage, and passing a reference in the event. When
integrating with third-party services through the
CloudEvents schema, it's not recommended to exceed 64KB
events.
Configure and optimize batch-size selection during load
testing.
Batch size selection depends on the payload size and the
message volume.
Monitor Event Grid for failed event publishing. The Unmatched metric will show messages that are
published, but not matched to any subscription. Depending
on your application architecture, the latter may be
intentional.
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Source artifacts
Resources
| where type == 'microsoft.eventgrid/topics'
| project name, resourceGroup, location, subscriptionId, properties['inputSchema']
Resources
| where type == 'microsoft.eventgrid/domains' and notnull(properties['privateEndpointConnections'])
| mvexpand properties['privateEndpointConnections']
| project-rename privateEndpointConnections = properties_privateEndpointConnections
| project name, resourceGroup, location, subscriptionId, privateEndpointConnections['properties']
['privateEndpoint']['id']
Resources
| where type == 'microsoft.eventgrid/domains'
| project name, resourceGroup, location, subscriptionId, properties['publicNetworkAccess']
Resources
| where type == 'microsoft.eventgrid/domains' and properties['publicNetworkAccess'] == 'Enabled'
| project name, resourceGroup, location, subscriptionId, properties['inboundIpRules']
To determine the Input SchemaInput Schema type for all available Event Grid topics, use the following query:
To retrieve the Resource IDResource ID of existing private endpoints for Event Grid domains, use the following query:
To identify Public Network AccessPublic Network Access status for all available Event Grid domains, use the following query:
To identify Firewall RulesFirewall Rules for all public Event Grid domains, use the following query:
To identify Firewall RulesFirewall Rules for all public Event Grid topics, use the following query:
Resources
| where type == 'microsoft.eventgrid/topics' and properties['publicNetworkAccess'] == 'Enabled'
| project name, resourceGroup, location, subscriptionId, properties['inboundIpRules']
Resources
| where type == 'microsoft.eventgrid/topics' and notnull(properties['privateEndpointConnections'])
| mvexpand properties['privateEndpointConnections']
| project-rename privateEndpointConnections = properties_privateEndpointConnections
| project name, resourceGroup, location, subscriptionId, privateEndpointConnections['properties']
['privateEndpoint']['id']
Resources
| where type == 'microsoft.eventgrid/domains'
| project name, resourceGroup, location, subscriptionId, properties['inputSchema']
Resources
| where type == 'microsoft.eventgrid/topics'
| project name, resourceGroup, location, subscriptionId, properties['publicNetworkAccess']
Next step
To retrieve the Resource IDResource ID of existing private endpoints for Event Grid topics, use the following query:
To determine the Input SchemaInput Schema type for all available Event Grid domains, use the following schema:
To identify Public Network AccessPublic Network Access status for all available Event Grid topics, use the following query:
Event Grid and operational excellence
Event Grid and operational excellence
12/16/2022 3 minutes to read • Edit Online
Design considerations
Checklist
Configuration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Monitor Event Grid for failed event delivery. The Delivery Failed metric will increase every time a
message can't be delivered to an event handler (timeout or a
non- 200-204 HTTP status code). If an event can't be lost,
set up a Dead-Letter-Queue (DLQ) storage account. A DLQ
account is where events that can't be delivered after the
maximum retry count will be placed. Optionally, implement a
notification system on the DLQ storage account, for
example, by handling a
new file
event through Event Grid.
Azure Event Grid lets you easily build applications with event-based architectures. This solution has build-in
support for events coming from Azure services, like storage blobs and resource groups. Event Grid also has
support for your own events, using custom topics.
For more information about using Event Grid, reference Create and route custom events with Azure Event Grid.
To understand how using Event Grid promotes operational excellence for your workload, reference Diagnostic
logs for Event Grid topics and Event Grid domains.
The following sections are specific to Azure Event Grid and operational excellence:
Design considerations
Configuration checklist
Recommended configuration options
Source artifacts
Azure Event Grid provides an uptime SLA. For more information, reference SLA for Event Grid.
Have you configured Azure Event Grid with operational excellence in mind?Have you configured Azure Event Grid with operational excellence in mind?
Monitor Event Grid for failed event delivery.
Use batched events.
Event batches can't exceed 1MB in size.
Configure and optimize batch-size selection during load testing.
Ensure Event Grid messages are accepted with HTTP 200-204 responses only if delivering to an endpoint
that holds custom code.
Monitor Event Grid for failed event publishing.
Consider the following recommendations to optimize operational excellence when configuring Azure Event Grid:
Use batched events in high-throughput scenarios. The service will deliver a json array with multiple events to
the subscribers, instead of an array with one event. The
consuming application must be able to process these arrays.
Event batches can't exceed 1MB in size. If the message payload is large, only one or a few messages
will fit in the batch. The consuming service will need to
process more event batches. If your event has a large
payload, consider storing it elsewhere, such as in blob
storage, and passing a reference in the event. When
integrating with third-party services through the
CloudEvents schema, it's not recommended to exceed 64KB
events.
Configure and optimize batch-size selection during load
testing.
Batch size selection depends on the payload size and the
message volume.
Monitor Event Grid for failed event publishing. The Unmatched metric will show messages that are
published, but not matched to any subscription. Depending
on your application architecture, the latter may be
intentional.
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Source artifacts
Resources
| where type == 'microsoft.eventgrid/topics'
| project name, resourceGroup, location, subscriptionId, properties['inputSchema']
Resources
| where type == 'microsoft.eventgrid/domains' and notnull(properties['privateEndpointConnections'])
| mvexpand properties['privateEndpointConnections']
| project-rename privateEndpointConnections = properties_privateEndpointConnections
| project name, resourceGroup, location, subscriptionId, privateEndpointConnections['properties']
['privateEndpoint']['id']
Resources
| where type == 'microsoft.eventgrid/domains'
| project name, resourceGroup, location, subscriptionId, properties['publicNetworkAccess']
Resources
| where type == 'microsoft.eventgrid/domains' and properties['publicNetworkAccess'] == 'Enabled'
| project name, resourceGroup, location, subscriptionId, properties['inboundIpRules']
To determine the Input SchemaInput Schema type for all available Event Grid topics, use the following query:
To retrieve the Resource IDResource ID of existing private endpoints for Event Grid domains, use the following query:
To identify Public Network AccessPublic Network Access status for all available Event Grid domains, use the following query:
To identify Firewall RulesFirewall Rules for all public Event Grid domains, use the following query:
To identify Firewall RulesFirewall Rules for all public Event Grid topics, use the following query:
Resources
| where type == 'microsoft.eventgrid/topics' and properties['publicNetworkAccess'] == 'Enabled'
| project name, resourceGroup, location, subscriptionId, properties['inboundIpRules']
Resources
| where type == 'microsoft.eventgrid/topics' and notnull(properties['privateEndpointConnections'])
| mvexpand properties['privateEndpointConnections']
| project-rename privateEndpointConnections = properties_privateEndpointConnections
| project name, resourceGroup, location, subscriptionId, privateEndpointConnections['properties']
['privateEndpoint']['id']
Resources
| where type == 'microsoft.eventgrid/domains'
| project name, resourceGroup, location, subscriptionId, properties['inputSchema']
Resources
| where type == 'microsoft.eventgrid/topics'
| project name, resourceGroup, location, subscriptionId, properties['publicNetworkAccess']
Next step
To retrieve the Resource IDResource ID of existing private endpoints for Event Grid topics, use the following query:
To determine the Input SchemaInput Schema type for all available Event Grid domains, use the following schema:
To identify Public Network AccessPublic Network Access status for all available Event Grid topics, use the following query:
Event Hubs and reliability
Event Hubs and reliability
12/16/2022 4 minutes to read • Edit Online
Design considerations
Checklist
Configuration recommendations
Azure Event Hubs is a scalable event processing service that ingests and processes large volumes of events and
data, with low latency and high reliability. It can receive and process millions of events per second. Data sent to
an event hub can be transformed and stored by using any real-time analytics provider or batching and storage
adapters.
For more information about using Event Hubs, reference the Azure Event Hubs documentation to learn how to
use Event Hubs to ingest millions of events per second from connected devices and applications.
To understand how using Event Hubs creates a more reliable workload, reference Azure Event Hubs - Geo-
disaster recovery.
The following sections are specific to Azure Event Hubs and reliability:
Design considerations
Configuration checklist
Recommended configuration options
Source artifacts
Azure Event Hubs provides an uptime SLA. For more information, reference SLA for Event Hubs.
Have you configured Azure Event Hubs with reliability in mind?Have you configured Azure Event Hubs with reliability in mind?
Create SendOnly and ListenOnly policies for the event publisher and consumer, respectively.
When using the SDK to send events to Event Hubs, ensure the exceptions thrown by the retry policy (
EventHubsException or OperationCancelledException ) are properly caught.
In high-throughput scenarios, use batched events.
Every consumer can read events from one to 32 partitions.
When developing new applications, use EventProcessorClient (.NET and Java) or EventHubConsumerClient
(Python and JavaScript) as the client SDK.
As part of your solution-wide availability and disaster recovery strategy, consider enabling the Event Hubs
geo disaster-recovery option.
When a solution has a large number of independent event publishers, consider using Event Publishers for
fine-grained access control.
Don't publish events to a specific partition.
When publishing events frequently, use the AMQP protocol when possible.
The number of partitions reflect the degree of downstream parallelism you can achieve.
Ensure each consuming application uses a separate consumer group and only one active receiver per
consumer group is in place.
When using the Capture feature, carefully consider the configuration of the time window and file size,
especially with low event volumes.
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
When using the SDK to send events to Event Hubs, ensure
the exceptions thrown by the retry policy (
EventHubsException or OperationCancelledException )
are properly caught.
When using HTTPS , ensure a proper retry pattern is
implemented.
In high-throughput scenarios, use batched events. The service will deliver a json array with multiple events to
the subscribers, instead of an array with one event. The
consuming application must process these arrays.
Every consumer can read events from one to 32 partitions. To achieve maximum scale on the side of the consuming
application, every consumer should read from a single
partition.
When developing new applications, use
EventProcessorClient (.NET and Java) or
EventHubConsumerClient (Python and JavaScript) as the
client SDK.
EventProcessorHost has been deprecated.
As part of your solution-wide availability and disaster
recovery strategy, consider enabling the Event Hubs geo
disaster-recovery option.
This option allows the creation of a secondary namespace in
a different region. Only the active namespace receives
messages at any time. Messages and events aren't replicated
to the secondary region. The RTO for the regional failover is
up to 30 minutes
. Confirm this RTO aligns with the
requirements of the customer and fits in the broader
availability strategy. If a higher RTO is required, consider
implementing a client-side failover pattern.
When a solution has a large number of independent event
publishers, consider using Event Publishers for fine-grained
access control.
Event Publishers automatically set the partition key to the
publisher name, so this feature should only be used if the
events originate from all publishers evenly.
Don't publish events to a specific partition. If ordering events is essential, implement ordering
downstream or use a different messaging service instead.
When publishing events frequently, use the AMQP protocol
when possible.
AMQP has higher network costs when initializing the
session, but HTTPS requires TLS overhead for every
request. AMQP has higher performance for frequent
publishers.
The number of partitions reflect the degree of downstream
parallelism you can achieve.
For maximum throughput, use the maximum number of
partitions ( 32 ) when creating the Event Hub. The maximum
number of partitions will allow you to scale up to 32
concurrent processing entities and will offer the highest send
and receive availability.
When using the Capture feature, carefully consider the
configuration of the time window and file size, especially with
low event volumes.
Data Lake will charge for minimal file size for storage (gen1)
or minimal transaction size (gen2). If you set the time
window so low that the file hasn't reached minimum size,
you'll incur extra cost.
Source artifacts
Consider the following recommendations to optimize reliability when configuring Azure Event Hubs:
To find Event Hubs namespaces with BasicBasic SKU, use the following query:
Resources
| where type == 'microsoft.eventhub/namespaces'
| where sku.name == 'Basic'
| project resourceGroup, name, sku.name
Next step
Event Hubs and operational excellence
Event Hubs and operational excellence
12/16/2022 4 minutes to read • Edit Online
Design considerations
Checklist
Azure Event Hubs is a scalable event processing service that ingests and processes large volumes of events and
data, with low latency and high reliability. It can receive and process millions of events per second. Data sent to
an event hub can be transformed and stored by using any real-time analytics provider or batching and storage
adapters.
For more information about using Event Hubs, reference the Azure Event Hubs documentation to learn how to
use Event Hubs to ingest millions of events per second from connected devices and applications.
To understand ways using Event Hubs helps you achieve operational excellence for your workload, reference the
following articles:
Monitor Azure Event Hubs
Stream Azure Diagnostics data using Event Hubs
Scaling with Event Hubs
The following sections are specific to Azure Event Hubs and operational excellence:
Design considerations
Configuration checklist
Recommended configuration options
Source artifacts
Azure Event Hubs provides an uptime SLA. For more information, reference SLA for Event Hubs.
Have you configured Azure Event Hubs with operational excellence in mind?Have you configured Azure Event Hubs with operational excellence in mind?
Create SendOnly and ListenOnly policies for the event publisher and consumer, respectively.
When using the SDK to send events to Event Hubs, ensure the exceptions thrown by the retry policy (
EventHubsException or OperationCancelledException ) are properly caught.
In high-throughput scenarios, use batched events.
Every consumer can read events from one to 32 partitions.
When developing new applications, use EventProcessorClient (.NET and Java) or EventHubConsumerClient
(Python and JavaScript) as the client SDK.
As part of your solution-wide availability and disaster recovery strategy, consider enabling the Event Hubs
geo disaster-recovery option.
When a solution has a large number of independent event publishers, consider using Event Publishers for
fine-grained access control.
Don't publish events to a specific partition.
When publishing events frequently, use the AMQP protocol when possible.
The number of partitions reflect the degree of downstream parallelism you can achieve.
Ensure each consuming application uses a separate consumer group and only one active receiver per
consumer group is in place.
Configuration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
When using the SDK to send events to Event Hubs, ensure
the exceptions thrown by the retry policy (
EventHubsException or OperationCancelledException )
are properly caught.
When using HTTPS , ensure a proper retry pattern is
implemented.
In high-throughput scenarios, use batched events. The service will deliver a json array with multiple events to
the subscribers, instead of an array with one event. The
consuming application must process these arrays.
Every consumer can read events from one to 32 partitions. To achieve maximum scale on the side of the consuming
application, every consumer should read from a single
partition.
When developing new applications, use
EventProcessorClient (.NET and Java) or
EventHubConsumerClient (Python and JavaScript) as the
client SDK.
EventProcessorHost has been deprecated.
As part of your solution-wide availability and disaster
recovery strategy, consider enabling the Event Hubs geo
disaster-recovery option.
This option allows the creation of a secondary namespace in
a different region. Only the active namespace receives
messages at any time. Messages and events aren't replicated
to the secondary region. The RTO for the regional failover is
up to 30 minutes
. Confirm this RTO aligns with the
requirements of the customer and fits in the broader
availability strategy. If a higher RTO is required, consider
implementing a client-side failover pattern.
When a solution has a large number of independent event
publishers, consider using Event Publishers for fine-grained
access control.
Event Publishers automatically set the partition key to the
publisher name, so this feature should only be used if the
events originate from all publishers evenly.
Don't publish events to a specific partition. If ordering events is essential, implement ordering
downstream or use a different messaging service instead.
When publishing events frequently, use the AMQP protocol
when possible.
AMQP has higher network costs when initializing the
session, but HTTPS requires TLS overhead for every
request. AMQP has higher performance for frequent
publishers.
The number of partitions reflect the degree of downstream
parallelism you can achieve.
For maximum throughput, use the maximum number of
partitions ( 32 ) when creating the Event Hub. The maximum
number of partitions will allow you to scale up to 32
concurrent processing entities and will offer the highest send
and receive availability.
When using the Capture feature, carefully consider the
configuration of the time window and file size, especially with
low event volumes.
Data Lake will charge for minimal file size for storage (gen1)
or minimal transaction size (gen2). If you set the time
window so low that the file hasn't reached minimum size,
you'll incur extra cost.
When using the Capture feature, carefully consider the configuration of the time window and file size,
especially with low event volumes.
Consider the following recommendations to optimize reliability when configuring Azure Event Hubs:
Source artifacts
Resources
| where type == 'microsoft.eventhub/namespaces'
| where sku.name == 'Basic'
| project resourceGroup, name, sku.name
Next step
To find Event Hubs namespaces with BasicBasic SKU, use the following query:
Service Bus and reliability
Service Bus and reliability
12/16/2022 5 minutes to read • Edit Online
Design considerations
Fully manage enterprise message brokering with message queues and publish-subscribe topics used in Azure
Service Bus. This service stores messages in a
broker
(for example, a
queue
) until the consuming party is ready
to receive the messages.
Benefits include:
Load-balancing across competing workers.
Safely routing and transferring data and control across service, and application boundaries.
Coordinating transactional work that requires a high-degree of reliability.
For more information about using Service Bus, reference Azure Service Bus Messaging. Learn how to set up
messaging that connects applications and services across on-premises and cloud environments.
To understand how Service Bus contributes to a reliable workload, reference the following topics:
Asynchronous messaging patterns and high availability
Azure Service Bus Geo-disaster recovery
Handling outages and disasters
The following sections are specific to Azure Service Bus and reliability:
Design considerations
Configuration checklist
Recommended configuration options
Source artifacts
Maximize reliability with an Azure Service Bus uptime SLA. Properly configured applications can send or receive
messages, or do other operations on a deployed Queue or Topic. For more information, reference the Service
Bus SLA.
Other design considerations include:
Express Entities
Partitioned queues and topics
Besides the documentation on Service Bus Premium and Standard messaging tiers, the following features are
only available on the Premium Stock Keeping Unit (SKU):
Dedicated resources.
Virtual network integration: Limits the networks that can connect to the Service Bus instance. Requires
Service Endpoints to be enabled on the subnet. There are Trusted Microsoft services that are not supported
when implementing Virtual Networks(for example, integration with Event Grid). For more information,
reference Allow access to Azure Service Bus namespace from specific virtual networks.
Private endpoints.
IP Filtering/Firewall: Restrict connections to only defined IPv4 addresses or IPv4 address ranges.
Availability zones: Provides enhanced availability by spreading replicas across availability zones within one
region at no extra cost.
Checklist
Configuration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Evaluate Premier-tier benefits of Azure Service Bus. Consider migrating to the Premium tier of Service Bus to
take advantage of platform-supported outage and disaster
protection.
Connect to Service Bus with the AMQP protocol and use
Service Endpoints or Private Endpoints when possible.
This recommendation keeps traffic on the Azure Backbone.
Note: The default connection protocol for
Microsoft.Azure.ServiceBus
and
Windows.Azure.ServiceBus
namespaces is
AMQP
.
Implement geo-replication on the sender and receiver side
to protect against outages and disasters.
Standard tier supports only the implementation of sender
and receiver-side geo-redundancy. An outage or disaster in
an Azure Region could cause downtime for your solution.
Configure Geo-Disaster. - Active/Active
- Active/Passive
- Paired Namespace (Active/Passive)
-
Note: The secondary region should preferably be an Azure
paired region
.
Event Grid integration: Available event types.
Scale messaging units.
Geo-Disaster Recovery (paired namespace).
BYOK (Bring Your Own Key): Azure Service Bus encrypts data at rest and automatically decrypts it when
accessed, but customers can also bring their own customer-managed key.
When deploying Service Bus with Geo-disaster recovery and in availability zones, the Service Level Operation
(SLO) increases dramatically, but does not change the uptime SLA.
Have you configured Azure Ser vice Bus with reliability in mind?Have you configured Azure Ser vice Bus with reliability in mind?
Evaluate Premier-tier benefits of Azure Service Bus.
Ensure that Service Bus Messaging Exceptions are handled properly.
Connect to Service Bus with the Advanced Messaging Queue Protocol (AMQP) and use Service Endpoints or
Private Endpoints when possible.
Review the Best Practices for performance improvements using Service Bus Messaging.
Implement geo-replication on the sender and receiver side to protect against outages and disasters.
Configure Geo-Disaster.
If you need mission-critical messaging with queues and topics, Service Bus Premium is recommended with
Geo-Disaster Recovery.
Configure Zone Redundancy in the Service Bus namespace (
only available with Premium tier
).
Implement high availability for the Service Bus namespace.
Ensure related messages are delivered in guaranteed order.
Evaluate different Java Messaging Service (JMS) features through the JMS API.
Use .NET Nuget packages to communicate with Service Bus messaging entities.
Implement resilience for transient fault handling when sending or receiving messages.
Consider the following recommendations to optimize reliability when configuring Azure Service Bus:
If you need mission-critical messaging with queues and
topics, Service Bus Premium is recommended with Geo-
Disaster Recovery.
Choosing the pattern is dependent on the business
requirements and the recovery time objective (RTO).
Configure Zone Redundancy in the Service Bus namespace
(
only available with Premium tier
).
Zone Redundancy includes three copies of the messaging
store. One zone is allocated as the primary messaging store
and the other zones are allocated as secondaries. If the
primary zone becomes unavailable, a secondary is promoted
to primary with no perceivable downtime.
Availability Zones
are available in a subset of Azure Regions with new regions
added regularly.
Implement high availability for the Service Bus namespace. Premium tier supports Geo-disaster recovery and replication
at the namespace level. At this level, Premium tier provides
high availability for metadata disaster recovery using
primary and secondary disaster recovery namespaces.
Ensure related messages are delivered in guaranteed order. Be aware of the requirement to set a Partition Key, Session
ID, or Message ID on each message to ensure related
messages send to the same partition in the messaging
entity.
Evaluate different JMS features through the JMS API. Features available through the JMS 2.0 API (and its Software
Development Kit (SDK)) are not the same as the features
available through the native SDK. For example, Service Bus
Sessions are not available in JMS.
Implement resilience for transient fault handling when
sending or receiving messages.
It is essential to implement suitable transient fault handling
and error handling for send and receive operations to
maintain throughput and to prevent message loss.
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Source artifacts
Resources
| where
type == 'microsoft.servicebus/namespaces'
| where
sku.tier == 'Premium'
and isempty(properties.privateEndpointConnections)
Resources
| where
type == 'microsoft.servicebus/namespaces'
| where
sku.tier != 'Premium'
To identify premium Service Bus Instances that are not using private endpoints, use the following query:
To identify Service Bus Instances that are not on the premium tier, use the following query:
To identify premium Service Bus Instances that are not zone redundant, use the following query:
Next step
Resources
| where
type == 'microsoft.servicebus/namespaces'
| where
sku.tier == 'Premium'
and properties.zoneRedundant == 'false'
Service Bus and operational excellence
Service Bus and operational excellence
12/16/2022 3 minutes to read • Edit Online
Design considerations
Fully manage enterprise message brokering with message queues and publish-subscribe topics using Azure
Service Bus. This service stores messages in a
broker
(for example, a
queue
) until the consuming party is ready
to receive the messages.
Benefits include:
Load-balancing work across competing workers.
Safely routing and transferring data and control across service, and application boundaries.
Coordinating transactional work that requires a high-degree of reliability.
For more information about using Service Bus, reference Azure Service Bus Messaging. Learn how to set up
messaging that connects applications and services across on-premises and cloud environments.
To understand how Service Bus promotes operational excellence, reference the following topics:
Handling outages and disasters
Throttling operations on Azure Service Bus
The following sections are specific to Azure Service Bus and operational excellence:
Design considerations
Configuration checklist
Recommended configuration options
Source artifacts
Maximize reliability with an Azure Service Bus uptime Service Level Agreement (SLA). Properly configured
applications can send or receive messages, or do other operations on a deployed Queue or Topic. For more
information, reference the Service Bus SLA.
Other design considerations include:
Express Entities
Partitioned queues and topics
Besides the documentation on Service Bus Premium and Standard messaging tiers, the following features are
only available on the Premium Stock Keeping Unit (SKU):
Dedicated resources.
Virtual network integration: Limits the networks that can connect to the Service Bus instance. Requires
Service Endpoints to be enabled on the subnet. There are Trusted Microsoft services that are not supported
when implementing Virtual Networks (for example, integration with Event Grid). For more information,
reference Allow access to Azure Service Bus namespace from specific virtual networks.
Private endpoints.
IP Filtering/Firewall: Restrict connections to only defined IPv4 addresses or IPv4 address ranges.
Availability zones: Provides enhanced availability by spreading replicas across availability zones within one
region at no extra cost.
Event Grid integration: Available event types.
Checklist
Configuration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Connect to Service Bus with the AMQP protocol and use
Service Endpoints or Private Endpoints when possible.
This recommendation keeps traffic on the Azure Backbone.
Note: The default connection protocol for
Microsoft.Azure.ServiceBus
and
Windows.Azure.ServiceBus
namespaces is
AMQP
.
Establish a process to actively monitor the dead-letter queue
(dlq) messages.
The dead-letter queue holds messages that cannot be
processed or cannot be delivered to any receiver. It is
important to monitor this queue to examine the issue cause,
apply required corrections, and to resubmit messages.
Analyze the differences between Azure Storage Queues and
Azure Service Bus Queues.
You will find that Azure Service Bus Messaging Entities are
more advanced, reliable, and feature-rich than Azure Storage
Queues. If your requirement is for simple queue messaging
without requirements for reliable messaging, then Azure
Storage Queues may be a more suitable option.
Source artifacts
Scale messaging units.
Geo-Disaster Recovery (paired namespace).
BYOK (Bring Your Own Key): Azure Service Bus encrypts data at rest and automatically decrypts it when
accessed, but customers can also bring their own customer-managed key.
When deploying Service Bus with Geo-disaster recovery and in availability zones, the Service Level Objective
(SLO) increases dramatically, but does not change the uptime SLA.
Have you configured Azure Ser vice Bus with operational excellence in mind?Have you configured Azure Ser vice Bus with operational excellence in mind?
Ensure that Service Bus Messaging Exceptions are handled properly.
Connect to Service Bus with the Advanced Message Queuing Protocol (AMQP) and use Service Endpoints or
Private Endpoints when possible.
Establish a process to actively monitor the dead-letter queue (dlq) messages.
Review the Best Practices for performance improvements using Service Bus Messaging.
Analyze the differences between Azure Storage Queues and Azure Service Bus Queues.
Consider the following recommendation to optimize reliability when configuring Azure Service Bus:
Resources
| where
type == 'microsoft.servicebus/namespaces'
| where
sku.tier == 'Premium'
and isempty(properties.privateEndpointConnections)
To identify premium Service Bus Instances that aren't using private endpoints, use the following query:
To identify Service Bus Instances that are not on the premium tier, use the following query:
Next step
Resources
| where
type == 'microsoft.servicebus/namespaces'
| where
sku.tier != 'Premium'
Queue Storage and reliability
Queue Storage and reliability
12/16/2022 2 minutes to read • Edit Online
Design considerations
Checklist
Configuration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
For an SLA increase, use geo-redundant storage. Use geo-redundant storage with read access and configure
the client application to fail over to secondary read
endpoints if the primary endpoints fail to respond. This
consideration should be part of the overall reliability strategy
of your solution.
Azure Queue Storage is a service for storing large numbers of messages that you can access from anywhere in
the world through authenticated calls using HTTP or HTTPS . Queues are commonly used to create a backlog of
work to process asynchronously.
For more information about Queue Storage, reference What is Azure Queue Storage?
To understand how Azure Queue Storage helps maintain a reliable workload, reference the following topics:
Azure Storage redundancy
Disaster recovery and storage account failover
The following sections are specific to Azure Queue Storage and reliability:
Design considerations
Configuration checklist
Recommended configuration options
Source artifacts
Azure Queue Storage follows the SLA statements of the general Storage Account service.
Have you configured Azure Queue Storage with reliability in mind?Have you configured Azure Queue Storage with reliability in mind?
Since Storage Queues are a part of the Azure Storage service, refer to the Storage Accounts configuration
checklist and recommendations for reliability.
Ensure that for all clients accessing the storage account, implement a proper retry policy.
Refer to the Storage guidance for specifics on data recovery for storage accounts.
For an SLA increase, use geo-redundant storage.
Use geo-zone-redundant storage (GZRS) or read-access geo-zone-redundant storage (RA-GZRS) for
durability and protection against failover if an entire data center becomes unavailable.
Consider the following recommendations to optimize reliability when configuring your Azure Queue Storage:
Use geo-zone-redundant storage (GZRS) or read-access
geo-zone-redundant storage (RA-GZRS) for durability and
protection against failover if an entire data center becomes
unavailable.
For more information, reference Azure Storage redundancy.
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Source artifacts
Resources
| where
type == 'microsoft.storage/storageaccounts'
and sku.name =~ 'Standard_LRS'
Resources
| where
type == 'microsoft.storage/storageaccounts'
and kind == 'Storage'
Next step
To identify storage accounts using locally redundant storage (LRS), use the following query:
To identify storage accounts using V1 storage accounts, use the following query:
Queue Storage and operational excellence
Queue Storage and operational excellence
12/16/2022 2 minutes to read • Edit Online
Design considerations
Checklist
Source artifacts
Resources
| where
type == 'microsoft.storage/storageaccounts'
and kind == 'Storage'
Next step
Azure Queue Storage is a service for storing large numbers of messages that you can access from anywhere in
the world through authenticated calls using HTTP or HTTPS . Queues are commonly used to create a backlog of
work to process asynchronously.
For more information about Queue Storage, reference What is Azure Queue Storage?
To understand how Azure Queue Storage promotes operational excellence, reference the following topics:
Monitoring Azure Queue Storage
Best practices for monitoring Azure Queue Storage
The following sections are specific to Azure Queue Storage and operational excellence:
Design considerations
Configuration checklist
Source artifacts
Azure Queue Storage follows the SLA statements of the general Storage Account service.
Have you configured Azure Queue Storage with operational excellence in mind?Have you configured Azure Queue Storage with operational excellence in mind?
Since Storage Queues are a part of the Azure Storage service, refer to the Storage Accounts configuration
checklist and recommendations for operational excellence.
Ensure that for all clients accessing the storage account, implement a proper retry policy.
Refer to the Storage guidance for specifics on data recovery for storage accounts.
To identify storage accounts using V1 storage accounts, use the following query:
IoT Hub and reliability
IoT Hub and reliability
12/16/2022 4 minutes to read • Edit Online
Design considerations
Checklist
Azure IoT Hub is a managed service hosted in the cloud that acts as a central message hub for communication
between an IoT application and its attached devices. You can connect millions of devices and their backend
solutions reliably and securely. Almost any device can be connected to an IoT Hub.
IoT Hub supports monitoring to help you track device creation, device connections, and device failures.
IoT Hub also supports the following messaging patterns:
Device-to-cloud telemetry
Uploading files from devices
Request-reply methods to control your devices from the cloud
For more information about IoT Hub, reference IoT Concepts and Azure IoT Hub.
To understand how IoT Hub supports a reliable workload, reference the following topics:
IoT Hub high availability and disaster recovery
How to achieve cross-region High Availability with IoT Hub
How to clone an Azure IoT Hub to another region
The following sections are specific to Azure IoT Hub and reliability:
Design considerations
Configuration checklist
Recommended configuration options
For more information about the Azure IoT Hub Service Level Agreement, reference SLA for Azure IoT Hub.
Have you configured Azure IoT Hub with reliability in mind?Have you configured Azure IoT Hub with reliability in mind?
Provision a second IoT Hub in another region and have routing logic on the device.
Use the AMQP or MQTT protocol when sending events frequently.
Use only certificates validated by a root CA in the production environment if you're using X.509 certificates
for the device connection.
For maximum throughput, use the maximum number of partitions ( 32 ) when creating the IoT Hub, if you're
planning to use the built-in endpoint.
For scaling, increase the tier and allocated IoT Hub units instead of adding more than one IoT Hub per region.
In high-throughput scenarios, use batched events.
If you require the minimum possible latency, don't use routing and read the events from the built-in
endpoint.
As part of your solution-wide availability and disaster recovery strategy, consider using the IoT Hub cross-
region Disaster Recovery option.
When reading device telemetry from the built-in Event Hub-compatible endpoint, refer to the Event Hub
consumers recommendation.
Configuration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Provision a second IoT Hub in another region and have
routing logic on the device.
These configurations can be further enhanced with a
Concierge Service.
Use the AMQP or MQTT protocol when sending events
frequently.
AMQP and MQTT have higher network costs when
initializing the session, however HTTPS requires extra TLS
overhead for every request. AMQP and MQTT have higher
performance for frequent publishers.
Use only certificates validated by a root CA in the
production environment if you're using X.509 certificates for
the device connection.
Make sure you have processes in place to update the
certificate before they expire.
For maximum throughput, use the maximum number of
partitions ( 32 ) when creating the IoT Hub, if you're
planning to use the built-in endpoint.
The number of device-to-cloud partitions for the Event Hub-
compatible endpoint reflect the degree of downstream
parallelism you can achieve. This will allow you to scale up to
32 concurrent processing entities and will offer the highest
send and receive availability. This number can't be changed
after creation.
For scaling, increase the tier and allocated IoT Hub units
instead of adding more than one IoT Hub per region.
Adding more than one IoT Hub per region doesn't offer
extra resiliency because all hubs can run on the same
underlying cluster.
In high-throughput scenarios, use batched events. The service will deliver an array with multiple events to the
consumers, instead of an array with one event. The
consuming application must process these arrays.
If you require the minimum possible latency, don't use
routing and read the events from the built-in endpoint.
When using message routing in IoT Hub, latency of the
message delivery increases. On average, latency shouldn't
exceed 500 ms , but there's no guarantee for the delivery
latency.
As part of your solution-wide availability and disaster
recovery strategy, consider using the IoT Hub cross-region
Disaster Recovery option.
This option will move the IoT Hub endpoint to the paired
Azure region. Only the device registry gets replicated. Events
aren't replicated to the secondary region.
The RTO for the
customer-initiated failover is between 10 minutes to a
couple of hours. For a Microsoft-initiated failover, the RTO is
2-26
hours. Confirm this RTO aligns with the requirements
of the customer and fits in the broader availability strategy. If
a higher RTO is required, consider implementing a client-side
failover pattern.
When using an SDK to send events to IoT Hub, ensure the
exceptions thrown by the retry policy (
EventHubsException or OperationCancelledException )
are properly caught.
When using HTTPS , implement a proper retry pattern.
When using an SDK to send events to IoT Hubs, ensure the exceptions thrown by the retry policy (
EventHubsException or OperationCancelledException ) are properly caught.
To avoid telemetry interruption due to throttling and a fully used quota, consider adding a custom auto-
scaling solution.
Consider the following recommendations to optimize reliability when configuring Azure IoT Hub:
Next step
IoT Hub and operational excellence
IoT Hub and operational excellence
12/16/2022 4 minutes to read • Edit Online
Design considerations
Checklist
Azure IoT Hub is a managed service hosted in the cloud that acts as a central message hub for communication
between an IoT application and its attached devices. You can connect millions of devices and their backend
solutions reliably and securely. Almost any device can be connected to an IoT Hub.
IoT Hub supports monitoring to help you track device creation, device connections, and device failures.
IoT Hub also supports the following messaging patterns:
Device-to-cloud telemetry
Uploading files from devices
Request-reply methods to control your devices from the cloud
For more information about IoT Hub, reference IoT Concepts and Azure IoT Hub.
To understand how IoT Hub promotes operational excellence, reference the following topics:
Tutorial: Set up and use metrics and logs with an IoT Hub
Monitoring Azure IoT Hub
Trace Azure IoT device-to-cloud messages with distributed tracing (preview)
Check IoT Hub service and resource health
The following sections are specific to Azure IoT Hub and operational excellence:
Design considerations
Configuration checklist
Recommended configuration options
For more information about the Azure IoT Hub Service Level Agreement, reference SLA for Azure IoT Hub.
Have you configured Azure IoT Hub with operational excellence in mind?Have you configured Azure IoT Hub with operational excellence in mind?
Provision a second IoT Hub in another region and have routing logic on the device.
Use the AMQP or MQTT protocol when sending events frequently.
Use only certificates validated by a root CA in the production environment if you're using X.509 certificates
for the device connection.
For maximum throughput, use the maximum number of partitions ( 32 ) when creating the IoT Hub, if you're
planning to use the built-in endpoint.
For scaling, increase the tier and allocated IoT Hub units instead of adding more than one IoT Hub per region.
In high-throughput scenarios, use batched events.
If you require the minimum possible latency, don't use routing and read the events from the built-in
endpoint.
As part of your solution-wide availability and disaster recovery strategy, consider using the IoT Hub cross-
region Disaster Recovery option.
When reading device telemetry from the built-in Event Hub-compatible endpoint, refer to the Event Hub
Configuration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Provision a second IoT Hub in another region and have
routing logic on the device.
These configurations can be further enhanced with a
Concierge Service.
Use the AMQP or MQTT protocol when sending events
frequently.
AMQP and MQTT have higher network costs when
initializing the session, however HTTPS requires extra TLS
overhead for every request. AMQP and MQTT have higher
performance for frequent publishers.
Use only certificates validated by a root CA in the
production environment if you're using X.509 certificates for
the device connection.
Make sure you have processes in place to update the
certificate before they expire.
For maximum throughput, use the maximum number of
partitions ( 32 ) when creating the IoT Hub, if you're
planning to use the built-in endpoint.
The number of device-to-cloud partitions for the Event Hub-
compatible endpoint reflect the degree of downstream
parallelism you can achieve. This will allow you to scale up to
32 concurrent processing entities and will offer the highest
send and receive availability. This number can't be changed
after creation.
For scaling, increase the tier and allocated IoT Hub units
instead of adding more than one IoT Hub per region.
Adding more than one IoT Hub per region doesn't offer
extra resiliency because all hubs can run on the same
underlying cluster.
In high-throughput scenarios, use batched events. The service will deliver an array with multiple events to the
consumers, instead of an array with one event. The
consuming application must process these arrays.
If you require the minimum possible latency, don't use
routing and read the events from the built-in endpoint.
When using message routing in IoT Hub, latency of the
message delivery increases. On average, latency shouldn't
exceed 500 ms , but there's no guarantee for the delivery
latency.
As part of your solution-wide availability and disaster
recovery strategy, consider using the IoT Hub cross-region
Disaster Recovery option.
This option will move the IoT Hub endpoint to the paired
Azure region. Only the device registry gets replicated. Events
aren't replicated to the secondary region.
The RTO for the
customer-initiated failover is between 10 minutes to a
couple of hours. For a Microsoft-initiated failover, the RTO is
2-26
hours. Confirm this RTO aligns with the requirements
of the customer and fits in the broader availability strategy. If
a higher RTO is required, consider implementing a client-side
failover pattern.
consumers recommendation.
When using an SDK to send events to IoT Hubs, ensure the exceptions thrown by the retry policy (
EventHubsException or OperationCancelledException ) are properly caught.
To avoid telemetry interruption due to throttling and a fully used quota, consider adding a custom auto-
scaling solution.
Consider the following recommendations for increasing operational excellence when configuring Azure IoT Hub:
When using an SDK to send events to IoT Hub, ensure the
exceptions thrown by the retry policy (
EventHubsException or OperationCancelledException )
are properly caught.
When using HTTPS , implement a proper retry pattern.
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Next step
IoT Hub Device Provisioning Service and reliability
IoT Hub Device Provisioning Service and reliability
12/16/2022 2 minutes to read • Edit Online
Design considerations
Next step
The IoT Hub Device Provisioning Service (DPS) is a helper service for IoT Hub. DPS enables zero-touch, just-in-
time provisioning to the right IoT Hub without requiring human intervention. IoT Hub DPS allows customers to
provision millions of devices in a secure and scalable manner.
For more information, reference What is Azure IoT Hub Device Provisioning Service?
To understand how IoT Hub DPS can increase workload reliability, reference IoT Hub DPS supports Availability
Zones.
For more information about the Service Level Agreement for Azure IoT Hub DPS, reference SLA for Azure IoT
Hub.
IoT Hub Device Provisioning Service and operational excellence
IoT Hub Device Provisioning Service and
operational excellence
12/16/2022 2 minutes to read • Edit Online
Design considerations
Next step
The IoT Hub Device Provisioning Service (DPS) is a helper service for IoT Hub. DPS enables zero-touch, just-in-
time provisioning to the right IoT Hub without requiring human intervention. IoT Hub DPS allows customers to
provision millions of devices in a secure and scalable manner.
For more information, reference What is Azure IoT Hub Device Provisioning Service?
To understand how IoT Hub DPS promotes operational excellence, reference How to manage device enrollments
with Azure portal.
For more information about the Service Level Agreement for Azure IoT Hub DPS, reference SLA for Azure IoT
Hub.
Application Delivery (General) and reliability
Application Delivery
(
General
)
and reliability
12/16/2022 3 minutes to read • Edit Online
Design considerations
Checklist
Application Delivery (General) explores key recommendations to deliver internal-facing and external-facing
applications in the Azure network through a secure, highly scalable, and highly available way. General
Application Delivery can be handled using a combination of the following networking services in Azure:
Content Delivery Network (CDN)
Azure Front Door Service
Traffic Manager
Application Gateway
Internet Analyzer
Load Balancer
For more information, reference Azure networking services overview.
The following sections are specific to general Application Delivery and reliability:
Design considerations
Configuration checklist
Recommended configuration options
Application Delivery in Azure includes the following design considerations:
Azure Load Balancer (internal and public) provides high availability for application delivery at a regional
level. (
Standard tier only
)
Azure Traffic manager allows the delivery of applications through DNS redirection, including traffic using
protocols other than HTTP/S .
Azure Front Door allows the secure delivery of highly available HTTP/S applications across Azure regions.
Azure Application Gateway allows the secure delivery of HTTP/S applications at a regional level.
Have you configured your Application Deliver y networking ser vices with reliability in mind?Have you configured your Application Deliver y networking ser vices with reliability in mind?
Use Azure Traffic Manager to deliver global applications that span protocols other than HTTP/S .
When using Azure Front Door and Application Gateway to protect HTTP/S applications, use Web Application
Firewall (WAF) policies in Front Door and lock down Application Gateway to receive traffic only from Azure
Front Door.
Create a separate health endpoint on the backend that the health probe can use. The health endpoint can
aggregate the state of the critical services and dependencies needed to serve requests.
Enable health probes for backends.
Deploy Application Gateway v2 or third-party NVAs used for inbound HTTP/S connections together with the
applications that they're securing.
When doing global load balancing for HTTP/S applications, use Front Door over Traffic Manager.
Use a third-party Network Virtual Appliance (NVA) if Application Gateway v2 can't be used for the security of
HTTP/S applications.
Configuration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Use Azure Traffic Manager to deliver global applications that
span protocols other than HTTP/S .
Traffic manager doesn't forward traffic, but only completes
the DNS redirection. The connection from the client is
established directly to the target using any protocol.
When using Azure Front Door and Application Gateway to
protect HTTP/S applications, use WAF policies in Front
Door and lock down Application Gateway to receive traffic
only from Azure Front Door.
Certain scenarios might force a customer to implement rules
specifically on AppGateway: For example, if ModSec CRS
2.2.9 , CRS 3.0 or CRS 3.1 rules are required, rules can
be only implemented on AppGateway. Conversely, rate-
limiting and geo-filtering are available only on Azure Front
Door, not on AppGateway. Instructions on how to lock down
traffic can be found at FAQ for Azure Front Door.
Create a separate health endpoint on the backend that the
health probe can use. The health endpoint can aggregate
the state of the critical services and dependencies needed to
serve requests.
For more information, reference Health Endpoint Monitoring
pattern.
Enable health probes for backends. Health probes are http(s) endpoints that are queried by
the load balancer (Azure Front Door, Traffic Manager,
AppGateway) service to determine if the backend is healthy
enough to handle requests.
Deploy Application Gateway v2 or third-party NVAs used for
inbound HTTP/S connections together with the
applications that they're securing.
Don't centrally manage Application Gateway v2 or third-
party NVAs within the organization and share with other
workloads.
When doing global load balancing for HTTP/S applications,
use Front Door over Traffic Manager.
- Azure Front Door optimizes the number of TCP
connections to the backend when forwarding traffic.
- Changes to the routing configuration, based on backend
health, are instantaneous. With Traffic Manager, traffic will
point to the original backend until a new DNS lookup occurs,
plus potential time for DNS propagation.
- Front Door supports caching on global edge nodes,
negating the need for a separate CDN service.
- Front Door supports Web Application Firewall rules,
negating the need for a separate WAF service.
For secure delivery of HTTP/S applications, ensure you
enable Web Application Firewall (WAF) protection and
policies.
Enable WAF protection and policies in either Application
Gateway or Front Door.
Application delivery for both internal and external facing
applications should be part of the application.
Don't centrally manage Application Delivery within an
organization.
For secure delivery of HTTP/S applications, ensure you enable Web Application Firewall (WAF) protection
and policies.
Application delivery for both internal and external facing applications should be part of the application.
Global HTTP/S applications that span Azure regions should be delivered and protected using Azure Front
Door with Web Application Firewall (WAF) policies.
All public IP addresses in the solution should be protected with a DDoS Standard protection plan.
Consider the following recommendations to optimize reliability when configuring your Application Delivery
networking services:
Next step
Application Delivery (General) and operational excellence
Application Delivery
(
General
)
and operational
excellence
12/16/2022 4 minutes to read • Edit Online
Design considerations
Checklist
Application Delivery (General) explores key recommendations to deliver internal-facing and external-facing
applications in the Azure network through a secure, highly scalable, and highly available way. General
Application Delivery can be handled using a combination of the following networking services in Azure:
Content Delivery Network (CDN)
Azure Front Door Service
Traffic Manager
Application Gateway
Internet Analyzer
Load Balancer
For more information, reference Azure networking services overview.
The following sections are specific to general Application Delivery and operational excellence:
Design considerations
Configuration checklist
Recommended configuration options
Application Delivery in Azure includes the following design considerations:
Azure Load Balancer (internal and public) provides high availability for application delivery at a regional
level. (
Standard tier only
)
Azure Traffic manager allows the delivery of applications through DNS redirection, including traffic using
protocols other than HTTP/S .
Azure Front Door allows the secure delivery of highly available HTTP/S applications across Azure regions.
Azure Application Gateway allows the secure delivery of HTTP/S applications at a regional level.
Have you configured your Application Deliver y networking ser vices with operational excellence inHave you configured your Application Deliver y networking ser vices with operational excellence in
mind?mind?
Use Azure Traffic Manager to deliver global applications that span protocols other than HTTP/S .
When using Azure Front Door and Application Gateway to protect HTTP/S applications, use Web Application
Firewall (WAF) policies in Front Door and lock down Application Gateway to receive traffic only from Azure
Front Door.
Create a separate health endpoint on the backend that the health probe can use. The health endpoint can
aggregate the state of the critical services and dependencies needed to serve requests.
Enable health probes for backends.
Deploy Application Gateway v2 or third-party NVAs used for inbound HTTP/S connections together with the
applications that they're securing.
Configuration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Use Azure Traffic Manager to deliver global applications that
span protocols other than HTTP/S .
Traffic manager doesn't forward traffic, but only completes
the DNS redirection. The connection from the client is
established directly to the target using any protocol.
When using Azure Front Door and Application Gateway to
protect HTTP/S applications, use WAF policies in Front
Door and lock down Application Gateway to receive traffic
only from Azure Front Door.
Certain scenarios might force a customer to implement rules
specifically on AppGateway: For example, if ModSec CRS
2.2.9 , CRS 3.0 or CRS 3.1 rules are required, rules can
be only implemented on AppGateway. Conversely, rate-
limiting and geo-filtering are available only on Azure Front
Door, not on AppGateway. Instructions on how to lock down
traffic can be found at FAQ for Azure Front Door.
Create a separate health endpoint on the backend that the
health probe can use. The health endpoint can aggregate
the state of the critical services and dependencies needed to
serve requests.
For more information, reference Health Endpoint Monitoring
pattern.
Enable health probes for backends. Health probes are http(s) endpoints that are queried by
the load balancer (Azure Front Door, Traffic Manager,
AppGateway) service to determine if the backend is healthy
enough to handle requests.
Deploy Application Gateway v2 or third-party NVAs used for
inbound HTTP/S connections together with the
applications that they're securing.
Don't centrally manage Application Gateway v2 or third-
party NVAs within the organization and share with other
workloads.
When doing global load balancing for HTTP/S applications,
use Front Door over Traffic Manager.
- Azure Front Door optimizes the number of TCP
connections to the backend when forwarding traffic.
- Changes to the routing configuration, based on backend
health, are instantaneous. With Traffic Manager, traffic will
point to the original backend until a new DNS lookup occurs,
plus potential time for DNS propagation.
- Front Door supports caching on global edge nodes,
negating the need for a separate CDN service.
- Front Door supports Web Application Firewall rules,
negating the need for a separate WAF service.
When doing global load balancing for HTTP/S applications, use Front Door over Traffic Manager.
Use a third-party Network Virtual Appliance (NVA) if Application Gateway v2 can't be used for the security of
HTTP/S applications.
For secure delivery of HTTP/S applications, ensure you enable Web Application Firewall (WAF) protection
and policies.
Application delivery for both internal and external facing applications should be part of the application.
Global HTTP/S applications that span Azure regions should be delivered and protected using Azure Front
Door with Web Application Firewall (WAF) policies.
All public IP addresses in the solution should be protected with a DDoS Standard protection plan.
Consider the following recommendations for operational excellence when configuring your Application Delivery
networking services:
For secure delivery of HTTP/S applications, ensure you
enable Web Application Firewall (WAF) protection and
policies.
Enable WAF protection and policies in either Application
Gateway or Front Door.
Application delivery for both internal and external facing
applications should be part of the application.
Don't centrally manage Application Delivery within an
organization.
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Next step
Azure Well-Architected Framework review - Azure Application Gateway v2
Azure Well
-
Architected Framework review
-
Azure
Application Gateway v2
12/16/2022 16 minutes to read • Edit Online
Prerequisites
Reliability
Design checklistDesign checklist
RecommendationsRecommendations
This article provides architectural best practices for the Azure Application Gateway v2 family of SKUs. The
guidance is based on the five pillars of architecture excellence:
Reliability
Security
Cost optimization
Operational excellence
Performance efficiency
We assume that you have working knowledge of Azure Application Gateway and are well versed with v2 SKU
features. For more information, see Azure Application Gateway features.
Understanding the Well-Architected Framework pillars can help produce a high-quality, stable, and efficient
cloud architecture. We recommend that you review your workload by using the Azure Well-Architected
Framework Review assessment.
Use a reference architecture to review the considerations based on the guidance provided in this article. We
recommend that you start with Protect APIs with Application Gateway and API Management and IaaS: Web
application with relational database.
In the cloud, we acknowledge that failures happen. Instead of trying to prevent failures altogether, the goal is to
minimize the effects of a single failing component. Use the following information to minimize failed instances.
As you make design choices for Application Gateway, review the Reliability design principles.
Deploy the instances in a zone-aware configuration, where available.
Use Application Gateway with Web Application Firewall (WAF) within an application virtual network to
protect inbound HTTP/S traffic from the Internet.
In new deployments, use Application Gateway v2 unless there is a compelling reason to use v1.
Plan for rule updates
Use health probes to detect backend unavailability
Review the impact of the interval and threshold settings on health probes
Verify downstream dependencies through health endpoints
Explore the following table of recommendations to optimize your Application Gateway configuration for
Reliability.
REC O M M EN DAT IO NREC O M M EN DAT IO N B EN EFITB EN EF IT
Plan for rule updates Plan enough time for updates before accessing Application
Gateway or making further changes. For example, removing
servers from backend pool might take some time because
they have to drain existing connections.
Use health probes to detect backend unavailability If Application Gateway is used to load balance incoming
traffic over multiple backend instances, we recommend the
use of health probes. These will ensure that traffic is not
routed to backends that are unable to handle the traffic.
Review the impact of the interval and threshold settings on
health probes
The health probe sends requests to the configured endpoint
at a set interval. Also, there's a threshold of failed requests
that will be tolerated before the backend is marked
unhealthy. These numbers present a trade-off.
- Setting a higher interval puts a higher load on your
service. Each Application Gateway instance sends its own
health probes, so 100 instances every 30 seconds means
100 requests per 30 seconds.
- Setting a lower interval leaves more time before an outage
is detected.
- Setting a low unhealthy threshold may mean that short,
transient failures may take down a backend.
- Setting a high threshold it can take longer to take a
backend out of rotation.
Verify downstream dependencies through health endpoints Suppose each backend has its own dependencies to ensure
failures are isolated. For example, an application hosted
behind Application Gateway may have multiple backends,
each connected to a different database (replica). When such
a dependency fails, the application may be working but
won't return valid results. For that reason, the health
endpoint should ideally validate all dependencies. Keep in
mind that if each call to the health endpoint has a direct
dependency call, that database would receive 100 queries
every 30 seconds instead of 1. To avoid this, the health
endpoint should cache the state of the dependencies for a
short period of time.
When using Azure Front Door and Application Gateway to
protect HTTP/S applications, use WAF policies in Front
Door and lock down Application Gateway to receive traffic
only from Azure Front Door.
Certain scenarios can force you to implement rules
specifically on Application Gateway. For example, if ModSec
CRS 2.2.9, CRS 3.0 or CRS 3.1 rules are required, these rules
can be only implemented on Application Gateway.
Conversely, rate-limiting and geo-filtering are available only
on Azure Front Door, not on AppGateway.
Security
Design checklistDesign checklist
Azure Advisor helps you ensure and improve continuity of your business-critical applications. Review the Azure
Advisor recommendations.
Security is one of the most important aspects of any architecture. Application Gateway provides features to
employ both the principle of least privilege and defense-in-defense. We recommend you review the Security
design principles.
Set up a TLS policy for enhanced security
RecommendationsRecommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N B EN EFITB EN EF IT
Set up a TLS policy for enhanced security Set up a TLS policy for extra security. Ensure you're using the
latest TLS policy version (AppGwSslPolicy20170401S). This
enforces TLS 1.2 and stronger ciphers.
Use AppGateway for TLS termination There are advantages of using Application Gateway for TLS
termination:
- Performance improves because requests going to different
backends to have to re-authenticate to each backend.
- Better utilization of backend servers because they don't
have to perform TLS processing
- Intelligent routing by accessing the request content.
- Easier certificate management because the certificate only
needs to be installed on Application Gateway.
Use Azure Key Vault to store TLS certificates Application Gateway is integrated with Key Vault. This
provides stronger security, easier separation of roles and
responsibilities, support for managed certificates, and an
easier certificate renewal and rotation process.
When re-encrypting backend traffic, ensure the backend
server certificate contains both the root and intermediate
Certificate Authorities (CAs)
A TLS certificate of the backend server must be issued by a
well-known CA. If the certificate was not issued by a trusted
CA, the Application Gateway checks if the certificate of the
issuing CA was issued by a trusted CA, and so on until either
a trusted CA is found. Only then a secure connection is
established. Otherwise, Application Gateway marks the
backend as unhealthy.
Use an appropriate DNS server for backend pool resources When the backend pool contains a resolvable FQDN, the
DNS resolution is based on a private DNS zone or custom
DNS server (if configured on the VNet), or it uses the default
Azure-provided DNS.
Comply with all NSG restrictions for Application Gateway NSGs are supported on Application Gateway, but there are
some restrictions. For instance, some communication with
certain port ranges is prohibited. Make sure you understand
the implications of those restrictions. For details, see
Network security groups.
Use AppGateway for TLS termination
Use Azure Key Vault to store TLS certificates
When re-encrypting backend traffic, ensure the backend server certificate contains both the root and
intermediate Certificate Authorities (CAs)
Use an appropriate DNS server for backend pool resources
Comply with all NSG restrictions for Application Gateway
Refrain from using UDRs to the backend subnet
Be aware of Application Gateway capacity changes when enabling WAF
Explore the following table of recommendations to optimize your Application Gateway configuration for
Security.
Refrain from using UDRs on the backend subnet Using User Defined Routes (UDR) on the Application
Gateway subnet cause some issues. Health status in the
back-end might be unknown. Application Gateway logs and
metrics might not get generated. We recommend that you
don't use UDRs on the Application Gateway subnet so that
you can view the back-end health, logs, and metrics. If your
organizations require to use UDR in the Application Gateway
subnet, please ensure you review the supported scenarios.
For more information, see Supported user-defined routes.
Be aware of Application Gateway capacity changes when
enabling WAF
When WAF is enabled, every request must be buffered by
the Application Gateway until it fully arrives and check if the
request matches with any rule violation in its core rule set
and then forward the packet to the backend instances. For
large file uploads (30MB+ in size), this can result in a
significant latency. Because Application Gateway capacity
requirements are different with WAF, we do not recommend
enabling WAF on Application Gateway without proper
testing and validation.
REC O M M EN DAT IO NREC O M M EN DAT IO N B EN EFITB EN EF IT
Policy definitionsPolicy definitions
Cost optimization
Design checklistDesign checklist
For more suggestions, see Principles of the security pillar.
Azure Advisor helps you ensure and improve continuity of your business-critical applications. Review the Azure
Advisor recommendations.
Gateway subnets should not be configured with a network security group. This policy denies if a gateway
subnet is configured with a network security group. Assigning a network security group to a gateway subnet
will cause the gateway to stop functioning.
Web Application Firewall (WAF) should be enabled for Application Gateway. Deploy Azure Web Application
Firewall (WAF) in front of public facing web applications for additional inspection of incoming traffic. Web
Application Firewall (WAF) provides centralized protection of your web applications from common exploits
and vulnerabilities such as SQL injections, Cross-Site Scripting, local and remote file executions. You can also
restrict access to your web applications by countries, IP address ranges, and other http(s) parameters via
custom rules.
Web Application Firewall (WAF) should use the specified mode for Application Gateway. Mandates the use of
'Detection' or 'Prevention' mode to be active on all Web Application Firewall policies for Application Gateway.
Azure DDoS Protection should be enabled. DDoS protection should be enabled for all virtual networks with a
subnet that is part of an application gateway with a public IP.
All built-in policy definitions related to Azure Networking are listed in Built-in policies - Network.
Cost optimization is about looking at ways to reduce unnecessary expenses and improve operational
efficiencies. We recommend you review the Cost optimization design principles.
Familiarize yourself with Application Gateway pricing
Review underutilized resources
Stop Application Gateway instances that are not in use
Have a scale-in and scale-out policy
Review consumption metrics across different parameters
RecommendationsRecommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N B EN EFITB EN EF IT
Familiarize yourself with Application Gateway pricing For information about Application Gateway pricing, see
Understanding Pricing for Azure Application Gateway and
Web Application Firewall. You can also leverage the Pricing
calculator.
Ensure that the options are adequately sized to meet the
capacity demand and deliver expected performance without
wasting resources.
Review underutilized resources Identify and delete Application Gateway instances with
empty backend pools to avoid unnecessary costs.
Stop Application Gateway instances when not in use You aren't billed when Application Gateway is in the stopped
state. Continuously running Application Gateway instances
can incur extraneous costs. Evaluate usage patterns and stop
instances when you don't need them. For example, usage
after business hours in Dev/Test environments is expected to
be low.
See these articles for information about how to stop and
start instances.
- Stop-AzApplicationGateway
- Start-AzApplicationGateway
Have a scale-in and scale-out policy A scale-out policy ensures that there will be enough
instances to handle incoming traffic and spikes. Also, have a
scale-in policy that makes sure the number of instances are
reduced when demand drops. Consider the choice of
instance size. The size can significantly impact the cost. Some
considerations are described in the Estimate the Application
Gateway instance count.
For more information, see What is Azure Application
Gateway v2?
Review consumption metrics across different parameters You're billed based on metered instances of Application
Gateway based on the metrics tracked by Azure. Evaluate
the various metrics and capacity units and determine the
cost drivers. For more information, see Azure Cost
Management and Billing.
The following metrics are key for Application Gateway. This
information can be used to validate that the provisioned
instance count matches the amount of incoming traffic.
- Estimated Billed Capacity Units
- Fixed Billable Capacity Units
- Current Capacity Units
For more information, see Application Gateway metrics.
Make sure you account for bandwidth costs. For details, see
Traffic across billing zones and regions.
Explore the following table of recommendations to optimize your Application Gateway configuration for Cost
optimization.
For more suggestions, see Principles of the cost optimization pillar.
Operational excellence
Design checklistDesign checklist
RecommendationsRecommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N B EN EFITB EN EF IT
Monitor capacity metrics Use these metrics as indicators of utilization of the
provisioned Application Gateway capacity. We strongly
recommend setting up alerts on capacity. For details, see
Application Gateway high traffic support.
Troubleshoot using metrics There are other metrics that can indicate issues either at
Application Gateway or the backend. We recommend
evaluating the following alerts:
- Unhealthy Host Count
- Response Status (dimension 4xx and 5xx)
- Backend Response Status (dimension 4xx and 5xx)
- Backend Last Byte Response Time
- Application Gateway Total Time
For more information, see Metrics for Application Gateway.
Enable diagnostics on Application Gateway and Web
Application Firewall (WAF)
Diagnostic logs allow you to view firewall logs, performance
logs, and access logs. Use these logs to manage and
troubleshoot issues with Application Gateway instances. For
more information, see Back-end health and diagnostic logs
for Application Gateway.
Use Azure Monitor Network Insights Azure Monitor Network Insights provides a comprehensive
view of health and metrics for network resources, including
Application Gateway. For additional details and supported
capabilities for Application Gateway, see Azure Monitor
Network insights.
Azure Advisor helps you ensure and improve continuity of your business-critical applications. Review the Azure
Advisor recommendations.
Monitoring and diagnostics are crucial. Not only can you measure performance statistics but also use metrics
troubleshoot and remediate issues quickly. We recommend you review the Operational excellence design
principles.
Monitor capacity metrics
Enable diagnostics on Application Gateway and Web Application Firewall (WAF)
Use Azure Monitor Network Insights
Match timeout settings with the backend application
Monitor Key Vault configuration issues using Azure Advisor
Configure and monitor SNAT port limitations
Consider SNAT port limitations in your design
Explore the following table of recommendations to optimize your Application Gateway configuration for
Operational excellence.
Match timeout settings with the backend application Ensure you have configured the IdleTimeout settings to
match the listener and traffic characteristics of the backend
application. The default value is set to four minutes and can
be configured to a maximum of 30. For more information,
see Load Balancer TCP Reset and Idle Timeout.
For workload considerations, see Application Monitoring.
Monitor Key Vault configuration issues using Azure Advisor Application Gateway checks for the renewed certificate
version in the linked Key Vault at every 4-hour interval. If it
is inaccessible due to any incorrectly modified Key Vault
configurations, it logs that error and pushes a corresponding
Advisor recommendation. You must configure the Advisor
alert to stay updated and fix such issues immediately to
avoid any Control or Data plane related problems. To set an
alert for this specific case, use the Recommendation Type as
Resolve Azure Key Vault issue for your ApplicationResolve Azure Key Vault issue for your Application
GatewayGateway.
Consider SNAT port limitations in your design SNAT port limitations are important for backend connections
on the Application Gateway. There are separate factors that
affect how Application Gateway reaches the SNAT port limit.
For example, if the backend is a public IP address, it will
require its own SNAT port. In order to avoid SNAT port
limitations, you can increase the number of instances per
Application Gateway, scale out the backends to have more IP
addresses, or move your backends into the same virtual
network and use private IP addresses for the backends.
Requests per second (RPS) on the Application Gateway will
be affected if the SNAT port limit is reached. For example, if
an Application Gateway reaches the SNAT port limit, then it
won't be able to open a new connection to the backend, and
the request will fail.
REC O M M EN DAT IO NREC O M M EN DAT IO N B EN EFITB EN EF IT
Performance efficiency
Design checklistDesign checklist
RecommendationsRecommendations
For more suggestions, see Principles of the operational excellence pillar.
Azure Advisor helps you ensure and improve continuity of your business-critical applications. Review the Azure
Advisor recommendations.
Performance efficiency is the ability of your workload to scale to meet the demands placed on it by users in an
efficient manner. We recommend you review the Performance efficiency principles.
Estimate the Application Gateway instance count
Define the maximum instance count
Define the minimum instance count
Define Application Gateway subnet size
Take advantage features for autoscaling and performance benefits
Explore the following table of recommendations to optimize your Application Gateway configuration for
Performance efficiency.
REC O M M EN DAT IO NREC O M M EN DAT IO N B EN EFITB EN EF IT
Estimate the Application Gateway instance count Application Gateway v2 scales out based on many aspects,
such as CPU, memory, network utilization, and more. To
determine the approximate instance count, factor in these
metrics:
Current compute units — Indicates CPU utilization. 1
Application Gateway instance is approximately 10 compute
units.
Throughput — Application Gateway instance can serve
~500 Mbps of throughput. This data depends on the type
of payload.
Consider this equation when calculating instance counts.
After you've estimated the instance count, compare that
value to the maximum instance count. This will indicate how
close you are to the maximum available capacity.
Define the minimum instance count For Application Gateway v2 SKU, autoscaling takes some
time (approximately six to seven minutes) before the
additional set of instances is ready to serve traffic. During
that time, if there are short spikes in traffic, expect transient
latency or loss of traffic.
We recommend that you set your minimum instance count
to an optimal level. After you estimate the average instance
count and determine your Application Gateway autoscaling
trends, define the minimum instance count based on your
application patterns. For information, see Application
Gateway high traffic support.
Check the Current Compute Units for the past one month.
This metric represents the gateway's CPU utilization. To
define the minimum instance count, divide the peak usage
by 10. For example, if your average Current Compute Units
in the past month is 50, set the minimum instance count to
five.
Define the maximum instance count We recommend 125 as the maximum autoscale instance
count. Make sure the subnet that has the Application
Gateway has sufficient available IP addresses to support the
scale-up set of instances.
Setting the maximum instance count to 125 has no cost
implications because you're billed only for the consumed
capacity.
Define Application Gateway subnet size Application Gateway needs a dedicated subnet within a
virtual network. The subnet can have multiple instances of
the deployed Application Gateway resource. You can also
deploy other Application Gateway resources in that subnet,
v1 or v2 SKU.
Here are some considerations for defining the subnet size:
- Application Gateway uses one private IP address per
instance and another private IP address if a private front-end
IP is configured.
- Azure reserves five IP addresses in each subnet for internal
use.
- Application Gateway (Standard or WAF SKU) can support
up to 32 instances. Taking 32 instance IP addresses + 1
private front-end IP + 5 Azure reserved, a minimum subnet
size of /26 is recommended. Because the Standard_v2 or
WAF_v2 SKU can support up to 125 instances, using the
same calculation, a subnet size of /24 is recommended.
- If you want to deploy additional Application Gateway
resources in the same subnet, consider the additional IP
addresses that will be required for their maximum instance
count for both, Standard and Standard v2.
Take advantage features for autoscaling and performance
benefits
The v2 SKU offers autoscaling to ensure that your
Application Gateway can scale up as traffic increases. When
compared to v1 SKU, v2 has capabilities that enhance the
performance of the workload. For example, better TLS
offload performance, quicker deployment and update times,
zone redundancy, and more. For more information about
autoscaling features, see Scaling Application Gateway v2 and
WAF v2.
If you are running v1 SKU gateways, consider migrating to
the v2 SKU. For more information, see Migrate Azure
Application Gateway and Web Application Firewall from v1
to v2.
REC O M M EN DAT IO NREC O M M EN DAT IO N B EN EFITB EN EF IT
Azure Advisor recommendations
ReliabilityReliability
Additional resources
Azure Architecture Center guidanceAzure Architecture Center guidance
Azure Advisor helps you ensure and improve continuity of your business-critical applications. Review the Azure
Advisor recommendations.
Azure Advisor is a personalized cloud consultant that helps you follow best practices to optimize your Azure
deployments. Here are some recommendations that can help you improve the reliability, security, cost
effectiveness, performance, and operational excellence of your Application Gateway.
Ensure application gateway fault tolerance
Do not override hostname to ensure website integrity
Using API gateways in microservices
Firewall and Application Gateway for virtual networks
Protect APIs with Application Gateway and API Management
Next steps
IaaS: Web application with relational database
Securely managed web applications
Zero-trust network for web applications with Azure Firewall and Application Gateway
Deploy an Application Gateway to see how it works: Quickstart: Direct web traffic with Azure Application
Gateway - Azure portal
Azure Well
-
Architected Framework review
-
Azure
Firewall
12/16/2022 13 minutes to read • Edit Online
Prerequisites
Reliability
Design checklistDesign checklist
RecommendationsRecommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N B EN EFITB EN EF IT
Use Azure Firewall Manager with Azure Virtual WAN to
deploy and manage instances of Azure Firewall across Virtual
WAN hubs or in hub virtual networks.
Easily create hub-and-spoke and transitive architectures with
native security services for traffic governance and protection.
This article provides architectural best practices for Azure Firewall. The guidance is based on the five pillars of
architecture excellence:
Reliability
Security
Cost optimization
Operational excellence
Performance efficiency
We assume that you have working knowledge of Azure Firewall and are well versed with its features. For more
information, see Azure Firewall Standard features.
Understanding the Azure Well-Architected Framework pillars can help produce a high-quality, stable, and
efficient cloud architecture. Review your workload by using the Well-Architected Framework review
assessment.
Use a reference architecture to review the considerations based on the guidance provided in this article. Start
with Network-hardened web application with private connectivity to PaaS datastores and Implement a secure
hybrid network.
To learn how Azure Firewall supports a reliable workload, see the following articles:
Introduction to Azure Firewall
Quickstart: Deploy Azure Firewall with availability zones
As you make design choices for Azure Firewall, review the design principles for reliability.
Deploy by using a secured virtual hub.
Use a global Azure Firewall policy.
Determine if you want to use third-party security as a service (SECaaS) providers.
Explore the following table of recommendations to optimize your Azure Firewall configuration for reliability.
Create a global Azure Firewall policy to govern the security
posture across global network environments. Assign the
policy to all instances of Azure Firewall.
Allow for granular policies to meet the requirements of
specific regions. Delegate incremental firewall policies to local
security teams through role-based access control (RBAC).
Configure supported third-party software as a service (SaaS)
security providers within Firewall Manager if you want to use
these solutions to protect outbound connections.
You can use your familiar, best-in-breed, third-party SECaaS
offerings to protect internet access for your users.
Deploy Azure Firewall across multiple availability zones for a
higher service-level agreement (SLA).
Azure Firewall provides different SLAs when it's deployed in a
single availability zone and when it's deployed in multizones.
For more information, see SLA for Azure Firewall. For
information about all Azure SLAs, see SLA summary for
Azure services.
In multi-region environments, deploy an instance of Azure
Firewall per region.
For workloads designed to be resistant to failures and fault
tolerant, remember to consider that instances of Azure
Firewall and Azure Virtual Network are regional resources.
Closely monitor Azure Firewall metrics to ensure this
component of your solution is healthy.
Closely monitor metrics, especially SNAT port utilization,
firewall health state, and throughput.
REC O M M EN DAT IO NREC O M M EN DAT IO N B EN EFITB EN EF IT
Security
Design checklistDesign checklist
RecommendationsRecommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N B EN EFITB EN EF IT
Create a global Azure Firewall policy to govern the security
posture across global network environments. Assign the
policy to all instances of Azure Firewall.
Allow for granular policies to meet the requirements of
specific regions. Delegate incremental firewall policies to local
security teams through RBAC.
Azure Advisor helps you ensure and improve the continuity of your business-critical applications. Review the
Azure Advisor recommendations.
Security is one of the most important aspects of any architecture. Azure Firewall is an intelligent firewall security
service that provides threat protection for your cloud workloads running in Azure.
As you make design choices for Azure Firewall, review the design principles for security.
Use a global Azure Firewall policy.
Use threat intelligence.
Use a DNS proxy.
Direct network traffic through Azure Firewall.
Validate spoke networks.
Determine if you want to use third-party SECaaS providers.
Use just-in-time (JIT) systems.
Explore the following table of recommendations to optimize your Azure Firewall configuration for security.
Enable threat intelligence on Azure Firewall. You can enable threat intelligence-based filtering for your
firewall to alert and deny traffic from or to unknown IP
addresses and domains. The IP addresses and domains are
sourced from the Microsoft Threat Intelligence Feed.
Intelligent Security Graph powers Microsoft threat
intelligence and is used by multiple services, including
Microsoft Defender for Cloud.
Enable Domain Name System (DNS) proxy and point the
infrastructure DNS to Azure Firewall.
By default, Azure Firewall uses Azure DNS. Custom DNS
allows you to configure Azure Firewall to use corporate DNS
to resolve external and internal names.
Configure the user-defined routes (UDR) to force traffic to
Azure Firewall.
Configure UDRs to force traffic to Azure Firewall for
SpoketoSpoke , SpoketoInternet , and SpoketoHybrid
connectivity.
Validate if unnecessary peering exists between the hub
virtual network where Azure Firewall is deployed, and other
spoke virtual networks.
Helps to guarantee that undesired traffic isn't being sent to
the Azure firewall or the hub network where the Azure
Firewall is deployed.
Use security partner providers for third-party SECaaS
offerings.
Security partner providers help filter internet traffic through
a virtual private network or a branch to the internet.
Use JIT systems to control access to virtual machines (VMs)
from the internet.
You can use Microsoft Defender for Cloud JIT to control
access for clients that connect from the internet by using
Azure Firewall.
Configure Azure Firewall in the forced tunneling mode to
route all internet-bound traffic to a designated next hop
instead of going directly to the internet.
Azure Firewall must have direct internet connectivity. If your
AzureFirewallSubnetAzureFirewallSubnet learns a default route to your on-
premises network via the Border Gateway Protocol, you
must configure Azure Firewall in the forced tunneling mode.
Using the forced tunneling feature, you'll need another /26
address space for the Azure Firewall Management subnet.
You're required to name it
AzureFirewallManagementSubnetAzureFirewallManagementSubnet.
If this is an existing Azure Firewall instance that can't be
reconfigured in the forced tunneling mode, create a UDR
with a 0.0.0.0/0 route. Set the NextHopTypeNextHopType value as
InternetInternet. Associate it with AzureFirewallSubnetAzureFirewallSubnet to
maintain internet connectivity.
Set the public IP address to NoneNone to deploy a fully private
data plane when you configure Azure Firewall in the forced
tunneling mode.
When you deploy a new Azure Firewall instance, if you
enable the forced tunneling mode, you can set the public IP
address to NoneNone to deploy a fully private data plane.
However, the management plane still requires a public IP for
management purposes only. The internal traffic from virtual
and on-premises networks won't use that public IP. For more
about forced tunneling, see Azure Firewall forced tunneling.
Use fully qualified domain name (FQDN) filtering in network
rules.
You can use FQDNs based on DNS resolution in Azure
Firewall and firewall policies. This capability allows you to
filter outbound traffic with any TCP/UDP protocol (including
NTP, SSH, RDP, and more). You must enable the DNS Proxy
option to use FQDNs in your network rules. To learn how it
works, see Azure Firewall FQDN filtering in network rules.
REC O M M EN DAT IO NREC O M M EN DAT IO N B EN EFITB EN EF IT
Azure Advisor helps you ensure and improve the continuity of your business-critical applications. Review the
Policy definitionsPolicy definitions
Cost optimization
Design checklistDesign checklist
RecommendationsRecommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N B EN EFITB EN EF IT
Deploy the Standard and Premium SKUs where appropriate. The Standard option is usually enough for east-west traffic.
Premium has the necessary extra features for north-south
traffic, the forced tunneling feature, and many other
features. For more information, see Azure Firewall Premium
Preview features. Deploy mixed scenarios using the Standard
and Premium options according to your needs.
Stop Azure Firewall deployments that don't need to run for
24 hours.
You might have development environments that are used
only during business hours. For more information, see
Deallocate and allocate Azure Firewall.
Share the same instance of Azure Firewall across multiple
workloads and Azure Virtual Network.
You can use a central instance of Azure Firewall in the hub
virtual network and share the same firewall across many
spoke virtual networks that are connected to the same hub
from the same region.
Ensure there's no unexpected cross-region traffic as part of
the hub-spoke topology.
Azure Advisor recommendations.
All internet traffic should be routed via your Azure Firewall. Protect your subnets from potential threats by
restricting access to them with Azure Firewall or a supported next-generation firewall.
All built-in policy definitions related to Azure networking are listed in Built-in policies - Network.
Cost optimization is about looking at ways to reduce unnecessary expenses and improve operational
efficiencies.
As you make design choices for Azure Firewall, review the design principles for cost optimization.
Determine which firewall SKUs to deploy.
Determine if some resources don't need 100% allocation.
Determine where you can optimize firewall use across workloads.
Monitor firewall usage to determine cost-effectiveness.
Review Firewall Manager capabilities to determine potential operational efficiency.
Determine the number of public IP addresses required.
Explore the following table of recommendations to optimize your Azure Firewall configuration for cost
optimization.
Review underutilized Azure Firewall instances. Identify and
delete unused Azure Firewall deployments.
To identify unused Azure Firewall deployments, start by
analyzing the monitoring metrics and UDRs associated with
subnets pointing to the firewall's private IP. Combine that
information with other validations, such as if your instance of
Azure Firewall has any rules (classic) for NAT, Network and
Application, or even if the DNS Proxy setting is configured to
DisabledDisabled, and with internal documentation about your
environment and deployments.
You can detect deployments that are cost-effective over
time.
For more information about monitoring logs and metrics,
see Monitor Azure Firewall logs and metrics and SNAT port
utilization.
Use Azure Firewall Manager and its policies to reduce
operational costs, increase efficiency, and reduce
management overhead.
Review your Firewall Manager policies, associations, and
inheritance carefully. Policies are billed based on firewall
associations. A policy with zero or one firewall association is
free of charge. A policy with multiple firewall associations is
billed at a fixed rate.
For more information, see Pricing - Azure Firewall Manager.
Delete unused public IP addresses and use IP Groups to
reduce your management overhead.
Validate whether all the associated public IP addresses are in
use. If they aren't in use, disassociate and delete them. Use
IP Groups to reduce your management overhead. Evaluate
SNAT port utilization before removing any IP addresses.
You'll only use the number of public IPs your firewall needs.
For more information, see Monitor Azure Firewall logs and
metrics and SNAT port utilization.
REC O M M EN DAT IO NREC O M M EN DAT IO N B EN EFITB EN EF IT
Operational excellence
Design checklistDesign checklist
RecommendationsRecommendations
For more suggestions, see Principles of the Cost optimization pillar.
Azure Advisor helps you ensure and improve the continuity of your business-critical applications. Review the
Azure Advisor recommendations.
Monitoring and diagnostics are crucial. You can measure performance statistics and metrics to troubleshoot and
remediate issues quickly.
As you make design choices for Azure Firewall, review the design principles for operational excellence.
Use logs for monitoring.
Use tags when possible to allow traffic through the firewall.
Use workbooks.
Use the Azure Firewall connector in Microsoft Sentinel.
Explore the following table of recommendations to optimize your Azure Firewall configuration for operational
excellence.
REC O M M EN DAT IO NREC O M M EN DAT IO N B EN EFITB EN EF IT
Turn on logs for Azure Firewall. You can monitor Azure Firewall by using firewall logs or
workbooks. You can also use activity logs for auditing
operations on Azure Firewall resources.
Use FQDN tags on Azure Firewall. FQDN tags make it easy to allow known Azure service
network traffic through your firewall. For example, say you
want to allow Windows Update network traffic through your
firewall. You create an application rule and use the Windows
Update tag. Now network traffic from Windows Update can
flow through your firewall.
Use workbooks in Azure Log Analytics. Workbooks help visualize firewall logs.
Enable Azure Firewall connector in Microsoft Sentinel. You can use Microsoft Sentinel to create detections and logic
apps for Azure Firewall.
Migrate Azure Firewall rules to Azure Firewall Manager
policies for existing deployments.
For existing deployments, migrate Azure Firewall rules to
Azure Firewall Manager policies. Use Azure Firewall Manager
to centrally manage your firewalls and policies.
Monitoring capacity metrics are indicators of the utilization
of provisioned Azure Firewall capacity.
Set alerts as needed to get notifications after reaching a
threshold for any metric.
For information about monitoring logs and metrics, see
Monitor Azure Firewall logs and metrics.
Monitor other Azure Firewall logs and metrics for
troubleshooting and set alerts.
Azure Firewall exposes a few other logs and metrics for
troubleshooting that are suitable indicators of issues.
Evaluate alerts based on the following list.
Use diagnostics logs and policy analytics. Diagnostic logs allow you to view Azure Firewall logs,
performance logs, and access logs. You can use these logs in
Azure to manage and troubleshoot your Azure Firewall
instance.
Policy analytics for Azure Firewall Manager allows you to
start seeing rules and flows that match the rules and hit
count for those rules. You can have full traffic visibility by
watching what rule is in use and the traffic being matched.
Performance efficiency
Design checklistDesign checklist
Application Rule log: Each new connection that
matches one of your configured application rules
results in a log for the accepted/denied connection.
Network Rule log: Each new connection that matches
one of your configured network rules results in a log
for the accepted/denied connection.
DNS Proxy log: This log tracks DNS messages to a
DNS server configured using a DNS proxy.
Azure Advisor helps you ensure and improve the continuity of your business-critical applications. Review the
Azure Advisor recommendations.
Performance efficiency is the ability of your workload to scale to efficiently meet the demands placed on it by
users.
RecommendationsRecommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N B EN EFITB EN EF IT
If you'll need more than 512,000 SNAT ports, deploy a NAT
gateway with Azure Firewall.
With a NAT gateway, you can scale up to more than 1 million
ports. For more information, see Scale SNAT ports with
Azure NAT gateway.
Create initial traffic that isn't part of your load tests 20
minutes before the test. Use diagnostics settings to capture
scale-up and scale-down events. You can use the Azure Load
Testing service to generate the initial traffic.
Allows the Azure Firewall instance to scale up its instances to
the maximum.
Use IP Groups to summarize IP address ranges. You can use IP Groups to summarize IP ranges, so you don't
exceed 10,000 network rules. For each rule, Azure multiplies
ports by IP addresses. So, if you have 1 rule with 4 IP
address ranges and 5 ports, you'll consume 20 network
rules.
Configure an Azure Firewall subnet (AzureFirewallSubnet)
with a /26 address space.
Azure Firewall is a dedicated deployment in your virtual
network. Within your virtual network, a dedicated subnet is
required for the instance of Azure Firewall. Azure Firewall
provisions more capacity as it scales.
A /26 address space for its subnets ensures that the firewall
has enough IP addresses available to accommodate the
scaling. Azure Firewall doesn't need a subnet bigger than
/26. The Azure Firewall subnet name must be
AzureFirewallSubnetAzureFirewallSubnet .
Azure Advisor recommendations
Additional resources
As you make design choices for Azure Firewall, review the design principles for performance efficiency.
Determine your SNAT port requirements and if you should deploy a NAT gateway.
Plan load tests to test auto-scale performance in your environment.
Plan network rule requirements and opportunities to summarize IP ranges.
Explore the following table of recommendations to optimize your Azure Firewall configuration for performance
efficiency.
Azure Advisor helps you ensure and improve the continuity of your business-critical applications. Review the
Azure Advisor recommendations.
Azure Advisor is a personalized cloud consultant that helps you follow best practices to optimize your Azure
deployments. Here are some recommendations that can help you improve the reliability, security, cost-
effectiveness, performance, and operational excellence of your instance of Azure Firewall.
Create Azure Service Health alerts to be notified when Azure problems affect you
Ensure you have access to Azure cloud experts when you need it
Enable Traffic Analytics to view insights into traffic patterns across Azure resources
Update your outbound connectivity protocol to Service Tags for Azure Site Recovery
Follow just enough administration (least privilege principle)
Protect your network resources with Microsoft Defender for Cloud
Azure Firewall documentation
Azure Architecture Center guidanceAzure Architecture Center guidance
Next step
Azure Firewall service limits, quotas, and constraints
Azure security baseline for Azure Firewall
Azure Firewall architecture overview
Use Azure Firewall to help protect an Azure Kubernetes Service (AKS) cluster
Hub-spoke network topology in Azure
Implement a secure hybrid network
Network-hardened web application with private connectivity to PaaS datastores
Deploy an instance of Azure Firewall to see how it works:
Tutorial: Deploy and configure Azure Firewall and policy by using the Azure portal
Azure Well
-
Architected Framework review
-
Azure
ExpressRoute
12/16/2022 10 minutes to read • Edit Online
Prerequisites
Reliability
Design checklistDesign checklist
This article provides architectural best practice for Azure ExpressRoute. The guidance is based on the five pillars
of the architecture excellence:
Reliability
Security
Cost optimization
Operational excellence
Performance efficiency
We assume that you have working knowledge of Azure ExpressRoute and are well versed with all of its features.
For more information, see Azure ExpressRoute.
For context, consider reviewing a reference architecture that reflects these considerations in its design. We
recommend that you start with Cloud Adoption Framework Ready methodology's guidance Connect to Azure
and Architect for hybrid connectivity with Azure ExpressRoute. For low-code application architectures, we
recommend reviewing Enabling ExpressRoute for Power Platform when planning and configuring ExpressRoute
for use with Microsoft Power Platform.
In the cloud, we acknowledge that failures happen. Instead of trying to prevent failures altogether, the goal is to
minimize the effects of a single failing component. Use the following information to minimize down time to and
from Azure when establishing connectivity using Azure ExpressRoute.
When discussing about reliability with Azure ExpressRoute it's important to taking into consideration bandwidth
usage, physical layout of the network, and disaster recovery if there's failures. Azure ExpressRoute is capable of
achieving these design considerations and have recommendations for each item in the checklist.
In the design checklistdesign checklist and list of recommendationslist of recommendations below, information is presented in order for you to
design a highly available network between your Azure environment and on-premises network.
As you make design choices for Azure ExpressRoute, review the design principles for adding reliability to the
architecture.
Select between ExpressRoute circuit or ExpressRoute Direct for business requirements.
Configure a diverse physical layer network to the service provider.
Configure ExpressRoute circuits with different service provider to have diverse routing paths.
Configure Active-Active ExpressRoute connections between on-premises and Azure.
Set up availability zone aware ExpressRoute Virtual Network Gateways.
Configure ExpressRoute circuits in a different location than the on-premises network.
Configure ExpressRoute Virtual Network Gateways in different regions.
Configure site-to-site VPN as a backup to ExpressRoute private peering.
RecommendationsRecommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N B EN EFITB EN EF IT
Plan for ExpressRoute circuit or ExpressRoute Direct During the initial planning phase, you want to decide
whether you want to configure an ExpressRoute circuit or an
ExpressRoute Direct connection. An ExpressRoute circuit
allows a private dedicated connection into Azure with the
help of a connectivity provider. ExpressRoute Direct allows
you to extend on-premises network directly into the
Microsoft network at a peering location. You also need to
identify the bandwidth requirement and the SKU type
requirement for your business needs.
Physical layer diversity For better resiliency, plan to have multiple paths between
the on-premises edge and the peering locations
(provider/Microsoft edge locations). This configuration can
be achieved by going through different service provider or
through a different location from the on-premises network.
Plan for geo-redundant circuits To plan for disaster recovery, set up ExpressRoute circuits in
more than one peering locations. You can create circuits in
peering locations in the same metro or different metro and
choose to work with different service providers for diverse
paths through each circuit. For more information, see
Designing for disaster recovery and Designing for high
availability.
Plan for Active-Active connectivity ExpressRoute dedicated circuits guarantee 99.95%
availability when an active-active connectivity is configured
between on-premises and Azure. This mode provides higher
availability of your Expressroute connection. It's also
recommended to configure BFD for faster failover if there's a
link failure on a connection.
Planning for Virtual Network Gateways Create availability zone aware Virtual Network Gateway for
higher resiliency and plan for Virtual Network Gateways in
different region for disaster recovery and high availability.
Monitor circuits and gateway health Set up monitoring and alerts for ExpressRoute circuits and
Virtual Network Gateway health based on various metrics
available.
Enable service health ExpressRoute uses service health to notify about planned
and unplanned maintenance. Configuring service health will
notify you about changes made to your ExpressRoute
circuits.
Set up monitoring for ExpressRoute circuit and ExpressRoute Virtual Network Gateway health.
Configure service health to receive ExpressRoute circuit maintenance notification.
Explore the following table of recommendations to optimize your ExpressRoute configuration for Reliability.
For more suggestions, see Principles of the reliability pillar.
Azure Advisor provides many recommendations for ExpressRoute circuits as they relate to reliability. For
example, Azure Advisor can detect:
ExpressRoute gateways in which only a single ExpressRoute circuit is deployed, instead of multiple. Multiple
ExpressRoute circuits are recommended for add resiliency for the peering location.
Security
Design checklistDesign checklist
RecommendationsRecommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N B EN EFITB EN EF IT
Configure Activity log to send logs to archive Activity logs provide insights into operations that were
performed at the subscription level for ExpressRoute
resources. With Activity logs, you can determine who and
when an operation was performed at the control plane. Data
retention is only 90 days and required to be stored in Log
Analytics, Event Hubs or a storage account for archive.
Maintain inventory of administrative accounts Use Azure RBAC to configure roles to limit user accounts
that can add, update, or delete peering configuration on an
ExpressRoute circuit.
Configure MD5 hash on ExpressRoute circuit During configuration of private peering or Microsoft peering,
apply an MD5 hash to secure messages between the on-
premises route and the MSEE routers.
Configure MACSec for ExpressRoute Direct resources Media Access Control security is a point-to-point security at
the data link layer. ExpressRoute Direct supports configuring
MACSec to prevent security threats to protocols such as
ARP, DHCP, LACP not normally secured on the Ethernet link.
For more information on how to configure MACSec, see
MACSec for ExpressRoute Direct ports.
Encrypt traffic using IPsec Configure a Site-to-site VPN tunnel over your ExpressRoute
circuit to encrypt data transferring between your on-
premises network and Azure virtual network. You can
configure a tunnel using private peering or using Microsoft
peering.
Cost optimization
ExpressRoute circuits that aren't being observed by Connection Monitor, as end-to-end monitoring of your
ExpressRoute circuit is critical for reliability insights.
Network topologies involving multiple peering locations that would benefit from ExpressRoute Global Reach
to improve disaster recovery designs for on-premises connectivity to account for unplanned connectivity
loss.
Security is one of the most important aspects of any architecture. ExpressRoute provides features to employ
both the principle of least privilege and defense-in-defense. We recommend you review the Security design
principles.
Configure Activity log to send logs to archive.
Maintain an inventory of administrative accounts with access to ExpressRoute resources.
Configure MD5 hash on ExpressRoute circuit.
Configure MACSec for ExpressRoute Direct resources.
Encrypt traffic over private peering and Microsoft peering for virtual network traffic.
Explore the following table of recommendations to optimize your ExpressRoute configuration for security.
For more suggestions, see Principles of the security pillar.
Design checklistDesign checklist
RecommendationsRecommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N B EN EFITB EN EF IT
Familiarize yourself with ExpressRoute pricing For information about ExpressRoute pricing, see Understand
pricing for Azure ExpressRoute. You can also use the Pricing
calculator.
Ensure that the options are adequately sized to meet the
capacity demand and deliver expected performance without
wasting resources.
Determine SKU and bandwidth required The way you're charged for your ExpressRoute usage varies
between the three different SKU types. With Local SKU,
you're automatically charged with an Unlimited data plan.
With Standard and Premium SKU, you can select between a
Metered or an Unlimited data plan. All ingress data are free
of charge except when using the Global Reach add-on. It's
important to understand which SKU types and data plan
works best for your workload to best optimize cost and
budget. For more information resizing ExpressRoute circuit,
see upgrading ExpressRoute circuit bandwidth.
Determine the ExpressRoute virtual network gateway size ExpressRoute virtual network gateways are used to pass
traffic into a virtual network over private peering. Review the
performance and scale needs of your preferred Virtual
Network Gateway SKU. Select the appropriate gateway SKU
on your on-premises to Azure workload.
Monitor cost and create budget alerts Monitor the cost of your ExpressRoute circuit and create
alerts for spending anomalies and overspending risks. For
more information, see Monitoring ExpressRoute costs.
Deprovision and delete ExpressRoute circuits no longer in
use.
ExpressRoute circuits are charged from the moment they're
created. To reduce unnecessary cost, deprovision the circuit
with the service provider and delete the ExpressRoute circuit
from your subscription. For steps on how to remove an
ExpressRoute circuit, see Deprovisioning an ExpressRoute
circuit.
Cost optimization is about looking at ways to reduce unnecessary expenses and improve operational
efficiencies. We recommend you review the Cost optimization design principle and Plan and manage costs for
Azure ExpressRoute.
Familiarize yourself with ExpressRoute pricing.
Determine the ExpressRoute circuit SKU and bandwidth required.
Determine the ExpressRoute virtual network gateway size required.
Monitor cost and create budget alerts.
Deprovision ExpressRoute circuits no longer in use.
Explore the following table of recommendations to optimize your ExpressRoute configuration for Cost
optimization.
For more suggestions, see Principles of the cost optimization pillar.
Azure Advisor can detect ExpressRoute circuits that have been deployed for a significant time but have a
provider status of
Not Provisioned
. Circuits in this state aren't operational; and removing the unused resource
will reduce unnecessary costs.
Operational excellence
Design checklistDesign checklist
RecommendationsRecommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N B EN EFITB EN EF IT
Configure connection monitoring Connection monitoring allows you to monitor connectivity
between your on-premises resources and Azure over the
ExpressRoute private peering and Microsoft peering
connection. Connection monitor can detect networking
issues by identifying where along the network path the
problem is and help you quickly resolve configuration or
hardware failures.
Configure Service Health Set up Service Health notifications to alert when planned
and upcoming maintenance is happening to all ExpressRoute
circuits in your subscription. Service Health also displays past
maintenance along with RCA if an unplanned maintenance
were to occur.
Review metrics with Network Insights ExpressRoute Insights with Network Insights allow you to
review and analyze ExpressRoute circuits, gateways,
connections metrics and health dashboards. ExpressRoute
Insights also provide a topology view of your ExpressRoute
connections where you can view details of your peering
components all in a single place.
Metrics available:
- Availability
- Throughput
- Gateway metrics
Review ExpressRoute resource metrics ExpressRoute uses Azure Monitor to collect metrics and
create alerts base on your configuration. Metrics are
collected for ExpressRoute circuits, ExpressRoute gateways,
ExpressRoute gateway connections, and ExpressRoute Direct.
These metrics are useful for diagnosing connectivity
problems and understanding the performance of your
ExpressRoute connection.
Performance efficiency
Monitoring and diagnostics are crucial. Not only can you measure performance statistics but also use metrics
troubleshoot and remediate issues quickly. We recommend you review the Operational excellence design
principles.
Configure connection monitoring between your on-premises and Azure network.
Configure Service Health for receiving notification.
Review metrics and dashboards available through ExpressRoute Insights using Network Insights.
Review ExpressRoute resource metrics.
Explore the following table of recommendations to optimize your ExpressRoute configuration for Operational
excellence.
For more suggestions, see Principles of the operational excellence pillar.
Performance efficiency is the ability of your workload to scale to meet the demands placed on it by users in an
efficient manner. We recommend you review the Performance efficiency principles.
Design checklistDesign checklist
RecommendationsRecommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N B EN EFITB EN EF IT
Test ExpressRoute gateway performance to meet work load
requirements.
Use Azure Connectivity Toolkit to test performance across
your ExpressRoute circuit to understand bandwidth capacity
and latency of your network connection.
Increase the size of the ExpressRoute gateway. Upgrade to a higher gateway SKU for improved throughput
performance between on-premises and Azure environment.
Upgrade ExpressRoute circuit bandwidth Upgrade your circuit bandwidth to meet your work load
requirements. Circuit bandwidth is shared between all virtual
networks connected to the ExpressRoute circuit. Depending
on your work load, one or more virtual networks can use up
all the bandwidth on the circuit.
Enable ExpressRoute FastPath for higher throughput If you're using an Ultra performance or an ErGW3AZ virtual
network gateway, you can enable FastPath to improve the
data path performance between your on-premises network
and Azure virtual network.
Monitor ExpressRoute circuit and gateway metrics Set up alerts base on ExpressRoute metrics to proactively
notify you when a certain threshold is met. These metrics are
useful to understand anomalies that can happen with your
ExpressRoute connection such as outages and maintenance
happening to your ExpressRoute circuits.
Azure Policy
Additional resources
Cloud Adoption Framework guidanceCloud Adoption Framework guidance
Test ExpressRoute gateway performance to meet work load requirements.
Increase the size of the ExpressRoute gateway.
Upgrade the ExpressRoute circuit bandwidth.
Enable ExpressRoute FastPath for higher throughput.
Monitor the ExpressRoute circuit and gateway metrics.
Explore the following table of recommendations to optimize your ExpressRoute configuration for performance
efficiency.
For more suggestions, see Principles of the performance efficiency pillar.
Azure Advisor will offer a recommendation to upgrade your ExpressRoute circuit bandwidth to accommodate
usage when your circuit has recently been consuming over 90% of your procured bandwidth. If your traffic
exceeds your allocated bandwidth, you’ll experience dropped packets, which can lead to significant performance
or reliability impact.
Azure Policy doesn't provide any built-in policies for ExpressRoute, but custom policies can be created to help
govern how ExpressRoute circuits should match your desired end state, such as SKU choice, peering type,
peering configurations and so on.
Traditional Azure network topology
Virtual WAN network topology (Microsoft-managed)
Next steps
Configure an ExpressRoute circuit or ExpressRoute Direct port to establish communication between your on-
premises network and Azure.
API Management and reliability
12/16/2022 2 minutes to read • Edit Online
Checklist
Configuration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Ensure you set quotas and rate limits when exposing APIs to
third parties.
Protect backend services and reduce the load placed on an
API Management scale unit. Rate limiting policies can be
applied at Global, Product, API, and Operation levels to
provide rate limit customization applied to API consumers.
Evaluate the need for response caching. Response caching can reduce API latency and bandwidth
consumption. Response caching reduces the load placed on
the backend APIs leading to improved performance, user
experience, and reduced solution cost.
Learn how to use API Management to publish APIs to external, partner, and employee developers securely and
at scale. This networking service is a hybrid, multicloud management platform for APIs across all environments.
Components include:
API gateway
Management plane
Developer portal
For more information, reference About API Management.
To understand how API Management can increase reliability for your workload, reference the following topics:
Availability zone support for Azure API Management
How to deploy an Azure API Management service instance to multiple Azure regions
How to implement disaster recovery using service backup and restore in Azure API Management
Have you configured API Management with reliability in mind?Have you configured API Management with reliability in mind?
Secure the communication between API Management and your backend.
Ensure that each party has its own credential when exposing APIs to third parties.
Ensure you set quotas and rate limits when exposing APIs to third parties.
Evaluate the need for response caching.
Plan a backup and restore process for your API Management instance.
Configure multiple Azure regions in your API Management service.
Implement a strategy to ensure availability during an outage or disaster affecting an Azure region.
Consider the following recommendations to optimize reliability when configuring your API Management
service:
Plan a backup and restore process for your API
Management instance.
Consider taking regular backups of your API Management
service so that you can easily restore it in another region.
Your recovery time objective may require that a standby is
deployed in a secondary region. It is a good practice to take
regular backups to recreate the service due to unforeseen
loss or misconfiguration of the service. Regular backups
allow you to replicate changes between your primary and
standby instances.
Configure multiple Azure regions in your API Management
service.
Configure your API Management service with multiple
regions to provide high-availability support in case an Azure
region experiences downtime or a disaster scenario.
Configuring multiple regions also reduces API call latency
because calls can be routed to the nearest region.
Implement a strategy to ensure availability during an outage
or disaster affecting an Azure region.
Consider using Azure Traffic Manager, Azure Front Door, or
Azure DNS to enable access to multiple regional
deployments of API Management. Using these services
ensures you can still serve requests due to an outage or
disaster. Requirements include syncing configurations
between these individual Standard instances.
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Next step
API Management and cost optimization
API Management and cost optimization
12/16/2022 2 minutes to read • Edit Online
Checklist
Configuration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Configure autoscaling where appropriate. Consider scaling your API Management instance up or down
to control costs. You can configure API Management with
Autoscale based on a metric or a specific count. Costs
depend upon the number of units, which determines
throughput in requests per seconds (RPS). An autoscaled API
Management instance switches between scale units
appropriate for RPS numbers during a specific time window.
Autoscaling helps to achieve balance between cost
optimization and performance.
Learn how to use API Management to publish APIs to external, partner, and employee developers securely and
at scale. This networking service is a hybrid, multicloud management platform for APIs across all environments.
Components include:
API gateway
Management plane
Developer portal
For more information, reference About API Management.
To understand how API Management supports cost optimization for your workload, reference the following
topics:
Automatically scale an Azure API Management instance
Use a virtual network with Azure API Management
Have you configured API Management with cost optimization in mind?Have you configured API Management with cost optimization in mind?
Configure autoscaling where appropriate.
Consider which features you need all the time.
Consider the following recommendations to optimize reliability when configuring your API Management
service:
Consider which features you need all the time. Consider switching between Basic, Standard, and Premium
tiers. If a workload does not need features available in higher
tiers, then consider switching to a lower tier. As an example,
a workload may need just 1GB of cache during off-peak
periods compared to 5GB of cache during peak periods.
Costs associated with such a workload can be reduced by
switching from a Premium to Standard tier during off-peak
periods and back to a Premium tier during peak periods. This
process can be automated as a job using Set-
AzApiManagement cmdlet. Refer to API Management pricing
about features available in different API Management tiers.
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Next step
API Management and operational excellence
API Management and operational excellence
12/16/2022 2 minutes to read • Edit Online
Checklist
Configuration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Ensure you set quotas and rate limits when exposing APIs to
third parties.
Protect backend services and reduce the load placed on an
API Management scale unit. Rate limiting policies can be
applied at Global, Product, API, and Operation levels to
provide rate limit customization applied to API consumers.
Learn how to use API Management to publish APIs to external, partner, and employee developers securely and
at scale. This networking service is a hybrid, multicloud management platform for APIs across all environments.
Components include:
API gateway
Management plane
Developer portal
For more information, reference About API Management.
To understand how API Management supports operational excellence, reference the following topics:
Managing Azure API Management using Azure Automation
Observability in Azure API Management
Have you configured API Management with operational excellence in mind?Have you configured API Management with operational excellence in mind?
Secure the communication between API Management and your backend.
Ensure that each party has its own credential when exposing APIs to third parties.
Ensure you set quotas and rate limits when exposing APIs to third parties.
Understand the Microsoft REST API design and architecture guidance.
Enable versioning of APIs to maintain backwards compatibility while adding other features.
Use the API Management Versioning and Revisions features to implement API versioning.
Understand the API import restrictions in API Management.
Understand the Event logging feature.
Trace calls in Azure API Management to help with debugging and testing.
Configure logging using Azure Monitor for the API Management service.
Choose the right modes to access private site connections.
Evaluate firewall rules and IP allowlists based on the API Management public IP address.
Consider the following recommendations for operational excellence when configuring your API Management
service:
Understand the Microsoft REST API design and architecture
guidance.
Follow standards and best practices when using the REST
API. Following best practices enables maximum compatibility
across platforms and implementations. Review the REST API
Guidelines and API Design guidance.
Understand the API import restrictions in API Management. Every effort is made to ensure the API import process runs
smoothly, which includes requiring no customizations. Some
scenarios impose restrictions that will require modification to
the import source. Applies to both REST and SOAP services.
Reference Policy Restrictions for the current API Import
restrictions.
Understand the Event logging feature. Supports event logging to an Azure event hub to perform
near real-time analysis. This feature integrates with external
logging, security information and event management (SIEM)
solutions, or analyzing API usage in near real time.
Trace calls in Azure API Management to help with debugging
and testing.
Tracing must be enabled on the subscription used to make
the request. Tracing is enabled on a request-by-request basis
using the Ocp-Apim-Trace header value. API Tracing is also
built into the admin portal and is enabled by default when
testing APIs from the portal.
Configure logging using Azure Monitor for the API
Management service.
Logs can be sent to a Logs Analytics workspace to enable
complex querying and analysis. Metrics can be ingested for
longer term analysis. All data is then surfaced using Azure
Monitor. It is possible to integrate Application Insights for
Application Performance Management.
Choose the right modes to access private site connections. Supports Virtual Network integration in internal and external
mode.
Evaluate firewall rules and IP allowlists based on the API
Management public IP address.
A fixed public IP address is available for the lifetime of the
service with the Basic, Developer, Standard, and Premium
plans for API Management.
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Next step
Reliability and Azure Firewall
Reliability and Azure Front Door
12/16/2022 2 minutes to read • Edit Online
Checklist
Configuration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Use WAF policies in Front Door. Lock down Application
Gateway to receive traffic only from Azure Front Door when
using Azure Front Door and Application Gateway to protect
HTTP/S applications.
Certain scenarios can force a customer to implement rules
specifically on AppGateway: For example, if ModSec Core
Rule Set (CRS) 2.2.9 , CRS 3.0 , or CRS 3.1 rules are
required, rules can be only implemented on AppGatway.
Rate-limiting and geo-filtering are available only on Azure
Front Door, not on AppGateway. Instructions on how to lock
down traffic can be found at Frequently asked questions for
Azure Front Door
Ensure that the connection to the back-end is re-encrypted. Front Door doesn't support SSL passthrough. Front Door
must hold the certificate to terminate the encrypted
inbound connection.
A scalable and secure entry point, Azure Front Door provides fast delivery of your global web applications. Front
Door uses the Microsoft global edge network to create fast, secure, and widely scalable web applications.
Key features include:
Accelerated application performance by using split TCP-based anycast protocol.
Intelligent health probe monitoring for backend resources.
URL-path based routing for requests.
Enables hosting of multiple websites for efficient application infrastructure.
Cookie-based session affinity.
For more key features and information, reference Why use Azure Front Door?
To understand how Azure Front Door creates a more reliable workload, reference the following topics:
Selecting the Front Door environment for traffic routing (Anycast)
Front Door routing methods
Have you configured Azure Front Door with reliability in mind?Have you configured Azure Front Door with reliability in mind?
Use WAF policies in Front Door. Lock down Application Gateway to receive traffic only from Azure Front Door
when using Azure Front Door and Application Gateway to protect HTTP/S applications.
Use Azure Front Door Web Application Firewall (WAF) policies to provide global protection across Azure
regions for inbound HTTP/S connections to a
Landing Zone
.
Create a rule to block access to the health endpoint from the internet.
Ensure that the connection to the back-end is re-encrypted.
Evaluate the four traffic routing configurations in Azure Front Door.
Consider the following recommendation to optimize reliability when configuring Azure Front Door:
Evaluate the four traffic routing configurations in Azure
Front Door.
The Front Door service supports various traffic-routing
methods to determine how to route your HTTP/HTTPS
traffic to the various service endpoints.
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Next step
Security and Azure Front Door
Security and Azure Front Door
12/16/2022 2 minutes to read • Edit Online
Checklist
Configuration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Consider using geo-filtering in Azure Front Door. The Front Door service responds to user requests regardless
of the location of the user making the request.
Next step
A scalable and secure entry point, Azure Front Door provides fast delivery of your global web applications. Front
Door uses the Microsoft global edge network to create fast, secure, and widely scalable web applications.
Key features include:
Accelerated application performance by using split TCP-based anycast protocol.
Intelligent health probe monitoring for backend resources.
URL-path based routing for requests.
Enables hosting of multiple websites for efficient application infrastructure.
Cookie-based session affinity.
For more key features and information, reference Why use Azure Front Door?
To understand how Azure Front Door creates a more secure workload, reference the following topics:
Security baseline for Azure Front Door
DDoS protection on Front Door
End-to-end Transport Layer Security (TLS) with Azure Front Door
Have you configured Azure Front Door with security in mind?Have you configured Azure Front Door with security in mind?
Consider using geo-filtering in Azure Front Door.
Consider the following recommendation to optimize security when configuring Azure Front Door:
Operational excellence and Azure Front Door
Operational excellence and Azure Front Door
12/16/2022 2 minutes to read • Edit Online
Checklist
Configuration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Use Web Application Firewall (WAF) policies in Front Door.
Lock down Application Gateway to receive traffic only from
Azure Front Door when using Azure Front Door and
Application Gateway to protect HTTP/S applications.
Certain scenarios can force a customer to implement rules
specifically on AppGateway: For example, if ModSec Core
Rule Set (CRS) 2.2.9 , CRS 3.0 , or CRS 3.1 rules are
required, rules can be only implemented on AppGatway.
Rate-limiting and geo-filtering are available only on Azure
Front Door, not on AppGateway. Instructions on how to lock
down traffic can be found at Frequently asked questions for
Azure Front Door
Ensure that the connection to the back-end is re-encrypted. Front Door doesn't support SSL passthrough. Front Door
must hold the certificate to terminate the encrypted
inbound connection.
A scalable and secure entry point, Azure Front Door provides fast delivery of your global web applications. Front
Door uses the Microsoft global edge network to create fast, secure, and widely scalable web applications.
Key features include:
Accelerated application performance by using split TCP-based anycast protocol.
Intelligent health probe monitoring for backend resources.
URL-path based routing for requests.
Enables hosting of multiple websites for efficient application infrastructure.
Cookie-based session affinity.
For more key features and information, reference Why use Azure Front Door?
To understand how Azure Front Door supports operational excellence, reference the following topics:
How Front Door determines backend health
Monitoring metrics and logs in Azure Front Door
Front Door Standard/Premium (Preview) Logging
Have you configured Azure Front Door with operational excellence in mind?Have you configured Azure Front Door with operational excellence in mind?
Use Web Application Firewall (WAF) policies in Front Door. Lock down Application Gateway to receive traffic
only from Azure Front Door when using Azure Front Door and Application Gateway to protect HTTP/S
applications.
Use Azure Front Door Web Application Firewall (WAF) policies to provide global protection across Azure
regions for inbound HTTP/S connections to a
Landing Zone
.
Create a rule to block access to the health endpoint from the internet.
Ensure that the connection to the back-end is re-encrypted.
Consider the following recommendation to optimize operational excellence when configuring Azure Front Door:
Next step
Reliability and Network Virtual Appliances
Reliability and Network Virtual Appliances
(
NVA
)
12/16/2022 2 minutes to read • Edit Online
Checklist
Configuration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
NVAs should be deployed within a
Landing Zone
or
solution-level
Virtual Network.
If third-party NVAs are required for inbound HTTP/S
connections, deploy NVAs together with the applications
that they're protecting and exposing to the internet.
For Virtual Wide Area Network (VWAN) topologies, deploy
the NVAs to a separate Virtual Network (such as, NVA VNet).
Connect the NVA to the regional Virtual WAN Hub and to
the
Landing Zones
that require access to NVAs.
If third-party NVAs are required for east-west or south-
north traffic protection and filtering, reference Scenario:
Route traffic through an NVA.
Network Virtual Appliances (NVA) are typically used to control the flow of traffic between network segments
classified with different security levels, for example between a perimeter network (also known as DMZ,
demilitarized zone, and screened subnet) and the public internet.
Examples of NVAs include:
Network firewalls
Layer-4 reverse-proxies
Internet Protocol Security (IPsec) Virtual Private Network (VPN) endpoints
Web-based reverse-proxies
Internet proxies
Layer-7 load balancers
For more information about Network Virtual Appliances, reference Deploy highly available NVAs.
To understand how NVAs support a reliable workload, reference the following topics:
Scenario: Route traffic through an NVA
Scenario: Route traffic through NVAs by using custom settings
Use L7 load balancers
Have you configured your Network Vir tual Appliances (NVA) with reliability in mind?Have you configured your Network Vir tual Appliances (NVA) with reliability in mind?
NVAs should be deployed within a
Landing Zone
or
solution-level
Virtual Network.
For Virtual Wide Area Network (VWAN) topologies, deploy the NVAs to a separate Virtual Network (such as,
NVA VNet). Connect the NVA to the regional Virtual WAN Hub and to the
Landing Zones
that require access
to NVAs.
For non-Virtual Wide Are Network (WAN) topologies, deploy the third-party NVAs in the central Hub Virtual
Network (VNet).
Consider the following recommendations to optimize reliability when configuring your Network Virtual
Appliances (NVA):
For non-Virtual Wide Area Network (WAN) topologies,
deploy the third-party NVAs in the central Hub Virtual
Network (VNet).
If third-party NVAs are required for east-west or south-
north traffic protection and filtering, deploy the third-party
NVAs in the central Hub Virtual Network.
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Next step
Cost optimization and Network Virtual Appliances (NVA)
Cost optimization and Network Virtual Appliances
(
NVA
)
12/16/2022 2 minutes to read • Edit Online
Design considerations
Next step
Network Virtual Appliances (NVA) are typically used to control the flow of traffic between network segments
classified with different security levels, for example between a perimeter network (also known as DMZ,
demilitarized zone, and screened subnet) and the public internet.
Examples of NVAs include:
Network firewalls
Layer-4 reverse-proxies
Internet Protocol Security (IPsec) Virtual Private Network (VPN) endpoints
Web-based reverse-proxies
Internet proxies
Layer-7 load balancers
For more information about Network Virtual Appliances, reference Deploy highly available NVAs.
When deploying a Network Virtual Appliance (NVA), keep in mind the following design considerations:
There's a difference between using a third-party app (NVA) and using an Azure native service (Firewall or
Application Gateway).
With managed Platform as a Service (PaaS) services such as Azure Firewall or Application Gateway, Microsoft
handles the management of the service and the underlying infrastructure. Using NVAs, which usually have to
be deployed on Virtual Machines or Infrastructure as a Service (IaaS), the customer has to handle the
management operations (such as patching and updating) of that Virtual Machine and the appliance on top.
Managing third-party services also involves using specific vendor tools making integration difficult.
Operational excellence and Network Virtual Appliances (NVA)
Operational excellence and Network Virtual
Appliances
(
NVA
)
12/16/2022 2 minutes to read • Edit Online
Checklist
Configuration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
NVAs should be deployed within a
Landing Zone
or
solution-level
Virtual Network.
If third-party NVAs are required for inbound HTTP/S
connections, deploy NVAs together with the applications
that they're protecting and exposing to the internet.
Network Virtual Appliances (NVA) are typically used to control the flow of traffic between network segments
classified with different security levels, for example between a perimeter network (also known as DMZ,
demilitarized zone, and screened subnet) and the public internet.
Examples of NVAs include:
Network firewalls
Layer-4 reverse-proxies
Internet Protocol Security (IPsec) Virtual Private Network (VPN) endpoints
Web-based reverse-proxies
Internet proxies
Layer-7 load balancers
For more information about Network Virtual Appliances, reference Deploy highly available NVAs.
To understand how NVAs promote operational excellence, reference the following topics:
Scenario: Route traffic through an NVA
Scenario: Route traffic through NVAs by using custom settings
Gateway Load Balancer
Have you configured your Network Vir tual Appliances (NVA) with operational excellence in mind?Have you configured your Network Vir tual Appliances (NVA) with operational excellence in mind?
NVAs should be deployed within a
Landing Zone
or
solution-level
Virtual Network.
For Virtual Wide Area Network (VWAN) topologies, deploy the NVAs to a separate Virtual Network (such as,
NVA VNet). Connect the NVA to the regional Virtual WAN Hub and to the
Landing Zones
that require access
to NVAs.
For non-Virtual Wide Are Network (WAN) topologies, deploy the third-party NVAs in the central Hub Virtual
Network (VNet).
Consider the following recommendations to optimize reliability when configuring your Network Virtual
Appliances (NVA):
For Virtual Wide Area Network (VWAN) topologies, deploy
the NVAs to a separate Virtual Network (such as, NVA VNet).
Connect the NVA to the regional Virtual WAN Hub and to
the
Landing Zones
that require access to NVAs.
If third-party NVAs are required for east-west or south-
north traffic protection and filtering, reference Scenario:
Route traffic through an NVA.
For non-Virtual Wide Area Network (WAN) topologies,
deploy the third-party NVAs in the central Hub Virtual
Network (VNet).
If third-party NVAs are required for east-west or south-
north traffic protection and filtering, deploy the third-party
NVAs in the central Hub Virtual Network.
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Next step
Reliability and Network connectivity
Reliability and Network connectivity
12/16/2022 2 minutes to read • Edit Online
Design considerations
Network connectivity includes three Azure models for private network connectivity:
VNet injection
VNet service endpoints
Private Link
VNet injection applies to services that are deployed specifically for you, such as:
Azure Kubernetes Service (AKS) nodes
SQL Managed Instance
Virtual Machines
These resources connect directly to your virtual network.
Virtual Network (VNet) service endpoints provide secure and direct connectivity to Azure services. These service
endpoints use an optimized route over the Azure network. Service endpoints enable private IP addresses in the
VNet to reach the endpoint of an Azure service without needing a public IP address on the VNet.
Private Link provides dedicated access using private IP addresses to Azure PaaS instances, or custom services
behind an Azure Load Balancer Standard.
Network connectivity includes the following design considerations related to a reliable workload:
Use Private Link, where available, for shared Azure PaaS services. Private Link is generally available for
several services and is in public preview for numerous ones.
Access Azure PaaS services from on-premises through ExpressRoute private peering.
Use either virtual network injection for dedicated Azure services or Azure Private Link for available
shared Azure services. To access Azure PaaS services from on-premises when virtual network injection or
Private Link isn't available, use ExpressRoute with Microsoft peering. This method avoids transiting over
the public internet.
Use virtual network service endpoints to secure access to Azure PaaS services from within your virtual
network. Use virtual network service endpoints only when Private Link isn't available and there are no
concerns with unauthorized movement of data.
Service Endpoints don't allow a PaaS service to be accessed from on-premises networks. Private
Endpoints do.
To address concerns about unauthorized movement of data with service endpoints, use network-virtual
appliance (NVA) filtering. You can also use virtual network service endpoint policies for Azure Storage.
The following native network security services are fully managed services. Customers don't incur the
operational and management costs associated with infrastructure deployments, which can become
complex at scale:
Azure Firewall
Application Gateway
Checklist
Next step
Azure Front Door
PaaS services are typically accessed over public endpoints. The Azure platform provides capabilities to
secure these endpoints or make them entirely private.
You can also use third-party network-virtual appliances (NVAs) if the customer prefers them for
situations where native services don't satisfy specific requirements.
Have you configured Network connectivity with reliability in mind?Have you configured Network connectivity with reliability in mind?
Don't implement forced tunneling to enable communication from Azure to Azure resources.
Unless you use network virtual appliance (NVA) filtering, don't use virtual network service endpoints when
there are concerns about unauthorized movement of data.
Don't enable virtual network service endpoints by default on all subnets.
Cost optimization and Network connectivity
Cost optimization and Network connectivity
12/16/2022 2 minutes to read • Edit Online
Design considerations
Checklist
Configuration recommendations
Network connectivity includes three Azure models for private network connectivity:
VNet injection
VNet service endpoints
Private Link
VNet injection applies to services that are deployed specifically for you, such as:
Azure Kubernetes Service (AKS) nodes
SQL Managed Instance
Virtual Machines
These resources connect directly to your virtual network.
Virtual Network (VNet) service endpoints provide secure and direct connectivity to Azure services. These service
endpoints use an optimized route over the Azure network. Service endpoints enable private IP addresses in the
VNet to reach the endpoint of an Azure service without needing a public IP address on the VNet.
Private Link provides dedicated access using private IP addresses to Azure PaaS instances, or custom services
behind an Azure Load Balancer Standard.
Network connectivity includes the following design considerations related to cost optimization:
Running cost of services: The services are metered. Pay for service itself and consumption on service.
VNet Peering cost: Consider the consequences of putting all resources in a single VNet to save costs. It also
prevents the infrastructure from growing. The VNet can eventually reach a point where new resources don't
fit anymore.
For two peered VNets using a private endpoint: Only the private endpoint access is billed and not the VNet
peering cost.
Azure Firewall is also metered: Pay for the instance and for usage. The same applies to load balancers.
Have you configured Network connectivity with cost optimization in mind?Have you configured Network connectivity with cost optimization in mind?
Select SKU for service so that it does the job required, which allows the customer to grow as the workload
evolves.
For the Load balancer, select two SKUs: Basic (free) and Standard (paid).
For App Gateway, select Basic or V2.
For Gateways, limit throughput and performance.
Select DDoS Standard.
Consider the following recommendation for cost optimization when configuring Network connectivity:
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
For the Load balancer, select two SKUs: Basic (free) and
Standard (paid).
Microsoft recommends Standard because it has richer
capabilities, such as:
- Outbound rules
- Granular network security configuration
- Monitoring
Standard provides a Service Level Agreement (SLA) and can
be deployed in Availability Zones. Capabilities in Basic are
limited.
Select DDoS Standard. Depending on the workload and usage patterns, Standard
can provide useful protection. Otherwise, you can use Basic
for small customers.
Next step
Operational excellence and Network connectivity
Operational excellence and Network connectivity
12/16/2022 2 minutes to read • Edit Online
Design considerations
Network connectivity includes three Azure models for private network connectivity:
VNet injection
VNet service endpoints
Private Link
VNet injection applies to services that are deployed specifically for you, such as:
Azure Kubernetes Service (AKS) nodes
SQL Managed Instance
Virtual Machines
These resources connect directly to your virtual network.
Virtual Network (VNet) service endpoints provide secure and direct connectivity to Azure services. These service
endpoints use an optimized route over the Azure network. Service endpoints enable private IP addresses in the
VNet to reach the endpoint of an Azure service without needing a public IP address on the VNet.
Private Link provides dedicated access using private IP addresses to Azure PaaS instances, or custom services
behind an Azure Load Balancer Standard.
Network connectivity includes the following design considerations related to operational excellence:
Use Private Link, where available, for shared Azure PaaS services. Private Link is generally available for
several services and is in public preview for numerous ones.
Access Azure PaaS services from on-premises through ExpressRoute private peering.
Use either virtual network injection for dedicated Azure services or Azure Private Link for available
shared Azure services. To access Azure PaaS services from on-premises when virtual network injection or
Private Link isn't available, use ExpressRoute with Microsoft peering. This method avoids transiting over
the public internet.
Use virtual network service endpoints to secure access to Azure PaaS services from within your virtual
network. Use virtual network service endpoints only when Private Link isn't available and there are no
concerns with unauthorized movement of data.
Service Endpoints don't allow a PaaS service to be accessed from on-premises networks. Private
Endpoints do.
To address concerns about unauthorized movement of data with service endpoints, use network-virtual
appliance (NVA) filtering. You can also use virtual network service endpoint policies for Azure Storage.
The following native network security services are fully managed services. Customers don't incur the
operational and management costs associated with infrastructure deployments, which can become
complex at scale:
Azure Firewall
Application Gateway
Checklist
Next step
Azure Front Door
PaaS services are typically accessed over public endpoints. The Azure platform provides capabilities to
secure these endpoints or make them entirely private.
You can also use third-party network-virtual appliances (NVAs) if the customer prefers them for
situations where native services don't satisfy specific requirements.
Have you configured Network connectivity with operational excellence in mind?Have you configured Network connectivity with operational excellence in mind?
Don't implement forced tunneling to enable communication from Azure to Azure resources.
Unless you use network virtual appliance (NVA) filtering, don't use virtual network service endpoints when
there are concerns about unauthorized movement of data.
Don't enable virtual network service endpoints by default on all subnets.
Reliability and Azure Virtual Network
Reliability and Azure Virtual Network
12/16/2022 3 minutes to read • Edit Online
Design considerations
Checklist
A fundamental building block for your private network, Azure Virtual Network enables Azure resources to
securely communicate with each other, the internet, and on-premises networks.
Key features of Azure Virtual Network include:
Communication with Azure resources
Communication with the internet
Communication with on-premises resources
Network traffic filtering
For more information, reference What is Azure Virtual Network?
To understand how Azure Virtual Network supports a reliable workload, reference the following topics:
Tutorial: Move Azure VMs across regions
Quickstart: Create a virtual network using the Azure portal
Virtual Network – Business Continuity
The Virtual Network (VNet) includes the following design considerations for a reliable Azure workload:
Overlapping IP address spaces across on-premises and Azure regions creates major contention challenges.
While a Virtual Network address space can be added after creation, this process requires an outage if the
Virtual Network is already connected to another Virtual Network through peering. An outage is necessary
because the Virtual Network peering is deleted and re-created.
Resizing of peered Virtual Networks is in public preview (August 20, 2021).
Some Azure services do require dedicated subnets, such as:
Subnets can be delegated to certain services to create instances of that service within the subnet.
Azure reserves five IP addresses within each subnet, which should be factored in when sizing Virtual
Networks and encompassed subnets.
Azure Firewall
Azure Bastion
Virtual Network Gateway
Have you configured Azure Vir tual Network with reliability in mind?Have you configured Azure Vir tual Network with reliability in mind?
Use Azure DDoS Standard Protection Plans to protect all public endpoints hosted within customer Virtual
Networks.
Enterprise customers must plan for IP addressing in Azure to ensure there's no overlapping IP address space
across considered on-premises locations and Azure regions.
Use IP addresses from the address allocation for private internets (Request for Comment (RFC) 1918).
For environments with limited private IP addresses (RFC 1918) availability, consider using IPv6.
Don't create unnecessarily large Virtual Networks (for example: /16 ) to ensure there's no unnecessary waste
of IP address space.
Configuration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Don't create Virtual Networks without planning the required
address space in advance.
Adding address space will cause an outage once a Virtual
Network is connected through Virtual Network peering.
Use VNet Service Endpoints to secure access to Azure
Platform as a Service (PaaS) services from within a customer
VNet.
Only when Private Link isn't available and when there are no
data exfiltration concerns.
Access Azure PaaS services from on-premises through
ExpressRoute Private Peering.
Use either VNet injection for dedicated Azure services or
Azure Private Link for available shared Azure services.
To access Azure PaaS services from on-premises networks
when VNet injection or Private Link aren't available, use
ExpressRoute with Microsoft Peering when there are no data
exfiltration concerns.
Avoids transit over the public internet.
Don't replicate on-premises perimeter network (also known
as DMZ, demilitarized zone, and screened subnet) concepts
and architectures into Azure.
Customers can get similar security capabilities in Azure as
on-premises, but the implementation and architecture will
need to be adapted to the cloud.
Ensure the communication between Azure PaaS services that
have been injected into a Virtual Network is locked down
within the Virtual Network using user-defined routes (UDRs)
and network security groups (NSGs).
Azure PaaS services that have been injected into a Virtual
Network still perform management plane operations using
public IP addresses.
Next step
Don't create Virtual Networks without planning the required address space in advance.
Don't use public IP addresses for Virtual Networks, especially if the public IP addresses don't belong to the
customer.
Use VNet Service Endpoints to secure access to Azure Platform as a Service (PaaS) services from within a
customer VNet.
To address data exfiltration concerns with Service Endpoints, use Network Virtual Appliance (NVA) filtering
and VNet Service Endpoint Policies for Azure Storage.
Don't implement forced tunneling to enable communication from Azure to Azure resources.
Access Azure PaaS services from on-premises through ExpressRoute Private Peering.
To access Azure PaaS services from on-premises networks when VNet injection or Private Link aren't
available, use ExpressRoute with Microsoft Peering when there are no data exfiltration concerns.
Don't replicate on-premises perimeter network (also known as DMZ, demilitarized zone, and screened
subnet) concepts and architectures into Azure.
Ensure the communication between Azure PaaS services that have been injected into a Virtual Network is
locked down within the Virtual Network using user-defined routes (UDRs) and network security groups
(NSGs).
Don't use VNet Service Endpoints when there are data exfiltration concerns, unless NVA filtering is used.
Don't enable VNet Service Endpoints by default on all subnets.
Consider the following recommendations to optimize reliability when configuring an Azure Virtual Network:
Operational excellence and Azure Virtual Network
Operational excellence and Azure Virtual Network
12/16/2022 3 minutes to read • Edit Online
Design considerations
Checklist
A fundamental building block for your private network, Azure Virtual Network enables Azure resources to
securely communicate with each other, the internet, and on-premises networks.
Key features of Azure Virtual Network include:
Communication with Azure resources
Communication with the internet
Communication with on-premises resources
Network traffic filtering
For more information, reference What is Azure Virtual Network?
To understand how Azure Virtual Network supports operational excellence, reference the following topics:
Monitoring Azure Virtual Network
Monitoring Azure Virtual Network data reference
Azure Virtual Network concepts and best practices
The Virtual Network (VNet) includes the following design considerations for operational excellence:
Overlapping IP address spaces across on-premises and Azure regions creates major contention challenges.
While a Virtual Network address space can be added after creation, this process requires an outage if the
Virtual Network is already connected to another Virtual Network through peering. An outage is necessary
because the Virtual Network peering is deleted and re-created.
Resizing of peered Virtual Networks is in public preview (August 20, 2021).
Some Azure services do require dedicated subnets, such as:
Subnets can be delegated to certain services to create instances of that service within the subnet.
Azure reserves five IP addresses within each subnet, which should be factored in when sizing Virtual
Networks and encompassed subnets.
Azure Firewall
Azure Bastion
Virtual Network Gateway
Have you configured Azure Vir tual Network with operational excellence in mind?Have you configured Azure Vir tual Network with operational excellence in mind?
Use Azure DDoS Standard Protection Plans to protect all public endpoints hosted within customer Virtual
Networks.
Enterprise customers must plan for IP addressing in Azure to ensure there's no overlapping IP address space
across considered on-premises locations and Azure regions.
Use IP addresses from the address allocation for private internets (Request for Comment (RFC) 1918).
For environments with limited private IP addresses (RFC 1918) availability, consider using IPv6.
Don't create unnecessarily large Virtual Networks (for example: /16 ) to ensure there's no unnecessary waste
of IP address space.
Configuration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Don't create Virtual Networks without planning the required
address space in advance.
Adding address space will cause an outage once a Virtual
Network is connected through Virtual Network peering.
Use VNet Service Endpoints to secure access to Azure
Platform as a Service (PaaS) services from within a customer
VNet.
Only when Private Link isn't available and when there are no
data exfiltration concerns.
Access Azure PaaS services from on-premises through
ExpressRoute Private Peering.
Use either VNet injection for dedicated Azure services or
Azure Private Link for available shared Azure services.
To access Azure PaaS services from on-premises networks
when VNet injection or Private Link aren't available, use
ExpressRoute with Microsoft Peering when there are no data
exfiltration concerns.
Avoids transit over the public internet.
Don't replicate on-premises perimeter network (also known
as DMZ, demilitarized zone, and screened subnet) concepts
and architectures into Azure.
Customers can get similar security capabilities in Azure as
on-premises, but the implementation and architecture will
need to be adapted to the cloud.
Ensure the communication between Azure PaaS services that
have been injected into a Virtual Network is locked down
within the Virtual Network using user-defined routes (UDRs)
and network security groups (NSGs).
Azure PaaS services that have been injected into a Virtual
Network still perform management plane operations using
public IP addresses.
Next step
Don't create Virtual Networks without planning the required address space in advance.
Don't use public IP addresses for Virtual Networks, especially if the public IP addresses don't belong to the
customer.
Use VNet Service Endpoints to secure access to Azure Platform as a Service (PaaS) services from within a
customer VNet.
To address data exfiltration concerns with Service Endpoints, use Network Virtual Appliance (NVA) filtering
and VNet Service Endpoint Policies for Azure Storage.
Don't implement forced tunneling to enable communication from Azure to Azure resources.
Access Azure PaaS services from on-premises through ExpressRoute Private Peering.
To access Azure PaaS services from on-premises networks when VNet injection or Private Link aren't
available, use ExpressRoute with Microsoft Peering when there are no data exfiltration concerns.
Don't replicate on-premises perimeter network (also known as DMZ, demilitarized zone, and screened
subnet) concepts and architectures into Azure.
Ensure the communication between Azure PaaS services that have been injected into a Virtual Network is
locked down within the Virtual Network using user-defined routes (UDRs) and network security groups
(NSGs).
Don't use VNet Service Endpoints when there are data exfiltration concerns, unless NVA filtering is used.
Don't enable VNet Service Endpoints by default on all subnets.
Consider the following recommendations for operational excellence when configuring an Azure Virtual
Network:
Reliability and ExpressRoute
Reliability and Azure Load Balancer
12/16/2022 2 minutes to read • Edit Online
Checklist
Configuration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
For production workloads, use the Standard Stock Keeping
Units (SKU).
Basic load balancers don't have a Service Level Agreement
(SLA). The Standard SKU supports Availability Zones.
Next step
Load balancing
refers to evenly distributing load (incoming network traffic) across a group of backend resources
or servers. With Azure Load Balancer, load-balance traffic to and from virtual machines and cloud resources,
and in cross-premises virtual networks.
You can scale your applications and create highly available services with Azure Load Balancer. It supports both
inbound and outbound scenarios. Load balancer provides low latency and high throughput.
Key benefits include:
Load balance internal and external traffic to Azure virtual machines.
Increase availability by distributing resources within and across zones.
Configure outbound connectivity for Azure virtual machines.
Use health probes to monitor load-balanced resources.
For more information, reference Why use Azure Load Balancer?
To understand how Azure Load Balancer supports a reliable workload, reference the following topics:
Improve application scalability and resiliency by using Azure Load Balancer
Load Balancer and Availability Zones
High availability ports overview
Have you configured Azure Load Balancer with reliability in mind?Have you configured Azure Load Balancer with reliability in mind?
For production workloads, use the Standard Stock Keeping Units (SKU).
Consider the following recommendation to optimize reliability when configuring an Azure Load Balancer:
Operational excellence and Azure Load Balancer
Operational excellence and Azure Load Balancer
12/16/2022 2 minutes to read • Edit Online
Checklist
Configuration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
For production workloads, use the Standard Stock Keeping
Units (SKU).
Basic load balancers don't have a Service Level Agreement
(SLA). The Standard SKU supports Availability Zones.
Next step
Load balancing
refers to evenly distributing load (incoming network traffic) across a group of backend resources
or servers. With Azure Load Balancer, load-balance traffic to and from virtual machines and cloud resources,
and in cross-premises virtual networks.
You can scale your applications and create highly available services with Azure Load Balancer. It supports both
inbound and outbound scenarios. Load balancer provides low latency and high throughput.
Key benefits include:
Load balance internal and external traffic to Azure virtual machines.
Increase availability by distributing resources within and across zones.
Configure outbound connectivity for Azure virtual machines.
Use health probes to monitor load-balanced resources.
For more information, reference Why use Azure Load Balancer?
To understand how Azure Load Balancer supports operational excellence, reference the following topics:
Load Balancer health probes
Standard load balancer diagnostics with metrics, alerts, and resource health
Using Insights to monitor and configure your Azure Load Balancer
Have you configured Azure Load Balancer with operational excellence in mind?Have you configured Azure Load Balancer with operational excellence in mind?
For production workloads, use the Standard Stock Keeping Units (SKU).
Consider the following recommendation for operational excellence when configuring an Azure Load Balancer:
Reliability and Traffic Manager
Reliability and Traffic Manager
12/16/2022 2 minutes to read • Edit Online
Checklist
Configuration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
If the Time to Live (TTL) interval of the DNS record is too
long, consider adjusting the health probe timing or DNS
record TTL.
When a backend becomes unavailable, Traffic Manager won't
fail over to another region immediately. There will be a time
interval where clients can't be served. The length of this
interval depends on the time settings of the health probe
(probe interval and the number of unhealthy responses
allowed). If the resulting interval is still too large for the
scenario, consider switching to Azure Front Door for global
load balancing.
Traffic Manager is a Domain Name System (DNS)-based traffic load balancer. This service allows you to
distribute traffic to your public-facing applications across the global Azure regions. Traffic Manager also
provides your public endpoints with high availability and quick responsiveness.
Features include:
Increase application availability
Improve application performance
Service maintenance without downtime
Combine hybrid applications
Distribute traffic for complex deployments
For more information, reference What is Traffic Manager?
To learn how Traffic Manager supports a reliable workload, reference the following articles:
Enhance your service availability and data locality by using Azure Traffic Manager
Using load-balancing services in Azure
Disaster recovery using Azure DNS and Traffic Manager
Have you configured Traffic Manager with reliability in mind?Have you configured Traffic Manager with reliability in mind?
If the Time to Live (TTL) interval of the DNS record is too long, consider adjusting the health probe timing or
DNS record TTL.
Implement a custom page to use as a health check for your Traffic Manager.
Evaluate the three different traffic routing methods.
Consider nested Traffic Manager profiles.
Consider the following recommendations to optimize reliability when configuring Traffic Manager:
Implement a custom page to use as a health check for your
Traffic Manager.
A common practice is to implement a custom page within
your application (for example: /health.aspx ). Using this
path for monitoring, you can do application-specific checks,
such as checking performance counters or verifying database
availability. Based on these custom checks, the page returns
an appropriate HTTPS status code.
Evaluate the three different traffic routing methods. Traffic Manager supports three traffic-routing methods to
determine how to route network traffic to the various
service endpoints. Traffic Manager applies the traffic-routing
method to each DNS query it receives. The traffic-routing
method determines which endpoint is returned in the DNS
response. The customer should be aware of these endpoints
and the differences in routing between endpoints.
Consider nested Traffic Manager profiles. Each Traffic Manager profile specifies a single traffic-routing
method. There are scenarios that require more sophisticated
traffic routing than the routing provided by a single Traffic
Manager profile. You can nest Traffic Manager profiles to
combine the benefits of more than one traffic-routing
method. Nested profiles allow you to override the default
Traffic Manager behavior to support larger, more complex
application deployments.
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Next step
Operational excellence and Traffic Manager
Operational excellence and Traffic Manager
12/16/2022 2 minutes to read • Edit Online
Checklist
Configuration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
If the Time to Live (TTL) interval of the DNS record is too
long, consider adjusting the health probe timing or DNS
record TTL.
When a backend becomes unavailable, Traffic Manager won't
fail over to another region immediately. There will be a time
interval where clients can't be served. The length of this
interval depends on the time settings of the health probe
(probe interval and the number of unhealthy responses
allowed). If the resulting interval is still too large for the
scenario, consider switching to Azure Front Door for global
load balancing.
Next step
Traffic Manager is a Domain Name System (DNS)-based traffic load balancer. This service allows you to
distribute traffic to your public-facing applications across the global Azure regions. Traffic Manager also
provides your public endpoints with high availability and quick responsiveness.
Features include:
Increase application availability
Improve application performance
Service maintenance without downtime
Combine hybrid applications
Distribute traffic for complex deployments
For more information, reference What is Traffic Manager?
To learn how Traffic Manager supports operational excellence, reference the following articles:
Troubleshooting degraded state on Azure Traffic Manager
Traffic Manager endpoint monitoring
Traffic Manager metrics and alerts
Have you configured Traffic Manager with operational excellence in mind?Have you configured Traffic Manager with operational excellence in mind?
If the Time to Live (TTL) interval of the DNS record is too long, consider adjusting the health probe timing or
DNS record TTL.
Consider the following recommendation for operational excellence when configuring Traffic Manager:
Cost optimization and IP addresses
Cost optimization and IP addresses
12/16/2022 2 minutes to read • Edit Online
Checklist
Configuration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
PIPs (Public IPs) are free until used. Static PIPs are paid even
when not assigned to resources.
There's a difference in billing for regular and static public IP
addresses. Develop a process to look for orphan network
interface cards (NICs) and PIPs that aren't being used in
production and non-production.
Next step
IP services are a collection of IP address-related services that enable communication in an Azure Virtual
Network. Public and private IP addresses are used in Azure for communication between resources. The
communication with resources can occur in a private Azure Virtual Network and the public internet.
Key features include:
Public IP addresses
Public IP address prefixes
Private IP addresses
Routing preference
Routing preference unmetered
For more information, reference What is Azure Virtual Network IP Services?
To understand how IP services support a cost-optimized workload, reference the following articles:
IP addresses pricing
Create, change, or delete an Azure public IP address
Routing over public Internet (ISP network)
Have you configured IP addresses with cost optimization in mind?Have you configured IP addresses with cost optimization in mind?
PIPs (Public IPs) are free until used. Static PIPs are paid even when not assigned to resources.
Consider the following recommendation for cost optimization when configuring IP addresses:
Cost optimization and Log Analytics
Cost optimization and Log Analytics
12/16/2022 2 minutes to read • Edit Online
Design considerations
Checklist
Configuration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Log Analytics is a tool in the Azure portal used to edit and run log queries with data in Azure Monitor Logs. You
can write a query that returns a set of records. Then, use features of Log Analytics to sort, filter, and analyze
them. Alternately, write a more advanced query to do statistical analysis. Visualize the results in a chart to
identify a particular trend.
For more information, reference Overview of Log Analytics in Azure Monitor.
To understand how Log Analytics supports cost optimization, reference Manage usage and costs with Azure
Monitor Logs. This article includes:
Estimating the costs to manage your environment
Viewing Log Analytics usage on your Azure bill
Understand your usage and optimizing your pricing tier
Log Analytics workspace includes the following design considerations:
Consider how long to retain data on Log Analytics:
Data ingested into Log Analytics workspace can be retained at no additional charge up to the first 31
days. Consider general aspects to configure the Log Analytics workspace level default retention. Consider
specific needs to configure data retention by data type that can be as low as four days. Example: Usually,
performance data doesn't need to be retained longer, instead, security logs may need to be retained
longer.
Consider exporting data for long-term retention or auditing purposes:
Data retained for audit purposes may be exported to a cheaper storage type. Refer to Log Analytics
workspace data export in Azure Monitor (preview).
Have you configured Log Analytics with cost optimization in mind?Have you configured Log Analytics with cost optimization in mind?
Consider adoption of the Commitment Tiers pricing model to the Log Analytics workspace.
Evaluate usage of daily cap to limit the daily ingestion for your workspace.
Understand Log Analytics workspace usage.
Evaluate possible data ingestion volume reducing.
Consider the following recommendation for cost optimization when configuring Log Analytics:
Consider adoption of the Commitment Tiers pricing model
to the Log Analytics workspace.
The usage of Commitment Tiers enables saving as much as
30% compared to Pay-As-You-Go pricing. Commitment
Tiers starts at 100 GB/day and any usage above the
reservation level is billed at the Pay-As-You-Go rate. Refer to
Changing pricing tier about how to change the pricing tier
to Capacity Reservations. Use the Log Analytics usage and
estimated cost page to analyze data usage and calculate
possible Commitment Tiers.
Note: Azure Defender (Security
Center) billing includes
500 MB/node/day
allocation against
the security data types. Take it into consideration when
calculating Commitment Tiers.
Evaluate daily cap usage to limit the daily ingestion for your
workspace.
Daily cap is used to manage an unexpected increase in data
volume. Use daily cap when you want to limit unplanned
charges for your workspace. Use care with this configuration
as it can cause some data to be unwritten on Log Analytics
workspace if the daily cap is reached. This configuration can
impact services whose functionality may depend on the
availability of up-to-date data in the workspace. Refer to Set
the Daily Cap about how to set the daily cap in Application
Insights.
Note: If you have a workspace-based Application
Insights, use daily cap in workspace to limit ingestion and
costs instead of using the cap in Application Insights.
Understand Log Analytics workspace usage. When Log Analytics workspace usage is higher than
expected, consider the troubleshooting guide and the
Understanding ingested data volume guide to understand
the unexpected behavior.
Evaluate possible data ingestion volume reducing. Refer to Tips for reducing data volume documentation to
help configure data ingestion properly.
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Next step
Security and Application Insights
Security and Application Insights
12/16/2022 2 minutes to read • Edit Online
Checklist
Configuration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Review instances where customer data is captured in your
application.
We don't recommend collecting customer data in Application
Insights, although it can be unavoidable. It's up to you and
your company to determine the strategy you'll use to handle
your private data.
Next step
Application Insights is a feature of Azure Monitor. This feature provides extensible application performance
management (APM) and monitoring for live web apps.
Key features include:
Supports a wide variety of platforms, including .NET, Node.js, Java, and Python.
Works for apps hosted on-premises, hybrid, or on any public cloud.
Integrates with DevOps processes.
Has connection points to many development tools.
Can monitor and analyze customer data from mobile apps by integrating with Visual Studio App Center.
For more information, reference Application Insights overview.
Have you configured Application Insights with security in mind?Have you configured Application Insights with security in mind?
Review instances where customer data is captured in your application.
Consider the following security recommendation when configuring Application Insights:
Cost optimization and Application Insights
Cost optimization and Application Insights
12/16/2022 2 minutes to read • Edit Online
Design considerations
Checklist
Configuration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Application Insights is a feature of Azure Monitor. This feature provides extensible application performance
management (APM) and monitoring for live web apps.
Key features include:
Supports a wide variety of platforms, including .NET, Node.js, Java, and Python.
Works for apps hosted on-premises, hybrid, or on any public cloud.
Integrates with DevOps processes.
Has connection points to many development tools.
Can monitor and analyze customer data from mobile apps by integrating with Visual Studio App Center.
For more information, reference Application Insights overview.
Application Insights includes the following design considerations for cost optimization:
Consider using sampling to reduce the amount of data that's sent:
Sampling is a feature in Application Insights. It's a recommended way to reduce data traffic, data, and
storage costs. Refer to Sampling in Application Insights.
Consider turning off collection for unneeded modules:
On configuration files, you can enable or disable data modules and initializers for tracking data from your
applications. Refer to Application Insights for web pages.
Consider limiting Asynchronous JavaScript and XML (AJAX) call tracing:
AJAX calls can be limited to reduce costs. Refer to Application Insights for web pages, which explains the
fields and its configurations.
Have you configured Application Insights with cost optimization in mind?Have you configured Application Insights with cost optimization in mind?
Evaluate usage of daily cap to limit the daily ingestion for your workspace.
Use sampling in Azure Application Insights to reduce data traffic, data costs, and storage costs, while
preserving a statistically correct analysis of application data.
Consider the following recommendations for cost optimization when configuring Application Insights:
Evaluate daily cap usage to limit the daily ingestion for your
workspace.
Daily cap is used to manage an unexpected increase in data
volume. Use daily cap when you want to limit unplanned
charges for your workspace. Use care with this configuration
as it can cause some data to be unwritten on Log Analytics
workspace if the daily cap is reached. This configuration can
impact services whose functionality may depend on the
availability of up-to-date data in the workspace. Refer to Set
the Daily Cap about how to set the daily cap in Application
Insights.
Note: If you have a workspace-based Application
Insights, use the daily cap in workspace to limit ingestion
and costs instead of using the cap in Application Insights.
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Next step
Operational excellence and Application Insights
Operational excellence and Application Insights
12/16/2022 3 minutes to read • Edit Online
Checklist
Configuration recommendations
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Configure Application Insights to monitor the availability and
responsiveness of your web application.
After you've deployed your application, you can set up
recurring tests to monitor availability and responsiveness.
Application Insights sends web requests to your application
at regular intervals from points around the world. It can
alert you if your application isn't responding or if it responds
too slowly.
Application Insights is a feature of Azure Monitor. This feature provides extensible application performance
management (APM) and monitoring for live web apps.
Key features include:
Supports a wide variety of platforms, including .NET, Node.js, Java, and Python.
Works for apps hosted on-premises, hybrid, or on any public cloud.
Integrates with DevOps processes.
Has connection points to many development tools.
Can monitor and analyze customer data from mobile apps by integrating with Visual Studio App Center.
For more information, reference Application Insights overview.
Have you configured Application Insights with operational excellence in mind?Have you configured Application Insights with operational excellence in mind?
Configure Application Insights to monitor the availability and responsiveness of your web application.
Be aware that Application Insights can be used to monitor deployed sites and services on-premises (or on an
Azure Virtual Machine (VM)).
Evaluate Java codeless application monitoring for your Java-based application development stack.
Configure sampling in Application Insights.
Record custom events and metrics from sites and services in Application Insights.
Use Application Insights to ingest existing log traces from common libraries, such as ILogger , Nlog , and
log4Net .
Become familiar with the Application Insights quotas and limits.
Review the need for custom analysis. Use Application Insights data with tools such as Azure Dashboards or
Power BI.
Separate data across Application Insights resources.
Consider the following recommendations for operational excellence when configuring Application Insights:
Evaluate Java codeless application monitoring for your Java-
based application development stack.
Java codeless application monitoring is all about simplicity.
There are no code changes. You can enable the Java agent
through a couple of configuration changes. The Java agent
works in any environment and allows you to monitor all
your Java applications. No matter if you're running your Java
apps on Virtual Machines, on-premises, in Azure Kubernetes
Service (AKS), on Windows, or Linux, the Java 3.0 agent
will monitor your app.
Configure sampling in Application Insights. Ingestion sampling operates at the point where the data
from your web servers, browsers, and devices reaches the
Application Insights service endpoints. Although it doesn't
reduce the data sent from your app, it does reduce the
amount processed, retained, and charged by Application
Insights. Use this type of sampling if your app often goes
above its monthly quota. Use ingestion sampling if you don't
have access to the Software Development Kit (SDK)-based
types of sampling.
Record custom events and metrics from sites and services in
Application Insights.
Use Application Insights to record domain-specific custom
events and metrics from your site or service. For example:
number-of-active-baskets
or
product-lines-out-of-stock
.
Use Application Insights to ingest existing log traces from
common libraries, such as ILogger , Nlog , and log4Net .
If you're already using a logging framework such as
ILogger , Nlog , log4Net , or
System.Diagnostics.Trace , we recommend sending your
diagnostic tracing logs to Application Insights. For Python
applications, send diagnostic tracing logs using
AzureLogHandler in OpenCensus Python for Azure
Monitor. You can explore and search these logs, which are
merged with the other log files from your application.
Merging the log files allows you to identify traces associated
with each user request and correlate them with other events
and exception reports.
Become familiar with the Application Insights quotas and
limits.
This information can influence your sampling model and
your strategy for separating Application Insights resources.
Review the need for custom analysis. Use Application
Insights data with tools such as Azure Dashboards or Power
BI.
There are several available options to analyze your
Application Insights data. For example, you can create a
dashboard in the Azure portal that includes tiles visualizing
data from multiple Azure resources across different resource
groups and subscriptions. Alternatively, you can use Power
BI to analyze data combined with data from other sources
and share insights.
Separate data across Application Insights resources. It's important to consider when to share a single Application
Insights resource and when to create a new one. For
example, you should use a single resource for application
components that you deploy together, a single Team
develops, or that the same set of DevOps or ITOps users
manages. You should use a separate resource for different
environments.
REC O M M EN DAT IO NREC O M M EN DAT IO N DE SC R IP T IO NDESC RIP T IO N
Next step
Operational excellence and Application Insights
Implementing recommendations
12/16/2022 2 minutes to read • Edit Online
Overview
Assess workload
NOTENOTE
Integrate recommendations
The purpose of this document is to provide a guide for incorporating the recommendations generated by the
Microsoft Azure Well-Architected Review and Azure Advisor into a new or established operational process for
continuous workload improvement.
This first step in the Well-Architected Recommendation Process is to conduct an assessment of your workload.
The Microsoft Azure Well-Architected Review tool generates a set of recommendations through a guided
assessment based on the Microsoft Well-Architected Framework. This tool also has the ability to pull in Azure
Advisor recommendations based on an Azure subscription or resource group. At the end of the assessment,
there is an option to export these recommendations into a CSV file that can then be used to incorporate them
into the operational process for the workload.
When using the Microsoft Azure Well-Architected Review tool, it's important to Sign inSign in and select the Azure subscription
or resource group that contains your workload. This will ensure that only the relevant Azure Advisor recommendations
are included when exporting the CSV.
The goal of this stage is to establish a backlog of work items either through a manual or an automated process
that is based on the recommendations generated by the assessment. The process and tooling for managing the
backlog for your workload should be well-defined based on your cloud operating model.
NOTENOTE
Triage backlog
Implement work items
Monitor progress
The Microsoft Cloud Adoption Framework for Azure is proven guidance that's designed to help you create and implement
the business and technology strategies necessary for your organization to succeed in the cloud. This guidance can help
you define the cloud operating model that governs the operational process for your workload.
If you're using a DevOps approach for your operational process, Microsoft has provided example scripts that will
automate the import of the recommendations from the Well-Architected Review exported CSV into a new Azure
DevOps or GitHub project within your existing organization.
Download example import scripts at DevOps Tooling for Well-Architected Recommendation Process.
Now that the recommendations have been added to the backlog, the workload owners and key stakeholders
should prioritize, then triage the recommendations in a weekly standup meeting with the workload team. Next,
they assign recommendations to a specific owner, postpone, or dismiss. When assigned to a specific owner, the
recommendation should be tracked until resolved and weekly, or monthly reminders are sent to the assignee.
When going through this process, it's recommended to align responsibilities across teams by developing a
cross-team matrix that identifies
responsible, accountable, consulted, and informed (RACI)
parties. Some of the
key benefits of this exercise include:
Assisting teams in charting roles and responsibilities in a consistent manner
Assisting teams with development of implementation tool kits
Clarifying individual and departmental roles, and responsibilities
Identifying accountabilities
Eliminating misunderstandings and encouraging teamwork
Reducing duplication of effort
Establishing
consults
and
informs
resulting in better communication
This stage of the process is focused on working through the backlog of recommendations. How you organize
this work will depend on your cloud operating model. For example, an Agile-based process would have an
emphasis on sprint planning paired with frequent stand ups so that progress can be tracked and reported.
This stage of the process is to monitor improvements to a workload relative to the Microsoft Well-Architected
Framework through each iteration. It's important to establish a baseline based on the Microsoft Azure Well-
Architected Review and Azure Advisor so that you can track improvements back to key recommendations and
metrics.
The Azure Advisor Score is a key metric that is available for tracking and socializing improvements during this
continuous process.