Network incidents can occur at times that may impact latency in AWS Regions , Availability Zones (AZs), and individual infrastructure components, such as server hardware. Users can also experience more latency because of the introduction of network appliances in their traffic path. This post explains some of the best practices that allow you to identify and respond to latency related scenarios to help you with the observability capabilities to proactively recover from the possible failures.

Objectives

We discuss how delays in network communication can impact your workload regionally or cross-Region. We show the concept of latency and packet loss to review failure scenarios, failover decisions, and criteria that allow you respond to latency-specific events that include:

  •  Understand the impact of a regional network latency on components of critical applications and the system as a whole
  • Review and validate metrics that are available to measure and help pinpoint latency and packet loss
  • Identify alarms and thresholds relevant to latency
  • Establish a troubleshooting and diagnosis approach
  • Review steps to mitigate the impact and/or determine when to failover to another AWS Region during latency
  • Testing operational readiness with disaster recovery (DR) exercises through the use of playbooks and runbooks

Any event that prevents a workload or system from fulfilling its business objectives is classified as a disaster. Diagnosing the impact of latency and its root cause is crucial for a resilient architecture and your business continuity plan.

Latency and packet loss

Now, we discuss the definition of latency and packet loss. Network latency is the delay in network communication and it measures delays in a packet’s arrival at the destination. It is measured in time units such as milliseconds. Latency can be introduced in a network for several reasons: high concurrent data volume, server performance, or incidents on devices, such as routers, switches, or firewalls and links.

Packet loss measures how many data packets that never reach their destination. Factors, such as software bugs, hardware issues, and network congestion, cause dropped packets during data transmission. Packet loss is a percentage value that measures how many packets that never arrived. Therefore, if 90 out of 100 packets arrived, then packet loss is 10%.

To better evaluate the impact of latency in your application, you should have knowledge of your network throughput, transactions per second (TPS), and application response time.

To minimize the impact of latency, you should have an understanding of the thresholds where the applications start to timeout or fail the requests. Before reaching to this threshold, you must be notified to take necessary actions to failover or plan for recovery.

Understanding your thresholds

This section walks you through some possible scenarios and how to understand your thresholds:

Generally, thresholds are the defined values that set the limit to convey a healthy status. For latency metrics, response time, throughput, and volume of traffic are the main factors for understanding the thresholds.

Throughput measurements help you determine an application or server’s maximum load per second before failure or reduced response times.

Typically, the higher volume of requests per second may result in slower response times. Therefore, a transaction or process is put in a queue, which results in longer latency. This queue, based on your architecture, can be a buffer that holds requests, a task scheduler, or a messaging queue (MQ).

If the server/application’s load remains high for a long enough period of time, or when a network connection can’t be established, and/or a server is taking too long to respond, then “timeout” occurs. The “timeout” is the time that the client is prepared to wait for the transaction to be completed.

Finally, a request for a transaction fails immediately when the processing queue is full. Therefore, when the throughput rate exceeds your queue’s capacity for a long enough time period and the queue overflows, then the transaction immediately fails.

Diagram 1: Concepts of latency, throughput, and failure

How infrastructure performance works at AWS

In this section we discuss infrastructure performance designed at AWS. Network latency measurements are an aggregate of latency performance information captured by active probing between AWS managed probes within the AWS global network.

  • Inter-Region latency metrics are generated by aggregating latency measurements from probes located across AWS Regions, and filtered on your chosen source and destination AWS Region pairs.
  • Inter-AZ latency metrics are generated by aggregating latency measurements from probes located across AZ, and filtered on your chosen source and destination AZs.
  • Intra-AZ latency metrics are generated by aggregating latency measurements between the probes within a single AZ. This includes the probes that are across AWS data centers, and within the same data center, for the chosen AZ.

Depending on your workload, you may have resources in a single AWS Region, multiple AWS Regions, or in a hybrid solution (combination of cloud and on-premises environment). Therefore , components that may lead to slow response times for a user or possible network congestion between communication tiers would be different. For example, for a multi-Region architecture, you need observability tooling that monitors latency across your AWS Regions, within a Region between your AZ, and within an AZ between your resources or applications.

For hybrid solutions, you also need observability solutions to monitor network performance from your on-premises datacenters to AWS Cloud. In this case, bandwidth, distance, packet size, or your routings have effects on network latency. Refer to the post Improving performance on AWS and Hybrid Networks for more information.

Recommendations

1. Create multidimensional monitoring of latency

a. Your observability tooling should be able to pinpoint a particular latency generating event.
b. Use composite alarms to reduce alarm noise by taking actions only at an aggregated level.
c. Create a composite alarm to be triggered to send a notification when you are close to reaching your threshold that defines network latency.
d. The underlying alarms in your composite alarm must be in the same account and the same Region as your composite alarm. However, if you set up a composite alarm in an Amazon CloudWatch cross-account observability monitoring account, then the underlying alarms can watch metrics in different source accounts and in the monitoring account itself.

a.You should have an understanding of the thresholds where your critical application starts to timeout or fail the requests. Before reaching this threshold, you need to be alerted to take necessary actions to avoid failure.
b. You need to have the ability to interpret the detected suspicious events and breakdown the root causes and possible anomalies.
c. Understand the maximum time that a message/request takes to be processed.
d. Understand your round-trip time (RTT).
e. Understand the maximum capacity of your applications to process the transactions (without delays).
f. Finally, understand the maximum tolerance of your system to more delays.

3. Prepare your alerting systems to validate AWS Health notifications and dismiss false positives

AWS will notify you about AWS service events or impairments through the AWS Health Dashboard. However, you may not be impacted by the events or impairments depending on your architecture and the static stability of your workload. This shows the importance of having the correct observability tooling in place to help you minimize false positives.

Testing is a crucial component of operational resilience. To recover from failures, you need to have a DR plan and regularly test your readiness through different scenarios. For example, your failover plans for a latency impairment in a single AZ are different than impairments in an AWS Region. Depending on your architecture, recovery time objective (RTO), recovery point objective (RPO), and the nature of the incident, you may need to plan to use a different service, or evacuate an AZ or even an AWS Region.

a. Include possible scenarios to validate the latency threshold in your DR exercises.
b. Test the latency only to understand your system’s maximum tolerance.
c. Test throughput only to understand your system’s maximum capacity.
d. Test the latency and throughput together to find the threshold where the latency and throughput would impact your system.
e. Regularly review and optimize your DR procedures related to latency or networking issues.

5. Perform post-incident analysis

a. Create a correction of error (COE) process.
b. Create a standardized method to document critical root causes, and make sure they are reviewed, validated, and addressed among your teams.
c. Assign clear ownership for the post-incident analysis process.
d. Designate a responsible team or individual who oversees incident investigations and follow-ups.

6.Validate playbooks and runbooks

Playbooks provide comprehensive strategic guidance for organizations regarding handling disastrous events, focusing on responsible parties, communication protocols, and overall response strategy. On the other hand, runbooks offer step-by-step instructions for routine tasks such as restoration and recovery, outlining detailed technical workflows, automation strategies, and troubleshooting steps. Both are crucial for comprehensive disaster planning to make sure of a well-coordinated and effective overall incident response.

a. Document your procedures in playbooks to achieve a specific outcome related to latency. You need different playbooks per each disastrous event. For example, your procedures to recover from a network latency incident are different than actions taken to recover from a security incident such as a DDoS attack. Generally, a well-defined playbook includes the following:
Roles and responsibilities of key personnel

  • Contact list and communication strategies
  • Defined disaster levels, impacts, and response requirements
  • Specified recovery time objectives (RTOs) and recovery point objectives (RPOs) per workload
  • Procedures for disaster declaration and alert response
  • Escalation processes
  • Infrastructure readiness for the specific disaster event
  • Risk factors and mitigation processes
  • Details on the failover environment and dependencies
  • Failover initiation, restoration order, and timelines
  • Phased recovery steps
  •  Testing and failback procedures

b. Codify your runbooks to limit the introduction of errors from manual activity. Runbooks can be composed of multiple scripts representing the different steps that might be necessary to identify the contributing factors of an issue. For example, in the case of impairments in an AZ that cause slow response times for the end users, you would use your runbooks related to AZ evacuation to temporarily remove the impaired AZ for all or specific workloads depends on the scope of impairments. This includes actions taken to reconfigure your resources to stop sending the traffic to the impaired AZ. An example of a well-defined runbook includes:

  •  Specific workload execution processes and necessary permissions
  • Comprehensive list of existing infrastructure components
  • Detailed list of impacted resources and their dependencies
  • Information on necessary backups and other necessary configurations
  • Detailed, step-by-step instructions for the recovery procedures
  • Clear documentation of the expected outcome upon successful execution

c. As you build and work through your playbooks and runbooks, you can add, revise, optimize, and improve your processes through regular testing and DR exercises. For more information about solutions for automating your playbooks or runbooks, review the following:Automate your operational playbooks with AWS Systems Manager

7. Validate support best practices for opening support cases

During an impairment or a critical situation, and if you are subscribed to premium support, make sure to open a support case for AWS support to assist you with technical guidance.

  • Understand the AWS case severity levels for high impact incidents:
    • High: production system impaired
    • Urgent: production system down (only available for Business, Enterprise, and Enterprise On-Ramp users)
    • Critical: business-critical system down (only available for Enterprise and Enterprise On-Ramp users)
    • Choose chat/phone for the fastest method to get an assigned engineer when your critical workload is affected.

There are many AWS native tools that can help you build and architect your observability tooling, such as the following:

  • AWS Network Manager – Monitor network infrastructure performance
    • Use AWS Network Manager to better understand the performance of the AWS Global Network. You can monitor the real-time inter-Region, inter-AZ, and intra-AZ latency, and the health status of the AWS Global Network.
  • Subscription metrics for CloudWatch
    • Subscribe to AggregateAWSNetworkPerformance to receive notifications about latency between AWS Regions, inter-AZs, or intra-AZs.
  • Amazon Q network troubleshooting
  • Amazon CloudWatch Internet Monitor
    • Internet Monitor suggests insights and recommendations that can help you improve your end users’ experience and performance. You can explore, in near real-time, how to improve the projected latency of your application by switching to use other services, or by rerouting traffic to your workload through different AWS Regions.
    • Internet Monitor sends health events to Amazon EventBridge so that you can set up notifications. If an issue is caused by the AWS network, then you also automatically receive an AWS Health Dashboard notification with the steps that AWS is taking to mitigate the problem.
Diagram 2: AWS cloud services for network latency monitoring

Diagram 2: AWS cloud services for network latency monitoring

Conclusion

Latency related disaster events pose a threat to your system’s availability, but you can mitigate or remove these threats by proactively monitoring your workload . This post has explained the concepts and the best practices that can help you create effective observability tools to understand, monitor, and react to possible delays, latency, or performance issues across your AWS cloud environment. Amazon CloudWatch and AWS Network Manager are strong tools that allow you to monitor and get alarms for the Inter-AZ, Intra-AZ network latency, and also your WAN connectivity from your user network to AWS Cloud.

For more deep dive discussions related to latency, review View source