Lesson 6-AWS SysOps Administrator Associate

EC2 Status Checks & Recovery

Master EC2 status checks to identify hardware and software issues, and configure automated recovery options using CloudWatch Alarms and Auto Scaling.

14 Topics
5 Quiz Questions
70 XP Reward

What You'll Learn

1

Understanding EC2 Status Checks

EC2 performs automated status checks every minute to identify hardware and software issues with your instances. Understanding these checks is crucial for maintaining high availability and implementing automated recovery. There are three types of status checks, each monitoring different aspects of your instance's health.

Key Point

EC2 performs three types of status checks: System, Instance, and Attached EBS—each requiring different remediation actions.

Vocabulary

Key Terms

|

Monitors problems with AWS systems that require AWS to repair, such as issues with the underlying host hardware or hypervisor

|

Monitors software and network configuration issues that you can fix by rebooting or modifying the instance

|

Monitors the reachability and I/O completion status of EBS volumes attached to the instance

Status Check Types and Resolution

The three status check types and their remediation approaches

Status Checks

EC2 Health Monitoring

System Status

AWS infrastructure issues

Stop & Start

Migrates to new host

Instance Status

Guest OS/config issues

Reboot

Or fix configuration

Attached EBS

Volume health issues

Reboot or Replace

Replace affected volumes

Comparison

Status Check Details

Check TypeWhat It MonitorsResolution
System StatusPhysical host, power, networkStop and Start (migrates to new host)
Instance StatusGuest OS, memory, networking configReboot or fix instance configuration
Attached EBSEBS volume reachability and I/OReboot instance or replace volumes

Stop/Start is different from Reboot—Stop/Start moves the instance to new hardware

Stop vs Reboot

Stop and Start moves your instance to different underlying hardware, which resolves System Status failures. Reboot keeps the instance on the same host, which only helps with Instance Status issues. Know the difference!

Check AWS Health Dashboard

For System Status failures, check the AWS Personal Health Dashboard for any scheduled maintenance or known issues. AWS may have already scheduled maintenance to repair the underlying infrastructure.

7

CloudWatch Metrics for Status Checks

Status check results are published to CloudWatch as metrics at 1-minute intervals. You can create alarms on these metrics to trigger automated recovery actions or notifications. The key metrics are StatusCheckFailed_System, StatusCheckFailed_Instance, StatusCheckFailed_AttachedEBS, and StatusCheckFailed (combined).

Key Point

Status check metrics are published to CloudWatch every minute and can trigger automated recovery actions.

Status Check CloudWatch Metrics

  • StatusCheckFailed_System: 1 if system status check failed, 0 if passed
  • StatusCheckFailed_Instance: 1 if instance status check failed, 0 if passed
  • StatusCheckFailed_AttachedEBS: 1 if EBS status check failed, 0 if passed
  • StatusCheckFailed: 1 if any status check failed, 0 if all passed
9

Automated Recovery Options

There are two main approaches to automated recovery: CloudWatch Alarms with EC2 actions and Auto Scaling Groups. Each has different characteristics and preserves different aspects of your instance configuration.

Key Point

Choose CloudWatch Alarm recovery to preserve IP addresses, or Auto Scaling for more robust instance management.

Comparison

Recovery Options Comparison

FeatureCloudWatch Alarm RecoveryAuto Scaling Group
How It WorksRecovers same instance to new hostLaunches new instance
Private IPPreservedNot preserved (new instance)
Public IPPreservedNot preserved
Elastic IPPreservedNot preserved (must reassociate)
MetadataPreservedBased on launch template
Placement GroupPreservedBased on launch template

Set min/max/desired to 1 in ASG for single-instance recovery

Creating a CloudWatch Alarm for Recovery

1

Select the Metric

In CloudWatch, choose the StatusCheckFailed_System metric for your EC2 instance.

2

Configure the Alarm

Set the threshold (typically >= 1 for 2 or more consecutive periods) to trigger when the check fails.

3

Add EC2 Action

Choose 'Recover this instance' as an alarm action. This uses the EC2 recover action.

4

Add SNS Notification

Optionally add an SNS notification to alert your team when recovery occurs.

5

Test and Monitor

Monitor the alarm and verify it's in the OK state when your instance is healthy.

CloudWatch Alarm Recovery Flow

How automated recovery works with CloudWatch

EC2 Instance

Monitored instance

CloudWatch

Monitors StatusCheckFailed

Alarm Triggered

Check failed

Alarm Actions

Recover + Notify

Instance Recovered

Migrated to new host

SNS Notification

Team alerted

Recovery Limitations

EC2 instance recovery is only supported for instances using EBS storage (not instance store), with specific instance types, and in a VPC. Check AWS documentation for current limitations.

Reflection

Pause & Ponder

How would you design a high-availability solution using EC2 status checks?

  • Consider when CloudWatch Alarm recovery is sufficient vs. when you need Auto Scaling
  • Think about applications that require IP address persistence
  • How would you handle recovery for stateful applications with data on EBS?

Ready to Start Learning?

Dive deeper into this lesson with our interactive learning experience. Complete the quiz and earn 70 XP!

Start This Lesson

Continue Your Journey