EC2 Status Checks & Recovery
Master EC2 status checks to identify hardware and software issues, and configure automated recovery options using CloudWatch Alarms and Auto Scaling.
What You'll Learn
Understanding EC2 Status Checks
EC2 performs automated status checks every minute to identify hardware and software issues with your instances. Understanding these checks is crucial for maintaining high availability and implementing automated recovery. There are three types of status checks, each monitoring different aspects of your instance's health.
Key Point
EC2 performs three types of status checks: System, Instance, and Attached EBS—each requiring different remediation actions.
Key Terms
Monitors problems with AWS systems that require AWS to repair, such as issues with the underlying host hardware or hypervisor
Monitors software and network configuration issues that you can fix by rebooting or modifying the instance
Monitors the reachability and I/O completion status of EBS volumes attached to the instance
Status Check Types and Resolution
The three status check types and their remediation approaches
Status Checks
EC2 Health Monitoring
System Status
AWS infrastructure issues
Stop & Start
Migrates to new host
Instance Status
Guest OS/config issues
Reboot
Or fix configuration
Attached EBS
Volume health issues
Reboot or Replace
Replace affected volumes
Status Check Details
| Check Type | What It Monitors | Resolution | |
|---|---|---|---|
| System Status | Physical host, power, network | Stop and Start (migrates to new host) | |
| Instance Status | Guest OS, memory, networking config | Reboot or fix instance configuration | |
| Attached EBS | EBS volume reachability and I/O | Reboot instance or replace volumes |
Stop/Start is different from Reboot—Stop/Start moves the instance to new hardware
Stop vs Reboot
Stop and Start moves your instance to different underlying hardware, which resolves System Status failures. Reboot keeps the instance on the same host, which only helps with Instance Status issues. Know the difference!
Check AWS Health Dashboard
For System Status failures, check the AWS Personal Health Dashboard for any scheduled maintenance or known issues. AWS may have already scheduled maintenance to repair the underlying infrastructure.
CloudWatch Metrics for Status Checks
Status check results are published to CloudWatch as metrics at 1-minute intervals. You can create alarms on these metrics to trigger automated recovery actions or notifications. The key metrics are StatusCheckFailed_System, StatusCheckFailed_Instance, StatusCheckFailed_AttachedEBS, and StatusCheckFailed (combined).
Key Point
Status check metrics are published to CloudWatch every minute and can trigger automated recovery actions.
Status Check CloudWatch Metrics
- StatusCheckFailed_System: 1 if system status check failed, 0 if passed
- StatusCheckFailed_Instance: 1 if instance status check failed, 0 if passed
- StatusCheckFailed_AttachedEBS: 1 if EBS status check failed, 0 if passed
- StatusCheckFailed: 1 if any status check failed, 0 if all passed
Automated Recovery Options
There are two main approaches to automated recovery: CloudWatch Alarms with EC2 actions and Auto Scaling Groups. Each has different characteristics and preserves different aspects of your instance configuration.
Key Point
Choose CloudWatch Alarm recovery to preserve IP addresses, or Auto Scaling for more robust instance management.
Recovery Options Comparison
| Feature | CloudWatch Alarm Recovery | Auto Scaling Group | |
|---|---|---|---|
| How It Works | Recovers same instance to new host | Launches new instance | |
| Private IP | Preserved | Not preserved (new instance) | |
| Public IP | Preserved | Not preserved | |
| Elastic IP | Preserved | Not preserved (must reassociate) | |
| Metadata | Preserved | Based on launch template | |
| Placement Group | Preserved | Based on launch template |
Set min/max/desired to 1 in ASG for single-instance recovery
Creating a CloudWatch Alarm for Recovery
Select the Metric
In CloudWatch, choose the StatusCheckFailed_System metric for your EC2 instance.
Configure the Alarm
Set the threshold (typically >= 1 for 2 or more consecutive periods) to trigger when the check fails.
Add EC2 Action
Choose 'Recover this instance' as an alarm action. This uses the EC2 recover action.
Add SNS Notification
Optionally add an SNS notification to alert your team when recovery occurs.
Test and Monitor
Monitor the alarm and verify it's in the OK state when your instance is healthy.
CloudWatch Alarm Recovery Flow
How automated recovery works with CloudWatch
EC2 Instance
Monitored instance
CloudWatch
Monitors StatusCheckFailed
Alarm Triggered
Check failed
Alarm Actions
Recover + Notify
Instance Recovered
Migrated to new host
SNS Notification
Team alerted
Recovery Limitations
EC2 instance recovery is only supported for instances using EBS storage (not instance store), with specific instance types, and in a VPC. Check AWS documentation for current limitations.
Pause & Ponder
How would you design a high-availability solution using EC2 status checks?
- •Consider when CloudWatch Alarm recovery is sufficient vs. when you need Auto Scaling
- •Think about applications that require IP address persistence
- •How would you handle recovery for stateful applications with data on EBS?
Ready to Start Learning?
Dive deeper into this lesson with our interactive learning experience. Complete the quiz and earn 70 XP!
Start This LessonContinue Your Journey
EC2 Instance Type Management
Learn how to change EC2 instance types and understand the requirements for resiz...
Lesson 2EC2 Placement Groups
Master EC2 placement group strategies to optimize instance placement for perform...
Lesson 3EC2 SSH Connectivity & Troubleshooting
Master SSH connections to EC2 instances, including traditional SSH, EC2 Instance...