Tutorial 14: Troubleshoot Terraform
Learning Objectives
- Learn systematic approaches to Terraform troubleshooting
- Understand common error types and their solutions
- Practice using debugging tools and techniques
- Implement troubleshooting workflows for complex issues
- Develop skills for root cause analysis
Terraform Troubleshooting Methodology
Systematic Approach
- Identify the Problem: Understand what's failing and when
- Gather Information: Collect logs, error messages, and context
- Isolate the Issue: Narrow down to specific resources or operations
- Analyze Root Cause: Understand why the problem occurred
- Implement Solution: Fix the underlying issue
- Verify Resolution: Test that the fix works
- Document Learning: Record solution for future reference
Error Classification
- Syntax Errors: Invalid HCL configuration
- Validation Errors: Configuration doesn't meet provider requirements
- Authentication Errors: Credential or permission issues
- API Errors: Cloud provider API failures
- State Errors: State file corruption or inconsistencies
- Dependency Errors: Resource dependency issues
- Network Errors: Connectivity or timeout issues
Debugging Tools and Techniques
Enable Debug Logging
# Set log level (TRACE, DEBUG, INFO, WARN, ERROR)
export TF_LOG=DEBUG
# Save logs to file
export TF_LOG_PATH=./terraform-debug.log
# Run Terraform with logging
terraform plan
# Review logs
less terraform-debug.log
Component-Specific Logging
# Log only core Terraform operations
export TF_LOG_CORE=DEBUG
# Log only provider operations
export TF_LOG_PROVIDER=DEBUG
# Log path for provider logs
export TF_LOG_PROVIDER_PATH=./provider-debug.log
Validation and Formatting
# Check configuration syntax
terraform validate
# Format configuration files
terraform fmt -check
# Initialize and validate
terraform init
terraform validate
State Inspection
# Show current state
terraform show
# List all resources in state
terraform state list
# Show specific resource
terraform state show aws_instance.web
# Refresh state from infrastructure
terraform refresh
Common Error Categories and Solutions
1. Syntax and Configuration Errors
Invalid HCL Syntax
# Error: Invalid character
resource "aws_instance" "web" {
ami = "ami-12345"
instance_type = "t2.micro"
# Missing closing brace
Error Message:
Error: Invalid character
on main.tf line 10:
1: resource "aws_instance" "web" {
This character is not used within the language.
Solution:
# Fix: Add missing closing brace
resource "aws_instance" "web" {
ami = "ami-12345"
instance_type = "t2.micro"
}
Invalid Resource Arguments
# Error: Invalid argument
resource "aws_instance" "web" {
ami = "ami-12345"
instance_type = "t2.micro"
invalid_arg = "value" # This argument doesn't exist
}
Error Message:
Error: Unsupported argument
on main.tf line 4, in resource "aws_instance" "web":
4: invalid_arg = "value"
An argument named "invalid_arg" is not expected here.
Solution:
# Check provider documentation
terraform providers schema -json | jq '.provider_schemas["registry.terraform.io/hashicorp/aws"].resource_schemas["aws_instance"]'
# Or check online documentation
# Remove invalid argument
Circular Dependencies
# Error: Circular dependency
resource "aws_security_group" "web" {
name = "web-sg"
ingress {
from_port = 80
to_port = 80
protocol = "tcp"
security_groups = [aws_security_group.app.id] # Depends on app
}
}
resource "aws_security_group" "app" {
name = "app-sg"
ingress {
from_port = 3000
to_port = 3000
protocol = "tcp"
security_groups = [aws_security_group.web.id] # Depends on web
}
}
Solution:
# Fix: Use separate security group rules
resource "aws_security_group" "web" {
name = "web-sg"
}
resource "aws_security_group" "app" {
name = "app-sg"
}
resource "aws_security_group_rule" "web_to_app" {
type = "egress"
from_port = 3000
to_port = 3000
protocol = "tcp"
security_group_id = aws_security_group.web.id
source_security_group_id = aws_security_group.app.id
}
resource "aws_security_group_rule" "app_from_web" {
type = "ingress"
from_port = 3000
to_port = 3000
protocol = "tcp"
security_group_id = aws_security_group.app.id
source_security_group_id = aws_security_group.web.id
}
2. Authentication and Permission Errors
AWS Credential Issues
Error Message:
Error: No valid credential sources found for AWS Provider.
Solutions:
# Check AWS configuration
aws configure list
# Set credentials via environment variables
export AWS_ACCESS_KEY_ID="your-access-key"
export AWS_SECRET_ACCESS_KEY="your-secret-key"
export AWS_DEFAULT_REGION="us-west-2"
# Or configure AWS CLI
aws configure
# Or use IAM roles (for EC2/Lambda)
# Or use AWS SSO
aws sso login --profile your-profile
export AWS_PROFILE=your-profile
Insufficient Permissions
Error Message:
Error: UnauthorizedOperation: You are not authorized to perform this operation.
Solution:
# Check current identity
aws sts get-caller-identity
# Test specific permissions
aws iam simulate-principal-policy \
--policy-source-arn $(aws sts get-caller-identity --query Arn --output text) \
--action-names ec2:RunInstances \
--resource-arns "*"
# Required IAM policy example
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ec2:RunInstances",
"ec2:TerminateInstances",
"ec2:DescribeInstances",
"ec2:CreateTags"
],
"Resource": "*"
}
]
}
3. Provider and API Errors
Provider Version Conflicts
Error Message:
Error: Failed to query available provider packages
Solutions:
# Clear provider cache
rm -rf .terraform
# Reinitialize
terraform init
# Update providers
terraform init -upgrade
# Check provider constraints
terraform providers
API Rate Limiting
Error Message:
Error: Request rate exceeded. Please retry after some time.
Solutions:
# Add delays between operations
sleep 5
terraform apply
# Use parallelism control
terraform apply -parallelism=1
# Implement retry logic
for i in {1..3}; do
terraform apply && break
echo "Retry $i failed, waiting..."
sleep 30
done
Resource Already Exists
Error Message:
Error: InvalidGroup.Duplicate: The security group 'web-sg' already exists
Solutions:
# Option 1: Import existing resource
terraform import aws_security_group.web sg-12345678
# Option 2: Use unique names
resource "aws_security_group" "web" {
name_prefix = "web-sg-" # Terraform adds unique suffix
# ... rest of configuration
}
# Option 3: Check if resource exists first
data "aws_security_group" "existing" {
name = "web-sg"
}
resource "aws_security_group" "web" {
count = length(data.aws_security_group.existing.id) == 0 ? 1 : 0
name = "web-sg"
# ... rest of configuration
}
4. State File Issues
State Lock Timeout
Error Message:
Error: Error locking state: ConditionalCheckFailedException
Solutions:
# Check lock status
aws dynamodb get-item \
--table-name terraform-state-locks \
--key '{"LockID":{"S":"bucket/path/terraform.tfstate-md5"}}'
# Force unlock (dangerous - ensure no other process is running)
terraform force-unlock LOCK_ID
# Increase timeout
terraform apply -lock-timeout=300s
State File Corruption
Error Message:
Error: Failed to load state: state file was created by a newer Terraform version
Solutions:
# Backup current state
cp terraform.tfstate terraform.tfstate.corrupt
# Restore from S3 versioning
aws s3api list-object-versions \
--bucket terraform-state-bucket \
--prefix path/terraform.tfstate
# Download previous version
aws s3api get-object \
--bucket terraform-state-bucket \
--key path/terraform.tfstate \
--version-id VERSION_ID \
terraform.tfstate.restored
# Or upgrade Terraform to match state version
Resource Not in State
Error Message:
Error: Resource not found in state
Solutions:
# List current state
terraform state list
# Import missing resource
terraform import aws_instance.web i-1234567890abcdef0
# Or remove resource from configuration if it shouldn't be managed
5. Network and Connectivity Errors
Timeout Errors
Error Message:
Error: Timeout while waiting for state to become 'running'
Solutions:
# Increase timeout in resource configuration
resource "aws_instance" "web" {
# ... configuration
timeouts {
create = "10m"
update = "10m"
delete = "10m"
}
}
# Check network connectivity
aws ec2 describe-instances --instance-ids i-1234567890abcdef0
# Verify security groups and NACLs
DNS Resolution Issues
Error Message:
Error: no such host
Solutions:
# Check DNS resolution
nslookup registry.terraform.io
# Use alternative registry
terraform {
required_providers {
aws = {
source = "terraform.example.com/hashicorp/aws"
}
}
}
# Configure proxy if needed
export HTTPS_PROXY=http://proxy.company.com:8080
Advanced Troubleshooting Techniques
Creating Minimal Reproduction Cases
# minimal-repro.tf - Simplest configuration that reproduces the issue
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
provider "aws" {
region = "us-west-2"
}
# Only the failing resource
resource "aws_instance" "test" {
ami = "ami-12345" # Known good AMI
instance_type = "t2.micro"
}
Binary Search Debugging
# When multiple resources fail, use binary search
# Comment out half the resources
# If error persists, problem is in remaining half
# If error goes away, problem is in commented half
# Repeat until you find the specific resource
Using Terraform Console for Debugging
# Interactive console for testing expressions
terraform console
# Test variable values
> var.instance_type
"t2.micro"
# Test function calls
> cidrsubnet("10.0.0.0/16", 8, 1)
"10.0.1.0/24"
# Test data source queries
> data.aws_ami.amazon_linux.id
"ami-12345"
# Test resource references
> aws_vpc.main.id
"vpc-12345"
Graph Analysis
# Generate dependency graph
terraform graph | dot -Tpng > graph.png
# Or use simpler format
terraform graph | grep -E "(aws_instance|aws_security_group)"
Provider Plugin Debugging
# Enable provider debugging
export TF_LOG_PROVIDER=DEBUG
# Run specific operation
terraform plan
# Check for provider-specific issues in logs
grep "aws_instance" terraform-debug.log
Troubleshooting Workflows
Production Issue Response
#!/bin/bash
# production-troubleshoot.sh
set -e
echo "Starting production troubleshooting..."
# 1. Gather initial information
echo "Gathering information..."
terraform version
aws sts get-caller-identity
terraform workspace show
# 2. Check current state
echo "Checking current state..."
terraform state list > current-state.txt
terraform show > current-config.txt
# 3. Validate configuration
echo "Validating configuration..."
terraform validate
# 4. Check for drift
echo "Checking for drift..."
terraform plan -refresh-only -out=refresh.plan
# 5. Analyze the issue
echo "Current state saved to current-state.txt"
echo "Current config saved to current-config.txt"
echo "Refresh plan saved to refresh.plan"
echo "Review these files to understand the issue"
# 6. If safe, refresh state
read -p "Apply refresh plan? (y/N): " -n 1 -r
echo
if [[ $REPLY =~ ^[Yy]$ ]]; then
terraform apply refresh.plan
fi
Development Debugging Workflow
#!/bin/bash
# debug-workflow.sh
# Enable comprehensive logging
export TF_LOG=DEBUG
export TF_LOG_PATH="./debug-$(date +%Y%m%d-%H%M%S).log"
echo "Starting debug session with logging enabled"
echo "Log file: $TF_LOG_PATH"
# Validate syntax
echo "1. Validating syntax..."
if ! terraform validate; then
echo "Syntax errors found. Fix and retry."
exit 1
fi
# Check formatting
echo "2. Checking formatting..."
terraform fmt -check -diff
# Plan with detailed output
echo "3. Running plan..."
terraform plan -out=debug.plan 2>&1 | tee plan-output.txt
# Show plan details
echo "4. Showing plan details..."
terraform show debug.plan
echo "Debug complete. Check log file: $TF_LOG_PATH"
Automated Issue Detection
#!/bin/bash
# health-check.sh
ISSUES_FOUND=false
echo "Running Terraform health check..."
# Check 1: Configuration validation
if ! terraform validate; then
echo "❌ Configuration validation failed"
ISSUES_FOUND=true
else
echo "✅ Configuration validation passed"
fi
# Check 2: Format check
if ! terraform fmt -check; then
echo "❌ Format check failed"
ISSUES_FOUND=true
else
echo "✅ Format check passed"
fi
# Check 3: State consistency
if ! terraform plan -detailed-exitcode > /dev/null; then
exitcode=$?
if [ $exitcode -eq 1 ]; then
echo "❌ Terraform plan failed"
ISSUES_FOUND=true
elif [ $exitcode -eq 2 ]; then
echo "⚠️ Infrastructure drift detected"
fi
else
echo "✅ No infrastructure drift"
fi
# Check 4: State lock status
if terraform force-unlock -help 2>/dev/null | grep -q "Manually unlock"; then
echo "⚠️ State may be locked - check manually"
fi
# Check 5: Provider availability
if ! terraform providers > /dev/null 2>&1; then
echo "❌ Provider check failed"
ISSUES_FOUND=true
else
echo "✅ Providers available"
fi
if [ "$ISSUES_FOUND" = true ]; then
echo "❌ Issues found - review above output"
exit 1
else
echo "✅ All health checks passed"
fi
Environment-Specific Troubleshooting
Local Development Issues
# Common local issues and solutions
# Issue: Provider download fails
# Solution: Check network connectivity
curl -I https://registry.terraform.io
# Issue: State file permissions
# Solution: Fix file permissions
chmod 644 terraform.tfstate
# Issue: Plugin cache issues
# Solution: Clear plugin cache
rm -rf ~/.terraform.d/plugin-cache/*
CI/CD Pipeline Issues
# .github/workflows/terraform-debug.yml
name: Terraform Debug
on:
workflow_dispatch:
inputs:
debug_level:
description: 'Debug level (TRACE, DEBUG, INFO)'
required: true
default: 'DEBUG'
jobs:
debug:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Terraform
uses: hashicorp/setup-terraform@v2
with:
terraform_version: 1.5.0
- name: Debug Terraform
env:
TF_LOG: ${{ github.event.inputs.debug_level }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: |
terraform init
terraform validate
terraform plan -out=debug.plan
- name: Upload debug artifacts
uses: actions/upload-artifact@v3
if: always()
with:
name: terraform-debug
path: |
debug.plan
.terraform/
Production Environment Issues
# Production troubleshooting checklist
# 1. Check service status
echo "Checking service dependencies..."
aws sts get-caller-identity
aws ec2 describe-regions --region-names us-west-2
# 2. Verify permissions
echo "Checking permissions..."
aws iam get-user
aws iam list-attached-user-policies --user-name $(aws sts get-caller-identity --query UserName --output text)
# 3. Check state backend health
echo "Checking state backend..."
aws s3 ls s3://terraform-state-bucket/
aws dynamodb describe-table --table-name terraform-state-locks
# 4. Review recent changes
echo "Recent AWS API calls:"
aws logs describe-log-groups --log-group-name-prefix CloudTrail
Prevention Strategies
Pre-commit Hooks
# .pre-commit-config.yaml
repos:
- repo: https://github.com/antonbabenko/pre-commit-terraform
rev: v1.77.0
hooks:
- id: terraform_fmt
- id: terraform_validate
- id: terraform_docs
- id: terraform_tflint
Automated Testing
# test-terraform.sh
#!/bin/bash
echo "Running Terraform tests..."
# Syntax validation
terraform validate
# Security scanning
tfsec .
# Best practices check
terraform-compliance -f compliance-rules -p .
# Cost estimation
infracost breakdown --path .
echo "All tests completed"
Monitoring and Alerting
# monitoring.tf
resource "aws_cloudwatch_log_group" "terraform_logs" {
name = "/aws/terraform/operations"
retention_in_days = 30
}
resource "aws_sns_topic" "terraform_alerts" {
name = "terraform-alerts"
}
resource "aws_cloudwatch_metric_alarm" "terraform_failures" {
alarm_name = "terraform-operation-failures"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = "1"
metric_name = "Errors"
namespace = "AWS/Logs"
period = "300"
statistic = "Sum"
threshold = "0"
alarm_description = "This metric monitors terraform operation failures"
alarm_actions = [aws_sns_topic.terraform_alerts.arn]
dimensions = {
LogGroupName = aws_cloudwatch_log_group.terraform_logs.name
}
}
Key Takeaways
- Use systematic approaches to troubleshooting
- Enable debug logging for complex issues
- Start with simple validation commands
- Create minimal reproduction cases
- Use Terraform console for interactive debugging
- Implement health checks and monitoring
- Document solutions for future reference
- Prevent issues with automated testing
- Have escalation procedures for production issues
- Keep troubleshooting tools and scripts ready
Next Steps
- Complete Tutorial 15: Create and Use Modules
- Learn about advanced Terraform testing strategies
- Explore infrastructure monitoring and observability
- Practice troubleshooting in different environments