Tutorial 14: Troubleshoot Terraform

Learning Objectives

Learn systematic approaches to Terraform troubleshooting
Understand common error types and their solutions
Practice using debugging tools and techniques
Implement troubleshooting workflows for complex issues
Develop skills for root cause analysis

Terraform Troubleshooting Methodology

Systematic Approach

Identify the Problem: Understand what's failing and when
Gather Information: Collect logs, error messages, and context
Isolate the Issue: Narrow down to specific resources or operations
Analyze Root Cause: Understand why the problem occurred
Implement Solution: Fix the underlying issue
Verify Resolution: Test that the fix works
Document Learning: Record solution for future reference

Error Classification

Syntax Errors: Invalid HCL configuration
Validation Errors: Configuration doesn't meet provider requirements
Authentication Errors: Credential or permission issues
API Errors: Cloud provider API failures
State Errors: State file corruption or inconsistencies
Dependency Errors: Resource dependency issues
Network Errors: Connectivity or timeout issues

Debugging Tools and Techniques

Enable Debug Logging

# Set log level (TRACE, DEBUG, INFO, WARN, ERROR)
export TF_LOG=DEBUG

# Save logs to file
export TF_LOG_PATH=./terraform-debug.log

# Run Terraform with logging
terraform plan

# Review logs
less terraform-debug.log

Component-Specific Logging

# Log only core Terraform operations
export TF_LOG_CORE=DEBUG

# Log only provider operations
export TF_LOG_PROVIDER=DEBUG

# Log path for provider logs
export TF_LOG_PROVIDER_PATH=./provider-debug.log

Validation and Formatting

# Check configuration syntax
terraform validate

# Format configuration files
terraform fmt -check

# Initialize and validate
terraform init
terraform validate

State Inspection

# Show current state
terraform show

# List all resources in state
terraform state list

# Show specific resource
terraform state show aws_instance.web

# Refresh state from infrastructure
terraform refresh

Common Error Categories and Solutions

1. Syntax and Configuration Errors

Invalid HCL Syntax

# Error: Invalid character
resource "aws_instance" "web" {
  ami = "ami-12345"
  instance_type = "t2.micro"
  # Missing closing brace

Error Message:

Error: Invalid character

  on main.tf line 10:
   1: resource "aws_instance" "web" {

This character is not used within the language.

Solution:

# Fix: Add missing closing brace
resource "aws_instance" "web" {
  ami           = "ami-12345"
  instance_type = "t2.micro"
}

Invalid Resource Arguments

# Error: Invalid argument
resource "aws_instance" "web" {
  ami           = "ami-12345"
  instance_type = "t2.micro"
  invalid_arg   = "value"  # This argument doesn't exist
}

Error Message:

Error: Unsupported argument

  on main.tf line 4, in resource "aws_instance" "web":
   4:   invalid_arg = "value"

An argument named "invalid_arg" is not expected here.

Solution:

# Check provider documentation
terraform providers schema -json | jq '.provider_schemas["registry.terraform.io/hashicorp/aws"].resource_schemas["aws_instance"]'

# Or check online documentation
# Remove invalid argument

Circular Dependencies

# Error: Circular dependency
resource "aws_security_group" "web" {
  name = "web-sg"
  
  ingress {
    from_port       = 80
    to_port         = 80
    protocol        = "tcp"
    security_groups = [aws_security_group.app.id]  # Depends on app
  }
}

resource "aws_security_group" "app" {
  name = "app-sg"
  
  ingress {
    from_port       = 3000
    to_port         = 3000
    protocol        = "tcp"
    security_groups = [aws_security_group.web.id]  # Depends on web
  }
}

Solution:

# Fix: Use separate security group rules
resource "aws_security_group" "web" {
  name = "web-sg"
}

resource "aws_security_group" "app" {
  name = "app-sg"
}

resource "aws_security_group_rule" "web_to_app" {
  type                     = "egress"
  from_port                = 3000
  to_port                  = 3000
  protocol                 = "tcp"
  security_group_id        = aws_security_group.web.id
  source_security_group_id = aws_security_group.app.id
}

resource "aws_security_group_rule" "app_from_web" {
  type                     = "ingress"
  from_port                = 3000
  to_port                  = 3000
  protocol                 = "tcp"
  security_group_id        = aws_security_group.app.id
  source_security_group_id = aws_security_group.web.id
}

2. Authentication and Permission Errors

AWS Credential Issues

Error Message:

Error: No valid credential sources found for AWS Provider.

Solutions:

# Check AWS configuration
aws configure list

# Set credentials via environment variables
export AWS_ACCESS_KEY_ID="your-access-key"
export AWS_SECRET_ACCESS_KEY="your-secret-key"
export AWS_DEFAULT_REGION="us-west-2"

# Or configure AWS CLI
aws configure

# Or use IAM roles (for EC2/Lambda)
# Or use AWS SSO
aws sso login --profile your-profile
export AWS_PROFILE=your-profile

Insufficient Permissions

Error Message:

Error: UnauthorizedOperation: You are not authorized to perform this operation.

Solution:

# Check current identity
aws sts get-caller-identity

# Test specific permissions
aws iam simulate-principal-policy \
  --policy-source-arn $(aws sts get-caller-identity --query Arn --output text) \
  --action-names ec2:RunInstances \
  --resource-arns "*"

# Required IAM policy example
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:RunInstances",
        "ec2:TerminateInstances",
        "ec2:DescribeInstances",
        "ec2:CreateTags"
      ],
      "Resource": "*"
    }
  ]
}

3. Provider and API Errors

Provider Version Conflicts

Error Message:

Error: Failed to query available provider packages

Solutions:

# Clear provider cache
rm -rf .terraform

# Reinitialize
terraform init

# Update providers
terraform init -upgrade

# Check provider constraints
terraform providers

API Rate Limiting

Error Message:

Error: Request rate exceeded. Please retry after some time.

Solutions:

# Add delays between operations
sleep 5
terraform apply

# Use parallelism control
terraform apply -parallelism=1

# Implement retry logic
for i in {1..3}; do
  terraform apply && break
  echo "Retry $i failed, waiting..."
  sleep 30
done

Resource Already Exists

Error Message:

Error: InvalidGroup.Duplicate: The security group 'web-sg' already exists

Solutions:

# Option 1: Import existing resource
terraform import aws_security_group.web sg-12345678

# Option 2: Use unique names
resource "aws_security_group" "web" {
  name_prefix = "web-sg-"  # Terraform adds unique suffix
  # ... rest of configuration
}

# Option 3: Check if resource exists first
data "aws_security_group" "existing" {
  name = "web-sg"
}

resource "aws_security_group" "web" {
  count = length(data.aws_security_group.existing.id) == 0 ? 1 : 0
  name  = "web-sg"
  # ... rest of configuration
}

4. State File Issues

State Lock Timeout

Error Message:

Error: Error locking state: ConditionalCheckFailedException

Solutions:

# Check lock status
aws dynamodb get-item \
  --table-name terraform-state-locks \
  --key '{"LockID":{"S":"bucket/path/terraform.tfstate-md5"}}'

# Force unlock (dangerous - ensure no other process is running)
terraform force-unlock LOCK_ID

# Increase timeout
terraform apply -lock-timeout=300s

State File Corruption

Error Message:

Error: Failed to load state: state file was created by a newer Terraform version

Solutions:

# Backup current state
cp terraform.tfstate terraform.tfstate.corrupt

# Restore from S3 versioning
aws s3api list-object-versions \
  --bucket terraform-state-bucket \
  --prefix path/terraform.tfstate

# Download previous version
aws s3api get-object \
  --bucket terraform-state-bucket \
  --key path/terraform.tfstate \
  --version-id VERSION_ID \
  terraform.tfstate.restored

# Or upgrade Terraform to match state version

Resource Not in State

Error Message:

Error: Resource not found in state

Solutions:

# List current state
terraform state list

# Import missing resource
terraform import aws_instance.web i-1234567890abcdef0

# Or remove resource from configuration if it shouldn't be managed

5. Network and Connectivity Errors

Timeout Errors

Error Message:

Error: Timeout while waiting for state to become 'running'

Solutions:

# Increase timeout in resource configuration
resource "aws_instance" "web" {
  # ... configuration
  
  timeouts {
    create = "10m"
    update = "10m"
    delete = "10m"
  }
}

# Check network connectivity
aws ec2 describe-instances --instance-ids i-1234567890abcdef0

# Verify security groups and NACLs

DNS Resolution Issues

Error Message:

Error: no such host

Solutions:

# Check DNS resolution
nslookup registry.terraform.io

# Use alternative registry
terraform {
  required_providers {
    aws = {
      source = "terraform.example.com/hashicorp/aws"
    }
  }
}

# Configure proxy if needed
export HTTPS_PROXY=http://proxy.company.com:8080

Advanced Troubleshooting Techniques

Creating Minimal Reproduction Cases

# minimal-repro.tf - Simplest configuration that reproduces the issue
terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region = "us-west-2"
}

# Only the failing resource
resource "aws_instance" "test" {
  ami           = "ami-12345"  # Known good AMI
  instance_type = "t2.micro"
}

Binary Search Debugging

# When multiple resources fail, use binary search
# Comment out half the resources
# If error persists, problem is in remaining half
# If error goes away, problem is in commented half
# Repeat until you find the specific resource

Using Terraform Console for Debugging

# Interactive console for testing expressions
terraform console

# Test variable values
> var.instance_type
"t2.micro"

# Test function calls
> cidrsubnet("10.0.0.0/16", 8, 1)
"10.0.1.0/24"

# Test data source queries
> data.aws_ami.amazon_linux.id
"ami-12345"

# Test resource references
> aws_vpc.main.id
"vpc-12345"

Graph Analysis

# Generate dependency graph
terraform graph | dot -Tpng > graph.png

# Or use simpler format
terraform graph | grep -E "(aws_instance|aws_security_group)"

Provider Plugin Debugging

# Enable provider debugging
export TF_LOG_PROVIDER=DEBUG

# Run specific operation
terraform plan

# Check for provider-specific issues in logs
grep "aws_instance" terraform-debug.log

Troubleshooting Workflows

Production Issue Response

#!/bin/bash
# production-troubleshoot.sh

set -e

echo "Starting production troubleshooting..."

# 1. Gather initial information
echo "Gathering information..."
terraform version
aws sts get-caller-identity
terraform workspace show

# 2. Check current state
echo "Checking current state..."
terraform state list > current-state.txt
terraform show > current-config.txt

# 3. Validate configuration
echo "Validating configuration..."
terraform validate

# 4. Check for drift
echo "Checking for drift..."
terraform plan -refresh-only -out=refresh.plan

# 5. Analyze the issue
echo "Current state saved to current-state.txt"
echo "Current config saved to current-config.txt"
echo "Refresh plan saved to refresh.plan"
echo "Review these files to understand the issue"

# 6. If safe, refresh state
read -p "Apply refresh plan? (y/N): " -n 1 -r
echo
if [[ $REPLY =~ ^[Yy]$ ]]; then
    terraform apply refresh.plan
fi

Development Debugging Workflow

#!/bin/bash
# debug-workflow.sh

# Enable comprehensive logging
export TF_LOG=DEBUG
export TF_LOG_PATH="./debug-$(date +%Y%m%d-%H%M%S).log"

echo "Starting debug session with logging enabled"
echo "Log file: $TF_LOG_PATH"

# Validate syntax
echo "1. Validating syntax..."
if ! terraform validate; then
    echo "Syntax errors found. Fix and retry."
    exit 1
fi

# Check formatting
echo "2. Checking formatting..."
terraform fmt -check -diff

# Plan with detailed output
echo "3. Running plan..."
terraform plan -out=debug.plan 2>&1 | tee plan-output.txt

# Show plan details
echo "4. Showing plan details..."
terraform show debug.plan

echo "Debug complete. Check log file: $TF_LOG_PATH"

Automated Issue Detection

#!/bin/bash
# health-check.sh

ISSUES_FOUND=false

echo "Running Terraform health check..."

# Check 1: Configuration validation
if ! terraform validate; then
    echo "❌ Configuration validation failed"
    ISSUES_FOUND=true
else
    echo "✅ Configuration validation passed"
fi

# Check 2: Format check
if ! terraform fmt -check; then
    echo "❌ Format check failed"
    ISSUES_FOUND=true
else
    echo "✅ Format check passed"
fi

# Check 3: State consistency
if ! terraform plan -detailed-exitcode > /dev/null; then
    exitcode=$?
    if [ $exitcode -eq 1 ]; then
        echo "❌ Terraform plan failed"
        ISSUES_FOUND=true
    elif [ $exitcode -eq 2 ]; then
        echo "⚠️  Infrastructure drift detected"
    fi
else
    echo "✅ No infrastructure drift"
fi

# Check 4: State lock status
if terraform force-unlock -help 2>/dev/null | grep -q "Manually unlock"; then
    echo "⚠️  State may be locked - check manually"
fi

# Check 5: Provider availability
if ! terraform providers > /dev/null 2>&1; then
    echo "❌ Provider check failed"
    ISSUES_FOUND=true
else
    echo "✅ Providers available"
fi

if [ "$ISSUES_FOUND" = true ]; then
    echo "❌ Issues found - review above output"
    exit 1
else
    echo "✅ All health checks passed"
fi

Environment-Specific Troubleshooting

Local Development Issues

# Common local issues and solutions

# Issue: Provider download fails
# Solution: Check network connectivity
curl -I https://registry.terraform.io

# Issue: State file permissions
# Solution: Fix file permissions
chmod 644 terraform.tfstate

# Issue: Plugin cache issues
# Solution: Clear plugin cache
rm -rf ~/.terraform.d/plugin-cache/*

CI/CD Pipeline Issues

# .github/workflows/terraform-debug.yml
name: Terraform Debug

on:
  workflow_dispatch:
    inputs:
      debug_level:
        description: 'Debug level (TRACE, DEBUG, INFO)'
        required: true
        default: 'DEBUG'

jobs:
  debug:
    runs-on: ubuntu-latest
    
    steps:
      - uses: actions/checkout@v3
      
      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v2
        with:
          terraform_version: 1.5.0
          
      - name: Debug Terraform
        env:
          TF_LOG: ${{ github.event.inputs.debug_level }}
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          terraform init
          terraform validate
          terraform plan -out=debug.plan
          
      - name: Upload debug artifacts
        uses: actions/upload-artifact@v3
        if: always()
        with:
          name: terraform-debug
          path: |
            debug.plan
            .terraform/

Production Environment Issues

# Production troubleshooting checklist

# 1. Check service status
echo "Checking service dependencies..."
aws sts get-caller-identity
aws ec2 describe-regions --region-names us-west-2

# 2. Verify permissions
echo "Checking permissions..."
aws iam get-user
aws iam list-attached-user-policies --user-name $(aws sts get-caller-identity --query UserName --output text)

# 3. Check state backend health
echo "Checking state backend..."
aws s3 ls s3://terraform-state-bucket/
aws dynamodb describe-table --table-name terraform-state-locks

# 4. Review recent changes
echo "Recent AWS API calls:"
aws logs describe-log-groups --log-group-name-prefix CloudTrail

Prevention Strategies

Pre-commit Hooks

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/antonbabenko/pre-commit-terraform
    rev: v1.77.0
    hooks:
      - id: terraform_fmt
      - id: terraform_validate
      - id: terraform_docs
      - id: terraform_tflint

Automated Testing

# test-terraform.sh
#!/bin/bash

echo "Running Terraform tests..."

# Syntax validation
terraform validate

# Security scanning
tfsec .

# Best practices check
terraform-compliance -f compliance-rules -p .

# Cost estimation
infracost breakdown --path .

echo "All tests completed"

Monitoring and Alerting

# monitoring.tf
resource "aws_cloudwatch_log_group" "terraform_logs" {
  name              = "/aws/terraform/operations"
  retention_in_days = 30
}

resource "aws_sns_topic" "terraform_alerts" {
  name = "terraform-alerts"
}

resource "aws_cloudwatch_metric_alarm" "terraform_failures" {
  alarm_name          = "terraform-operation-failures"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = "1"
  metric_name         = "Errors"
  namespace           = "AWS/Logs"
  period              = "300"
  statistic           = "Sum"
  threshold           = "0"
  alarm_description   = "This metric monitors terraform operation failures"
  alarm_actions       = [aws_sns_topic.terraform_alerts.arn]

  dimensions = {
    LogGroupName = aws_cloudwatch_log_group.terraform_logs.name
  }
}

Key Takeaways

Use systematic approaches to troubleshooting
Enable debug logging for complex issues
Start with simple validation commands
Create minimal reproduction cases
Use Terraform console for interactive debugging
Implement health checks and monitoring
Document solutions for future reference
Prevent issues with automated testing
Have escalation procedures for production issues
Keep troubleshooting tools and scripts ready

Next Steps

Complete Tutorial 15: Create and Use Modules
Learn about advanced Terraform testing strategies
Explore infrastructure monitoring and observability
Practice troubleshooting in different environments

Troubleshoot Terraform

Tutorial 14: Troubleshoot Terraform

Learning Objectives

Terraform Troubleshooting Methodology

Systematic Approach

Error Classification

Debugging Tools and Techniques

Enable Debug Logging

Component-Specific Logging

Validation and Formatting

State Inspection

Common Error Categories and Solutions

1. Syntax and Configuration Errors

Invalid HCL Syntax

Invalid Resource Arguments

Circular Dependencies

2. Authentication and Permission Errors

AWS Credential Issues

Insufficient Permissions

3. Provider and API Errors

Provider Version Conflicts

API Rate Limiting

Resource Already Exists

4. State File Issues

State Lock Timeout

State File Corruption

Resource Not in State

5. Network and Connectivity Errors

Timeout Errors

DNS Resolution Issues

Advanced Troubleshooting Techniques

Creating Minimal Reproduction Cases

Binary Search Debugging

Using Terraform Console for Debugging

Graph Analysis

Provider Plugin Debugging

Troubleshooting Workflows

Production Issue Response

Development Debugging Workflow

Automated Issue Detection

Environment-Specific Troubleshooting

Local Development Issues

CI/CD Pipeline Issues

Production Environment Issues

Prevention Strategies

Pre-commit Hooks

Automated Testing

Monitoring and Alerting

Key Takeaways

Next Steps

Additional Resources