AskLearn
Loading...
← Back to Terraform Course
IntermediateFundamentals

Troubleshoot Terraform

Debug common issues

Tutorial 14: Troubleshoot Terraform

Learning Objectives

  • Learn systematic approaches to Terraform troubleshooting
  • Understand common error types and their solutions
  • Practice using debugging tools and techniques
  • Implement troubleshooting workflows for complex issues
  • Develop skills for root cause analysis

Terraform Troubleshooting Methodology

Systematic Approach

  1. Identify the Problem: Understand what's failing and when
  2. Gather Information: Collect logs, error messages, and context
  3. Isolate the Issue: Narrow down to specific resources or operations
  4. Analyze Root Cause: Understand why the problem occurred
  5. Implement Solution: Fix the underlying issue
  6. Verify Resolution: Test that the fix works
  7. Document Learning: Record solution for future reference

Error Classification

  • Syntax Errors: Invalid HCL configuration
  • Validation Errors: Configuration doesn't meet provider requirements
  • Authentication Errors: Credential or permission issues
  • API Errors: Cloud provider API failures
  • State Errors: State file corruption or inconsistencies
  • Dependency Errors: Resource dependency issues
  • Network Errors: Connectivity or timeout issues

Debugging Tools and Techniques

Enable Debug Logging

# Set log level (TRACE, DEBUG, INFO, WARN, ERROR)
export TF_LOG=DEBUG

# Save logs to file
export TF_LOG_PATH=./terraform-debug.log

# Run Terraform with logging
terraform plan

# Review logs
less terraform-debug.log

Component-Specific Logging

# Log only core Terraform operations
export TF_LOG_CORE=DEBUG

# Log only provider operations
export TF_LOG_PROVIDER=DEBUG

# Log path for provider logs
export TF_LOG_PROVIDER_PATH=./provider-debug.log

Validation and Formatting

# Check configuration syntax
terraform validate

# Format configuration files
terraform fmt -check

# Initialize and validate
terraform init
terraform validate

State Inspection

# Show current state
terraform show

# List all resources in state
terraform state list

# Show specific resource
terraform state show aws_instance.web

# Refresh state from infrastructure
terraform refresh

Common Error Categories and Solutions

1. Syntax and Configuration Errors

Invalid HCL Syntax

# Error: Invalid character
resource "aws_instance" "web" {
  ami = "ami-12345"
  instance_type = "t2.micro"
  # Missing closing brace

Error Message:

Error: Invalid character

  on main.tf line 10:
   1: resource "aws_instance" "web" {

This character is not used within the language.

Solution:

# Fix: Add missing closing brace
resource "aws_instance" "web" {
  ami           = "ami-12345"
  instance_type = "t2.micro"
}

Invalid Resource Arguments

# Error: Invalid argument
resource "aws_instance" "web" {
  ami           = "ami-12345"
  instance_type = "t2.micro"
  invalid_arg   = "value"  # This argument doesn't exist
}

Error Message:

Error: Unsupported argument

  on main.tf line 4, in resource "aws_instance" "web":
   4:   invalid_arg = "value"

An argument named "invalid_arg" is not expected here.

Solution:

# Check provider documentation
terraform providers schema -json | jq '.provider_schemas["registry.terraform.io/hashicorp/aws"].resource_schemas["aws_instance"]'

# Or check online documentation
# Remove invalid argument

Circular Dependencies

# Error: Circular dependency
resource "aws_security_group" "web" {
  name = "web-sg"
  
  ingress {
    from_port       = 80
    to_port         = 80
    protocol        = "tcp"
    security_groups = [aws_security_group.app.id]  # Depends on app
  }
}

resource "aws_security_group" "app" {
  name = "app-sg"
  
  ingress {
    from_port       = 3000
    to_port         = 3000
    protocol        = "tcp"
    security_groups = [aws_security_group.web.id]  # Depends on web
  }
}

Solution:

# Fix: Use separate security group rules
resource "aws_security_group" "web" {
  name = "web-sg"
}

resource "aws_security_group" "app" {
  name = "app-sg"
}

resource "aws_security_group_rule" "web_to_app" {
  type                     = "egress"
  from_port                = 3000
  to_port                  = 3000
  protocol                 = "tcp"
  security_group_id        = aws_security_group.web.id
  source_security_group_id = aws_security_group.app.id
}

resource "aws_security_group_rule" "app_from_web" {
  type                     = "ingress"
  from_port                = 3000
  to_port                  = 3000
  protocol                 = "tcp"
  security_group_id        = aws_security_group.app.id
  source_security_group_id = aws_security_group.web.id
}

2. Authentication and Permission Errors

AWS Credential Issues

Error Message:

Error: No valid credential sources found for AWS Provider.

Solutions:

# Check AWS configuration
aws configure list

# Set credentials via environment variables
export AWS_ACCESS_KEY_ID="your-access-key"
export AWS_SECRET_ACCESS_KEY="your-secret-key"
export AWS_DEFAULT_REGION="us-west-2"

# Or configure AWS CLI
aws configure

# Or use IAM roles (for EC2/Lambda)
# Or use AWS SSO
aws sso login --profile your-profile
export AWS_PROFILE=your-profile

Insufficient Permissions

Error Message:

Error: UnauthorizedOperation: You are not authorized to perform this operation.

Solution:

# Check current identity
aws sts get-caller-identity

# Test specific permissions
aws iam simulate-principal-policy \
  --policy-source-arn $(aws sts get-caller-identity --query Arn --output text) \
  --action-names ec2:RunInstances \
  --resource-arns "*"

# Required IAM policy example
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:RunInstances",
        "ec2:TerminateInstances",
        "ec2:DescribeInstances",
        "ec2:CreateTags"
      ],
      "Resource": "*"
    }
  ]
}

3. Provider and API Errors

Provider Version Conflicts

Error Message:

Error: Failed to query available provider packages

Solutions:

# Clear provider cache
rm -rf .terraform

# Reinitialize
terraform init

# Update providers
terraform init -upgrade

# Check provider constraints
terraform providers

API Rate Limiting

Error Message:

Error: Request rate exceeded. Please retry after some time.

Solutions:

# Add delays between operations
sleep 5
terraform apply

# Use parallelism control
terraform apply -parallelism=1

# Implement retry logic
for i in {1..3}; do
  terraform apply && break
  echo "Retry $i failed, waiting..."
  sleep 30
done

Resource Already Exists

Error Message:

Error: InvalidGroup.Duplicate: The security group 'web-sg' already exists

Solutions:

# Option 1: Import existing resource
terraform import aws_security_group.web sg-12345678

# Option 2: Use unique names
resource "aws_security_group" "web" {
  name_prefix = "web-sg-"  # Terraform adds unique suffix
  # ... rest of configuration
}

# Option 3: Check if resource exists first
data "aws_security_group" "existing" {
  name = "web-sg"
}

resource "aws_security_group" "web" {
  count = length(data.aws_security_group.existing.id) == 0 ? 1 : 0
  name  = "web-sg"
  # ... rest of configuration
}

4. State File Issues

State Lock Timeout

Error Message:

Error: Error locking state: ConditionalCheckFailedException

Solutions:

# Check lock status
aws dynamodb get-item \
  --table-name terraform-state-locks \
  --key '{"LockID":{"S":"bucket/path/terraform.tfstate-md5"}}'

# Force unlock (dangerous - ensure no other process is running)
terraform force-unlock LOCK_ID

# Increase timeout
terraform apply -lock-timeout=300s

State File Corruption

Error Message:

Error: Failed to load state: state file was created by a newer Terraform version

Solutions:

# Backup current state
cp terraform.tfstate terraform.tfstate.corrupt

# Restore from S3 versioning
aws s3api list-object-versions \
  --bucket terraform-state-bucket \
  --prefix path/terraform.tfstate

# Download previous version
aws s3api get-object \
  --bucket terraform-state-bucket \
  --key path/terraform.tfstate \
  --version-id VERSION_ID \
  terraform.tfstate.restored

# Or upgrade Terraform to match state version

Resource Not in State

Error Message:

Error: Resource not found in state

Solutions:

# List current state
terraform state list

# Import missing resource
terraform import aws_instance.web i-1234567890abcdef0

# Or remove resource from configuration if it shouldn't be managed

5. Network and Connectivity Errors

Timeout Errors

Error Message:

Error: Timeout while waiting for state to become 'running'

Solutions:

# Increase timeout in resource configuration
resource "aws_instance" "web" {
  # ... configuration
  
  timeouts {
    create = "10m"
    update = "10m"
    delete = "10m"
  }
}

# Check network connectivity
aws ec2 describe-instances --instance-ids i-1234567890abcdef0

# Verify security groups and NACLs

DNS Resolution Issues

Error Message:

Error: no such host

Solutions:

# Check DNS resolution
nslookup registry.terraform.io

# Use alternative registry
terraform {
  required_providers {
    aws = {
      source = "terraform.example.com/hashicorp/aws"
    }
  }
}

# Configure proxy if needed
export HTTPS_PROXY=http://proxy.company.com:8080

Advanced Troubleshooting Techniques

Creating Minimal Reproduction Cases

# minimal-repro.tf - Simplest configuration that reproduces the issue
terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region = "us-west-2"
}

# Only the failing resource
resource "aws_instance" "test" {
  ami           = "ami-12345"  # Known good AMI
  instance_type = "t2.micro"
}

Binary Search Debugging

# When multiple resources fail, use binary search
# Comment out half the resources
# If error persists, problem is in remaining half
# If error goes away, problem is in commented half
# Repeat until you find the specific resource

Using Terraform Console for Debugging

# Interactive console for testing expressions
terraform console

# Test variable values
> var.instance_type
"t2.micro"

# Test function calls
> cidrsubnet("10.0.0.0/16", 8, 1)
"10.0.1.0/24"

# Test data source queries
> data.aws_ami.amazon_linux.id
"ami-12345"

# Test resource references
> aws_vpc.main.id
"vpc-12345"

Graph Analysis

# Generate dependency graph
terraform graph | dot -Tpng > graph.png

# Or use simpler format
terraform graph | grep -E "(aws_instance|aws_security_group)"

Provider Plugin Debugging

# Enable provider debugging
export TF_LOG_PROVIDER=DEBUG

# Run specific operation
terraform plan

# Check for provider-specific issues in logs
grep "aws_instance" terraform-debug.log

Troubleshooting Workflows

Production Issue Response

#!/bin/bash
# production-troubleshoot.sh

set -e

echo "Starting production troubleshooting..."

# 1. Gather initial information
echo "Gathering information..."
terraform version
aws sts get-caller-identity
terraform workspace show

# 2. Check current state
echo "Checking current state..."
terraform state list > current-state.txt
terraform show > current-config.txt

# 3. Validate configuration
echo "Validating configuration..."
terraform validate

# 4. Check for drift
echo "Checking for drift..."
terraform plan -refresh-only -out=refresh.plan

# 5. Analyze the issue
echo "Current state saved to current-state.txt"
echo "Current config saved to current-config.txt"
echo "Refresh plan saved to refresh.plan"
echo "Review these files to understand the issue"

# 6. If safe, refresh state
read -p "Apply refresh plan? (y/N): " -n 1 -r
echo
if [[ $REPLY =~ ^[Yy]$ ]]; then
    terraform apply refresh.plan
fi

Development Debugging Workflow

#!/bin/bash
# debug-workflow.sh

# Enable comprehensive logging
export TF_LOG=DEBUG
export TF_LOG_PATH="./debug-$(date +%Y%m%d-%H%M%S).log"

echo "Starting debug session with logging enabled"
echo "Log file: $TF_LOG_PATH"

# Validate syntax
echo "1. Validating syntax..."
if ! terraform validate; then
    echo "Syntax errors found. Fix and retry."
    exit 1
fi

# Check formatting
echo "2. Checking formatting..."
terraform fmt -check -diff

# Plan with detailed output
echo "3. Running plan..."
terraform plan -out=debug.plan 2>&1 | tee plan-output.txt

# Show plan details
echo "4. Showing plan details..."
terraform show debug.plan

echo "Debug complete. Check log file: $TF_LOG_PATH"

Automated Issue Detection

#!/bin/bash
# health-check.sh

ISSUES_FOUND=false

echo "Running Terraform health check..."

# Check 1: Configuration validation
if ! terraform validate; then
    echo "❌ Configuration validation failed"
    ISSUES_FOUND=true
else
    echo "✅ Configuration validation passed"
fi

# Check 2: Format check
if ! terraform fmt -check; then
    echo "❌ Format check failed"
    ISSUES_FOUND=true
else
    echo "✅ Format check passed"
fi

# Check 3: State consistency
if ! terraform plan -detailed-exitcode > /dev/null; then
    exitcode=$?
    if [ $exitcode -eq 1 ]; then
        echo "❌ Terraform plan failed"
        ISSUES_FOUND=true
    elif [ $exitcode -eq 2 ]; then
        echo "⚠️  Infrastructure drift detected"
    fi
else
    echo "✅ No infrastructure drift"
fi

# Check 4: State lock status
if terraform force-unlock -help 2>/dev/null | grep -q "Manually unlock"; then
    echo "⚠️  State may be locked - check manually"
fi

# Check 5: Provider availability
if ! terraform providers > /dev/null 2>&1; then
    echo "❌ Provider check failed"
    ISSUES_FOUND=true
else
    echo "✅ Providers available"
fi

if [ "$ISSUES_FOUND" = true ]; then
    echo "❌ Issues found - review above output"
    exit 1
else
    echo "✅ All health checks passed"
fi

Environment-Specific Troubleshooting

Local Development Issues

# Common local issues and solutions

# Issue: Provider download fails
# Solution: Check network connectivity
curl -I https://registry.terraform.io

# Issue: State file permissions
# Solution: Fix file permissions
chmod 644 terraform.tfstate

# Issue: Plugin cache issues
# Solution: Clear plugin cache
rm -rf ~/.terraform.d/plugin-cache/*

CI/CD Pipeline Issues

# .github/workflows/terraform-debug.yml
name: Terraform Debug

on:
  workflow_dispatch:
    inputs:
      debug_level:
        description: 'Debug level (TRACE, DEBUG, INFO)'
        required: true
        default: 'DEBUG'

jobs:
  debug:
    runs-on: ubuntu-latest
    
    steps:
      - uses: actions/checkout@v3
      
      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v2
        with:
          terraform_version: 1.5.0
          
      - name: Debug Terraform
        env:
          TF_LOG: ${{ github.event.inputs.debug_level }}
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          terraform init
          terraform validate
          terraform plan -out=debug.plan
          
      - name: Upload debug artifacts
        uses: actions/upload-artifact@v3
        if: always()
        with:
          name: terraform-debug
          path: |
            debug.plan
            .terraform/

Production Environment Issues

# Production troubleshooting checklist

# 1. Check service status
echo "Checking service dependencies..."
aws sts get-caller-identity
aws ec2 describe-regions --region-names us-west-2

# 2. Verify permissions
echo "Checking permissions..."
aws iam get-user
aws iam list-attached-user-policies --user-name $(aws sts get-caller-identity --query UserName --output text)

# 3. Check state backend health
echo "Checking state backend..."
aws s3 ls s3://terraform-state-bucket/
aws dynamodb describe-table --table-name terraform-state-locks

# 4. Review recent changes
echo "Recent AWS API calls:"
aws logs describe-log-groups --log-group-name-prefix CloudTrail

Prevention Strategies

Pre-commit Hooks

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/antonbabenko/pre-commit-terraform
    rev: v1.77.0
    hooks:
      - id: terraform_fmt
      - id: terraform_validate
      - id: terraform_docs
      - id: terraform_tflint

Automated Testing

# test-terraform.sh
#!/bin/bash

echo "Running Terraform tests..."

# Syntax validation
terraform validate

# Security scanning
tfsec .

# Best practices check
terraform-compliance -f compliance-rules -p .

# Cost estimation
infracost breakdown --path .

echo "All tests completed"

Monitoring and Alerting

# monitoring.tf
resource "aws_cloudwatch_log_group" "terraform_logs" {
  name              = "/aws/terraform/operations"
  retention_in_days = 30
}

resource "aws_sns_topic" "terraform_alerts" {
  name = "terraform-alerts"
}

resource "aws_cloudwatch_metric_alarm" "terraform_failures" {
  alarm_name          = "terraform-operation-failures"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = "1"
  metric_name         = "Errors"
  namespace           = "AWS/Logs"
  period              = "300"
  statistic           = "Sum"
  threshold           = "0"
  alarm_description   = "This metric monitors terraform operation failures"
  alarm_actions       = [aws_sns_topic.terraform_alerts.arn]

  dimensions = {
    LogGroupName = aws_cloudwatch_log_group.terraform_logs.name
  }
}

Key Takeaways

  • Use systematic approaches to troubleshooting
  • Enable debug logging for complex issues
  • Start with simple validation commands
  • Create minimal reproduction cases
  • Use Terraform console for interactive debugging
  • Implement health checks and monitoring
  • Document solutions for future reference
  • Prevent issues with automated testing
  • Have escalation procedures for production issues
  • Keep troubleshooting tools and scripts ready

Next Steps

  1. Complete Tutorial 15: Create and Use Modules
  2. Learn about advanced Terraform testing strategies
  3. Explore infrastructure monitoring and observability
  4. Practice troubleshooting in different environments

Additional Resources