Terraform State: Managing Drifts and Preventing Disasters

Terraform, with its Infrastructure as Code (IaC) approach, simplifies infrastructure provisioning and management. However, real-world scenarios can bring challenges that make the desired state in your Terraform code diverge from the actual state in your cloud environment. Such situations are termed as “drifts”. In this post, we’ll walk through common scenarios of drifts, and how you can effectively manage them.

Terraform State

Understanding Terraform State
Common Drift Scenarios
Preventing Accidental Deletions
Handle Temporary Changes
Accidentally Modifying Immutable Resources
Overlooking Dependency Management
Hardcoding Secrets in Configurations
Not Leveraging Modules
Ignoring Resource Limits

Understanding Terraform State

Terraform uses a state file to map the resources in your configuration to real-world resources in cloud environments. This state allows Terraform to identify what has been created, what should be modified, and what should be deleted. But drifts occur when changes are made outside of Terraform, causing the state file and real-world resources to diverge.

Common Drift Scenarios

a. Detached Database Volume

Scenario: You’re performing a backup and decide to manually detach a database volume in AWS. When you attempt to run Terraform, it tries to remount the volume, interrupting the backup process.

Solution:

Remove the volume from Terraform state.

terraform state rm aws_volume_attachment.db02_volume1_attachment

Import the current state of the volume.

terraform import aws_volume_attachment.db02_volume1_attachment /dev/sdf:vol-0ae468fd40c3c12b7:i-08ec40a34d924f99e

b. Elastic IPs Disassociation

Scenario: An Elastic IP has been manually disassociated from an instance.

Solution:

Remove the association from Terraform state.

terraform state rm aws_eip_association.example

Import the current state.

terraform import aws_eip_association.example eipalloc-12345678

c. ALB Target Group Changes

Scenario: A target group for an Application Load Balancer (ALB) was modified outside Terraform.

Solution:

Remove the target group from Terraform state.

terraform state rm aws_lb_target_group.example

Import the modified target group state.

terraform import aws_lb_target_group.example arn:aws:elasticloadbalancing:region:account-id:targetgroup/target-group-name/target-group-id

Preventing Accidental Deletions

Accidentally deleting a production resource is a concern many engineers have. Terraform provides safeguards against this:

Resource Lifecycle: Using the prevent_destroy lifecycle directive, you can ensure Terraform doesn’t accidentally destroy a resource.
```
resource "aws_db_instance" "example" {
  ...
  lifecycle {
    prevent_destroy = true
  }
}
```
State Locking: By enabling state locking, you ensure that no two team members can modify state concurrently, preventing conflicts and mishaps.
Backup: Always backup your Terraform state file before applying changes. Many backends, including the S3 backend, support versioning, providing an additional layer of security.
Commenting out the resource: This is a less elegant approach and can have unintended consequences, but you can comment out the entire resource in your Terraform configuration.
When you run terraform plan or terraform apply, Terraform will detect the resource as having been removed and will attempt to destroy it. However, if you’ve set the prevent_destroy lifecycle directive to true, Terraform will prevent the resource from being destroyed. Caution: This approach can be risky if you accidentally remove the prevent_destroy directive or forget that you commented out the resource. Recommendations:
```
Always review your plans carefully before applying changes.
If using the ignore_changes directive, document why the change is being ignored, either as code comments or in associated documentation.
Use version control (like Git) for your Terraform configurations, so you can track changes and revert if necessary.
```

Handle Temporary Changes

Terraform allows you to temporarily ignore changes to a resource or specific attributes of a resource. This can be particularly useful in situations where you want to prevent Terraform from managing certain parts of your infrastructure temporarily.

Using the ignore_changes lifecycle directive:

The lifecycle block within a resource configuration supports an ignore_changes directive. With ignore_changes, you can specify one or more attributes to be ignored during Terraform’s plan and apply phases.

Example: If you want to prevent Terraform from managing the tags, instance_type, ami, user_data of an aws_instance, you would use:

resource "aws_instance" "example" {
 # ... other configuration ...

 lifecycle {
   ignore_changes = [tags,instance_type, ami, user_data]
 }
}

If you want to ignore all changes to a resource, you can use: all

Some other random examples how this can be utilised

EC2 Instance (aws_instance): ami, instance_type, key_name, user_data, tags

Fargate (aws_ecs_task_definition): family, cpu, memory, execution_role_arn, task_role_arn, requires_compatibilities

Application Load Balancer (aws_lb): name, enable_deletion_protection, internal, load_balancer_type, security_groups, subnet_mapping, enable_cross_zone_load_balancing

VPC (aws_vpc): cidr_block, assign_generated_ipv6_cidr_block, enable_dns_support, enable_dns_hostnames, tags

AMI (aws_ami): name, description, virtualization_type, architecture, root_device_name

In some complex cases, you might be dealing with nested attributes or blocks. In those cases, you’d use a dot notation to specify which nested attribute to ignore. For instance:

lifecycle {
  ignore_changes = [scaling_config.0.desired_size]
}

It's important to emphasize that these lists are not exhaustive. They just provide a subset of commonly adjusted attributes. Depending on your use-case

Accidentally Modifying Immutable Resources

Scenario: After making changes to your configuration, Terraform wants to recreate a resource because it’s considered immutable (e.g., changing the name of an S3 bucket).

Solution:

Avoid changing properties that force resource recreation unless absolutely necessary.
If you must proceed, backup any data and configurations before applying changes.
Consider using the create_before_destroy lifecycle rule to reduce downtime.

resource "aws_s3_bucket" "example" {
  ...
  lifecycle {
    create_before_destroy = true
  }
}

Overlooking Dependency Management

Scenario: Resources are being created in the wrong order, causing failures due to unsatisfied dependencies.

Solution:

Use depends_on to explicitly define resource dependencies.
Alternatively, reference attributes from one resource in another, implicitly creating a dependency.

resource "aws_instance" "example" {
  ...
  user_data = data.template_file.user_config.rendered
}

Hardcoding Secrets in Configurations

Scenario: Sensitive data like API keys and database passwords are hardcoded in the Terraform configuration.

Solution:

Use Terraform variables to parameterize sensitive data.
Use a secrets manager, such as AWS Secrets Manager or HashiCorp Vault, to store and fetch secrets.

variable "db_password" {
  description = "The password for the database"
  sensitive   = true
}

resource "aws_db_instance" "example" {
  ...
  password = var.db_password
}

Not Leveraging Modules

Scenario: Repetitive code across multiple configurations, leading to inconsistency and harder maintenance.

Solution:

Create reusable Terraform modules for common infrastructure patterns.
Version your modules and store them in a module registry or Git repository.

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "2.77.0"
  ...
}

Ignoring Resource Limits

Scenario: Hitting cloud provider resource limits, such as the maximum number of VPCs in an AWS account.

Solution:

Regularly review cloud provider documentation to be aware of resource limits.
Monitor cloud accounts for approaching limits.
Use count or for_each sparingly and be aware of the impact on resource creation.

Buy Me a Coffee

Table of Contents

Understanding Terraform State

Common Drift Scenarios

a. Detached Database Volume

b. Elastic IPs Disassociation

c. ALB Target Group Changes

Preventing Accidental Deletions

Handle Temporary Changes

Some other random examples how this can be utilised

Accidentally Modifying Immutable Resources

Overlooking Dependency Management

Hardcoding Secrets in Configurations

Not Leveraging Modules

Ignoring Resource Limits

Articles for