Preventing A Buildup of CloudFormation Orphaned Resources

AWS recently announced new functionality for CloudFormation that allows you to “force delete” a stack.

This simply means that when a stack fails to delete because resources go into the DELETE_FAILED state, the API action will retry the delete operation but configured to retain the resources in the DELETE_FAILED state.

Notice how resources that failed to delete remain present in your AWS account even after a stack is force deleted.

This is not a silver bullet to keep our CloudFormation stack lists pristine. In this blog post I explain why, and what you can do on top of this feature to strive towards a clean AWS account.

The Problem with CloudFormation

Forget about dogs 🐶, if you’re building anything in AWS then CloudFormation is your best friend.

One pet peeve we all have about the service is that there is no guarantee that a CRUD operation will succeed. Issues arise such as:

Stacks go into a CREATE_FAILED state because the IAM entity does not have the required permissions to create all resources.
Unable to teardown or update an existing stack because there are no update/delete permissions on the IAM entity.
Stacks go into a DELETE_FAILED state because certain resources can be prevented from deletion (e.g S3 buckets containing objects).

This causes automation headaches for engineers that are using short-lived CloudFormation stacks - which is very common in microservice and/or serverless architectures. Perhaps developers are deploying a whole stack for the feature they are working on. Or maybe the CI/CD pipeline is deploying a stack in order to execute integration or remocal tests as part of the test suite.

When a CloudFormation stack goes into the CREATE_FAILED or DELETE_FAILED state, the stack is stuck there in perpetuity unless either an engineer manually deletes the stack or the automation is programmed to handle such errors. For the deletion error case, this code would need to know to (if applicable):

Delete all objects (and object versions) from an S3 bucket (this is actually quite complex to achieve in code).
Delete all triggers from a lambda function (how would it even find them?).
Delete all other networking entities using a particular subnet/VPC (again, how would it even find them?).

Then re-trigger the stack deletion. And this was only for the three AWS resources I know off the top of my head that can fail to delete.

What CloudFormation Force Deletion Solves

The “force delete” feature does somewhat help our AWS account administrators. It can ensure that CloudFormation stacks will successfully delete, regardless of failure (unless it is IAM-related).

This is useful if your CloudFormation stack naming conventions are hardcoded. If your CI/CD deploys stack my-app-stack-branch-foo to test against code from branch foo, it may fail to delete that stack if it involved putting an object in S3. When CI/CD is triggered for that branch again in the future, you will not be able to deploy my-app-stack-branch-foo again if it already exists (and is stuck in a DELETE_FAILED state).

If you have “force deleted” my-app-stack-branch-foo on the initial run, then the second run will have no problems.

What CloudFormation Force Deletion Doesn’t Solve

The issue is that the resources themselves that failed to delete are still in your account, contributing to your AWS costs and service quotas, but are now orphaned.

In order to have a guarantee of stack and resource CRUD, you need to handle the resources in the DELETE_FAILED state too. Unfortunately, there is no “force delete” option for resources.

Here is how you can achieve this:

Step 1: Capture `DELETE_FAILED` Events

CloudFormation emits events every time a stack or resource changes state. This includes resources that go into the DELETE_FAILED state before a stack is “force deleted” after the first attempt fails. This allows us to set up an EventBridge rule that listens for resources that go into the DELETE_FAILED state.

This is the event pattern that achieves this:

1
2
3
4
5
6
7
8
9


{
  "source": ["aws.cloudformation"],
  "detail-type": ["CloudFormation Resource Status Change"],
  "detail": {
    "status-details": {
      "status": ["DELETE_FAILED"]
    }
  }
}

Step 2: Store These Events to be Addressed

These events do not need to be handled in real time and they could benefit from retries and DLQs. So writing them to an SQS queue is the best pattern.

Save all events that meet the above event rule filter to an SQS queue.

Step 3: Address These Events

I mentioned above that your code deleting CloudFormation stacks may need complex error-handling logic in order to deal with S3 buckets, Lambda Functions and networking entities on a case-by-case basis.

One nice pattern would be attempting to address failed resource deletions programatically here in step 3 - especially if this is a large AWS account where manual intervention wont scale well.

I would recommend writing a Step Function state machine (triggered off the SQS queue in step 2 above) that attempts to handle and delete resources of expected types. It isn’t expected to run hundreds of times a day, so the overhead cost of Step Functions isn’t a big factor. If the state machine cannot deal with the resource, a message should be placed in a queue or database destined for manual intervention. This queue or database should have alerts set up for a human being to react to new messages/items.

Step Function state machine should deal with known or expected cases of failed resource deletions.

Conclusion

If you follow the above steps, partnering it with CloudFormation “force deletions”, then you can expect your AWS account to be free of orphaned AWS resources.

There is still some manual effort required. AWS has over 100 services, each containing resources with their own deletion rules.

But the CloudFormation “force delete” feature is a step in the right direction. Kudos to the CloudFormation team! 💪

Interested In More?

Connect with me on LinkedIn, Twitter and you are welcome to join the #BelieveInServerless Discord Community.