Implementing Automatic Safe Hands-off Deployment in AWS

One of my clients asked me to implement the solution Clare Liguori of AWS described in Automating safe, hands-off deployments. It’s a very interesting and detailed document describing how Amazon deploys code to production with no human interaction. It describes safe continuous delivery in cloud scale that minimizes developers interaction and failure points. When combined with AWS Well-Architected principals, it shows you the way to build a multi-tenant SaaS product made of multiple services over multiple regions and multiple accounts that follows all best practices, is easy to maintain and easy to develop. AWS provides the principals, but the implementation details vary and depend on the specific product requirements.

In this blog post I will describe how I architected and implemented this solution for one of my clients. They wanted to move their on-premise product to a SaaS offering in the cloud that can scale to millions of transactions a second. A key requirement was being able to easily deploy multiple environments in multiple regions over multiple accounts to accommodate for the security pillar, service limits, and scalability.

Read More »

Avoiding CDK Pipelines Support Stacks

If you ever used CDK Pipelines to deploy stacks cross-region, you’ve probably come across support stacks. CodePipeline automatically creates stacks named <PipelineStackName>-support-<region> that contain a bucket and sometimes a key. The buckets these stacks create are used by CodePipeline to replicate artifacts across regions for deployment.

As you add more and more pipelines to your project, the number of these stacks and the buckets they leave behind because they don’t use autoDeleteObjects can get daunting. The artifact bucket for the pipeline itself even has removalPolicy: RemovalPolicy.RETAIN. These stacks are deployed to other regions, so it’s also very easy to forget about them when you delete the pipeline stack. Avoiding these stacks is straightforward, but does take a bit of work and understanding.

CodePipeline documentation covers the basic steps, but there are a couple more for CDK Pipelines.

One-time Setup

  1. Create a bucket for each region where stacks are deployed.
  2. Set bucket policy to allow other accounts to read it.
  3. Create a KMS key for each region (might be optional if not using cross-account deployment)
  4. Set key policy to allow other accounts to decrypt using it.

Here is sample Python code:

try:
    import aws_cdk.core as core  # CDK 1
except ImportError:
    import aws_cdk as core  # CDK 2
from aws_cdk import aws_iam as iam
from aws_cdk import aws_kms as kms
from aws_cdk import aws_s3 as s3

app = core.App()
for region in ["us-east-1", "us-west-1", "eu-west-1"]:
    artifact_stack = core.Stack(
        app,
        f"common-pipeline-support-{region}",
        env=core.Environment(
            account="123456789",
            region=region,
        ),
    )
    key = kms.Key(
        artifact_stack,
        "Replication Key",
        removal_policy=core.RemovalPolicy.DESTROY,
    )
    key_alias = kms.Alias(
        artifact_stack,
        "Replication Key Alias",
        alias_name=core.PhysicalName.GENERATE_IF_NEEDED,  # helps using the object directly
        target_key=key,
        removal_policy=core.RemovalPolicy.DESTROY,
    )
    bucket = s3.Bucket(
        artifact_stack,
        "Replication Bucket",
        bucket_name=core.PhysicalName.GENERATE_IF_NEEDED,  # helps using the object directly
        encryption_key=key_alias,
        auto_delete_objects=True,
        removal_policy=core.RemovalPolicy.DESTROY,
    )

    for target_account in ["22222222222", "33333333333"]:
        bucket.grant_read(iam.AccountPrincipal(target_account))
        key.grant_decrypt(iam.AccountPrincipal(target_account))

CDK Pipeline Setup

  1. Create a codepipeline.Pipeline object:
    • If you’re deploying stacks cross-account, set crossAcountKeys: true for the pipeline.
  2. Pass the Pipeline object in CDK CodePipeline’s codePipeline argument.

Here is sample Python code:

try:
    import aws_cdk.core as core  # CDK 1
except ImportError:
    import aws_cdk as core  # CDK 2
from aws_cdk import aws_codepipeline as codepipeline
from aws_cdk import aws_kms as kms
from aws_cdk import aws_s3 as s3
from aws_cdk import pipelines

app = core.App()
pipeline_stack = core.Stack(app, "pipeline-stack")
pipeline = codepipeline.Pipeline(
    pipeline_stack,
    "Pipeline",
    cross_region_replication_buckets={
        region: s3.Bucket.from_bucket_attributes(
            pipeline_stack,
            f"Bucket {region}",
            bucket_name="insert bucket name here",
            encryption_key=kms.Key.from_key_arn(
                pipeline_stack,
                f"Key {region}",
                key_arn="insert key arn here",
            )
        )
        for region in ["us-east-1", "us-west-1", "eu-west-1"]
    },
    cross_account_keys=True,
    restart_execution_on_update=True,
)
cdk_pipeline = pipelines.CodePipeline(
    pipeline_stack,
    "CDK Pipeline",
    code_pipeline=pipeline,
    # ... other settings here ...
)

Tying it Together

The missing piece from the pipeline code above is how it gets the bucket and key names. That depends on how your code is laid out. If everything is in one project, you can create the support stacks in that same project and access the objects in them. That’s what PhysicalName.GENERATE_IF_NEEDED is for.

If the project that creates the buckets is separate from the pipeline project, or if there are many different pipeline projects, you can write the bucket and key names into a central location. For example, it can be written into SSM parameter. Or if your project is small enough, you can even hardcode them.

Another option to try out is cdk-remote-stack that lets you easily “import” values from the support stacks you created even though they are in a different region.

Conclusion

CDK makes life easy by creating CodePipeline replications buckets for you using support stacks. But sometimes it’s better to do things yourself to get a less cluttered CloudFormation and S3 resource list. Avoid the mess by creating the replication buckets yourself and reuse them with every pipeline.

Sanitized RDS Snapshots

Testing on production data is very useful to root out real-life bugs, take user behavior into account, and measure real performance of your system. But testing on production databases is dangerous. You don’t want the extra load and you don’t want the potential of data loss. So you make a copy of your production database and before you know it has been two years, the data is stale and the schema has been manually modified beyond recognition. This is why I created RDS-sanitized-snapshots. It periodically takes a snapshot, sanitizes it to remove data the developers shouldn’t access like credit card numbers, and then optionally share with other AWS accounts.

As usual it’s one CloudFormation template that can be deployed in one step. The template is generated using Python and troposphere.

There are many examples around the web that do parts of this. I wanted to create a complete solution that doesn’t require managing access keys and can be used without any servers. Since all of the operations take a long time and Lambda has a 15 minutes time limit, I decided it’s time to play with Step Functions. Step Functions let you create a state machine that is capable of executing Lambda functions and Fargate tasks for each step. Defining retry and wait logic is also built-in so there is no need for long running Lambda functions or EC2 instances. It even shows you the state in a nice graph.

To create a sanitized snapshot we need to:

  1. Create a temporary copy of the production database so we don’t affect the actual data or the performance of the production system. We do this by taking a snapshot of the production database or finding the latest available snapshot and creating a temporary database from that.
  2. Run configured SQL queries to sanitize the temporary database. This can delete passwords, remove PII, etc. Since database operations can take a long time, we can’t do this in Lambda due to its 15 minutes limit. So instead we create a Fargate task that connects to the temporary database and executes the queries.
  3. Take a snapshot of the temporary database after it has been sanitized. Since this process is meant to be executed periodically, the snapshot name needs to be unique.
  4. Share snapshot with QA and development accounts.
  5. Clean-up temporary snapshots and databases.

If the database is encrypted we might also need to re-encrypt it with a key that can be shared with the other accounts. For that purpose we have a KMS key id option that adds another step of copying the snapshot over with a new key. There is no way to modify the key of an existing database or snapshot besides when copying the snapshot over to a new snapshot. Sharing the key is not covered by this solution.

The step function handles all the waiting by calling the Lambda handler to check if it’s ready. If it is ready, we can move on to the next step. If it’s not ready, it throws a specific NotReady exception and the step function retries in 60 seconds. The default retry parameters are maximum of 3 retries with each wait twice as long as the previous one. Since this is not a real failure but an expected one, we can increase the number of retries and remove the backoff logic that doubles the waiting time.

{
  "States": {
    "WaitForSnapshot": {
      "Type": "Task",
      "Resource": "${HandlerFunction.Arn}",
      "Parameters": {
        "state_name": "WaitForSnapshot",
      },
      "Next": "CreateTempDatabase",
      "Retry": [
        {
          "ErrorEquals": [
            "NotReady"
          ],
          "IntervalSeconds": 60,
          "MaxAttempts": 300,
          "BackoffRate": 1
        }
      ]
    }
  }
}

One complication with RDS is networking. Since databases are not accessed using AWS API (and RDS Data API only supports Aurora), the Fargate task needs to run in the same network as the temporary database. We can theoretically create the temporary database in the same VPC, subnet and security group as the production database. But that would require modifying the security group of the production database and can pose a potential security risk or data loss risk. It’s better to keep the temporary and production databases separate to avoid even the remote possibility of something going wrong by accident.

Another oddity I’ve learned from this is that Fargate tasks with no route to the internet can’t use Docker images from Docker Hub. I would have expected the image pulling to be separate from the execution of the task itself like it was with AWS Batch, but that’s not the case. This is why the Fargate task is created with a public facing IP. I tried using Amazon Linux Docker image from ECR, but even that requires an internet route or VPC Endpoint.

All the source code is available on GitHub. You can open an issue or comment here if you have any questions.

CloudWatch2S3

AWS CloudWatch Logs is very useful service but it does have its limitations. Retaining logs for a long period of time can get expensive. Logs cannot be easily searched across multiple streams. Logs are hard to export and integration requires AWS specific code. Sometimes it makes more sense to store logs as text files in S3. That’s not always possible with some AWS services like Lambda that write logs directly to CloudWatch Logs.

One option to get around CloudWatch Logs limitations is exporting logs to S3 where data can be stored and processed longer-term for a cheaper price. Logs can be exported one-time or automatically as they come in. Setting up an automatic pipeline to export the logs is not a one-click process, but luckily Amazon detailed all the steps in a recent blog post titled Stream Amazon CloudWatch Logs to a Centralized Account for Audit and Analysis.

Amazon’s blog post has a lot of great information about the topic and the solution. In short, they create a Kinesis Stream writing to S3. CloudWatch Logs subscriptions to export logs to the new stream are created either manually with a script or in response to CloudTrail events about new log streams. This architecture is stable and scalable, but the implementation has a few drawbacks:

  • Writes compressed CloudWatch JSON files to S3.
  • Setup is still a little manual, requiring you to create a bucket, edit permissions, modify and upload source code, and run a script to initialize.
  • Requires CloudTrail.
  • Configuration requires editing source code.
  • Has a minor bug limiting initial subscription to 50 log streams.

That is why I created CloudWatch2S3 – a single CloudFormation template that sets everything up in one go while still leaving room for tweaking with parameters.

The architecture is mostly the same as Amazon’s but adds a subscription timer to remove the hard requirement on CloudTrail, and post-processing to optionally write raw log files to S3 instead of compressed CloudWatch JSON files.

architecture

Setup is simple. There is just one CloudFormation template and the default parameters should be good for most.

  1. Download the CloudFormation template
  2. Open AWS Console
  3. Go to CloudFormation page
  4. Click “Create stack
  5. Under “Specify template” choose “Upload  a template file”, choose the file downloaded in step 1, and click “Next”
  6. Under “Stack name” choose a name like “CloudWatch2S3”
  7. If you have a high volume of logs, consider increasing Kinesis Shard Count
  8. Review other parameters and click “Next”
  9. Add tags if needed and click “Next”
  10. Check “I acknowledge that AWS CloudFormation might create IAM resources” and click “Create stack”
  11. Wait for the stack to finish
  12. Go to “Outputs” tab and note the bucket where logs will be written
  13. That’s it!

Another feature is the ability to export logs from multiple accounts to the same bucket. To set this up you need to set the AllowedAccounts parameter to a comma-separated list of AWS account identifiers allowed to export logs. Once the stack is created, go to the “Outputs” tab and copy the value of LogDestination. Then simply deploy the CloudWatch2S3-additional-account.template to the other accounts while setting LogDestination to the value previously copied.

For troubleshooting and more technical details, see https://github.com/CloudSnorkel/CloudWatch2S3/blob/master/README.md.

If you are exporting logs to S3 to save money, don’t forget to also change the retention settings in CloudWatch so old logs are automatically purged and your bill actually goes down.