One of my clients asked me to implement the solution Clare Liguori of AWS described in Automating safe, hands-off deployments. It’s a very interesting and detailed document describing how Amazon deploys code to production with no human interaction. It describes safe continuous delivery in cloud scale that minimizes developers interaction and failure points. When combined with AWS Well-Architected principals, it shows you the way to build a multi-tenant SaaS product made of multiple services over multiple regions and multiple accounts that follows all best practices, is easy to maintain and easy to develop. AWS provides the principals, but the implementation details vary and depend on the specific product requirements.
In this blog post I will describe how I architected and implemented this solution for one of my clients. They wanted to move their on-premise product to a SaaS offering in the cloud that can scale to millions of transactions a second. A key requirement was being able to easily deploy multiple environments in multiple regions over multiple accounts to accommodate for the security pillar, service limits, and scalability.
Prerequisites
Infrastructure as Code (IaC)
Before we dive into the implementation details that enabled deployment automation, we first have to talk about infrastructure as code. A big part of deployment safety is knowing what is currently deployed and being able to automatically update or replicate deployments. The most common use case in any company is the development environment that allows developers to safely test changes as they’re writing their code. You always want the development environment to be as close as possible to the production environment. If your infrastructure is written as code, you can easily spin up multiple environments that work the same way as the production environment. No need to manually go to AWS console and create security groups, instances, load-balancer, etc. Just deploy a copy of your infrastructure using your IaC and you’re done.
IaC is also very important for making safe changes to the infrastructure. Code can, and should, be checked into a version control system where code review is required before any changes are deployed. Pull requests can be used to ensure the code is reviewed and passes some tests. An automated process can lint the code, print a diff to the currently deployed environment, and even run some tests. Using automatically deployed version controlled code also makes auditing changes easy. Changes can be viewed as code diffs, and manual infrastructure changes are uncommon enough that auditing them doesn’t feel like looking for a needle in a haystack.
There are multiple tools to create IaC. This specific client already had CDK (AWS Cloud Development Kit) code, so we continued with that. CDK is really amazing. It provides a much higher level abstraction over CloudFormation that really speeds up your development process and helps you avoid a lot of common CloudFormation pitfalls. Instead of manually writing security group rules, you can simply use:
instance.connections.allow_from(alb, ec2.Port.tcp(8080), "Ingress")
Instead of manually writing IAM policies to allow your Lambda function to read and write to a bucket, you can use:
bucket = s3.Bucket.from_bucket_name(self, "bucket-id", "bucket-name") bucket.grant_read_write(lambda_function)
It even has support for some higher level constructs like QueueProcessingFargateService.
All this and more was crucial for a multi-tenant, multi-region and multi-account project. It would be impossible to control a dynamic number of deployments without being able to automate them. Everything from VPC subnets down to Lambda configuration is controlled by code.
Pipelines
Next on the list of prerequisites is pipeline infrastructure. To safely automate hands-off deployment process, we need different stages of testing and deployments. The exact details differ based on project requirements, but in all cases you start with testing the code alone, continue to an integrated environment, then a production-like environment, ideally test it on a small subset of your users, before finally tracking the deployment in production and automatically rolled back if needed. These automated steps prevent customer-impacting defects from reaching production and limit the impact of defects on customers if they do reach production. Developers are able to trust their code will cautiously and safely deploy to production, without the need to actively watch it.
This process requires a pipeline infrastructure that automatically shepherds the code through all the required stages of tests and safety checks. Luckily, CDK provides us with CDK Pipelines which greatly reduces the code required to define and deploy pipelines using CodePipeline and CodeBuild. It takes care of all the low-level details like security groups, least-privilege IAM roles, and converting all generated CloudFormation stacks into steps in the pipeline.
Steps can be divided into stages and waves that can correspond to the different testing stages required. With only a few lines of code, you can create a pipeline that deploys your service to your alpha environment for functional testing, then the beta environment for integration testing, then the gamma environment for production-like testing, before finally deploying to production. Waves can be used to further divide production into multiple steps to pace deployments even further so the blast radius of bugs can be contained. All stages and their corresponding environments are defined as code, so it’s easy to add or remove stages according to the specific needs of your project.
It is crucial to define the pipelines themselves as code as they are part of the infrastructure and require the same safety mentioned above. It is even more important for this use case where the target environment list is dynamic and keeps growing.
Pipelines can be used to deploy infrastructure code, application code or configuration changes through all the required stages.
Architecture
When I joined the project, we had duplicated CDK Pipelines code over multiple microservices. Each pipeline was deployed differently, coded differently, and supported different features. Developers were unsure how to deploy to multiple environments and so only one alpha environment existed. There was no production environment yet or a known set of steps that would lead to a production environment.
It was clear that a new system was needed to handle all this. The new DevOps system was required to allow the team to:
- Configure all the different environments for all the different stages, accounts and regions.
- Configure which tests and safety checks need to run at which stages.
- Easily add services to the product and deploy pipelines for them.
- Provide metrics on everything (failure rates, build times, MTTR, and others).
- Simplify pipeline coding and encourage pipeline uniformity.
I therefore suggested the following 3-piece solution:
- A configuration service that holds system state for all accounts, stages, services, checks, etc.
- A library that creates pipelines based on basic input and the configuration service.
- CLI that lets developers easily interact with the system.
Configuration Service
To accommodate for dynamic number of environments, ever evolving deployment validation tests, and code being spread-out over multiple git repos; I chose to create a central configuration service that holds the information on what gets deployed where and how it’s tested. To keep it simple and easy to maintain, I created it with Django. It only needs to scale with the pipelines and not with end-users, so even a small instance to host it is enough. Any other framework would do just fine too. It is running on Fargate as that was the service of choice for the client.
This service mainly provides:
- An admin dashboard that allows the DevOps team to configure all the different AWS accounts, regions, and which environment goes where. With a proper set of models in place, Django’s admin site made this very easy.
- Some developer dashboards that list:
- Status for all pipelines in one easy view.
- Detailed status and links for stage validation tests.
- Overview of all environments to make it easy to find where services are running.
- APIs for developers to:
- Add test environments.
- List details about pipelines.
- List details about environments.
- APIs for pipelines to:
- Query where services need to be deployed.
- Update pipeline and environment status.
- Start stage validation tests.
- Reporting:
- Broken pipeline notification.
- CloudWatch metrics for build times, failure rates, etc.
Considering the entire project spanned multiple accounts, an important aspect of the service was to gather all this information in one place. If a CloudFormation stack deployment failed, developers didn’t need to switch AWS accounts and start looking. There was a simple link in the status page that pulled that information for you. This also allowed us to completely block access to production accounts while still providing some information about them.
Common Library
Using the data from the configuration service, we need common code to create the pipeline so that it queries the right data and deploys the service to the right places. The library uses CDK Pipelines to create the proper pipelines with all the required steps and stages with data from the configuration service.
The library is where most of the deployment logic is defined. It needs to be able to setup the right stages with the right tests and safety checks. This includes deploying or reusing multiple lambdas and step functions as steps in the pipeline that execute the different tests. For example, we had a step function that would call the configuration service to ask for required tests after each stage and execute them. If they all passed, it let the pipeline continue. If any failed, it would stop the pipeline.
The easier it is to use the library, the more developers will be excited to use it and the easier it gets to maintain pipeline uniformity. That’s why defining a pipeline was a two line process with as small number of parameters as possible. You only needed to pass the GitHub repo name so the code would know where to pull it in the pipeline. That name was also used to query all the required configuration from the configuration service.
This library is also good place to put some common constructs that developers should use. Common constructs can help developers easily adhere to company policy like never having public buckets, required tagging, or baseline auto-scaling settings.
Developer CLI
To make life easy for developers and to encourage them to use the systems we have in place just the right way, I created a CLI that uses the configuration service. It should cover a lot of the daily tasks developer execute, like:
- Set up AWS profiles with access to all relevant accounts. This is especially important for a system with so many accounts. You don’t want to be copying credentials from Control Tower all the day long. The CLI both sets up the credentials on the fly when needed and sets up SSO profiles that can be used with awscli and SDKs. My favorite part of this feature was the
devops-cli aws <ENV NAME> <COMMAND like s3 ls>
option that allowed you to run awscli preconfigured with the right profile for the right account and region based on the environment name. - Deploy pipelines and/or microservices directly. It takes a very certain set of environment variables and CDK parameters to deploy exactly what you need. Instead of educating developers, it’s easier to just give them a
devops-cli deploy
command. The CLI always prints out what commands it runs internally, so developers can learn while using it. - Compare the currently deployed stacks to the ones generated by local code using
cdk diff
. This is a critical tool that helps developer get comfortable with their changes by letting them know exactly how their stacks will change. - List resources of certain stacks in certain environments. It can get hard to find the right bucket or log group when there are hundreds and CloudFormation randomizes parts of the name. Help developers find resources by the criteria they need instead of reading through multiple pages on the console.
- Create new environment either for testing or production usage. An environment has to be defined in the configuration service and some other 3rd party services. Help developers always do it right with just one command instead of forcing them to follow multiple steps described in a document that rarely stays up-to-date.
Conclusion
By creating a central deployment configuration service, giving developers tools to interact with it, and wrapping pipeline generation code in a common library, we were able to enable development and safe deployment of a well-architected microservice architecture project through multiple automated testing stages over multiple AWS regions and accounts.
If you’re interested to learn more about any part of this system, leave a comment. If you need help building your own automated deployment system or and/or IaC, contact me at amir@cloudsnorkel.com or in the contact page.