Deployment & Production Environment Management

Russell Spitler
7 min readSep 23, 2020

This post is a continuation of the series started here —

The ability to incrementally evolve your production environment by allowing your developers to simply check in code opens up an incredible opportunity for velocity but needs appropriate controls to ensure stability, maintainability and security. With no particular importance the following strategies and philosophies were in part what made this work for us.

Service Identification

As the creation of our git repos, build pipelines, monitoring, and production infrastructure was automated each microservice needed an auto-generated canonical identifier. This was accomplished through the use of two attributes ‘application’ and ‘role.’

Application — this identifier was used to roughly scope a family of microservices. While there may be many models that are beneficial to use to scope a family of microservices ultimately the intention is to set a nominal bound for side-effects. The higher the likelihood the update of one service could have an impact on the behavior of another then the more likely it should be in the same application. Our back of the napkin trick was whether the microservice accessed or operated on the same dataset.

Role — this identifier is unique per application but was intended to provide a functional label for the actual microservice to simplify identification of resources associated with it. Throughout the code repositories, build pipelines, monitoring systems, and AWS UI our systems use the auto-generated naming convention of <environment>-<application>-<role>. Failure to use a functional and unique label for a role causes headaches.

Consistent use of these attributes to generate names provides sanity when reviewing infrastructure, and debugging issues. In addition it allows for service discovery of shared resources (such as datastores, queues or even REST API’s).

Environment

In our world there are two environments — CI and Prod — as the names suggest the first for continuous integration testing the second for our actual production workloads. This attribute was not definitional for a service (as in there was not one service for CI and one for PROD) but rather a build and runtime identifier. The idealized goal we were trying to achieve is to have programmatically identical CI and Production environments. In reality this is near impossible as the load in productions are often infeasible to recreate in CI because of concerns of cost or data privacy. In building our infrastructure there were facilities for providing deployment parameters specific to each of these environments. For the most part we use these parameters to specify auto-scale limits which allowed for our CI environment to be the ‘low volume’ version of our production environment (for example a batch job limited to 1 instance at a time opposed to the 1000 specified for production). This is particularly useful when facing service issues in production as CI can easily be scaled up to mirror production.

Development

In our world CI is used for regression and compatibility testing. It is infeasible to have a local development environment that fully represents the interactions in production once you have more than a handful of services. We focused our efforts on the regression testing and abstraction of each service opposed to attempting to allow our developers to create a fully functioning environment on their local machine. This certainly introduces difficulties when debugging complex issues, but the efforts that go to testing offset the limitations of local recreation of issues.

Roll Forward

It is a common trope of development managers and executives to discuss ‘rolling back’ a release once an issue appears. For those who have been on the frontlines of a bad release this idea is always harder to realize then it is to say. In a world with no data and no state this may be possible, but that is also a world that is probably not too much fun. Once builds go live, they always have some effect on the state of the data making a hard reversion to a previous build undesirable as the behavior will be equally unpredictable (old logic on new data formats). Our build pipelines and production systems were not architected with the intention of trying to revert the state to a previous one, but rather built to support the idea of ‘rolling forward.’ The way to fix a bad release was to update the code (or configuration of a service) check-in and push a new release. Embracing this model simplified our build pipeline and production infrastructure but placed additional requirements on our testing and build pipeline throughput. Ultimately this was probably the right call for us as our early incidents put a spotlight on our build times, a problem when solved ultimately improved our developer productivity and stability.

Burn the Keys

Limiting access to production was a very unpopular choice at the onset ultimately proved to our long term benefit. The machines running in our production environment were nearly inaccessible by our developers (we never got to the place of actually destroying them). A common pattern we faced was a push to production, a developer signing into the production box and then debugging/optimizing/finishing the service they just pushed. As we moved to a fully automated infrastructure the access to the keys used to access the production environment was severely limited. This pushed the focus towards improving our ability to extract monitoring data and analyze the runtime behavior of these environments.

Infrastructure is code, and so is monitoring

At this point the idea of keeping your entire infrastructure in a form that can be managed like your source code is relatively well accepted and embraced. Often overlooked in that discussion is the logic used to monitor your environments. As we built the building blocks for our infrastructure we started to include the definitions for performance alerts in our infrastructure definitions. The ultimate benefit of this was the ability to regression test the logic in these alerts as well as ensure consistent monitoring for known problematic conditions across all services. For example, alerting on an increase of 5XX responses from a HTTP load balancer is a reasonable monitor to put in place for any load balancer. Ultimately, the major benefit of this exercise was the ability to test the logic of the monitors. A few times we were bit as we created a new alert after a service incident only to find it had faulty logic when the issue recurred. A transition to managing this logic as testable code was a huge step forward.

Integration tests are artifacts

A common failure pattern in a microservice architecture is the update of a service which invalidates contracted (or implied) behavior another service relies upon. Most integration tests validate that the new behavior is consistent with the new suite of integration tests provided with the service — a tautology in most build systems. Overlooked is the regression testing of all the other microservices in the same ‘application’ (see above). Our solution to this TOCTOU testing problem was to treat integration tests as artifacts. As the artifacts related to microservices were published so were the integration tests that were run during the build time of that artifact. A dedicated build stage was then introduced in the build pipeline of each build to run through the integration tests of all currently published artifacts to validate that the new service did not introduce breaking behavior. In this manner the new service would have to pass it’s new integration test (as pushed with the code) as well as pass it’s previous integration test (the currently stored artifact).

Artifacts vs. Manifests

Our build system in particular seemed intent on storing the artifacts from our builds. As we did not want to trust our build system (or those administering it) with this responsibility we found ourselves trying to work around this design. This proved frustrating and complicated the build process substantially. In the end we came to an approach of producing a build manifest as one of the artifacts of the build system giving a referential pointer to the systems where the actual artifacts (containers, zips of angular apps, etc) were stored. This allowed for us to use the native artifact flows built into the build systems while also leveraging the purpose built artifact repositories to manage lifecycle (and minimize storage costs!). In the end our build system had a nice history of our builds and each build had a manifest of what had been created during that run. The manifest would then point to the artifact location in a S3 bucket or docker repository where it was stored long term.

If you write to it, you own it!

With our services we found that there was an ownership contention for the resources used when services interacted with one another. Examples might be a queue or a ‘dumb’ data store like a S3 bucket. In these cases we decided that the service writing to it ‘owned’ it. The lifecycle and configuration was managed in conjunction with the service that wrote to that store in order to create clear responsibilities and avoid conflict in the case both services attempted to manage the configuration.

Persistent data stores are special

When we started using the rule above we initially treated our databases in the same manner — as a dependent resource of one of our microservices. For example, initially our REST API ‘owned’ the database. While this allowed us to adhere to our clear mechanisms for ownership it ultimately caused lifecycle issues. Keeping the microservices ‘stateless’ allowed for us to treat them as ephemeral services without much concern about destroying them in CI or Production. When a persistent datastore was attached this introduced constraints about how we updated services and our actions in the case of service disruptions. To handle this we started to treat any persistent datastore as a standalone service which allowed us to manage the lifecycle of that service independently and in a manner appropriate to the lifecycle of the data stored.

A further discussion of how these concepts are practically used can be found here-

and continued here-

--

--

Russell Spitler

Russell Spitler has spent his career in cybersecurity working as an engineer, architect, product manager, and product executive.