Creating a read-only AWS UI with ‘infra-buddy’

Russell Spitler
5 min readOct 12, 2020

The trap of the Cloud Service UI is familiar to anyone who has set up a service on AWS (or any cloud provider) . Well intentioned (and stressed!) engineers responding to incidents by flipping settings and promptly forgetting about them. Or worse, your ‘AWS Cost Admin’ tweaks some setting that is only uncovered months later. The use of the web ui inevitably leads to configuration drift — a series of small undocumented changes to your environment causing it to slowly evolve from whatever was initially deployed. . Configuration drift of production environments ultimately poses two fundamental problems — an unknown state to restore in the case of disaster recovery and the fidelity of the CI (continuous integration) environment. In the worst case scenario this drift is hard to replicate if the entire environment has to be rebuilt in a major service disruption. In a more daily pain point, the behaviors of CI and Production start to diverge and the integrity of your regression tests is invalidated. Our team addressed the issue of drift through an evolutionary process that resulted in the development of ‘infra-buddy’, a python CLI that automated deployment processes in order to ensure the alignment of the CI and Production Environments thus removing the need for manual configuration of the cloud environment through the web UI.

In order to achieve our goal, our team needed to solve the following problems:

  1. Completely configure our infrastructure through our source code repositories and the build system. Many deployment patterns treat infrastructure as a special case of their build pipelines, either skipping formal build / verification steps or having divergent logic paths for CI / Production. We chose to employ the pattern of keeping the infrastructure pipeline identical to the code pipeline: having necessary build steps, CI deployment & verification, followed by an automated promotion to production. This consistency forced us to develop the necessary automation for verification, as well as provided the added benefit of a single view into *all* changes into production. Any change, regardless of whether it was a code change or a configuration update, was visible through the build system and traceable through the git check-ins. The rigor of the implemented pattern provided a substantial improvement over the existing patterns in the wild.
  2. Provide self-service infrastructure building blocks for developers to use as needed. Reducing the number of snowflakes in our environment was critical to our model. To do this, we needed to create flexible building blocks for use in the construction of their services. As always, it is easier to find the common set of reusable code through refactoring, as we took a critical eye to the first generation infrastructure we were able to create the following for developers to use: REST API, Batch processing, CDN distribution for a single page app, Persistent data store (in various flavors — elasticsearch, postgres, mysql). These building blocks could then be easily extended to turn on pre-configured capabilities, such as autoscaling.
  3. Create a common monitoring system for the infrastructure that did not require opt-in or additional configuration. It was then possible to pre-bake the integrations and configurations necessary to hook the previously-mentioned building blocks into the common logging platform, as well as the performance monitoring systems. For those responsible for uptime, the confidence gained by knowing there are no parts of the system accidentally overlooked in your logging or monitoring is game changing.
  4. Create ‘identical’ CI and Production environments (see concepts outlined here) By using a build pipeline for the infrastructure that validated deployment in CI, and then promoted to production. This approach ensured that there were no infrastructure discrepancies between environments. To handle the natural delta in scale, two facilities were created to provide different scaling parameters to the different environments, which also provided for easy scale up of CI if there were production issues that appeared to be rooted in the volume.
  5. Minimize duplication of the ‘code’ used to build our environment. Everyone wants to think they are special; as such our first generation of infrastructure was prone to the ‘copy & modify’ design pattern. We reduced our infrastructure ‘code’ down to a minimal set by providing the right building blocks. Making it self-service provided the easy path to reuse.
  6. Be able to test our infrastructure building blocks in isolation of their use in a production system. By creating a full build pipeline for our infrastructure we were able to create regression tests. Ultimately we ended up doing this in the ‘build pipeline’ for the infra-buddy project because when developers checked in updates to the common building blocks, regression tests were executed against them in isolation. The building blocks were also then tested in context in the build-pipeline for the microservice that leveraged it.
  7. Create a fully replicable process for creating our runtime environments (excluding data, the environment could be recreated without any undue burden). The replicable process is the 4end result of a fully automated build process and a configuration strategy that was fully git-managed. With the exception of the data in the data stores, production could be recreated in as long as it took for the build process to run (~60 minutes for even complex applications as most microservices ran in parallel)

Our path towards these goals started with a mess of bash scripts. While the words ‘mess’ and ‘bash’ often go hand in hand this was an insane mess — the dark corners of bash were uncovered and unholy contracts were made with environment variables and regular expressions. All of this was wrapping CloudFormation templates that were hundreds of lines long, often contained copy & pasted snippets from other templates and contained similar incomprehensible magic to generate parameters and conditional logic. While this worked for a while, it was not long before our team was reverting back to modifying the UI and worrying about the next time the template would be run for fear of reversion of critical settings. In short — unmanageable.

It was an obvious choice to refactor into a more usable utility one that provided for more stable testing and an architecture that was extensible. This ultimately turned into ‘infra-buddy’ a python CLI that provides for simple management of production infrastructure through predefined building blocks that could be self-serviced by the developers.

Infra-buddy has a lot of moving parts, and perhaps, more probably more extensibility than the average user might need. However, in practice, the value is simple to access: all the developer needs to know is relatively small JSON file that in most cases is auto-generated with all needed configuration by the upstream service-buddy.

Our goal of aligning the CI and production environments was achieved though many iterations and lessons learned, but the result was worth it — all configuration updates were recorded as git check-ins; monitoring was tested in CI before going to production; and developers could extend the architecture of the application with ready to use, monitored, logged, reliable building blocks. This ultimately improved the reliability and scale of the system. At this point infra-buddy has been put to use on at least two other projects of reasonable scale proving its portability beyond OTX.

--

--

Russell Spitler

Russell Spitler has spent his career in cybersecurity working as an engineer, architect, product manager, and product executive.