For Platform Engineering teams, there is one certainty in life: you will have to undertake a migration. You might move everyone out of one data center and into another, move from one container scheduler to another, or leave one cloud provider for another. Some migrations can be nearly transparent to the user, but most require other teams to do work. And when you make that request, you’ll be met with the familiar chorus of, “Didn’t we just finish the last migration!?”
Migration Misery
You are caught in a bind: you have partner teams who are frustrated with continuous ill-defined migrations, and you have to ask your team to define and run a migration—which they likely don’t want to do. At this point, it’s tempting to do nothing, but you likely know that approach will just make for more pain later. You might go to your manager and ask for a project manager (or a team of them) to get all the dependencies mapped and hound people until they’re done.
This is where Two Sigma’s Platform Engineering team found itself on many occasions, and we did what so many teams do: turn to project managers to make sense of our migrations. For some projects, like data center build-outs, this worked well. For others, however, a different story played out. Large meetings were convened, side meetings on the large meetings held, and spreadsheets of data with dubious freshness and accuracy emailed back and forth. However, we never quite got to the bottom of the scope of our migration and new tasks or unknown dependencies emerged in meetings frequently. The answer appeared to be more meetings, more emails, and more instant messages. Something was going wrong: project managers could rescue certain projects but others seemed thankless despite their efforts.
We were making progress but it was slow, tiring, and unpredictable. Everyone was frustrated: project managers were overwhelmed by a huge and uniquely complex web of dependencies; those carrying out the migration were feeling harassed by project managers trying to understand it all; and the team requesting the migration spent too much time in meetings and spreadsheets trying to align with the other teams involved and making sure those teams knew what they needed to do.
It was clear this approach didn’t work for anyone, and migrations would eventually burn us all out. There had to be a better approach than a fleet of people armed with spreadsheets who were tasked with a problem nobody could clearly define to them. We needed to reimagine our whole approach and help our engineers define their migrations, not hand it off to a project manager. We were producing spreadsheets and numbers from data somewhere; didn’t that mean we should be able to automate running a migration?
Designing a Better Way
Our Platform Engineering team started by getting everyone on the same page about all of the resources in our migration — an operating system upgrade — by writing a small piece of code that produced data that listed out the name of hosts, whether they still needed to be fixed, and who should fix them. By running this automatically every day to measure everything from our source-of-truth data source, we could see exactly how the migration was progressing.
We extended this code into a system to look at the data and create a Jira ticket for every host that needed to be fixed. This ticket included details on exactly what needed to be done to carry out the upgrade. Once the upgrade was carried out, the system automatically closed the ticket. Similarly, the system reopened tickets closed erroneously for hosts that hadn’t been upgraded, with information to help users work out what they hadn’t done correctly.
Once we had assigned Jira tickets, we started storing the data in a database and visualized the progress and scope of the migration in dashboards.
We call this system ADMIRE, which stands for ADopt, MIgrate, and REtire. With ADMIRE, everyone knows what they need to do and how to do it. Even better, we don’t need a project manager running the day-to-day operations of the migration, and everyone can see the migration status on a dashboard.
One of the hardest parts of building this system was working out who should receive these tickets. Ownership information varied from resource to resource, was often outdated, and required mapping to specific people from department names or shared account names. To solve these challenges, we built a web service that could take various identifiers and apply heuristics to find the most likely owners. Unsurprisingly, the service was a hit and found itself used throughout Two Sigma to identify owners.
After we wrote the code for our first migration, it only took a few hours to agree that the generated data was complete in scope and accuracy. This was clearly a big improvement over the old way of carrying out a migration and prompted the question: why not use it for more migrations?
Scaling ADMIRE Migrations
Indeed, today, we don’t just run any old migration, we run ADMIRE migrations. They’re different from traditional chaotic and ill-defined migration projects. They have scope defined in code and are run by automated processes. Customers, who receive tickets with very specific instructions, trust and understand these migrations. ADMIRE provides a well understood framework for teams who need to run migrations, so they don’t need to create one from scratch.
By reducing the cost of our migrations, we’ve been able to invest in ADMIRE itself. For example, we’ve added the ability for ADMIRE to map dependencies between tasks in migrations, enabling it to open tickets only when its dependencies have been satisfied. We also invested in a reminder system that only reminds users when their specified due dates are coming up, rather than nagging them constantly. Improvements like these make ADMIRE better able to model our migrations, assist in identifying the next things to fix, and become more useful to and trusted by our customers.
ADMIRE has also helped us to focus on how we can automate (or, in some cases, eliminate) migration tasks. For example, when we’ve been forced to write concise ticket text for an upgrade, we’ve realized that we can automate steps easily. In other cases, as we see the whole scope of the migration in a dashboard, we’ve noticed resources that could be deleted automatically. We’ve also heard complaints and frustrations from customers that traditionally may have been lost in the back-and-forth of a meeting and we’ve used these to inform changes to make future migrations easier.
In the midst of all this automation, it’s easy to think project managers are no longer necessary, but we found them in a new role as ambassadors of migrations. Where once they were doing the day-to-day busy work of running migrations, they can now focus on long-term planning, supported by automated reporting and reminders. In addition, they have the ability to support more migrations for their teams, play to their strengths instead of constantly having to learn the minutiae of migrations, and act as a feedback conduit between their teams and the ADMIRE team. This is a much more effective use of their time.
Our embrace of this product-focused approach to migrations has helped us move from migration paralysis to having a lean and effective engine for running migrations under a trusted brand and well-understood process. In turn, this engine is giving us the time and insights to automate the migration tasks themselves, which reduces or eliminates the cost of every task. This ongoing process of optimization is an easy-to-miss but integral part of ADMIRE. Simply opening more and more Jiras doesn’t work in the long run. Instead, the Jira tickets themselves are a feedback mechanism. They quantify the scale (and, by proxy, the cost) of migrations, and that helps inform how we build platforms and invest in their automation. It’s this change in attitude to building and automating platforms that delivers organizational change.
Migrations may be a certainty of Platform Engineering, but ADMIRE proves they don’t have to be the unpleasant experience most of us are used to.