All Articles

Data Migration Playbook

Maintaining a software system may require periodically changing the shape of your data and occasionally changing where the data is persisted. An added complexity is that typically this must be achieved with zero service downtime, which often requires complex data migrations. Below are scenarios I’ve encountered and helpful solutions I’ve found along the way: 

Scenario 1: Migrating entities to another data store

Let’s say we are maintaining an API, /v1/users, that allows querying and mutating users and is  backed by a NoSQL database. We see a steady uptick in user growth rate and decide to  update our API to rely on a relational database as its backing store and migrate user entities.

Handling writes during the migration process

The first problem we need to solve is how to handle requests to update users that might happen during the migration process while maintaining the zero down-time requirements. The combination of a high cardinality of users and high mutation request rates hitting the API means a high probability of receiving a write during the migration process. 

The simplest solution to migrate user entities is to write and run a script that eagerly copies every user. To prevent writes going to the data store during the migration process, we block updates to users breaking the zero downtime invariant. 

If we make the migration script idempotent and run it multiple times, our procedure might look as follows:

  1. Run migration script
  2. After migration is successful, update API to read and write to new database
  3. Re-run migration script to migrate any dangling writes

Handling rollbacks

Let’s say after a few days a critical performance issue is discovered with our new data store and we decide to temporarily rollback the update. Our /v1/users API will now be returning user entities persisted in our original data store, missing any writes that may have happened to the new data store before the rollback.

Safely handling rollbacks requires staggering our migration process over two deployments:

Deployment 1:

  • Update API mutations to write to both data stores

    • If ACID invariants are implemented, this requires relaxing some constraints or investing a large amount of resources building a transaction abstraction across both data stores
    • Since we are typically migrating to a new data store with better transaction primitives, generally this isn’t as problematic as what may be expected

Deployment 2:

  • Update API queries to read from the new data store
  • Update API mutations to write solely to the second data store

Scenario 2: Adding a new required field to an entity

We recently received a feature request to allow admins to deactivate user accounts. The design of this feature calls for adding a new status field to the user entity. For every current user, we want to backfill this new database column with a default value of ACTIVE and require this field going forward.

Making this field required in a single deployment , either at the API boundary or at the database, requires precisely timing releases and deployments of database schema updates, backend API updates, and front-end clients. Accomplishing this in a single deployment with zero down-time is unrealistic; but is possible by staggering changes across two deployments:

Deployment 1:

  • Add nullable column to the database table
  • Update API mutations to write to status field

    • If field not provided in the request because of an outdated client, set to either null or a default value of ACTIVE
  • Add functionality to clients to deactivate users and wire up to the API request
  • Update API queries to repair on read and return a default value of ACTIVE for any user that has a null status

    • Older versions of the backend API software will simply ignore the new column

Deployment 2:

  • Run migration script to backfill all users with a null status to a default value of ACTIVE
  • Update database schema to make status column nonnull
  • Update API queries to remove repair on read functionality

Scenario 3: Migrating into an incompatible schema

For our next scenario, let’s assume our company, Alice Software Co., just acquired another company, Bob Software Co., and we want to integrate Bob’s software with our platform. Furthermore, let’s assume that doing so requires migrating all of Bob’s users to our platform’s user data store.

Situations like these typically involve solving incompatibilities between the two database schemas. For a concrete example, let’s assume that Bob’s user schema does not have a manager field, but that field is required in Alice’s schema.

Lazily Migrating

One solution to this problem is to lazily migrate the users:

  • In Bob’s user management application, add the user flow to set the (now required) manager field
  • Update Bob’s user schema to have a migrated flag and Bob’s user management API mutations to set the migrated flag in Bob’s data store to true and also write to Alice’s data store
  • Update Bob’s user management API queries to repair on read and return users from Alice’s data store and any users not migrated from Bob’s data store

    • Accomplishing this is helped by adding an index on Bob’s migrated field allows for better performance stitching the data together between data sources. 
    • Using this technique and allowing users to specify custom sort orders on queries typically requires sophisticated caching and/or scatter-gather operations which are beyond the scope of this article

This solution allows us to lazily migrate users as they get updated organically. After all users have been migrated, we can remove the code to set the migrated flag and also the repair on read functionality.

What if users never get migrated?

By using the lazy migration approach, there is some risk users will never get updated and therefore never get migrated. In practice, we can solve this problem by either:

  • Implementing a bulk user update feature and have an internal admin eagerly migrate the remaining users
  • Having an automatic account deactivation flow after a period of inactivity

Conclusion

Data migrations can be incredibly complex and difficult to get correct. Hopefully some of the techniques outlined above can help with the next data migration problem you tackle.