Best Practices using Flyway for Database Migrations

This article contains a set of best practices for Flyway integration (taken from my experience and experience of other people) that I hope you will find useful.  I also assume you are already familiar with what Flyway is used for, and how it works. If you are not, check out a simple introduction on the Flywaydb.org website: https://flywaydb.org/getstarted/.

We will take a look at the following best practices in this article:

  • Team Arrangement and Branching with Flyway
  • Idempotent delta scripts
  • Baseline
  • Flyway Configuration using Spring Boot
  • Flyway and H2 Unit Tests
  • Versioned and Repeatable Migrations
  • Dealing with Hotfixes
  • Multiple Instances with Flyway
  • Manage Multiple Schemes or Shards with Flyway
  • Flyway in Production
  • Dry Runs
  • Rollback of Flyway Migrations
  • Flyway Log

This is not the exhaustive list of Flyway practices, but the ones listed are the most essential in my humble opinion.

Team Arrangement and Branching with Flyway

The best branching strategy for Flyway-based database migrations depends on your team arrangement.

Shared Development Database and  Dedicated DBA Role

If your team uses a single development database shared by all team members, if you have a dedicated DBA who responds to your requests for database changes, if you favor trunk and rarely use branches, then you should be good with the default Flyway configuration and numbering scheme, which uses integer version numbers for your delta files, such as:

  • V1__Update_1.sql
  • V2__Update_2.sql
  • … and so on.

When using this approach, you will do best with checking in all migrations to the trunk. Your DBA will be a single authority to resolve conflicts and to make sure that database doesn’t break when new delta files are applied. Your DBA will also be responsible for rolling back migrations and cleaning up database if something goes wrong.

Multiple Developers Making DB Changes

If you are in the true DevOps model and give multiple developers control over database changes, if they work independently on different features, then your team should follow a strict process when implementing migrations and resolving conflicts.  When developers have power over database changes, they have a great responsibility and they should be very careful when deploying their delta files. To make sure that database migrations work smoothly, try to follow the best practices below.

1. Each developer should work with his/her own database copy.

This requirement is essential. Since database changes can break other people’s work, every developer should have a personal copy of the database which can be hosted on developer’s laptop or on development server.

2. Each developer should work in his/her own branch.

When developer works on a separate feature then he/she should maintain dependency between the DB changes and the code changes within a separate branch. Issues may start when DB changes that this developer makes in his own branch depend on the changes that another developer makes. In this case both developers should work in the same branch.

Developer who implements migrations in his own branch and in his own database will be responsible for rollbacks and for cleaning up DB if migrations fail to work. This developer will need to use Flyway maintenance commands, such as CLEAN and MIGRATE.

3. Use timestamps for delta file versions instead of integers.

When all branches are merged into trunk, and you use integers for delta file versions (as Flyway suggests), then conflicts are possible. Developers will constantly have to check with each other when they create a new version of a migration. The alternative is to use timestamps for the delta file versions instead of integers, then possibility for conflicts is reduced dramatically. A timestamp will allow migrations to be applied in order they were originally created.

Flyway only recognizes integer numbers, so the timestamp should be converted into integer, i.e. ‘12/30/2016 12:30:55.282’ should be converted to 20161230123055282.

4. Enable out of order migrations.

By default, Flyway will ignore migrations/delta scripts that are older than the one already applied to DB. This will create an issue if the branch with the later version is merged into CI environment and built before the one with the earlier version. On the diagram above the red Branch #1 was merged into trunk before purple Branch #2. Also, an update to trunk was made before purple Branch #2 was merged. If the project utilizes continuous integration, then migrations of the purple Branch #2 will be ignored. The reason is that Flyway will check the latest applied migration version (which is 20170410171256) and ignore migration version 20170120081315 as outdated when purple Branch #2 is merged. This is true for timestamp version numbers as well as for integer version numbers.

To avoid this situation, enable the out-of-order migrations setting in Flyway. With Spring Boot you need to set the property flyway.out-of-order=true (see this link for more info: https://flywaydb.org/documentation/commandline/migrate ).

5. Use continuous integration DB environment to merge all DB changes.

A good idea is to have a dedicated environment where all your deltas (migrations) will be applied. You can use development, test or staging environment for this purpose. Usually this environment will be associated with a trunk. If you have a dedicated environment where your CI build is performed, then use this environment for Flyway database updates.

6. Do a DB code review before merging changes to CI environment (trunk).

This is a very important requirement: all developers who make changes to DB should review other teammate’s delta scripts before they get merged into trunk. This is an obvious step to make sure that one migration will not break the other.

7. If feasible, combine multiple changes from different branches into one delta file.

This is an optional step that will help maintaining cleaner deployment code. Imagine you have a hundred delta scripts (migrations) accumulated before the release. Those migrations will be sequentially applied in all higher environments including production. It will be easier to maintain all changes in one single delta file. To do that, before merging individual branches into trunk, all delta scripts in several branches should be manually merged into a single delta script that will be pushed to trunk.

This approach is not always feasible as it doesn’t play well with CI and CD processes that favor frequent incremental updates. Still, if your team merges all branches once before a release or you have a dedicated DBA who doesn’t mind additional work combining several scripts into one, then this approach will save time and effort in the long run, when the number of migrations becomes huge after a few years. Also, check the Baseline section later in this article.

Idempotent delta scripts

Every delta script that represents database migration should be written in a way that allows executing it multiple times and obtaining the same result (idempotent operation).

This means that every action in the script should be preceded by the action that checks if the change already exists or the action that reverses the change.

Checking for existence is preferable, but not always easy. MySQL, for example, does not allow IF ELSE statements outside of the procedures, and though workarounds exist, they don’t play well with H2 unit tests (I explain it later in this article).

The common idempotent workaround for MySQL is creating a stored procedure that performs the migration, then dropping this stored procedure afterwards. Stored procedure will use IF-ELSE to check for existing objects. Example is below:

DELIMITER $$

DROP PROCEDURE IF EXISTS upgrade_database_1_0_to_2_0 $$

CREATE PROCEDURE upgrade_database_1_0_to_2_0()

BEGIN

-- rename a table safely

IF NOT EXISTS( (SELECT * FROM information_schema.COLUMNS WHERE TABLE_SCHEMA=DATABASE()

AND TABLE_NAME='my_old_table_name') ) THEN

RENAME TABLE

my_old_table_name TO my_new_table_name,

END IF;

-- add a column safely

IF NOT EXISTS( (SELECT * FROM information_schema.COLUMNS WHERE TABLE_SCHEMA=DATABASE()

AND COLUMN_NAME='my_additional_column' AND TABLE_NAME='my_table_name') ) THEN

ALTER TABLE my_table_name ADD my_additional_column varchar(2048) NOT NULL DEFAULT '';

END IF;

END $$

CALL upgrade_database_1_0_to_2_0() $$

DELIMITER ;

Source: http://blog.geekq.net/2010/08/11/add-column-safely-mysql/

Baseline

You can set your existing database as a baseline version and instruct Flyway to apply all migrations on top of this version. You should start with your production database DDL – that will be your baseline. Then you can create new environments that are aligned with your production database from scratch using Flyway, thus implementing continuous integration on your DB frontier.

Obviously, you don’t want to break your production database, so when you established the baseline and tested everything in dev/test environment then create a production copy of your database (aka pre-prod or integration environment) and test the first migrations there.

Even if you used Flyway since the beginning of your DB life cycle, you might want to baseline it as an existing database later in time, so you will avoid checking and applying hundreds of migrations collected over the years. You will need to follow the process described on the Flyway web site.

A good and eaasy description of the baseline process is published on Flyway website: https://flywaydb.org/documentation/existing.html

Flyway Configuration using Spring Boot

Spring Boot comes with out-of-the-box support for Flyway, all you need to do is to add a line into build.gradle:

compile "org.flywaydb:flyway-core:4.0.3"

Flyway will be automatically called when application starts. If you use H2 for unit tests, Flyway will be called for H2 initialization as well.

Flyway will search your classpath for migration files (delta scripts) and apply them accordingly.

The list of Flyway properties that can be specified in application.properties/application.yml is the following (for the latest updates check this URL: https://docs.spring.io/spring-boot/docs/current/reference/html/common-application-properties.html ).

# FLYWAY (FlywayProperties)

flyway.baseline-description= #

flyway.baseline-version=1 # version to start migration

flyway.baseline-on-migrate= #

flyway.check-location=false # Check that migration scripts location exists.

flyway.clean-on-validation-error= #

flyway.enabled=true # Enable flyway.

flyway.encoding= #

flyway.ignore-failed-future-migration= #

flyway.init-sqls= # SQL statements to execute to initialize a connection immediately after obtaining it.

flyway.locations=classpath:db/migration # locations of migrations scripts

flyway.out-of-order= #

flyway.password= # JDBC password if you want Flyway to create its own DataSource

flyway.placeholder-prefix= #

flyway.placeholder-replacement= #

flyway.placeholder-suffix= #

flyway.placeholders.*= #

flyway.schemas= # schemas to update

flyway.sql-migration-prefix=V #

flyway.sql-migration-separator= #

flyway.sql-migration-suffix=.sql #

flyway.table= #

flyway.url= # JDBC url of the database to migrate. If not set, the primary configured data source is used.

flyway.user= # Login user of the database to migrate.

flyway.validate-on-migrate= #

For other Flyway execution modes and plugins (Maven, Ant, Grails, etc.) see Flyway documentation: https://flywaydb.org/documentation/ (check the left-side navigation).

Flyway and H2 Unit Tests

When you use H2 for unit tests, Spring will automatically run the Flyway on H2 database when application starts. This approach will make sure your H2 database matches your application database.

The recommended approach is to create two profiles in your application .yml file: one for development and one for testing, i.e.:

spring:
 profiles: development
 server :
 port : 8080
 jdbcConnection :
 driverClassName: com.mysql.jdbc.Driver
 url: jdbc:mysql://localhost/mydatabase
 username: user
 password: mypassword
spring:
 profiles: test
 server :
 port : 8080
 jdbcConnection :
 driverClassName: org.h2.Driver
 url : "jdbc:h2:mem:mydatabase;MODE=MySQL;DB_CLOSE_ON_EXIT=FALSE;DB_CLOSE_DELAY=-1;DATABASE_TO_UPPER=false;"
 username: user
 password: mypassword

Configure your development profile with your application DB (i.e. MySQL) connection string, and test profile with H2 database connection string.
You will also need to install the latest version of H2 database, otherwise you may have compatibility errors, at the time of writing the following versions were used:

h2Version = '1.3.176'
mysqlVersion = '5.1.6'
flywayVersion = '4.0.3'

When running your application, you should select development profile with the following command:

./gradlew bootRun -Dspring.profiles.active=development

For running test cases you can select a test profile (or set it as active in application.yml).

Another issue that you may experience is that Flyway closes connection after it applies all migrations. To keep connection open you should explicitly add the following options to the connection string:

DB_CLOSE_ON_EXIT=FALSE;DB_CLOSE_DELAY=-1;

Flyway compatibility case: MySQL and H2

When you use H2 for unit tests in Spring Boot, all Flyway migrations will be automatically applied to H2 database. Therefore, H2 database should understand the syntax of your DB SQL language.

The compatibility mode must be turned on in the H2 database connection string using the following option:

MODE=MySQL;

Another compatibility issue comes from the fact that H2 automatically creates public schema in upper case letters which may conflict with MySQL database name if it was set in lower case letters. You need to turn off uppercase letters in H2 by providing the following option:

DATABASE_TO_UPPER=false;

Another workaround is to force MySQL to use upper case letters for schema name.

Versioned and Repeatable Migrations

Flyway supports both versioned and repeatable migrations. Versioned migration file name has a prefix V# where # is a version number (unless you change the prefix in Flyway configuration). This migration is applied only once.

Repeatable migrations will be applied after all versioned migrations. They will be applied again on the next run if the checksum of the delta file changes. Repeatable migrations are useful in the following situations:

  • Rebuilding indexes, views and stored procedures.
  • Adding permissions
  • Other maintenance tasks

Repeatable migrations are especially convenient for resolving the conflicts as you have one source file that multiple developers can update.

Note that if you want to apply the same post-migration script on every application run, then instead of repeatable migrations you need to use Flyway callbacks (https://flywaydb.org/documentation/callbacks.html ).

Dealing with Hotfixes

Activate outOfOrder property to allow applying the Flyway migrations out of order and fill the gaps (as described in “Enable out of order migrations” section above).

If your production environment is at version 5 and your dev and test environments are already at version 6, but you need to run hotfix 5.5, then outOfOrder property needs to be activated in dev/test environments in order to apply version 5.5 after version 6. You should have activated it anyway if you have multiple developers working in different branches.

Multiple Instances with Flyway

Flyway is advertised as a “thread-safe” application. If your production environment has several instances of the application (nodes, containers) then every instance will run a Flyway “migrate” command. For every migration Flyway locks the SCHEMA_VERSION table, so other instances will be waiting until migration gets completed, and only then will proceed further, so no conflicts or duplicate migrations occur.

Since Flyway only locks the SCHEMA_VERSION table for one of the delta scripts at a time, multiple Flyway instances can grab different migrations and implement them in the order defined by migration version numbers.

The following simplified diagram shows how 3 instances of Flyway performed a series of 6 migrations.

Instance 1 on the diagram performed 3 migrations, while Instance 2 performed 2 migrations and Instance 3 performed only 1 migration. The order in which instances were started (1,2,3) contributed to the order in which instances acquired locks and implemented migrations. Migration 2 was the longest to implement, so both instances 2 and 3 had to get in queue and wait for their turns until migration 2 was finished. Please note that though this diagram is simplified, it represents the actual sequence of operations that take place during migration.

One scenario was reported when multi-node Flyway configuration failed. This may happen during the initial deployment when database is empty and multiple nodes are initialized at exact same time. Since SCHEMA_VERSION table doesn’t exist yet, there is nothing to lock, so while the first node is creating this table, the second node attempts this too, but since the table is already being created, the second node fails. This scenario was described here: https://github.com/flyway/flyway/issues/1067

The workarounds to the issue above would be:

  • Run Flyway manually from the command line when deploying for the first time.
  • Create SCHEMA_VERSION table in advance, before the first deployment.
  • Start the second and consecutive instances of the application with some delay, so the first instance will have enough time to create the table.

Manage Multiple Schemes or Shards with Flyway

You should use one DB schema per micro service to avoid conflicts with other projects. Still, if your micro service uses multiple DB schemes or shards, you can configure Flyway to deal with them. The process is described in Flyway documentation: https://flywaydb.org/documentation/faq#multiple-schemas

There are 3 scenarios covered by Flyway:

  1. You have multiple identical schemes (i.e. one per tenant) and you want to apply the same migration scripts to all those schemes.
  2. You have similar schemes that use the same migration scripts.
  3. You have multiple different schemes and each schema needs different migration scripts.

All those scenarios are easy to implement using Flyway configuration parameters. If you have some other scenario (unlikely) then you can always extend Flyway with your custom logic written in Java.

Note that MySQL doesn’t support schemes, so you will have to deal with multiple databases instead.

Flyway in Production

There are a few things that you will need to do when planning to use Flyway in production environment. But first, remember to test flyway migrations on production DB copy! Especially if you are integrating Flyway for the first time in your project. You might already maintain your production DB copy as an integration/preview/pre-prod environment, so as long as it matches production, do your pre-release testing in this environment first!

  1. You will need to make sure you baseline your production database.
  2. You will need to disable the Flyway Clean command in production.
  3. You will need to enable out-of-order migrations to allow hotfixes (out-of-order migrations are described earlier in this document).
  4. You will need to consider a dedicated connection string for your Flyway migrations.
  5. You will need to decide if your application will trigger Flyway migrations in production, or your DBA will run it manually from command line.

1. Baseline of production database is extremely important, so when you deploy the Flyway for the first time. Read a section about setting DB baseline in this document and on Flyway web site.

2. When you run the Clean command, Flyway deletes all objects from the database, even if they were there before baseline, so this command can’t be applied in production. Flyway has the option to disable the Clean command:

cleanDisabled = true.

Add it to your production profile/configuration. More info about the command is here: https://flywaydb.org/documentation/gradle/clean

4. Flyway can be configured to use a dedicated connection string with admin access, which is especially important for production environment where you don’t want to mix admin and application accounts.

5. Some enterprises use policies that prohibit storing admin passwords in production configuration file. Therefore, application will not be able to run Flyway in such production environments. Still, DBA can invoke Flyway from the command line, manually entering admin password. This approach breaks the Continuous Delivery (CD) paradigm, but provides the policy compliance.

Dry Runs

Dry run is a recently added feature in Flyway that allows you to do two things:

  • Preview your changes to DB before running MIGRATE command
  • Perform releases without running Flyway directly.

Performing dry run is as easy as adding a key –dryRunOutput to MIGRATE command, like this:

flyway migrate -dryRunOutput=dryrun.sql

Instead of running the multiple scripts on the DB, Flyway will combine them into one big script and save it to the specified file. The file will contain the whole set of database updates as if Flyway ran them. The file (in our example it’s called dryrun.sql) can be presented to DBA for review/approval. It can also be run directly on the database which will be an alternative to running migrations with Flyway.

If Flyway can’t be run in production environment for any reason, dry run output can be used instead. Just keep in mind that you will still need to run Flyway with MIGRATE command somewhere (i.e. test environment) to advance with normal Flyway process and perform a new release next time.

Rollback of Flyway Migrations

Flyway supports rollbacks with the following features:

  • DDL transaction support
  • Rollback scripts

DDL Transaction Support. Every migration script in Flyway is automatically wrapped up in transaction. If your database supports DDL transactions (like PostgreSQL does), then any error in your migration will roll back the whole migration script automatically. If you have SQL Server then every DDL query will be automatically committed, so your script will not roll back automatically and you will need to create manual rollback scripts (or use idempotent scripts).

Of course, if your migration script does not contain DDL commands, then in case of error it will be automatically rolled back by most of the DB engines.

Rollback Scripts. Flyway recently added a support for rollback scripts. For every versioned migration file you can create a script that rollbacks changes and uses naming convention U###_ instead of V###_ as you used to do with migration scripts. Then if V123_ migration fails, Flyway will run U123_ rollback script.

There is an UNDO command in Flyway that will trigger rollback manually. By default it will rollback the latest migration.

Another approach would be do a snapshot of the database before release and restore it if any migration goes wrong. This will be the cleanest approach since it will rollback all migrations in the release, but it is not always feasible (i.e. when update is made on live database or downtime for snapshot restoration is too long), rollback scripts should be used in this case.

It’s important to remember that a rollback strategy with Flyway rests solely on your shoulders. It will be your responsibility to create a proper rollback script. It helps tremendously if you use idempotent migrations as described earlier in this document. Your rollback will be much easier with idempotent scripts.

Flyway Log

Flyway will automatically use Log4J if it is available in the classpath. To change the level of logging and log SQL scripts output you can set the following options (set in logback.xml):

<logger name="org.flywaydb" level="DEBUG"/>

<logger name="org.flywaydb.core.internal.dbsupport.SqlScript" level="DEBUG"/>

References

To be continued….

Flyway 4.1+ doesn’t work with Percona’s PXC Strict Mode

After upgrading from Flyway 4.0 to Flyway 4.1/4.2 our migrations to Percona (MySQL) XtraDB cluster 5.6 started failing. The issue is apparently a Percona’s PXC Strict Mode that Percona recommends to be set to ENFORCING. The error we get is:

Exception encountered during context initialization - cancelling refresh attempt: org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'flywayInitializer' defined in class path resource [org/springframework/boot/autoconfigure/flyway/FlywayAutoConfiguration$FlywayConfiguration.class]: Invocation of init method failed; nested exception is org.flywaydb.core.internal.dbsupport.FlywaySqlException: 
Unable to acquire MySQL named lock: Flyway-605484229
----------------------------------------------------
SQL State  : HY000
Error Code : 1105
Message    : Percona-XtraDB-Cluster prohibits use of GET_LOCK with pxc_strict_mode = ENFORCING

Apparently, the issue with the new version of Flyway is that they moved from locking the metadata table (VERSION_HISTORY) to named locks (GET_LOCK() function) which is considered experimental in Percona.

The workaround would be to enable the PERMISSIVE mode on Percona (at least temporarily):

mysql> SET pxc_strict_mode=PERMISSIVE;

This is not recommended by Percona itself, and DBA’s don’t like that either. So far we had to rollback to the previous version of Flyway 4.0 that was working fine in this setup.

I asked this question to Flyway guys, they should do something about it.

Useful links: 

https://www.percona.com/doc/percona-xtradb-cluster/5.7/features/pxc-strict-mode.html

https://github.com/rails/rails/issues/28666

https://github.com/flyway/flyway/issues/1556

https://stackoverflow.com/questions/45332033/flyway-4-1-4-2-incompatible-with-perconas-pxc-strict-mode

 

 

 

 

Change Data Capture (CDC) and Data Auditing in MySQL – Part 2

Implementing CDC in the Application Code

Implementing CDC in the application code is a popular approach because you have freedom here to do whatever you want with your data using your favorite programming language. Developers usually prefer this approach because it falls into their domain of knowledge.  The main disadvantage of this approach is that it doesn’t track database changes made outside of the application. Also keep in mind that performance is impacted because every auditing transaction utilizes application resources and goes over the network (instead of being executed by the database server itself).

There are two options to choose from when implementing the CDC in the application code: use CDC functionality pre-packed in Object-Relational Mapping (ORM) tool or develop your own. We will take a look at both.

CDC in ORM

So far the best CDC implementation in Java ORM world comes with Hibernate and is called Entity Versioning (Envers). When you apply the @Audited annotation to your entity or property then updates to the entity will be automatically tracked in the database in the separate table called your_entity_name_AUD. Let’s see how it works.

First, you need to add a dependency:

<dependency>

     <groupId>org.hibernate</groupId>

     <artifactId>hibernate-envers</artifactId>

     <version>${your.hibernate.version}</version>

</dependency>

Then you need to add an @Audited annotation to either your entity or a specific column. You can also use @NotAudited to exclude the column from being audited.

@Entity

@Table(name=”user”)

@Audited

public class User{

     @Id

     @GeneratedValue

     @Column(name=”id”)

     private int id;


     @Column(name=”name”)

     @NotAudited

     private String name;
}

Now you can set property hibernate.hbm2ddl.auto=create to automatically create audit tables in the database when SessionFactory is created. Hibernate will create a table user_AUD that will contain two additional fields:

  • REVTYPE – indicates what type of operation it was: 0 (add), 1 (modify), 2 (delete).
  • REV – revision number

If selected columns are annotated with @Audited, then only those selected columns will be added to the audit table.

Envers also provide a way to log additional data for revisions. You can annotate one entity in your code with @RevisionEntity and this entity will be used to track the revision info. The entity should have at least two properties:

  • An integer of long-valued property annotated with @RevisionNumber.
  • A long- or Date- valued property annotated with @RevisionTimestamp.

Those properties will be used to automatically track the revision numbers and associated update timestamps. You can add other properties too. For more info see: http://docs.jboss.org/envers/docs/#revisionlog.

You can query audit records using the AuditReader interface as following:

AuditReader reader = AuditReaderFactory.get(entityManager);

User oldUser = reader.find(User.class, id, revision_number);

Envers can be configured using a set of options, for example, you can change the prefix for the audit table. See here for more info (http://docs.jboss.org/envers/docs/#configuration ).

Other options to implement CDC in ORMs are limited. You can use @PrePersist, @PreUpdate and @PreRemove callback annotations in JPA, but you don’t have a framework that manages tables, so you will need to write your own code. With Spring Security, you can also track the user who updated data.

If you program in .NET, then you can use NHibernate (http://nhibernate.info ) ORM which includes Envers.

Custom Application-Based CDC

Sometimes you get requirements that just don’t fit into any readily available CDC solution described above. For example, instead of saving the audit log into some database, you might need to post data changes to some web service (a “push” approach), or you just simply don’t use any ORM platform. In this case you might need to implement your own CDC solution in your application code.

There is a plethora of ways to achieve this goal.  A lot depends on the ORM system you use (or the lack thereof). Before design your own audit solution you might want to check the following approaches:

  • Raise an event when some change was committed, and implement the standalone audit module that subscribes to those events.
  • Extend the ORM you use to catch events when change was committed and implement your own actions (i.e. @PrePersist, @PreUpdate and @PreRemove callback annotations in JPA).
  • Extend your logging module to incorporate audit logic.
  • Just add a piece of code that implements auditing when you execute an update query – this is overly simplified, but will be useful if you need to track one or two actions rather than changes to a whole database.

Many developers still maintain audit tables in a database, but instead of triggers they fill them up from the application. This approach has a few advantages:

  • You can fill up history tables asynchronously (the triggers are usually synchronous).
  • Errors in triggers do not affect your application queries.
  • Changes to your master tables may not affect your audit tables (since your code takes care of incompatibilities).

There is no universal way to write a CDC code in an application, it depends on your data layer design, the ORM that you use (Hibernate, JOOQ, MyBatis, etc.). Generally, you would want to duplicate every database query that updates data in your code to save a copy of your data into an audit table. The structure of audit tables plays an important role, let’s talk about it in the next section.

Audit Table Structure

There are two main approaches to design an audit table structure in a database, I usually call them the snapshot approach and the pivoted approach.

The snapshot approach can be considered the industry standard; it is implemented in most of the out-of-the-box CDC solutions. With this approach the audit table mimics the structure of the application table, with the addition of some metadata columns (for example, to track a transaction time or the user who performed the update).

Snapshot Approach

Action ID Name Phone Email Updated
Insert 1 Will 555-555-5555 w235@abc.net 02/01/17 2:00pm
Update 1 William 555-555-5555 w235@abc.net 02/14/17 1:00am

The other approach is pivoted approach, when you track the changes on the column level. It might look more efficient in terms of data storage, but it is much more effort to query the data.

Pivoted Approach

Table ID Column Old New Updated
Person 1 Name Will 02/01/17 2:00pm
Person 1 Phone 555-555-5555 02/01/17 2:00pm
Person 1 Email w235@abc.net 02/01/17 2:00pm
Person 1 Name Will William 02/14/17 1:00am

Pivoted approach has the following disadvantages:

  • Increases number of queries during inserts, increases locking time.
  • Needs more locks (one audit table for the whole database), but there are workarounds.
  • Makes more difficult to restore the number of entities.
  • Is more difficult to implement.

Therefore snapshot approach was widely adopted in the industry, and is used by for SQL Server and other database engines.

Other CDC Techniques

Version Numbers

A technique that you can use for capturing changes is to add a version number and/or modification timestamp properties to your entity and keep all the audit rows along with your production data in this same table.

ID Name Phone Email Updated Version
1 Will 555-555-5555 w235@abc.net 02/01/17 2:00pm 1
1 William 555-555-5555 w235@abc.net 02/14/17 1:00am 2

This technique is not very suitable for implementing CDC in busy OLTP applications, the reasons are:

  • Performance impact: your application always needs to select the latest row, and this operation uses additional resources.
  • Frequent updates result in frequent audit inserts into the same table and it clutters the table and degrades the performance of the application.
  • It’s easy to mess up with production data during maintenance.

It is always better if you store the historical data separately from the production data, so the aforementioned approach will never be on the top of my list for auditing CDC implementation. I still encourage people to always add the modification timestamp to every table, it allows to implement optimistic locking and really helps to debug issues in case of trouble.

Slowly Changing Dimensions

Slowly Changing Dimensions (SCD) is a more advanced variant of the versioning approach described above, so it shares its’ shortcomings if applied to OLTP systems. SCD is usually implemented in data warehouses, where data is periodically added, but rarely gets updated, (therefore “slowly changing dimensions” name).

There are several types of slowly changing dimensions, from type 1 to type 6 (plus some hybrid types). Some of them directly resemble the audit approach described above, for example type 4 is using the history tables in the same way as triggers do.

In this article we are focusing on CDC for auditing, and while slowly changing dimensions will handle this use case, they will not be on the top of my list for performance reasons.

Conclusion

The CDC approach that you choose depends on your requirements. Select one or combine several to better suite your needs and choose what will be easier to implement and maintain in your system. I am usually a fan of combined replication approach that involves running triggers on a slave, but this may not work in your case.

In the table below I summarized some pros and cons of different CDC approaches on MySQL platform, I hope it will help you with your decision. 

Binary Log Parsing
Advantages Disadvantages
Application performance is not affected Need to install, configure and maintain third-party components on your servers
Outputs in JSON, easy to parse Need to write a code for your own consumer and usually design a storage for data changes
Many open-source libraries available (Maxwell, Kafka) Third-party components are not always compatible with new versions of MySQL
Capture all changes to the database, including DDL changes May be an overkill for your project

 

Trigger-Based CDC
Advantages Disadvantages
Easy to implement, no need for third-party libraries Synchronous trigger execution, application performance is affected.
Everything is done in DB, transparent for application Exceptions in triggers will break application queries
Support for most of the transactional databases (including MySQL) Bad performance when doing bulk data operations in database (need to disable triggers temporarily)
Commercial solutions available (Talend)

 

Combined Binary Log and Trigger-Based CDC
Advantages Disadvantages
Binary log parsing is easy to implement using a built-in statement-based replication Need a second database instance for replication
All advantages of trigger-based approach If you use a row-based replication, then you will need a workaround
No impact on application performance or master database

 

Application-Based CDC
Advantages Disadvantages
Hibernate has built-in Envers Tracks application changes only
Log only the data you need Additional level of effort to implement by application developer
Use all functionality of your programming language to process data Changes in audit system will require another production deployment of your main application
Audit can be run asynchronously to avoid performance impact Audit processing utilizes resources on your application server
Implementation using developer’s favorite programming language Not suitable for auditing DDL changes
Good for pinpoint auditing May be too much effort to implement auditing of all tables in a large database

 

 

Change Data Capture (CDC) and Data Auditing in MySQL – Part 1

What CDC is used for?

Change Data Capture (CDC) is an approach to identify changes in a database, extract those changes and deliver them somewhere else, for example, to a set of audit tables, another database, a logging component or some other data consumer. CDC is also a name given to a set of design patterns that serves this purpose.

CDC is mostly used for:

  • Data warehousing
  • Replication
  • Auditing

Data warehousing. CDC patterns are widely used in data warehousing, for example, to maintain slowly changing dimensions (SCD). I am not going to focus on data warehousing in this article (MySQL is not an optimal choice for OLAP platform anyway, mostly because it’s not optimized for running analytical queries), but the principles of CDC are applied to all types of databases regardless of the purpose.

Replication is the major purpose of CDC since it is less time consuming to replicate changes than to dump and copy the whole database. Replication is usually based on database binary logs: all master database changes are recorded in binary logs and then picked up by replication agents and get replicated on all slaves.

Auditing. CDC can be used to track a history of data changes, in particular: who changed the record and when. A set of history tables is usually maintained in this case; those tables store the values before and after the change. Therefore, auditing, or keeping historical records, is a special case of CDC which is often a requirement for your application. CDC for auditing will be the main focus of this article.

CDC in SQL Server

Unfortunately, there is no CDC component that comes bundled with MySQL as of now, though other databases provide built-in CDC features. Let’s see how CDC is implemented in SQL Server. This approach will help us understand the principles that we can use for MySQL.

CDC in SQL Server
CDC in SQL Server (diagram taken from Microsoft web site)

As displayed on the diagram above, in SQL Server, all changes to data or schema are being recorded to a binary log. Binary log is parsed by a capture process, which saves all identified data differences to a set of change tables. From there, you can query the changes yourself or set up an ETL to copy them to a warehouse.

In SQL Server you can enable CDC by running the stored procedure sys.sp_cdc_enable_db. Then you specify which tables to track using the stored procedure sys.sp_cdc_enable_table. Transactions are tracked in the table cdc.lsn_time_mapping along with the log sequence number, the start and the end timestamps.

The change table is automatically created for each of the tracked tables and includes:

  • Metadata columns, such as: log sequence number, operation (insert, update, delete), mask which indicates the columns that where updated, etc.
  • The same set of columns that original table has.

<structure of SQL Server tables>

For each data insert, the row is added to the change table with values that are being inserted. For each delete, the row is added with the values before delete. For update, two rows are inserted into the change table, containing before and after update values.

SQL Server uses binary log as a source of changes. MySQL has a binary log too, and it uses it for replication, but MySQL doesn’t have the embedded CDC component that maintains change tables like SQL Server does. You can either write your own (major pain) or use some third-party solution.

CDC in MySQL

In MySQL you are pretty much limited to 3 approaches when implementing CDC:

  1. Use a third party component to parse binary logs.
  2. Use triggers to capture changes.
  3. Implement CDC in your application.

Which approach should you choose? It ultimately depends on your requirements. All three approaches have different levels of effort associated with them, so it’s imperative to select the easiest. Parsing binary logs may require significant infrastructure changes while triggers may negatively affect the performance, so check with your customers first, find out what their requirements are!

In the next sections I will cover those 3 approaches and the scenarios to use them for.

Binary Log Parsing

If your customer’s requirement is to capture ALL changes, including DDL changes, then binary log is the best way to go. You will need to use some third-party parser for binary logs (unless you are paid to write your own) and configure the infrastructure. You will also need to write the code to process the changes that usually come in JSON format.

Now let’s be clear on this one. If you just want to maintain a copy of your database somewhere, then you should use MySQL Replication, the embedded feature. It parses binary logs automatically and maintains the exact copy of the database on a different server. Still, if you need to perform any custom action with your change data, or your destination database is different from the origin (i.e. it’s a warehouse), then you need to fall back to some third-party CDC solution.

This is how binary log parsing works: the parser pretends to be a replication slave, but instead of copying data to another server, it streams it out to JSON or some other format. The consumer of JSON is responsible for processing and storing those changes.

The biggest advantage of the binary log parsing is that it doesn’t affect the application or the database. There is almost no performance impact on the application, and no changes to a database schema or to an application code are required. The disadvantage of this approach is that you will need to install several third party components and, depending on your requirements, write your own code to process database changes.

There are many binary log parsing solutions for MySQL out there, check this list, for example (and it still doesn’t include all of them). Let’s see how one of the open-source CDC tools works.  The tool is called Maxwell and it’s advertised as MySQL+Kafka solution. Let’s see how it works.

First, let’s create a table in our main database that we will track:

CREATE TABLE IF NOT EXISTS `person` (
 `id` BIGINT NOT NULL AUTO_INCREMENT COMMENT 'a database generated ID sequence',
 `nickname` VARCHAR(40) NOT NULL COMMENT 'nickname ',
 `eff_date` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT 'effective Date',
 `exp_date` TIMESTAMP NULL DEFAULT NULL COMMENT 'expiration Date',
 `first_name` VARCHAR(50) NOT NULL COMMENT 'first name',
 `last_name` VARCHAR(50) NOT NULL COMMENT 'last name',
 `org` VARCHAR(200) NOT NULL COMMENT 'organization',
 `created_dt` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT 'date when records was created',
 `last_modified_dt` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT 'date when records was last modified',

 PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Now download Maxwell and configure it as described here. You need to switch the binary log format of MySQL to row-based and set the server-id of your MySQL instance. You also need to create a user “Maxwell” in your MySQL database.

You can then run Maxwell with stdout output using command:

./bin/maxwell --user='maxwell' --password='XXXXXX' --host='127.0.0.1' --producer=stdout --binlog_connector=true

Maxwell will start to track your events and will shoot out results to command line in JSON format. Now let’s run the insert query in our main database:

insert into PERSON( nickname, eff_date,exp_date,first_name,last_name,org)
 values ('Den', current_timestamp, null, 'Dennis','Menace','My Org');

You will see Maxwell’s output in the command window:

16:12:15,456 INFO BinaryLogClient - Connected to 127.0.0.1:3306 at master.000004/21513 (sid:6379, cid:31)
{"database":"sample","table":"person","type":"insert","ts":1494879351,"xid":3210,"data":{"id":1,"nickname":"Den","eff_date":"2017-05-15 20:15:51","exp_date":null,"first_name":"Dennis","last_name":"Menace","org":"My Org","created_dt":"2017-05-15 20:15:51","last_modified_dt":"2017-05-15 20:15:51"}}

Now you can install and configure Kafka (http://kafka.apache.org/quickstart ) as recommended by Maxwell authors, and start streaming your log events.

This article has a nice example of the nodeJS script that will save your Maxwell events to MongoDB or MySQL.

Now think of your requirements again: do you want to install and maintain a third-party component (i.e. Maxwell) in production, install and maintain Kafka (another component to stream data), introduce dependencies, write a code for JSON consumer, design a storage to keep all this info, if your customer just needs a history of password changes?

Many articles will recommend binary log parsing for CDC applications, but it may be an overkill for your project. Before making a decision check the alternatives below.

Trigger-Based CDC Approach

Many people have biased opinions against triggers in MySQL. This is coming from times when MySQL triggers were half-baked and very limited. With recent versions of MySQL though (5.6+), triggers improved significantly, up to the point when it actually makes sense to use them.

Developers usually hate triggers because of two reasons: triggers fall out of their area of expertise and triggers cause unexpected data mutations. Triggers are usually executed in the same transaction with SQL query, so if trigger fails – the whole query will fail. The application code can be unaware of triggers in a database that may cause a surprise exception. Also, if trigger implements a lengthy operation, an application thread will be delayed as well, which impacts application performance. The latter is particularly true when you perform bulk data operations on tables that have triggers.

All those considerations should not stop you from using triggers if you use them correctly. Triggers were successfully utilized in CDC applications for decades, and there are solid commercial CDC solutions based on triggers out there, for example CDC from Talend. The advantage of trigger-based CDC approach is that it’s relatively easy to implement and it doesn’t require updates to your application code or infrastructure changes.

Let’s see an example of a trigger-based CDC in a sample project. First, we create or production table ‘person’ and the audit table `person_audit` that includes all the fields from `person` plus some special fields:

  • audit_id – we need it as a primary key, since original person id field will be duplicated in the audit table;
  • action – indicates what action was performed: insert, update, delete;
  • audit_dt – date and time when audit record was created, we will never update this record afterwards, so no need for the last_update field.

We will also need to create 3 triggers: one for the insert event, one for the update event and one for the delete event. Note that update trigger will put two records into the audit table: one with the old values and one with the new values. It will be easy to identify what columns were updated as part of the transaction by comparing those two records.

The SQL script is the following:

CREATE TABLE IF NOT EXISTS `person` (
 `id` BIGINT NOT NULL AUTO_INCREMENT COMMENT 'a database generated ID sequence',
 `nickname` VARCHAR(40) NOT NULL COMMENT 'nickname ',
 `eff_date` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT 'effective Date',
 `exp_date` TIMESTAMP NULL DEFAULT NULL COMMENT 'expiration Date',
 `first_name` VARCHAR(50) NOT NULL COMMENT 'first name',
 `last_name` VARCHAR(50) NOT NULL COMMENT 'last name',
 `org` VARCHAR(200) NOT NULL COMMENT 'organization',
 `created_dt` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT 'date when records was created',
 `last_modified_dt` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT 'date when records was last modified',

 PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;


CREATE TABLE IF NOT EXISTS `person_audit` (
 `audit_id` BIGINT NOT NULL AUTO_INCREMENT COMMENT 'a database generated ID sequence',
 `action` VARCHAR(100) NULL COMMENT 'action taken with record: insert, update, delete',
 `person_id` BIGINT NULL COMMENT 'person id',
 `nickname` VARCHAR(40) NULL COMMENT 'nickname ',
 `eff_date` TIMESTAMP NULL DEFAULT CURRENT_TIMESTAMP COMMENT 'effective date',
 `exp_date` TIMESTAMP NULL DEFAULT NULL COMMENT 'expiration date',
 `first_name` VARCHAR(50) NULL COMMENT 'first name',
 `last_name` VARCHAR(50) NULL COMMENT 'last name',
 `org` VARCHAR(200) NULL COMMENT 'organization',
 `created_dt` TIMESTAMP NULL COMMENT 'date when record was created',
 `last_modified_dt` TIMESTAMP NULL COMMENT 'date when record was last modified',
 `audit_dt` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT 'date when audit record was created',
 PRIMARY KEY (`audit_id`)
 ) ENGINE=InnoDB DEFAULT CHARSET=utf8;

delimiter //
 CREATE TRIGGER `person_after_insert`
 AFTER INSERT ON `person`
 FOR EACH ROW BEGIN
 insert into `person_audit` (`action`, person_id, nickname, eff_date, exp_date, first_name, last_name, org, created_dt, last_modified_dt)
 values ('insert', NEW.id, NEW.nickname, NEW.eff_date, NEW.exp_date, NEW.first_name, NEW.last_name, NEW.org, NEW.created_dt, NEW.last_modified_dt);
 END;//

CREATE TRIGGER `person_after_update`
 AFTER UPDATE ON `person`
 FOR EACH ROW BEGIN
 insert into `person_audit` (`action`, person_id, nickname, eff_date, exp_date, first_name, last_name, org, created_dt, last_modified_dt)
 values ('update: old', OLD.id, OLD.nickname, OLD.eff_date, OLD.exp_date, OLD.first_name, OLD.last_name, OLD.org, OLD.created_dt, OLD.last_modified_dt);

insert into `myuser_audit` (`action`, person_id, nickname, eff_date, exp_date, first_name, last_name, org, created_dt, last_modified_dt)
 values ('update: new', NEW.id, NEW.nickname, NEW.eff_date, NEW.exp_date, NEW.first_name, NEW.last_name, NEW.org, NEW.created_dt, NEW.last_modified_dt);
 END;//

CREATE TRIGGER `person_after_delete`
 AFTER DELETE ON `person`
 FOR EACH ROW BEGIN
 insert into `person_audit` (`action`, person_id, nickname, eff_date, exp_date, first_name, last_name, org, created_dt, last_modified_dt)
 values ('delete', OLD.id, OLD.nickname, OLD.eff_date, OLD.exp_date, OLD.first_name, OLD.last_name, OLD.org, OLD.created_dt, OLD.last_modified_dt);
 END;//

Note that all the audit table fields that were copied from the master table are defined as nullable in the audit table. This will help us to avoid trigger failure in case of constraint violations.

In our example we don’t track what user updated the record, because we don’t track this in the master table. If you add the column `updated_by` to the `person` table and fill it from the application, then it’s easy to add this field to the audit table and the trigger code.

Another advantage of the trigger-based CDC is that application stays unaware of the fact that you have added the whole audit layer. You need to be careful when you update the database schema though, since changes in the master tables will require changes in the audit tables and possibly in the trigger code.

Combined Binary Log and Trigger-Based Approach

Many people combine the binary log parsing approach with the trigger-based approach to allow triggers running asynchronously. The concept is simple. A master database is getting replicated to a slave database, which has CDC triggers running. The application queries running against the master database will not be affected if the triggers in the slave database fail.

This approach requires additional resources to host a second database, but if you have a cluster anyway, it makes sense to offload all audit work from a master database to a slave. You don’t usually need some third-party CDC solution to do replication, the MySQL built-in feature is usually sufficient.

Note that this approach is sensitive to the type of replication that you are using. When you use a statement-based replication, then you can run triggers on a slave. If you use a row-based replication, then you can’t use triggers on a slave, so this approach will not work out of the box.   Still, there are several workarounds, for example, you can use a third-party CDC instead of a built-in MySQL replication. MariaDB has a special option slave_run_triggers_for_rbr to overcome this limitation.

In Part 2 we will talk about implementing the CDC in the application code.