A while back, I wrote Careful with That Lock,Eugene about an idea for how to check if a database migration is likely to disturb production. That post came about after having an inspiring chat with a colleague about the advantages of transactional migration scripts and the ability to check the postgres system catalog views before committing a transaction.
Over the past few weeks, I've been experimenting with this idea to test if I can use it to build valuable safety checks for DDL migrations. Kind of like shellcheck, but for database DDL migrations.
At this point, I've made enough progress to share some results.
I've been working on a Rust project that compiles a command line tool named
eugene
and published it to crates.io
and ghcr.io/kaaveland/eugene
.
This post is about what eugene
can do, how it is implemented, and what I hope
to achieve with it in the future.
Demo: eugene trace
report
Suppose I have a migration script named alter_column_not_null.sql
that looks like
this:
-- alter_column_not_null.sql
-- create table books(id serial primary key, title text);
-- fix: the title column should be not null
alter table books alter column title set not null;
-- fix: the title column should be unique
alter table books add constraint title_unique unique (title);
There exists some table books
already, and we want to enforce both that the title
column is not null
and that it is unique. Suppose you commit this to a branch and
make a pull request, then a minute later, the pull request is updated with a comment
generated by eugene trace -f markdown alter_column_not_null.sql
that looks like this:
Statement number 1 for 10 ms
SQL
alter table books alter column title set not null;
Locks at start
No locks held at the start of this statement.
New locks taken
Schema | Object | Mode | Relkind | OID | Safe |
---|---|---|---|---|---|
public |
books |
AccessExclusiveLock |
Table | 1 | ❌ |
Hints
Validating table with a new NOT NULL
column
ID: make_column_not_nullable_with_lock
A column was changed from NULL
to NOT NULL
. This blocks all table access until all rows are validated. A safer way is: Add a CHECK
constraint as NOT VALID
, validate it later, then make the column NOT NULL
.
The column title
in the table public.books
was changed to NOT NULL
. If there is a CHECK (title IS NOT NULL)
constraint on public.books
, this is safe. Splitting this kind of change into 3 steps can make it safe:
- Add a
CHECK (title IS NOT NULL) NOT VALID;
constraint onpublic.books
. - Validate the constraint in a later transaction, with
ALTER TABLE public.books VALIDATE CONSTRAINT ...
. - Make the column
NOT NULL
Taking dangerous lock without timeout
ID: dangerous_lock_without_timeout
A lock that would block many common operations was taken without a timeout. This can block all other operations on the table indefinitely if any other transaction holds a conflicting lock while idle in transaction
or active
. A safer way is: Run SET lock_timeout = '2s';
before the statement and retry the migration if necessary.
The statement took AccessExclusiveLock
on the Table public.books
without a timeout. It blocks SELECT
, FOR UPDATE
, FOR NO KEY UPDATE
, FOR SHARE
, FOR KEY SHARE
, UPDATE
, DELETE
, INSERT
, MERGE
while waiting to acquire the lock.
Statement number 2 for 10 ms
SQL
alter table books add constraint title_unique unique (title);
Locks at start
Schema | Object | Mode | Relkind | OID | Safe |
---|---|---|---|---|---|
public |
books |
AccessExclusiveLock |
Table | 1 | ❌ |
New locks taken
Schema | Object | Mode | Relkind | OID | Safe |
---|---|---|---|---|---|
public |
books |
ShareLock |
Table | 1 | ❌ |
Hints
Running more statements after taking AccessExclusiveLock
ID: holding_access_exclusive
A transaction that holds an AccessExclusiveLock
started a new statement. This blocks all access to the table for the duration of this statement. A safer way is: Run this statement in a new transaction.
The statement is running while holding an AccessExclusiveLock
on the Table public.books
, blocking all other transactions from accessing it.
Creating a new index on an existing table
ID: new_index_on_existing_table_is_nonconcurrent
A new index was created on an existing table without the CONCURRENTLY
keyword. This blocks all writes to the table while the index is being created. A safer way is: Run CREATE INDEX CONCURRENTLY
instead of CREATE INDEX
.
A new index was created on the table public.books
. The index public.title_unique
was created non-concurrently, which blocks all writes to the table. Use CREATE INDEX CONCURRENTLY
to avoid blocking writes.
Creating a new unique constraint
ID: new_unique_constraint_created_index
Found a new unique constraint and a new index. This blocks all writes to the table while the index is being created and validated. A safer way is: CREATE UNIQUE INDEX CONCURRENTLY
, then add the constraint using the index.
A new unique constraint title_unique
was added to the table public.books
. This constraint creates a unique index on the table, and blocks all writes. Consider creating the index concurrently in a separate transaction, then adding the unqiue constraint by using the index: ALTER TABLE public.books ADD CONSTRAINT title_unique UNIQUE USING INDEX public.title_unique;
Taking dangerous lock without timeout
ID: dangerous_lock_without_timeout
A lock that would block many common operations was taken without a timeout. This can block all other operations on the table indefinitely if any other transaction holds a conflicting lock while idle in transaction
or active
. A safer way is: Run SET lock_timeout = '2s';
before the statement and retry the migration if necessary.
The statement took ShareLock
on the Table public.books
without a timeout. It blocks UPDATE
, DELETE
, INSERT
, MERGE
while waiting to acquire the lock.
How to perform this change safely
eugene
has some concrete suggestions for changes we could do to make this migration safer. We should
always use lock_timeout
when taking locks, and here's a way to perform this migration in more steps
that causes less time with dangerous locks held overall. The hints give us a strategy to perform the
migration in several more steps:
Step 1: Add a CHECK
constraint that is NOT VALID
This still takes the AccessExclusiveLock
on books
, but it no longer needs to validate all the rows,
so the lock should be held for a shorter time. All inserts or updates on books
will only allow
title
to be not null
.
set lock_timeout = '2s';
alter table books add constraint title_not_null check (title is not null) not valid;
Step 2: Validate the constraint
This will take a ShareUpdateExclusiveLock
on books
, which is less restrictive than
the AccessExclusiveLock
. This is possible, because only "old rows" could possibly be
null
, so there's no need to guard against inserts or updates with a lock.
set lock_timeout = '2s';
alter table books validate constraint title_not_null;
Step 3: Make the column not null
At this point, postgres already knows that title
is not null, so the AccessExclusiveLock
is
only necessary for updating the catalog, which is instant:
set lock_timeout = '2s';
alter table books alter column title set not null;
Step 4: Add the unique constraint concurrently
This returns immediately, and postgres will build up the index and validate the constraint in its own time, generally without blocking other operations:
create unique index concurrently title_unique_idx on books(title);
Step 5: Add the unique constraint using the index
This is also instant now, since the index is already built and postgres knows that title
is unique.
The AccessExclusiveLock
is again only necessary for updating the catalog:
set lock_timeout = '2s';
alter table books add constraint title_unique unique using index title_unique_idx;
Summary
In this instance, eugene
was able to provide some valuable hints about how to make migrations
that achieve the same result, but is much less likely to cause disturbances in the database.
There are many example migration scripts and reports like this in the examples directory of the repository.
When do you need a tool like eugene
?
Breaking small database changes into this many steps isn't for everyone. I have worked on many
systems where we did migrations all the time without using safeguards like this, and most of
the time, it was fine. In many cases, some amount of downtime was acceptable and we could stop
the application prior to running DDL. In many other cases, lock_timeout
alone would have
saved us from issues with slow concurrent transactions and lock queues.
There is a lot of value in simplicity and simple migration scripts if you can afford it. Here are some things you may need to deal with if you use the safer approach recommended by eugene:
- If your statements are unable to take their locks in 2 seconds, your script crashes. You may need retries around your migration deployment.
- Creating indexes concurrently can take a long time. The statement returns right away and you need to wait for it to finish before you can run the next migration. You must be prepared to handle a deadlock or unique violation that makes it unable to complete the index.
- Invariably, working like this is going to make you create a lot more migration scripts than you otherwise would have.
But if you'd like to learn more about how postgres locking works, and how to avoid common pitfalls
when doing migrations, eugene
can be a valuable tool. If you're working on a system where
some tables are very large, or have a lot of concurrent transactions where some are slow, eugene
could help you avoid "stopping the world" in production when you're executing migrations.
On the whole, I've certainly learned a lot about how postgres works while making eugene
,
and I hope that I can impart some of that learning on others by seeing the tool adopted.
How does eugene trace
work?
eugene
has a lot of static data built into it. For example:
- It knows about which lock modes that conflict with each other, I've lifted this from the excellent postgres documentation about locks.
- It has lots of helpful text snippets that can be used to generate hints. I've lifted many of these from the awesome strong_migrations README.
But the primary mechanism that it uses is to break down the SQL script into individual statements, that are then run in a transaction. Before and after each statement, important database state is saved in memory, which enables us to calculate the effect that the statement had on several important system views in the database:
pg_class
helps eugene figure out which database objects that are visible to other transactions (and maybe in the future, which objects that have been rewritten).pg_locks
helps eugene figure out which locks that it ownspg_attribute
helps eugene figure out which schema changes that were made to tablespg_constraint
helps eugene figure out which constraints that were added, altered or removed
This means that there's a lot of knowledge that eugene
does not need to have, for example, it does
not need a very good SQL parser, or keep track of which versions of postgres that exhibits what behaviour.
If postgres 17 will be able to add constraints with lesser locks, eugene
will be able to figure that out.
Running the transaction, then optionally committing it results in a big data structure that essentially is
a layered diff of database state. eugene trace -f json
can emit this structure, so that it could be used
by tools like jq
, or eugene trace -f md
can generate a markdown report like the one above. Both of
these are essentially just views of this layered diff.
How are hints implemented?
It is easy to add new hints to eugene
. They consist of a lot of static text, and then a rule that
checks a statement level diff for certain conditions. If the conditions are met, the hint is emitted.
Here's an example of a hint rule for the running_statement_while_holding_access_exclusive
hint:
fn running_statement_while_holding_access_exclusive(
sql_statement_trace: &FullSqlStatementLockTrace,
) -> Option<String> {
let lock = sql_statement_trace
.locks_at_start
.iter()
.find(|lock| lock.mode == "AccessExclusiveLock")?;
let help = format!(
"The statement is running while holding an `AccessExclusiveLock` on the {} `{}.{}`, \
blocking all other transactions from accessing it.",
lock.relkind, lock.schema, lock.object_name,
);
Some(help)
}
Here's another one for discovering columns that changed from NULL
to NOT NULL
:
fn make_column_not_nullable_help(
sql_statement_trace: &FullSqlStatementLockTrace,
) -> Option<String> {
let column = sql_statement_trace
.altered_columns
.iter()
.find(|column| !column.new.nullable && column.old.nullable)?;
let table_name = format!("{}.{}", column.new.schema_name, column.new.table_name);
let col_name = column.new.column_name.as_str();
let help = format!(
"The column `{col_name}` in the table `{table_name}` was changed to `NOT NULL`. \
If there is a `CHECK ({col_name} IS NOT NULL)` constraint on `{table_name}`, this is safe. \
Splitting this kind of change into 3 steps can make it safe:\n\n\
1. Add a `CHECK ({col_name} IS NOT NULL) NOT VALID;` constraint on `{table_name}`.\n\
2. Validate the constraint in a later transaction, with `ALTER TABLE {table_name} VALIDATE CONSTRAINT ...`.\n\
3. Make the column `NOT NULL`\n",
);
Some(help)
}
I am hoping for some help to add many more to hints.rs
in the future, I am certain that there are many kinds of migration patterns that I do not know about. If you install
the tool, you can run eugene hints
to display general information about the hints that it knows about, for example:
{
"hints": [
{
"id": "validate_constraint_with_lock",
"name": "Validating table with a new constraint",
"condition": "A new constraint was added and it is already `VALID`",
"effect": "This blocks all table access until all rows are validated",
"workaround": "Add the constraint as `NOT VALID` and validate it with `ALTER TABLE ... VALIDATE CONSTRAINT` later"
} // and many more
]
}
Where to get eugene
You can install eugene
from crates.io using cargo install eugene --bin
,
or you can use the docker image ghcr.io/kaaveland/eugene
to run the tool. I have been trying to document
it well in the github repository. If you want to play around with
it, I suggest you clone the repository. There is a docker-compose.yml
that boots up a postgres database
which is compatible with the example migrations in examples/
, so this should get you started:
docker-compose up -d # boots postgres on 5432
cargo run trace -f markdown examples/alter_column_not_null.sql \
glow # glow renders markdown in the terminal beautifully
If you make any changes to the source code, you can run cargo test
to run the tests, and see
the diff on all the markdown files in examples/
to see the effect of your change.
The test setup is using another neat postgres trick from this blog to make sure that the tests can run DDL changes in parallel without conflicting with each other or causing deadlocks.
Once you're satisfied with your changes, you can run cargo install --path . --bin
to install the
the tool to your system and try it out on your own migrations.
Future work
I have a lot of ideas for how to improve eugene
. Here are some of them:
- Add more hints. I am hoping for ideas on the issue tracker
- Read and parse
pg_stat_statements
to find out which queries that could conflict with migrations. - Add a
eugene ci
command that is suitable for integration with CI/CD pipelines, like pull requests so that you can runeugene ci
on a branch and get a report on the pull request. - Adding mechanisms to make
eugene
work like a linter, so it can optionally "fail the build". - Add detection of table and index rewrites, so
eugene
can tell you if a migration could take an unexpectedly long time. - Automatically generate safer migration scripts where it is possible to do this in a deterministic manner.
I am hoping to get some feedback on the tool, and I would love some suggestions on the issue tracker for where to go from here.
This post was originally posted on kaveland.no.