On Constraints

This memo is for novice devs.

Designing a properly functioning database schema can be a tough quest. You must account for many important things such as normalization, properly set indexes, isolation levels, constraints, and others. This time I want to discuss one simple case pertaining to constraints.

Schema

Imagine a table with the primary key, product ID, rating category, and product position. The last column depicts an index (do not mess with a database index) of a product’s popularity among customers in a rating category. The higher the position, the more popular the product.

ID (PK) Product ID (FK) Category ID (FK) Position
1 1000 150 2
2 1001 150 1
3 1002 150 4
4 1003 150 3
5 1004 150 5
6 1005 200 1
7 1006 200 3
8 1007 200 2

Next, we define several rules about the compose of the table’s data. First, every product can appear only in one category. Say, a vacuum cleaner cannot be a member of the kitchen knives category. Second, a single position value cannot be shared by several products.

The business logic somehow manages this table changing a row count, and updating the position column.

Constraints

The primary key is always unique. The product ID is a one-to-one foreign key, it has the same constraint. On the other side, the category ID is a many-to-one key, and cannot be unique. What is about the position column? What constraints can we fit for it and why?

Any position must always be positive and greater than zero. Modern DBMSs allow us to create a constraint like position > 0. To match the requirement of whether a product from any category must have a unique position, we can consider its linkage with the category ID effectively making the constraint of uniqueness as pairs of (category_id; position).

Are there any benefits of the last constraint? Well, it prevents a process from insertion of product in a category with a position that already resides there. Sounds good and tempting. Anything else? With concurrent access to the database, this constraint may simplify a required isolation level of transactions.

With the Read Committed and Repeatable Read isolation levels an application trying to insert a row into the table or update an existing one may stumble upon the constraint violation which prevents the database consistency breach. If this constraint was not on a table, the application would need to employ serial access to data or manual locks of affected rows to prevent possible duplicates.

At the same time, this constraint does not prevent us from considering more sophisticated isolation levels or locking approaches. For example, even if two processes cannot insert a row with the same position for a shared category, there are still cases, when race conditions can pose problems with the data consistency.

Issue

An experienced dev or DBA has already noticed an elephant in the room. When an application tries to insert a row or update an existing one it may violate the constraint. If we want to change a position in a category, we must be sure that a new value is not in the category. For insertion of data, it is not a problem, due to newly inserted data is always appended. If a parallel process inserts data with the same position, we will face an exception and act accordingly. With a swap case, when position numbers change between two existing rows, the constraint makes things difficult.

Every query to the database (in a transaction or without it) applies in a sequence. With transactions, it is nothing wrong to temporarily violate the data integrity but after a transaction is done (or rolled back), the integrity must be in place. Even with this property, there is no way to achieve the position swap case, we cannot temporarily break the integrity as the very first query will fail due to the constraint violation.

BEGIN;
UPDATE table SET position=4 WHERE id=5;
UPDATE table SET position=5 WHERE id=3;
COMMIT;

No matter how we arrange the lines inside the transaction, it won’t be ever committed.

Solutions

There are several possible solutions to the issue with their pros and cons. Not all of them are required to sacrifice the constraint.

As I mentioned, there are possible race conditions with concurrent access to the database. In our example, they can appear when we need to read the table (a category) first and change the positions after (see the read-write anti-pattern). It is true for every solution below.

Update With Case

Many modern DBMSs support a safe and atomic way to solve the riddle. By atomic, I mean a query that swaps values between two rows at once.

BEGIN;
UPDATE table SET position = CASE
    WHEN position=4 THEN 5
    WHEN position=5 THEN 4
END
WHERE id IN (3, 5);
COMMIT;

I think the snippet above is pretty self-explanatory.

This solution preserves the constraint because of its atomic nature.

Temporary Values

A curious mind may come up with a pretty obvious but doubtful solution. What if we set for conflicting rows some temporary and unique values? Like very big ones that 100% do not contain in our table.

If we know the rows in advance, this requires a transaction of at least three queries: the first one sets the temporary value for the first conflicting row, and others set the intended ones. Scale it up for several thousand rows.

If we do not know, we must query the table, process the returned data, and then apply our transaction. Yet another query to the database.

In the end, we have many redundant queries to the database which is not good at all. It can work at small datasets and low loads but should be avoided for serious production.

Chicken With No Head

The simplest way is to get rid of the constraint. For example, if we know our architecture well and we are 100% sure no other process can interfere with the manipulations with the table in question, we are completely free to remove the constraint. Let’s say, this table is managed only by a single process (without parallel and cooperative access) or there are several processes but each of them is in charge of a dedicated category. This way we can guarantee the integrity of data.

At the same time, if we are completely sure about the assumptions above, nothing stops us from using the first solution, right?

Not necessarily. Imagine we have some third-party API, it may be our microservice or a public one, that produces some data for us. Let’s say it tracks the index of product popularity. We gather the API’s data and fill it into our table, then our backend retrieves the data (maybe caches it) and returns it to end users via one of its endpoints. Sounds pretty usual.

So far, so good. How does it correlate with our constraint? First, we don’t need to query the table before we swap the positions anymore, and all that we need is just to save external data there. This saves us from discrepancies between the data we have read and the data the database contains when we update it. Second, the first two solutions may not be so handy this time.

The reason for that we still need to querify the table, compare its data with the API’s data, and then produce updates. What do we really need in this case? We need to either insert new rows into our table or update existing ones by a single action. There is a so-called UPSERT SQL command. This command is an INSERT SQL command that also updates the row with the desired values when an expected conflict arises.

BEGIN;
INSERT INTO table (product_id, postition) VALUES (1000, 2) ON CONFLICT (product_id) DO UPDATE SET position=EXCLUDED.position;
INSERT INTO table (product_id, postition) VALUES (1001, 1) ON CONFLICT (product_id) DO UPDATE SET position=EXCLUDED.position;
INSERT INTO table (product_id, postition) VALUES (1002, 5) ON CONFLICT (product_id) DO UPDATE SET position=EXCLUDED.position;
INSERT INTO table (product_id, postition) VALUES (1003, 3) ON CONFLICT (product_id) DO UPDATE SET position=EXCLUDED.position;
INSERT INTO table (product_id, postition) VALUES (1004, 4) ON CONFLICT (product_id) DO UPDATE SET position=EXCLUDED.position;
INSERT INTO table (product_id, postition) VALUES (1008, 6) ON CONFLICT (product_id) DO UPDATE SET position=EXCLUDED.position;
COMMIT;

As you can see, we just insert all rows for a single category. Every insert except the last updates the position with provided values. The last one just inserts a new row into the table. The prominent problem here is updating the existing rows with the same values they already contain. Look at the two first queries, for instance, their positions do not change, but we update them anyway. This creates unnecessary writes into the table and probably creates a way more mess with your DBMS if it supports row versioning. Let’s improve our transaction to cope with this issue.

BEGIN;
INSERT INTO table (product_id, postition) VALUES (1000, 2) ON CONFLICT (product_id) DO UPDATE SET position=EXCLUDED.position WHERE (table.position) IS DISTINCT FROM (EXCLUDED.position);
...
COMMIT;

This time the first row (the rest are the same but just are omitted for brevity) is updated only when it has a position that is distinct from the new one.

The last, but not the least, the UPSERT cannot be combined with the constraint in question! That’s why I said the “not necessarily”.

Final Thoughts

Even a simple constraint can pose many challenges. This is true almost for all aspects of the schema design. The best way to mitigate possible pitfalls is to know your data and all manipulation flows. I hope this post gives you some food for thought.