Force that data in!

Force that data in!

Set validation is a problem with any architecture, but particularly with CQRS because of the rules about what data can be accessed from where.  Before we get into the details, let’s talk about what set validation is and why it’s a problem.

Set validation is when you need to validate whether a new item is allowed to be added to a set, or an existing item can be updated in a particular way.  Take, for example, email addresses in a registration system.  You don’t want a new user to be able to create a user with an email that already exists, and neither do you want to allow an existing user to change their email to match an existing one.  The problem arises in how one should enforce this restriction.

The basic problem with set validation in CQRS is where to put the logic.  To make the discussion easier, let’s use the example of a large system with 1MM existing users.  When a new user registers ensuring they register with a unique email can be costly with so many existing values to check.  Since the user is sitting at their screen waiting for the operation to finish, we want this to be as fast as possible.  Let’s go over the options.

As we know, in CQRS there are two data models – the write model and the read model.  Following the rules of CQRS, we only interact with our write model using full aggregates which can be expensive to instantiate since they are heavy with business rules, domain logic, child classes, etc.  If we query the read model it would be much faster, but our read model is probably on a separate server since our system is so big, meaning we need an API call and that slows us down again.  Using the read model can also introduce the problem of eventual consistency.  If two people create the same “unique” object at the same time, the faster creation may not be in the read model when the slower one arrives.

The first question to ask ourselves is if this “problem” is really worth solving.

What is the business impact?

If bad data succeeds in getting into your system, what are the repercussions?  Is it worth the effort to detect up front, or can it be handled later?

Given the situation of validating whether a new user’s email is already in the system, perhaps it is better to ask forgiveness than permission.  In other words, let the bad data be persisted, then detect and fix it later in the process when time and performance are less of an issue.

The faster a system needs to operate the more expensive it is, so if we can move the code dealing with the conflict to a system farther down the chain, i.e. something that does not run while the user is sitting at his screen waiting for a response, then we’ve improved our scalability and perceived performance at the same time.  In the CQRS context this means we let the issue be detected while updating the read model, and we send a FixDuplicateUsername command to start the process of fixing the data.  Obviously this assumes there are no DB restrictions that prevent the insert/update from succeeding in the original command.

Another option that can be difficult to swallow for technical people is to solve the problem manually.  The common belief is that if there’s a problem then of course we fix it with software.  What is typically forgotten is the opportunity cost of fixing it in code.  If a problem happens once a year and takes 5 minutes to fix manually, is it worth spending a developer’s time to fix it?  That question can only be answered by the business but I would argue that notification is sufficient.  Let the problem be detected when the read model is updated, then send an email to someone with the ability to fix the problem by hand.  This lets the developer move on to more important features.

If you’ve read to here, we have decided that the error is significant enough that we need to prevent it from happening in the first place, which means we have a user sitting at their screen waiting so we need this to be as fast as possible.  Scanning through 1MM emails looking for a duplicate is not a valid definition for “fast”.  The way we would approach this is to have a dedicated service that carries a cache of just the information that needs to be unique (email in this case).  It does not carry the full payload of aggregates since it will not be making any changes to data.  All it provides is a lookup service for usernames to validate uniqueness.  This service can live in the Domain where it has access to the write model and can subscribe to any commands that change or create an email address.