GOTO 2014 • Polyglot Data • Greg Young
43 2 4088
This presentation was recorded at GOTO Chicago 2014
Greg Young - Independent Consultant & Entrepreneur
We are forced to solve many problems with respect to handling data. Many concepts work in our current models, many do not. Picking the wrong model can lead to massive amounts of accidental complexity. This talk will look at how to reach the point where you stop thinking about how to force your problem into your predefined thinking and how to reach a place where you focus on how to choose the right model for the problem!
By anonymous 2017-09-20
This MoneyTransferEmailNotifier is subscribed to MoneyTransferred events. But note that system sending the MoneyTransferred event does not really care who or how many listeners there are to this event. The whole point is the decoupling here. I raise an event and don't care if there zero or 20 listeners that subscribe to this event.
Your tangle, I believe, is here - that only the publish subscribe middleware can deliver events to where they need to go.
Summarizing: the pub/sub middleware is in the way. A pull based model, where consumers retrieve data from the durable event store gives you a reliable way to retrieve the messages from the store. So you pull the data from the store, and then use the business level data to recognize previous work as before.
For instance, upon retrieving the MoneyTransferred event with its business data, the process manager looks around for an EmailSent event with matching business data. If the second event is found, the process manager knows that at least one copy of the email was successfully delivered, and no more work need be done.
The push based models (pub/sub, UDP multicast) become latency optimizations -- the arrival of the push message tells the subscriber to pull earlier than it normally would.
In the extreme push case, you pack into the pushed message enough information that the subscriber(s) can act upon it immediately, and trust that the idempotent handling of the message will prevent problems when the redundant copy of the message arrives on the slower channel.
By anonymous 2017-09-20
During a new user sign-up I want to check if the username the user provided is already taken.
You may want to review Greg Young's essay on Set Validation.
In my understanding of how ES works the controller that processes the sign-up request will check if the request is valid, it will then send a new event (e.g. NewUser) to Kafka, and finally that event will be picked up by another controller which will persist it in a materialized view (e.g. Postgres DB).
That's a little bit different from the usual arrangement. (You may also want to review Greg's talk on polyglot data.)
Suppose we begin with two writers; that's fine, but if there is going to be a single point of truth, then you are going to need synchronization somewhere.
The usual arrangement is to use a form of optimistic concurrency; when processing a request, you reserve a copy of your original state, then you do your calculation, and finally you send the book of record a `replace(originalState,newState)'.
So at this point, we have two writes racing toward the book of record
At the book of record, the writes are processed in series.
So when the book of record processes
replace(red,blue), it performs a check that yes, the state is currently red, and swaps in blue. Later, when the book of record tries to process
replace(red,green), the book of record performs the check, which fails because the state is no longer red.
So one of the writes has succeeded, and the other fails; the latter can propagate the failure outwards, or retry, or..., precisely what depends on the specific mechanics in question. A retry should mean, of course, reload the "original state", at which point the model would discover that some previous edit already claimed the username.
Any ideas on how to address this?
Single writer per stream makes the rest of the problem pretty simple, by eliminating the ambiguity introduced by having multiple in memory copies of the model.
Multiple writers using a synchronous write to the durable store is probably the most common design. It requires an event store that understands the idea of writing to a specific location in a stream -- aka "expected version".
You can perform an asynchronous write, and then start doing other work until you get an acknowledgement that the write succeeded (or not, or until you time out, or)....
There's no magic -- if you want uniqueness (or any other sort of invariant enforcement, for that matter), then everybody needs to agree on a single authority, and anybody else who wants to propose a change won't know if it has been accepted without getting word back from the authority, and needs to be prepared for a rejected proposal.
(Note: this shouldn't be a surprise -- if you were using a traditional design with current state stored in a RDBMS, then your authority would be a user table in the database, with a uniqueness constraint on the username column, and the race would be between the two insert statements trying to finish their transaction first....)
By anonymous 2017-09-20
Which Event store can be used for it ...?
If you have the luxury of choosing the technology to use, then I would suggest you start out by looking into Greg Young's Event Store
Yes, that's the same guy that introduced CQRS to the world.
(You may also want to review his talk on polyglot data, which includes discussion of pull vs push based models).
By anonymous 2017-11-27
How do you solve those problems?
On remedy is to avoid trying to rebuild images from unstable representations of the event history. When you are loading state into the write model, you will normally do so by querying a "document" that has all of the history of your aggregate in the order that they were written.
Taking the same approach in the read model, whereby you read the stable event history for each topic, avoids the problems that you might face because the topic events arrive out of order.
See Greg Young's talk on polyglot data.
You can take the same approach when building a read model from multiple topics, which gives you a consistent history for each topic... but not necessarily a synchronized whole.
So to use your specific example you might have
ContactCreated (contactId: "123", name: "Peter")
ContactAddedToGroup (contactId: "123", groupId: "456"), but without the event that belongs in the "middle". So now what?
One possible answer is to build the view using the unaligned histories - you have Contact information as of 00:15, and Group information as of 00:00, and you make that temporal discrepancy part of the read model. This might include using a variation of the
NullObject pattern to represent objects that don't exist yet.
Another possibility would be to use something like a Lamport Clock to keep track of the dependencies between events in different topics. That might look like meta data in
ContactAddedToGroup that lets the consumer know that event is consequent to
GroupCreated. The consumer could then decide whether or not to ignore events that are missing precedents.
By anonymous 2018-05-01
I can see a few ways of doing this but I'm not really sure which is the right way to proceed.
There's no perfect answer - in most cases, externally observable side effects are independent of your book of record; you're always likely to have some failure mode where an email is sent but the system doesn't know, or where the system records that an email was sent but there was actually a failure.
For a pretty good answer: you're normally going to start with a facility that sends and email and reports as an event that the email was sent successfully, or not. That's fundamentally an event stream - your model doesn't get to veto whether or not the email was sent.
With that piece in place, you effectively have a query to run, which asks "what emails do I need to send now?" You fold the
ApplicationSaved events with the
EmailSent events, compute from that what new work needs to be done.
Rinat Abdullin, writing Evolving Business Processes a la Lokad, suggested using a human operator to drive the process. Imagine building a screen, that shows what emails need to be sent, and then having buttons where the human says to actually do "it", and the work of sending an email happens when the human clicks the button.
What the human is looking at is a view, or projection, which is to say a read model of the state of the system computed from the recorded events. The button click sends a message to the "write model" (the button clicked event tells the system to try to send the email and write down the outcome).
When all of the information you need to act is included in the representation of the event you are reacting to, it is normal to think in terms of "pushing" data to the subscribers. But when the subscriber needs information about prior state, a "pull" based approach is often easier to reason about. The delivered event signals the project to wake up (reducing latency).
Greg Young covers push vs pull in some detail in his Polyglot Data talk.