Dad, what's an event stream??

Once upon a time, I had to explain change data capture to ~~my son~~ a client.

Here's the gist of how that went:

Change data capture (CDC) is basically reading the event stream of everything that has happened on a database. Any update, insert, delete (and sometimes even DDL commands) that have been ran on the database. The database itself actually uses that same stream to apply the requested changes/events to the main node, as well as any replicas it might have. CDC therefor is already used internally by the database itself.

That being said, when we talk about CDC as data engineers, more often then not, we're referring to reading that same event stream but for "external" consumption/replication (i.e towards a data warehouse or datalake, generally for post-hoc analytical purposes...)

And that got me thinking that the underlying concept: the event stream, is nebulous at best... Which is a shame, because it's really useful!! So it would be worth being able to explain it to anyone, even my ~~6yo~~ son. (now 8yo 🤦)

Anyhow, let's get on with the story, shall we?

The Never-Ending Book

Long ago, storytellers could only share tales by speaking out loud. If you wanted to hear their stories, you had to be there listening at that very moment. If you left or got distracted, you'd miss out on parts of the narrative.

But then, writing came along. Suddenly, storytellers could write their stories down in books. This allowed listeners to become readers who could pick up where they left off whenever they wanted. The storytellers no longer had to wait for an audience before sharing their tales.

Now, imagine there was a magical author who was so inspired, her stories never ended! She just kept writing and writing without ever stopping. To share this endless tale, she created a very special magical book.

This book was truly magical because no matter how many new pages the author added to it, every reader could see the full book at the same time! The author could keep writing forever, with new pages just appearing at the very end for all to see.

Since readers all experienced the story at their own pace, each one needed a special bookmark to keep their place. Thanks to the bookmarks, they could put the book down and come back to it later without losing their spot.

The author's inspiration was so endless that the book just kept growing bigger and bigger and bigger. Before long, it became too huge for anyone to pick up!

So the author cast another spell: This time, as soon as every single reader had finished with a page, that page would magically disappear from the front of the book. This kept the book from becoming too heavy to lift.

But what if a reader couldn't read for too long, or stopped altogether? If their bookmark didn't move for a while, it would eventually disappear too so it didn't slow down all the other readers.

And that, kids, is a lot like the "event stream" systems that power so many of the apps and websites we use today...

There you have it, that's basically the same as an event stream. New pages (or events) keep getting added at the end. And pages that have been read by all readers (consumers) can be discarded along the way.

This illustrates:

the need for asynchronous communication to start with
how event streams (or message queues) enable such communication
the never ending aspect of streams
as well as the basics around keeping it within reasonable size.

However it does not explicit the fact that readers can consume concurrently, which is the ~~magic~~ scaling benefit of such systems (when configured properly). But I find it still captures most of the intuition around such a piece of stack.

And to close off the loop: that's also how the Postgres Write Ahead Log (WAL - used for CDC) generally works and can be configured...

Bluug

Bluug

Dad, what's an event stream??

Make me understand that technical jargon like I'm 6...

The Never-Ending Book