A Startup CTO’s Guide to the Modern Data Stack
So you’ve hooked up Metabase on your prod replica. Now what?
Plugging your first BI tool on your first source of data (your production backend’s database) is already a step in the right direction. Well done!
But if you’re lucky enough to get some traction internally on your brand new BI tool’s usage, you’ll soon end up with a sprawling collection of disparate queries and dashboards. Because everyone will be trying to find their own answers.
And that’s when things generally get messy… with other additional issues you’ll eventually encounter:
- Your production database won’t scale to handle the analytical workloads on top of your transactional ones.
- Your production database won’t contain the data of your whole business, only from the app it’s serving (no data from your CRM, ERP, ad spend, etc)
- Your production data won’t be ready for easy human consumption, quite the opposite, it should only care about being correctly normalized for your app’s consistent usage.
Now, don’t get me wrong, you can get away with those for some time, by “simply” vertically scaling your replica, by adding other data sources as additional “connections” to your BI tool, and by educating all data users on each and every separate data source available.
Or, you can take the next real steps on your analytics journey: bootstrapping your Modern Data Stack.
Personally, I’ve been into data engineering for a few years now and have had the chance of setting up such stacks for a bunch of startups and scaleups over the past couple of years.
Step 1: set up a cloud data warehouse
“Where they are put in boxes, and they come out all the same.”
In order to decouple the scaling of your app’s transactional workloads from that of your analytics workload, the idea is to add a resource specifically tailored to the latter.
Furthermore, the aim will be to make it your central source of truth for what’s going on across your whole business.
- if you already rely on GCP for other stuff in your tech stack
- or you have some machine learning use cases you already want to teeth out (as BQ is pretty tightly integrated with Google’s ML services, among others)
- and/or you prefer a less hands-on approach to elasticity and scaling
- if none of the above really matters
- and your team has some — even very limited — experience with database administration: you’ll feel right at home here (e.g user management through DDL statements)
Step 2: extract & load into your warehouse
“Lock and load.”
Now that you have a brand new and immaculate data warehouse, it’s time to start loading things in it.
While we’re at it, why not dump every potentially useful data source from across your business?
Typically, your future rockstar CMO might be interested in measuring your ads’ ROI (i.e Is our spending on Facebook Ads leading new users to sign up? Or maybe bringing past ones back? …)
You have quite a few options in terms of SaaS providers to do precisely that.
What you’ll generally get, are click-based UIs that allow to quickly connect your database(s) and/or a bunch of third party APIs as data sources to be loaded into your warehouse.
This even allows less technical folks to quickly become independent in dumping the data from their favorite tool right into the common warehouse.
Here, I’d highlight Fivetran — current EL(T) market leader — as the more mature option, albeit currently very pricey.
Or Airbyte as the latest but strongest upcoming contender. Especially if you’re comfortable hosting you’re own ingestion infrastructure: it’s open sourced and just a
docker-compose up away.
Step 3: model, model, model
Also known as the ‘T’ in ELT (i.e “Transform”)
What you’ll need now is to get usable results — KPIs — from all of that data. Maybe merging data across different sources, aggregating it per user segment, and/or per day, week, month, quarter? Comparing that year over year, or over the previous period?
There are quite a few specialist terms involved because of a long history of techniques used to do that holistically.
These techniques have roots back in the 80s’ when on-premises proprietary data warehouses were the norm (and ETL was a necessary evil). So the arguments focused on performance are way less relevant nowadays with cloud data warehouses being the norm.
But all the other benefits remain essential.
The problem is, in a start up environment where everything is going too fast, taking 6 months to design that perfect star schema from scratch will rarely be an option.
Hence my general advice in that context is: start from the use case and work your way back to the data sources (not the other way around).
Then you’ll refactor into more generic and intermediate layers/models once multiple — think rule of three — use cases need similar transformation logic applied upstream.
A framework that really helps pinning all of that down is dbt.
How does it help ?
It allows anyone that speaks SQL to contribute to your shared data warehouse construction. All of that in a versioned, tested and templated way, with builtin dependency resolution between all your models and sources.
Then, what’s next?
This one should actually go hand in hand with that previous step. Focusing on the actual use case(s) is what will align the data team to quickly bring strong value to the table.
This can take multiple forms depending on the use case at hand:
- from dashboards (or alerts) to allow the whole company to gain insight on the business
- to reverse ETLs that’ll make that transformed data directly actionable by sharing it back to the operational tools the other teams are using (e.g calculate your segments in your data warehouse to then (re)load them in your email marketing tool, weighing clients at risk for your CRM, …)
- or serving data APIs to feed some insights/recommendations/predictions into your product directly
Something I could’ve mentioned: of course you’ll need to start recruiting some data specialists along the way. A data analyst or two for starters, helped by an analytics engineer.
I’ve held back on that one to highlight the fact that: there are quite a few things you can progress on, even before that very first data hire! That would definitely help with the hiring process, by conveying the importance you want to place on having a sustainable data stack. And it could actually enable them to make an impact from the get-go.
There you have it. You’re just a few steps away from data-enabled greatness! So start building ;-)