Your data pipeline matters more than your database.
Your engineering maturity matters more than your data pipeline.Engineers at Every Tech Company
Every company’s data needs are different. Having a team of skilled, flexible and up-to-date engineers can solve your problems. Right?
Iterations upon iterations of data pipelines have been built before, for the last 20 years at least (it is 2020 now). Shouldn’t there exist a ‘best practice’ to do things, or using some tools to achieve a consistent level of performance?
Well its not that simple. Every company has different needs. Lets see why.
Determine Your Data Complexity
At some point, management needs to start thinking of how to sustain an engineering team to fulfil the data needs of an organisation. How hard could that be?
Well it depends on the complexity of your requirements.
Let’s think in terms of three components:
These ultimately result in the data volume you have to deal with (i.e. flow rate, if you know a little about plumbing), and methods of storing them.
Thereafter, you can start thinking about how these will be used. Depending on your use cases, you may or may not need sophisticated tools to do so.
Often times, these three components are lumped into the category of “Big Data”.
This is often overused. If your data collected data is significantly lower than 20TB a day, you likely do not have a “Big Data” problem. You must note that your needs are more specific than industry jargon!
Types of Data
Numerical: Do you deal mostly with numerical data? Are they a static snapshot, or does it incorporate multiple timeframes?
Methods to extract and store these data are relatively straightforward. These can be your usual flat-file numerical data, toaccounting statements, transactions, or payroll-type data.
For higher-frequency formats, processing this requires a more sophisticated engine.
Text: How about text from emails, messaging applications and websites?
Image/Video: Do you deal with images and/or videos?
And likely, how about a combination of all three?
Amount of Data
How much data is coming in per second? Hour? Daily?
Are you generating data to the tune of 1GB an hour? per minute even?
This directly affects how you will store and process the data into the formats you want, and also affecting the internal SLAs for your users.
Speed of Data
How often is data coming into your organization? One record per event? How often are the events (per customer message, per day, per week etc)? This relates to the channels you receive data by.
How often do your users need their insights? By when? Are there stringent SLAs to meet? (e.g. An updated report by end of the day? hour?)
I ticked yes to almost all of the above…
Congratulations! Your data requirements are complex, and you probably need to get a team of data engineers.
On both the receiving and delivery end of data, you need to have pipelines to handle them. That’s where the complexity comes in, and where your engineering team can focus their attention on.
Just Follow These Instructions…
The problem sounds hard. What do I do?
1. Organize and Take Inventory
What types of data do I have? What files and databases are being used now? How are they being used?
Who do they belong to, who is responsible for maintaining them?
From here, you understand who you have to deal with, and what everyone needs to get their job done.
2. Identify Key Data fields
Who is accessing what data the most?
What kind of data fields/columns?
What kind of end-result reports?
Once you do a mapping of this, it becomes clearer on how to streamline your data operations. You may notice an overlap between data fields. Some fields are requested a lot more than others too.
3. How Optimized?
There are a few phases you can optimize till, and they can be easily categorized in terms of technological requirements. Each level of data maturity has its growing pains, and what you need to do to maintain it, and go to the next level when ready.
Level 1: Flat files – Excel, CSV
There may not be a data warehouse, and you’ve probably got a monolithic architecture. You deal mostly with flat files and some scattered databases.
Department A has its own way of structuring files, which is passed to department B, who extracts these files either manually or does a static reference in Excel. Some database calls are used to combine with the Excel data, or save them as a backup or ‘final’ state.
You probably also don’t have too many technical folks and so you’re doing this on the side until requests become too much to handle.
Level 2: Unoptimised Databases
Here, you may or may not have dedicated resources to maintain this system and the data coming in. Maybe a few of your staff know a little SQL here, some Python there.
Someone mentioned MySQL was cool, and a few teams ran with it with “rogue” scripts calling the databases all over the company to get stuff done.
However, if there is more than one large job running, you notice things start to tip over. Queries are getting timed out, and users are impacting each other when they add/update data. This is normal of growing pains, signs you should move to the next level.
Level 3: Batch Data Handling
At this point, your data is growing pretty fast. You went from single database calls, to calling them in batch, say end of the day, and the data you need should be ready by the next day.
Batch seems to work, for now. A little while after, you realize databases are hitting resource limits too.
This can come from running out of memory or space. Some batch queries can’t complete on time. Jobs may even impact each other, and you need to manually insert/remove data for all sorts of reasons.
You find data being duplicated somewhere, but you don’t know where exactly. Data is also dirty, yet quite little is done as there is just so much coming in, and not enough people to clean it.
The usual case of dirty data comes from user inputs – so sanitize, sanitize, sanitize.
You may want to prioritize input sanitization as a top priority before moving any further, and having integrity checks on your databases.
After that, you would want to optimize how the queries are being done, while getting your tech team (I hope you have one by now) to move to the next stage.
If you have independent teams all writing their own scripts, you may notice some are actually written inefficiently. Additionally, how these scrips are calling the database tables result in a ‘locking’ effect, result in long running times that timeouts.
What needs to be done: Impose dependencies on data schemas or requirements on the data being loaded/queries.
Clearly identify data workflows and start to use data pipeline management systems (e.g. Azkaban, Airflow) to organize and monitor them. You need to sort this out, as the next phase will require a lot more capacity.
Level 4: Real-time
At this stage, you have done what you can with current data pipeline designs. Still, load times are taking too long. Pipelines are no longer stable, or even considered reliable. Workflows are failing. Your MySQL or whichever RDBMS are having issues even serving data.
Workflows of database jobs are complicated and have lots of dependencies between each other. Data latency is becoming a bigger problem. Data engineering is probably a full-time job. Everything is overloaded.
Its time to move from batch processes into something streaming.
What does this mean? Well whenever something in your data changes, everything down the pipeline is reflected. And we mean everything.
Does a table have a new record? Data was updated or deleted? Every other table cascades the updates in the same order.
Now you have to note that this is a complete overhaul of what you have – not many organizations make such a change because it can be disruptive to operations if not managed well.
Yet if you have the chance to start fresh with a new team, this approach effectively leapfrogs your competitors when you can handle workloads at scale.
This opens up more use cases. You get real-time metrics and business insights straight off the data warehouses. Engineers can see the state of databases in production there and then, and can figure out quickly what to do.
You can also do fancy checks to see the the shape of data is what is expected, so you can even check if web services would be healthy. All from a well organized pipeline.
Not everything can be migrated over though. So you still have some of the painful, older stuff moving alongside this “lightspeed” pipeline. We’ll deal with that soon enough.
Level 5: Integration
So about that painful stuff. Once you are ready to start (getting buy-in etc), we first start by integrating them into the workflows. Reuse components in the existing system to do that, so your team does not have to deal with so many disparate systems.
You know when you are ready when almost all of your services can be accessed with a microservice. You really should have a team of data engineers by now. *Hint hint*
At least enough people that are responsible that they can start to manage some of this complex workload. Also a happy, mature site-reliability engineering team thats more than willing to take on all these connectors and integrations for you.
Bonus Level: Decentralize Everything
In Technology companies, it is not unheard of to automate most (i.e. all) of the data pipeline. Mature data operations go into a “decentralized” structure, This covers the following:
Data Catalog: All metadata of all data, dimensions, data types, who owns it. Remember the first step of taking inventory? This is now fully automated, so you know where everything is.
This also includes access control, so you can control who in the organization can get access to what, and how often. That’s really important if you’re dealing with sensitive data (don’t forget PDPA, Eurozone’s privacy laws and so on).
With this capability, anyone in the organization can find the data they need. Which brings the next capability…
Self-service Analytics: Users just need to know SQL to retrieve data they need. If they need it updated often, simply raise a ticket for robots to create the data marts/warehouses within SLAs, varying from 1-3 hours, or the latest by the next day.
Automatic Pipeline Creation: From the self-service requests, bots can build an entire pipeline from a new data source to marts and warehouses, where users can pick and choose the kind of data they need and when they need it.
As a result of levelling-up, your organisation will need more enhanced skillsets. For example, business analysts need to know SQL to retrieve data from the self-service analytical portal. Using Python is essential to plumb data into their reports, or visualising them.
All in all, this results in a flatter, leaner but super-optimized organization. You do need to consider that this also requires overall skillsets to be higher, with a shortage of IT talent worldwide. Can your organization attract people to bring your company forward?
That’s a lot to digest…Now what?
Well, you just covered your organizational needs. You are better prepared to find a solution that works for you, rather than grasping at the first hyped-up technology and throwing the problem to the first person that says “Big Data”. Caveat emptor, as always.
Is this going to be worth it? Why go through all the hassle?
Think about Metcalfe’s law applied to your company data.
The value of a network increases as more nodes and connections are added to it. The same can apply for your data – individual data points combined to make more dimensions, results in more valuable insights.
Google did not launch its own Payments solution for no reason – transaction data, along with transaction size, merchants, and regularity are important data for its advertising operations. It raises the value of their user base. And maybe it can even become a digital bank some day…
You also get much higher ‘response’ times to make business decisions. Since almost everything is automated, you’re at least a day, or week faster than the nearest competitors to receive insights. You begin the analysis and thus decision-making process much earlier.
Even if you don’t have a plan for a new business line, collecting the data helps you to observe trends. Tracked over time, this data helps you uncover new relationships especially with the help of machine learning.
Lastly, mature data operations also speak volumes (literally) of how your technology stack is handled. You would be competing against the FAANGs in some way or another, so it is the kind of stuff engineers look at to stay long at the company.
In a catch-22 situation, if you don’t get engineering talent in and help them to stay, your data operations remain stagnant, and the vicious cycle continues. Start making a difference with small improvements today!
As-is analysis. What is the current state of data operations?
To-be analysis: What level of data operations do you foresee in the next 3 years? What kind of data are you likely to be handling at that time? Do you have a project roadmap?
Do you have the skills in organisation to do this?
Are you and decision-makers willing to invest in it and take the risk?
What does it take to get the resources?
Approach everything in layers, and move only when you are ready. Data pipelines are as tricky as ERP implementations.
Consider the tradeoff of not moving versus what your competitors are doing. There are always benefits to have a faster response time to the market, and more analytical points to make business decisions.
Start thinking of mature data operations as a talent attraction and retention strategy.