My Take On Data Engineering
After reading many articles on the topic of Data Engineering (like this one), I thought it would be a good idea to share my experience and thoughts on Data Engineering, especially with the fear that Data Engineering will go to zero with the introduction of new tools once it gets mature. Data Engineering is challenging mainly due to the fact that it is still immature with a lot of confusion in the role and the work. Every company has their own meaning of Data Engineering. I wrote an article Path to become a Data Engineer a few years ago, which is relatable in this context.
I will split this article into three sections,
Challenges, highlighting the high level issues
Roles, different roles with same name
Tech, the ever evolving tech
Based on my Data Engineering experience, I think there are way too many challenges that need a good solution from data observability to transformation to maintainability of the codebases. We do see every now and then some Data Company started some Data tool to solve one of the challenges but still it is a long way to become a mature space.
Most of the tools that are in the market have solved one aspect of data problems like data discovery, documentation, testing, lineage or orchestration, however, it is still a challenge to integrate all systems together. The biggest issue I see with being in the immature space is that there is not only lack of tools but also lack of good practices, good resources and support, unlike Software Engineering where you have many seasoned engineers, credible resources all over the internet and more. A good example would be a simple issue: searching on Google gives only few credible solutions for the Data Engineering problem, it's a long way to go and hopefully in like five to ten years, we should have better tools, frameworks and standards. These challenges can be associated with tools in the tech section of the article.
However, the good thing is that it is an interesting space. If you really like open-ended and vague challenges which might have never been solved before, I would recommend checking out Data Engineering.
The title of Data Engineer is very confusing, it mainly depends on the company and most importantly the job description. You might get a job as a Data Engineer but you will end up doing something which is not really Data Engineering. To understand more let's dive into what Data Engineer jobs may look like:
Some companies even hire Data Engineers but mainly they work as an Analyst, mostly doing SQLs, strong analysis and some visualization, rarely any python or technical stuff.
Business Intelligence Engineer
A traditional BI Engineer who specializes in the BI tools and is good with data intelligence, reporting and visualization, a BIE may become an Analytics Engineer if they can master modern tools like DBT.
Analytics Engineer is more like a technical version of Data Analyst with the visualization skills of BI Engineer. Analytics Engineer is more familiar with new modern data tools like DBT, AirByte, Great Expectations and alongside some experience with Python.
Data Platform Engineer/ Software Engineer, Data
This Engineer is mainly responsible for building the data platform for the Analytics Engineer, Data Analyst etc., and automating the components to make lives easier. Experience with distributed systems like Kafka, Beam, Spark, working on end to end real time or batch ingestion processes etc. and building tools, frameworks and services to support data.
Data Infra Engineer
If we compare this to a Software world, we can say it is a DataOps role, mainly focused on the CI/CD, IaaC (Terraform), operations and maintaining the infrastructure associated with the Data Platform, like a Kubernetes cluster. Data Platform and Data Infra are very close and can even be considered one.
So in case you are looking for a job, I would suggest not only go for the title ‘Data Engineer’ but also read the job description and discuss the responsibilities with the hiring manager as it could be really different per company or even a team. I think it should only be called Data Engineer if the role requires you to wear multiple hats especially involving the technical part.
I have experienced this in hundreds of companies and I really hope they will start to differentiate this more often, having a more specialized name rather than just calling the buzzword to attract more talent. One example is Facebook, they call Data Engineers, but based on Blind they are Glorified Data Analysts or you may say Analytics Engineers.
The evolution of Data Engineering roles heavily depends on the technology, like Analytics Engineer started when tools like DBT became famous as it gives power to Data Analyst to work more like Engineers, also with Data Warehouses getting more powerful, cheaper and efficient there seems to be less tech needed to solve bigger problems and most things are done via SQL, let's take a look how tech evolved and where it would end up.
Processing and Storage Systems
Hadoop contributed and I would say initiated the role of a Data Engineer. Back then Data Engineer used to be really good with the Hadoop ecosystem which was the only tech to process and store large amounts of data, in my opinion. I even started with Hadoop by doing a course during my college days.
Hadoop involves tools like Spark, Hive, HDFS that became the backbone of data systems for quite a number of years but in recent years we are seeing a decline with the rise of powerful systems like SnowFlake and BigQuery which become a great alternative to processing systems like Spark, actually moving from ETL to ELT to give end users the power to transform rather then depending on engineers to do for them. While HDFS is replaced by cloud storage solutions like S3 and GCS. This overall takes a lot of technical pieces away from Data Engineers, but there is still a good side and challenges coming along the way especially in real time space with the introduction of technologies like Apache Beam. As a Data Engineer, I like technical challenges.
Python, Scala or Java are still used in the data world for processing. But it heavily depends on the company and the team, all the Apache tools have API in all these languages, like Kafka can be used in Java, Python, etc. A bit of python can even land you a job as a Data Engineer. Programming demand will increase with the last few previously mentioned roles.
One of the biggest fears of Data Engineers is the no code or low code tool, when I started we had to integrate data from multiple sources, however with tools like Fivetran, Stitch, Airbyte, Workato, you don’t really need any special knowledge. But again this tool covers a large chunk of the market but data focused companies that heavily depend on data need a scalable and custom solution to solve their needs. So, in general these tools help in making life easy for simple tasks but are not enough. Great article by Zach around the low code tool.
Airflow, the most popular tool, is widely used across the industry, since it's written in python, you still have to do a lot of python work, even with other tools like Dagster or Luigi. I think Airflow is here to live for another number of years. But again you might not need it if you land in the non tech bucket of Data Engineer.
Data Build and Transformation Framework
Tools like DBT, which has revolutionised the Data Engineering world by introducing a new role known as Analytics Engineer. DBT has just started to get famous and will go up for the next number of years. I am pretty sure there will be better versions and competitors for this type of solution.
Data Quality and Reliability
Data Quality has been one of the most difficult tasks currently, super hard to define consistencies, hard to manage, a little has been solved by tools like DBT Expectations, Great Expectations and Soda, but lot still needs to be done, high level checks like null and unique are easy but doing custom complex quality checks are still needed to be written out as a service.
Data Discovery, Observability And Governance
Along with data quality, it is very important to have good standards and approaches to have discoverability, observability and governance. Cataloguing tools like Amundsen, DataPortal and even DBT documentation helps with required metadata information, testing, ownership and lineage. These are still challenging, the tools do not provide enough to cover all aspects, still custom solutions and integrations are required.
Visualisation and reporting tools like Looker and Tableau come into play for the first few types of roles mainly, these tools are powerful and have already taken over the traditional tools like PowerBI. Having expertise in these tools are great for the market.
In this article, I shared my opinion on how things are going in Data Engineering world, from current challenges to the role itself to the tech, why I still believe it's a good opportunity to be in this space and if you want to be in, what to look out when searching for a job, tools and tech for the next coming years. My one advice would be not to focus on one tech as things have been evolving very quickly in this space.
This article was written months ago, few things might look a bit outdated or I may have missed mentions of new important tools that solve Data Engineer challenges. Feel free to add in the comment section.