Path to become a Data Engineer
Data Engineering is definitely one of the most demanded jobs intoday’s world. As the data grows the need of Data Engineer grows and with thenew technologies becoming common like Spark and Hadoop, companies are lookingto hire people who can do the data handling job efficiently. I personally believe there is still short of Data Engineers in the market and its still a good time to think about it if you are interested in pursuing this field.
Data Engineering is actually not a new name, it has been in the market for decades but was never used the way we see nowadays. Back then Data Engineer’s job was also to handle, manage, transfer data but the difference was in the technologies and the size of data. Due to the small size of data, companies used old-school SQL systems which was enough atthat time to process the data, but with the rise of data in the recent yearsthere was needed a better solution. Few of the most common technologies you will hear are Spark, Hive and Presto, they are cheaper and faster.
This article is going to answer the most common questions I see every other day on Quora.
Typical questions are:
How to become a Data Engineer?
What's the path to be a Data Engineer?
How to switch from a Software Engineer to a Data Engineer?
What technologies are required for a Data Engineer?
A typical path to be a Data Engineer includes few important things:
Love for Data
Big Data Technologies
Love for Data
It shows how passionate you are about the data, one must be working on the type of data he/she loves, for example looking into health care data might not be interesting to you, so you better look what you like. There is almost every type of data out there in themarket. Top examples include; healthcare, financial, real estate, ad tech, social media, etc.
Big Data Technologies
In this space, technologies are still not that matured, companies are still adapting, improvements are coming every year, so if you want to be in this field you will need to be pro active in updating yourself with new tools and features.
Technologies can also be subdivided into categories:
Platforms: AWS and Google Cloud, these comprised of various technologies that help to build a fully scalable and reliable system. Compute Instances like EMR (EC2), databases like Redshift and query services like Athena are some common applications used by Data Engineers. Also, everyday improvements and new tools are coming out by these platforms, so you have to keep an eye out there.
Data Processing: Hadoop ecosystem (open source) that includes top tools like Spark, Hive, HDFS are very common in the market. Commonly, Spark is used to process large amount of data but SQL still holds its worthy place, tools like Hive and Presto (Athena in AWS) use the same SQL to query from a filesystem even Spark has Spark SQL, concludes that SQL is still worth learning. There is a new Spark release coming up known as Spark 3.0.
Schedulers: Airflow and Luigi are the most common schedulers in the market at the moment. Both Airflow and Luigi are open sourced projects by Airbnb and Spotify respectively, they do the same workflow management but have a different approach.
DataBases/DataWarehouses: Old School database systems are likely to go away pretty soon as the new data warehousing technologies like SnowFlake are emerging very quickly. AWS Redshift is still popular among companies, though it’s costly and lacks scalability.
Yes! Data Engineers are programmers as well, they write code to support data. Write pre processing data logics, ETL, schedulers and much more. That's why most of Data Engineers were used to be SoftwareEngineers and companies might prefer a Software Engineer as a Data Engineerinstead of having a fresh one.Top programminglanguages used in this field are Scala, Python, Java. Spark is written in Scala and is supported by all three languages mentioned, while schedulers areusually written in Python like Luigi and Airflow.
If you are looking to apply for a job then typically these are the minimum set of skills a Data Engineer must have:
AWS (EMR, S3, Redshift)
Hadoop (Spark, HDFS)