Skip to main content

What is Data Engineering? How is it different from Data Science?

· 8 min read
Parham Parvizi

While Data Engineering and Data Science are commonly interchanged terms, there are distinct differences in responsibilities and skills of each group that
distinguishes the two.

In short... Data Engineering is the practice of gathering and preparing data which is used by the data science models and algorithms.

Before data can be used in data science models, it needs to be acquired, cleansed, stored, organized, and finally passed on to the model in a standard format which is acceptable and understood by the model. These are the responsibilities of a Data Engineer.

Data Engineers ensure that:

  • The data is collected from various sources
  • The data is stored, processed, and analyzed by models
  • The data is delivered to the end-users

data engineering tasks

There's an old saying:

You know you've done everything right when no one thinks you've done anything at all!

Unfortunately, this holds true for Data Engineers! They are the glue that holds everything together. Data Engineers ensure that data is always collected on-time, processed, and passed to its user, be that the data scientist, business analyst, or end-user's screen. When everything goes smoothly it's often unnoticed, but when things go wrong people notice their stale screens right away. This is why we like to call Data Engineers the Silent or Unsung Heros of data.

A Familiar Example: iOS Memories

To differentiate Data Engineering from Data Science let's take a look at a familiar example...

If you're an iPhone user, you're probably familiar with the iOS feature called Photo Memories.

This is where Apple automatically discovers a common story between your pictures and creates a clever and pleasing video tailored around the experience. It slides through the pictures and adds a suitable soundtrack to create a short video clip.

Take my personal example:

I have hundreds of pictures taken on my phone this year. Ranging from pictures of my niece, cool things I saw walking around, tons of selfies due to the inevitable 2020 covid quarantines, to my hiking pictures.

my pictures


Here, iOS detected that I went on a bunch of hikes in the beautiful Pacific Northwest (where I live) this year. It grouped these pictures together and created a "Memory" for me titled "On the Trail":



Let's dig into what goes into creating this video technically:

  1. First, my pictures are taken, stored on my iPhone, and then transferred to iCloud
  2. The quality of each picture is checked to filter out low-res and blurry images
  3. A series of processes are run to do what's called data enrichment. This is to tag my photos with useful information such as geo location, timeline, orientation, or simply if they are photos or screenshots
  4. Geolocation data is used to do proximity detection to further tag my photos with famous landmarks or known places like nearby trails
  5. Advanced Computer Vision and Image Recognition algorithms are run to do what's called feature extraction. This tags my videos with things like people, friends, landmarks, objects, trees, my car, etc...
  6. A Data Science Model classifies and groups my pictures together based on the tags created above
  7. The model detects a common theme ("hiking") and makes a decision about if there's enough strongly themed pictures to make a movie
  8. A suitable soundtrack is added based on the detected "theme"
  9. Pictures, along with the soundtrack, are stitched together into a video
  10. A notification is sent to my phone to say I have a new "Memory"

Plus, this entire list is repeated every time I take a new set of photos to make check if we could possibly have a new "Memory"!


In these steps:

  • Tasks #1 thru #4 are common Data Engineering tasks. A Data Engineer is in charge of collecting, organizing, data cleansing, and feature enrichment tasks.

  • Task #5 is often shared between a Data Engineer and a Data Scientist. Parallel processing of the photos by various computer vision algorithms is often done by Data Engineers, while the types of feasible algorithms are often decided by Data Scientists.

  • Tasks #6 thru #8 are distinct Data Scientist tasks to develop models to classify and group pictures with high confidence that they tell a singular "story". By detecting the "theme" for the story, the data scientist often assigns a suitable soundtrack: classic, epic, calming, dancing, ...

  • Tasks #9 and #10 are again a Data Engineer's job: to stitch together the video and notify the user of their new "Memory"

The Data Engineer is also in charge of automating this entire flow and building the data pipeline to refresh the process every time I take a new group of pictures.

Data Engineering Roles & Tools

There's an excellent blog by Monica Rogati on the hierarchy of needs of a Data Scientists:

data engineering workflow

Looking at the pyramid above:

  • Bottom 3 tasks (collect, move/store, explore/transform) are clear Data Engineering tasks
  • 4th task (aggregate/label) is commonly shared between Data Engineers and Data Scientists
  • Top two tasks are often done strictly by Data Scientists

Data Engineers are tasked with creating automated data pipelines to collect, transform, and store data within an organization. You might have heard of a term called "ETL" which stands for "Extract, Transform, and Load" to describe this.

Data Engineers work closely with Data Scientists or Business Analysts to provide the data needed to refresh their ML/AI models and Reports and Dashboards. They're in charge of making sure everything runs smoothly and data is delivered from its source to the destination without a hitch. They make sure that everything is cleansed, labeled, and organized correctly for easy and timely retrieval.

Data Engineers often wear many different hats. Their most common responsibilities include:

  1. Developing and maintaining data pipelines
  2. Acquisition of data from various sources
  3. Architecture and design of systems to store and label data
  4. Data security and governance to ensure only the right people gain access to only what they need
  5. Distribution of data to internal users such as Data Scientists or Business Analysts
  6. Automating processes and monitoring to ensure things run smoothly
  7. Serving data to end-users

Data Engineers often set up and maintain distributed applications such as Databases, Cloud Services, or Big Data tools like Hadoop and Spark. While any programming language could be used for Data Engineering the most prevalent languages are Python, Java/Scala, and more recently Go; with python being the most used and Go closing the gap. Among distributed applications used by Data Engineers, Apache Hive and Spark are the most commonly used Big Data tools. Although companies are quickly shifting to the Cloud and Serverless pay per use technologies like Cloud Containers and Functions are heavily favored for their cost and relative ease of startup and maintenance.

Data Engineering Demand and Salaries

Needless to say that Data Engineering is one of the hottest job markets in the tech industry. Data is everywhere and every company needs to make smart data decisions to keep competitive in its industry; hence, the need for Data Engineering is on the rise.

Indeed and Glassdoor searches yield over 100,000 jobs in the United States on average daily. This number is 4x larger than the search for Data Scientists which yields around 25,000 jobs.

Currently, the average base salary for a Data Engineer is $130,000 per year on Indeed, while a Data Scientist is tracking at around $125,000. Although in my personal experience, Data Scientists make slightly more than Data Engineers, while there are typically 3-4x more Data Engineers on a given team.

Conclusion

In short, numbers don't lie: the demand for Data Engineers surpasses every other field in tech.

Data is only growing, and therefore the demand for Data Engineers will continue to rise. As much as my mother told me to become a Doctor, I might be telling my children to become a Data Engineer! Seeking a career in Data Engineering is surely one of the safest paths in finding a stable and high paying job.

Of course, I'm going to tell you that you can learn to become a Data Engineer now! Yes, at Tura Labs we provide a completely FREE Data Engineering Bootcamp.

I know I'm sounding like a used car dealer but I'm not here to sell anything. It's free!

Data Engineering has been an amazing career for me and we aim to provide the tools to let others do the same. Yes, it might not be as glamorous as showing your friends you made the new coolest App or video game; but I assure you that it is in every sense as satisfying and challenging. If you are a Generalist who loves data and strives to make things perfect, Data Engineering might be a good career for you!