Getting started with Data Engineering

Getting started with Data Engineering - Google Cloud Notes

August 22, 2022

Data engineering at high level is to design, build, monitor and secure the data processing pipelines.

Data pipelines mainly perform ETL (Extract, Transform, Load) jobs, which transform the source data (structure/un-structure) to a data warehouse in a meaningful way, which can than be used for analytics or machine learning models.

Data processing can be in form of batch or streaming. On GCP, we can use Dataproc (Apache Hadoop, Spark etc.) for Batch processing and Dataflow (Apache Beam programing) for Batch & Streaming.

Batch Pipeline:

Processing the data in the form of batches. Examples: A nightly Job to extract sales transactions for analysis.

Streaming Pipeline:

Processing continues stream of data. Examples: IoT events, payment processing events, logs. Streaming processing is used when you need a near real-time outcome or decision making.

To get started on Data engineering, would recommend you to go through below courses or labs.

I personally feel if we are planning to build new data processing pipelines, or trying to learn data engineering than we should consider Dataflow over traditional Hadoop/Spark jobs. The reason been much of the infra setup complexity is abstracted as part of Google's Managed service offering which make the learning curve bit smoother.

Will continue posting further Dataflow and Data engineering learning exercises.....

-----------------------

If you need a short desc on Google Cloud products, you can refer below cheat sheet:

Google Cloud Cheat Sheet

Rohan Lopes Blog

Getting started with Data Engineering - Google Cloud Notes

Popular posts from this blog

Combine or Merge XML documents in Single XML using Boomi & Groovy

Journey towards launching: Follow My Church Mobile App - (iOS & Android)

Quick Guide - Docker/Container/Container Images/Registry