Getting started with Data Engineering - Google Cloud Notes

Data engineering at high level is to design, build, monitor and secure the data processing pipelines. 

Data pipelines mainly perform ETL (Extract, Transform, Load) jobs, which transform the source data (structure/un-structure) to a data warehouse in a meaningful way, which can than be used for analytics or machine learning models. 

Data processing can be in form of batch or streaming. On GCP, we can use Dataproc (Apache Hadoop, Spark etc.) for Batch processing and Dataflow (Apache Beam programing) for Batch & Streaming. 

Batch Pipeline:

Processing the data in the form of batches. Examples: A nightly Job to extract sales transactions for analysis. 

Streaming  Pipeline:

Processing continues stream of data. Examples: IoT events, payment processing events, logs. Streaming processing is used when you need a near real-time outcome or decision making. 


To get started on Data engineering, would recommend you to go through below courses or labs. 


I personally feel if we are planning to build new data processing pipelines, or trying to learn data engineering than we should consider Dataflow over traditional Hadoop/Spark jobs. The reason been much of the infra setup complexity is abstracted as part of Google's Managed service offering which make the learning curve bit smoother.  


Will continue posting further Dataflow and Data engineering learning exercises.....

-----------------------

If you need a short desc on Google Cloud products, you can refer below cheat sheet:

Google Cloud Cheat Sheet



Popular posts from this blog

API Design First approach: Implementing quick mock API's using swagger hub and postman

Combine or Merge XML documents in Single XML using Boomi & Groovy

JAVA embedding in Oracle SOA 12c