Getting started with Data Engineering - Google Cloud Notes
Data engineering at high level is to design, build, monitor and secure the data processing pipelines.
Data pipelines mainly perform ETL (Extract, Transform, Load) jobs, which transform the source data (structure/un-structure) to a data warehouse in a meaningful way, which can than be used for analytics or machine learning models.
Data processing can be in form of batch or streaming. On GCP, we can use Dataproc (Apache Hadoop, Spark etc.) for Batch processing and Dataflow (Apache Beam programing) for Batch & Streaming.
Batch Pipeline:
Processing the data in the form of batches. Examples: A nightly Job to extract sales transactions for analysis.
Streaming Pipeline:
Processing continues stream of data. Examples: IoT events, payment processing events, logs. Streaming processing is used when you need a near real-time outcome or decision making.
To get started on Data engineering, would recommend you to go through below courses or labs.
- Modernizing Data Lakes and Data Warehouses with Google Cloud
- Serverless Data Processing with Dataflow: Foundations
- Serverless Data Processing with Dataflow: Develop Pipelines
I personally feel if we are planning to build new data processing pipelines, or trying to learn data engineering than we should consider Dataflow over traditional Hadoop/Spark jobs. The reason been much of the infra setup complexity is abstracted as part of Google's Managed service offering which make the learning curve bit smoother.
Will continue posting further Dataflow and Data engineering learning exercises.....
-----------------------
If you need a short desc on Google Cloud products, you can refer below cheat sheet: