Posts

Showing posts from August, 2022

Getting started with Data Engineering - Google Cloud Notes

Data engineering at high level is to design, build, monitor and secure the data processing pipelines.  Data pipelines mainly perform ETL (Extract, Transform, Load) jobs, which transform the source data (structure/un-structure) to a data warehouse in a meaningful way, which can than be used for analytics or machine learning models.  Data processing can be in form of batch or streaming. On GCP, we can use Dataproc (Apache Hadoop, Spark etc.) for Batch processing and Dataflow (Apache Beam programing) for Batch & Streaming.  Batch Pipeline: Processing the data in the form of batches. Examples: A nightly Job to extract sales transactions for analysis.  Streaming  Pipeline: Processing continues stream of data. Examples: IoT events, payment processing events, logs. Streaming processing is used when you need a near real-time outcome or decision making.  To get started on Data engineering, would recommend you to go through below courses or labs.  Modernizing Data Lakes and Data Warehouses w