Data Engineering

Building Data ETL pipeline in streaming and batch mode.

Project Background

To ensure effective analysis, it is necessary to aggregate and transform both incoming and existing data within the BigQuery tables. Apart from batch processing, a streaming and long running data pipeline has been established. Given the vast amount of data from 20K brands across various social media platforms, an auto scalable cloud architecture is essential, along with a disaster recovery mechanism. Furthermore, the system should be adaptable to accommodate future data features from constantly evolving data sources.

Challenges & Requirements

The objective of this project is to dynamically convert data in order to highlight and extract real-time streaming data insights. Once the data has been transformed, it is stored in the Google BigQuery data warehouse. Batch processing tasks were utilized to fulfill ad-hoc requirements.

The project is built on the Google Cloud Platform, with data coming from Google Pub/Sub or the BigQuery data warehouse. The ETL transformation is performed within Dataflow, written in Java using the Apache Beam SDK. The resulting data is stored in a BigQuery dataset.

Apache Beam offers a comprehensive solution for data ETL, providing a single platform to handle all your data processing needs. One-time batch processing was utilized for ad-hoc tasks.

The Python batch codebase demo provided below showcases the utilization of BigQuery as both the data source and data destination.

Actions & Outcomes

After considerting above factors, Google Dataflow is chosen as the infrastucture of data pipeline, PubSub as the incoming data buffer layer.

Here's a demonstration of the Python batch codebase, showcasing BigQuery as both the data source and data destination.

I highly recommend zhu1230 (Vincent)! My experience with him on my second project has confirmed that he is a consummate professional - he is very responsive and receptive to feedback, fully committed to delivering a great outcome / product and has a quick turnaround time.

Technologies Used

The Java language was utilized to develop the Streaming Dataflow pipeline, leveraging the Apache Beam framework. On the other hand, the Batch processing code was implemented in Python. Both components have undergone thorough testing to ensure the functionality of their core features. BigQuery serves as both the data source and destination for this pipeline. Additionally, the Pub/Sub messaging system is employed as an intermediate stage for streaming data.

Conclusions

The consolidation of the data ETL pipeline has resulted in significant cost reduction, simplification, and streamlined management of various components through the creation of a single, visualizable pipeline. This new design not only replaced multiple sub-projects but also automated the manual scaling up of work on servers, optimizing resource utilization. When combined with Python batch processing, it enhances the process by reducing errors and providing greater flexibility and adaptability to meet business requirements.

Interested in hiring me for your project?

Looking for an experienced full-stack developer to build your web app or ship your software product? To start an initial chat, just drop me an email at vincent.zhu@storytreat.com or use the form on the contact page.