Architecture for High-Throughput Low-Latency Big Data Pipeline on Cloud

Excerpts from the article "Scalable Efficient Big Data Pipeline Architecture" by Satish Chandra Gupta

The big data pipeline is the railroad on which the heavy wagons of ML run.

A data pipeline stitches together the end-to-end operation consisting of collecting the data, transforming it into insights, training a model, delivering insights, applying the model whenever and wherever the action needs to be taken to achieve the business goal.

There are 5 stages in the big data pipeline:

🔹 Collect - Collect data from internal & external sources
🔹 Ingest - Ingest data through batch jobs and streams
🔹 Store - Store in Data Lake and/or Warehouse
🔹 Compute - Compute analytics aggregations and ML features
🔹 Use - Use it in dashboards, data science, and ML

Typical serverless architectures of big data pipelines on Amazon Web Services, Microsoft Azure, and Google Cloud Platform (GCP) are shown below.

Comments