ISSN : 2583-2646

Building a Real-time Data Ingestion Platform for Web Log Analytics using GCP Pub/Sub and Dataflow

ESP Journal of Engineering & Technology Advancements
© 2025 by ESP JETA
Volume 5  Issue 4
Year of Publication : 2025
Authors : Vamshi Krishna Pamula
:10.56472/25832646/JETA-V5I4P104

Citation:

Vamshi Krishna Pamula, 2025. "Building a Real-time Data Ingestion Platform for Web Log Analytics using GCP Pub/Sub and Dataflow", ESP Journal of Engineering & Technology Advancements  5(4): 19-22.

Abstract:

This paper proposes a scalable low-latency fault-tolerant architecture for real-time web log analytics based on the native stream processing services of Google Cloud Platform. The main contribution is an end-to-end system design that uses Pub/Sub high volume ingestion and custom Dataflow (Apache Beam) pipeline to process high-throughput unstructured log streams plus details of custom parsing, real-time enrichment via Beam Enrichment transform, and event time-based aggregation techniques. Most importantly, it presents an empirical analysis of performance tradeoffs under exactly-Once versus At-Least-Once stream processing semantics toward optimizing both latency and cost of operation with the optimized setting reducing latency in demanding web analytic workloads by a very large factor. In its current version, this system writes output into BigQuery in a format readily available for direct querying at minimal cost through partitioning and clustering.

References:

[1] S. K. G. Maheswari and P. S. J. K. Kumar, "Real-Time Analytics In Streaming Big Data: Techniques and Applications," in 2021 International Conference on Intelligent Computing and Control Systems (ICICCS), Madurai, India, 2021, pp. 1201-1205.

[2] A. N. A. Bakar, M. R. M. Said, and N. M. Nor, "Development of infrastructure for anomalies detection in big data: Applied implementation of Anomaly Detection in Real-Time using GCP and Apache Beam," in 2020 IEEE 8th International Conference on System Engineering and Technology (ICSET), Bandung, Indonesia, 2020, pp. 141-146.

[3] "Scaling streaming workload on Apache Beam," Google Cloud, 2021. [Online] [Accessed: Oct. 26, 2023].

[4] M. S. Aslan, A. M. S. Al-Talabani, and A. J. H. Al-Sherbaz, "RES: Real-time Video Stream Analytics using Edge Enhanced Clouds," in 2022 International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), Ankara, Turkey, 2022, pp. 1-5.

[5] J. Roberts, "Serverless Architecture in 2025: Scalability, Cost, and Performance," IEEE Cloud Computing, vol. 12, no. 2, pp. 45-53, Mar. 2025.

[6] T. Akidau, R. Bradshaw, and C. Chambers, "Apache Beam: A Unified Model for Batch and Stream Processing Data," Proceedings of the VLDB Endowment, vol. 11, no. 12, pp. 2070-2073, 2018.

Keywords:

Cloud Computing, Stream Processing, Apache Beam, Dataflow, Web Log Analytics, Low Latency, Event Time Semantics.