| ESP Journal of Engineering & Technology Advancements |
| © 2025 by ESP JETA |
| Volume 5 Issue 3 |
| Year of Publication : 2025 |
| Authors : Raviteja Narra |
:10.5281/zenodo.19675152 |
Raviteja Narra, 2025. "Data Quality Engineering for AI Systems: A Survey of Validation, Drift Detection, and Monitoring Techniques", ESP Journal of Engineering & Technology Advancements 5(3): 202-211.
The increased dependence of industries on data-driven decision-making practices makes it obvious that quality data and governance systems are necessary. Contemporary organisations have to deal with extensive and heterogeneous data sets and are highly concerned about the quality of data and compliance. This survey overcomes such obstacles by suggesting a taxonomy of traditional data-quality dimensions, e.g., accuracy, completeness, and timeliness, and machine-learning-specific dimensions, e.g., training-serving skew, label noise, feature freshness, and feedback-loop contamination. It further offers a structured comparison, mapping validation, drift detection, and monitoring methods to the circumstances in which each method is best suited. The reviewed articles cover the period of 2022-2024 of algorithmic drift detection, serverless monitoring architecture, deep-learning based validation, and industry-based reliability. With this synthesis, a number of relevant trends can be identified: model-based drift detection, cost-efficient data cleaning and adaptive monitoring. At the same time, there are still gaps, e.g., the lack of standardized benchmarks, the limited scalability of available solutions, and insufficient enterprise preparedness to implement them. This survey thus makes data-quality engineering an important foundation for trustworthy artificial intelligence, identifying its benefits and drawbacks. Sometimes, more specific taxonomies and practical frameworks can inform practitioners to build pipelines that sustain the results of the model and guarantee future confidence in real-world deployments.
[1] X. Zheng, P. Li, X. Hu, and K. Yu, “Semi-supervised classification on data streams with recurring concept drift and concept evolution,” Knowledge Based Syst., vol. 215, p. 106749, Mar. 2021, doi: 10.1016/j.knosys.2021.106749.
[2] F. Bachinger, L. Ehrlinger, G. Kronberger, and W. Wöss, “Data Validation Utilizing Expert Knowledge and Shape Constraints,” J. Data Inf. Qual., vol. 16, no. 2, pp. 1–27, Jun. 2024, doi: 10.1145/3661826.
[3] T. Reynolds, I. Painter, and L. Streichert, “Data Quality: A Systematic Review of the Biosurveillance Literature,” Online J. Public Health Inform., vol. 5, no. 1, Mar. 2013, doi: 10.5210/ojphi.v5i1.4376.
[4] C. Bidlack and M. P. Wellman, “Exceptional Data Quality Using Intelligent Matching and Retrieval,” AI Mag., vol. 31, no. 1, pp. 65–73, Mar. 2010, doi: 10.1609/aimag.v31i1.2280.
[5] L. Ehrlinger and W. Wöß, “A Survey of Data Quality Measurement and Monitoring Tools,” Front. Big Data, vol. 5, pp. 1–30, Mar. 2022, doi: 10.3389/fdata.2022.850611.
[6] D. Sculley et al., “Hidden Technical Debt in Machine Learning Systems,” NIPS’15 Proc. 29th Int. Conf. Neural Inf. Process. Syst., vol. 2, pp. 2503–2511, 2015.
[7] S. Amrale, “A Novel Generative AI-Based Approach for Robust Anomaly Identification in High-Dimensional Dataset,” Int. J. Adv. Res. Sci. Commun. Technol., vol. 4, no. 2, pp. 709–721, Oct. 2024, doi: 10.48175/IJARSCT-19900D.
[8] D. Kreuzberger, N. Kühl, and S. Hirschl, “Machine Learning Operations (MLOps): Overview, Definition, and Architecture,” IEEE Access, vol. 11, pp. 31866–31879, 2023, doi: 10.1109/ACCESS.2023.3262138.
[9] R. Dattangire, R. Vaidya, D. Biradar, and A. Joon, “Exploring the Tangible Impact of Artificial Intelligence and Machine Learning: Bridging the Gap between Hype and Reality,” in 2024 1st International Conference on Advanced Computing and Emerging Technologies (ACET), IEEE, Aug. 2024, pp. 1-6. doi: 10.1109/ACET61898.2024.10730334.
[10] F. Biessmann, J. Golebiowski, T. Rukat, D. Lange, and P. Schmidt, “Automated Data Validation in Machine Learning Systems,” IEEE Comput. Soc., pp. 1–14, 2021.
[11] S. Garg, “AI/ML Driven Proactive Performance Monitoring, Resource Allocation and Effective Cost Management in SAAS Operations,” Int. J. Core Eng. Manag., vol. 6, no. 6, pp. 263–273, 2019.
[12] H. Li, X. Liu, H.-C. Chiu, D. Li, N. Zhang, and C. Xiao, “Drift: Dynamic rule-based defense with injection isolation for securing LLM agents,” arXiv p, 2025.
[13] R. Taplin and C. Hunt, “The Population Accuracy Index: A New Measure of Population Stability for Model Monitoring,” Risks, vol. 7, no. 2, p. 53, May 2019, doi: 10.3390/risks7020053.
[14] Y. Qu, J. Blanchet, and P. Glynn, “Computable bounds on convergence of Markov chains in Wasserstein distance via contractive drift,” Ann. Appl. Probab., vol. 35, no. 4, pp. 2678–2715, 2025.
[15] T. Nitta, Y. Shi, T. Hirakawa, T. Yamashita, and H. Fujiyoshi, “Detecting data drift with ks test using attention map,” in Asian Conference on Pattern Recognition, 2023, pp. 68–80.
[16] A. Bifet and R. Gavald{, “Learning from Time-Changing Data with Adaptive Windowing,” in Proceedings of the 2007 SIAM International Conference on Data Mining, Philadelphia, PA: Society for Industrial and Applied Mathematics, Apr. 2007, pp. 443–448. doi: 10.1137/1.9781611972771.42.
[17] J. Gama, P. Medas, G. Castillo, and P. Rodrigues, “Learning with drift detection,” in Brazilian symposium on artificial intelligence, 2004, pp. 286–295.
[18] B. X. Yong, Y. Fathy, and A. Brintrup, “Bayesian Autoencoders for Drift Detection in Industrial Environments,” in 2020 IEEE International Workshop on Metrology for Industry 4.0 & IoT, IEEE, Jun. 2020, pp. 627–631. doi: 10.1109/MetroInd4.0IoT48571.2020.9138306.
[19] Y. Tan, C. Hu, K. Zhang, K. Zheng, E. A. Davis, and J. S. Park, “LSTM-based anomaly detection for non-linear dynamical system,” arXiv Prepr. arXiv2006.03193, 2020.
[20] M. Yusuff, “Model Drift Monitoring: Continuously Tracking Model Performance Metrics to Detect Accuracy Degradation.” December, 2024.
[21] F. Hinder, V. Vaquet, and B. Hammer, “One or two things we know about concept drift—a survey on monitoring in evolving environments. Part A: detecting concept drift,” Front. Artif. Intell., vol. 7, Jun. 2024, doi: 10.3389/frai.2024.1330257.
[22] T. R. Hoens, R. Polikar, and N. V. Chawla, “Learning from streaming data with concept drift and imbalance: an overview,” Prog. Artif. Intell., vol. 1, no. 1, pp. 89–101, Apr. 2012, doi: 10.1007/s13748-011-0008-0.
[23] A. Mallick, K. Hsieh, B. Arzani, and G. Joshi, “Matchmaker: Data Drift Mitigation in Machine Learning for Large-Scale Systems,” in Proceedings of Machine Learning and Systems, D. Marculescu, Y. Chi, and C. Wu, Eds., 2022, pp. 77–94.
[24] R. S. M. Barros and S. G. T. C. Santos, “A large-scale comparison of concept drift detectors,” Inf. Sci. (Ny)., vol. 451–452, pp. 348–370, Jul. 2018, doi: 10.1016/j.ins.2018.04.014.
[25] X. Yu, L. Ruan, J. S. Evans, and E. Wong, “Novel Concept Drift Detection and Adaptation (CDDA) Framework for Human-to-Machine (H2M) Applications over Future Communication Networks,” in ICC 2024 - IEEE International Conference on Communications, IEEE, Jun. 2024, pp. 5509-5514. doi: 10.1109/ICC51166.2024.10622259.
[26] P. Wang, N. Jin, D. Davies, and W. L. Woo, “Model-centric transfer learning framework for concept drift detection,” Knowledge-Based Syst., vol. 275, p. 110705, Sep. 2023, doi: 10.1016/j.knosys.2023.110705.
[27] S. Zhang, P. Tino, and X. Yao, “Hierarchical Reduced-space Drift Detection Framework for Multivariate Supervised Data Streams,” IEEE Trans. Knowl. Data Eng., vol. 35, no. 3, pp. 2628–2640, 2023, doi: 10.1109/TKDE.2021.3111756.
[28] J. C. Sisniega, V. Rodríguez, G. Moltó, and Á. L. García, “Efficient and scalable covariate drift detection in machine learning systems with serverless computing,” Futur. Gener. Comput. Syst., vol. 161, pp. 174–188, Dec. 2024, doi: 10.1016/j.future.2024.07.010.
[29] S. Nahvi et al., “Enhancing cooling tower performance with condition monitoring and machine learning based drift detection,” Procedia CIRP, vol. 112, pp. 146–150, 2022, doi: 10.1016/j.procir.2022.09.063.
[30] T. Xia et al., “Collaborative production and predictive maintenance scheduling for flexible flow shop with stochastic interruptions and monitoring data,” J. Manuf. Syst., vol. 65, pp. 640–652, Oct. 2022, doi: 10.1016/j.jmsy.2022.10.016.
[31] P. Zürcher, S. Badr, S. Knüppel, and H. Sugiyama, “Data-driven equipment condition monitoring and reliability assessment for sterile drug product manufacturing: Method and application for an operating facility,” Chem. Eng. Res. Des., vol. 188, pp. 301–314, Dec. 2022, doi: 10.1016/j.cherd.2022.09.005.
Data quality engineering, Artificial intelligence, Data validation, Concept drift, Data monitoring and drift detection.