Training Generative AI for Low-Resource Languages in Cloud Infrastructure

Anand Polamarasetti

Training Generative AI for Low-Resource Languages in Cloud Infrastructure

ESP Journal of Engineering & Technology Advancements

Volume 4 Issue 4

Year of Publication : 2024

Authors : Anand Polamarasetti

:10.56472/25832646/JETA-V4I4P125

Citation:

Anand Polamarasetti, 2024. "Training Generative AI for Low-Resource Languages in Cloud Infrastructure", ESP Journal of Engineering & Technology Advancements 4(4): 190-202.

Abstract:

Despite generative AI's tremendous transformational potential in NLP, low-resource languages remain significantly behind in data scarcity and computational load. It is envisioned that the present study will investigate whether cloud infrastructure can be gainfully employed toward exploring and training generative models for low-resource languages so they can surmount both their data and computational limitations. It starts by reviewing related work in generative AI and techniques optimized for the low-resource context to identify such languages' challenges and unique demands. This methodology couples data collection techniques tailored for a limited-resource setting with cloud-based model training such as AWS and Google Cloud by utilizing transfer learning and cross-lingual methods that allow improvement in model performances. Indeed, the results showed that cloud infrastructure efficiently and scalably trained models, bringing significant improvements in coherence, cultural relevance, and linguistic accuracy for low-resource language generations. Cost-effective and accessible models optimized in the cloud thus prove viable for under-represented language communities. Therefore, the present study contributes to the research area of low-resource language processing by presenting a comprehensive cloud-based approach. The conclusion also involves recommendations on future research directions of advanced cloud-based optimizations while considering how AI models may further fulfill an inclusive digital world on behalf of speakers of low-resource languages.

References:

[1] Kshetri, N. (2024). Linguistic Challenges in Generative Artificial Intelligence: Implications for Low-Resource Languages in the Developing World. Journal of Global Information Technology Management, 27(2), 95-99https://www.tandfonline.com/doi/abs/10.1080/1097198X.2024.2341496

[2] Tatineni, S. (2020). Deep Learning for Natural Language Processing in Low-Resource Languages. International Journal of Advanced Research in Engineering and Technology (IJARET), 11(5), 1301-1311.https://www.researchgate.net/profile/Sumanth-Tatineni/publication/377406799_Deep_Learning_for_Natural_Language_Processing_in_Low-Resource_Languages/links/65a529428ee032139ae7c0a0/Deep-Learning-for-Natural-Language-Processing-in-Low-Resource-Languages.pdf

[3] Ògúnrẹ̀mí, T., Nekoto, W. O., & Samuel, S. (2023). Decolonizing nlp for “low-resource languages”: Applying abebe birhane’s relational ethics. GRACE: Global Review of AI Community Ethics, 1(1).https://ojs.stanford.edu/ojs/index.php/grace/article/view/2584

[4] Adelani, D. I., Doğruöz, A. S., Shode, I., & Aremu, A. (2024). Which Nigerian-Pidgin does Generative AI speak?: Issues about Representativeness and Bias for Multilingual and Low Resource Languages. arXiv preprint arXiv:2404.19442.https://arxiv.org/abs/2404.19442

[5] Maddali, L. K. UNLOCKING THE POWER OF LANGUAGE: IMPROVING MULTILINGUAL CAPABILITIES IN GENERATIVE AI FOR GLOBAL ACCESSIBILITY.https://www.academia.edu/download/115776661/IJARET_15_03_019.pdf

[6] Mani, G., & Namomsa, G. B. (2023, September). Large Language Models (LLMs): Representation Matters, Low-Resource Languages and Multi-Modal Architecture. In 2023 IEEE AFRICON (pp. 1-6). IEEE.https://ieeexplore.ieee.org/abstract/document/10293675/

[7] ŁUKASIK, M. (2023). Corpus linguistics and generative AI tools in term extraction: a case of Kashubian–a low-resource language.https://www.researchgate.net/profile/Marek-Lukasik/publication/378234423_Corpus_linguistics_and_generative_AI_tools_in_term_extraction_a_case_of_Kashubian_-_a_low-resource_language/links/65ce74f128b7720cecd30d2e/Corpus-linguistics-and-generative-AI-tools-in-term-extraction-a-case-of-Kashubian-a-low-resource-language.pdf

[8] Ranathunga, S., Lee, E. S. A., Prifti Skenduli, M., Shekhar, R., Alam, M., & Kaur, R. (2023). Neural machine translation for low-resource languages: A survey. ACM Computing Surveys, 55(11), 1-37.https://dl.acm.org/doi/abs/10.1145/3567592

[9] Hasan, M. A., Tarannum, P., Dey, K., Razzak, I., & Naseem, U. (2024). Do Large Language Models Speak All Languages Equally? A Comparative Study in Low-Resource Settings. arXiv preprint arXiv:2408.02237.https://arxiv.org/abs/2408.02237

[10] Zaki, M. Z. REVOLUTIONISING TRANSLATION TECHNOLOGY: AComparative STUDY OF VARIANT TRANSFORMER MODELS-BERT, GPT, AND T5.https://www.academia.edu/download/116199770/Published_Revolutionising_Trans_Tech.pdf

[11] Munaf, M., Afzal, H., Mahmood, K., & Iltaf, N. (2024). Low Resource Summarization using Pre-trained Language Models. ACM Transactions on Asian and Low-Resource Language Information Processing, 23(10), 1-19.https://dl.acm.org/doi/abs/10.1145/3675780

[12] Nitu, M., & Dascalu, M. Natural Language Processing Tools for Romanian–Going Beyond a Low-Resource Language.https://ixdea.org/wp-content/uploads/IxDEA_art/60/60_SP_1.pdf

[13] Avetisyan, H., & Broneske, D. (2023). Large Language Models and Low-Resource Languages: An Examination of Armenian NLP. Findings of the Association for Computational Linguistics: IJCNLP-AACL 2023 (Findings), 199-210.https://aclanthology.org/2023.findings-ijcnlp.18.pdf

[14] Ranjbargol, S. (2024). COMPARATIVE ANALYSIS OF LANGUAGE MODELS ON AUGMENTED LOW-RESOURCE DATASETS FOR APPLICATION IN QUESTION & ANSWERING SYSTEMS.https://yorkspace.library.yorku.ca/items/a80b0a77-07cd-4018-bcf3-91c584a2ecb7

[15] Ding, B. (2024). Advancing low resource information extraction and dialogue system using data-efficient methods.https://dr.ntu.edu.sg/handle/10356/179560

[16] Yu, Y., Qiu, D., Wan, H., Yuan, X., Liu, K., Wang, Y., ... & Okumura, M. (2024). Asian and Low-Resource Language Information Processing. ACM Transactions on, 23(4).https://dl.acm.org/doi/pdf/10.1145/3613577

[17] Wasike, A., Kamukama, I., Aleshinloye, Y. A., Ajiboye, A. R., & Ssebadduka, J. Advancements in Natural Language Understanding-Driven Machine Translation: Focus on English and the Low Resource Dialectal Lusoga.https://www.researchgate.net/profile/Azizi-Wasike/publication/385074545_Advancements_in_Natural_Language_Understanding-_Driven_Machine_Translation_Focus_on_English_and_the_Low_Resource_Dialectal_Lusoga/links/6713ddb4d796f96b8ebf5eab/Advancements-in-Natural-Language-Understanding-Driven-Machine-Translation-Focus-on-English-and-the-Low-Resource-Dialectal-Lusoga.pdf

[18] Christopoulos, G. (2024). The impact of language family on D2T generation in under-resourced languages (Master's thesis).https://studenttheses.uu.nl/handle/20.500.12932/46903

[19] Fox, M. (2022). Transformer-based Language Models-Architectures and Applications: Analyzing transformer-based language models such as BERT, GPT, and T5 and their applications in NLP tasks such as text generation and classification. Journal of AI-Assisted Scientific Discovery, 2(1), 150-164.https://scienceacadpress.com/index.php/jaasd/article/view/72

[20] Roos, Q. (2022). Fine-tuning pre-trained language models for CEFR-level and keyword conditioned text generation: comparison between Google’s T5 and OpenAI’s GPT-2.https://www.diva-portal.org/smash/record.jsf?pid=diva2:1708538

[21] de Vries, W. (2024). Evaluation and Adaptation of Neural Language Models for Under-Resourced Languages.https://research.rug.nl/en/publications/evaluation-and-adaptation-of-neural-language-models-for-under-res

[22] Ulčar, M., & Robnik-Šikonja, M. (2023). Sequence-to-sequence pretraining for a less-resourced Slovenian language. Frontiers in Artificial Intelligence, 6, 932519.https://www.frontiersin.org/articles/10.3389/frai.2023.932519/full

[23] Bhavya, B., Isaza, P. T., Deng, Y., Nidd, M., Azad, A. P., Shwartz, L., & Zhai, C. (2023, December). Exploring Large Language Models for Low-Resource IT Information Extraction. In 2023 IEEE International Conference on Data Mining Workshops (ICDMW) (pp. 1203-1212). IEEE.https://ieeexplore.ieee.org/abstract/document/10411566/

[24] Wertz, L. (2024). When few-shot fails: low-resource, domain-specific text classification with transformers.https://elib.uni-stuttgart.de/handle/11682/14788

[25] Helland, E. D. (2023). Tackling Lower-Resource Language Challenges: A Comparative Study of Norwegian Pre-Trained BERT Models and Traditional Approaches for Football Article Paragraph Classification (Master's thesis, Norwegian University of Life Sciences).https://nmbu.brage.unit.no/nmbu-xmlui/handle/11250/3081492

[26] Toukmaji, C. (2024). Few-Shot Cross-Lingual Transfer for Prompting Large Language Models in Low-Resource Languages. arXiv preprint arXiv:2403.06018.https://arxiv.org/abs/2403.06018

Keywords:

Natural Language Processing, Language Models, Transfer Learning, Cross-Lingual Models, AWS, Google Cloud, Model Optimization, Data Scarcity, Computational Scalability, Linguistic Inclusivity, Digital Transformation.

ISSN : 2583-2646