Indic-BERT: A Comprehensive Analysis of Statistical Properties, Qualitative Qualities, Training Methodology, Applications, and Evaluation Metrics

Mar 25, 2025

1. Introduction: The Significance of Indic-BERT

India needs its own A.I and having your own quality dataset and embedding models is prerequisite for building foundation models. ’

India, a subcontinent with 22 officially recognized languages and a multitude of dialects, presents a unique challenge for developing NLP systems that can effectively cater to its multilingual populace ¹. The development of robust language models for these Indian languages is further complicated by the scarcity of annotated datasets and the intricate morphological structures inherent in many of them ³. This necessitates a dedicated effort to create language models specifically designed to understand and process these linguistic nuances.

Indic-BERT, developed by AI4Bharat at IIT Madras, emerges as a significant milestone in this endeavor ¹. It represents a pioneering effort to construct a robust multilingual artificial intelligence model tailored for the Indian context ¹. This model supports 12 major Indian languages, including Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu, with the primary aim of enhancing NLP performance across these languages ¹.

The creation of Indic-BERT addresses the limitations often encountered when applying general-purpose multilingual models like mBERT and XLM-R to the intricacies of Indian languages. By focusing specifically on this group of languages, the developers aimed to capture the unique linguistic characteristics more effectively than broader multilingual approaches.

Indic-BERT is built upon the ALBERT architecture, which is itself a derivative of the widely successful BERT model, known for its efficiency in learning language representations ⁵. The model was pre-trained on an extensive corpus of approximately 9 billion tokens, spanning across the 12 supported languages ⁴. The selection of the ALBERT architecture suggests a deliberate strategy to develop a high-performing model while minimizing the number of parameters compared to standard BERT, thereby making it more accessible for broader utilization. This architectural choice likely facilitates efficient learning from a substantial multilingual dataset through techniques like parameter sharing.

2. Statistical Properties: A Quantitative Analysis

A key aspect of understanding Indic-BERT lies in examining its statistical properties, which provide insights into its quantitative characteristics and design choices. Notably, Indic-BERT possesses a significantly lower number of parameters compared to other prevalent multilingual models such as mBERT and XLM-R ¹. This reduction in parameter count is not merely a design choice but a crucial factor contributing to the model's efficiency, resulting in faster training and inference times ¹. This design emphasis on efficiency likely aimed to make the model more practical for deployment in environments where computational resources might be limited.

The foundation of Indic-BERT's learning lies in the vast amount of data it was trained on. The model was pre-trained using the IndicCorp corpus, which comprises approximately 9 billion tokens ⁴. This corpus covers the 12 major Indian languages, with varying amounts of data allocated to each. For instance, English and Hindi have the largest representation in terms of token count, while other languages have comparatively fewer tokens ⁵. This imbalance in the distribution of training data across languages could potentially influence the model's performance on different languages, with those having more training data possibly exhibiting superior results.

To gauge Indic-BERT's effectiveness, it has been benchmarked against other models on various NLP tasks. Despite its smaller size, Indic-BERT achieves performance that is on par with or even surpasses mBERT and XLM-R on several tasks ⁴. Specifically, its performance on the IndicGLUE benchmark, which includes tasks like News Article Headline Prediction, Wikipedia Section Title Prediction, and Cloze-style multiple-choice Question Answering, demonstrates its capabilities. In some instances, Indic-BERT outperforms or closely matches the performance of these larger multilingual models ⁵. These benchmark results underscore the effectiveness of Indic-BERT's architecture and the targeted training on Indic languages, suggesting that a language-specific focus can indeed yield better outcomes for certain language groups compared to broad multilingual training.

Table 1 : Performance Analysis of Indic-Bert with others

Table 1 provides a clear quantitative comparison of Indic-BERT with mBERT and XLM-R across various tasks. It highlights instances where Indic-BERT demonstrates superior or comparable performance, reinforcing the idea of its statistical efficiency and effectiveness for Indian language processing.

3. Qualitative Qualities: Linguistic Strengths and Weaknesses

Beyond the quantitative metrics, Indic-BERT exhibits notable qualitative qualities that underscore its linguistic capabilities in understanding and processing Indian languages. Its ability to comprehend and handle text in 12 major Indian languages makes it an invaluable asset for developing multilingual NLP applications ¹. This capability stems from its training on a diverse corpus, which allows it to capture the intricate syntactic and semantic nuances inherent in these languages ¹. This multilingual support and the capacity to understand linguistic subtleties are key qualitative strengths that enable Indic-BERT to effectively navigate the complexities of Indian languages.

Indic-BERT has demonstrated strong performance across a range of NLP tasks, further highlighting its qualitative strengths. It achieves impressive accuracy in tasks such as news article headline prediction, sentiment analysis, named entity recognition, and question answering ⁴. Moreover, it has been successfully fine-tuned for sentiment classification in code-mixed Dravidian languages, showcasing its adaptability to the linguistic phenomenon prevalent in multilingual societies ¹⁰. The successful application of Indic-BERT to these diverse downstream tasks indicates a robust qualitative ability to understand and represent the meaning of text in Indian languages. The success in fine-tuning for specific tasks suggests that the pre-trained representations are semantically rich and can be effectively adapted for various NLP applications.

Despite its strengths, Indic-BERT also has certain limitations. One notable limitation is its context window size, which is restricted to 128 tokens ⁴. This can pose a challenge when analyzing longer texts, as it necessitates splitting documents into smaller segments, which might not be ideal for all applications ⁴. This limited context window represents a qualitative weakness that could potentially affect the model's ability to grasp long-range dependencies and context within extended pieces of text.

Furthermore, while Indic-BERT supports 12 major Indian languages, scaling this to include the hundreds of dialects and addressing the complexities of different scripts and regional variations remain significant qualitative challenges ¹. The linguistic diversity within India is substantial, and encompassing all these variations within a single model is a complex undertaking that requires extensive data and sophisticated modeling techniques.

4. The Making of Indic-BERT: Training and Data

The development of Indic-BERT involved a carefully designed training process centered around the Masked Language Modeling (MLM) objective ¹. In this self-supervised learning approach, the model learns to predict randomly masked words within sentences, thereby developing a deep understanding of the contextual relationships between words ¹². This method enables the model to learn rich language representations from unlabeled text, which is particularly valuable in the context of Indian languages where labeled data can be scarce.

The initial version of Indic-BERT was pre-trained on the IndicCorp dataset, which contains approximately 9 billion tokens across the 12 supported languages ⁵. Subsequently, newer and more advanced versions, such as IndicBERT v2, were trained on the significantly larger IndicCorp v2 dataset. This expanded corpus boasts 20.9 billion tokens and covers 24 constitutionally recognized Indian languages ¹⁴. The distribution of tokens within the original IndicCorp dataset varied across the 12 languages, with specific counts available for each ⁵. The progression from IndicCorp to IndicCorp v2 signifies a clear effort to broaden the model's language coverage and enhance its performance by leveraging larger training datasets. The increase in data generally contributes to better generalization and a more robust language model.

The curation of the training data and the training methodology were guided by specific principles. The data was sourced from a variety of domains, including news articles, literary works, and social media content, with the aim of exposing the model to diverse sentence structures, grammatical patterns, and vocabulary ¹. For a related model, IndicBART, the principle of using the Devanagari script to represent most Indic languages was employed to facilitate transfer learning among languages with similar scripts ¹⁸. While not explicitly stated for Indic-BERT, a similar consideration of script overlap might have influenced the data preparation, given the shared Brahmi script origins of many Indian languages. This focus on diverse data and the potential leveraging of script similarities suggests a principle of maximizing the model's ability to handle real-world text and to promote cross-lingual understanding. Training on varied data helps the model become more adaptable to different writing styles and subject matter.

5. Core Principles and Design Philosophy

Several core principles and a distinct design philosophy underpinned the development of Indic-BERT. A primary objective was to create a language model that not only performs well but is also specifically tailored to the unique linguistic characteristics of Indian languages, thereby overcoming the limitations often encountered with general multilingual models ¹. Another guiding principle was efficiency, which led to the adoption of the ALBERT architecture, known for its ability to achieve high performance with a reduced number of parameters ¹. The fundamental design philosophy, therefore, centered on achieving state-of-the-art performance on Indic languages through a computationally efficient model, likely driven by the practical need for models that can be trained and deployed without requiring extensive computational resources.

A significant emphasis was placed on ensuring strong performance on Indic languages, as evidenced by the model's evaluation on benchmarks like IndicGLUE and IndicXTREME, which showcase its robust capabilities across various NLP tasks relevant to these languages ⁵. Furthermore, Indic-BERT was designed to excel at cross-language transfer learning, enabling it to generalize its understanding from one Indian language to another, even in scenarios where training data for a specific language might be limited ¹. A central principle here is the exploitation of the inherent similarities among Indian languages to facilitate learning and enable the effective transfer of knowledge across them. This is particularly beneficial for low-resource Indian languages where the availability of training data might be scarce.

The choice of the ALBERT architecture played a crucial role in realizing these design goals. ALBERT's architectural innovations, including parameter sharing between layers and factorization techniques, contribute to the model's efficiency and its capacity to learn robust representations from large volumes of multilingual data ⁵. This allows Indic-BERT to achieve performance levels comparable to or even better than larger models, all while requiring fewer computational resources. The selection of ALBERT was a strategic decision to balance performance and efficiency, positioning Indic-BERT as a practical and effective solution for NLP in the Indian linguistic context. This architectural decision reflects a focus on real-world usability and accessibility.

6. Indic-BERT in Action: Famous Products and Projects

Indic-BERT's capabilities have led to its application in various real-world products and projects across diverse domains. Its foundational ability to perform NLP tasks such as text classification, sentiment analysis, named entity recognition, and question answering makes it a versatile tool ¹. For example, Indic-BERT has been utilized in sentiment analysis for product reviews ⁸ and for the classification of news articles ⁸. Its potential extends to various AI-powered tools and platforms designed for Indian language users. The fundamental NLP capabilities of the model serve as a strong foundation for more complex downstream applications.

The Hugging Face model hub, a popular platform for sharing and accessing pre-trained language models, lists several "Spaces" that utilize ai4bharat/indic-bert, indicating active community engagement and usage in various projects ⁵. These include leaderboards for evaluating performance on Indic language understanding tasks ⁵. Furthermore, Indic-BERT has been employed in the development of question answering systems specifically designed for Indian languages ¹². Its capabilities also extend to supporting speech recognition and natural language understanding in voice assistants, making these technologies more accessible to non-English speakers in India ¹. The presence of Indic-BERT in community-driven projects and its potential in areas like voice assistance and question answering highlight its practical utility in addressing real-world needs.

Beyond Indic-BERT itself, AI4Bharat has also developed other related models, such as IndicBART ¹⁴. IndicBART can be fine-tuned for natural language generation tasks in Indian languages, suggesting a broader ecosystem of tools being developed to support NLP for these languages. While not directly Indic-BERT, the existence of such related models indicates a larger and ongoing effort to build comprehensive NLP capabilities for the diverse set of Indian languages. This suggests a growing ecosystem and potential for synergy between different models developed by AI4Bharat.

7. Staying Current: The Latest Version of Indic-BERT

Keeping abreast of the latest developments is crucial in the rapidly evolving field of language models. The most recent iteration of Indic-BERT is IndicBERT v2 ⁸. Based on the publication dates of associated research and model updates, IndicBERT v2 was released around December 2022 ¹⁴. This latest version represents the current state-of-the-art for the Indic-BERT family of models, incorporating significant advancements and expanded capabilities.

IndicBERT v2 boasts several key improvements over its predecessor. Notably, it is pre-trained on the IndicCorp v2 dataset, which is significantly larger, containing 20.9 billion tokens, and covers a broader range of 24 constitutionally recognized Indian languages ¹⁴. This is a substantial increase in language support compared to the original Indic-BERT's coverage of 12 languages. Furthermore, IndicBERT v2 has been evaluated on the IndicXTREME benchmark, a more comprehensive and challenging evaluation suite designed specifically for Indic languages ¹⁴. Various versions of IndicBERT v2 have been trained using different objectives, such as Masked Language Modeling (MLM) and Translation Language Modeling (TLM), and with different datasets, including Samanantar and data generated through back-translation ¹⁶. These enhancements in v2, including expanded language coverage, a larger training dataset, and evaluation on a more rigorous benchmark, indicate significant progress in the model's overall capabilities. These improvements likely lead to enhanced performance and broader applicability across a wider array of Indian languages.

8. Evaluating the Fairness Quotient: Bias in Indic-BERT

Evaluating the fairness of language models like Indic-BERT is a critical aspect of responsible AI development. However, assessing bias in multilingual models presents unique challenges, particularly due to the limited availability of benchmarks and resources for bias evaluation beyond the English language ²⁶. Methodologies developed for detecting bias in English word embeddings and language models might not be directly applicable to other languages, especially those with different grammatical structures and gender representations, as is the case with many Indian languages ²⁷.

One approach used for evaluating bias in multilingual settings is the Multilingual Bias Evaluation (MBE), which often utilizes parallel corpora to assess bias by comparing the likelihood assigned to sentences with male and female pronouns or entities ²⁶. Research has also specifically explored gender bias in Hindi language models, revealing the presence of such biases ²⁷. Given that Indic-BERT is trained on a large corpus of text that reflects societal norms and potentially biases, it is also susceptible to encoding similar biases, which could lead to unfair or discriminatory outputs in downstream applications. These biases often originate from the data used during the pre-training phase ²⁷.

Efforts are underway to adapt and develop specific metrics for quantifying bias in multilingual models, including those relevant to the nuances of Indian languages. For instance, metrics like DisCo perform a chi-squared test to identify statistically significant differences in model predictions based on gendered contexts ²⁶. The development of such metrics is essential for objectively measuring the extent of bias and for guiding efforts to mitigate it in models like Indic-BERT.

9. Ensuring Impartiality: Assessing Fairness in Indic-BERT

While bias evaluation focuses on identifying the presence of potentially harmful associations, assessing fairness in Indic-BERT involves examining the model's impartiality in its language understanding and generation capabilities. Fairness in NLP is concerned with addressing the broader issue of social biases that language models might perpetuate ²⁸. Traditional fairness metrics, such as demographic parity, which are commonly used in other areas of machine learning, might not be directly applicable to the complexities of natural language tasks ²⁹. Instead, the evaluation of fairness often involves assessing the model's performance on various downstream tasks to identify and measure extrinsic bias ²⁹.

One aspect of fairness to consider is how Indic-BERT performs across the different demographic groups or languages it is designed to support. Given the variations in training data size and the inherent linguistic complexities of the 12 languages covered ⁵, the model's performance might not be uniform across all of them. Significant underperformance on certain languages could be viewed as a fairness concern. Furthermore, bias evaluation needs to extend beyond gender and consider fairness across other relevant social groups within the diverse Indian context.

The development and evaluation of related models like IndicSBERT, a variant of BERT tailored for sentence-level understanding in 10 Indian languages, also contribute to the broader understanding of fairness and performance in Indic NLP ³⁰. Different model architectures and training approaches can exhibit varying degrees of fairness, and studying these variations is crucial for advancing the field. Ensuring fairness in Indic-BERT requires a comprehensive approach that takes into account performance across different languages and demographic groups, alongside the development and application of appropriate fairness metrics specifically designed for the Indian linguistic and cultural context.

10. Decoding the Representation: Vector Length

The vector representations generated by language models like Indic-BERT are crucial for their ability to understand and process text. Sentence-transformers based on Indic-BERT typically map sentences to a 768-dimensional dense vector space ³¹. Similarly, Vyakyarth-1-Indic-Embedding, another Indic language embedding model, also utilizes a 768-dimensional vector space ³². While the exact embedding dimension of the base Indic-BERT model isn't explicitly stated in the provided snippets, it is common for base versions of BERT-based models to use a vector length of 768 ³³. However, given that Indic-BERT is based on the ALBERT architecture, which incorporates parameter reduction techniques, the embedding dimension might differ. Further investigation into the specific architectural details of Indic-BERT is needed to confirm its base embedding dimension.

The dimensionality of these vector representations has significant implications for the model's ability to capture semantic information. Higher dimensionality allows the model to encode more complex and nuanced semantic relationships between words and sentences. These dense vectors can then be effectively used for various downstream tasks such as semantic search, text clustering, and cross-lingual NLP applications ³¹. The 768-dimensional embedding space, commonly used in sentence embeddings derived from Indic-BERT, provides a robust representation of the meaning of text in the supported Indian languages, enabling effective performance in a wide range of tasks. Each dimension in the vector can be thought of as capturing a different aspect of the word or sentence's meaning.

It is also important to note that the embedding vector length can vary depending on the specific variant or fine-tuned version of a BERT model. For instance, BERT-Tiny has a hidden embedding size of 128, which is significantly smaller than the 768 dimensions often found in base BERT models ³³. This highlights the trade-off between computational efficiency and the richness of the semantic representation. Smaller vector lengths can lead to more efficient computation and lower memory usage but might potentially sacrifice some of the semantic detail that can be captured with higher-dimensional embeddings.

11. The Spectrum of Meaning: Diversity of Representations

The ability of a language model to generate diverse vector representations is crucial for its capacity to understand and distinguish between different words and concepts. Indic-BERT is trained on a diverse corpus of Indian languages, which enables it to capture the unique syntactic and semantic nuances inherent in each language ¹. Its cross-lingual transfer learning capabilities further suggest that the model learns a shared representation space to some extent, allowing for the generalization of knowledge and understanding across the supported languages ¹. This indicates that Indic-BERT likely creates diverse representations that reflect the individual characteristics of each language while also capturing shared semantic concepts that transcend language boundaries. This balance between language-specific and cross-lingual understanding is essential for the effective functioning of a multilingual model.

The performance of Indic-BERT on tasks such as sentiment analysis and named entity recognition provides further evidence of the diversity in its representations and its ability to distinguish subtle differences in meaning both within and across languages ⁴. Evaluation using cosine similarity scores can offer additional insights into the semantic similarity and dissimilarity between different representations generated by the model ³. The model's success in these meaning-dependent tasks suggests a significant level of diversity in its representations, enabling it to discern fine-grained semantic nuances. For example, effective sentiment analysis requires the model to understand the often subtle differences between positive and negative expressions.

Comparisons with other multilingual language models, such as MuRIL, indicate that models specifically trained on Indic languages tend to generate more linguistically accurate and diverse representations for these languages compared to more universal multilingual models ³². This reinforces the idea that targeted training, like that employed for Indic-BERT, is beneficial for capturing the specific linguistic features and nuances of a particular set of languages.

Thanks for reading Research Kits ! This post is public so feel free to share it.

12. Conclusion and Future Directions

Indic-BERT represents a significant advancement in the field of multilingual NLP, specifically addressing the linguistic diversity of India. Its statistical properties highlight its efficiency, achieving competitive performance with fewer parameters than larger models. Qualitatively, it demonstrates strong linguistic understanding across 12 major Indian languages, proving effective in various NLP tasks. The training process, leveraging large-scale monolingual corpora, underscores the commitment to building robust language representations for these languages. The core principles guiding its development prioritize performance on Indic languages, computational efficiency, and cross-lingual transfer learning. Indic-BERT has found practical applications in sentiment analysis, news classification, and question answering, with a growing ecosystem of related tools like IndicBART further expanding the NLP capabilities for Indian languages. The latest version, IndicBERT v2, marks a substantial improvement with expanded language coverage and enhanced performance.

However, challenges remain. Evaluating and mitigating bias in Indic-BERT requires further research and the development of culturally and linguistically appropriate metrics for Indian languages. Ensuring fairness across the diverse linguistic and demographic landscape of India is an ongoing endeavor. While the vector representations are rich and facilitate semantic understanding, the base model's embedding dimension warrants further investigation. The diversity of representations allows the model to distinguish nuances in meaning, but continuous efforts are needed to improve its ability to capture the full spectrum of linguistic variations.

Future research directions could focus on expanding the context window of Indic-BERT to handle longer sequences more effectively. Increasing language coverage to include more of India's numerous dialects would further enhance its inclusivity. Deeper investigations into potential biases and the development of effective mitigation strategies are crucial for ensuring fairness. Exploring more efficient architectures and training techniques could lead to even better performance and reduced computational costs. Finally, the development of more comprehensive and standardized evaluation benchmarks specifically designed for the intricacies of Indic languages is essential for driving further progress in this vital area of multilingual NLP. While Indic-BERT has made remarkable strides, continued research and development are necessary to fully address the unique challenges and opportunities presented by the rich linguistic heritage of India.

Works cited

IndicBERT: Multilingual AI model for Indian Languages | by Vaibhav Srivastava | Medium, accessed on March 24, 2025, https://vabnix.medium.com/indicbert-multilingual-ai-model-for-indian-languages-85601995915e
Advancements in IndicBERT: A Leap for Multilingual AI | by Vaibhav Srivastava - Medium, accessed on March 24, 2025, https://vabnix.medium.com/advancements-in-indicbert-a-leap-for-multilingual-ai-823a4717c485
IndicMMLU-Pro: Benchmarking the Indic Large Language Models - arXiv, accessed on March 24, 2025, https://arxiv.org/html/2501.15747v1
Indic Bert · Models - Dataloop, accessed on March 24, 2025, https://dataloop.ai/library/model/ai4bharat_indic-bert/
ai4bharat/indic-bert - Hugging Face, accessed on March 24, 2025, https://huggingface.co/ai4bharat/indic-bert
AI4Bharat - IndicBERT: Multilingual Language Representation Model - AIKosha, accessed on March 24, 2025, https://aikosha.indiaai.gov.in/home/models/details/ai4bharat_indicbert_multilingual_language_representation_model.html
IndicBERT | AI4Bharat IndicNLP, accessed on March 24, 2025, https://indicnlp.ai4bharat.org/pages/indic-bert/
AI4Bharat/Indic-BERT-v1: Indic-BERT-v1: BERT-based Multilingual Model for 11 Indic Languages and Indian-English. For latest Indic-BERT v2, check: https://github.com/AI4Bharat/IndicBERT - GitHub, accessed on March 24, 2025, https://github.com/AI4Bharat/Indic-BERT-v1
Pretraining, fine-tuning and evaluation scripts for IndicBERT-v2 and IndicXTREME - GitHub, accessed on March 24, 2025, https://github.com/AI4Bharat/IndicBERT
IndicBERT based approach for Sentiment Analysis on Code-Mixed Tamil Tweets - CEUR-WS.org, accessed on March 24, 2025, https://ceur-ws.org/Vol-3159/T3-16.pdf
ML&AI_IIITRanchi@DravidianLangTech: Fine-Tuning IndicBERT for Exploring Language-specific Features for Sentiment Classification in Code-Mixed Dravidian Languages - ACL Anthology, accessed on March 24, 2025, https://aclanthology.org/2023.dravidianlangtech-1.27/
Question Answering System with Indic multilingual-BERT - Semantic Scholar, accessed on March 24, 2025, https://www.semanticscholar.org/paper/Question-Answering-System-with-Indic-Jha-Akana/2cde187456f0c05bcac4d0e87c4e0e97b1286ab4
What Is BERT Language Model? Its Advantages And Applications - Neurond AI, accessed on March 24, 2025, https://www.neurond.com/blog/what-is-bert
Large Language Models - AI4Bharat, accessed on March 24, 2025, https://ai4bharat.iitm.ac.in/areas/llm
IndicBERTv2 - AI4Bharat, accessed on March 24, 2025, https://ai4bharat.iitm.ac.in/areas/model/LLM/IndicBERTv2
ai4bharat/IndicBERTv2-MLM-Sam-TLM - Hugging Face, accessed on March 24, 2025, https://huggingface.co/ai4bharat/IndicBERTv2-MLM-Sam-TLM
IndicBERT v2 - a ai4bharat Collection - Hugging Face, accessed on March 24, 2025, https://huggingface.co/collections/ai4bharat/indicbert-v2-66c5a0bd4ee34ebc59303bc5
ai4bharat/IndicBART - Hugging Face, accessed on March 24, 2025, https://huggingface.co/ai4bharat/IndicBART
IndicBART: A Pre-trained Model for Natural Language Generation of Indic Languages, accessed on March 24, 2025, https://www.researchgate.net/publication/354435178_IndicBART_A_Pre-trained_Model_for_Natural_Language_Generation_of_Indic_Languages
arXiv:2109.02903v2 [cs.CL] 27 Oct 2022, accessed on March 24, 2025, https://arxiv.org/pdf/2109.02903
IndicBART: A Pre-trained Model for Indic Natural Language Generation - ResearchGate, accessed on March 24, 2025, https://www.researchgate.net/publication/361063446_IndicBART_A_Pre-trained_Model_for_Indic_Natural_Language_Generation
IndicBART: A Pre-trained Model for Indic Natural Language Generation - ACL Anthology, accessed on March 24, 2025, https://aclanthology.org/2022.findings-acl.145.pdf
IndicBART: A Pre-trained Model for Indic Natural Language Generation of Indic Languages - Microsoft Research, accessed on March 24, 2025, https://www.microsoft.com/en-us/research/publication/indicbart-a-pre-trained-model-for-indic-natural-language-generation-of-indic-languages/
IndicBART: A Pre-trained Model for Natural Language Generation of Indic Languages, accessed on March 24, 2025, https://www.semanticscholar.org/paper/IndicBART%3A-A-Pre-trained-Model-for-Natural-Language-Dabre-Shrotriya/a05ff6a06948992ecfa93f4c7576583b5272e4c2
Daily Papers - Hugging Face, accessed on March 24, 2025, https://huggingface.co/papers?q=IndicBERT
On Evaluating and Mitigating Gender Biases in Multilingual Settings - arXiv, accessed on March 24, 2025, https://arxiv.org/html/2307.01503
Evaluating Gender Bias in Pre-trained Indic Language Models - WiNLP, accessed on March 24, 2025, https://www.winlp.org/wp-content/uploads/2022/11/63_Paper.pdf
A Survey on Fairness in Large Language Models - arXiv, accessed on March 24, 2025, https://arxiv.org/html/2308.10149v2
Measuring Fairness with Biased Rulers: A Comparative Study on Bias Metrics for Pre-trained Language Models - Lirias, accessed on March 24, 2025, https://lirias.kuleuven.be/retrieve/667403
Unmask It! AI-Generated Product Review Detection in Dravidian Languages - arXiv, accessed on March 24, 2025, https://arxiv.org/html/2503.09289
aditeyabaral/sentencetransformer-indic-bert - Hugging Face, accessed on March 24, 2025, https://huggingface.co/aditeyabaral/sentencetransformer-indic-bert
Vyakyarth-1-Indic-Embedding - Krutrim AI Labs, accessed on March 24, 2025, https://ai-labs.olakrutrim.com/models/Vyakyarth-1-Indic-Embedding
Using BERT Model to Generate Real-time Embeddings - Target Tech Blog, accessed on March 24, 2025, https://tech.target.com/blog/bert-model
L3Cube-IndicSBERT: A simple approach for learning cross-lingual sentence representations using multilingual BERT - ACL Anthology, accessed on March 24, 2025, https://aclanthology.org/2023.paclic-1.16.pdf
IndicSentEval: How Effectively do Multilingual Transformer Models encode Linguistic Properties for Indic Languages? - arXiv, accessed on March 24, 2025, https://arxiv.org/html/2410.02611v1

Research Kits