BigBERT Telugu: A New Era for Telugu Natural Language Processing

BigBERT Telugu: A New Era for Telugu Natural Language Processing
July 5, 2025 | Brainmage Team
>> Blogs / BigBERT Telugu: A New Era for Telugu Natural Language Processing
BigBERT Telugu is an encoder-only transformer model designed specifically for the Telugu language, aiming to advance the field of Natural Language Processing (NLP) for low-resource Indic languages. The model introduces architectural and efficiency improvements and is trained on a massive, diverse dataset.
Key Highlights of BigBERT Telugu:
Enhanced Architecture: BigBERT Telugu builds upon the standard BERT architecture, incorporating several modern advancements. These include the removal of bias terms from most linear and LayerNorm components to improve parameter allocation. It utilizes rotary positional embeddings (RoPE) for better performance across varying context lengths and employs a pre-normalization structure for improved training stability. The model also features the Swish-Gated Linear Unit (SwiGLU) activation function to enhance model expressiveness.
Efficiency Improvements: To optimize performance, BigBERT Telugu incorporates alternating global and local attention mechanisms, allowing for efficient processing of long-context inputs. It also adopts an unpadding strategy, eliminating the computational overhead of padding tokens by processing sequences in a concatenated, unpadded form. The model leverages Flash Attention 2 for highly efficient attention computation, further boosting speed and memory efficiency.
Massive and Diverse Training Data: Unlike previous models that often relied on limited or outdated datasets, BigBERT Telugu was pretrained on an extensive corpus of 13.4 billion tokens. This dataset comprises a wide variety of Telugu text, including filtered Common Crawl data, Wikipedia pages, recent news articles, and educational textbooks, ensuring comprehensive linguistic coverage and a robust understanding of both formal and informal Telugu. The data underwent rigorous preprocessing steps like language identification and deduplication to ensure high quality.
Tokenizer: BigBERT Telugu utilizes a word-piece tokenizer, which offers improved token efficiency. The vocabulary size is 16,000, including 5 reserved tokens: [CLS], [SEP], [PAD], [MASK], [UNK] to maintain backward compatibility and provide flexibility for downstream applications.
Strong Performance: BigBERT Telugu achieves state-of-the-art performance across a range of downstream tasks. In Masked Language Modeling (MLM) evaluation, it recorded a low perplexity of 2.4712 and a high token prediction accuracy of 79.45% on unseen Telugu text, significantly outperforming other multilingual models like ModernBERT, multilingual BERT, MuRIL, and xlm-RoBERTa. For downstream classification tasks, it achieved an accuracy of 91.94% for Sentiment Analysis and an impressive 96.02% for Topic Classification/News Categorization. These results highlight the model's ability to capture Telugu's linguistic nuances and semantic content effectively.
Computational Efficiency: The model is optimized for speed and memory efficiency, capable of processing sequences up to 2048 tokens nearly twice as fast as previous models, making it suitable for deployment on standard GPUs.
BigBERT Telugu represents a significant step forward for Telugu NLP, offering a powerful and practical choice for real-world applications by combining a compact architecture with modern deep learning advancements. The model's architecture consists of 13 layers and 89 million parameters with an embedding dimension of 768. The training took approximately 112.12 hours on a single RTX 4090 GPU.