How are Indian languages faring in the age of AI and language models? Premium
The Hindu
As large language models like ChatGPT find more applications around the world, their adoption also passively spreads a prejudice against languages other than English, including Indian languages. Some researchers are working to remedy this.
“Sanskrit suits the language of computers and those learning artificial intelligence learn it,” Indian Space Research Organisation chairman S. Somanath said at an event in Ujjain on May 25. His was the latest in a line of statements exalting Sanskrit and its value for computing but without any evidence or explanation.
But beyond Sanskrit, how are other Indian languages faring in the realm of artificial intelligence (AI), at a time when its language-based applications have taken the world by storm?
The answer is a mixed bag. There is some passive discrimination even as the languages’ fates are buoyed by public-spirited research and innovation.
Behind both seemingly intelligent chatbots and art-making computers, algorithms and data-manipulation techniques turn linguistic and visual data into mathematical objects (like vectors), and combine them in specific ways to produce the desired output. This is how ChatGPT is able to respond to your questions.
When working with a language, a machine first has to break a sentence or a word down into little bits in a process called tokenisation. These are the bits that the machine’s data-processing model will work with. For example, “there’s a star” can be tokenised to “there”, “is”, “a”, and “star”.
There are several tokenisation techniques. A treebank tokeniser breaks up words and sentences based on the rules that linguists use to study them. A subword tokeniser allows the model to learn some common word and modifications to that word separately, such as “dusty” and “dustier”/“dustiest”.
OpenAI, the maker of ChatGPT and the GPT series of large language models, uses a type of the subword tokeniser called byte-pair encoding (BPE). Here’s an example of the OpenAI API using this on a statement by Gayathri Chakravorty Spivak: