Tokenizing Efficiently:
Ancient India's 32-Fault System for Compressing Knowledge
Vyom A. Shah | January 31, 2026
In the world of artificial intelligence, technologies evolve rapidly, with new and smarter AI models debuting weekly, ranging from open-source to proprietary, and from free to costly. For instance, a recent release of Meta’s LLaMA model showcased advancements that made its predecessor seem outdated. Such fast-paced developments highlight the transformative capabilities of AI, much of which is driven by the incredible work of transformers, a key technology pushing AI to new heights.
Whether a transformer AI model is processing text, images, audio clips, videos or another modality, it will translate the data into tokens. This process is known as tokenization. Tokens are like the syllables in a language, each serving as a fundamental unit that communicates a piece of information. Just as syllables combine to form words and phrases in human languages, tokens are combined by AI models to understand and generate language and information. For instance, in the word ‘computer,’ the syllables ‘com’ and ‘puter’ each carry a part of the word’s pronunciation and meaning. Similarly, in AI, the word might be broken into tokens that the model uses to predict the next word in a sequence or understand the context of a sentence. The impact of efficient tokenization is such that the amount of computing power required for training and inference can be reduced. During pre-training and post-training, tokens equate to investment into intelligence, and during inference, they drive cost and revenue. Thus, their technique of compressing ought to be one of the key differentiating factors among other models.
Turning towards ancient India, there was a similar method used for the transmission of knowledge. Efficient tokenization was done in such a way that philosophical concepts and tenets could be transmitted in written or oral form from one generation to another at maximum efficiency. This method was particularly prominent during the sūtra-era, a period in Indian literary history that spanned several centuries from around 600 BCE to 200 BCE. The sūtra-era was significant for its contributions to the development of succinct, yet profound, literary traditions that influenced various areas of thought including philosophy and linguistics.
The parallel between these two systems—separated by millennia—is striking. Both AI tokenization and ancient Indian sūtra composition grapple with the same fundamental challenge: how to compress maximum information into minimum space while maintaining accuracy and utility. Modern AI engineers optimize tokenization to reduce computational costs and improve model performance; ancient Indian scholars optimized sūtra composition to enable oral transmission, memorization, and preservation of knowledge across generations. Where AI tokens serve machine learning algorithms, sūtras served human memory. Where tokens are measured in computational efficiency and training costs, sūtras were measured by memorability and transmission fidelity. Yet both represent sophisticated information compression technologies designed to preserve meaning while minimizing form. This raises an intriguing question: what principles did ancient Indian scholars develop to guide this compression, and how comprehensive were their systems for evaluating it?
What is sūtra anyway?
स्वल्पाक्षरमसन्दिग्धं सारवद्विश्वतोमुखम् । [var. अल्पाक्षरमसन्दिग्धं a.]
अस्तोभनवद्यं च सूत्रं सूत्रविदो विदुः ॥
The knowers of sūtra define sūtra as one containing only a few syllables, is devoid of any confusion, contains the complete essence, faces all sides, without stoppage, and is faultless.
The aforementioned verse has nearly become synonymous with the definition of sūtra and is found quoted in every other commentary on sūtra -styled philosophical and grammatical works. In this blog post, I shall try to explore another ancient (and probably more comprehensive) tradition discussing the definition of sūtra .
Prologus
The meaning of sutta in Jaina literature is wide & far-reaching. This word—'sutta' can itself form through 'sūkta' or 'sūtra', but the literature predominantly follows the 'sūtra' interpretation. At the time of defining sutta, Mūlācāra (5.80) states (in Prākṛta)
सुत्तं गणधरकधिदं तहेव पत्तेयबुद्धिकधिदं च।
सुदकेवलिणा कधिदं अभिण्णदसपुव्विकधिदं च॥ [1]
That which is uttered by Gaṇadhara (pontiffs), that which is spoken by pratyeka-buddha (self-enlightened), that which is stated by śruta-kevalin (ascetic who has fathomed the entire śruta), and that which is said by abhinna-daśa-pūrvin (knower of ten pūrva-s) is termed as sutta.
The former verse finds its earliest mention in the Vāyupurāṇa (ASS 59.142) and Brahmāṇḍapurāṇa (1.3.58). While Dikshitar suggests that the purāṇa started to shape around 350 BCE, modern scholarship places it later between 300 and 500 CE. Ghatge places both of these purāṇa-s in the pre-500 CE era in his chronology.
An alternative characteristic
Bhadrabāhu (3rd-2nd century BCE), one of the most honoured and respected figures among Jaina scholar monks, made pioneering contributions in the development of Jaina philosophy and commentorial literature traditions. He is ascribed with the authorship of niryukti-s, ten in number, which are the earliest commentaries on Jaina canons composed exclusively in Prākṛta. In one such niryukti on Āvāśyaka-sūtra[2], an alternative (and more comprehensive) characterization of a sūtra is suggested.[3]
That which has a small volume, encompasses huge meaning, devoid of thirty-two faults, fulfilling the characteristics and containing the eight virtues is a sūtra.
After this preliminary definition, he went on enumerate the thirty-two flaws of a sūtra[4]:
- Aliya (alīka) – untruthfulness
The sūtra shall not preach something which is not the truth. Bhūtanihnava and Abhūtodbhāvana are two types of untruths. The former refers to denying the existence of something which exists, while the latter refers to making up things which do not exist. Since truth is subjective, this is to be understood with reference to the philosophy studied. - Uvaghāyajaṇaya (upaghātajanaka) – bringing destruction
A sūtra shall not teach something which leads to sacrificing the life of any living being. - Niratthaya (nirarthaka) – Useless
A sūtra shall not contain anything useless. This refers tendency of not putting any filler words in the sūtra. - Avatthaya (apārthaka) – incoherent meaning
The sense of the sūtra shall never be incoherent with reference to the subject under discussion. The idea should be expressed with clarity such that there is no room for confusion in understanding the meaning of the sūtra in line with other connected sūtras. Statements like 'daśa dāḍimāni', fall under category of faulty sūtra-s. - Chala (chala) - unstable in meaning
The choice of words used in the sūtra should be such that it doesn't create any illusion. Statements like 'navakambalo devadattaḥ' fall under this category of faults. In the mentioned statement, 'nava' means both 'nine' and 'new', which creates an illusion. - Duhila (druhila) - rebellious
The sūtra shall not promote rebellion of any kind. - Ṇissāra (niḥsāra) - pointlessness
A sūtra shall never be pointless. It shall be put forward with clear points and emphasis. - Ahiya (adhika) &
- Ūṇa (ūna)
A sūtra shall be precise in the number of terms and syllables to the extent that the idea cannot be transmitted with an extra or a lesser syllable/term. - Punarutta (punarukta)
A sūtra shall not contain any repetition. The fault of repetition is broadly considered within three categories:- by syllables: the terms or words within sūtra shall not be repeated.
- By meaning: the idea/meaning of the word shall not be repeated. For instance, 'ghaṭa, kumbha, kuṭa'. Words of the same meaning are repeated here.
- By implied meaning: the idea/meaning that the sūtra carries shall not be repeated. For instance, saying 'Devadatta doesn't eat in the day' implies that he eats at night. Even writing this implied meaning is a fault of the sūtra.
- Vāhaya (vyāhata)
The sūtra shall not be obstructed in meaning by the context. For instance, saying that there is an action and its fruit but no doer of the action is an example of this. - Ajutta (ayukta)
The sūtra shall not embody anything which is devoid of reasoning⸺something which is beyond comprehension. - Kamabhiṇṇa (kramabhinna) – discrepant in arrangement
A sūtra, if stating one group's terms in relation to the other group's terms, breaks the arrangement, it incurs a hefty error. This can be understood as correct usage of 'respectively' in a sentence. - Vayaṇabhiṇṇa (vacanabhinna) – discrepant in number
If a sūtra contains words/terms which are pointed out in relation to others, they shall have uniform use of number. Usage like 'They goes' are faults. - Vibhattibhiṇṇa (vibhaktibhinna) – discrepant in vibhakti
The words/terms which are referred to in relation to others shall have the same vibhakti as the other. - Liṅgabhiṇṇa (liṅgabhinna) – discrepant in gender of words
In Saṁskṛta, each word carries its gender, which is based on its inherent qualities and prevailing usages in speech and literature. The words, if mutually possess a relation, shall be of the same gender. For instance, saying 'He's a woman'. - Aṇabhihiya (anabhihita) - unasserted
If a sūtra contains something which has not discussed in one's philosophy, it is a fault. For instance, talking about the tenth element in Vaiśeṣika Philosophy–Prakṛti. - Apaya (apada) – fault in uniformity
In a chapter where everything is written in a particular chandas, the use of different chandas is considered a fault. When tenet is to be expressed in āryā, the use of vaitālīya chandas is a fault. - Sahāvahīṇa (svabhāvahīna) – opposite of inherent property
When the inherent property of an object/element is expressed otherwise than already known and considered, it is also a fault of sūtra. For instance, 'cool fire'. - Vavahia (vyavahita) - interruption
When a certain connected concept is left to teach another concept in detail, and the original concept is again taken up, it is termed the vyavahita. - Kāladosa (kāladoṣa) – improper use of tenses
Improper use of tenses within the sūtra is a fault. Instead of saying 'Rāma went to the forest', if the sūtra states 'Rāma is going to the forest', then the sūtra is irregular. - Jaidosa (yatidoṣa) – unnecessary stoppage
If there is an unnecessary stoppage within the sūtra, it is a fault. - Chavidosa (chavi-doṣa) – absence of Chavi
Chavi is a name of a certain figure of speech. The sūtra devoid of that figure of speech incurs the fault of its absence. - Samayaviruddha (samayaviruddha) – opposed to the scriptures
When a sūtra contains something which is in opposition to one's own proposition, it receives the fault of being devoid of disposition values. - Vayaṇamitta (vacanamātra) – mere talks
Only speaking, without any evidence. For instance, just telling people, 'This part is the middle of the world'. - Atthāvattīdosa (arthāpattidoṣa) – undesired implication
If a sūtra's inference causes varied undesired meanings, it incurs the arthāpatti. For instance, by saying 'don't kill the pet dogs', it is implied that dogs which are not pets can be killed. - Asamāsadosa (asamāsadoṣa) – non-compounding
If the individual, in case of opportunity for compounding, doesn't compound words or does it improperly, it is regarded as the error of compounding. - Uvamādosa (upamādoṣa) - comparison
When comparisons made are too far-fetched–'Meru is as large as sarṣapa' or vice-versa, it follows the Upamādoṣa. - Rūvagadosa (rūpakadoṣa) – disrupting explanation
When some part is to be explained through expression, the expression made may suggest incorrect or completely opposite parts. For instance, if a river were to be described, instead of describing the pinnacles, etc., describes oceans, its currents, flow, etc. - Ṇiddesadosa (nirdeśadoṣa) – missing out information
When something is missed out in an expression, it is considered a fault. For instance, to say 'Devadatta cooks in the vessel', the individual forgets to say 'cooks'. - Payatthadosa (padārthadoṣa) – mis-terming
When a synonym of a certain element is considered as another element, leading to confusion. For instance, thinking that sattā is nothing but vastuparyāya, which is a completely different element in Vaiśeṣika Philosophy. - Sandhidosa (sandhidoṣa)
In case of sandhi, if the rules are not applied or the forms are incorrectly formed, it is regarded as sandhi-doṣa.
After the explanation of the faults of the sūtra, he went on to enumerate the virtues of a sūtra[5]:
- Ṇiddosa (nirdoṣa) – devoid of the faults
A sūtra shall not contain any faults (mentioned earlier in this blog). - Sāravaṁta (sāravat) – full of essence
The sūtra can embody multiple desired meanings through the synonyms of the words used. - Heujutta (hetuyukta) – complemented with causes
The sūtra shall consist causes of the anvaya or vyatireka. - Alaṁkia (alaṅkṛta) – with figures of speech
A sūtra ornamented with the use of figures of speech is virtuous. - Uvaṇīya (upanīta) - conclusiveness
A sūtra shall be conclusive. - Sovayāra (sopacāra) – non-vernacular
A sūtra shall not appear vernacular in nature. - Miya (mita) - limited
A sūtra shall be limited by finite syllables. - Mahura (madhura) - sweet
A sūtra shall be such that it heeds up sweetly.
Based on extant texts available today, this seems to first mentioned in Bhadrabāhusvāmī's niryukti on Āvaśyakasūtra and discussed at length thereafter by Haribhadrasūri, Jinabhadragaṇī Kṣamāśramaṇa and Maladhārin Hemacandrasūri in their subsequent commentaries. The thirty-two faults are enumerated in Anuyogadvārasūtra as well. A complete picture of how a sūtra works has been compiled, explained and enumerated again at a single place by Dharmasāgaragaṇī in his Sūtravyākhyānavidhi Śataka. After thus enumerating the virtues and faults of a sūtra, Bhadrabāhu proceeded to quote a renowned characteristic of sūtra in its Prākṛta rendering.
That which contains only a few syllables, devoid of any confusion, contains the essence, faces all directions, without stoppage, devoid of any faults and spoken by the omniscient is the sutta.[6]
Conclusion:
All these accounts reveal something profound about how seriously ancient Indian scholars treated the challenge of information compression. This was not merely aesthetic preference or literary convention; it was systematic engineering of knowledge transmission. Every fault identified represents a failure mode that could corrupt meaning across generations. Every virtue enumerated represents a design principle that enhances preservation fidelity.
Modern AI tokenization and ancient sūtra composition emerge from radically different contexts—one algorithmic and computational, the other mnemonic and pedagogical. Yet both wrestle with the same tension: compression inevitably risks information loss, but verbosity undermines utility. The solutions differ—AI uses statistical frequency analysis and subword segmentation; sūtra composers used grammatical precision and semantic density—but the underlying problem remains timeless. One can imagine ancient scholars using this framework much as modern engineers use design specifications—checking their compositions against each criterion, debugging their expressions for logical inconsistencies, grammatical ambiguities, and semantic redundancies.
In an age where we celebrate AI’s ability to compress and process information at unprecedented scales, it’s humbling to recognize that the fundamental challenge—and the need for systematic approaches to solving it—has ancient precedent. The technologies differ, but the engineering mindset persists: compress carefully, preserve meaning rigorously, and establish clear criteria for quality. Whether in silicon or in syllables, information wants to be both dense and clear.
Post-scriptum:
Jayadhavalā, a commentary on Dharasena's Kaṣāyapāhuḍa, quotes yet another definition of sūtra:
अर्थस्य सूचनात् सम्यक् सूतेर्वार्थस्य सूरिणा ।
सूत्रमुक्तमनल्पार्थं सूत्रकारेण तत्त्वतः ॥
Bibliography
Primary
- Sūtravyākhyānapaddhati Śataka by Dharmasāgaragaṇī
- Āvaśyakaniryukti by Bhadrabāhu
- Viśeṣāvaśyakabhāṣya by Jinabhadragaṇī Kṣamāśramaṇa
- Vāyupurāṇa
- Brahmāṇdapurāṇa
Secondary
- History of Sanskrit Literature by Arthur Anthony Macdonnell
- Saṁskṛta Vaṅmaya kā Bṛhad Itihāsa
लक्खणजुत्तं सुत्तं अट्ठहि य गुणेहि उववेअं॥८८०॥ ĀvaNi
णिस्सारमहिअमूणं पुणरुत्तं वाहयमजुत्तं॥८८१॥ ĀvaNi कमभिण्णं वयणभिण्णं विभत्तिभिन्नं च लिंगभिन्नं च।
अणभिहियमपयमेव य सहावहीणं ववहिअं च॥८८२॥ ĀvaNi कालजइच्छविदोसा समयविरुद्धं च वयणमित्तं च।
अत्थावत्तीदोसो णेओ असमासदोसो अ॥८८३॥ ĀvaNi उवमा रूवगदोसो णिद्देसपयत्थसंधिदोसो अ।
एए उस्सुत्तदोसा बत्तीसं हुंति णायव्वा॥८८४॥ ĀvaNi
उवणीयं सोवयारं च मुयं महुरमेव च॥८८५॥ ĀvaNi; SthāSū 7.553 [gāhā 72]
अत्थोभमणवज्जं च सुत्तं सव्वण्णुभासिअं॥८८६॥
Author of this blog is an enthusiast in Saṃskṛta, Prākṛta and Apabhraṃśa, and has pursued Masters' degree in Sanskrit. Know more here.