Corpus linguistic studies are often based on the assumption that the temporal structure of the corpus under consideration is known. While such an assumption holds true for important texts of many European (e.g., Classical Greek and Latin, Middle High German) and East Asian corpora, the dates of pre-modern Indian texts that were (and are) proposed in scholarly research often vary by several centuries. Temporal uncertainties are especially serious for texts written in Vedic Sanskrit, one of the oldest Indo-European languages. Vedic Sanskrit was in active use in the second and first millenium BCE in Northern India, and a large corpus of Vedic texts, which was, remarkably, transmitted orally until the beginning of the Common Era, is still extant today. While these texts mainly deal with questions of religion and ritual, they also contain innumerable details and observations from daily life and the ancient Indian material culture. Since the Vedic corpus is the main source for reconstructing religious, linguistic and socio-cultural developments in the 2nd and 1st millennia BCE, the uncertainty about its diachronic structure poses a real problem for linguistic, cultural and historical research.

Building on preliminary work by members of our research group, the project ChronBMM develops probabilistic Bayesian models that use linguistic features in order to narrow down the temporal range of important Vedic texts and thus provide future research with a more reliable temporal structure of the Vedic corpus. Concretely, we will study various directed graphical models (Bayesian mixture models) and combinations of Dirichlet processes with undirected graphical models (Markov random fields), which can be used to impose additional conditions on the probabilistic distributions. Notably, the project does not rely on a small number of manually selected features, but will draw on a wide range of linguistic feature types comprising, among others, phonetic, morpho-syntactic, semantic, derivational and syntactic information. In spite of previous work, there exists no large-scale digital resource of Vedic syntax. One of the key aims of our project is therefore the annotation of a sizable syntactic treebank of Vedic Sanskrit for which we are currently developing a syntactic parser.

The quantitative methods developed in this project will allow us to obtain more reliable dates of central Vedic texts than have been proposed so far. In order to validate our methods with independent data, we will test them, as proof of concept, on datable corpora of European languages. The text-historical evaluation will concentrate on the Atharvaveda, an early tradition of religious texts which was somewhat neglected in previous research, and the Upaniá¹£ads, an important group of philosophical treatises. In addition, the project aims at providing new insights into the development of Vedic syntax, with a special focus on word order, for which no corpus-based studies have been published so far. Although ChronBMM concentrates on Vedic texts, the models developed are generic and can therefore be applied to comparable cases such as early New Indo-Aryan languages or Epic Greek.

The project is funded by the German Federal Ministry of Education and Research, FKZ 01UG2121, for the years 2021-2023.