In December 2021 Ajda Gokçen successfully defended and filed her dissertation, Resourceful at Any Size: A Predictive Methodology Using Linguistic Corpus Metrics for Multi-Source Training in Neural Dependency Parsing, supervised by Gina-Anne Levow.
Multilingual modeling comes up in natural language processing at any scale. Corpora for high-resource languages (like English) train high-performing models, and can be combined with other language corpora of all sizes to make better models for low-resource languages (like Sahaptin). Projects like Universal Dependencies even make it possible to train highly multilingual models from standardized morphosyntactic labels. However, multilingual (or, more generally, multi-source) training does not consistently improve modeling performance. With an abundance of language resources comes a difficult design choice: which corpora will train better together rather than separately? More specifically, when is it worthwhile to supplement (i.e., concatenate) one corpus with another during training, rather than training on the first corpus alone? Approaches to selecting and evaluating candidate combinations have tended toward two extremes: ad hoc or exhaustive.
In her dissertation, Gokçen proposed an alternative, predictive methodology for outcomes of concatenative training in dependency parsing: leveraging treebanks constructed using the Universal Dependencies framework to assess the utility of linguistic corpus metrics in multi-source modeling. She found this approach to be both robust and practical, as it uses computationally simple metrics that expand upon intuitions of linguistic similarity to make it possible to reasonably predict which conditions will yield significant improvement for a target corpus. Although the results are specific to a particular family of models and the task of dependency parsing, the approach holds promise for any number of natural language processing applications.