D. Karpov

M. Burtsev

Models trained on a Russian topical dataset, of knowledge-grounded human-human conversation, are capable of real-world tasks across languages.

Chatbots and virtual assistants are on the rise, but building them is not easy. Collecting and labelling conversational data takes effort, and cross-lingual topical datasets are scarce. We address this by training a multilingual model on a monolingual topical dataset. Our model can transfer topical knowledge across multiple languages, with accuracy that correlates with the size of language-specific pre-training data.

This article investigates the knowledge transfer from the RuQTopics dataset. This Russian topical dataset combines a large sample number (361,560 single-label, 170,930 multi-label) with extensive class coverage (76 classes). We have prepared this dataset from the "Yandex Que" raw data. By evaluating the RuQTopics - trained models on the six matching classes of the Russian MASSIVE subset, we have proved that the RuQTopics dataset is suitable for real-world conversational tasks, as the Russian-only models trained on this dataset consistently yield an accuracy around 85\% on this subset. We also have figured out that for the multilingual BERT, trained on the RuQTopics and evaluated on the same six classes of MASSIVE (for all MASSIVE languages), the language-wise accuracy closely correlates (Spearman correlation 0.773 with p-value 2.997e-11) with the approximate size of the pretraining BERT's data for the corresponding language. At the same time, the correlation of the language-wise accuracy with the linguistical distance from Russian is not statistically significant.

Cross-lingual knowledge

Monolingual and cross-lingual knowledge transfer for topic classification