Semantic Indexing of Multilingual Corpora

The increasing amount of multilingual text collections available in different domains makes its automatic processing essential for the development of a given field. However, standard processing techniques based on statistical clues and keyword searches have clear limitations. Instead, we propose a knowledge-based processing pipeline overcoming most of the limitations of these techniques and enabling direct comparison across texts in different languages without the need of translation. In our paper we show its potential for semantically indexing multilingual text collections. We used a multilingual version of the Bible for the experiments (available for download), evaluating the precision of our semantic indexing pipeline and showing its reliability on the cross-lingual text retrieval task.


Download the whole package [32.4 MB] .

Or, alternatively, download each file individually:

Reference paper

When using these data, please refer to the following paper:

Alessandro Raganato, José Camacho-Collados, Antonio Raganato and Yunseo Joung.
Semantic Indexing of Multilingual Corpora and its Application on the History Domain. [paper] [bib] [poster]
LT4DH, COLING 2016, Osaka, Japan.


Should you have any enquiries about any of the resources, please contact Alessandro Raganato (raganato [at] di.uniroma1 [dot] it) or José Camacho Collados (collados [at] di.uniroma1 [dot] it).

Last update: 8 Dec. 2016