Mark Granroth-Wilding and Hannu Toivonen (2019).
In proceedings of the Society for Computation in Linguistics (SCiL)
- Paper PDF
- Code
- Trained embeddings
- Visualizations with TSNE, MDS
Abstract
We present a new method for unsupervised learning of multilingual symbol (e.g. character) embeddings, without any parallel data or prior knowledge about correspondences between languages. It is able to exploit similarities across languages between the distributions over symbols' contexts of use within their language, even in the absence of any symbols in common to the two languages. In experiments with an artificially corrupted text corpus, we show that the method can retrieve character correspondences obscured by noise. We then present encouraging results of applying the method to real linguistic data, including for low-resourced languages. The learned representations open the possibility of fully unsupervised comparative studies of text or speech corpora in low-resourced languages with no prior knowledge regarding their symbol sets.
Files
Code
We have released the code used to train the Xsym models on a variety of language pairs and output character embeddings. It includes all the code relevant to the results reported in the paper.
Some of the models were trained on textual data for low-resourced Uralic dialects. The original data comes from the University of Helsinki Language Corpus Server (UHLCS) collection, available via a request form from the CSC. It is an old corpus and required some preprocessing to fix encoding errors and other problems. We release the code for preprocessing the corpus (for the languages used here).
Pre-trained embeddings
We release here the trained embeddings output by the above code for all the language pairs for which results are reported in the paper, plus some additional ones not included in the paper.
Formats
Each set of embeddings is made available in two different formats, each standard formats for storing vector embeddings.
- Word2vec format:
The binary format used by the word2vec tool to store its embeddings. See word2vec for details, or refer to the Gensim implementation. You can load these using Gensim's KeyedVectors.load_word2vec_format() method.
Note that, since the format does not support spaces in the word strings, space characters are stored as <space>. - TSV format:
The TSV (tab-separated value) format used, for example, by the Tensorflow embedding projector. Accompanied by metadata identifying the language and character for each embedding. The TSV file, together with the metadata, can be loaded into the online projector to explore the vector space in a 3D projection, using the annotations in the metadata.
In each case, characters are represented as their unicode character, with a prefix denoting its language, e.g. fi:a is the name for the Finnish letter a. Note that in the embeddings for the Finnish-Estonian language pair, fi:a and et:a are considered distinct characters and have separate embeddings. See the paper for more details.
Files
The following embeddings are those for which results are reported in the paper. Refer to the paper for more details of the corpora used.
Corpus 1 | Corpus 2 | Language prefixes | Files |
---|---|---|---|
Ylilauta (Finnish forum) | Estonian Reference (news) | fi, et | Word2vec, TSV |
Ylilauta (Finnish forum) | Estonian Reference (forum) | fi, et | Word2vec, TSV |
North Karelian Bible | Olonets Karelian Bible | dvi, liv | Word2vec, TSV |
Ingrian Bible | Olonets Karelian Bible | ing, liv | Word2vec, TSV |
Ingrian Bible | Ylilauta (Finnish forum) | ing, fi | Word2vec, TSV |
The following language pairs are not described in the paper, but are included here in case they are of interest.
For Danish, we use a Danish Wikipedia dump, downloaded from Linguatools. For Swedish, we use the Swedish portion of the Europarl corpus of transcripts from the European Parliament. As with some pairs used in the paper, this choice deliberately spans contrasting domains.
For Spanish and Portuguese the corresponding portions of the Europarl corpus are used.
Corpus 1 | Corpus 2 | Language prefixes | Files |
---|---|---|---|
Danish Wikipedia | Swedish Europarl | da, sv | Word2vec, TSV |
Spanish Europarl | Portuguese Europarl | es, pt | Word2vec, TSV |
Funding
This work was funded by the Academy of Finland Digital Language Typology project, (no. 12933481) as part of the DIGIHUM programme.