Mark Granroth-Wilding

Unsupervised Learning of Cross-Lingual Symbol Embeddings Without Parallel Data

Paper > Plots


The paper includes a couple of plots of the trained embeddings for different language pairs, using an MDS reduction to 2D. Here we present a complete set of plots for all the trained language pairs.

Plots using TSNE, which are rather better at conserving the proximity between neighbours, are also available here.

NB: These plots are of embeddings reduced from 30 dimensions to two. A lot of information is lost in the reduction and they often do not accurately represent the proximity between individual pairs. They are meant mainly for giving a general idea of the layout of the space.

Each set has a plot of:

  1. the single-character embeddings for both languages;
  2. these plus the most frequent bi-grams and tri-grams for both languages, based on a sample from the training set.

You can pan and zoom all the plots.

See the main paper page for more details of the methods and datasets.

Finnish-Estonian, mixed domain

Finnish-Estonian, forum domain

North Karelian (dvi) - Olonets Karelian (liv), Bible

Ingrian - Olonets Karelian (liv), Bible

Ingrian Bible - Finnish forum

Danish Wikipedia - Swedish Europarl

Spanish-Portuguese Europarl