Unsupervised Learning of Cross-Lingual Symbol Embeddings Without Parallel Data

Plots using TSNE

The paper includes a couple of plots of the trained embeddings for different language pairs, using an MDS reduction to 2D. Here we present a complete set of plots for all the trained language pairs using TSNE.

Plots using MDS can be seen here.

NB: These plots are of embeddings reduced from 30 dimensions to two. Although these TSNE plots tend to be better than those using MDS, a lot of information is still lost in the reduction and they often do not accurately represent the proximity between individual pairs. They are meant mainly for giving a general idea of the layout of the space.

Each set has a plot of:

  1. the single-character embeddings for both languages;
  2. these plus the most frequent bi-grams and tri-grams for both languages, based on a sample from the training set.

You can pan and zoom all the plots.

See the main paper page for more details of the methods and datasets.

Finnish-Estonian, mixed domain

Finnish-Estonian, forum domain

North Karelian (dvi) - Olonets Karelian (liv), Bible

Ingrian - Olonets Karelian (liv), Bible

Ingrian Bible - Finnish forum

Danish Wikipedia - Swedish Europarl

Spanish-Portuguese Europarl