Unsupervised Learning of Cross-Lingual Symbol Embeddings Without Parallel Data
Plots using TSNE
The paper includes a couple of plots of the trained embeddings for different language pairs, using an MDS reduction to 2D. Here we present a complete set of plots for all the trained language pairs using TSNE.
Plots using MDS can be seen here.
NB: These plots are of embeddings reduced from 30 dimensions to two. Although these TSNE plots tend to be better than those using MDS, a lot of information is still lost in the reduction and they often do not accurately represent the proximity between individual pairs. They are meant mainly for giving a general idea of the layout of the space.
Each set has a plot of:
- the single-character embeddings for both languages;
- these plus the most frequent bi-grams and tri-grams for both languages, based on a sample from the training set.
You can pan and zoom all the plots.
See the main paper page for more details of the methods and datasets.
- Finnish-Estonian (mixed domain)
- Finnish-Estonian (forum domain)
- North Karelian-Olonets Karelian
- Ingrian-Olonets Karelian