Unsupervised Learning of Cross-Lingual Symbol Embeddings Without Parallel Data
Plots using TSNE, which are rather better at conserving the proximity between neighbours, are also available here.
NB: These plots are of embeddings reduced from 30 dimensions to two. A lot of information is lost in the reduction and they often do not accurately represent the proximity between individual pairs. They are meant mainly for giving a general idea of the layout of the space.
Each set has a plot of:
- the single-character embeddings for both languages;
- these plus the most frequent bi-grams and tri-grams for both languages, based on a sample from the training set.
You can pan and zoom all the plots.
See the main paper page for more details of the methods and datasets.
- Finnish-Estonian (mixed domain)
- Finnish-Estonian (forum domain)
- North Karelian-Olonets Karelian
- Ingrian-Olonets Karelian