Alternative title | Plongements de mots pré-entraînés |
---|
Download | |
---|
Author | Search for: Lo, Chi-Kiu1ORCID identifier: https://orcid.org/0000-0001-8714-7846 |
---|
Affiliation | - National Research Council of Canada. Digital Technologies
|
---|
Format | Text, Dataset |
---|
Physical description | 14 .tgz files – approximately 65 GB total size |
---|
Subject | YiSi; embeddings; machine translation; bleu score; NRC portage |
---|
Abstract | NRC pretrained word embeddings: words representation in high dimensional vector space
The NRC pretrained word embeddings are a collection of high dimensional vector representation of words in fourteen languages:
• Chinese
• Czech
• English
• Estonian
• Finnish
• French
• German
• Hindi
• Latvian
• Polish
• Romanian
• Russian
• Spanish
• Turkish
The word embeddings are trained using word2vec (Mikolov et al. 2013) on the data released for the news translation task of the Conference on Machine Translation (WMT). All the pretrained word embeddings are normalized to unit vectors in 300-dimension space. The pretrained word embeddings could be used as a foundation block of neural models for other natural language processing, such as word similarity, semantic textual similarity, machine translation evaluation, and other applications.
For more information about using the word embeddings with YiSi, the NRC’s open source machine translation quality evaluation and estimation metric, please visit the NRC’s github website: http://github.com/nrc-cnrc/YiSi. |
---|
Publication date | 2019-05-23 |
---|
Date created | 2018 |
---|
Publisher | National Research Council of Canada |
---|
Licence | |
---|
Related publication | |
---|
Language | English |
---|
Export citation | Export as RIS |
---|
Collection | NRC Research Data |
---|
Record identifier | 41bc88cd-5362-4d43-b4fd-61ef661018c8 |
---|
Record created | 2019-05-23 |
---|
Record modified | 2022-05-09 |
---|