COMET Metrics¶
COMET models can be optimized towards different kinds of human judgements (for example HTER or DA’s). Because of we provide a list of different Metrics that you can use to test you systems:
| Model | Description |
|---|---|
↑wmt-large-da-estimator-1719 |
RECOMMENDED: Estimator model build on top of XLM-R (large) trained on DA from WMT17, WMT18 and WMT19 |
↑wmt-base-da-estimator-1719 |
Estimator model build on top of XLM-R (base) trained on DA from WMT17, WMT18 and WMT19 |
↓wmt-large-hter-estimator |
Estimator model build on top of XLM-R (large) trained to regress on HTER. |
↓wmt-base-hter-estimator |
Estimator model build on top of XLM-R (base) trained to regress on HTER. |
↑emnlp-base-da-ranker |
Translation ranking model that uses XLM-R to encode sentences. This model was trained with WMT17 and WMT18 Direct Assessments Relative Ranks (DARR). |
The first four models (wmt-*) where trained and tested for the WMT2020 shared task, thus they were only introduced in our submission to the shared task (paper still under-review)
NOTE: Even when regressing on the same Human Judgement scores between metrics are not comparable (e.g. scores from a large and a base model are not comparable even when trained on the same type of judgements)! Please make sure you use the same metric when comparing 2 systems!
Also, since HTER measures the amount of edits we needed to correct an MT hypothesis, models regressing on HTER produce low scores for better systems.