[Moses-support] When to truecase

Discussion:

Lane Schwartz

2015-05-20 16:31:22 UTC

Philipp (and others),

I'm wondering what people's experience is regarding when truecasing is
applied.

One option is to truecase the training data, then train your TM and LM
using that truecased data. Another option would be to lowercase the data,
train TM and LM on the lowercased data, and then perform truecasing after
decoding.

I assume that the former gives better results, but the latter approach has
an advantage in terms of extensibility (namely if you get more data and
update your truecase model, you don't have to re-train all of your TMs and
LMs).

Does anyone have any insights they would care to share on this?

Thanks,
Lane

Philipp Koehn

2015-05-20 17:43:09 UTC

Permalink

Hi,

see Section 2.2 in our WMT 2009 submission:
http://www.statmt.org/wmt09/pdf/WMT-0929.pdf

One practical reason to avoid recasing is the need
for a second large cased language model.

But there is of course also the practical issue with
have a unique truecasing scheme for each data
condition, handling of headlines, all-caps emphasis,
etc.

It would be worth to revisit this issue again under
different data conditions / language pairs. Both
options are readily available in EMS.

Each of the two alternative methods could be
improved as well. See for instance:
http://www.aclweb.org/anthology/N06-1001

-phi

-phi

Post by Lane Schwartz
Philipp (and others),
I'm wondering what people's experience is regarding when truecasing is
applied.
One option is to truecase the training data, then train your TM and LM
using that truecased data. Another option would be to lowercase the data,
train TM and LM on the lowercased data, and then perform truecasing after
decoding.
I assume that the former gives better results, but the latter approach has
an advantage in terms of extensibility (namely if you get more data and
update your truecase model, you don't have to re-train all of your TMs and
LMs).
Does anyone have any insights they would care to share on this?
Thanks,
Lane
_______________________________________________
Moses-support mailing list
http://mailman.mit.edu/mailman/listinfo/moses-support

Lane Schwartz

2015-05-20 18:01:11 UTC

Permalink

Philipp,

In Table 2 of the WMT 2009 paper, are the "baseline" and "truecased"
columns directly comparable? In other words, do the two columns indicate
identical conditions other than a single variable (how and/or when casing
was handled)?

In the baseline condition, how and when was casing handled?

Thanks,
Lane

Post by Philipp Koehn
Hi,
http://www.statmt.org/wmt09/pdf/WMT-0929.pdf
One practical reason to avoid recasing is the need
for a second large cased language model.
But there is of course also the practical issue with
have a unique truecasing scheme for each data
condition, handling of headlines, all-caps emphasis,
etc.
It would be worth to revisit this issue again under
different data conditions / language pairs. Both
options are readily available in EMS.
Each of the two alternative methods could be
http://www.aclweb.org/anthology/N06-1001
-phi
-phi

Post by Lane Schwartz
Philipp (and others),
I'm wondering what people's experience is regarding when truecasing is
applied.
One option is to truecase the training data, then train your TM and LM
using that truecased data. Another option would be to lowercase the data,
train TM and LM on the lowercased data, and then perform truecasing after
decoding.
I assume that the former gives better results, but the latter approach
has an advantage in terms of extensibility (namely if you get more data and
update your truecase model, you don't have to re-train all of your TMs and
LMs).
Does anyone have any insights they would care to share on this?
Thanks,
Lane
_______________________________________________
Moses-support mailing list
http://mailman.mit.edu/mailman/listinfo/moses-support

Philipp Koehn

2015-05-20 18:30:57 UTC

Permalink

Hi,

no, the changes are made incrementally.

So the recesed "baseline" is the previous "mbr/mp" column.

-phi

Post by Lane Schwartz
Philipp,
In Table 2 of the WMT 2009 paper, are the "baseline" and "truecased"
columns directly comparable? In other words, do the two columns indicate
identical conditions other than a single variable (how and/or when casing
was handled)?
In the baseline condition, how and when was casing handled?
Thanks,
Lane

Post by Lane Schwartz
Philipp (and others),
I'm wondering what people's experience is regarding when truecasing is
applied.
One option is to truecase the training data, then train your TM and LM
using that truecased data. Another option would be to lowercase the data,
train TM and LM on the lowercased data, and then perform truecasing after
decoding.
I assume that the former gives better results, but the latter approach
has an advantage in terms of extensibility (namely if you get more data and
update your truecase model, you don't have to re-train all of your TMs and
LMs).
Does anyone have any insights they would care to share on this?
Thanks,
Lane
_______________________________________________
Moses-support mailing list
http://mailman.mit.edu/mailman/listinfo/moses-support

--
When a place gets crowded enough to require ID's, social collapse is not
far away. It is time to go elsewhere. The best thing about space travel
is that it made it possible to go elsewhere.
-- R.A. Heinlein, "Time Enough For Love"
_______________________________________________
Moses-support mailing list
http://mailman.mit.edu/mailman/listinfo/moses-support

Lane Schwartz

2015-05-20 18:50:41 UTC

Permalink

Got it. So then, how was casing handled in the "mbr/mp" column? Was all of
the data lowercased, then models trained, then recasing applied after
decoding? Or something else?

Post by Philipp Koehn
Hi,
no, the changes are made incrementally.
So the recesed "baseline" is the previous "mbr/mp" column.
-phi

Post by Lane Schwartz
Philipp (and others),
I'm wondering what people's experience is regarding when truecasing is
applied.
One option is to truecase the training data, then train your TM and LM
using that truecased data. Another option would be to lowercase the data,
train TM and LM on the lowercased data, and then perform truecasing after
decoding.
I assume that the former gives better results, but the latter approach
has an advantage in terms of extensibility (namely if you get more data and
update your truecase model, you don't have to re-train all of your TMs and
LMs).
Does anyone have any insights they would care to share on this?
Thanks,
Lane
_______________________________________________
Moses-support mailing list
http://mailman.mit.edu/mailman/listinfo/moses-support

Ondrej Bojar

2015-05-22 09:20:16 UTC

Permalink

Hi,

we also have an experiment on truecasing, see Table 1 in http://www.statmt.org/wmt13/pdf/WMT08.pdf

What works best for us is relying on the casing as guessed by the lemmatizer. (Our lemmatizer recognizes names as separate lemmas and keeps the lemma upcased; which we then cast to the token in the sentence.)

Moses recaser was the worst option, it was actually better to lowercase only the source side of the parallel data, i.e. have the main search also pick the casing.

Cheers, O.

----- Original Message -----

Sent: Wednesday, 20 May, 2015 20:50:41
Subject: Re: [Moses-support] When to truecase
Got it. So then, how was casing handled in the "mbr/mp" column? Was all of
the data lowercased, then models trained, then recasing applied after
decoding? Or something else?

Post by Philipp Koehn
Hi,
no, the changes are made incrementally.
So the recesed "baseline" is the previous "mbr/mp" column.
-phi

Post by Lane Schwartz
Philipp (and others),
I'm wondering what people's experience is regarding when truecasing is
applied.
One option is to truecase the training data, then train your TM and LM
using that truecased data. Another option would be to lowercase the data,
train TM and LM on the lowercased data, and then perform truecasing after
decoding.
I assume that the former gives better results, but the latter approach
has an advantage in terms of extensibility (namely if you get more data and
update your truecase model, you don't have to re-train all of your TMs and
LMs).
Does anyone have any insights they would care to share on this?
Thanks,
Lane
_______________________________________________
Moses-support mailing list
http://mailman.mit.edu/mailman/listinfo/moses-support

--
Ondrej Bojar (mailto:***@cuni.cz / ***@ufal.mff.cuni.cz)
http://www.cuni.cz/~obo

Matthias Huck

2015-05-22 12:24:24 UTC

Permalink

Hi,

If your system output is lowercase, you could try SRILM's `disambig`
tool for predicting the correct casing in a postprocessing step.

http://www.speech.sri.com/projects/srilm/manpages/disambig.1.html

Cheers,
Matthias

Post by Ondrej Bojar
Hi,
we also have an experiment on truecasing, see Table 1 in
http://www.statmt.org/wmt13/pdf/WMT08.pdf
What works best for us is relying on the casing as guessed by the
lemmatizer. (Our lemmatizer recognizes names as separate lemmas and
keeps the lemma upcased; which we then cast to the token in the
sentence.)
Moses recaser was the worst option, it was actually better to
lowercase only the source side of the parallel data, i.e. have the
main search also pick the casing.
Cheers, O.
----- Original Message -----

Post by Philipp Koehn
Hi,
no, the changes are made incrementally.
So the recesed "baseline" is the previous "mbr/mp" column.
-phi

Post by Lane Schwartz
Philipp (and others),
I'm wondering what people's experience is regarding when truecasing is
applied.
One option is to truecase the training data, then train your TM and LM
using that truecased data. Another option would be to lowercase the data,
train TM and LM on the lowercased data, and then perform truecasing after
decoding.
I assume that the former gives better results, but the latter approach
has an advantage in terms of extensibility (namely if you get more data and
update your truecase model, you don't have to re-train all of your TMs and
LMs).
Does anyone have any insights they would care to share on this?
Thanks,
Lane
_______________________________________________
Moses-support mailing list
http://mailman.mit.edu/mailman/listinfo/moses-support

--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Philipp Koehn

2015-05-20 19:07:52 UTC

Permalink

Hi,

yes, this is what the RECASER section in EMS enables.

-phi

Post by Lane Schwartz
Got it. So then, how was casing handled in the "mbr/mp" column? Was all
of the data lowercased, then models trained, then recasing applied after
decoding? Or something else?

Post by Philipp Koehn
Hi,
no, the changes are made incrementally.
So the recesed "baseline" is the previous "mbr/mp" column.
-phi

Post by Lane Schwartz
Philipp (and others),
I'm wondering what people's experience is regarding when truecasing
is applied.
One option is to truecase the training data, then train your TM and
LM using that truecased data. Another option would be to lowercase the
data, train TM and LM on the lowercased data, and then perform truecasing
after decoding.
I assume that the former gives better results, but the latter
approach has an advantage in terms of extensibility (namely if you get more
data and update your truecase model, you don't have to re-train all of your
TMs and LMs).
Does anyone have any insights they would care to share on this?
Thanks,
Lane
_______________________________________________
Moses-support mailing list
http://mailman.mit.edu/mailman/listinfo/moses-support

Ergun Bicici

2015-05-21 15:34:03 UTC

Permalink

recaser: builds a Moses model for word translation from lowercased to cased
and also uses a language model. Input to recaser is lowercased.

truecaser: builds a casing model based on the number of times each version
appears in text (e.g. rivet (4/8) Rivet (3) RIVET (1)). Input to truecaser
is as it is and not lowercased.

Therefore, if text is noisy such as Tweets, recaser may perform better.

Best Regards,
Ergun

Ergun BiÃ§ici, CNGL, School of Computing, DCU, www.cngl.ie
http://www.computing.dcu.ie/~ebicici/

Post by Philipp Koehn
Hi,
yes, this is what the RECASER section in EMS enables.
-phi

Post by Lane Schwartz
Got it. So then, how was casing handled in the "mbr/mp" column? Was all
of the data lowercased, then models trained, then recasing applied after
decoding? Or something else?

Post by Philipp Koehn
Hi,
no, the changes are made incrementally.
So the recesed "baseline" is the previous "mbr/mp" column.
-phi

Post by Lane Schwartz
Philipp (and others),
I'm wondering what people's experience is regarding when truecasing
is applied.
One option is to truecase the training data, then train your TM and
LM using that truecased data. Another option would be to lowercase the
data, train TM and LM on the lowercased data, and then perform truecasing
after decoding.
I assume that the former gives better results, but the latter
approach has an advantage in terms of extensibility (namely if you get more
data and update your truecase model, you don't have to re-train all of your
TMs and LMs).
Does anyone have any insights they would care to share on this?
Thanks,
Lane
_______________________________________________
Moses-support mailing list
http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
http://mailman.mit.edu/mailman/listinfo/moses-support