Word segmentation of Vietnamese texts: a comparison of approaches

NGUYEN Minh Huyen, ROSSIGNOL Mathias, LE H.P., DINH Q. T. , VU X. L. , NGUYEN C. T.
LREC '08 - May 2008
We present in this paper a comparison between three segmentation systems for the Vietnamese language. Indeed, the majority of Vietnamese words is built by semantic composition from about 7,000 syllables, that also have a meaning as isolated words. So the identification of word boundaries in a text is not a simple task, and ambiguities often appear. Beyond the presentation of the tested systems, we also propose a standard definition for word segmentation in Vietnamese, and introduce a reference corpus developed for the purpose of evaluating such a task. The results observed confirm that it can be relatively well treated by automatic means, although a solution needs to be found to take into account out-of-vocabulary words.

BibTex references

  author       = {NGUYEN, M. and ROSSIGNOL, M. and LE, H. and DINH, Q. and VU, X. and NGUYEN, C.},
  title        = {Word segmentation of Vietnamese texts: a comparison of approaches},
  booktitle    = {LREC '08},
  month        = {May},
  year         = {2008},
  address      = {Marrakech, Maroc},
  url          = {/2008/NRLDVN08},

Other publications in the database

» Minh Huyen NGUYEN