Language Identification and it peter Constable and Gary Simons

Yüklə 433,5 Kb.

Language Identification and IT

Language identification

Language identification

About the Ethnologue

About the Ethnologue

Linguistic diversity

Motivation for this paper

Five issues

The need for language identifiers

The need for language identifiers

The need for language identifiers

The need for language identifiers

The need for language identifiers

Issue #1: change

Issue #2: categorization

Issue #3: inadequate definition

Issue #3: inadequate definition

Issue #4: Scale

Issue #4: Scale

Issue #4: Scale

Issue #5: documentation

Issue #5: documentation

Solving these problems

**What the Ethnologue offers**

**What the Ethnologue offers**

**What the Ethnologue offers**

Integration with RFC 1766, XML

Integration with RFC 1766, XML

Integration with RFC 1766, XML

Integration with RFC 1766, XML

Conclusions

Yüklə 433,5 Kb.

Dostları ilə paylaş:

Language Identification and it peter Constable and Gary Simons

Language Identification and IT

Peter Constable and Gary Simons

SIL International

peter_constable@sil.org

gary_simons@sil.org

www.sil.org

Language identification

The use of identificational codes for tagging information objects to indicate the language in which the information is expressed

Language identification

Not considering automated language detection

About the Ethnologue

SIL Ethnologue

About the Ethnologue

About the Ethnologue

About the Ethnologue

Existing user base for Ethnologue codes:

Linguistic diversity

# of languages:

Motivation for this paper

Languages covered by standards

Five issues

Change

Categorization

Inadequate definition

Scale

Documentation

The need for language identifiers

Language-specific processing

The need for language identifiers

Language-specific processing

The need for language identifiers

Two distinct issues:

The need for language identifiers

Language detection

The need for language identifiers

Language-specific processing

Issue #1: change

Languages are constantly changing

Implications:

Issue #2: categorization

Typical question: Are Serbian and Croatian the same language, or different languages?

Issue #3: inadequate definition

Existing systems do not consistently employ a single operational definition

Issue #3: inadequate definition

Consistent use of a single definition in a given namespace is beneficial

“Requiring a single definition imposes too much constraint on users”

Issue #4: Scale

Number of languages exceed existing systems by an order of magnitude (400 vs. 6,800)

Existing systems do not scale well

Issue #4: Scale

ISO 639-x

Issue #4: Scale

RFC 1766

Issue #5: documentation

Existing systems: can’t tell what codes denote

Issue #5: documentation

Solving these problems

Requirements of an adequate system:

What the Ethnologue offers

Scale: already there

Change: careful management

What the Ethnologue offers

Definition: single definition, applied quite consistently

What the Ethnologue offers

Documentation

Integration with RFC 1766, XML

Ethnologue codes immediately available using “x-”

Integration with RFC 1766, XML

Register thousands of new tags with IANA

Integration with RFC 1766, XML

Register “i-sil-” to specify a namespace maintained by a particular agency

Integration with RFC 1766, XML

Possible refinement: define primary tag “n-”

Conclusions

Language identifiers required for language-specific processing

Immediate need for thousands of new language identifiers; in particular, for use in XML

Five problem areas—need to be considered in any system

SIL Ethnologue codes address all five problems

Revising RFC 1766 to add a namespace mechanism can support this and would offer many benefits

**What the Ethnologue offers**

**What the Ethnologue offers**

**What the Ethnologue offers**

**SIL Ethnologue codes address all five problems**