Language Identification and it peter Constable and Gary Simons



Yüklə 433,5 Kb.
tarix30.10.2018
ölçüsü433,5 Kb.
#76652


Language Identification and IT

  • Peter Constable and Gary Simons

  • SIL International

  • peter_constable@sil.org

  • gary_simons@sil.org

  • www.sil.org


Language identification

  • The use of identificational codes for tagging information objects to indicate the language in which the information is expressed



Language identification

  • Not considering automated language detection



About the Ethnologue

  • SIL Ethnologue

    • catalogue of all modern languages in the world
    • lists over 6,800 living languages
    • result of decades of research
    • system of three-letter codes
    • http://www.sil.org/ethnologue


About the Ethnologue



About the Ethnologue



About the Ethnologue

  • Existing user base for Ethnologue codes:

    • SIL
    • UNESCO
    • Linguistic Data Consortium (850+ agencies)
    • The Linguist List (12,500 individual linguists)
    • The Endangered Language Fund
    • others


Linguistic diversity

  • # of languages:



Motivation for this paper

  • Languages covered by standards

    • ISO 639-x covers approx. 400 languages;
    • existing needs to go much further—over 6,800 languages
    • immediate need among linguists and other researchers for use in XML


Five issues

  • Change

  • Categorization

  • Inadequate definition

  • Scale

  • Documentation



The need for language identifiers

  • Language-specific processing

    • spell-checking
    • sorting
    • morphological parsing
    • speech recognition/synthesis
    • language-specific typographic behaviour
    • etc.


The need for language identifiers

  • Language-specific processing

    • choosing appropriate resources


The need for language identifiers

  • Two distinct issues:

    • identify the language
    • apply the specific processing for that language


The need for language identifiers

  • Language detection

    • identify language by inspection of data itself
    • available only for a few languages
    • not practical for searching large corpora (e.g. the Internet)
    • doesn’t work on short text segments


The need for language identifiers

  • Language-specific processing

    • in general, must tag information objects to indicate language
    • identifiers are needed to distinguish every language


Issue #1: change

  • Languages are constantly changing

  • Implications:

    • systems of language tags cannot be static
    • the speech variety (varieties) denoted by a tag is time-bound


Issue #2: categorization

  • Typical question: Are Serbian and Croatian the same language, or different languages?



Issue #3: inadequate definition



Issue #3: inadequate definition

  • Consistent use of a single definition in a given namespace is beneficial

  • “Requiring a single definition imposes too much constraint on users”

    • users may legitimately have different requirements
    • but no control results in confusion, especially when thousands of identifiers are added


Issue #4: Scale

  • Number of languages exceed existing systems by an order of magnitude (400 vs. 6,800)

  • Existing systems do not scale well



Issue #4: Scale

  • ISO 639-x

    • slow process unable to cope with large volume of requests
    • minimal attestation (50 documents) not appropriate for lesser-known languages
    • mnemonic codes (impossible for thousands of languages)
    • confusion due to inconsistent definition


Issue #4: Scale

  • RFC 1766

    • process unable to cope with large volume of requests
    • confusion due to inconsistent definition
    • unclear how to create tags


Issue #5: documentation

  • Existing systems: can’t tell what codes denote

    • ISO 639-x: language, or group of languages?


Issue #5: documentation

    • ISO 639-x: 2- vs. 3-letter codes


Solving these problems

  • Requirements of an adequate system:

    • able to scale
    • able to deal with change, track history of change
    • use a single operational definition for a given namespace
    • apply definition consistently within a namespace
    • complete, maintained, online documentation


What the Ethnologue offers

  • Scale: already there

    • enumeration of languages
    • set of three-letter codes
  • Change: careful management

    • no re-use of codes
    • have begun recording revision history


What the Ethnologue offers

  • Definition: single definition, applied quite consistently

    • definition: primary criterion of mutual non-intelligibility as a basis for identifying candidates for separate literacy, literature
    • all categories are of the same type; no language families, groups, writing systems


What the Ethnologue offers

  • Documentation

    • extensive information maintained for every language
    • new site will provide various reports
      • alternate names, location, population, etc.
      • related ISO codes, relationship
      • return Ethnologue data given an ISO code
    • evaluating possibilities for returning results as XML


Integration with RFC 1766, XML

  • Ethnologue codes immediately available using “x-”



Integration with RFC 1766, XML

  • Register thousands of new tags with IANA

    • process would not be able to cope
    • problems devising that many tags
    • create considerable confusion in the single namespace


Integration with RFC 1766, XML

  • Register “i-sil-to specify a namespace maintained by a particular agency



Integration with RFC 1766, XML

  • Possible refinement: define primary tag “n-”



Conclusions

  • Language identifiers required for language-specific processing

  • Immediate need for thousands of new language identifiers; in particular, for use in XML

  • Five problem areas—need to be considered in any system

  • SIL Ethnologue codes address all five problems

  • Revising RFC 1766 to add a namespace mechanism can support this and would offer many benefits



Yüklə 433,5 Kb.

Dostları ilə paylaş:




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©www.genderi.org 2024
rəhbərliyinə müraciət

    Ana səhifə