|
Language Identification and it peter Constable and Gary Simons
|
tarix | 30.10.2018 | ölçüsü | 433,5 Kb. | | #76652 |
|
Peter Constable and Gary Simons SIL International peter_constable@sil.org gary_simons@sil.org www.sil.org
Language identification The use of identificational codes for tagging information objects to indicate the language in which the information is expressed
Language identification Not considering automated language detection
About the Ethnologue SIL Ethnologue - catalogue of all modern languages in the world
- lists over 6,800 living languages
- result of decades of research
- system of three-letter codes
- http://www.sil.org/ethnologue
About the Ethnologue
About the Ethnologue
About the Ethnologue Existing user base for Ethnologue codes: - SIL
- UNESCO
- Linguistic Data Consortium (850+ agencies)
- The Linguist List (12,500 individual linguists)
- The Endangered Language Fund
- others
Linguistic diversity
Motivation for this paper Languages covered by standards - ISO 639-x covers approx. 400 languages;
- existing needs to go much further—over 6,800 languages
- immediate need among linguists and other researchers for use in XML
Five issues Change Categorization Inadequate definition Scale Documentation
The need for language identifiers Language-specific processing - spell-checking
- sorting
- morphological parsing
- speech recognition/synthesis
- language-specific typographic behaviour
- etc.
The need for language identifiers Language-specific processing - choosing appropriate resources
The need for language identifiers Two distinct issues: - identify the language
- apply the specific processing for that language
The need for language identifiers Language detection - identify language by inspection of data itself
- available only for a few languages
- not practical for searching large corpora (e.g. the Internet)
- doesn’t work on short text segments
The need for language identifiers Language-specific processing - in general, must tag information objects to indicate language
- identifiers are needed to distinguish every language
Issue #1: change Languages are constantly changing Implications: - systems of language tags cannot be static
- the speech variety (varieties) denoted by a tag is time-bound
Issue #2: categorization Typical question: Are Serbian and Croatian the same language, or different languages?
Issue #3: inadequate definition - ISO 639-2: codes for “languages” and for groups of languages
Issue #3: inadequate definition Consistent use of a single definition in a given namespace is beneficial “Requiring a single definition imposes too much constraint on users” - users may legitimately have different requirements
- but no control results in confusion, especially when thousands of identifiers are added
Issue #4: Scale Number of languages exceed existing systems by an order of magnitude (400 vs. 6,800) Existing systems do not scale well
Issue #4: Scale ISO 639-x - slow process unable to cope with large volume of requests
- minimal attestation (50 documents) not appropriate for lesser-known languages
- mnemonic codes (impossible for thousands of languages)
- confusion due to inconsistent definition
Issue #4: Scale RFC 1766 - process unable to cope with large volume of requests
- confusion due to inconsistent definition
- unclear how to create tags
Issue #5: documentation Existing systems: can’t tell what codes denote - ISO 639-x: language, or group of languages?
Issue #5: documentation - ISO 639-x: 2- vs. 3-letter codes
Solving these problems Requirements of an adequate system: - able to scale
- able to deal with change, track history of change
- use a single operational definition for a given namespace
- apply definition consistently within a namespace
- complete, maintained, online documentation
What the Ethnologue offers Scale: already there - enumeration of languages
- set of three-letter codes
Change: careful management - no re-use of codes
- have begun recording revision history
What the Ethnologue offers - definition: primary criterion of mutual non-intelligibility as a basis for identifying candidates for separate literacy, literature
- all categories are of the same type; no language families, groups, writing systems
What the Ethnologue offers Documentation - extensive information maintained for every language
- new site will provide various reports
- alternate names, location, population, etc.
- related ISO codes, relationship
- return Ethnologue data given an ISO code
- evaluating possibilities for returning results as XML
Integration with RFC 1766, XML Ethnologue codes immediately available using “x-”
Integration with RFC 1766, XML Register thousands of new tags with IANA - process would not be able to cope
- problems devising that many tags
- create considerable confusion in the single namespace
Integration with RFC 1766, XML Register “i-sil-” to specify a namespace maintained by a particular agency
Integration with RFC 1766, XML Possible refinement: define primary tag “n-”
Conclusions Language identifiers required for language-specific processing Immediate need for thousands of new language identifiers; in particular, for use in XML Five problem areas—need to be considered in any system SIL Ethnologue codes address all five problems Revising RFC 1766 to add a namespace mechanism can support this and would offer many benefits
Dostları ilə paylaş: |
|
|