Named Entity Recognition What is ne? What isn’t ne?



Yüklə 1,13 Mb.
tarix29.09.2018
ölçüsü1,13 Mb.
#71141


Named Entity Recognition

  • What is NE?

  • What isn’t NE?

  • Problems and solutions with NE task definitions

  • Problems and solutions with NE task

  • Some applications


Why do NE Recognition?

  • Key part of Information Extraction system

  • Robust handling of proper names essential for many applications

  • Pre-processing for different classification levels

  • Information filtering

  • Information linking



NE Definition

  • NE involves identification of proper names in texts, and classification into a set of predefined categories of interest.

  • Three universally accepted categories: person, location and organisation

  • Other common tasks: recognition of date/time expressions, measures (percent, money, weight etc), email addresses etc.

  • Other domain-specific entities: names of drugs, medical conditions, names of ships, bibliographic references etc.



What NE is NOT

  • NE is not event recognition.

  • NE recognises entities in text, and classifies them in some way, but it does not create templates, nor does it perform co-reference or entity linking, though these processes are often implemented alongside NE as part of a larger IE system.

  • NE is not just matching text strings with pre-defined lists of names. It only recognises entities which are being used as entities in a given context.

  • NE is not easy!



Problems in NE Task Definition

  • Category definitions are intuitively quite clear, but there are many grey areas.

  • Many of these grey area are caused by metonymy.

  • Person vs. Artefact: “The ham sandwich wants his bill.” vs “Bring me a ham sandwich.”

  • Organisation vs. Location : “England won the World Cup” vs. “The World Cup took place in England”.

  • Company vs. Artefact: “shares in MTV” vs. “watching MTV

  • Location vs. Organisation: “she met him at Heathrow” vs. “the Heathrow authorities”



Solutions

  • The task definition must be very clearly specified at the outset.

  • The definitions adopted at the MUC conferences for each category listed guidelines, examples, counter-examples, and “logic” behind the intuition.

  • MUC essentially adopted simplistic approach of disregarding metonymous uses of words, e.g. “England” was always identified as a location. However, this is not always useful for practical applications of NER (e.g. football domain).

  • Idealistic solutions, on the other hand, are not always practical to implement, e.g. making distinctions based on world knowledge.



Basic Problems in NE

  • Variation of NEs – e.g. John Smith, Mr Smith, John.

  • Ambiguity of NE types

    • John Smith (company vs. person)
    • May (person vs. month)
    • Washington (person vs. location)
    • 1945 (date vs. time)
  • Ambiguity with common words, e.g. “may”



More complex problems in NER

  • Issues of style, structure, domain, genre etc.

    • Punctuation, spelling, spacing, formatting, ….all have an impact
  • Dept. of Computing and Maths

  • Manchester Metropolitan University

  • Manchester

  • United Kingdom

  • > Tell me more about Leonardo

  • > Da Vinci



List Lookup Approach

  • System that recognises only entities stored in its lists (gazetteers).

  • Advantages - Simple, fast, language independent, easy to retarget

  • Disadvantages – collection and maintenance of lists, cannot deal with name variants, cannot resolve ambiguity



Shallow Parsing Approach

  • Internal evidence – names often have internal structure. These components can be either stored or guessed.

  • location:

  • CapWord + {City, Forest, Center}

  • e.g. Sherwood Forest

  • Cap Word + {Street, Boulevard, Avenue, Crescent, Road}

  • e.g. Portobello Street



Shallow Parsing Approach

  • External evidence - names are often used in very predictive local contexts

  • Location:

  • “to the” COMPASS “of” CapWord

  • e.g. to the south of Loitokitok

  • “based in” CapWord

  • e.g. based in Loitokitok

  • CapWord “is a” (ADJ)? GeoWord

  • e.g. Loitokitok is a friendly city



Difficulties in Shallow Parsing Approach

  • Ambiguously capitalised words (first word in sentence)

  • [All American Bank] vs. All [State Police]

  • Semantic ambiguity

  • “John F. Kennedy” = airport (location)

  • “Philip Morris” = organisation

  • Structural ambiguity

  • [Cable and Wireless] vs. [Microsoft] and [Dell]

  • [Center for Computational Linguistics] vs. message from [City Hospital] for

  • [John Smith].



Technology

  • JAPE (Java Annotations Pattern Engine)

  • Based on Doug Appelt’s CPSL

  • Reimplementation of NE recogniser from LaSIE



NE System Architecture



Modules

  • Tokeniser

    • segments text into tokens, e.g. words, numbers, punctuation
  • Gazetteer lists

    • NEs, e.g. towns, names, countries, ...
    • key words, e.g. company designators, titles, ...
  • Grammar

    • hand-coded rules for NE recognition


JAPE

  • Set of phases consisting of pattern /action rules

  • Phases run sequentially and constitute a cascade of FSTs over annotations

  • LHS - annotation pattern containing regular expression operators

  • RHS - annotation manipulation statements

  • Annotations matched on LHS referred to on RHS using labels attached to pattern elements



Tokeniser

  • Set of rules producing annotations

  • LHS is regular expression matched on input

  • RHS describes annotations to be added to AnnotationSet

  • (UPPERCASE _LETTER) (LOWERCASE_LETTER)* >

  • Token; orth = upperInitial; kind = word



Gazetteer

  • Set of lists compiled into Finite State Machines

  • Each list has attributes MajorType and MinorType (and optionally, Language)

  • city.lst: location: city

  • currency_prefix.lst: currency_unit: pre_amount

  • currency_unit.lst: currency_unit: post_amount



Named entity grammar

  • hand-coded rules applied to annotations to identify NEs

  • annotations from format analysis, tokeniser and gazetteer modules

  • use of contextual information

  • rule priority based on pattern length, rule status and rule ordering



Example of JAPE Grammar rule

  • Rule: Location1

  • Priority: 25

  • ( ( { Lookup.majorType == loc_key,

  • Lookup.minorType == pre}

  • { SpaceToken} )?

  • { Lookup.majorType == location}

  • ( {SpaceToken}

  • { Lookup.majorType == loc_key,

  • Lookup.minorType == post} ) ?

  • )

  • : locName -->

    • :locName.Location = { kind = “gazetteer”, rule = Location1
    • }


MUSE

  • MUlti-Source Entity recognition

  • Named entity recognition from a variety of text types, domains and genres.

  • 2 years from Feb 2000 – 2002

  • Sponsors: GCHQ



PASTA

  • Protein Active Site Template Acquisition

  • Aim: Use of IE techniques to create a database of protein active site data to support protein structure analysis

  • Partners: Dept. of Computer Science, Information Studies, Mol. Biology and Biotechnology, Univ. of Sheffield

  • Sponsors: BBSRC-EPSRC Bioinformatics Initiative



Molecular Biology



PASTA System Architecture



Recognition of Biological Terminology



MUMIS

  • MUltiMedia Indexing and Searching environment

  • Application of IE technology to multimedia, multilingual video indexing in football domain

  • 2 years: June 2000 - 2002

  • CTIT (NL), University of Sheffield (UK), DFKI (D), Max Planck Institute (D), University of Nijmegen (NL), ESTeam (SWE), VDA (NL)



Yüklə 1,13 Mb.

Dostları ilə paylaş:




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©www.genderi.org 2024
rəhbərliyinə müraciət

    Ana səhifə