Named Entity Recognition

Stanford Named Entity Recognizer

Stanford’s Named Entity Recognizer software ran on a World Bank foreign aid document

 

During my second and third weeks of research, I investigated whether named entity recognition could be the method I use to extract location names from foreign aid documents.

The image above is a screenshot of Stanford Natural Language Processing Group’s Named Entity Recognizer used on a document from the same World Bank project in Cote d’Ivoire referenced in my last post. As you can see from the key on the right side, deep purple indicates that the highlighted word is a location. On this document, the software identifies all but one location (Indencie).

Stanford’s Named Entity Recognizer is a CRF Classifier, a general implementation of linear chain Conditional Random Field sequence models, which are often applied in pattern recognition and machine learning for structured prediction. CRFs are a type of discriminative undirected probabilistic graphical model, so unlike ordinary classifiers which predict labels for individual samples without regarding neighboring samples, they take context into account – hence their popularity in natural language processing. They predict sequences of labels for sequences of input samples by encoding known relationships between observations from a training set of documents and constructing consistent interpretations. Stanford has trained several classifiers on different corpuses – I’ll be using the 7-class model trained on MUC as it yielded the most accurate results in my sample project.

Natural Language Toolkit, the leading platform for working with human language data in Python, is open source software containing text processing modules for classification, tokenization, stemming, tagging, parsing, and semantic reasoning. One of these modules interfaces with the Stanford NER software to tag tokens as times, locations, organizations, etc. Originally I used this in my program to identify locations, but quickly realized its implementation was far too slow to deal with the quantity and length of documents I needed to analyze. I tested out different implementations and decided on using the pyner module, which tags tokens by connecting to a remote Stanford NER server.

My program currently identifies locations using pyner, sorts them by their frequency throughout the documents, and prints each sentence containing any location name. Just providing a human geocoder with these sentences would be helpful in speeding up the geocoding process, but I plan to use them to improve the accuracy of automation by searching for missed locations (Indencie is included in a sentence with other locations and could potentially be identified) and activity descriptions.

Next, I plan on finding a way to connect to and utilize GeoNames, the geographic database AidData uses to geocode, within my program.

If you have any questions or suggestions, please comment below!