Integrating Geographic Data

During my fifth, sixth, and seventh weeks of research, I investigated how to use my GeoNames data to eliminate and add locations.

Here’s an example of a dictionary entry to refresh your memory:

{retiro : [((40.41317, -3.68307), ‘ES’, ‘6544495’), ((-34.58333, -58.38333), ‘AR’, ‘3429576’)]}

According to this entry, there are two locations with the name Retiro. One is in Spain with the coordinates (40.41317, -3.68307) and GeoID 6544495. The other is in Argentina with the coordinates (-34.58333, -58.38333) and GeoID 3429576.

Initially, I use this dictionary to eliminate tokens tagged as locations by Stanford NER that aren’t in the GeoNames database. Occasionally, listed project locations aren’t included in the database (apparently there are over 9 million named locations in the world) and human geocoders have to manually find their coordinates and enter them into GeoNames through its wiki-style interface, but for the purposes of this program we will only further consider those already in the database because we need their information (coordinates, country codes, etc.) to make further decisions on whether we will keep or eliminate them from our list of correct locations.

I search for the location names tagged by Stanford NER in GeoNames using “fuzzy” search, which allows for slight misspellings of locations in documents instead of requiring exact matches. Aside from ordinary human typos, these misspellings can be caused by language translation, errors in PDF to TXT transformation, and in my case – my program itself. When importing text from the project documents, it eliminates non-ASCII characters because they’re untokenizable by Stanford NER, meaning that accented characters are deleted from location names. Thus a method is necessary to ensure that these locations aren’t overlooked. The most basic form of fuzzy search bases matches on Levenshtein distance, the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one word into the other. Natural Language Toolkit contains a Levenshtein distance calculator that I use to compare each found location and GeoNames location, calling them a match if they have a Levenshtein distance less than or equal to 1.

Next, I compile a list of all sentences mentioning any location in this new abbreviated list. I search through these sentences for other GeoNames location names, as oftentimes sentences mentioning one location will contain others not detected by Stanford NER. An example from a World Bank project document mentioned in previous blog posts is the sentence: “Project locations will include Abidjan (in particular Abobo, Yopougon and Cocody), Bonoua, Bouake, Korhogo, and in some selected cities including Indenie and Cocody bays.” Stanford NER identifies Abidjan, Abobo, Yopougon, Cocody, Bonoua, Bouake, and Korhogo, but not Indenie. This step of the program would add Indenie to the program’s list of considered locations if it were in GeoNames. Because not all location names are individual words (like the United States of America), I search for each GeoNames location in each sentence so multiword phrases are considered, instead of fuzzy searching each word in the GeoNames dictionary. If I was using the full GeoNames dataset this would take far too long, so in the future I plan to make this method more efficient by dividing each sentence into “chunks” and identifying their approximate matches in the GeoNames dictionary. This would involve word-level tokenization and part-of-speech tagging to determine multi-token sequences, which could be achieved with modules in Natural Language Toolkit.

With this new expanded list of locations, I use each locations’ country and coordinates to eliminate geographic outliers. Documents often mention the location of the donor, capital city of the recipient country, and other financial and governmental hubs as headquarters addresses, conference locations, and the like, but are not where the aid is actually going. First, I eliminate locations not within the country that the majority lie within, as aid projects typically only occur within one recipient country. Then I calculate coordinate outliers. Currently, I’m using the standard statistical definition of an outlier and evaluating latitude and longitude separately. So I’m calculating the first and third quartiles and the interquartile range, which is the difference between these quartiles, and calling a location an outlier if it falls outside the range of (Q1- 1.5*IQR) to (Q3+1.5*IQR) in either its latitude, longitude, or both. This could potentially be made more accurate by testing out different parameters (instead of 1.5), or even calculating the distance between each location and calculating distance outliers instead of coordinate outliers. But the latter would be overly complicated as the distance between each location would need to be calculated pairwise and labeling an individual location as an outlier would require more conditions.

In the final week of research, I plan to reflect on my research thus far and determine what I can improve in my program.

If you have any questions or suggestions, please comment below!