Reflecting on Progress

During my eighth week of research, I gave a presentation to the EXTREEMS-QED research program that outlined my research thus far and future goals, allowing me to reflect on my progress and gain insight from other budding mathematicians and computer scientists.

You can view the presentation here – it explains the human process of geocoding and my attempts at replication with real output from my program. These have been discussed in my previous blog posts, so I’ll focus on what’s new: the usefulness of my current program and its future.

The big question in applied research is whether what you’re doing is actually useful, and I think it is in this case. Beyond expanding my knowledge of programming in Python and handling big data, it’s a step in the right direction for AidData’s approach towards data extraction. It’s not the first attempt at geocoding automation, and I’m sure it won’t be the last as people gain interest and work on speeding up and improving the accuracy of the process, but it definitely added new material to the discussion.

Currently, 2 student interns geocode each aid project, and if their codes don’t exactly match a research assistant must arbitrate them. There were around 50 AidData employees geocoding this summer, and while many of them grew to be very efficient after doing it for 8 hours a day 5 days a week, they’re only human. They rely on speed reading, skimming, and searching for keywords like “locations” and “sites”. To handle the vast supply of aid projects and demand for their quantification, technology needs to intervene. In the sample World Bank project from my presentation and previous blog posts, the program would provide the human geocoder with a list of 4 correct locations and no incorrect locations, and sentences containing these locations and the remaining 4 missed correct locations. Thus from a list of 4 words and sentences totaling 477 words, the geocoder would come to the same conclusions as they would having read 20 pages of project documents. This is a small-scale project, many have hundreds of pages of documents, so there’s even greater potential to increase efficiency.

Whether this program has a future is debatable – in my work at AidData I’ll be moving on from geocoding and joining the “data team”, which is concerned with these advancements, and find out. If I were to continue work on it, I’ve determined an array of things I could improve upon.

The highest priority, and most deceivingly simple one, is finding a more comprehensive GeoNames connection. Employees at AidData are constantly adding new locations to GeoNames as they geocode and I want to be able to utilize those.

Beyond that, I’d like to retain the structure of each document. Stanford NER is trained to tag sentences, but project documents can list locations in paragraphs or tables, which are converted to TXT as unrelated sequences of words that the software is unequipped to understand. I’m currently converting documents from PDF to TXT in Adobe Reader, but ideally I would determine a way to convert documents within my program, differentiate between paragraphs and tables, and develop a new location identification method for tables.

More specific to the inner workings of my current program, I want to determine more accurate parameters for labeling locations as outliers and if any specific types of locations can generally be eliminated, like capital cities (which can be government ministry locations) or countries (which typically aren’t coded unless no other locations are listed). Additionally, instead of eliminating non-ASCII characters when importing text from the documents, I want to replace the string representation of accented characters with plain ASCII text (so é would become e) so tagged locations can be more accurately matched with GeoNames locations. I’d also like to develop faster implementations of many of my current methods, particularly those in which I iterate through each GeoNames location. To fuzzy search I’d use Levenshtein automaton, which produces a list of words within a predetermined Levenshtein distance of a given word, and search for each of these in the GeoNames dictionary; and to search for missed locations in sentences containing found locations I’d use chunking to create multiword proper noun phrases which would then be fuzzy searched in the GeoNames dictionary.

Overall, this summer has been an incomparable learning experience, and I’m so grateful to the Charles Center and EXTREEMS-QED program for giving me the opportunity to pursue this research. Thanks for staying updated with my progress – hope you learned something too! As always, if you have any questions or suggestions, please comment below.

Comments

  1. Sooo cool, Miranda! I liked the comp sci pun: “I’ve determined an array of things I could improve upon.” Heh. But in all seriousness, I like the applicability and usefulness to the real world your project has. Would you say R or Python were more useful throughout your research?

  2. Miranda Elliott says:

    Thanks Yussre! Python ended up being more useful, as it was the language I wrote my final program in. R was helpful in the initial stages of my research when I was learning the basics of general text mining, but my work with it ended there. I’m sure everything I executed in Python could have also been accomplished in R, but I’m more familiar with Python from my course background at the College so I defaulted to that.