Connecting to GeoNames

During my fourth week of research, I investigated how to connect to GeoNames within my program.

GeoNames is the database of locations that AidData uses to geocode. It contains millions of geographic features, storing information about their alternate names (Cote d’Ivoire vs. Ivory Coast), latitude and longitude, country, feature class (city vs. district vs. river), and more.

Currently, I’m using the geonamescache module to connect to GeoNames. From its data, I’ve created a Python dictionary with location names as keys and a list of tuples (as there can be multiple locations with the same name) of the latitude and longitude, country code, and GeoName ID as values.

Unfortunately, geonamescache is a subset of the complete database, only containing city names. Obviously I need to find a more comprehensive connection that has every location and ideally more information about them, like alternate names and feature class. I tried downloading the file that contains every location and all of its associated data directly from GeoNames – but weighing in at 1.14 GB with 9,115,154 locations, using it has proven to be more complicated than I anticipated.

This file marked a milestone: my first direct interaction with the phenomenon of big data! It was too big for even Excel to handle, which apparently can only store up to a meek 1,048,576 rows. Every approach to tame it was met with a memory error. First off, the file was TXT, which I didn’t understand until later didn’t mean a bunch of unorganized gibberish (I had only previously encountered data stored in CSV or XML files). So I broke it up into 10 distinct files with Python code, converted them to CSV in Excel, and tried to iterate through them and recreate a more complete version of the previously mentioned dictionary. Memory error. Then I discovered how to read the original file in its tab-delimited text format with Python, clinging on to a faint hope that the error before was a result of some glitch in the TXT to CSV conversion, and again tried to make the dictionary. Again, memory error. I’ve been scouring over Stack Overflow and other programmer forums and resources attempting to find more advanced data structures better equipped to handle big data and methods of storing the eventual structure so that it’s readable by my program and not massive, but I haven’t found my answers yet.

In the meantime, I’ve created a program that makes use of the geographic data I do have control over, which I’ll talk about in my next post.

If you have any questions or suggestions (very open to these!), please comment below!