Text Mining with R

Made with R

Word cloud made with R

 

Ever wondered how Google knows exactly what you’re looking for given just a couple words?

The secret is found in text mining. Google compares its vast collection of webpages against your search terms, providing you with pages that mention your given terms most frequently and significantly.

 

During my first week of research, I explored how to text mine conceptually and computationally with the programming language R.

To build a background in the basics of text mining, I read a few chapters of Springer Publishing’s Text Mining: Predictive Methods for Analyzing Unstructured Information. Text mining has a variety of applications apart from fueling search engines and will be the process I use to extract location names and activity descriptions from foreign aid documents.

Text mining, which ultimately transforms text into analyzable data, involves building a matrix with rows representing documents and columns representing tokens, which can be words or multi-word phrases (ex. United States of America). This matrix can be filled with binary (0 indicating that the document does not contain the token, 1 indicating that it does), the frequency of the occurrence of the token, a score based on the token’s weight (ex. higher weight if token in title than body) and frequency, or many other kinds of representative numeric data. Text mining also encompasses more advanced methods very relevant to my project like named entity recognition, which identifies proper nouns like locations, names, and organizations based on capitalization, relationships to other words from sentence parsing, and a range of other factors.

I tested out basic text mining with R, a freeware programming language and interface for statistical computing and graphics. Due to my involvement in the EXTREEMS-QED research program each morning of the week began with intensive R training, so with this newfound expertise I tackled the steps detailed by Graham Williams in Data Science with R: Text Mining on sample documents from a World Bank urban infrastructure project in Cote d’Ivoire. After converting these documents from PDF to TXT using Adobe Reader and removing non-ASCII characters with basic Python code, I was able to create a corpus of the project’s documents, make all words lowercase, remove punctuation and numbers, remove English stop words (the, and, is, etc.), build a document term matrix containing words and their frequencies throughout the documents, and create a word cloud with the most frequent words (the image above!) with R. As you can see, these words are not all significant in accomplishing my goal (Abidjan and Cocody are the only locations), so more advanced analysis is needed. However, this activity did convince me of R’s practicality as a tool for text mining and better my understanding of the structure and features of foreign aid documents.

Next, I plan on investigating named entity recognition further and testing it out on the same World Bank project corpus.

If you have any questions or suggestions, please comment below!

Comments

  1. swnordstrom says:

    I like working in R a lot, especially if I’m doing any sort of statistical analysis (and trying to use it for simulations in my project). There are a ton of libraries/packages out there (some more developed than others) that you might want to look in to; they could have some pre-written functions that could save you some time. If you can’t find one, maybe it would be a fun project to try to make one.

  2. dfmcpherson says:

    This is great! I was looking into doing a project like this this summer, and am actually a little jealous because you make it seem so cool. I’m also new to R, have been using it to graph columns of data against eachother. It’s interesting to see other ways it can be used.

  3. Darice Xue says:

    Speaking as someone who has also passed many an hour geocoding, I’m very excited to know that you’re working on this project! I was also surprised when I started as an intern that there wasn’t already an automated process to do this, but over time I saw that the process was less a science and more an art — often times geocoders and arbitrators would have to make “judgment” calls in their geocoding. Classic examples: coding roads, 6 vs. 8, etc, determining beneficiaries vs. project locations… I’ve only taken CS through 241 and I’m not very familiar with R (yet), so creating something that could potentially encompass all the “judgment calls” seems mind-boggling to me — but if there’s a way to do it, I hope you find it! Thanks a bunch!

  4. wjevans01 says:

    Text mining sounds like a really useful skill to have in R. I’m working on a project that finds keywords in newspaper articles covering the Affordable Care Act. How did you go about getting the text to text mine? Were you able to read the text files into R and then run a code to text mine? One last question, do you have any tips for getting better at R, any books to read, websites to work with, etc.

  5. Miranda Elliott says:

    Thanks for all of your comments and interest!
    wjevans01 – to learn about text mining in R, definitely read this article: http://onepager.togaware.com/TextMiningO.pdf. I followed it step by step, it walks you through the code and what it’s doing. I originally learned about R from attending workshops at school so unfortunately I can’t recommend any specific introductory resource from experience, but any time I have a question on syntax/packages/etc Google is overflowing with results so you don’t need to look far for information on learning and troubleshooting R. Because it’s free, it has tons of amateur and experienced users from various fields that are asking and answering questions.