Cleaning Data: the end of an era

Summer Research is in full swing! My project is starting to take shape and I am finally done cleaning my data!

In order to be able to sort my data not only by state, but also by individual newspaper I have spent the last two weeks cleaning up my data by hand. I have gone through each state data file I am interested in (Florida, Texas, and Massachusetts) and cleaned the data so that each individual newspaper name is spelled correctly and that all newspaper names look identical, making sure that no newspaper names had extra spaces or parentheses in their names. I have also gone through and classified every article from these three states into a “section” category, like national news, local news, letters to the editor, opinion pieces, etc. There was no reliable way to clean this data, except for by hand. This means that I had to open up over 3,500 individual newspaper articles, and read them enough to determine their classification. Needless to say, it has been a tedious two weeks, but I’m finally finished!

What does this mean going forward?

Well, now that every article has been categorized and newspaper names have been standardized I can move ahead with testing which newspapers have the most use of liberal and conservative keywords. I can also now see how opinion pieces and letters to the editor differ in language from local or national news articles. I am continuing work on my literature review too. I have found several really interesting articles about a host of different topics from how newspaper endorsements affect voting patterns, to how conservative and liberal newspapers affect their readerships differently.

Lastly, you may notice a slight change from my first post with my abstract. I am no longer using California in my project, instead I have replaced it with Massachusetts. This change is because California had over 2,200 newspaper articles, more than double that of any other state. To save time cleaning up my data, I switched to Massachusetts which had a much more manageable 1,100 articles. At the end of the day, both states are considered to be strongly liberal. Ultimately, I made the switch to save myself time (and sanity).

That’s all for now, I will give another update once I have completed my literature review, and I might even have some concrete numbers about the newspaper word use!