Data Wrangling

Near the conception of this project, I was part of several discussions on “data workflow” and other such monikers that allude to the crazy, messy world of what we call data. In a time when information and “big data” are valuable and only relatively recently tapped sources of knowledge, extracting insights from this messy world is a skill that seems like it just makes everything easier. So naturally, I was excited at the idea of being able to learn some of the techniques used to make sense of everything in the process of my summer research project. After all, when practicing science and statistics, there is a lot information to keep track of. Unfortunately that means there are that many more ways that all that information can get mixed up and jumbled around… I learned several important things about cleaning and managing data in the course of my project so far.

  1. Plan everything from the Beginning. From a scientific and logistical standpoint, sitting down before you even have any data and thinking about what you need and how you are going to deal with it makes data collection a lot easier to actually carry it out later and eliminates a lot of the interrupting “next steps” planning in the middle of your project. If everyone is on the same page at the beginning on what types of information go with what other types, where files will be stored, how everything gets backed up, and from a coding standpoint, what inputs and outputs to expect from certain scripts, then there will be less room for confusion and inconsistencies that lead to missing data. For example, my project involves several processing steps where data is moved from field computers to PCs to the cloud, where it is supposed to merge with our other data. Knowing the path your data should take makes it easier to find it when it’s not where you expect.
  2. Organization is crucial. At every step. Not just the end result, not just at the beginning. If you don’t stay organized and consistent with where you are keeping your data, how you are naming it, and what each set of it is for, then it will quickly become an unintelligible mess that is no longer reliable and may not even be useable for gaining any insights. This is especially important if you are part of a collaboration or are going to be handing your data to someone else eventually (which is often the case). If you have clear and consistent naming conventions and keep your files in designated folders clear of clutter and side-projects, it will make it a simple task for someone else to step in and get right up to speed. If not… well good luck finding your way through endless folders in search of the middle half of that spreadsheet that accidentally got deleted that one time. This goes for code as well. Making your variable names or files two-letter abbreviations might seem like a good idea now, but later when you have to go back and fix a bug, have fun trying to decide what vg.Rdata was.
  3. Human Error is hard to get rid of. And it takes a long time to fix. Part of my project involved automating the process of cleaning up the field data and finding errors while we were still on site. I vastly underestimated the amount of processing that goes through our heads when we go through a spreadsheet to see if “everything makes sense.” I also underestimated the number of times we get it wrong. Writing code that corrects all this for you and gives you a list of things to go back and check was a challenge. Countless things that we might take for granted when manually verifying data came out to me when I had to explicitly program them in, and I have a better appreciation for the amount of effort that goes into getting good data from the get go.
  4. Minor changes and improvements keep coming. Even after you have already moved on to other steps in the project. Coding and working with data requires you to constantly think about the form it’s in and the intermediate between what you have and what you want. Sometimes you connect the dots and then realize further down the line that you could skip this stage entirely if you go back and change the first one. Other times, you don’t realize there was a problem until it makes itself known later when you’re trying to do something else to the data. I see this process of back and forth as some built-in quality control. As annoying and interrupting as it can be to your daily workflow, it helps improve the overall effectiveness and simplicity of whatever code you’re working with.
  5. Every manipulation has subtle consequences. And you must try and consider all of them. This is something that the math side of my project has really introduced me to. Every little choice you make in how you manipulate your data has a consequence, and there are trade-offs everywhere. Whether you’re weighing a loss of statistical power against an easy computation, a straightforward plot against a loss of data points, or an extremely precise analysis against a high degree of abstraction (i.e. you have done a lot of manipulating to get there). It’s a pretty delicate balance to meet all your statistical assumptions, keep as much of the information in your data as possible, and still get results that are easy to interpret. The struggle continues for me in this regard.

This doesn’t seem like a very biological post like the rest of mine, but in my lab this is a lot of what we do. I find it really engaging to use math and computers to answer questions that at first seem pretty far removed from those areas. Hopefully this gives you a little bit of an idea of the power that these tools give us, and encourages those who might not think them relevant to give them a try!

P.S. Here's a nice thistle just to add some color to the page.

P.S. Here’s a nice thistle just to add some color to the page.