When you have to discard data

Last posting talked about the potential uncontrollable factors, and how researchers might handle them. This one will talk briefly about how a researcher might discern when to discard data for the validity of data analysis.

Discarding or throwing away certain data. One might claim tempering with data at all is to be frowned upon for researchers might intentionally bend and twist the data to acquire the desired data set that supports their hypothesis. However, I would argue the other way. Defining the validity of data as the degree to which the data answers the present question, hypothesis, discarding certain data will be necessary to keep the validity of the data set high. This is due to the fact that for the most of the time, the potential uncontrollable factors do present themselves as actual uncontrollable factors and compromises the validity of the data.

Therefore, as long as the reasoning for disregarding data is valid and clearly stated, I would even claim that it is encouraged for a researcher to discard specific sessions’ data that are compromising the validity of the entire data set by clouding and blurring the rest of the data set. For example, outliers could be discarded from data set for they might “carry” and skew the results of the data analysis, leading the researchers to miss the real meaning of the data set. Data from a session which an accident happened might also be discarded depending on the degree of the significance of the accident. Accident here can mean anything that appears to have sufficient negative impact on the validity of data.A police officer coming into the research lab while the experiment session is running is an example of such situation. The sudden appearance of police officer could act as a strong confounding variable for the presence of a police officer may elicit numerous responses in varying degree depending on individuals.

A researcher should discard data when the validity of data is compromised. However, the researcher is ought to be able to clearly state how many participants’ data were discarded and the valid reason for it in one’s final report.


  1. mlmendonca says:

    I completely agree with your stance on discarding data. Although it may sometimes feel like we are purposefully searching for the perfect data, it’s important to have a data set that is representative of the research. I work in a physical chemistry lab and there are many times when my microscope scans from one day are much worse when compared to the rest of my work. Much like your research, there are some confounding variables such as light interference that can negatively affect the validity of a day’s work. Just because I choose to discard some scans doesn’t mean I am “cheating” in favor of a hypothesis, but rather I am compiling data that is only affected by the appropriate variables.

  2. kthuynh says:

    I completely agree. A scientist can learn a lot with bad data. It requires them to think more critically about the experiment, procedure, and all variables included in the experiment. Learning why a set of data is unusable and determining what changes need to be made in order to get more accurate data is as much of a learning experience as getting perfectly good data.

    Good luck with the rest of the data acquisition.