Creating the Processing Pipeline

The early stages of my research have been comprised of creating a pipeline of computer scripts that can process the large amounts of genomic data I have. Because the files I’m dealing with are incredibly large (10gb text files) none of the data cleaning and processing can feasibly be done by hand. I’ve tried several strategies to do this, and after weeks worth of failed attempts, I was able to get the major file processed and broken down into much more reasonably sized files that I now have to work on further to fully process to the point where I can use them to create a phylogeny.

At this point, I am now trying to write simple computer scripts to pull genes from a list and copy and paste their DNA sequences into a file along with the ID of gene and individual in order to streamline the process of creating phylogenetic trees. So far the biggest problem with this process has been getting the data I want out of the text files automatically and converting them into fasta files. I’m hoping to find a solution for this quickly.

The next part of my research involves finding genes known to have influence on the petal spot phenotype. Having a list of these genes is necessary so that I know what sequences to pull out of the larger files. As far as identifying these genes, I’ve been lucky in that my advisor has pulled a list of them for me to use in the preliminary while we work out how to fix up the processing method.

If your interested to see what a petal spot looks like I’ve included a picture of some Mimulus flowers. Where you can see some of the distinctive petal patternings that the Puzey lab is researching.