Week 2: A glimpse of TE evolution by analyzing the ratio that the query TE sequences match the referential sequences

In the post of week 1, I summarized how I got the raw data of some key features of the Mimulus TE sequences. I aligned the query sequences with the referential genome sequences using ReapeatMasker. Based on the data in the output files, I wrote python programs to make tables and plot frequency histograms for each of the TE families. In the second week, I moved forward to focus on the matching ratio of sequences in each TE family. That was, the ratio that a detected region of the query sequences matched the corresponding sequences in the referential genome file. To achieve that, I first calculated the length of each transposon in the referential file. After that, I assigned the referential length of each TE to the ones that I found in the plant genome file, using the name as index. Then, I constructed a csv file that contained the name, the family it belonged to, the number of bases matched, and the matching ratio of  every single TE in the plant genome file. This file would serve as the data frame to generate frequency histograms of matching ratio of every TE family. All those procedures were achieved by self-written python programs.

[Read more…]

Week 1: Find where TEs are in the M. guttatus genome and generate frequency histograms for important features of DNA and LTR transposons

In the first week of my research, I started from locating the TE sequences in the Mimulus guttatus genome file which was provided by Puzey Lab. To do that, I used the program RepeatMasker. RepeatMasker is a command line program that runs in the terminal interface. It aligned the M. guttatus genome sequences with the reference genome file, which was also provided by Puzey Lab, and found the loci of TEs in the plant genome. First, I ran the program on the whole genome but it encountered an issue and failed. Then I tried to divide the plant genome into 14 separate parts and ran the program on each of them. In this way, I got the output file containing a bunch of data for each matching repeat. To narrow down to information that was useful for my research, I focused on the data of actual TEs which had two main categories, DNA transposons and LTR transposons. I extracted the percentage of insertion, deletion, and divergence of those TEs in the first piece of TE sequences named scaffold_1 using programs in python. After that, I made frequency histograms of the three percentages and the size of TEs as well for two types of TEs separately. When successfully generated the plots for scaffold_1, I scaled up to do the same thing for the whole genome. The histograms for the entire plant genome are as followed:

[Read more…]

Research on preferential TE insertion and/or retention

Transposons (TEs) are the abundant jumping pieces of DNA. They display insertion and/or retention along the genomes extensively. My research project aims to understand if TE elements always prefer to insert into some specific locations, and how this preference relates to the fitness of individuals. To conduct the research, I will analyze a database of sample TEs from the Puzey lab. I will write code to quantify the correlation between the density and the age of TEs within regions that are of various distances from the genes on the genomes. Interpreted from the correlation, the result would imply if TEs are constantly removed from regions near genes by purifying selection and left unattended when they are far from the coding regions. Once the preference of insertion and/or retention is confirmed, I can infer that TEs have a potential negative fitness cost on individuals because selection continuously acts against them.