Week 2: A glimpse of TE evolution by analyzing the ratio that the query TE sequences match the referential sequences

In the post of week 1, I summarized how I got the raw data of some key features of the Mimulus TE sequences. I aligned the query sequences with the referential genome sequences using ReapeatMasker. Based on the data in the output files, I wrote python programs to make tables and plot frequency histograms for each of the TE families. In the second week, I moved forward to focus on the matching ratio of sequences in each TE family. That was, the ratio that a detected region of the query sequences matched the corresponding sequences in the referential genome file. To achieve that, I first calculated the length of each transposon in the referential file. After that, I assigned the referential length of each TE to the ones that I found in the plant genome file, using the name as index. Then, I constructed a csv file that contained the name, the family it belonged to, the number of bases matched, and the matching ratio of  every single TE in the plant genome file. This file would serve as the data frame to generate frequency histograms of matching ratio of every TE family. All those procedures were achieved by self-written python programs.

The histograms below were of the matching ratio of all transposons as a whole, and of each family as well.

Figure 1: the frequency histogram of the matching ratio of all TEs from the Mimulus genome file to the referential TE sequences.

Figure 1: the frequency histogram of the matching ratio of all TEs from the Mimulus genome file to the referential TE sequences.

 

Figure 2: the frequency histograms of the matching ratio of DNA transposons from 6 families to the referential TE sequences.

Figure 2: the frequency histograms of the matching ratio of DNA transposons from 6 families to the referential TE sequences.

 

Figure 3: the frequency histograms of the matching ratio of LTR transposons from 3 families to the referential TE sequences.

Figure 3: the frequency histograms of the matching ratio of LTR transposons from 3 families to the referential TE sequences.

The histogram of matching ratio of all TEs was drawn according to data from a sample of 9% of the entire genome file. The histograms of transposons of different families were drawn according to the full data of each TE subclass.

What’s more, I also generated plots for LINE and SINE transposons as I did for DNA and LTR in week 1. Those histograms included the perc of divergence, insertion, deletion, and the size of the TE.

The histograms were shown below:

Figure 4: frequency histograms of perc of divergence, insertion, deletion, and the size of TE for LINE transposons.

Figure 4: frequency histograms of perc of divergence, insertion, deletion, and the size of TE for LINE transposons.

 

Figure 5: frequency histograms of perc of divergence, insertion, deletion, and the size of TE for SINE transposons.

Figure 5: frequency histograms of perc of divergence, insertion, deletion, and the size of TE for SINE transposons.

 

The last thing I did this week was to calculate the number of intact TEs and TE fragments in the query genome file. More specifically, according to the matching ratio, I considered TEs having a ratio higher than 90% as intact ones, otherwise, they should be fragmented ones. The threshold 90% could be adjusted later if needed. The difference of intact TEs and TE fragments implied that the former might be younger and the later might be older. Younger TEs were produced from more recent TE evolutionary event, such as a explosion, while older TEs were from more ancient event. Thus, from here, I’m getting a touch onto the determination of TE ages and its evolution process. The data turned out to show that the number of intact TEs were significantly smaller than that of TE fragments. This was under expectation because TE insertion are considered to happen at a relatively low frequency. And all the other older TEs were constantly accumulating changes inside their sequences. However, the total number of TEs here were much more it should be. The reason for it was that the sequence alignment performed by RepeatMasker did not consolidate TEs that were cut in the middle at some time but were actually from one single TE back in time.

Thus, the next step would be to find software to consolidate TEs and adjust the matching ratio based on the new and more reasonable data.