Week 1: Find where TEs are in the M. guttatus genome and generate frequency histograms for important features of DNA and LTR transposons

In the first week of my research, I started from locating the TE sequences in the Mimulus guttatus genome file which was provided by Puzey Lab. To do that, I used the program RepeatMasker. RepeatMasker is a command line program that runs in the terminal interface. It aligned the M. guttatus genome sequences with the reference genome file, which was also provided by Puzey Lab, and found the loci of TEs in the plant genome. First, I ran the program on the whole genome but it encountered an issue and failed. Then I tried to divide the plant genome into 14 separate parts and ran the program on each of them. In this way, I got the output file containing a bunch of data for each matching repeat. To narrow down to information that was useful for my research, I focused on the data of actual TEs which had two main categories, DNA transposons and LTR transposons. I extracted the percentage of insertion, deletion, and divergence of those TEs in the first piece of TE sequences named scaffold_1 using programs in python. After that, I made frequency histograms of the three percentages and the size of TEs as well for two types of TEs separately. When successfully generated the plots for scaffold_1, I scaled up to do the same thing for the whole genome. The histograms for the entire plant genome are as followed:

histograms_of_DNA_whole

Figure 1: frequency histograms of percentage of divergence, insertion, and deletion, and the size of DNA transposons in the whole genome.

histograms_of_LTR_whole copy

Figure 2: frequency histograms of percentage of divergence, insertion, and deletion, and the size of LTR transposons in the whole genome.

Then I further analyzed the subclass of the DNA and LTR transposons. But I only focused on the percentage of divergence this time. The histograms are as followed:

Figure 3: frequency histograms of percentages of divergence of 6 types of DNA transposons.

Figure 3: frequency histograms of percentages of divergence of 6 types of DNA transposons in the whole genome.

Figure 4: frequency histograms of percentage of divergence of 3 types of LTR transposons in the whole genome.

Figure 4: frequency histograms of percentage of divergence of 3 types of LTR transposons in the whole genome.

Then, the next step should be examining the patterns of the histograms and trying to figure out the proportion of each TE sequence in the plant genome that matches the referential genome sequence.

Speak Your Mind

*