Summary: data extraction and collection of M. guttatus transpsons

I have completed 7-week of research this summer. I worked to determine whether TE insertion and/or retention was preferential or not. In order to do this, I worked in Puzey lab and got the raw genome file of M. guttatus and a referential genome file. I aligned the raw data file with the referential genome file using RepeatMasker to locate and extract the transposons. From there, I started to analyze the TEs and compose several tables and plots to try to make sense of the data. I first made density histograms of DNA and LTR transposons separately. The parameters that I was interested in were the percentage of divergence, insertion, and deletion. The size of the TEs was also taken into account. After that, I found that TEs would gradually accumulate mutations which led to more divergences compared to the original sequences. This rule was held for each family of the DNA and LTR transposons as well.

[Read more…]

Week 7: Realigning TEs with the referential sequence and finding the mutation rate

Last week, I modified the density plots by adding several horizontal lines to them. The lines indicated the average density of transposons across the chromosome and the one or two standard deviation above or below the average. This intended to figure out whether there was biased insertion and/or retention of TEs in the chromosomes. Dr. Gregory Conradi Smith sent me a matlab script with exemplary code. The code aimed to perform a statistic test to see the occurrence of biased insertion of TEs. And I was still learning to modify the code to suit for my data.

[Read more…]

Week 6: Interpreting the distance matrix and trying to detect biased insertion/retention of TEs

Last week, I fixed the bug in the code and regenerated the density plots of TEs. I also regrouped the transposons by family and ran a program called emma on each of the families. From that, I got an alignment file, and a tree file. Then I used PHYLIP to turn the tree file to a visualized tree. This week, I spent some time trying to figure out a way to make inference from the phylogenetic trees. However, after talking with Dr. Puzey, we decided to use distance matrix instead of the trees. It was hard to extract accurate information from a tree graph, especially when they were composed by samples of data. Thus, I ran distmat on the output alignment files from emma for each TE family. And I ran transform_matrix.py to get csv files of the reformatted distance matrix. In this way, I made the matrix into a data frame. I then imported the data frames to Jupyter Notebook and made the distance histograms of each transposon family. Note that the distance ranged from 0 to 100, but it actually was the number of substitutions per 100 bases. This meant that I had to divide it by 100 for the num of substitution per base. It would be a percentage showing how divergent a certain TE was compared to others.

[Read more…]

Week 5: Modifying the density plots and grouping the TEs by family when conducting the alignment

Last week, I did the alignment of transposons using a series of programs. I had got the alignment file and the distance matrix for each DNA subfamily. This week, I continued on this track but modified the grouping of the TE families. Instead of grouping them by the really tiny unit–subfamily, I regrouped them by their family. There were only six major families of DNA transposons. In this way, we could now see a phylogenetic hierarchy within the families.

[Read more…]

Week 4: Aligning the TE sequences together and finding the distance matrices within a subfamily.

Last week I removed the overlaps of TE sequences in the file containing all the consolidated transposon data. Then I revised the length ratio of each transposon and separated them into intact ones and fragmented ones. Finally, I generated the density plots for all DNA and LTR, and each of their TE families as well. I have got some really nice plots indicating a possible relationship between the insertion and/or retention of intact TEs and TE fragments. These plots needed further analysis.

[Read more…]