Week 4: Aligning the TE sequences together and finding the distance matrices within a subfamily.

Last week I removed the overlaps of TE sequences in the file containing all the consolidated transposon data. Then I revised the length ratio of each transposon and separated them into intact ones and fragmented ones. Finally, I generated the density plots for all DNA and LTR, and each of their TE families as well. I have got some really nice plots indicating a possible relationship between the insertion and/or retention of intact TEs and TE fragments. These plots needed further analysis.

This week, I performed an alignment of the transposons in each subfamily. In this way, I could see how different they were from each other. This information which could be collected as distance matrices would be the foundation for a phylogenetic tree of TEs.

To do that, I first used a program named bedtools to grab all the consolidated TE sequences from the original genome file. Marked by their begin and end position, more than 200,000 sequences were extracted and packed in one file. I wrote python script to separate them all and save them into hundreds of thousands single files. Then I grab the sequences that belonged to the same subfamily and put them together in a combined file. After that, I sent the combined file to a alignment tool called MUSCLE to find the sequence portion that matched each other. One thing noticeable was that some of the TE subfamilies were failed to pass through MUSCLE. Most common error was bus error. I would do some research on that and try to fix it. With the presence of the aligned files, I used a program named distmat to generate the distance matrix for each of the TE subfamilies.

Here’s one example of how the distance matrix looks.

Figure 1: A distance matrix of CACTA na2a (subfamily) of DNA CACTA (family).

Figure 1: A distance matrix of CACTA na2a (subfamily) of DNA CACTA (family).

Finally, I wrote a python script to transform the distance matrices into csv files which had a more readable format.

Next step would be to find a program to visualize the phylogenetic tree.