[HOME] [APPROACH] [RESULTS]

[DOWNLOADS]

[CONTACT]

 

Introduction

The shotgun assembly method follows a divide and conquer approach to sequence large genomes. Multiple copies of the genome are randomly sheared into thousands of fragments. Fragments from the same part of the DNA have similar information and thus some parts of them overlap. The assembly softwares use this overlapping property of the fragments to reconstruct the entire genome, as illustrated in figure 1.  All the modern assemblers maintain a k-mer hash table for overlap detection. K-mers are base sequence of length 'k'. Two fragments are said to overlap if they share k-mers across the entire length of the overlap. This is a powerful technique and has been used to sequence many large and complex genomes.
 Figure 1: Shotgun assembly process

All large genomes have regions that get repeated multiple times. These repeat regions in the DNA cause problems for the shotgun assembly process as all the fragments from different copies of the repeat show overlap with each other. This causes the assembler to misassemble the genome. One such case is shown in figure 2. The figure shows two copies of a single repeat region. The fragments that lie entirely inside the repeat region get mapped into 'contig 1' (indicated by brown colour lines). While the boundary fragments of R2 cause inconsistency in the layout of the rest of the assembly. Thus the assembler is forced to split the assembly into two contigs.

Figure 2: Assembly in the presence of repeats in the genome

 

In some cases the assembler might rearrange the entire genome without it even noticing any errors as shown in figure 3. Regions labelled 'R' are repeat region while all others are unique. In such a case, all overlaps in the incorrect layout will be consistent and thus the misassembly will go undetected.

 

 

Figure 3: Rearranged genome

Aims

The aim of the project was to investigate the advantages of using our k-mer walking technique, which is a novel approach to sequencing repeats in the genome. This technique attempts to sequence the repeat region at the base level. The repeat regions are stored in the form of a graph. Post processing algorithms on the graph can extract different repeat characteristics such as base differences, base insertions and tandem repeats. To read more about the approach click here.

 
Author: Muhammad Yasir Aheer        Supervisor: Dr. Trevor Dix          Assoc. Supervisor: Dr. Darren Platt