Haplotype Assembly

Reconstructing the two haplotypes of a diploid organism (also known as phasing) is an important problem with applications in fundamental research but also in clinical settings. So far, second generation sequencing data was insufficient to accomplish this task: fragments were too short to bridge "variant deserts". Emerging sequencing technologies hold the promise of allowing for read-based phasing. On the computational side, most formalizations of the corresponding optimization problem are NP-hard. In an approach called WhatsHap, we demonstrated that (i) the problem instances encountered in practice can be solved using a fixed parameter tractable (FPT) algorithm and (ii) that read-based phasing indeed delivers excellent performance once the reads are sufficiently long.

References

M. Patterson*, T. Marschall*, N. Pisanti, L. v. Iersel, L. Stougie, G. W. Klau, A. Schönhuth.
WhatsHap: Haplotype Assembly for Future-Generation Sequencing Reads.
Proceedings of RECOMB, pp. 237-249, 2014.
DOI: 10.1007/978-3-319-05269-4_19

Discovery and Genotyping of Structural Variations

Besides single-nucleotide polymorphisms (SNPs), second-generation sequencing (in principle) allows us to detect larger structural changes like insertions, deletions, translocations, duplications, and inversions. Due to limited read length, sequencing errors, and read mapping ambiguity, however, designing reliable methods for that is not trivial. In case of deletions, especially mid-size events of 30 to 150bp are hard to discover in practice; some speak of a deletion twilight zone. We have developed methods that are designed to deliver good performance also for such events:

  • CLEVER is a read-pair based method that features a fast algorithm for clique enumeration. It is able to find twilight zone deletions because it does not discard concordant alignments but processes all read pairs.
  • LASER is a read mapper geared towards detecting alignments that cover insertions and (potentially long) deletions with high sensitivity.
  • MATE-CLEVER is an integrated workflow that uses CLEVER and LASER and is able to genotype insertions and deletions. Furthermore, it can leverage information on family structure in a Bayesian framework.

Software

CLEVER, LASER, and MATE-CLEVER are available as open-source software as part of the CLEVER ToolKit (CTK).

References

T. Marschall*, I. G. Costa*, S. Canzar, M. Bauer, G. W. Klau, A. Schliep, A. Schönhuth.
CLEVER: clique-enumerating variant finder.
Bioinformatics, 28(22), pp. 2875-2882, 2012.
DOI: 10.1093/bioinformatics/bts566


T. Marschall, I. Hajirasouliha, A. Schönhuth.
MATE-CLEVER: Mendelian-inheritance-aware discovery and genotyping of midsize and long indels.
Bioinformatics, 29(24), pp. 3143-3150, 2013.
DOI: 10.1093/bioinformatics/btt556


T. Marschall, A. Schönhuth.
Sensitive Long-Indel-Aware Alignment of Sequencing Reads.
arXiv, 1303.3520, 2013.

Viral Quasispecies Assembly

Viruses (like HIV, for instance) exhibit a fast mutation rate and hence evolve within a host. As a result, the host is not infected by a single virus type, but by a population of genetically diverse viruses. Knowledge of the spectrum of present virus haplotypes and their relative abundancies can be important for the choice of treatment.

On current second-generation sequencing machines, such a virus population can be sequenced to very deep coverage at moderate cost. Reconstructing haplotypes from the resulting sequencing reads is computationally challenging, though. In a recent project, we met these challenges and introduced a haplotype reconstruction algorithm that is able to reconstruct full-length haplotypes (given sufficient coverage and genetic diversity). The observed error rates are lower by about two orders of magnitude compared to previous approaches.

Software

Our software implemetation HaploClique is available from github.

References

A. Töpfer, T. Marschall, R. A. Bull, F. Luciani, A. Schönhuth, N. Beerenwinkel.
Viral Quasispecies Assembly via Maximal Clique Enumeration.
Proceedings of RECOMB / PLoS Computational Biology, 10(3), pp. e1003515, 2014.
DOI: 10.1371/journal.pcbi.1003515

The Genome of the Netherlands

The Genome of the Netherlands project has sequenced the whole genomes of 750 Dutch individuals from 250 families. Applications of these data include building high-quality reference panels for imputation, studying de novo mutations and the corresponding mechanisms, estimating the rate of such events, analyzing population structure, among many others. We contributed to this project as part of the Structural Variations subgroup and provided algorithms for the discovery and genotyping of structural variations.

References

The Genome of the Netherlands Consortium.
Whole-genome sequence variation, population structure and demographic history of the Dutch population.
Nature Genetics, advance online publication, 2014.
DOI: 10.1038/ng.3021

Pan-Genome Data Structures

Many bioinformatics tools use the reference genome of a species under study. The used reference genomes are linear, i.e. they consist of one DNA sequence per chromosome. For instance, programs to align next-generation sequencing reads will map the reads to such a linear reference genome. Likewise, tools to call variants like SNPs and structural variations do that with respect to this reference genome.

Today, however, information on common and rare variants is available for many species (and, most prominently, for Homo sapiens). To leverage this additional information, linear reference genomes should be replaced by variant-aware reference genomes. We develop data structures and algorithms to achieve that.