Simulation of Genomes from Arab and German Datasets

Big Data Simulation

This interdisciplinary research project of AGYA members Mohammad Adm from the Palestine Polytechnic University and Olfa Messaoud from the Pasteur Institute in Tunis aims at simulating whole genome-based medical population genetics datasets that leverage the human genome characteristic, particularly of Arab, German and European descents. These datasets will be used to derive a consensus NGS variant detection method(s) tailored to Arab, German and non-Germanp European populations through a systematic evaluation of various variant calling methods.

Recent technological advances in high-throughput DNA-sequencing and genomics have facilitated the development of a range of statistical genomic approaches ranging from ancestry inference to disease scoring statistics including genome-wide association studies. These advances led to an increase in the bioinformatic tool development, particularly with the advancement of next-generation sequencing (NGS) technologies.

NGS technologies allow massively parallel DNA-sequencing, resulting in higher data throughput, though at the cost of higher error rate and shorter reads when compared with traditional Sanger sequencing. This necessitates a myriad of computational algorithms and tools to interpret the raw sequence reads coming from the sequencing platforms. However, validating and benchmarking methods for genome analysis is hindered by the availability of few reference datasets for which all the mutational events of the sample genome is known and fully validated. Furthermore, accuracy, effectiveness and performance assessments of different analytical methods used to analyse NGS data and to perform variant calling are important aspects of biomedical and population genetics. Indeed, the performance of these tools can be significantly affected by the genetic structure, reflecting population ancestry, of the sample genome under analysis. This is particularly true for NGS data generated from individuals with Arab/North-African ancestry, for which, no comprehensive investigation was performed to access the accuracy and performance of these algorithms.

Furthermore, although a large number of NGS data from European ancestry samples are publicly available, data from German descent individuals are largely under-represented and practically missing from the majority of these datasets. The simulation tools for NGS reads that mimics inferred sequences of base pairs generated from sequencer platforms may play a critical role in enabling efficient analytical NGS. Moreover, simulated datasets can be constructed to mimic many properties of human data while also being freely shareable among users and software developers without exposing personal health information. Thus, simulations can provide a gold standard available to all software engineers and researchers for the design and evaluation of variant calling workflows. These synthetic data are functionally similar to the output of a sequencer, but all of the underlying mutational events are known.

The research cooperation of this project is based on the exchange of expertise and knowledge transfer at the corresponding project institutions and working groups in Tunis and Palestine as well as a research visit of a Tunisian PhD-student at the Palestine Polytechnic University.

majcot/Shutterstock.com — shutterstock/majcot

Disciplines Involved: Mathematics, Human Genetics
Cooperation Partners: Palestine Polytechnic University, Palestine; Pasteur Institute of Tunis, Tunisia

Project Title: Simulation of Genomes from Arab and German Datasets
Year: 2020
Funding Scheme: Tandem Project
Countries Involved: Palestine, Tunisia

Project Partners

Mohammad Adm

Mathematics

Palestine Polytechnic University, Palestine

Olfa Messaoud

Human Genetics

University Tunis El Manar, Tunisia