How to speed up DNA analysis using the BLAST algorithm on supercomputers |

How to speed up DNA analysis using the BLAST algorithm on supercomputers

4 September 2018

Genetics is one of the fastest growing fields of science. Knowledge of the DNA code allows for more and more sophisticated applications: from diagnostics of various diseases, through the development of new drugs and methods of treatment to predicting the probability of particular disease. The first analyzes of the human genome made several years ago were a huge and very expensive international project (Human Genome Project). Currently, along with the advancement of biochemistry and the development of necessary technologies, including high performance computing, learning key DNA fragments is fast and relatively cheap. Gathering information about entire human DNA (Whole Genome Sequencing) is also not so expensive and can be done in a relatively short time – days or weeks. Modern techniques of finding the order of amino acids in DNA are commonly called next generation sequencing (Next Generation Sequencing). As a result, genetic information for diagnostics is increasingly used, especially in the case of rare diseases, often having autoimmune and therefore genetic basis. Knowledge of DNA is also helpful in determining the probability of developing some diseases (including oncological) or in determining the type of disease and choosing the right treatment. For this reason, DNA sequencing is becoming an increasingly important diagnostic method and sequencing devices are more often found in hospital laboratories.

The sequencing process is based on the treatment of material taken from humans (usually in the form of blood) with appropriate enzymes. The effects are read using appropriate detection systems. The proper preparation of the sample followed by its analysis is relatively quick and takes a few hours with the use of modern sequencers. Unfortunately, the information obtained as a result of biochemical reactions requires further processing, which is done using computers. As a result of biochemical analyzes we obtain short (about 100) strings of bases found in the analyzed DNA. These fragments (so-called reads) should be compared with information in available databases. And this element – searching for analogous DNA fragments in databases – is the most time-consuming and requires the involvement of large computing power. Depending on the questions asked, and so depending on how much and which database you are searching and how much similarity is required, this process may take hours, weeks or even months on a typical computer.

Why so long? It would seem that finding similar strings is not so difficult. Well, the problem is in the search for similar and not necessarily identical strings. While finding the same strings is not difficult, allowing modifications (mutations, as geneticists call it) significantly complicates the problem. Algorithms based on searching for all possible combinations are unfortunately too slow and only the development of algorithms based on stochastic models allowed to significantly speed up the search. The first such algorithm used is the BLAST (Basic Local Alignment Search Tool) published in the 1990. The development of this algorithm was a breakthrough event and is considered by many to be the beginning of bioinformatics. BLAST, and its implementation created and developed by the US National Center for Biotechnology Information (NCBI) is widely used by scientists around the world. Unfortunately, this software is adapted to work on a single computer (workstation) and is not able to use the capabilities of modern parallel computers. In recent years there have been many attempts to implement BLAST for parallel computers, but they have not been sufficiently efficient or have not kept up with the rapidly changing BLAST. As a result, the analysis of genetic data takes too long to be used in diagnostics or even in scientific research.

Our associates from the Karolinska Instituet in Stockholm (prof. J. Dilner and D. Bzhalava, PhD) have encountered this problem. In their work they dealt with the identification of human DNA fragments that are similar to the DNA sequence of viruses. It is estimated that several percent of human DNA are virus-like sequences – i.e. fragments similar to those found in viruses. They probably serve as a kind of memory used to synthesize antibodies or proteins that fight specific viruses. These mechanisms are not yet well researched, but it seems that they can have a significant relationship with human immunity to viral infections.

Due to the required high accuracy of the comparison, but also due to the size of the searched databases, the analysis performed using NCBI software and available equipment (workstations) lasted for months. To speed up the work we were able to use the ICM UW experience in parallelizing applications and use the tools developed by ICM UW. In a short time (less than 3 months) we managed to prepare a program implemented using the PCJ library which allowed to divide data for analysis into smaller fragments and simultaneous launch of many BLAST instances. The developed solution uses dynamic allocation of workload for individual processors, which turned out to be couple times more efficient (faster) than competitive solutions. Our solution also allows you to use different versions of NCBI BLAST and is as easy to use as the original code. At the same time, it allows for significant (more than 100-fold) reduction of the waiting time for the result, limiting it to a few or a dozen or so hours.

At the moment, we are working on launching applications on hadoop clusters or cloud resources. The software is available under the BSD license and can be used free of charge (https://github.com/hpdcj/PCJ-blast). Recently, a publication on this has been published in the Journal of Computational Biology.

Piotr Bała

Prof. dr hab. Piotr Bała – physicist, programmer, specializing in molecular and computer physics, parallel and distributed computations and modeling of dynamic systems with classical-quantum methods. Creator of the PCJ library allowing for effective parallelization of calculations in Java. Co-founder of the UNICORE software that allows distributed computations in molecular biology and quantum chemistry. Author or co-author of over 120 scientific publications. Since 1993 he has been cooperating with the ICM UW. Creator of studies in computational engineering at the ICM UW. Originator of hackathons Wielkie Wyzwania Programistyczne (Great Programming Challenges).

News

Be FAIR – Open Your Data
30 June 2025
The Open Science Platform of the Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw (ICM UW), provides an informational website for users of the University of Warsaw Research Data Repository. The service was created with researchers in mind, offering practical information on using the University of Warsaw Research Data Repository, as well as guidelines for managing data at every stage of its life cycle.
Join the 4EU+ Summer School 2025: Against Cancer
2 June 2025
The 4EU+ Against Cancer Summer School returns for its 2025 edition, taking place from July 7 to 9 at Sorbonne University in Paris. This intensive three-day program offers 20 hours of instruction and awards 4 ECTS credits. It is designed for students from the 4EU+ Alliance universities who are eager to deepen their understanding of oncology through a multidisciplinary approach.
RepOD Repository Awarded the CoreTrustSeal Certification
26 May 2025
The Open Data Repository RepOD has been awarded the CoreTrustSeal certification. This certificate confirms the high quality of the service, which has been supporting Polish researchers for the past ten years. RepOD is one of the services offered by the Open Science Platform at the Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw […]
40 Years of the Free Software Foundation
21 May 2025
The Open Science Platform at the Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw, cordially invites you to join in celebrating the 40th anniversary of the Free Software Foundation’s work in defending the freedom of computer program users.
Condolences
8 May 2025
On May 7th, 2025, dramatic events took place on the Main Campus of the University of Warsaw, deeply affecting the entire academic community. Our hearts go out to the Family and Loved Ones of the murdered Employee and the severely injured UW Security Guard. In these difficult moments, we wish to express our deepest sympathy […]
Rapid HPV genotype detection
16 April 2025
A team led by the late Prof Piotr Bała, as part of the HEAP – Human Exposome Assessment Platform project funded by Horizon 2020, has developed a novel tool for rapid HPV viral sequence detection. The paper entitled “HPV-KITE: sequence analysis software for rapid HPV genotype detection”, whose first author is Dr Marek Nowicki from […]

How to speed up DNA analysis using the BLAST algorithm on supercomputers

News

Be FAIR – Open Your Data

Join the 4EU+ Summer School 2025: Against Cancer

RepOD Repository Awarded the CoreTrustSeal Certification

40 Years of the Free Software Foundation

Condolences

Rapid HPV genotype detection