How to speed up DNA analysis using the BLAST algorithm on supercomputers

Genetics is one of the fastest growing fields of science. Knowledge of the DNA code allows for more and more sophisticated applications: from diagnostics of various diseases, through the development of new drugs and methods of treatment to predicting the probability of particular disease. The first analyzes of the human genome made several years ago were a huge and very expensive international project (Human Genome Project). Currently, along with the advancement of biochemistry and the development of necessary technologies, including high performance computing, learning key DNA fragments is fast and relatively cheap. Gathering information about entire human DNA (Whole Genome Sequencing) is also not so expensive and can be done in a relatively short time – days or weeks. Modern techniques of finding the order of amino acids in DNA are commonly called next generation sequencing (Next Generation Sequencing). As a result, genetic information for diagnostics is increasingly used, especially in the case of rare diseases, often having autoimmune and therefore genetic basis. Knowledge of DNA is also helpful in determining the probability of developing some diseases (including oncological) or in determining the type of disease and choosing the right treatment. For this reason, DNA sequencing is becoming an increasingly important diagnostic method and sequencing devices are more often found in hospital laboratories.

The sequencing process is based on the treatment of material taken from humans (usually in the form of blood) with appropriate enzymes. The effects are read using appropriate detection systems. The proper preparation of the sample followed by its analysis is relatively quick and takes a few hours with the use of modern sequencers. Unfortunately, the information obtained as a result of biochemical reactions requires further processing, which is done using computers. As a result of biochemical analyzes we obtain short (about 100) strings of bases found in the analyzed DNA. These fragments (so-called reads) should be compared with information in available databases. And this element – searching for analogous DNA fragments in databases – is the most time-consuming and requires the involvement of large computing power. Depending on the questions asked, and so depending on how much and which database you are searching and how much similarity is required, this process may take hours, weeks or even months on a typical computer.

Why so long? It would seem that finding similar strings is not so difficult. Well, the problem is in the search for similar and not necessarily identical strings. While finding the same strings is not difficult, allowing modifications (mutations, as geneticists call it) significantly complicates the problem. Algorithms based on searching for all possible combinations are unfortunately too slow and only the development of algorithms based on stochastic models allowed to significantly speed up the search. The first such algorithm used is the BLAST (Basic Local Alignment Search Tool) published in the 1990. The development of this algorithm was a breakthrough event and is considered by many to be the beginning of bioinformatics. BLAST, and its implementation created and developed by the US National Center for Biotechnology Information (NCBI) is widely used by scientists around the world. Unfortunately, this software is adapted to work on a single computer (workstation) and is not able to use the capabilities of modern parallel computers. In recent years there have been many attempts to implement BLAST for parallel computers, but they have not been sufficiently efficient or have not kept up with the rapidly changing BLAST. As a result, the analysis of genetic data takes too long to be used in diagnostics or even in scientific research.

Our associates from the Karolinska Instituet in Stockholm (prof. J. Dilner and D. Bzhalava, PhD) have encountered this problem. In their work they dealt with the identification of human DNA fragments that are similar to the DNA sequence of viruses. It is estimated that several percent of human DNA are virus-like sequences – i.e. fragments similar to those found in viruses. They probably serve as a kind of memory used to synthesize antibodies or proteins that fight specific viruses. These mechanisms are not yet well researched, but it seems that they can have a significant relationship with human immunity to viral infections.

Due to the required high accuracy of the comparison, but also due to the size of the searched databases, the analysis performed using NCBI software and available equipment (workstations) lasted for months. To speed up the work we were able to use the ICM UW experience in parallelizing applications and use the tools developed by ICM UW. In a short time (less than 3 months) we managed to prepare a program implemented using the PCJ library which allowed to divide data for analysis into smaller fragments and simultaneous launch of many BLAST instances. The developed solution uses dynamic allocation of workload for individual processors, which turned out to be couple times more efficient (faster) than competitive solutions. Our solution also allows you to use different versions of NCBI BLAST and is as easy to use as the original code. At the same time, it allows for significant (more than 100-fold) reduction of the waiting time for the result, limiting it to a few or a dozen or so hours.

At the moment, we are working on launching applications on hadoop clusters or cloud resources. The software is available under the BSD license and can be used free of charge (https://github.com/hpdcj/PCJ-blast). Recently, a publication on this has been published in the Journal of Computational Biology.

Piotr Bała

Prof. dr hab. Piotr Bała – physicist, programmer, specializing in molecular and computer physics, parallel and distributed computations and modeling of dynamic systems with classical-quantum methods. Creator of the PCJ library allowing for effective parallelization of calculations in Java. Co-founder of the UNICORE software that allows distributed computations in molecular biology and quantum chemistry. Author or co-author of over 120 scientific publications. Since 1993 he has been cooperating with the ICM UW. Creator of studies in computational engineering at the ICM UW. Originator of hackathons Wielkie Wyzwania Programistyczne (Great Programming Challenges).