what I need to know about parallel computing in genomic research?

Today I was reading about genomics and perl and find a “good question” posted in the biostars webpage about: Perl or Python for Comparative Genomics?

I want to start with this question because is difficult to give a clear answer, I think the real question here is what you want to achieve and if you are able to do it in a programming language adequate for it. As a non programmer researcher, I found python useful as a programming language to do genome analysis, but work with so big letter databases as genomes (in the order of GBs) can be a painful process. You can spend hours analysing sequence alignments (more than 8 h in my i5 8 GB RAM laptop…), so I think it’s time to go beyond the analysis per se and start searching the alternatives to improve speed in this specific field.

Python is a good language to start dealing with genomes and multifasta files. There are good tutorials available to how achieve genome analysis, but now what?, as a non bioinformatic researcher is not easy to see the next step, but I am sure that more researchers are asking themselves the same, what are the alternatives to boost performance computing genome analysis?, my time is short as many other researchers so we need to have very clear which are the alternatives to improve the use of our hardware. I was stoked here for a few weeks because I didn’t know if it will be better to start (i) learning a new programing language or (ii) Parallel computing. Right now I think both are a good way, but why? well, because right now few algorithms are available to achieve parallel computing in both, Perl and Python. Learn Perl can be a good way, if you read a little more about Perl (a programing language created for text analysis purposes) you will see that Bioperl can be a good choice to work with genomes too, indeed, my final goal is to compare the differences between Burrows-Wheeler/Sam/Blast alignments implemented in Python and Perl. On the other hand, implement Parallel computing is the most complex step here, because you can make this by CPU parallel computing or GPU parallel computing. CPU vs GPU isn’t an easy choice, in my case I’m limited in CPU (Intel i5) but my laptop has a GTX NVIDIA GPU so I can use CUDA as a good way to learn parallel computing, indeed, I bought this specific GPU thinking in genome analysis!.

nvidia-cuda

CUDA is the API provided for NVIDIA to access and use GPU hardware, but isn’t an easy task to make it work. A few days ago I tried to get CUDA working on my Windows 7 with Visual Studio 2010, but my hardware performance gets slowed by all VS/CUDA stuff, so I tried a less painful approach and get CUDA working in a dual-boot Ubuntu. An alternative to CUDA are the use of OpenCL for NVIDIA and AMD GPUs, but in my case I chose an NVIDIA GPU because it was the most widespread GPU brand in the market, and CUDA can work with multiple GPU (SLI).

Now, the next step is to find how to implement a Burrows-Wheeler/Sam/Blast alignments in GPU parallel computing, because it will require vector implementing. GPU parallel computing is a more powerful source of computing speed if we consider how expensive are CPU hardware, but, on the other hand, more algorithms based on CPU parallel computing are available. I think maybe in future CPU and GPU will be more integrated but that isn’t my actual point. I think this text can show how complex is to achieve a better performance in genome computing. This is a very important task for genomic bioinformaticians, and a good reason to implement new approaches to improve the performance of the hardware available. I hope this text had shown you a basic idea of which are some of the alternatives available to boost performance in computing genome analysis.