Cheap Parallelisation

Mon, 22 Jun 2020 00:00:00 +0000

What is fd?
Baby steps: finding our files
Constructing execution commands
Putting it all together: Parallel MSA
Benchmark
- Results
- Conclusion
Final Remarks

Motivation

I was recently creating a snakemake pipeline and needed to write a rule/process that would perform a multiple sequence alignment (MSA) on 2,582 fasta files. Usually, it is easy to parallelise this kind of task using snakemake. To cut a long story short; using snakemake to parallelise across the files was not feasible. I knew there were ways of doing this kind of thing with tools such as parallel, xargs, and find, but I had never really invested the time to get comfortable with them. This post is an attempt to document that process using one of my favourite CLI tools: fd. We’ll see how fd can be used to execute multiple MSAs (with MAFFT) simultaneously, and benchmark how much faster it is than a conventional “synchronous” approach.

Benchmarking Guppy algorithms

Fri, 01 Feb 2019 00:00:00 +0000

Methods
Results
Conclusions
Supplementary code {:toc}

ONT’s basecaller Guppy has recently been released to the masses. And with the announcement of the new “flip-flop” basecalling algorithm there is now the choice of two different algorithms for basecalling.

ONT has obviously been singing flip-flop’s praises, and understandably so, as the initial results look like a decent step up in read accuracy.

For an upcoming project I am going to be doing a lot of basecalling of Mycobacterium tuberculosis and given the project will involve assessing metrics heavily reliant on read accuracy I thought it best to invest some time in deciding which algorithm to go with. Another reason for my indecision came when I read a recent blog from Keith Robison which showed that maybe the new flip-flop algorithm doesn’t work well with organisms that have a higher GC content.

Benchmark on Microbes made me do it

Cheap Parallelisation

Table of Contents

Motivation

Benchmarking Guppy algorithms