Cheap Parallelisation

Table of Contents What is fd? Baby steps: finding our files Constructing execution commands Simple: counting characters in files Intermediate: changing file extensions Advanced: redirecting stdout to a file within the command Putting it all together: Parallel MSA Benchmark Results Conclusion Final Remarks Motivation I was recently creating a snakemake pipeline and needed to write a rule/process that would perform a multiple sequence alignment (MSA) on 2,582 fasta files. Usually, it is easy to parallelise this kind of task using snakemake. To cut a long story short; using snakemake to parallelise across the files was not feasible. I knew there were ways of doing this kind of thing with tools such as parallel, xargs, and find, but I had never really invested the time to get comfortable with them. This post is an attempt to document that process using one of my favourite CLI tools: fd. We’ll see how fd can be used to execute multiple MSAs (with MAFFT) simultaneously, and benchmark how much faster it is than a conventional “synchronous” approach. ...

June 22, 2020 · Michael Hall · ... views