Bioinformatics

Searching for shared sequence between Mycobacterium tuberculosis and Homo sapiens

Table of Contents Motivation Shared k-mer content Aligning reads Summary References Motivation We are in the early stages of planning a Mycobacterium tuberculosis (MTB) analysis pipeline for a research project in Papua New Guinea. We’ll be sequencing sputum samples with Oxford Nanopore Technologies (ONT) devices and were thinking of different ways of decontaminating the data - i.e. remove anything non-MTB. Sputum samples traditionally have a lot of host (human) reads and reads from a variety of bacteria. Traditionally the MTB component is quite small1. One component of this pipeline will be to upload sequencing reads to a remote/cloud server, so any reduction in file size will make uploads faster. As human reads are not used in any analysis steps, and will need to be removed prior to making any data available, we thought we could simplify things by removing human data as the first step. Our idea was to align reads to the human genome and just remove anything that aligns. However, one concern with this approach was whether any MTB reads could be lost in the process. This effectively boils down to the question: Do Mycobacterium tuberculosis and Homo sapiens share genomic sequence? After a literature search, I was unable to find an answer - which seemed quite surprising. My suspicion is that most people just assume they do not. (Or my literature searching skills are poor.) So let’s take a look. ...

Cheap Parallelisation

Table of Contents What is fd? Baby steps: finding our files Constructing execution commands Simple: counting characters in files Intermediate: changing file extensions Advanced: redirecting stdout to a file within the command Putting it all together: Parallel MSA Benchmark Results Conclusion Final Remarks Motivation I was recently creating a snakemake pipeline and needed to write a rule/process that would perform a multiple sequence alignment (MSA) on 2,582 fasta files. Usually, it is easy to parallelise this kind of task using snakemake. To cut a long story short; using snakemake to parallelise across the files was not feasible. I knew there were ways of doing this kind of thing with tools such as parallel, xargs, and find, but I had never really invested the time to get comfortable with them. This post is an attempt to document that process using one of my favourite CLI tools: fd. We’ll see how fd can be used to execute multiple MSAs (with MAFFT) simultaneously, and benchmark how much faster it is than a conventional “synchronous” approach. ...

Benchmarking Guppy algorithms

Methods Results Conclusions Supplementary code {:toc} ONT’s basecaller Guppy has recently been released to the masses. And with the announcement of the new “flip-flop” basecalling algorithm there is now the choice of two different algorithms for basecalling. ONT has obviously been singing flip-flop’s praises, and understandably so, as the initial results look like a decent step up in read accuracy. For an upcoming project I am going to be doing a lot of basecalling of Mycobacterium tuberculosis and given the project will involve assessing metrics heavily reliant on read accuracy I thought it best to invest some time in deciding which algorithm to go with. Another reason for my indecision came when I read a recent blog from Keith Robison which showed that maybe the new flip-flop algorithm doesn’t work well with organisms that have a higher GC content. ...