<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Posts on Microbes made me do it</title><link>https://mbhall88.github.io/post/</link><description>Recent content in Posts on Microbes made me do it</description><generator>Hugo</generator><language>en-au</language><lastBuildDate>Wed, 22 Apr 2026 14:16:33 +1000</lastBuildDate><atom:link href="https://mbhall88.github.io/post/index.xml" rel="self" type="application/rss+xml"/><item><title>Minimap2 lr:hq preset testing</title><link>https://mbhall88.github.io/post/minimap2-lrhq-preset-testing/</link><pubDate>Wed, 22 Apr 2026 14:16:33 +1000</pubDate><guid>https://mbhall88.github.io/post/minimap2-lrhq-preset-testing/</guid><description>&lt;h2 id="evaluating-minimap2s-lrhq-preset-for-bacterial-nanopore-variant-calling"&gt;Evaluating minimap2&amp;rsquo;s &lt;code&gt;lr:hq&lt;/code&gt; preset for bacterial nanopore variant calling&lt;/h2&gt;
&lt;h2 id="introduction"&gt;Introduction&lt;/h2&gt;
&lt;p&gt;Oxford Nanopore Technologies (ONT) sequencing accuracy has improved dramatically in recent years. With basecalling models like &lt;a href="https://github.com/nanoporetech/dorado"&gt;Dorado&lt;/a&gt; v5.2.0 super-accuracy (&lt;code&gt;sup&lt;/code&gt;), error rates are consistently hovering around the 1% mark. To match this shift in raw read quality, &lt;a href="https://github.com/lh3/minimap2/"&gt;&lt;code&gt;minimap2&lt;/code&gt;&lt;/a&gt;[&lt;a href="#references" class="citation-link"&gt;1&lt;/a&gt;] introduced the &lt;code&gt;lr:hq&lt;/code&gt; preset in &lt;a href="https://github.com/lh3/minimap2/releases/tag/v2.27"&gt;version 2.27 (March 2024)&lt;/a&gt;, which is calibrated for long reads with an error rate of &amp;lt;1%.&lt;/p&gt;
&lt;p&gt;This introduction was driven by internal benchmarking from ONT developers (see &lt;a href="https://github.com/lh3/minimap2/issues/1127"&gt;minimap2 issue #1127&lt;/a&gt;) who found that &lt;code&gt;-x map-ont -k19 -w19 -U50,500&lt;/code&gt; maximised both speed and downstream accuracy for high-quality reads. As such, the &lt;code&gt;lr:hq&lt;/code&gt; preset was added to mirror those options.&lt;/p&gt;</description></item><item><title>Searching for shared sequence between Mycobacterium tuberculosis and Homo sapiens</title><link>https://mbhall88.github.io/post/mtb-human-shared-sequence/</link><pubDate>Wed, 21 Jun 2023 00:00:00 +0000</pubDate><guid>https://mbhall88.github.io/post/mtb-human-shared-sequence/</guid><description>&lt;h1 id="table-of-contents"&gt;Table of Contents&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#motivation"&gt;Motivation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#shared-k-mer-content"&gt;Shared &lt;em&gt;k&lt;/em&gt;-mer content&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#aligning-reads"&gt;Aligning reads&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#summary"&gt;Summary&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#references"&gt;References&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;a href="https://doi.org/10.5281/zenodo.8068147"&gt;&lt;img alt="DOI" loading="lazy" src="https://zenodo.org/badge/DOI/10.5281/zenodo.8068147.svg"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h1 id="motivation"&gt;Motivation&lt;/h1&gt;
&lt;p&gt;We are in the early stages of planning a &lt;em&gt;Mycobacterium tuberculosis&lt;/em&gt; (MTB) analysis pipeline for a research project in Papua New Guinea. We&amp;rsquo;ll be sequencing sputum samples with Oxford Nanopore Technologies (ONT) devices and were thinking of different ways of decontaminating the data - i.e. remove anything non-MTB. Sputum samples traditionally have a lot of host (human) reads and reads from a variety of bacteria. Traditionally the MTB component is quite small&lt;sup&gt;1&lt;/sup&gt;. One component of this pipeline will be to upload sequencing reads to a remote/cloud server, so any reduction in file size will make uploads faster. As human reads are not used in any analysis steps, and will need to be removed prior to making any data available, we thought we could simplify things by removing human data as the first step. Our idea was to align reads to the human genome and just remove anything that aligns. However, one concern with this approach was whether any MTB reads could be lost in the process. This effectively boils down to the question: &lt;strong&gt;Do &lt;em&gt;Mycobacterium tuberculosis&lt;/em&gt; and &lt;em&gt;Homo sapiens&lt;/em&gt; share genomic sequence&lt;/strong&gt;? After a literature search, I was unable to find an answer - which seemed quite surprising. My suspicion is that most people just assume they do not. (Or my literature searching skills are poor.) So let&amp;rsquo;s take a look.&lt;/p&gt;</description></item><item><title>Cheap Parallelisation</title><link>https://mbhall88.github.io/post/cheap-parallelisation/</link><pubDate>Mon, 22 Jun 2020 00:00:00 +0000</pubDate><guid>https://mbhall88.github.io/post/cheap-parallelisation/</guid><description>&lt;h1 id="table-of-contents"&gt;Table of Contents&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#what-is-fd"&gt;What is &lt;code&gt;fd&lt;/code&gt;?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#baby-steps-finding-our-files"&gt;Baby steps: finding our files&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#constructing-execution-commands"&gt;Constructing execution commands&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#simple-counting-characters-in-files"&gt;Simple: counting characters in files&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#intermediate-changing-file-extensions"&gt;Intermediate: changing file extensions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#advanced-redirecting-stdout-to-a-file-within-the-command"&gt;Advanced: redirecting stdout to a file within the command&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#putting-it-all-together-parallel-msa"&gt;Putting it all together: Parallel MSA&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#benchmark"&gt;Benchmark&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#results"&gt;Results&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#conclusion"&gt;Conclusion&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#final-remarks"&gt;Final Remarks&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h1 id="motivation"&gt;Motivation&lt;/h1&gt;
&lt;p&gt;I was recently creating a &lt;a href="https://snakemake.readthedocs.io/en/stable/"&gt;&lt;code&gt;snakemake&lt;/code&gt;&lt;/a&gt; pipeline and needed to write a
rule/process that would perform a multiple sequence alignment (MSA) on 2,582 fasta
files. Usually, it is easy to parallelise this kind of task using &lt;code&gt;snakemake&lt;/code&gt;. To cut a
long story short; using &lt;code&gt;snakemake&lt;/code&gt; to parallelise across the files was not feasible. I
knew there were ways of doing this kind of thing with tools such as
&lt;a href="https://www.gnu.org/software/parallel/"&gt;&lt;code&gt;parallel&lt;/code&gt;&lt;/a&gt;, &lt;a href="https://www.man7.org/linux/man-pages/man1/xargs.1.html"&gt;&lt;code&gt;xargs&lt;/code&gt;&lt;/a&gt;, and &lt;a href="https://www.gnu.org/software/findutils/"&gt;&lt;code&gt;find&lt;/code&gt;&lt;/a&gt;, but I had never really
invested the time to get comfortable with them. This post is an attempt to document that
process using one of my favourite CLI tools: &lt;a href="https://github.com/sharkdp/fd"&gt;&lt;code&gt;fd&lt;/code&gt;&lt;/a&gt;. We&amp;rsquo;ll see how &lt;code&gt;fd&lt;/code&gt; can be used
to execute multiple MSAs (with MAFFT) simultaneously, and benchmark how much faster it is than
a conventional &amp;ldquo;synchronous&amp;rdquo; approach.&lt;/p&gt;</description></item><item><title>Benchmarking Guppy algorithms</title><link>https://mbhall88.github.io/post/benchmark-guppy-algorithms/</link><pubDate>Fri, 01 Feb 2019 00:00:00 +0000</pubDate><guid>https://mbhall88.github.io/post/benchmark-guppy-algorithms/</guid><description>&lt;ul&gt;
&lt;li&gt;Methods&lt;/li&gt;
&lt;li&gt;Results&lt;/li&gt;
&lt;li&gt;Conclusions&lt;/li&gt;
&lt;li&gt;Supplementary code
{:toc}&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;ONT&amp;rsquo;s basecaller Guppy has recently been released to the masses. And with the announcement of the new &amp;ldquo;&lt;a href="https://community.nanoporetech.com/posts/pre-release-of-stand-alone"&gt;flip-flop&lt;/a&gt;&amp;rdquo; basecalling algorithm there is now the choice of two different algorithms for basecalling.&lt;/p&gt;
&lt;p&gt;ONT has obviously been singing flip-flop&amp;rsquo;s praises, and understandably so, as the &lt;a href="https://community.nanoporetech.com/posts/pre-release-of-stand-alone"&gt;initial results&lt;/a&gt; look like a decent step up in read accuracy.&lt;/p&gt;
&lt;p&gt;For an upcoming project I am going to be doing &lt;em&gt;a lot&lt;/em&gt; of basecalling of &lt;em&gt;Mycobacterium tuberculosis&lt;/em&gt; and given the project will involve assessing metrics heavily reliant on read accuracy I thought it best to invest some time in deciding which algorithm to go with. Another reason for my indecision came when I read a &lt;a href="https://omicsomics.blogspot.com/2018/12/flappie-vs-albacore-via-counterr.html"&gt;recent blog from Keith Robison&lt;/a&gt; which showed that maybe the new flip-flop algorithm doesn&amp;rsquo;t work well with organisms that have a higher GC content.&lt;/p&gt;</description></item></channel></rss>