Minimap2 lr:hq preset testing

Evaluating minimap2’s `lr:hq` preset for bacterial nanopore variant calling

Introduction

Oxford Nanopore Technologies (ONT) sequencing accuracy has improved dramatically in recent years. With basecalling models like Dorado v5.2.0 super-accuracy (sup), error rates are consistently hovering around the 1% mark. To match this shift in raw read quality, minimap2[1] introduced the lr:hq preset in version 2.27 (March 2024), which is calibrated for long reads with an error rate of <1%.

This introduction was driven by internal benchmarking from ONT developers (see minimap2 issue #1127) who found that -x map-ont -k19 -w19 -U50,500 maximised both speed and downstream accuracy for high-quality reads. As such, the lr:hq preset was added to mirror those options.

I have had an increasing number of questions about whether people doing variant calling should be using this new preset or not. I don’t like making recommendations without empirical data, so here is my attempt at providing some evidence for which preset should be used.

What are the preset differences

To understand the results that follow, we have to look at the seeding mechanics defined by these presets in minimap2.

map-ont: This is the default preset and was designed for noisier long reads with an expected error rate of ~10%. It uses shorter k-mers (-k15) and samples a dense minimizer window (-w10)¹. It is extremely tolerant of repetitive regions, allowing k-mers that occur as little as 10, and up to a million, times (-U10,1000000). It handles high error rates by creating a dense map of seeds to anchor alignments.

lr:hq: This preset was designed for reads with <1% error and requires longer, perfect matches (-k19) and samples them less frequently (-w19)¹. Additionally, it caps k-mer occurrences at a maximum of 500 (a big decrease from map-ont) and raises the minimum to 50 (-U50,500). Because the reads are highly accurate, minimap2 does not need a dense seed map to anchor the alignment. It saves compute time and prevents multi-mapping ambiguity by aggressively ignoring repetitive noise.

Methods and data

I have a nice dataset and methodology from our recent paper benchmarking variant calling in bacterial genomes with which to assess the impact of these presets [2]. This post details a direct benchmarking of map-ont against lr:hq. Using Clair3 [3] on both high-accuracy (hac) and sup ONT reads, we ask a simple question:

Does swapping to lr:hq actually translate to improvements (or regressions) in downstream bacterial variant calling?

While lr:hq is designed for sup reads, I thought it would be interesting to also see how it impacts hac, as I suspect these are the predominant accuracy level for many users.

I have tried to ensure easy reproducibility with this analysis in case I need to revisit for any other assessments in the future. The basic outline of the pipeline is:

Data: I downloaded the FASTQs from the benchmark paper that were submitted to the SRA. These were basecalled with Dorado model v4.3.0 hac and sup. I used the truth VCFs from our paper (which are stored on Zenodo).
Standardisation: Reads were randomly subsampled to 50x depth (using rasusa). To guarantee a 1:1 comparison, the same read IDs were extracted from the sup dataset as those chosen for the hac random subset dataset.
Alignment and variant calling: Reads were mapped with minimap2 (v2.30) using both presets, followed by variant calling with Clair3 (v1.0.5) using the respective Dorado v4.3.0 models to match the original pipeline from the paper.
Assessment: Variants were filtered and then evaluated against the truth sets using vcfdist [4] to handle variant representation, and generating precision, recall, and F1 scores.

(Note: The complete set of Bash and Python scripts used to reproduce this workflow are included in the Appendix at the end of this post, as are the accessions for the reads).

Results

Variant Type	Read Model	Preset	Mean Precision	Mean Recall	Mean F1 Score	Mean F1 Q-Score²
SNP	hac	lr-hq	99.997%	99.790%	99.892%	45.46
SNP	hac	map-ont	99.995%	99.790%	99.891%	45.27
SNP	sup	lr-hq	99.998%	99.795%	99.895%	50.97
SNP	sup	map-ont	99.999%	99.785%	99.891%	50.74
INDEL	hac	lr-hq	99.440%	97.697%	98.556%	24.48
INDEL	hac	map-ont	99.421%	97.646%	98.521%	24.37
INDEL	sup	lr-hq	99.980%	98.594%	99.281%	22.05
INDEL	sup	map-ont	99.968%	98.581%	99.268%	21.89
ALL	hac	lr-hq	99.985%	99.739%	99.861%	32.79
ALL	hac	map-ont	99.983%	99.738%	99.860%	32.65
ALL	sup	lr-hq	99.997%	99.770%	99.882%	35.16
ALL	sup	map-ont	99.997%	99.761%	99.878%	34.94

Across the board, lr:hq is a marginal improvement. For SNPs, the F1 Q-score² sees a bump of about 0.2 to 0.25. For indels, we see a similar bump of about 0.13 to 0.17. A shift this deep in the decimal points might seem trivial, but ONT is improving so much now that progress is measured by hunting down the last few false calls. These aren’t massive, earth-shattering percentage leaps anymore. But for something like bacterial outbreak tracking where a single SNP can make a big difference, squeezing out those last false calls is important.

When looking at the improvement given by lr:hq on SNPs, we see that for hac, the higher F1 score is driven solely by a small increase in precision (0.002%), with recall remaining the same. In contrast, for sup, the higher SNP F1 score comes from a 0.01% increase in the recall. Though there was a very small decrease in precision (0.001%).

Indels are a little more clear cut. For both hac and sup there is an increase in both precision and recall. These results can visualised in Figure 1 (F1 scores) below and Figures S1 (precision) and S2 (recall) in the Appendix, along with the full results per-sample (Table S1).

A boxplot showing F1 score of SNPs and indels — **Figure 1:** F1 score for SNPs (left) and indels (right) for `minimap2` presets `lr:hq` (black) and `map-ont` (orange).

Conclusion

If you’re aligning modern sup basecalling (v4.3.0+ models) for variant calling, lr:hq should be your new default. As always, it is worth doing your own assessment on your own data as I am sure there are edge cases for certain difficult genomes/regions. Though as an overarching finding, it seems to be mainly upside with no real downside on this dataset.

The most surprising finding for me though is the (slight) improvement for the hac data. Again, there didn’t seem to be any downside to using lr:hq for these data.

On a final note, Clair3 is a deep learning-based variant caller. So it is very possible that it “hides” some of the benefits (and drawbacks) of these different alignment presets. Having said that, as we point out in our benchmarking paper [2], Clair3, and the other deep learning-based variant callers, provide much better variant calls than traditional callers. My point here is around ensuring these results are taken purely in the context of variant calling; I cannot make claims about how these presets will impact other applications of read alignment.

Appendix

A boxplot showing precision of SNPs and indels — **Figure S1:** Precision for SNPs (left) and indels (right) for `minimap2` presets `lr:hq` (black) and `map-ont` (orange).

A boxplot showing recall of SNPs and indels — **Figure S2:** Recall for SNPs (left) and indels (right) for `minimap2` presets `lr:hq` (black) and `map-ont` (orange).

Table S1: Interactive per-sample variant calling results

Rows per page:

Raw data for the above Table S1.

Preview source code

VAR_TYPE,THRESHOLD,MIN_QUAL,TRUTH_TP,QUERY_TP,TRUTH_FN,QUERY_FP,PREC,RECALL,F1_SCORE,F1_QSCORE,read_model,preset,sample
ALL,BEST,0,23032,23032,11,3,0.99987,0.999523,0.999696,35.173203,hac,lr-hq,AJ292__202310
ALL,BEST,0,23032,23032,11,3,0.99987,0.999523,0.999696,35.173203,hac,map-ont,AJ292__202310
INDEL,BEST,0,191,191,2,1,0.994792,0.989637,0.992208,21.083414,hac,lr-hq,AJ292__202310
INDEL,BEST,0,191,191,2,1,0.994792,0.989637,0.992208,21.083414,hac,map-ont,AJ292__202310
SNP,BEST,0,22841,22841,9,2,0.999912,0.999606,0.999759,36.18446,hac,lr-hq,AJ292__202310
SNP,BEST,0,22841,22841,9,2,0.999912,0.999606,0.999759,36.18446,hac,map-ont,AJ292__202310
ALL,BEST,0,23039,23039,4,0,1.0,0.999826,0.999913,40.615585,sup,lr-hq,AJ292__202310
ALL,BEST,0,23039,23039,4,0,1.0,0.999826,0.999913,40.615585,sup,map-ont,AJ292__202310
INDEL,BEST,0,190,190,3,0,1.0,0.984456,0.992167,21.060783,sup,lr-hq,AJ292__202310
INDEL,BEST,0,190,190,3,0,1.0,0.984456,0.992167,21.060783,sup,map-ont,AJ292__202310
SNP,BEST,0,22849,22849,1,0,1.0,0.999956,0.999978,46.60054,sup,lr-hq,AJ292__202310
SNP,BEST,0,22849,22849,1,0,1.0,0.999956,0.999978,46.60054,sup,map-ont,AJ292__202310
ALL,BEST,3,2274,2274,7,0,1.0,0.996931,0.998463,28.133865,hac,lr-hq,AMtb_1__202402
ALL,BEST,3,2274,2274,7,0,1.0,0.996931,0.998463,28.133865,hac,map-ont,AMtb_1__202402
INDEL,BEST,3,172,172,7,0,1.0,0.960894,0.980057,17.002096,hac,lr-hq,AMtb_1__202402
INDEL,BEST,3,172,172,7,0,1.0,0.960894,0.980057,17.002096,hac,map-ont,AMtb_1__202402
SNP,BEST,0,2102,2102,0,0,1.0,1.0,1.0,100.0,hac,lr-hq,AMtb_1__202402
SNP,BEST,0,2102,2102,0,0,1.0,1.0,1.0,100.0,hac,map-ont,AMtb_1__202402
ALL,BEST,0,2278,2278,3,0,1.0,0.998685,0.999342,31.817509,sup,lr-hq,AMtb_1__202402
ALL,BEST,0,2278,2278,3,0,1.0,0.998685,0.999342,31.817509,sup,map-ont,AMtb_1__202402
INDEL,BEST,0,176,176,3,0,1.0,0.98324,0.991549,20.731079,sup,lr-hq,AMtb_1__202402
INDEL,BEST,0,176,176,3,0,1.0,0.98324,0.991549,20.731079,sup,map-ont,AMtb_1__202402
SNP,BEST,0,2102,2102,0,0,1.0,1.0,1.0,100.0,sup,lr-hq,AMtb_1__202402
SNP,BEST,0,2102,2102,0,0,1.0,1.0,1.0,100.0,sup,map-ont,AMtb_1__202402
ALL,BEST,1,19165,19158,18,6,0.999687,0.999062,0.999374,32.035305,hac,lr-hq,ATCC_10708__202309
ALL,BEST,2,19162,19155,21,4,0.999791,0.998905,0.999348,31.857821,hac,map-ont,ATCC_10708__202309
INDEL,BEST,1,385,392,14,6,0.984925,0.964912,0.974816,15.988695,hac,lr-hq,ATCC_10708__202309
INDEL,BEST,2,383,390,16,4,0.989848,0.9599,0.974644,15.959143,hac,map-ont,ATCC_10708__202309
SNP,BEST,0,18780,18766,4,0,1.0,0.999787,0.999893,39.725952,hac,lr-hq,ATCC_10708__202309
SNP,BEST,0,18779,18765,5,0,1.0,0.999734,0.999867,38.758312,hac,map-ont,ATCC_10708__202309
ALL,BEST,0,19175,19168,8,0,1.0,0.999583,0.999791,36.806519,sup,lr-hq,ATCC_10708__202309
ALL,BEST,0,19175,19168,8,0,1.0,0.999583,0.999791,36.806519,sup,map-ont,ATCC_10708__202309
INDEL,BEST,0,391,398,8,0,1.0,0.97995,0.989873,19.945368,sup,lr-hq,ATCC_10708__202309
INDEL,BEST,0,391,398,8,0,1.0,0.97995,0.989873,19.945368,sup,map-ont,ATCC_10708__202309
SNP,BEST,0,18784,18770,0,0,1.0,1.0,1.0,100.0,sup,lr-hq,ATCC_10708__202309
SNP,BEST,0,18784,18770,0,0,1.0,1.0,1.0,100.0,sup,map-ont,ATCC_10708__202309
ALL,BEST,0,58452,58445,19,2,0.999966,0.999675,0.99982,37.455765,hac,lr-hq,ATCC_17802__202309
ALL,BEST,0,58453,58446,18,2,0.999966,0.999692,0.999829,37.669895,hac,map-ont,ATCC_17802__202309
INDEL,BEST,0,575,590,9,1,0.998308,0.984589,0.991401,20.655499,hac,lr-hq,ATCC_17802__202309
INDEL,BEST,0,576,591,8,1,0.998311,0.986301,0.99227,21.118034,hac,map-ont,ATCC_17802__202309
SNP,BEST,0,57877,57855,10,1,0.999983,0.999827,0.999905,40.222317,hac,lr-hq,ATCC_17802__202309
SNP,BEST,0,57877,57855,10,1,0.999983,0.999827,0.999905,40.222317,hac,map-ont,ATCC_17802__202309
ALL,BEST,0,58463,58456,8,0,1.0,0.999863,0.999932,41.651566,sup,lr-hq,ATCC_17802__202309
ALL,BEST,0,58457,58452,14,1,0.999983,0.999761,0.999872,38.918777,sup,map-ont,ATCC_17802__202309
INDEL,BEST,0,581,596,3,0,1.0,0.994863,0.997425,25.892059,sup,lr-hq,ATCC_17802__202309
INDEL,BEST,0,580,593,4,1,0.998317,0.993151,0.995727,23.692553,sup,map-ont,ATCC_17802__202309
SNP,BEST,0,57882,57860,5,0,1.0,0.999914,0.999957,43.643818,sup,lr-hq,ATCC_17802__202309
SNP,BEST,0,57877,57859,10,0,1.0,0.999827,0.999914,40.636517,sup,map-ont,ATCC_17802__202309
ALL,BEST,1,8887,8885,11,4,0.99955,0.998764,0.999157,30.740248,hac,lr-hq,ATCC_19119__202309
ALL,BEST,1,8887,8885,11,4,0.99955,0.998764,0.999157,30.740248,hac,map-ont,ATCC_19119__202309
INDEL,BEST,0,439,441,7,5,0.988789,0.984305,0.986542,18.710171,hac,lr-hq,ATCC_19119__202309
INDEL,BEST,0,439,441,7,5,0.988789,0.984305,0.986542,18.710171,hac,map-ont,ATCC_19119__202309
SNP,BEST,1,8448,8444,3,0,1.0,0.999645,0.999822,37.506493,hac,lr-hq,ATCC_19119__202309
SNP,BEST,1,8448,8444,3,0,1.0,0.999645,0.999822,37.506493,hac,map-ont,ATCC_19119__202309
ALL,BEST,7,8888,8886,10,1,0.999887,0.998876,0.999382,32.087318,sup,lr-hq,ATCC_19119__202309
ALL,BEST,7,8888,8886,10,1,0.999887,0.998876,0.999382,32.087318,sup,map-ont,ATCC_19119__202309
INDEL,BEST,7,437,439,9,0,1.0,0.979821,0.989807,19.917187,sup,lr-hq,ATCC_19119__202309
INDEL,BEST,7,437,439,9,0,1.0,0.979821,0.989807,19.917187,sup,map-ont,ATCC_19119__202309
SNP,BEST,0,8450,8446,1,1,0.999882,0.999882,0.999882,39.269592,sup,lr-hq,ATCC_19119__202309
SNP,BEST,0,8450,8446,1,1,0.999882,0.999882,0.999882,39.269592,sup,map-ont,ATCC_19119__202309
ALL,BEST,0,4774,4773,118,1,0.999791,0.975879,0.98769,19.097452,hac,lr-hq,ATCC_25922__202309
ALL,BEST,0,4775,4774,117,2,0.999581,0.976083,0.987693,19.098335,hac,map-ont,ATCC_25922__202309
INDEL,BEST,0,358,359,3,0,1.0,0.99169,0.995828,23.796095,hac,lr-hq,ATCC_25922__202309
INDEL,BEST,0,358,359,3,0,1.0,0.99169,0.995828,23.796095,hac,map-ont,ATCC_25922__202309
SNP,BEST,0,4416,4414,115,1,0.999774,0.974619,0.987036,18.872646,hac,lr-hq,ATCC_25922__202309
SNP,BEST,0,4417,4415,114,2,0.999547,0.97484,0.987039,18.873606,hac,map-ont,ATCC_25922__202309
ALL,BEST,0,4774,4773,118,1,0.999791,0.975879,0.98769,19.097452,sup,lr-hq,ATCC_25922__202309
ALL,BEST,0,4769,4768,123,1,0.99979,0.974857,0.987166,18.916435,sup,map-ont,ATCC_25922__202309
INDEL,BEST,0,359,360,2,1,0.99723,0.99446,0.995843,23.81213,sup,lr-hq,ATCC_25922__202309
INDEL,BEST,0,359,360,2,1,0.99723,0.99446,0.995843,23.81213,sup,map-ont,ATCC_25922__202309
SNP,BEST,0,4415,4413,116,0,1.0,0.974399,0.987033,18.871708,sup,lr-hq,ATCC_25922__202309
SNP,BEST,0,4410,4408,121,0,1.0,0.973295,0.986467,18.686003,sup,map-ont,ATCC_25922__202309
ALL,BEST,0,6581,6578,11,2,0.999696,0.998331,0.999013,30.05817,hac,lr-hq,ATCC_33560__202309
ALL,BEST,0,6581,6578,11,2,0.999696,0.998331,0.999013,30.05817,hac,map-ont,ATCC_33560__202309
INDEL,BEST,0,215,218,8,2,0.990909,0.964126,0.977334,16.44622,hac,lr-hq,ATCC_33560__202309
INDEL,BEST,0,215,218,8,2,0.990909,0.964126,0.977334,16.44622,hac,map-ont,ATCC_33560__202309
SNP,BEST,0,6366,6360,3,0,1.0,0.999529,0.999764,36.27903,hac,lr-hq,ATCC_33560__202309
SNP,BEST,0,6366,6360,3,0,1.0,0.999529,0.999764,36.27903,hac,map-ont,ATCC_33560__202309
ALL,BEST,0,6583,6580,9,0,1.0,0.998635,0.999317,31.654974,sup,lr-hq,ATCC_33560__202309
ALL,BEST,0,6583,6580,9,0,1.0,0.998635,0.999317,31.654974,sup,map-ont,ATCC_33560__202309
INDEL,BEST,0,217,220,6,0,1.0,0.973094,0.986364,18.653,sup,lr-hq,ATCC_33560__202309
INDEL,BEST,0,217,220,6,0,1.0,0.973094,0.986364,18.653,sup,map-ont,ATCC_33560__202309
SNP,BEST,0,6366,6360,3,0,1.0,0.999529,0.999764,36.27903,sup,lr-hq,ATCC_33560__202309
SNP,BEST,0,6366,6360,3,0,1.0,0.999529,0.999764,36.27903,sup,map-ont,ATCC_33560__202309
ALL,BEST,0,16653,16637,12,2,0.99988,0.99928,0.99958,33.76469,hac,lr-hq,ATCC_35221__202309
ALL,BEST,0,16653,16637,12,2,0.99988,0.99928,0.99958,33.76469,hac,map-ont,ATCC_35221__202309
INDEL,BEST,0,120,138,4,1,0.992806,0.967742,0.980114,17.014458,hac,lr-hq,ATCC_35221__202309
INDEL,BEST,0,120,138,4,1,0.992806,0.967742,0.980114,17.014458,hac,map-ont,ATCC_35221__202309
SNP,BEST,0,16533,16499,8,1,0.999939,0.999516,0.999728,35.65184,hac,lr-hq,ATCC_35221__202309
SNP,BEST,0,16533,16499,8,1,0.999939,0.999516,0.999728,35.65184,hac,map-ont,ATCC_35221__202309
ALL,BEST,1,16654,16638,11,2,0.99988,0.99934,0.99961,34.086777,sup,lr-hq,ATCC_35221__202309
ALL,BEST,1,16654,16638,11,1,0.99994,0.99934,0.99964,34.43539,sup,map-ont,ATCC_35221__202309
INDEL,BEST,1,121,139,3,0,1.0,0.975806,0.987755,19.120455,sup,lr-hq,ATCC_35221__202309
INDEL,BEST,1,121,139,3,0,1.0,0.975806,0.987755,19.120455,sup,map-ont,ATCC_35221__202309
SNP,BEST,0,16533,16499,8,2,0.999879,0.999516,0.999698,35.193695,sup,lr-hq,ATCC_35221__202309
SNP,BEST,0,16533,16499,8,1,0.999939,0.999516,0.999728,35.65184,sup,map-ont,ATCC_35221__202309
ALL,BEST,1,17206,17203,10,1,0.999942,0.999419,0.99968,34.954742,hac,lr-hq,ATCC_35897__202309
ALL,BEST,1,17203,17201,13,3,0.999826,0.999245,0.999535,33.326809,hac,map-ont,ATCC_35897__202309
INDEL,BEST,1,259,262,4,1,0.996198,0.984791,0.990461,20.205185,hac,lr-hq,ATCC_35897__202309
INDEL,BEST,1,258,260,5,3,0.988593,0.980989,0.984776,18.174751,hac,map-ont,ATCC_35897__202309
SNP,BEST,1,16947,16941,6,0,1.0,0.999646,0.999823,37.519634,hac,lr-hq,ATCC_35897__202309
SNP,BEST,1,16945,16941,8,0,1.0,0.999528,0.999764,36.270248,hac,map-ont,ATCC_35897__202309
ALL,BEST,2,17211,17208,5,0,1.0,0.99971,0.999855,38.378643,sup,lr-hq,ATCC_35897__202309
ALL,BEST,2,17211,17208,5,0,1.0,0.99971,0.999855,38.378643,sup,map-ont,ATCC_35897__202309
INDEL,BEST,2,261,264,2,0,1.0,0.992395,0.996183,24.183092,sup,lr-hq,ATCC_35897__202309
INDEL,BEST,2,261,264,2,0,1.0,0.992395,0.996183,24.183092,sup,map-ont,ATCC_35897__202309
SNP,BEST,2,16950,16944,3,0,1.0,0.999823,0.999911,40.529934,sup,lr-hq,ATCC_35897__202309
SNP,BEST,2,16950,16944,3,0,1.0,0.999823,0.999911,40.529934,sup,map-ont,ATCC_35897__202309
ALL,BEST,0,9230,9226,4,3,0.999675,0.999567,0.999621,34.211945,hac,lr-hq,ATCC_BAA-679__202309
ALL,BEST,0,9230,9226,4,3,0.999675,0.999567,0.999621,34.211945,hac,map-ont,ATCC_BAA-679__202309
INDEL,BEST,0,142,146,2,2,0.986486,0.986111,0.986299,18.632414,hac,lr-hq,ATCC_BAA-679__202309
INDEL,BEST,0,142,146,2,2,0.986486,0.986111,0.986299,18.632414,hac,map-ont,ATCC_BAA-679__202309
SNP,BEST,5,9088,9080,2,0,1.0,0.99978,0.99989,39.587234,hac,lr-hq,ATCC_BAA-679__202309
SNP,BEST,5,9088,9080,2,0,1.0,0.99978,0.99989,39.587234,hac,map-ont,ATCC_BAA-679__202309
ALL,BEST,0,9233,9229,1,0,1.0,0.999892,0.999946,42.66156,sup,lr-hq,ATCC_BAA-679__202309
ALL,BEST,0,9233,9229,1,0,1.0,0.999892,0.999946,42.66156,sup,map-ont,ATCC_BAA-679__202309
INDEL,BEST,0,143,147,1,0,1.0,0.993056,0.996516,24.578835,sup,lr-hq,ATCC_BAA-679__202309
INDEL,BEST,0,143,147,1,0,1.0,0.993056,0.996516,24.578835,sup,map-ont,ATCC_BAA-679__202309
SNP,BEST,0,9090,9082,0,0,1.0,1.0,1.0,100.0,sup,lr-hq,ATCC_BAA-679__202309
SNP,BEST,0,9090,9082,0,0,1.0,1.0,1.0,100.0,sup,map-ont,ATCC_BAA-679__202309
ALL,BEST,0,8051,8051,1,0,1.0,0.999876,0.999938,42.068523,hac,lr-hq,BPH2947__202310
ALL,BEST,0,8051,8051,1,0,1.0,0.999876,0.999938,42.068523,hac,map-ont,BPH2947__202310
INDEL,BEST,0,158,158,0,0,1.0,1.0,1.0,100.0,hac,lr-hq,BPH2947__202310
INDEL,BEST,0,158,158,0,0,1.0,1.0,1.0,100.0,hac,map-ont,BPH2947__202310
SNP,BEST,0,7893,7893,1,0,1.0,0.999873,0.999937,41.981865,hac,lr-hq,BPH2947__202310
SNP,BEST,0,7893,7893,1,0,1.0,0.999873,0.999937,41.981865,hac,map-ont,BPH2947__202310
ALL,BEST,0,8050,8050,2,0,1.0,0.999752,0.999876,39.060307,sup,lr-hq,BPH2947__202310
ALL,BEST,0,8050,8050,2,0,1.0,0.999752,0.999876,39.060307,sup,map-ont,BPH2947__202310
INDEL,BEST,0,157,157,1,0,1.0,0.993671,0.996825,24.983025,sup,lr-hq,BPH2947__202310
INDEL,BEST,0,157,157,1,0,1.0,0.993671,0.996825,24.983025,sup,map-ont,BPH2947__202310
SNP,BEST,0,7893,7893,1,0,1.0,0.999873,0.999937,41.981865,sup,lr-hq,BPH2947__202310
SNP,BEST,0,7893,7893,1,0,1.0,0.999873,0.999937,41.981865,sup,map-ont,BPH2947__202310
ALL,BEST,0,16021,16019,24,1,0.999938,0.998504,0.99922,31.081453,hac,lr-hq,KPC2__202310
ALL,BEST,0,16020,16018,25,2,0.999875,0.998442,0.999158,30.7467,hac,map-ont,KPC2__202310
INDEL,BEST,0,161,163,7,1,0.993902,0.958333,0.975794,16.160751,hac,lr-hq,KPC2__202310
INDEL,BEST,0,161,163,7,1,0.993902,0.958333,0.975794,16.160751,hac,map-ont,KPC2__202310
SNP,BEST,0,15860,15856,17,0,1.0,0.998929,0.999464,32.711052,hac,lr-hq,KPC2__202310
SNP,BEST,0,15859,15855,18,1,0.999937,0.998866,0.999401,32.22813,hac,map-ont,KPC2__202310
ALL,BEST,0,16025,16022,20,0,1.0,0.998753,0.999376,32.05022,sup,lr-hq,KPC2__202310
ALL,BEST,0,16023,16020,22,0,1.0,0.998629,0.999314,31.636446,sup,map-ont,KPC2__202310
INDEL,BEST,0,166,169,2,0,1.0,0.988095,0.994012,22.227137,sup,lr-hq,KPC2__202310
INDEL,BEST,0,166,169,2,0,1.0,0.988095,0.994012,22.227137,sup,map-ont,KPC2__202310
SNP,BEST,0,15859,15853,18,0,1.0,0.998866,0.999433,32.462654,sup,lr-hq,KPC2__202310
SNP,BEST,0,15857,15851,20,0,1.0,0.99874,0.99937,32.004807,sup,map-ont,KPC2__202310
ALL,BEST,2,10646,10638,10,1,0.999906,0.999062,0.999484,32.870014,hac,lr-hq,MMC234__202311
ALL,BEST,2,10646,10638,10,1,0.999906,0.999062,0.999484,32.870014,hac,map-ont,MMC234__202311
INDEL,BEST,2,174,182,8,1,0.994536,0.956044,0.97491,16.004986,hac,lr-hq,MMC234__202311
INDEL,BEST,2,174,182,8,1,0.994536,0.956044,0.97491,16.004986,hac,map-ont,MMC234__202311
SNP,BEST,0,10472,10456,2,0,1.0,0.999809,0.999905,40.200573,hac,lr-hq,MMC234__202311
SNP,BEST,0,10472,10456,2,0,1.0,0.999809,0.999905,40.200573,hac,map-ont,MMC234__202311
ALL,BEST,0,10649,10641,7,0,1.0,0.999343,0.999671,34.833321,sup,lr-hq,MMC234__202311
ALL,BEST,0,10649,10641,7,0,1.0,0.999343,0.999671,34.833321,sup,map-ont,MMC234__202311
INDEL,BEST,0,178,186,4,0,1.0,0.978022,0.988889,19.542437,sup,lr-hq,MMC234__202311
INDEL,BEST,0,178,186,4,0,1.0,0.978022,0.988889,19.542437,sup,map-ont,MMC234__202311
SNP,BEST,0,10471,10455,3,0,1.0,0.999714,0.999857,38.439663,sup,lr-hq,MMC234__202311
SNP,BEST,0,10471,10455,3,0,1.0,0.999714,0.999857,38.439663,sup,map-ont,MMC234__202311
ALL,BEST,0,5487,5485,2,0,1.0,0.999636,0.999818,37.392822,hac,lr-hq,RDH275__202311
ALL,BEST,0,5487,5485,2,0,1.0,0.999636,0.999818,37.392822,hac,map-ont,RDH275__202311
INDEL,BEST,0,126,128,2,0,1.0,0.984375,0.992126,21.03804,hac,lr-hq,RDH275__202311
INDEL,BEST,0,126,128,2,0,1.0,0.984375,0.992126,21.03804,hac,map-ont,RDH275__202311
SNP,BEST,0,5361,5357,0,0,1.0,1.0,1.0,100.0,hac,lr-hq,RDH275__202311
SNP,BEST,0,5361,5357,0,0,1.0,1.0,1.0,100.0,hac,map-ont,RDH275__202311
ALL,BEST,0,5487,5485,2,0,1.0,0.999636,0.999818,37.392822,sup,lr-hq,RDH275__202311
ALL,BEST,0,5487,5485,2,0,1.0,0.999636,0.999818,37.392822,sup,map-ont,RDH275__202311
INDEL,BEST,0,127,129,1,0,1.0,0.992188,0.996078,24.065403,sup,lr-hq,RDH275__202311
INDEL,BEST,0,127,129,1,0,1.0,0.992188,0.996078,24.065403,sup,map-ont,RDH275__202311
SNP,BEST,0,5360,5356,1,0,1.0,0.999813,0.999907,40.302055,sup,lr-hq,RDH275__202311
SNP,BEST,0,5360,5356,1,0,1.0,0.999813,0.999907,40.302055,sup,map-ont,RDH275__202311

Scripts

The following scripts detail the complete pipeline used to generate the data for this analysis.

Config file

Preview source code

sample,species,biosample,pod5,illumina,ont_simplex_fast,ont_simplex_hac,ont_simplex_sup,ont_duplex_hac,ont_duplex_sup,assembly
ATCC_10708__202309,Salmonella enterica,SAMN38321309,10.26188/25521883,SRR26899135,SRR28370662,SRR28370670,SRR27638402,SRR28370653,SRR28370644,CP149507-CP149508
ATCC_17802__202309,Vibrio parahaemolyticus,SAMN38321311,10.26188/25495063,SRR26899141,SRR28370661,SRR28370669,SRR27638400,SRR28370652,SRR28370643,CP149505-CP149506
ATCC_25922__202309,Escherichia coli,SAMN38321313,10.26188/25521892,SRR26899128,SRR28370659,SRR28370668,SRR27638398,SRR28370651,SRR28370642,CP149500-CP149504
ATCC_33560__202309,Campylobacter jejuni,SAMN38321314,10.26188/25495054,SRR26899120,SRR28370658,SRR28370667,SRR27638397,SRR28370650,SRR28370641,CP149499
ATCC_35221__202309,Campylobacter lari,SAMN38321315,10.26188/25493905,SRR26899115,SRR28370657,SRR28370666,SRR27638396,SRR28370648,SRR28370640,CP149498
ATCC_19119__202309,Listeria ivanovii,SAMN38321312,10.26188/25495057,SRR26899136,SRR28370656,SRR28370665,SRR27638399,SRR28370647,SRR28370639,CP149497
ATCC_35897__202309,Listeria welshimeri,SAMN38321316,10.26188/25495081,SRR26899109,SRR28370655,SRR28370664,SRR27638395,SRR28370646,SRR28370637,CP149496
ATCC_BAA-679__202309,Listeria monocytogenes,SAMN38321317,10.26188/25495069,SRR26899101,SRR28370654,SRR28370663,SRR27638394,SRR28370645,SRR28370636,CP149495
BPH2947__202310,Staphylococcus aureus,SAMN40453078,10.26188/25495075,ERR2929425,SRR28370690,SRR28370638,SRR28370694,SRR28370677,SRR28370684,CP149492-CP149494
AJ292__202310,Klebsiella variicola,SAMN40453079,10.26188/25495048,SRR28370702,SRR28370689,SRR28370697,SRR28370693,SRR28370676,SRR28370683,CP149491
KPC2__202310,Klebsiella pneumoniae,SAMN40453080,10.26188/25495078,SRR28370701,SRR28370688,SRR28370696,SRR28370682,SRR28370675,SRR28370681,CP149487-CP149490
RDH275__202311,Streptococcus pyogenes,SAMN40453081,10.26188/25495072,SRR28370700,SRR28370687,SRR28370695,SRR28370671,SRR28370674,SRR28370680,CP149486
MMC234__202311,Streptococcus dysgalactiae,SAMN40453082,10.26188/25495066,SRR28370699,SRR28370686,SRR28370692,SRR28370660,SRR28370673,SRR28370679,CP149485
AMtb_1__202402,Mycobacterium tuberculosis,SAMN40453083,10.26188/25495045,SRR28370698,SRR28370685,SRR28370691,SRR28370649,SRR28370672,SRR28370678,CP149484

1. Download Data

Preview source code

#!/bin/bash
set -euo pipefail

# Download and extract truth VCFs
cd /scratch/user/uqmhal11/minimap_preset_testing/data/truth_vcfs
wget -O truth_vcfs.zip "https://zenodo.org/api/records/10867171/files-archive"
unzip truth_vcfs.zip -d .
rm truth_vcfs.zip

for archive in *.tar.gz; do
    tar -xzf "$archive"
    rm "$archive"
done

# Create list of accessions to download
cd /scratch/user/uqmhal11/minimap_preset_testing/data/reads
csvtk cut -Uf ont_simplex_hac ../../config/accessions.csv >hac_accessions.txt
csvtk cut -Uf ont_simplex_sup ../../config/accessions.csv >sup_accessions.txt

# Download reads
ssubmit -t 12h -m 8g download_hac "kingfisher get --run-identifiers-list hac_accessions.txt -m ena-ascp ena-ftp --output-directory hac --check-md5sums"
ssubmit -t 12h -m 8g download_sup "kingfisher get --run-identifiers-list sup_accessions.txt -m ena-ascp ena-ftp --output-directory sup --check-md5sums"

2. Subsample Reads

Preview source code

#!/bin/bash
cd /scratch/user/uqmhal11/minimap_preset_testing/data/reads || exit 1

csv_file="../../config/accessions.csv"
truth_dir="../truth_vcfs"

mkdir -p hac_subsampled sup_subsampled

seed=23867

tail -n +2 "$csv_file" | while IFS=, read -r sample species biosample pod5 illumina ont_simplex_fast ont_simplex_hac ont_simplex_sup ont_duplex_hac ont_duplex_sup assembly remainder; do

    fai_file="${truth_dir}/${sample}/reference.fna.fai"

    if [[ ! -f "$fai_file" ]]; then
        echo "Error: Index file $fai_file not found for $sample. Skipping."
        continue
    fi

    hac_in="hac/${ont_simplex_hac}_1.fastq.gz"
    hac_out="hac_subsampled/${sample}.fastq.gz"

    if [[ -f "$hac_in" ]]; then
        if [[ -f "$hac_out" ]]; then
            echo "Skipping HAC subsampling: $hac_out already exists."
        else
            echo "Subsampling HAC: $hac_in -> $hac_out (50x, using $fai_file)"
            rasusa reads "$hac_in" -c 50 -g "$fai_file" -o "$hac_out" -s "$seed"
        fi
    fi

    sup_in="sup/${ont_simplex_sup}_1.fastq.gz"
    sup_out="sup_subsampled/${sample}.fastq.gz"

    if [[ -f "$sup_in" ]]; then
        if [[ -f "$sup_out" ]]; then
            echo "Skipping SUP extraction: $sup_out already exists."
        else
            id_file="sup_subsampled/${sample}_read_ids.txt"

            echo "Step 1: Extracting read IDs to $id_file"
            seqkit seq -n "$hac_out" | cut -f 2 -d' ' >"$id_file"

            echo "Step 2: Extracting corresponding SUP reads to $sup_out"
            ssubmit -t 6h -m 32g "${sample}_sup_extract" "rg -zFf $id_file -A 3 --no-context-separator  $sup_in | gzip > $sup_out"
        fi
    fi
done

echo "Subsampling complete!"

3. Align Reads

Preview source code

#!/bin/bash
cd /scratch/user/uqmhal11/minimap_preset_testing

out_dir="alignments"
truth_dir="data/truth_vcfs"
reads_dir="data/reads"
threads=8

mkdir -p "$out_dir"

for hac_fq in ${reads_dir}/hac_subsampled/*.fastq.gz; do
    filename=$(basename "$hac_fq")
    sample="${filename%.fastq.gz}"
    ref="${truth_dir}/${sample}/mutreference.fna"
    sup_fq="${reads_dir}/sup_subsampled/${sample}.fastq.gz"

    if [[ ! -f "$ref" ]]; then
        echo "Error: Reference not found at $ref. Skipping."
        continue
    fi

    # HAC + map-ont
    if [[ ! -f "${out_dir}/${sample}_hac_map-ont.bam" ]]; then
        echo " Aligning HAC reads with map-ont preset..."
        minimap2 -t "$threads" --cs --MD -aLx map-ont "$ref" "$hac_fq" |
            samtools sort --write-index -@ "$threads" -o "${out_dir}/${sample}_hac_map-ont.bam"
    else
        echo " Skipping HAC map-ont: output already exists."
    fi

    # HAC + lr:hq
    if [[ ! -f "${out_dir}/${sample}_hac_lr-hq.bam" ]]; then
        echo " Aligning HAC reads with lr:hq preset..."
        minimap2 -t "$threads" --cs --MD -aLx lr:hq "$ref" "$hac_fq" |
            samtools sort --write-index -@ "$threads" -o "${out_dir}/${sample}_hac_lr-hq.bam"
    else
        echo " Skipping HAC lr:hq: output already exists."
    fi

    if [[ -f "$sup_fq" ]]; then
        # SUP + map-ont
        if [[ ! -f "${out_dir}/${sample}_sup_map-ont.bam" ]]; then
            echo " Aligning SUP reads with map-ont preset..."
            minimap2 -t "$threads" --cs --MD -aLx map-ont "$ref" "$sup_fq" |
                samtools sort --write-index -@ "$threads" -o "${out_dir}/${sample}_sup_map-ont.bam"
        else
            echo " Skipping SUP map-ont: output already exists."
        fi

        # SUP + lr:hq
        if [[ ! -f "${out_dir}/${sample}_sup_lr-hq.bam" ]]; then
            echo " Aligning SUP reads with lr:hq preset..."
            minimap2 -t "$threads" --cs --MD -aLx lr:hq "$ref" "$sup_fq" |
                samtools sort --write-index -@ "$threads" -o "${out_dir}/${sample}_sup_lr-hq.bam"
        else
            echo " Skipping SUP lr:hq: output already exists."
        fi
    else
        echo " Warning: No SUP reads found for $sample."
    fi
done

4. Variant Calling

Preview source code

#!/usr/bin/env bash
set -euo pipefail

threads=1
log_file=""

usage() {
    cat <<EOF
Usage: $(basename "$0") -b <bam> -r <ref> -o <outvcf> -m <model> -v <version> -s <sample> [-t <threads>] [-l <log>]

Required arguments:
    -b, --bam      Input alignment BAM file
    -r, --ref      Reference FASTA file
    -o, --outvcf   Output VCF file path
    -m, --model    Base model name (e.g., dna_r10.4.1_e8.2_400bps_sup@v4.3.0)
    -v, --version  Model version (e.g., v4.3.0)
    -s, --sample   Sample name

Optional arguments:
    -t, --threads  Number of threads to use (default: 1)
    -l, --log      Log file to redirect stdout and stderr
    -h, --help     Show this help message
EOF
    exit 1
}

while [[ $# -gt 0 ]]; do
    case $1 in
    -b | --bam)
        aln="$2"
        shift 2
        ;;
    -r | --ref)
        ref="$2"
        shift 2
        ;;
    -o | --outvcf)
        outvcf="$2"
        shift 2
        ;;
    -m | --model)
        model="$2"
        shift 2
        ;;
    -v | --version)
        version="$2"
        shift 2
        ;;
    -s | --sample)
        sample="$2"
        shift 2
        ;;
    -t | --threads)
        threads="$2"
        shift 2
        ;;
    -l | --log)
        log_file="$2"
        shift 2
        ;;
    -h | --help) usage ;;
    *)
        echo "Error: Unknown parameter passed: $1"
        usage
        ;;
    esac
done

if [[ -z "${aln:-}" || -z "${ref:-}" || -z "${outvcf:-}" || -z "${model:-}" || -z "${version:-}" || -z "${sample:-}" ]]; then
    echo "Error: Missing required arguments."
    usage
fi

if [[ -n "$log_file" ]]; then
    exec &>"$log_file"
fi

model_name=$(echo "$model" | sed -E 's/.*dna_(.*)@.*/\1/')
model_name=$(echo "$model_name" | sed -E 's/\.//g')
model_name="${model_name}_${version}"
model_name=$(echo "$model_name" | sed -E 's/\.//g')

if [[ "$model" == *_fast@* ]]; then
    model_name=$(echo "$model_name" | sed -E 's/_fast/_hac/')
fi

model_path="/opt/models/${model_name}"
tmpoutdir=$(mktemp -d)
trap 'rm -rf "$tmpoutdir"' EXIT

echo "Running Clair3 with model: $model_path"

run_clair3.sh \
    --bam_fn="$aln" \
    --ref_fn="$ref" \
    --threads="$threads" \
    --platform="ont" \
    --model_path="$model_path" \
    --output="$tmpoutdir" \
    --sample_name="$sample" \
    --include_all_ctgs \
    --haploid_precise \
    --no_phasing_for_fa \
    --enable_long_indel

mv "${tmpoutdir}/merge_output.vcf.gz" "$outvcf"

Preview source code

#!/bin/bash
cd /scratch/user/uqmhal11/minimap_preset_testing/variants || exit 1

URI="docker://quay.io/mbhall88/clair3:1.0.5"
script="/scratch/user/uqmhal11/minimap_preset_testing/scripts/04a_clair3_wrapper.sh"
threads=8

for bam in ../alignments/*.bam; do
    filename=$(basename "$bam")
    base="${filename%.bam}"

    if [[ "$base" == *"_hac_"* ]]; then
        read_model="hac"
        sample="${base%%_hac_*}"
        preset="${base##*_hac_}"
    elif [[ "$base" == *"_sup_"* ]]; then
        read_model="sup"
        sample="${base%%_sup_*}"
        preset="${base##*_sup_}"
    else
        echo "Warning: Could not parse filename $filename. Skipping."
        continue
    fi

    outdir="${read_model}/${preset}/${sample}"
    mkdir -p "$outdir"

    ref="../data/truth_vcfs/${sample}/mutreference.fna"
    outvcf="${outdir}/${sample}.vcf.gz"
    log="${outdir}/${sample}.log"

    if [[ -f "$outvcf" ]]; then
        echo "Skipping variant calling: $outvcf already exists."
        continue
    fi

    clair_model="dna_r10.4.1_e8.2_400bps_${read_model}@v4.3.0"
    job_name="clair_${sample}_${read_model}_${preset}"

    echo "Submitting: $job_name -> Output to $outdir/"

    ssubmit -t 2h -m 16g "$job_name" \
        "apptainer exec $URI bash $script -b $bam -r $ref -o $outvcf -m $clair_model -v v4.3.0 -s $sample -t $threads -l $log" -- -c $threads
done

5. Assessment

Preview source code

#!/bin/bash
set -euo pipefail

vcf=$1
ref=$2
faidx=$3
filter_script=$4
outdir=$5
sample=$6

max_indel=50
filter_vcf="${outdir}/${sample}.filter.vcf.gz"
log_file="${outdir}/${sample}_filter.log"

exec 2>"$log_file"

echo "Filtering variants for $sample..."

contigs=$(mktemp).contigs.txt
header=$(mktemp).header.txt
trap 'rm -f "$contigs" "$header"' EXIT

awk '{print "##contig=<ID="$1",length="$2">"}' "$faidx" >"$contigs"
(bcftools view -h "$vcf" |
    grep -v "^##contig=" |
    sed -e "3r $contigs") >"$header"

(bcftools reheader -h "$header" "$vcf" |
    python "$filter_script" |
    bcftools view -i 'GT="alt"' |
    bcftools view -e 'ALT="."' |
    bcftools norm -f "$ref" -a -c e -m - |
    bcftools norm -aD |
    bcftools filter -e "abs(ILEN)>${max_indel} || ALT=\"*\"" |
    bcftools +setGT - -- -t a -n c:M |
    bcftools sort |
    bcftools view -i 'GT="A"' -o "$filter_vcf")

bcftools index -f "$filter_vcf"
echo "Filtering complete!"

Preview source code

#!/bin/bash
set -euo pipefail

filter_vcf=$1
truth_vcf=$2
ref=$3
bed=$4
outdir=$5
sample=$6

max_indel=50
log_file="${outdir}/${sample}_assess.log"

exec 2>"$log_file"
echo "Assessing variants for $sample..."

MAX_QUAL=$(bgzip -dc "$filter_vcf" | grep -v '^#' | cut -f 6 | sort -gr | sed -n '1p')
MAX_QUAL=${MAX_QUAL:-100}

vcfdist \
    "$filter_vcf" \
    "$truth_vcf" \
    "$ref" \
    --largest-variant "$max_indel" \
    --credit-threshold 1.0 \
    -p "${outdir}/${sample}." \
    -b "$bed" \
    -mx "$MAX_QUAL"

echo "Assessment complete!"

Preview source code

#!/bin/bash
cd /scratch/user/uqmhal11/minimap_preset_testing/variants || exit 1

filter_bash="/scratch/user/uqmhal11/minimap_preset_testing/scripts/05a_filter.sh"
assess_bash="/scratch/user/uqmhal11/minimap_preset_testing/scripts/05b_assess.sh"
filter_py="/scratch/user/uqmhal11/minimap_preset_testing/scripts/filter_hets.py"
truth_base="../data/truth_vcfs"

find . -mindepth 4 -maxdepth 4 -name "*.vcf.gz" | grep -v "\.filter\.vcf\.gz" | while read -r vcf; do
    vcf_clean=${vcf#./}
    read_model=$(echo "$vcf_clean" | cut -d'/' -f1)
    preset=$(echo "$vcf_clean" | cut -d'/' -f2)
    sample=$(echo "$vcf_clean" | cut -d'/' -f3)

    outdir=$(dirname "$vcf")
    ref="${truth_base}/${sample}/mutreference.fna"
    faidx="${ref}.fai"
    truth_vcf="${truth_base}/${sample}/truth.vcf.gz"
    bed="${truth_base}/${sample}/${sample}.bed"
    filter_vcf="${outdir}/${sample}.filter.vcf.gz"

    # skip submitting a job if the output already exists
    if [[ -f "$filter_vcf" ]]; then
        echo "Output $filter_vcf already exists, skipping $vcf"
        continue
    fi

    job_name="eval_${sample}_${read_model}_${preset}"

    ssubmit -t 1h -m 4g "$job_name" \
        "bash $filter_bash $vcf $ref $faidx $filter_py $outdir $sample && bash $assess_bash $filter_vcf $truth_vcf $ref $bed $outdir $sample"
done

6. Aggregation and Plotting

Preview source code

#!/usr/bin/env python3
import pandas as pd
from pathlib import Path
import seaborn as sns
import matplotlib.pyplot as plt
import math

base_dir = Path("/scratch/user/uqmhal11/minimap_preset_testing/variants")

print("Gathering summary files...")
summary_data = []

for filepath in base_dir.rglob("*.precision-recall-summary.tsv"):
    rel_path = filepath.relative_to(base_dir)
    read_model = rel_path.parts[0]
    preset = rel_path.parts[1]
    sample = rel_path.parts[2]
    
    df = pd.read_csv(filepath, sep='\t')
    df['read_model'] = read_model
    df['preset'] = preset
    df['sample'] = sample
    
    summary_data.append(df)

if not summary_data:
    print("Error: No summary data found. Check your paths.")
    exit(1)

merged_df = pd.concat(summary_data, ignore_index=True)
# Filter for the 'BEST' threshold to standardise the comparison
best_df = merged_df[merged_df['THRESHOLD'] == 'BEST'].copy()

# Filter out SVs
best_df = best_df[best_df['VAR_TYPE'] != 'SV']

# Sort the data by sample, read_model, and VAR_TYPE
best_df = best_df.sort_values(['sample', 'read_model', 'VAR_TYPE'])

out_csv = base_dir / "aggregated_precision_recall_summaries.csv"
best_df.to_csv(out_csv, index=False)
print(f"Aggregated data saved to {out_csv}")

print("Generating comparative plots...")
sns.set_theme(style="whitegrid")

for var_type in ['SNP', 'INDEL']:
    plot_df = best_df[best_df['VAR_TYPE'] == var_type].copy()
    
    if plot_df.empty:
        continue

    fig, ax1 = plt.subplots(figsize=(10, 8))
    fig.suptitle(f"{var_type} Performance: minimap2 Presets", fontsize=16)

    sns.boxplot(
        data=plot_df, x='preset', y='F1_QSCORE', hue='read_model', 
        ax=ax1, palette="Set2"
    )
    sns.stripplot(
        data=plot_df, x='preset', y='F1_QSCORE', hue='read_model', 
        ax=ax1, dodge=True, color='black', alpha=0.5, size=4, legend=False
    )
    
    ax1.set_xlabel("minimap2 Preset", fontsize=12)
    ax1.set_ylabel("Phred-scaled F1 Q-Score", fontsize=12)

    ax2 = ax1.twinx()
    
    f1_targets = [0.5, 0.8, 0.9, 0.95, 0.98, 0.99, 0.995, 0.998, 0.999, 0.9995, 0.9999, 0.99995, 0.99999, 0.999999]
    f1_labels = ["0.50", "0.80", "0.90", "0.95", "0.98", "0.99", "0.995", "0.998", "0.999", "0.9995", "0.9999", "0.99995", "0.99999", "0.999999"]
    q_targets = [-10 * math.log10(1 - f1) for f1 in f1_targets]
    
    ax2.set_yticks(q_targets)
    ax2.set_yticklabels(f1_labels)
    ax2.set_ylim(ax1.get_ylim())
    ax2.set_ylabel("Raw F1 Score", rotation=270, labelpad=20, fontsize=12)
    ax2.grid(False)

    handles, labels = ax1.get_legend_handles_labels()
    ax1.legend(handles, labels, title="Read Model", loc="lower right")

    plt.tight_layout()
    plot_path = base_dir / f"preset_comparison_{var_type}.png"
    plt.savefig(plot_path, dpi=300)
    plt.close()
    print(f"Saved plot: {plot_path}")

Preview source code

#!/usr/bin/env python3
import pandas as pd
from pathlib import Path

base_dir = Path("/scratch/user/uqmhal11/minimap_preset_testing/variants")
csv_path = base_dir / "aggregated_precision_recall_summaries.csv"
out_table_path = base_dir / "preset_comparison_summary.md"

if not csv_path.exists():
    print(f"Error: {csv_path} not found. Please run the aggregation script first.")
    exit(1)

df = pd.read_csv(csv_path)
df = df[df['VAR_TYPE'] != 'SV']
summary_df = df.groupby(['VAR_TYPE', 'read_model', 'preset'])[['PREC', 'RECALL', 'F1_SCORE', 'F1_QSCORE']].mean().reset_index()
summary_df = summary_df.sort_values(['VAR_TYPE', 'read_model', 'preset'], ascending=[False, True, True])

summary_df['PREC'] = summary_df['PREC'].apply(lambda x: f"{x * 100:.3f}%")
summary_df['RECALL'] = summary_df['RECALL'].apply(lambda x: f"{x * 100:.3f}%")
summary_df['F1_SCORE'] = summary_df['F1_SCORE'].apply(lambda x: f"{x * 100:.3f}%")
summary_df['F1_QSCORE'] = summary_df['F1_QSCORE'].apply(lambda x: f"{x:.2f}")

summary_df.columns = ['Variant Type', 'Read Model', 'Preset', 'Mean Precision', 'Mean Recall', 'Mean F1 Score', 'Mean F1 Q-Score']

try:
    markdown_table = summary_df.to_markdown(index=False)
    with open(out_table_path, "w") as f:
        f.write("# minimap2 Preset Performance Summary\n\n")
        f.write("*Values represent the mean across all tested samples at the 'BEST' threshold.*\n\n")
        f.write(markdown_table)
        f.write("\n")
    print(f"Markdown table saved to: {out_table_path}")
except ImportError:
    fallback_path = base_dir / "preset_comparison_summary.tsv"
    summary_df.to_csv(fallback_path, sep='\t', index=False)
    print(f"TSV table saved to: {fallback_path}")

Preview source code

#!/usr/bin/env python3
from pathlib import Path
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

base_dir = Path("/scratch/user/uqmhal11/minimap_preset_testing/variants")
plot_dir = base_dir / "plots"
plot_dir.mkdir(exist_ok=True)

csv_path = base_dir / "aggregated_precision_recall_summaries.csv"

if not csv_path.exists():
    print(f"Error: {csv_path} not found.")
    exit(1)

data = pd.read_csv(csv_path)

named_colors = {
    "black": "#000000",
    "orange": "#e69f00",
    "skyblue": "#56b4e9",
    "bluish green": "#009e73",
    "yellow": "#f0e442",
    "blue": "#0072b2",
    "vermilion": "#d55e00",
    "reddish purple": "#cc79a7",
}
cud_palette = list(named_colors.values())

def cud(n: int = len(cud_palette), start: int = 0) -> list[str]:
    remainder = cud_palette[:start]
    palette = cud_palette[start:] + remainder
    return palette[:n]

sns.set_theme(style="whitegrid")

metrics = ["F1_SCORE", "PREC", "RECALL"]
x = "read_model"
hue = "preset"
var_types = ["SNP", "INDEL"]

order = sorted(data[x].unique())
hue_order = sorted(data[hue].unique())
pal = {c: cud()[i] for i, c in enumerate(hue_order)}

for y in metrics:
    fig, axes = plt.subplots(
        nrows=1,
        ncols=len(var_types),
        figsize=(12, 6),
        dpi=300,
        sharey=True,
    )
    
    for i, vartype in enumerate(var_types):
        ax = axes[i]
        legend = (i == 0)
        
        df = data.query("VAR_TYPE == @vartype").copy()
        if df.empty:
            continue

        cap = 0.99999
        df.loc[:, y] = df[y].apply(lambda v: cap if v > cap else v)
        
        yticks = [0.5, 0.8, 0.9, 0.95, 0.99, 0.999, 0.9999, cap]
        yticklabels = [f"{yval:.2%}" for yval in yticks]

        box_kws = {
            "data": df, "x": x, "y": y, "order": order, "hue": hue,
            "ax": ax, "palette": pal, "fliersize": 0, "legend": legend,
        }
        
        if int(sns.__version__.split('.')[1]) >= 13:
            box_kws["fill"] = False
            box_kws["gap"] = 0.2
        
        sns.boxplot(**box_kws)

        sns.stripplot(
            data=df, x=x, y=y, order=order, hue=hue, ax=ax,
            palette=pal, alpha=0.5, dodge=True, legend=False,
            linewidth=0.5, edgecolor="black",
        )

        ax.set_yscale("logit", nonpositive="clip")
        ax.set_yticks(yticks)
        ax.set_yticklabels(yticklabels)
        
        ylabel = {"F1_SCORE": "F1 score", "PREC": "Precision", "RECALL": "Recall"}[y]
        ax.set_ylabel(f"{vartype} {ylabel}")
        ax.set_xlabel("")
        ax.tick_params(axis="x", labelsize=12)
        ax.set_title(f"{vartype} {ylabel}")

        if legend:
            handles, labels = ax.get_legend_handles_labels()
            for h in handles:
                h.set_linewidth(3)
            ax.legend(
                handles=handles, labels=labels, framealpha=1.0,
                fancybox=True, shadow=True, title="Preset",
            )

    fig.tight_layout()
    out_file = plot_dir / f"boxplot_strip_{y}.png"
    fig.savefig(out_file, dpi=300)
    print(f"Saved plot: {out_file}")

Preview source code

#!/usr/bin/env python3
import math
from pathlib import Path
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

base_dir = Path("/scratch/user/uqmhal11/minimap_preset_testing/variants")
plot_dir = base_dir / "plots"
plot_dir.mkdir(exist_ok=True)

named_colors = {
    "black": "#000000",
    "orange": "#e69f00",
    "skyblue": "#56b4e9",
    "bluish green": "#009e73",
    "yellow": "#f0e442",
    "blue": "#0072b2",
    "vermilion": "#d55e00",
    "reddish purple": "#cc79a7",
}
cud_palette = list(named_colors.values())

def cud(n: int = len(cud_palette), start: int = 0) -> list[str]:
    remainder = cud_palette[:start]
    palette = cud_palette[start:] + remainder
    return palette[:n]

sns.set_theme(style="whitegrid")

frames = []
pr_files = list(base_dir.rglob("*.precision-recall.tsv"))

for p in pr_files:
    rel_path = p.relative_to(base_dir)
    read_model = rel_path.parts[0]
    preset = rel_path.parts[1]
    sample = rel_path.parts[2]      
    
    df = pd.read_csv(p, sep="\t")
    df["sample"] = sample
    df["read_model"] = read_model
    df["preset"] = preset
    frames.append(df)

if not frames:
    print("Error: No precision-recall data found.")
    exit(1)

pr_df = pd.concat(frames, ignore_index=True)
samples = set(pr_df["sample"])

metrics = []
for vartype in ["SNP", "INDEL"]:
    for model in ["hac", "sup"]:
        for preset in pr_df["preset"].unique():
            data = pr_df.query("VAR_TYPE == @vartype and read_model == @model and preset == @preset")
            if data.empty: continue
                
            for q in sorted(set(data["MIN_QUAL"])):
                subdf = data.query("MIN_QUAL == @q")
                if set(subdf["sample"]) == samples:
                    tps = subdf["TRUTH_TP"].sum()
                    fps = subdf["QUERY_FP"].sum()
                    fns = subdf["TRUTH_FN"].sum()
                    
                    if (tps + fps) == 0 or (tps + fns) == 0: continue
                        
                    precision = tps / (tps + fps)
                    recall = tps / (tps + fns)
                    f1 = 2 * (precision * recall) / (precision + recall)
                    
                    metrics.append((preset, q, precision, recall, f1, vartype, model))

aggdf = pd.DataFrame(
    metrics,
    columns=["preset", "QUAL", "precision", "recall", "f1", "vartype", "read_model"],
)

aggdf.to_csv(plot_dir / "aggregated_pr_metrics.tsv", sep="\t", index=False)

vartypes = ["SNP", "INDEL"]
read_models = ["hac", "sup"]

fig, axes = plt.subplots(
    nrows=len(vartypes),
    ncols=len(read_models),
    figsize=(12, 10),
    dpi=300,
    sharex=True,
    sharey=True,
)

x = "recall"
y = "precision"
hue = "preset"

hue_order = sorted(set(aggdf[hue]))
pal = {c: cud()[i] for i, c in enumerate(hue_order)}

i = 0
legend = True
for vartype in vartypes:
    for model in read_models:
        ax = axes.flatten()[i]
        data = aggdf.query("vartype == @vartype and read_model == @model").copy()
        
        cap = 0.99999
        data.loc[:, y] = data[y].apply(lambda v: cap if v > cap else v)

        sns.lineplot(
            data=data, x=x, y=y, hue=hue, hue_order=hue_order,
            ax=ax, palette=pal, alpha=0.9, linewidth=2, legend=legend,
        )

        if legend:
            handles, labels = ax.get_legend_handles_labels()
            ax.legend().remove()
            legend = False

        ax.set_yscale("logit", nonpositive="clip")
        yticks = [0.8, 0.9, 0.95, 0.99, 0.999, 0.9999, cap]
        yticklabels = [f"{yval:.2%}" for yval in yticks]
        ax.set_yticks(yticks)
        ax.set_yticklabels(yticklabels)

        xticks = [0, 0.25, 0.5, 0.75, 1.0]
        xticklabels = [f"{xval:.2%}" for xval in xticks]
        ax.set_xticks(xticks)
        ax.set_xticklabels(xticklabels)
        ax.set_title(f"{vartype} ({model})")
        i += 1

for h in handles:
    h.set_linewidth(3)

plt.tight_layout()
leg_cols = math.ceil(len(hue_order))
fig.legend(
    handles=handles, labels=labels, loc="upper center",
    bbox_to_anchor=(0.5, 1.05), ncol=leg_cols, title="minimap2 preset",
    framealpha=1.0, fancybox=True, shadow=True,
)

out_png = plot_dir / "aggregated_precision_recall.png"
fig.savefig(out_png, bbox_inches="tight", dpi=300)
print(f"Saved plot: {out_png}")

7. Helper Scripts

Preview source code

"""This script takes a VCF file and forces HETs to be homozygous for the allele with 
the highest depth
"""

import argparse

import cyvcf2


def main(args):
    vcf = cyvcf2.VCF(args.vcf, gts012=True)
    vcf_out = cyvcf2.Writer("-", vcf, mode="w")
    for variant in vcf:
        v_type = variant.gt_types[0]
        if v_type == 1:  # HET
            use_ref = True
            allelic_counts = variant.INFO.get("AC")
            if "AD" in variant.FORMAT:
                allele_depths = variant.format("AD")[0]
                ref_depth = allele_depths[0]
                alt_depth = allele_depths[1]
                if alt_depth > ref_depth:
                    use_ref = False
            elif allelic_counts is not None:
                ref_count = allelic_counts[0]
                alt_count = allelic_counts[1]
                if alt_count > ref_count:
                    use_ref = False
            else:
                raise KeyError(f"Could not find allele counts for variant {variant}")

            if not use_ref:
                variant.genotypes[0] = [1, 1, False]
            else:
                variant.genotypes[0] = [0, 0, False]
            variant.genotypes = variant.genotypes

        vcf_out.write_record(variant)

    vcf_out.close()


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument(
        "vcf",
        help="Input VCF file",
        default="-",
        type=argparse.FileType("r"),
        nargs="?",
    )
    args = parser.parse_args()
    main(args)

The $2/(w+1)$ statistical retention rate is a fundamental property of the minimizer (or winnowing) algorithm, formalised by Schleimer et al. (2003) and Roberts et al. (2004) and dictates k-mer sampling density [5, 6]. When a window of size $w$ slides forward by one position, the algorithm is effectively evaluating a combined pool of $w+1$ k-mers (one dropping out, $w-1$ shared between windows, and one entering). Assuming a (relatively) random DNA sequence, the chosen minimizer will only change if the absolute lowest hash value in that entire $w+1$ pool sits at one of the two ends: either the k-mer that just exited the window (probability $1/(w+1)$) or the new k-mer that just entered (probability $1/(w+1)$). Summing these mutually exclusive events gives the $2/(w+1)$ probability that a new seed is saved. Therefore, map-ont ($w=10$) retains 2/11 (~18%) of its k-mers as minimizers, while lr:hq ($w=19$) retains 2/20 (10%). ↩︎ ↩︎
The F1 Q-score is the Phred-scaled equivalent of the standard F1 score, calculated as $-10 \log_{10}(1 - F1)$. This is useful when variant calling accuracies exceed 99.9%, as comparing linear F1 scores (e.g., 0.9990 vs 0.9999) becomes visually and intuitively difficult. Applying the standard Phred scale converts these fractional monstrosities into simpler logarithmic integers—for instance, an F1 of 0.999 becomes Q30, and 0.9999 becomes Q40—making microscopic differences in pipeline performance much easier to quantify. ↩︎ ↩︎

Cite this post

Hall, M. B. (2026). Minimap2 lr:hq preset testing. mbhall88.github.io. Zenodo. doi:10.5281/zenodo.19717304

View BibTeX

@misc{hall2026index,
  author = {Hall, Michael B.},
  title = { Minimap2 lr:hq preset testing },
  year = { 2026 },
  howpublished = { \url{ https://mbhall88.github.io/post/minimap2-lrhq-preset-testing/ } },
  publisher = { Zenodo },
  doi = { 10.5281/zenodo.19717304 }
}

References

Li, H. (2018). Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. [doi:10.1093/bioinformatics/bty191]
Hall, M. et al. (2024). Benchmarking reveals superiority of deep learning variant callers on bacterial nanopore sequence data. eLife. [doi:10.7554/eLife.98300]
Zheng, Z. et al. (2022). Symphonizing pileup and full-alignment for deep learning-based long-read variant calling. Nature Computational Science. [doi:10.1038/s43588-022-00387-x]
Dunn, T. et al. (2023). vcfdist: accurately benchmarking phased small variant calls in human genomes. Nature Communications. [doi:10.1038/s41467-023-43876-x]
Schleimer, S. et al. (2003). Winnowing. Proceedings of the 2003 ACM SIGMOD international conference on Management of data. [doi:10.1145/872757.872770]
Roberts, M. et al. (2004). Reducing storage requirements for biological sequence comparison. Bioinformatics. [doi:10.1093/bioinformatics/bth408]

Evaluating minimap2’s lr:hq preset for bacterial nanopore variant calling#

Introduction#

What are the preset differences#

Methods and data#

Results#

Conclusion#

Appendix#

Table S1: Interactive per-sample variant calling results

Scripts#

Config file#

1. Download Data#

2. Subsample Reads#

3. Align Reads#

4. Variant Calling#

5. Assessment#

6. Aggregation and Plotting#

7. Helper Scripts#

Cite this post

References

Evaluating minimap2’s `lr:hq` preset for bacterial nanopore variant calling

Introduction

What are the preset differences

Methods and data

Results

Conclusion

Appendix

Scripts

Config file

1. Download Data

2. Subsample Reads

3. Align Reads

4. Variant Calling

5. Assessment

6. Aggregation and Plotting

7. Helper Scripts