Halfbakery: Homopolymers begone

So when you're sequencing some genome... as you do... using one of the high-throughput third-generation sequencing technologies, one of the annoyances of the data is indels in homopolymer runs.

In essence: Both pacbio and nanopore long-sequence reads have problems in determining the exact length of a run of the same base, so the resultant sequence exhibits systematic errors which can't be fixed by higher read-depth (reading the same DNA template more times). This is true even though they're very distinct technologies, so it's probably going to be an issue until it's fixed.
Obviously, one can resolve many of these with a short read technology (e.g. illumina), since these have a different error profile. However, this may not be a panacea, since they may not be able to resolve repeats.

Now, you may say, hey, it's just one more or less base in a run of umpteen bases in a set of degenerate regions, who cares?
But I say no I'm sequencing this shit and I want it to be right, dammit.[1]

So. How do we resolve this?
The long read tech is basically trying to read off the bases in a single molecule of DNA as it goes through an enzyme or channel of some kind, somehow. The issue arises because the signal of each base in a run (of the same base) is indistinguishable. You know how if you're trying to count a set of identical things, sometimes you lose your place? Well, that.
The solution to losing your place is to add variety, so that you are able to avoid slipping position.
I propose that we add a pre-processing stage which corrupts or substitutes a fraction of the bases, such that they give a distinct signal. Obviously it's not as simple as just bollixing bases in random ways; the product will need to pass through the molecular machinery without jamming it up. This will be highly dependent on what the sequencing technology is doing.

As I understand it, nanopore sequencing essentially observes the leakage of ions around the DNA as it passes through a pore. Provided the DNA can pass through the associated unwinding enzymes, perhaps any change will help.
However, one (pretty elegant, I think) approach would be to find a methylase enzyme which methylates one of the bases of a run. For genomic sequencing, at least two would be required - one for polyA or polyT, and one for polyG or polyC. Four enzymes, one for each nucleotide, would however be desirable to give complete coverage. NEB currently supplies EcoGII Methyltransferase commercially, which apparently indiscriminately methylates adenine residues (N6), so that's a start - and this would be immediately testable.

Pacbio essentially reads off individual base additions as DNA is synthesised by DNA polymerase, through detection of a side-product. That's a bit more finicky, because the inserted base needs to match specifically. One -pretty extreme- approach would be to try to /insert/ additional bases into the strand. Ideally, non- canonical bases forming a new base- pair. Being distinct isn't completely essential, though. And I note there is precedent for such an enzymatic system in the RNA insertion editing pathway of Trypanosoma brucei.
Pacbio has the option of forming a loop of DNA, the bases of which can be read in multiple cycles. This currently seems to solve everything except homopolymer run lengths over about 6. So if we could incorporate about one additional base per 5 originals, or alter about one in six to a non-canonical read this would be all systemic errors essentially solved.

[1] But /is/ this a massive engineering effort for mostly very little gain?
I spent a great deal of time and effort correcting 415 errors in a pacbio-generated 6 megabase microbial genome; all but one were homopolymer indels. Most other genome sequencers don't bother, probably most of those in smaller labs getting individual organisms of interest sequenced don't even realise there's an issue. However, that probably affects something like 5% of protein sequences in these genomes, so it's not insignificant.

***

I think Max would have liked this idea, so I dedicate it to his memory.