h a l f b a k e r yAsk your doctor if the Halfbakery is right for you.
add, search, annotate, link, view, overview, recent, by name, random
news, help, about, links, report a problem
browse anonymously,
or get an account
and write.
register,
|
|
|
So when you're sequencing some genome... as you do...
using one of the high-throughput third-generation
sequencing technologies, one of the
annoyances of the data is indels in
homopolymer runs.
In essence: Both pacbio and nanopore long-sequence
reads have problems in determining the exact
length of a
run of the same base, so the
resultant sequence exhibits systematic errors
which can't be fixed by higher read-depth (reading the
same DNA template more times). This is true even
though they're very distinct
technologies, so it's probably going to be an issue until
it's fixed.
Obviously, one can resolve many of these with a short
read technology (e.g. illumina), since these have a
different error profile. However, this
may not be a panacea, since they may not be
able to resolve repeats.
Now, you may say, hey, it's just one more or less base in
a run of umpteen bases in a set of degenerate regions,
who cares?
But I say no I'm sequencing this shit and I want it to be
right, dammit.[1]
So. How do we resolve this?
The long read tech is basically trying to read off the
bases in a single molecule of DNA as it goes through an
enzyme or channel of some kind, somehow. The issue
arises because the signal of each base in a run (of the same base)
is indistinguishable. You know how if you're trying to
count a set of identical things, sometimes you lose your
place? Well, that.
The solution to losing your place is to add variety, so that you
are able to avoid slipping position.
I propose that we add a pre-processing stage which corrupts or substitutes a fraction of the
bases, such that they give a distinct signal. Obviously it's
not as simple as just bollixing
bases in random ways; the product will need
to pass through the molecular machinery without
jamming it up. This will be highly dependent on what the
sequencing technology is doing.
As I understand it, nanopore sequencing essentially
observes the leakage of ions around the DNA as it passes
through a pore. Provided the DNA
can pass through the associated unwinding
enzymes, perhaps any change will help.
However, one (pretty elegant, I think) approach would
be to find a methylase enzyme which methylates one of
the bases of a run. For genomic
sequencing, at least two would be required -
one for polyA or polyT, and one for polyG or polyC. Four
enzymes, one for each nucleotide, would however be
desirable to give complete
coverage.
NEB currently supplies EcoGII Methyltransferase
commercially, which apparently indiscriminately
methylates adenine residues (N6), so that's a
start - and this would be immediately testable.
Pacbio essentially reads off individual base additions as
DNA is synthesised by DNA polymerase, through detection
of a side-product. That's a bit
more finicky, because the inserted base
needs to match specifically.
One -pretty extreme- approach would be to try to
/insert/ additional bases into the strand. Ideally, non-
canonical bases forming a new base-
pair. Being distinct isn't completely essential,
though. And I note there is precedent for such an
enzymatic system in the RNA insertion editing pathway
of Trypanosoma brucei.
Pacbio has the option of forming a loop of DNA, the bases of which can be read in multiple
cycles.
This currently seems to solve everything except homopolymer run lengths over about 6. So
if we could incorporate about one additional
base per 5 originals, or alter about one in six to a non-canonical read this would be all
systemic errors essentially solved.
[1] But /is/ this a massive engineering effort for mostly
very little gain?
I spent a great deal of time and effort correcting 415
errors in a pacbio-generated 6 megabase microbial
genome; all but one were homopolymer indels.
Most other genome sequencers don't bother, probably most of those in smaller labs getting
individual organisms of interest sequenced
don't even realise there's an issue. However, that probably affects
something like 5% of protein sequences
in these genomes, so it's not insignificant.
***
I think Max would have liked this idea, so I dedicate it to
his memory.
Please log in.
If you're not logged in,
you can see what this page
looks like, but you will
not be able to add anything.
Annotation:
|
|
Is this a genuine idea, or is it a HOX ... ? |
|
|
......@...... @...... @ ...... |
|
|
You recognised it as a witticism ? And then added to the pun ? |
|
|
Are you quite sure you're American ? |
|
|
You Homeland Security file must be interesting ... "Can read, has travelled outside home state ... dangerous subversive intellectual, to be kept under close surveillance ... " |
|
|
Question: Do they pay for the laundry of the white robe with the pointy hood, or do you offset it against tax as a legitimate expense, or do you just have to do it yourself ? Presumably it counts as obligatory "work uniform" ... |
|
|
Bloodstains can be so stubborn, even with biological washing powders. |
|
|
Fair enough. And if you want to know, that's one of the reasons for BorgCo's hostile takeover of ACME Corp. It's not something we're ashamed of or anything. Frustrated and embarrassed, yes, but not actually ashamed ... |
|
|
(Disclaimer: this is WAAY outside my knowledge-base...)
Q1: Can you run 2 scans in parallel (at a molecularly-close
proximity)? As in, they work along in lock-step.
Q2: Can you create an "artificial" DNA-like molecule that the
scan CAN differentiate accurately between the "bases"?
If "yes" to these, create the "fake" with a different base
precisely every 10 (or whatever) steps. So, as the "actual"
molecule is read, the "fake" provides a count. |
|
|
I believe the answers to these are:
Q1 - not with current technology
Q2 - not with current technology |
|
|
Although... short read technologies (i.e. illumina, 454) don't have this run-length issue, because they
do almost exactly this - read off a lot of molecules (of the same sequence) in lockstep.
But they're short read technologies at least in part because some fraction of the population falls out of
synchrony in each cycle. |
|
|
To be honest, I'd say it's pretty impressive that sequencing technologies work as well as they do. They're
almost ridiculous. |
|
|
//it's pretty impressive that sequencing technologies
work as well as they do. They're almost ridiculous.// |
|
|
Indeed, we're now in the genomic age. Not back when we
first did the human genome, that's just one, and only bits
of it too. Now we can actually compare and contrast,
which is where all the interesting stuff will be. There
must be a lot of worried criminals right now. |
|
|
Anyhow, how about a nucleic acid guided methylase? You
can choose your guide, say 5 x T, and then you'll get a
methylated pulse every 5 A bases that you can use to get
a handle on the length, incubate, then just melt it off
before sequencing. |
|
|
// how about a nucleic acid guided methylase? // |
|
|
<Obligtory gratuitous Ethyl Methane Sulphonate reference/> |
|
|
Reading this posting and it's annotations has made my head hurt and it will probably be days before I get to look up half of these words if I even remember to, and... aurgh! |
|
|
//this is WAAY outside my knowledge-base// |
|
|
But this is how we learn. (Well, this and the data of regrettable
experience). [+] |
|
|
Sorry about that. Would a glossary help? These are very informal definitions: |
|
|
base : in this context, a single constituent unit of DNA (a monomeric unit). In standard DNA this is one of four options : A,
G, C or T (which stand for the chemical names). Bases can be methylated and still recognised as the same base by
molecular machinery.
DNA : a polymeric molecule in which an organism's genetic information is encoded
enzyme : a protein which performs some reaction; think of it as a biological machine carrying out some function.
genome : all the genetic information of an organism
high-throughput : the ability to do something lots - often by running the process massively in parallel.
homopolymer : a polymer (or part thereof) comprising a series of identical monomeric units
illumina : a short-read sequencing technology
indel : insertion or deletion of a base (or multiple bases), either as a mutation or as a sequencing error
long-read sequencing technology : approximately, any method of reading out several kilobases of DNA sequence at a
time
monomer : a small molecule which can be joined to other similar molecules to form a polymer
nanopore [sequencing]: a long-read sequencing technology
nucleotide : another word for base. (the word 'base' is often used for counting numbers of monomeric unit, or in abstract,
and is not nucleic-acid-specific, while 'nucleotide' is explicitly nucleic-acid, and tends to be used to refer to the nature of
the monomer is - i.e. which of A,C,G,T it is)
nucleic acid : DNA (or RNA, or some other related molecule typically encoding genetic information)
methyl~ (methylation, methyl group etc) : A small part of a molecule, comprising a carbon and three hydrogen
atoms.
methylase : an enzyme which adds a methyl group to a molecule. (also, methyltransferase; an enzyme which moves
a methyl group from one place to another.)
pacbio [sequencing]: short for Pacific Biosciences sequencing; a long-read sequencing technology
PCR : Polymerase Chain Reaction; a method of generating many copies of a DNA sequence from a template molecule. Uses
primers to
define the ends.
polymer : a chemical made up of a chain of small units. (Some are branched, but DNA is generally not)
polymerase : an enzyme which builds a polymer out of monomeric units.
short-read sequencing technology : approximately, any method of reading out less than about a kilobase of DNA
sequence at a time. Initial versions could often read only a few tens of bases, but technology has increased the
lenths significantly.
primer : in this context, a short molecule (of the order of 17 bases) with a sequence making it capable of specifically
binding to a DNA
molecule of interest. It can then be extended by a polymerase.
sequencing : determining the order of As, Gs, Cs, and Ts in some DNA.
systematic error : an error which is inherent to a measuring process and hence can't be fixed by e.g. repeated
readings and averaging.
umpteen : slang - lots; about 10 or more.
|
|
|
Good work, thank you [Loris]. |
|
|
When you use short-read sequencing, how do you address a
specific section of the longer molecule? Do you have to dissect
it first, and then remember carefully where you put the severed
sections? |
|
|
//When you use short-read sequencing, how do you address a specific section of the longer molecule? Do
you have to dissect it first,
and then remember
carefully where you put the severed sections?// |
|
|
1) Old-style sequencing, where you get one sequence per reaction:
eg "Sanger sequencing". You would address a specific region of a larger DNA molecule, by designing a
"primer" - a short piece of DNA
(something like 17 bases
long) which is complementary to (exactly matches) the region just before the part you're trying to read. This
anneals (sticks to) the matching sequence in the sequencing reaction and can be extended using a polymerase.
Works well, and very useful for "finishing" - completing any awkward parts of an unfinished sequence,
checking mutations and so on. [1]
Nowadays reads may be
over a kilobase of decent sequence.
It can be scaled up, but scale becomes an issue - the original human genome used warehouses of
sequencing machines, each the size
of a fridge-freezer,
which would run out something like 384 independent reactions at a time, supported by an army of
technicians. |
|
|
2) "Shotgun sequencing"
Smash up your large DNA molecule(s) into bits, tag at least one of the ends of each (methods vary), and
read along from there.
You get back a 'jigsaw puzzle' of overlapping - and possibly error-ridden - ranges which can be pieced
together.
It's possible to do this with Sanger sequencing; you create a library of 'clones' - bits of your sequence held in
a plasmid vector, or
similar - but this is still
beholden to the one sequence per reaction condition. So the 'high throughput' methods do multiple
reads in a single reaction.
Methods vary enormously - 454
does its PCR in an emulsion, attaching the template to beads, illumina attaches the template strands to a
glass slip, then does
'bridging PCR' to form a
patchwork of attached clones, and so on. |
|
|
Note that even the long-read technologies will often be using a shotgun strategy - they're not long enough
to read out an entire molecule of biologically realistic length. |
|
|
[1] Another old technology - "MaxamGilbert sequencing" used terminus marking and a strand breakage strategy, but
it was technically demanding and at best only gave passable data. It saw some early popularity due to parochialism
and a single 'pro' vs the initial Sanger method, but the many 'cons' meant it became obsolete as Sanger-style
sequencing improved. |
|
|
Outside of my wheelhouse as well. But I will [+] for
the dedication to our dear friend who left us because
he would have understood it and splained it to me
when I asked. :-( |
|
|
Can't we just put a sewing machine pin in where the count
was stopped, and then restart the machine later? I have a
cork board if that helps... |
|
|
Doesn't count slippage just mean the read head of the nano pore is not specific enough and a finer or more correctly, a more massive base disturbance in the measurerment variable is needed. Can DNA conduct free electrons without breaking apart? |
|
|
bliss, just ask, I'll splain it- or at least try to. I splain stuff to myself with imaginary sock
puppets all the time. |
|
|
//Doesn't count slippage just mean the read head of the nano pore is not specific enough
and a finer or more correctly, a more massive base disturbance in the measurerment
variable is needed. Can DNA conduct free electrons without breaking apart?// |
|
|
It's not really slippage, it's a failure to read out every base as exactly one base.
I've no idea about DNA's conductivity, but that's not what is going on. |
|
|
In nanopore's case, the equipment is measuring the leakage of ions past the DNA as it goes
through the pore. The readout at any instant is a function of multiple bases, plus some
noise. How they deconvolve that seems more involved than I want to look into in detail,
but my naive model is that they're using a hidden Markov model or similar to guess the
oligomer which is present in the pore, given the very high time-resolution measurement
series. The DNA doesn't pass through the pore at a linear rate, and perhaps might even go
backwards (because Brownian motion) so if the sequence is all the same base, i.e. a
homopolymeric tract, they have to guess how long the run is based on the time it takes to
get though. |
|
|
For pacbio on the other hand, they're detecting a fluorescent molecule created through the
addition of a base to a synthesised DNA chain. DNA polymerase is in general very reliable,
making very few errors in copying the chain. However, the read-out is not so accurate. The
fluorescence is a different colour for each nucleotide. Each DNA strand being read is at the
end of a short tunnel, and the fluorescence is only detected while the fluorescent molecule
diffuses out of the tunnel because it acts as something called a zero-mode waveguide.
Don't ask me how that works, I don't know. Anyway, that escape may be faster or slower
than average - it's random - and maybe there's some measurement error or the
fluorescence decays (because generally they do). The time between base additions is also
random, because that too relies on diffusion.
It's probably relevant to mention that the raw base reads are very error-prone - something
like 5-15% error is generally reported, but I don't know what the profile is on that.
So normally, this is resolved by getting multiple reads and reconciling them. This works
well for the most part because the reads are very long. But it isn't so reliable for base runs. |
|
|
Why don't we yet have a proper magnet based DNA reader? Using an extremely thin probe it would physically move up and down the DNA strand a few dozen times and infer the molecule from magnetic field strength. It doesn't need to touch it, just be very close. |
|
|
Or better yet, an RNA based device that pulls the DNA strand through and outputs the data in the form of electrical pulses. |
|
|
//Why don't we yet have a proper magnet based DNA reader? Using an
extremely thin probe it would physically move up and down the DNA strand a
few dozen times and infer the molecule from magnetic field strength. It
doesn't need to touch it, just be very close.// |
|
|
An electron microscope approach?
Over 10 years ago there was this TV show about a mathmatician helping out a
detective, (called num three ers; I dunno..). And in one episode there's a virus
or something, and the maths guy gets an EM image of maybe 10 or 20 bases of
nucleic acid, and is like ... "Oh, I'm just trying to get a very early idea ahead
of the labs".
I was... well : "Aaargh, no! That's so wrong in so many ways...". |
|
|
//Or better yet, an RNA based device that pulls the DNA strand through and
outputs the data in the form of electrical pulses.// |
|
|
I'd say both these suggestions were wishful thinking, but ... hell, I'd have said
what is now currently existing technology was crazy magic at that point, so I
wouldn't rule them out as a future option.
But making them work is probably harder than idly speculating in chat. |
|
|
So, you line up the dna along a spiral track on an LP,
find the finest pickup needle, and amplify the result
with a warm-sounding vacuum tube... |
|
|
Really, this comes down to finding a system that is part ribosomic-like stepping method and part amplification of the unique metric each base generates. |
|
|
It won't matter that the strand acts is nothing like it acts in the nucleus so long as the strand stays in sequence and gives the needed identifying signals. |
|
|
//Really, this comes down to finding a system that is part ribosomic-like stepping method
[...]// |
|
|
I'm not sure you meant ribosome - they're the molecular machinery which translate RNA
sequence into protein, and are just massively more complicated than DNA polymerases.
However, just in case you did mean that, it's not an idea completely without possibility.
ribosomes interpret, and step along the RNA (not DNA, but let's not worry...) three bases at
a time. If it wasn't for all the horrendous complexity of setting things up, and extracting the
information, it might be useful to read out multiple bases at once with some sort of
ribosome-based technology. There is another sequencing technology, SOLiD, which
progresses in such longer steps (not involving ribsosomes)... but it's a short read tech, rather
involved to use and although it did have some advantages, the platform didn't really take
off. |
|
|
//[...] and part amplification of the unique metric each base generates.
It won't matter that the strand acts is nothing like it acts in the nucleus so long as the
strand stays in sequence and gives the needed identifying signals// |
|
|
Um, yes. This is the basis of all sequencing tech. |
|
|
Q What do you call gay plastics?
A Homopolymers |
|
|
Why thank you [Loris], for the splaining. I certainly get it
now...I'll gets back to you with a snappy comment in
hmmm...a year or so. |
|
|
What about a myosin-actin engine dragging the strand* to be sequenced, up through the pore against gravity or heavy weight termination. The actin filament might give a stepped frame reference. A force stress on the polymer might even exaggerate base signals. |
|
|
* if it can be attached, is to scale and works distance-wise. |
|
|
//What about a myosin-actin engine dragging the strand* to be sequenced, up through the
pore against gravity. The actin filament might give a stepped frame reference.// |
|
|
I have no idea, and only the enormous development cost and time needed to test such a
thing is stopping me. |
|
|
Well. That and I'm not sure that would be an improvement, to be honest. I don't think gravity
is significant at that scale. In nanopore sequencing, I think the DNA is actively driven by an
enzyme, a helicase. (I just need to go out at short notice, so can't check right now.) I kind
of doubt that a muscle-based strategy would be a big win, to be honest. There are probably
some issues with setting up to pull the DNA rather than push it (as nanopore does now) as
well. |
|
|
//the region just before the part you're trying to read// |
|
|
Is there any meaningful difference between "just before" and
"just after" in this context? |
|
|
//Is there any meaningful difference between "just before" and "just after"
[he part you're trying to read] in this context?// |
|
|
Yes.
DNA has a polarity - that is, a direction. Basically the monomers arn't
symmetrical along the chain. Without going into the gory chemical details
(which I'd have to look up anyway), people talk about the 5' ("five prime")
and 3' ("three prime") ends. (These names relate to that chemistry.)
All known polymerases extend the chain by adding nucleotides to the 3' end.
Ribosomes also process RNA in the same direction.
(As an aside, DNA sequence is canonically written with the 5' end on the
left, and the 3' end on the right, at least in the English speaking world - so
you can read it in the same direction as the cellular machinery.)
It's probably worth mentioning that the two DNA strands of a duplex
molecule are "anti-parallel", that is, they run in opposite directions. |
|
|
Helicase?, I wasn't imagining the need to messily unzip the DNA, rather just topologically expose each base pairing for signal interrogation. But then I have been imagining a lot of fanciful stuff around this one. |
|
|
Then again there aren't the physical methods or environmental variables that can manipulate DNA at base to base bond scale. Life all seems to rest on large molecular weight machinery manipulating the ladder. |
|
|
One day soon*, protein engineers may design of a large molecule that changes as it traverses the major groove, such that it communicates the sequence to our higher scale.
*Might not be alive. |
|
|
//Helicase?, I wasn't imagining the need to messily unzip the DNA, rather just topologically expose each base
pairing for signal interrogation. But
then I have been imagining a lot of fanciful stuff around this one.// |
|
|
I think it's chemically possible to flip out a base, IIRC I read about a DNA binding protein which does that (in the
process of interacting with its
specific binding site). Many DNA binding proteins pattern-match their binding site from the side, impinging into
the large (or, sometimes, the small) groove. But in terms of reading out the information in arbitrary sequence,
I think biological processes are -without exception- all about matching up
base-pairs. It's the easiest approach. |
|
|
//Then again there aren't the physical methods or environmental variables that can manipulate DNA at base to
base bond scale. Life all seems to
rest on large molecular weight machinery manipulating the ladder.// |
|
|
At the molecular scale, I think that the processes involved are really quite "messy", and a large part of the
nature of biochemistry is about
managing that.
When people talk about nanobots it's easy to imagine little machines which work like robots, but the reality is -
that's just not feasible. Sure
there are machines (scanning tunneling microsopes) which can visualise and move around individual atoms, but
(a) under very specific
circumstances - atoms on a flat surface in a vaccuum at a few degrees above absolute zero, without vibration;
and (b) sure the scanning tip is
small, but the functional parts of the machine as a whole are very much larger. |
|
|
//One day soon*, protein engineers may design of a large molecule that changes as it traverses the major
groove, such that it communicates the
sequence to our higher scale. *Might not be alive.// |
|
|
Possibly, but the difficult part there is communicating the information, not crawling along the DNA. Unwinding
the DNA is really not an issue, it's
done routinely. |
|
|
//the processes involved are really quite "messy"// Yeah, that's what I find fantastic, that such a complexity of machinery can work with Brownian motion influence. I more imagine there is a over arching, guiding dimension at work, such as all molecules are little magnets and it is that that makes the clock work, underpins protein folds and like. To have a transported amino acid come out of the set when needed, has to be more than just random bump. |
|
|
//communicating the information//
I did imagine using an expanding wave front to magnify surface bumps. Maybe a network of printer bubble jet heads to pulse a medium fluid and create an engineered wave scope. But even to me it sounds too out there.
NcNcNH2..OcN or from the other edge cNccO..2HNcCC |
|
| |