Computer: Web: Formatting
OCR OCD Apparatus Criticus XSD   (+2)  [vote for, against]
Did they really write that?

First, some prior art.

It's been a long day in the scriptorium for Prior Cadfael. He has to stand at the front, and read from the master copy of a codex, while the brother scribes write down new manuscript copies of ... whatever it might be. Today, let's say it's Cicero's Pro Murena. There is no coffee, because it has not yet reached Europe. So no-one can really blame him when, instead of "nemo paene sobrius saltat, nisi insanit", he reads "[...] salsat [...]". Which makes just as good sense, and may well be true.

This is a well-known problem, with a well-known solution. The well-known solution, for people wanting to know many generations later what Cicero really said, is called an apparatus criticus, which will be familiar to scholarly readers of, for example, Oxford Classical Text editions.

Brother Cerdic has starting doodling dicks in the marginal illuminations again, but we will leave him to it.

For now we are reading an Amazon e-book that has been scanned in along with thousands of others, and never properly proof-read. The original texts are not at all ancient; on the contrary, there are often people still alive who knew the authors. However, it would have cost too much to find them and ask their opinion, so the e-books were just churned out as-is.

And scanning means OCR errors.

Now, the Kindle App offers a "report content error" option, where we can submit our observations of what look like OCR errors, so that they can be completely ignored. But in any case, this is not just an Amazon problem. Students often lack money, and Humanities students often also lack any hope of earning any, so they often do their own scanning-to-pdf, and then sharing to the internet. And the kinds of text they scan tend to diverge from the texts probably supplied as learning data to OCR AI algorithms, especially with respect to the prevalence of foreign words. It gets worse when the original text is, say, a hundred years old and printed on cheap, high-acid paper in fonts that have since fallen out of fashion. And then stained with fly-poop (or something), which gets mistaken for full stops.

So, let us imagine a text in which multiple versions of any given passage may be given in-line, but tagged with XML tags specifying the origin of each variant, and where the document as a whole is accompanied by a stemma codium in XML form, to which attributes of those in-line tags refer.

If all these tags, both the in-line ones and the "stemma" ones, conformed to an agreed XML Schema Definition (XSD), then the marked up text could be rendered using any one of a range of different XSLT transformations, of which the first two would probably include one "scholarly" version, which showed the apparatus in the footnotes, and one "Ee-Zee Read" version, which moved it to the back.

Each text with its tags could be held in a public project on, say, GitLab, and the XSD itself in a separate project.

And then the idea might have a witty and apposite punchline.
-- pertinax, Oct 15 2022

Apparatus Criticus https://en.wikipedi.../Critical_apparatus
Also, stemma codium [pertinax, Oct 15 2022]

XSD https://www.w3schoo...ml/schema_intro.asp
[pertinax, Oct 15 2022]

OCR https://en.wikipedi...aracter_recognition
[pertinax, Oct 15 2022]

OCD https://www.mayocli...causes/syc-20354432
Look, it's just annoying, OK? [pertinax, Oct 15 2022]

Pro Murena: when rules of evidence haven't been invented yet ... https://www.perseus....0019%3Atext%3DMur.
... so you can just distract the jury by telling lawyer-jokes. [pertinax, Oct 15 2022]

Is that you, Brother Cerdic? https://d3h6k4kfl8m...07/phallus-tree.jpg
[pertinax, Oct 16 2022]

An example of where this idea was needed http://homepages.cs...l/NATO/nato1969.PDF
See the second page - especially this: "One of the problems was that the OCR software used kept trying to convert the original British spellings of words like ‘realise’ to the American spelling ‘realize’ and made other stupid mistakes. Whenever the OCR program was unsure of a reading, it called it to the attention of the operator, but there were a number of occasions in which it was sure, but wrong. Not all of these instances are guaranteed to have been caught." [pertinax, Nov 30 2024]

this is brilliant [+] although if this idea's title is an elaborate pun, i don't understand it.
-- sninctown, Oct 16 2022


B=o
-- pocmloc, Oct 16 2022


I understood one of those references.
-- 2 fries shy of a happy meal, Oct 16 2022


I don't understand my own annotation
-- pocmloc, Nov 30 2024



random, halfbakery