Making Light :: The Implications are Staggering :: comments

The Implications are Staggering

Apparently some models of Xerox photocopiers are substituting one number for another in photocopied documents. This isn't an OCR error,...

The Implications are Staggering -- comment #1 from Edd Vick

Wed, 07 Aug 2013 01:29:39 -0500

I for one welcome our duplicating overlords.

While fearing for my life the next time I get a prescription...

]]>

Posted August 7, 2013 1:29 AM by Edd Vick

The Implications are Staggering -- comment #2 from Miramon

Wed, 07 Aug 2013 01:32:44 -0500

It's a hardware bug!
It's a software bug!
It's two, two, two bugs in one!

]]>

Posted August 7, 2013 1:32 AM by Miramon

The Implications are Staggering -- comment #3 from Lawrence

Wed, 07 Aug 2013 01:44:42 -0500

Well, damn, that really screws up one of the tests of parallel-world travel my characters used in "The Drifter."

]]>

Posted August 7, 2013 1:44 AM by Lawrence

The Implications are Staggering -- comment #4 from DriveBy

Wed, 07 Aug 2013 01:58:18 -0500

It's an image compression error.

Data compression algorithms can work, in part, by recognizing repetition in the data being compressed and storing only one copy of the repeated data. Where the input data is grainy, as is often the case with scanned print, "repetition" becomes a matter of judgment; insisting on pixel-perfect matches as the determinant of repetition would result in no matches, and no compression (at least by that part of the algorithm). So, in practice, the machine applies fudge factors to determine whether this grainy part of the image over here is close enough to that grainy part of the image over there to be called a duplicate.

When the fudge factors are set too liberally, similar-looking grainy parts can be mistaken for duplicates. When an erroneous match is made, one or the other of the grainy parts will be assigned as the master part and will be plugged into both places. This can happen to image parts containing text, especially if there is a common visual theme decorating the text. Apparently, the Xerox machines in question are susceptible to this problem primarily when the text in question contains numbers, but not when the text is alphabetical.

There is an update on the situation which includes words like JBIG2, and gets into the specifics of settings on Xerox machines. There are fingers yet to be pointed, and still some question about who knew what when and why certain people didn't understand sooner what was happening. Lawyers may at some point become involved. But at the lowest technical level, it's an image compression issue.

]]>

Posted August 7, 2013 1:58 AM by DriveBy

The Implications are Staggering -- comment #5 from Laura

Wed, 07 Aug 2013 02:02:05 -0500

Add in that many Xerox machines have fax capability, and all now have a boatload of memory, and I see even more problems - industrial espionage by chipset in trusted hardware.

]]>

Posted August 7, 2013 2:02 AM by Laura

The Implications are Staggering -- comment #6 from Bill Stewart

Wed, 07 Aug 2013 02:19:20 -0500

I'm sure their statement will be interesting, but I'm not going to believe any of the numbers in it...

]]>

Posted August 7, 2013 2:19 AM by Bill Stewart

The Implications are Staggering -- comment #7 from Bruce Cohen (Speaker to Managers)

Wed, 07 Aug 2013 02:49:58 -0500

Architects, contractors, airplane designers, and artillery units had all better be checking what model copier they're using, or things may start falling down, crashing, or blowing up unexpectedly.

]]>

Posted August 7, 2013 2:49 AM by Bruce Cohen (Speaker to Managers)

The Implications are Staggering -- comment #8 from dcb

Wed, 07 Aug 2013 04:19:39 -0500

My local Metro (free paper associated with trains, Tube etc.) says that Xerox's "principal engineer, Francis Tse" is saying that the problem can be combatted by copying at higher resolution - which would make sense from what DriveBy @4 says.

In note that the large copier at work which I use occasionally to scan e.g. a book chapter is set at 200 x 200 as standard and I have to keep resetting it to 300 x 300 (for each document, if I'm copying several, which is a pain, 'cos it resets to the lower res automatically at the end of each document). I'm sure the lower standard setting is to save memory, but it makes for lousy copies, particularly if, as I tend to do, you print two pages to a side* (and double sided of course) to save paper.

*Note: two pages to a side works better with UK/European paper sizes, A4 etc. than with American paper sizes.

]]>

Posted August 7, 2013 4:19 AM by dcb

The Implications are Staggering -- comment #9 from Rob Rusick

Wed, 07 Aug 2013 06:49:45 -0500

If I understood Mr. Kriesel's article, this is an issue when one uses a 'scan to PDF' feature of the copier, where the alterations are made in the saved PDF files.

I didn't see a claim that standard copying was affected.

]]>

Posted August 7, 2013 6:49 AM by Rob Rusick

The Implications are Staggering -- comment #10 from Daniel Martin

Wed, 07 Aug 2013 07:03:28 -0500

Specifically, from reading the updates it's a problem of using the quality setting "normal" when scanning to PDF. The three possible settings for quality are "normal", "higher", and "high"; the factory default for quality isn't the "normal" setting but because it doesn't warn "normal activates patch-based image compression (JBIG2), which will corrupt text" in big red letters, apparently many penny-wise users will change the default to "normal" to store their documents in less memory.

Memory/storage space is cheap. Efforts to conserve it almost always come around to bite you in the end.

]]>

Posted August 7, 2013 7:03 AM by Daniel Martin

The Implications are Staggering -- comment #11 from Jim Macdonald

Wed, 07 Aug 2013 07:31:38 -0500

Another problem of photocopiers which store documents to disk in the process of making copies is that, when you eventually discard/resell/whatever the machine, copies of all documents you ever copied, however confidential, may still be on it and floating around out there somewhere.

See: Copier Data Security: A Guide for Businesses

]]>

Posted August 7, 2013 7:31 AM by Jim Macdonald

The Implications are Staggering -- comment #12 from Nangleator

Wed, 07 Aug 2013 09:22:31 -0500

I'm sure only copiers can make electronic mistakes. I'm sure any other conceivable electronic glitches are completely unimportant. Even multiplied by, say, 314 million. The population of the United States.

Insignificant. Even if then multiplied by the number of phone calls and emails U.S. citizens make every day, all year long.

We can rely on our data, even in a court of law. Even secret courts of law.

]]>

Posted August 7, 2013 9:22 AM by Nangleator

The Implications are Staggering -- comment #13 from Remus Shepherd

Wed, 07 Aug 2013 09:32:23 -0500

The big problem with this bug is that the 'scan to PDF' feature is intended for users who want a paperless office, which means they shred the original paper documents once they're done scanning them.

It would be interesting if civilization were to collapse over a software bug as trivial as this.

]]>

Posted August 7, 2013 9:32 AM by Remus Shepherd

The Implications are Staggering -- comment #14 from P J Evans

Wed, 07 Aug 2013 09:53:20 -0500

The place I worked switched to Ricohs a few years back. In some ways, not an improvement over Xerox, and we swore a lot at them. (We were printing graphics, like PDFs reduced from 24x36 to 11x17. The old Xerox copies were readable.)

]]>

Posted August 7, 2013 9:53 AM by P J Evans

The Implications are Staggering -- comment #15 from Fragano Ledgister

Wed, 07 Aug 2013 11:05:50 -0500

Creative copying?

]]>

Posted August 7, 2013 11:05 AM by Fragano Ledgister

The Implications are Staggering -- comment #16 from Stan

Wed, 07 Aug 2013 12:23:01 -0500

A knowledgable MeFi user commented on this problem:
http://www.metafilter.com/130641/Cat-images-reportedly-unaffected#5125209

]]>

Posted August 7, 2013 12:23 PM by Stan

The Implications are Staggering -- comment #17 from Tom Whitmore

Wed, 07 Aug 2013 12:59:22 -0500

The original photocopiers worked by making a direct master, reversed, of the original document on a photosensitive drum. Now, most copiers work by scanning the image and then printing it out. I wonder, what other sorts of errors are introduced by this change?

]]>

Posted August 7, 2013 12:59 PM by Tom Whitmore

The Implications are Staggering -- comment #18 from Bill Higgins-- Beam Jockey has been gnomed

Wed, 07 Aug 2013 14:40:38 -0500

Bruce Cohen writes in #7:

Architects, contractors, airplane designers, and artillery units had all better be checking what model copier they're using, or things may start falling down, crashing, or blowing up unexpectedly.

I don't believe you'll find many artillery units using Xerox copiers. They tend to favor Canon.

]]>

Posted August 7, 2013 2:40 PM by Bill Higgins-- Beam Jockey has been gnomed

The Implications are Staggering -- comment #19 from Jim Macdonald

Wed, 07 Aug 2013 15:02:22 -0500

Bill Higgins #18 --

Can't find your gnomed post. Sorry about that.

-- JDM

]]>

Posted August 7, 2013 3:02 PM by Jim Macdonald

The Implications are Staggering -- comment #20 from Fragano Ledgister

Wed, 07 Aug 2013 15:04:32 -0500

Bill Higgins #18: That's Serge-level punditry.

]]>

Posted August 7, 2013 3:04 PM by Fragano Ledgister

The Implications are Staggering -- comment #21 from Ken Fletcher

Wed, 07 Aug 2013 15:45:32 -0500

"It's not a bug; it's a feature!"

A useful optional setting to discourage clandestine copying of documents thick with numbers. Could maybe even track the individual copy machine, and when the copies were made.

]]>

Posted August 7, 2013 3:45 PM by Ken Fletcher

The Implications are Staggering -- comment #22 from David Goldfarb

Wed, 07 Aug 2013 16:33:53 -0500

dcb @8: Moving between any two standard paper sizes is easier with European A/B series than with the stupid American letter/ledger/architectural etc. non-system. I worked in a copy shop for two decades, and people were constantly wanting documents designed for letter size blown up to ledger or 24x36, or something larger reduced to letter, and it was always a pain because the aspect ratio was different. Copy shop clerks in Europe have it so much easier: A4 to A3 is 141%, A4 to A5 is 77%, drop it on the glass and go. Sigh.

]]>

Posted August 7, 2013 4:33 PM by David Goldfarb

The Implications are Staggering -- comment #23 from albatross

Wed, 07 Aug 2013 16:35:28 -0500

The reason this seems like a big deal to me is that I am extremely skeptical that Xerox is the only copier/scanner on which this is a problem. We have spent some years now trying to get rid of paper in as many places as possible, ranging from voting to e-receipts to e-prescriptions to MERS (electronic mortgage records). And in all cases, there were cost and hassle savings up front. But part of the cost is that the underlying paper documents go away, and we can easily end up with *nothing but* electronic records. If those records are wrong--via glitch or misentry or tampering--there is nothing to check them against.

Our electronic technology is not reliable or secure enough to support this. With a fallback paper record, you have something that neither a glitch nor an attacker can change, which can be used to decide what should be in the electronic records. Without that, you trust the electronic records whether they deserve it or not.

]]>

Posted August 7, 2013 4:35 PM by albatross

The Implications are Staggering -- comment #24 from Nancy Lebovitz

Wed, 07 Aug 2013 16:47:38 -0500

The reason this seems like a big deal to me is that even if it's "just" two Xerox models (how long has this been going on?), it's probably enough to wreck a lot of subtly caused havoc. It wouldn't surprise me if it's enough to get people killed.

]]>

Posted August 7, 2013 4:47 PM by Nancy Lebovitz

The Implications are Staggering -- comment #25 from eric

Wed, 07 Aug 2013 17:33:52 -0500

I predict that, no matter the real cause, the users will be blamed.
Where really, it's negligent use of a compression algorithm.

]]>

Posted August 7, 2013 5:33 PM by eric

The Implications are Staggering -- comment #26 from P J Evans

Wed, 07 Aug 2013 17:45:15 -0500

23
The company I worked at was scanning (at a high resolution) all their old maps, and at a more reasonable resolutions all their other construction documents. Because they have to be able to track what was done until it's removed permanently, possibly 80 to 90 years.

]]>

Posted August 7, 2013 5:45 PM by P J Evans

The Implications are Staggering -- comment #27 from Matthew Brown

Wed, 07 Aug 2013 18:35:26 -0500

Eric@25:

Yes, I agree. This error falls outside of user expectations of a scanner or copier. Expected failure modes include unreadable scans. Readable-but-wrong scans fall outside of that.

I haven't seen anywhere mention if this is the normally configured compression mode of these devices, or whether it's an option the user has to set. Regardless, though, this is not a user error, since it falls outside of reasonable expectations. It'll just influence how many people have been bitten by this error.

Is it possible to determine if your scanned PDFs from one of these devices were scanned with the problematic compression option? If not, nobody who's got documents scanned with these can trust them.

Otherwise, it'll be hard to know if one is among the affected users and thus one will generally have to assume your documents might be wrong.

]]>

Posted August 7, 2013 6:35 PM by Matthew Brown

The Implications are Staggering -- comment #28 from Tom Whitmore

Wed, 07 Aug 2013 18:59:16 -0500

It's not just scanned PDFs, Matthew Brown -- it's simple photocopies. Things that look like they're a printed picture of the original, only with these anomalies. The copier makes a digital file, then prints that file, rather than simply taking a picture and printing that -- and this error comes in when it makes the digital file.

]]>

Posted August 7, 2013 6:59 PM by Tom Whitmore

The Implications are Staggering -- comment #29 from Matthew Brown

Wed, 07 Aug 2013 19:19:13 -0500

At least one of the linked articles I read mentioned that it's only the PDFs that are affected: is that inaccurate?

]]>

Posted August 7, 2013 7:19 PM by Matthew Brown

The Implications are Staggering -- comment #30 from Heather Rose Jones

Wed, 07 Aug 2013 21:32:59 -0500

In my mind I am running through all the sorts of documents scanned at my Place of Employment where the scan/copy is treated as equivalent to the original document. Documents like QC test results or Certificates of Analysis shipped with our products as proof that they meet specifications. Lots of numerals in pharmaceutical manufacturing documents. I don't believe there are any points where disposition decisions are made based on a scan/copy rather than an original document (or more often on purely electronic data). But there are certainly points where scanned/copied documents are used to document those decisions, which could present the appearance of non-conformance.

]]>

Posted August 7, 2013 9:32 PM by Heather Rose Jones

The Implications are Staggering -- comment #31 from lorax

Wed, 07 Aug 2013 21:37:53 -0500

As I understand the nature of the issue from reading the linked article, there is no a priori reason why this shouldn't affect alphabetical characters as well as numerical ones in similar circumstances (different isolated blocks of characters occurring in different locations in a primarily non-textual document). Things like names, for instance.

]]>

Posted August 7, 2013 9:37 PM by lorax

The Implications are Staggering -- comment #32 from Doug Burbidge

Thu, 08 Aug 2013 00:12:24 -0500

David Goldfarb @22:

> A4 to A5 is 70.7%

FTFY.

Going up a size, of course, involves multiplying by √2 (i.e. 141%); going down a size involves dividing by √2 (i.e. multiplying by 70.7%).

Another annoying property of US paper is that there are several different weight scales, all of which are abbreviated to "pounds": bond, index, cover, etc. The rest of the world has just one scale: gsm, or grams per square metre -- a calculation made simple when you know that an A0 sheet is one square metre, or 16 A4 sheets is one square metre.

]]>

Posted August 8, 2013 12:12 AM by Doug Burbidge

The Implications are Staggering -- comment #33 from Doug Burbidge

Thu, 08 Aug 2013 00:15:57 -0500

Mr. Kriesel's sample scans for room dimensions show the error in a rectangular block, containing the dimension number inside the block. The JBIG2 encoder inside the Xerox scanner has decided that this rectangular block is a single glyph, and has (incorrectly) replaced the whole glyph.

Further down, he shows scans where '6' has been substituted with '8'.

Wikipedia says that JBIG2 can use either of two methods for encoding data that it thinks is text: pattern matching and substitution, or soft pattern matching. Re the first method, it says "substitution errors could be made during the process if the image resolution is low." The second method stores the differences, so nominally even if it guessed the wrong glyph, the JBIG2 viewer used to view the encoded file would start from the wrong glyph and apply the differences, thus producing something with the correct appearance, which is all you need.

You sometimes see this when copy-pasting text from a JBIG2-encoded PDF: the text looks roughly correct in the PDF, but when you paste into Word or whatever, you see 'I' substituted for '1', or 'O' for '0' or whatever. (This could also be a problem, but not as insidious as the one Mr. Kriesel has identified.)

]]>

Posted August 8, 2013 12:15 AM by Doug Burbidge

The Implications are Staggering -- comment #34 from dcb

Thu, 08 Aug 2013 04:10:36 -0500

David Goldfarb @22: Yeah. I really only realised the US problem when I was visiting a specialist library and suggested to another person there that they could reduce-copy to give two pages to a side - which, as you know, is so easy with A4 etc. I was flabbergasted when I realised that two American letter size sheets didn't easily reduce-copy-70% (or 71%, since as Doug Burbidge says @32 it should be 70.7%) onto one sheet of American letter. Whoever designed the "A" system deservs a prize - and we just take the advantages for granted.

]]>

Posted August 8, 2013 4:10 AM by dcb

The Implications are Staggering -- comment #35 from Lila

Thu, 08 Aug 2013 07:47:14 -0500

Heather Rose Jones @#30, that was the first thing that leapt to mind about EMR (electronic medical records)--that the PDF file would make it look like someone had gotten the wrong dosage of meds, or too strenuous an exercise prescription (wrist curls with 8# 2 weeks after surgery?), and hello malpractice suit.

Scanning paper documents and storing them as PDFs is pretty common practice for EMR, at least in the small clinics I'm familiar with.

]]>

Posted August 8, 2013 7:47 AM by Lila

The Implications are Staggering -- comment #36 from Dave Bell

Thu, 08 Aug 2013 08:24:35 -0500

Lila @35

I read the original report, and what struck me was that the examples they showed were of small print, only just large enough to resolve some of the distances between a 6 and an 8. It was the sort of thing you could get with a 9-pin dot-matrix printer, a huge relative pixel size. It was marginally readable even without the errors.

If critical records are in print that small, it maybe isn't the Xerox machine that would be your problem.

]]>

Posted August 8, 2013 8:24 AM by Dave Bell

The Implications are Staggering -- comment #37 from oldster

Thu, 08 Aug 2013 09:51:22 -0500

dcb @34:

"Whoever designed the "A" system deserves a prize - and we just take the advantages for granted."

That would be Georg Christoph Lichtenberg for the really cool part of the insight (i.e. seeing the role of root-2), and Walter Porstmann for specifying it to the metric system and making it a standard. (From the Wiki page for "paper size".)

Probably no prizes will be forthcoming, but some people refer to the aspect ratio as the "Lichtenberg ratio." Having your name used as a nickname for root-2 is pretty cool, and better than most prizes. I mean, what prize could I possibly prefer to having some fundamental constant referred to as "the oldster number"?

Pie?

]]>

Posted August 8, 2013 9:51 AM by oldster

The Implications are Staggering -- comment #38 from Bob Webber

Thu, 08 Aug 2013 10:13:58 -0500

Xerox Workcenter like Bank of America: JBIG2 FAIL.

]]>

Posted August 8, 2013 10:13 AM by Bob Webber

The Implications are Staggering -- comment #39 from Clifton

Thu, 08 Aug 2013 12:46:20 -0500

Forwarded the original link to my wife yesterday, because I knew that she and her workplace have spent part of the past year scanning documents on a Xerox WorkCenter for permanent storage in PDF form. Uh-oh....

The kind of work she does could be (and sometimes is) the subject of lawsuits and/or "administrative hearings" though it's not the kind that usually depends on the values of a few digits.

Meanwhile on Mefi, a user comments that they've seen this problem in the form of inappropriate keming, and had been wondering what caused that - so it's not just numbers, even if numbers seem to be particularly vulnerable .

]]>

Posted August 8, 2013 12:46 PM by Clifton

The Implications are Staggering -- comment #40 from SarahS

Thu, 08 Aug 2013 14:56:52 -0500

Did anyone else immediately think, "Oh my lord...royalty statements!"

]]>

Posted August 8, 2013 2:56 PM by SarahS

The Implications are Staggering -- comment #41 from David Goldfarb

Thu, 08 Aug 2013 17:16:00 -0500

Doug Burbidge @32: Thanks for the correction.

Half letter size is actually close enough in aspect ratio to letter that reducing two pages onto one isn't that hard; usually you just have to clip a little bit of whitespace around the edge. The other stuff I mentioned is much more annoying.

]]>

Posted August 8, 2013 5:16 PM by David Goldfarb

The Implications are Staggering -- comment #42 from Bill Higgins-- Beam Jockey

Thu, 08 Aug 2013 18:35:37 -0500

Oldster in #37:

That Lichtenberg guy must have been a million laughs. He was the Jon Singer of his day.

]]>

Posted August 8, 2013 6:35 PM by Bill Higgins-- Beam Jockey

The Implications are Staggering -- comment #43 from Jacque

Thu, 08 Aug 2013 18:41:31 -0500

This is fascinating, as I am, at this very moment, proofing a scanned document. Maybe I'm just behind the times, but I'm still boggled that the OCR tech is as good as it is. And, to be fair, the original document has ^{eentsy weentsy teeny tiny} type, printed on that gray, speckled "recycled" paper that was all the rage back in the '90s, so that the speckle in the paper is less than an order of magnitude smaller than the features in the font. So I am seeing quite a bit of...interpretation. Unsurprisingly, commas become semicolons and so on. Interestingly, less in the numerals than in the letters.

For example:

Home > Honie
Petroleum > Petroi"Lim
PROPERTIES > Pl'loPERTIES, or l'flOPER:ms
Equipment > Equip,;ent, or Eqilipnieht (the latter is obviously the German spelling)

(I find this perversely amusing. If you cock your head and squint just so, you can kinda see how it got there.)

But I do see some numeracy fails:

222,195,990 > 222, 195,99\)_
6,775,900 > 6,775,90P
57,147,790 > 5}; 147:790
71,242 > 7i ,242

Interestingly, the errors actually look correct on the PDF, as compared to the paper. It's only when the text is copied out and pasted into another program that the parsing errors show up.

I haven't yet caught it out swapping one number for another. But I haven't actually proofed my processed content against the actual paper yet, either.

But I would never, in a million years, even think about putting out the processed text as "correct," without proofing it first.

]]>

Posted August 8, 2013 6:41 PM by Jacque

The Implications are Staggering -- comment #44 from Jacque

Thu, 08 Aug 2013 19:11:42 -0500

Okay, then. Just finished proofing, and I did, indeed, find three instances where it changed the numbers. (One of them where it saw 13 and read it as 8.)

]]>

Posted August 8, 2013 7:11 PM by Jacque

The Implications are Staggering -- comment #45 from P J Evans

Thu, 08 Aug 2013 20:13:49 -0500

OCR translations of text can be almost as fun as running things through Babelfish a few times.

]]>

Posted August 8, 2013 8:13 PM by P J Evans

The Implications are Staggering -- comment #46 from Cally Soukup

Thu, 08 Aug 2013 20:26:03 -0500

Proofing OCRed text for Distributed Proofreaders for Project Gutenberg has made me smile every time I see the word "arid". Because, you see, in an OCRed text, ninety-nine times or more out of a hundred the word is supposed to be "and". Also: watch out for he/be errors. That's another very, very common swap, in both directions.

]]>

Posted August 8, 2013 8:26 PM by Cally Soukup

The Implications are Staggering -- comment #47 from P J Evans

Thu, 08 Aug 2013 20:33:32 -0500

I spent a few months proofing OCRd text. My favorites were when the software scrambled 'District' to 'Omelet' and turned 'legal obligation' to 'lethal obligation' (well, it was about DC bonds).

]]>

Posted August 8, 2013 8:33 PM by P J Evans

The Implications are Staggering -- comment #48 from Erik Nelson

Thu, 08 Aug 2013 20:46:03 -0500

This is what I deal with every day on the job.

]]>

Posted August 8, 2013 8:46 PM by Erik Nelson

The Implications are Staggering -- comment #49 from albatross

Fri, 09 Aug 2013 11:03:35 -0500

eric:

The users will be blamed, but shipping a product with a normal setting that can cause this kind of damage is an amazingly bad idea. And I'll bet that there are other copiers and scanners with the same problems.

]]>

Posted August 9, 2013 11:03 AM by albatross

The Implications are Staggering -- comment #50 from Alan Hamilton

Fri, 09 Aug 2013 12:01:54 -0500

This is inherent to anything using JBIG2 compression, so yeah, it could affect other documents. I've seen the same issue on scanned documents converted to PDFs on a PC. For example, this NTSB report. There are a lot of typos and weird font substitutions. This was converted using PDFWriter 3 in 2000, so this is nothing new.

]]>

Posted August 9, 2013 12:01 PM by Alan Hamilton

The Implications are Staggering -- comment #51 from Lin Daniel

Sat, 10 Aug 2013 19:28:27 -0500

Speech to text software does some amusing stuff, too. jan finder would send me email with uncorrected speech to text. There were times I'd have to read it out loud, with his inflections, to figure out what it was he was saying. And once, sent it back saying, "I love it when you talk dirty to me" because it was totally unintelligible.

]]>

Posted August 10, 2013 7:28 PM by Lin Daniel

The Implications are Staggering -- comment #52 from Kevin Reid

Sun, 11 Aug 2013 13:09:20 -0500

Interestingly, the errors actually look correct on the PDF, as compared to the paper. It's only when the text is copied out and pasted into another program that the parsing errors show up.

An OCR'd PDF simply has the text computed by the OCR algorithm placed behind a copy of the original image, so that it is invisible (but still possible to select). When you select some text, you're actually getting the hidden character text, but it happens to line up with the part of the original image.

(Similar scenario that was a topic a while ago: failed redactions of PDFs, where black boxes are just added on top, but the original text is unaltered.)

]]>

Posted August 11, 2013 1:09 PM by Kevin Reid

The Implications are Staggering -- comment #53 from oldster

Sun, 11 Aug 2013 14:49:19 -0500

Relax, people!

It's all cool. Xerox knew about this all along. In fact, they warned you about it, in their documentation.

From a user-manual:
“Normal/Small produces small files by using advanced compression techniques. Image quality is acceptable but some quality degradation and *character substitution errors may occur* with some originals.” [asterisks added].

From the IBM spox:
"You are also correct that we have documented in our user guides as well as within our devices that the high compression mode may cause character substitution which means we have known about the potential for this issue. Our design philosophy was to make available a very useful mode that creates small files while at the same time providing information about its limitations."

Both of these quotes are from their blog:

http://realbusinessatxerox.blogs.xerox.com/2013/08/07/update-on-scanning-issue-software-patches-to-come/#.UgfbRmTF2RY

where they also say that they are working on a software patch for this.

I give them some credit for being very honest on the blog. Of course, Corporate Counsel's Office doesn't mind being honest, because they know that they warned about it in the original users manuals. And that warning means that they are shielded from any liability.

Don't you feel more relaxed now? Xerox does!

]]>

Posted August 11, 2013 2:49 PM by oldster

The Implications are Staggering -- comment #54 from Pfusand

Sun, 11 Aug 2013 14:50:40 -0500

The OCR error I was most grateful to spot was the one that changed "She licked her lips" into "She licked her hips."

]]>

Posted August 11, 2013 2:50 PM by Pfusand

The Implications are Staggering -- comment #55 from David Weingart

Tue, 13 Aug 2013 13:34:32 -0500

Pfusand @ 54: She'd have to be pretty flexible

]]>

Posted August 13, 2013 1:34 PM by David Weingart

The Implications are Staggering -- comment #56 from Pfusand

Tue, 13 Aug 2013 15:39:44 -0500

David @ 55,
Well, she was an alien, but not that sort of alien.

]]>

Posted August 13, 2013 3:39 PM by Pfusand

The Implications are Staggering -- comment #57 from Allan Beatty

Tue, 13 Aug 2013 18:55:08 -0500

Corporate counsel may think they are honest when there is a warning buried deep in a manual.

It would be more honest if the button on the screen said "Sub-Normal" instead of "Normal".

]]>

Posted August 13, 2013 6:55 PM by Allan Beatty

The Implications are Staggering -- comment #58 from Lee

Wed, 14 Aug 2013 11:48:22 -0500

Allan, #57: Exactly. A format subject to errors of this type should never have been set as the DEFAULT. Designating it as "smallest" or "ultra-compressed" or, yes, "sub-normal" -- combined with the manual's warning that in using this mode you risk substitution errors -- would have been the ethical approach.

]]>

Posted August 14, 2013 11:48 AM by Lee

The Implications are Staggering -- comment #59 from Mary Aileen

Wed, 14 Aug 2013 13:14:41 -0500

Lee (58): Daniel@10 says the overly compressed setting was not the factory default, but calling it "Normal" implies that it's perfectly okay. And yes, that's *very* problematic.

]]>

Posted August 14, 2013 1:14 PM by Mary Aileen

The Implications are Staggering -- comment #60 from Renee

Thu, 15 Aug 2013 22:27:38 -0500

Hmm. I've seen read errors of this sort when using a hand-held scanner to do inventories of barcoded items. Here I thought such errors were all caused by smudged barcodes. Nice to know what causes this -- in the sense that 'nice' means I can now explain why random nonsense is showing up in the data.

...And now I get to worry about non-nonsensical number swaps. Sigh.

]]>

Posted August 15, 2013 10:27 PM by Renee

The Implications are Staggering -- comment #61 from paul

Fri, 16 Aug 2013 15:21:37 -0500

Not that it really matters, but how many of the people who use a typical officer copier even have access to the user manual, much less have read the thing cover to cover?

]]>

Posted August 16, 2013 3:21 PM by paul

The Implications are Staggering -- comment #62 from Kaleberg

Sun, 18 Aug 2013 19:40:48 -0500

The real problem is that only one in a thousand font designers realizes that "6" and "8" are different characters. The rest of them seem to lack the mental capacity to understand the concept. I'm not surprised a Xerox machine couldn't tell a six from an eight. Most of the time, I can't either.

]]>

Posted August 18, 2013 7:40 PM by Kaleberg

The Implications are Staggering -- comment #63 from Daniel Martin

Tue, 01 Sep 2015 14:03:30 -0500

The discoverer of the bug has recorded an hour-long presentation on the experience of finding and reporting the bug here: https://www.youtube.com/watch?v=c0O6UXrOZJo.

It includes some interesting details not available when this first broke, and among other things I need to retract my statement in comment 10: there are Xerox scanners that, in the factory default setting, mangled text including substituting some characters for others.

Patches now exist for those pieces of hardware, but many places haven't applied them. If you depend on scanned documents that would be made useless by the transposition of one or more characters, ensure that your IT people know of this problem and have applied the appropriate patches.

]]>

Posted September 1, 2015 2:03 PM by Daniel Martin