I was pleased to see that a Salon piece on plagiarism published yesterday mentions the Lenore Hart affair, but disappointingly — and quite misleadingly — it refers to the evidence as mere "allegations". Since I get the impression that some people may still not fully appreciate just how clear-cut this case is, I thought it'd be useful to explain the mechanics behind my conviction that the media's fear of any comeback from calling a plagiarist a plagiarist is misplaced.
Here's a quick primer on plagiarism-hunting in the Internet age. It may once have all hinged on subjective judgment calls and grey areas but it's now advanced far beyond that. Today it's about examining quantifiable data and calculating probabilities — probabilities that offer a lot more certainty of making the right call than state-of-the-art DNA profiling.[1]
Here's how it works.
Last year, blogging about the glut of one-minute’s silences at Spanish football matches, I wrote this sentence:
Here's how it works.
Last year, blogging about the glut of one-minute’s silences at Spanish football matches, I wrote this sentence:
Now no self-respecting Liga match can be without one to mark the passing of Alderman Mumble (sorry, the PA system isn't all it might be)
Now let’s suppose a luckless hypothetical plagiarist (we'll call her “LHP” for short) came along and rewrote my sentence like this:
This season, would any Spanish football ground worth its salt go without one to mark Councillor Somebodyorother's sad demise? (The stadium’s Tannoy is playing up again, I’m afraid.)
LHP has been careful enough to cover her tracks by paraphrasing almost every important word or phrase. If we mark the text she's copied verbatim, all we're left with is this:
This season, would any Spanish football ground worth its salt go without one to mark Councillor Somebodyorother’s sad demise? (The stadium’s Tannoy is playing up again, I'm afraid.)
Yes, there is. Ridiculously commonplace though the words themselves may seem, the exact string “without one to mark" has only ever been used once by anybody on the whole of the Internet — by me. Type it into Google (including the inverted commas, to avoid hits for each individual word) and see for yourself.
As if that weren’t "huh?" enough, now comes the really weird part. LHP could have stayed much closer to my original, like this, yet even so remained practically Google-proof:[1]
Now no self-respecting Spanish match can forgo having one to mark the passing of Councillor Mumble (the PA system wasn’t all it might have been, sorry)It turns out that "now no self-respecting" has been used half a million times, and even the six-word string "one to mark the passing of" gets over a thousand hits, but put "without" before it and we find nobody has ever used that exact string except me and LHP.[2] And if, of all the possible contexts for LHP to have used that string in, it appeared in a piece that also happened to be about one-minute's silences at Spanish football grounds, and if we then factor in the extensive paraphrasing — changing "passing" to "sad demise" and "match" to "game" and so on — then ... you get the picture.
That's how Lenore Hart came such a spectacular cropper. Presumably to avoid detection, she took great pains to change words that were of obvious semantic consequence — she must have worn her thesaurus to dust — but she failed to pay enough attention to those piddling little text strings that are of merely syntactic significance. What she did is like a burglar meticulously vacuuming the furniture and carpet to remove any trace of hair or fibre evidence, and then leaving a big fat fingerprint on the doorknob on the way out.
As Poe wouldn't have put it, just do the math
There are fifteen million volumes in Google Books' database. Hitting by chance on a non-subject-specific text string — like our "without one to mark" or O'Neal/Hart's "privacy or as protection against" — in only two of them, which happen to deal with exactly the same topic, is several orders of magnitude more difficult than winning the lottery.
But if you end up with a tally of not just one but thirty-one exclusive string matches like this, we leave the realm of mere allegation behind us and stride firmly into dead-cert fact.
Expressed in the simplest, most conservative terms possible, the odds that Lenore Hart didn't plagiarise Cothburn O'Neal's novel are 1 in 1500000031. To give you an idea of just how big that number is, 15 million squared is 225 trillion, while 15 million cubed is 3.375 billion trillion. So 15 million to the power of 31 is ... you get the idea.[4]
But if you end up with a tally of not just one but thirty-one exclusive string matches like this, we leave the realm of mere allegation behind us and stride firmly into dead-cert fact.
Expressed in the simplest, most conservative terms possible, the odds that Lenore Hart didn't plagiarise Cothburn O'Neal's novel are 1 in 1500000031. To give you an idea of just how big that number is, 15 million squared is 225 trillion, while 15 million cubed is 3.375 billion trillion. So 15 million to the power of 31 is ... you get the idea.[4]
Isn't it time to stop alleging that Lenore Hart is a plagiarist and start calling her what she is: a proven one?
_____
1. A margin of error of 1 in 7000 is accepted as sufficiently conclusive for positive identification in crime-scene or paternity-case DNA analyses.
2. I say "practically" because although Google allows you to use an asterisk to stand in for any word in a string (e.g. "now no self-respecting * match"), it's too cumbersome a method to be viable when checking for plagiarism, because you have to work blind, with no idea which words you need to turn into asterisks.
3. Jeremy Duns has tried the same exercise, typing text strings from his own novels into Google Books, always with the same results: either no hits at all or far too many for it to be practical to wade through them. Just one match is always a loud plagiarism alert.
4. It's actually an even bigger number than 1500000031, because you also have to allow for things like motive (Lenore Hart's first draft, on the same subject as O'Neal's novel, had been rubbished by her editor and she'll also have been under deadline pressure to get the book out before Poe stopped being a hot property because of the bicentennial celebrations) and opportunity (of the seven billion people on the planet, probably only a few hundred are alive who have read The Very Young Mrs Poe, but, on her own admission, Lenore Hart is one of them), etc.
_____
1. A margin of error of 1 in 7000 is accepted as sufficiently conclusive for positive identification in crime-scene or paternity-case DNA analyses.
2. I say "practically" because although Google allows you to use an asterisk to stand in for any word in a string (e.g. "now no self-respecting * match"), it's too cumbersome a method to be viable when checking for plagiarism, because you have to work blind, with no idea which words you need to turn into asterisks.
3. Jeremy Duns has tried the same exercise, typing text strings from his own novels into Google Books, always with the same results: either no hits at all or far too many for it to be practical to wade through them. Just one match is always a loud plagiarism alert.
4. It's actually an even bigger number than 1500000031, because you also have to allow for things like motive (Lenore Hart's first draft, on the same subject as O'Neal's novel, had been rubbished by her editor and she'll also have been under deadline pressure to get the book out before Poe stopped being a hot property because of the bicentennial celebrations) and opportunity (of the seven billion people on the planet, probably only a few hundred are alive who have read The Very Young Mrs Poe, but, on her own admission, Lenore Hart is one of them), etc.