[BCAB] How to Carry out a Big OCR/Indexing Project
tony at seeingear.org
Wed May 15 11:18:39 BST 2013
I tried to automate Abby FineReader to batch process books and found it
impossible as it is not reliable - it might process 3 or fifty, then crash.
Then it would need restarting. FineReader works great if you use it
interactively but is not so good for a big project. The only OCR server I am
aware of that will handle a huge input without falling over is PrimeOCR but
price is prohibitive (well over £5,000 depending on options).
When you say indexing, do you mean picking out various headings to stick in
a database so that you can reference related judgements? This will be the
trickiest part, depending on the form of the text you scan. Abbyy do a
product (forget the exact name) specifically to read standardised documents
such as invoices and index them by supplying a template of what the document
looks like. When I played with it it could not handle multiple formats in
the same document so no good for books. Suggest you contact their R&D - they
are pretty helpful (and OmniPage are not.)
There was a big project to digitise newspapers in New Zealand and what they
did was to OCR the text as best they could, then put it on-line in a sort of
wiki, inviting members of the public to correct any OCR errors they found.
This worked well. Could you get a grant to do something like that? Get a
university to do it as a project?
Untried myself but transcribers at Dublin University use Abbyy PDF convertor
for OCR in preference to FineReader as they say better results.
The Seeing Ear
From: "M Lakhani" <muzz.lakhani at googlemail.com>
Sent: Wednesday, May 15, 2013 12:51 AM
To: "BCAB Discussion List" <bcab at lists.bcab.org.uk>
Subject: Re: [BCAB] How to Carry out a Big OCR/Indexing Project
> Hmm I'm trying to get what the issue is here!
> You said that you want something that will be better with your OCR
> compared to what's on offer from the Irish law society! I'm assuming that
> the purpose of an OCR conversion is to provide a text searchable document
> , right? You said that their PDF service is inferior to your OCR ,
> meaning one of these likelihoods:
> Their PDFs are sometimes inaccessible !?
> Or that what they put out can't be navigated or searched by text!?
> Or that the search function on their site is primitive & can't provide
> good results, & google can't provide the results either, & you want
> documents for offline use?
> What OCR do you currently use? How do you want to categorise these
> documents? If I could please get some more info, I'd be happy to assist :)
> Sent from my iPhone
> On 14 May 2013, at 23:36, Gerard Sadlier <gerard.sadlier at gmail.com> wrote:
>> Hi all,
>> The Law Society in Ireland (where I live) has put its collection of
>> unreported judgments online, in pdf.
>> This is a brilliant development, in many ways, since before I would
>> have had to scan these judgments to read them.
>> However, I do not think the judgments are searchable in their current
>> form (I will check with the library but I am allmost certain).
>> I want to be able to search the judgments.
>> Also, I want ideally to OCR them myself. This is because I have read
>> several of the judgments. While the Law Society has provided text for
>> the pdfs, I find my own ocr works better.
>> I am looking for suggestions as to how I can:
>> 1. OCR these; and
>> 2. Render them searchable.
>> I can download the judgments automatically to folders on my machine
>> and I would like to OCR and index them automatically too.
>> Your help would be much appreciated.
>> Kind regards
>> To find out more about BCAB and the benefits that membership can bring,
>> please visit our website:
>> To manage your subscription to the BCAB mailing list, please visit our
>> To discuss matters relating to the mailing list, please email
>> moderator at bcab.org.uk.
> To find out more about BCAB and the benefits that membership can bring,
> please visit our website:
> To manage your subscription to the BCAB mailing list, please visit our
> To discuss matters relating to the mailing list, please email
> moderator at bcab.org.uk.
More information about the Bcab