[BCAB] How to Carry out a Big OCR/Indexing Project

Tony tony at seeingear.org
Wed May 15 11:18:39 BST 2013


Hello

I tried to automate Abby FineReader to batch process books and found it 
impossible as it is not reliable - it might process 3 or fifty, then crash. 
Then it would need restarting. FineReader works great if you use it 
interactively but is not so good for a big project. The only OCR server I am 
aware of that will handle a huge input without falling over is PrimeOCR but 
price is prohibitive (well over £5,000 depending on options).

When you say indexing, do you mean picking out various headings to stick in 
a database so that you can reference related judgements? This will be the 
trickiest part, depending on the form of the text you scan. Abbyy do a 
product (forget the exact name) specifically to read standardised documents 
such as invoices and index them by supplying a template of what the document 
looks like. When I played with it it could not handle multiple formats in 
the same document so no good for books. Suggest you contact their R&D - they 
are pretty helpful (and OmniPage are not.)

There was a big project to digitise newspapers in New Zealand and what they 
did was to OCR the text as best they could, then put it on-line in a sort of 
wiki, inviting members of the public to correct any OCR errors they found. 
This worked well. Could you get a grant to do something like that? Get a 
university to do it as a project?

Untried myself but transcribers at Dublin University use Abbyy PDF convertor 
for OCR in preference to FineReader as they say better results.

HTH

Tony Dart
CTO
The Seeing Ear
www.seeingear.org



--------------------------------------------------
From: "M Lakhani" <muzz.lakhani at googlemail.com>
Sent: Wednesday, May 15, 2013 12:51 AM
To: "BCAB Discussion List" <bcab at lists.bcab.org.uk>
Subject: Re: [BCAB] How to Carry out a Big OCR/Indexing Project

> Hmm I'm trying to get what the issue is here!
> You said that you want something that will be better with your OCR 
> compared to what's on offer  from the Irish law society! I'm assuming that 
> the purpose of an OCR conversion is to provide a text searchable document 
> , right?  You said that their PDF service is inferior to your OCR , 
> meaning one of these likelihoods:
> Their PDFs are sometimes inaccessible !?
> Or that what they put out can't be navigated or searched by text!?
>
> Or that the search function on their site is primitive & can't provide 
> good results, & google can't provide the results either, & you want 
> documents for offline use?
> What OCR do you currently use? How do you want to categorise these 
> documents? If I could please get some more info, I'd be happy to assist :)
>
> HTH
> Muzz
>
> Sent from my iPhone
>
> On 14 May 2013, at 23:36, Gerard Sadlier <gerard.sadlier at gmail.com> wrote:
>
>> Hi all,
>>
>> The Law Society in Ireland (where I live) has put its collection of
>> unreported judgments online, in pdf.
>>
>> This is a brilliant development, in many ways, since before I would
>> have had to scan these judgments to read them.
>>
>> However, I do not think the judgments are searchable in their current
>> form (I will check with the library but I am allmost certain).
>>
>> I want to be able to search the judgments.
>>
>> Also, I want ideally to OCR them myself. This is because I have read
>> several of the judgments. While the Law Society has provided text for
>> the pdfs, I find my own ocr works better.
>>
>> I am looking for suggestions as to how I can:
>>
>> 1. OCR these; and
>> 2. Render them searchable.
>>
>> I can download the judgments automatically to folders on my machine
>> and I would like to OCR and index them automatically too.
>>
>> Your help would be much appreciated.
>>
>> Kind regards
>>
>> Ger
>>
>> -- 
>> To find out more about BCAB and the benefits that membership can bring, 
>> please visit our website:
>> http://www.bcab.org.uk/
>>
>> To manage your subscription to the BCAB mailing list, please visit our 
>> website:
>> http://www.bcab.org.uk/bcab-discussion-list/
>>
>> To discuss matters relating to the mailing list, please email 
>> moderator at bcab.org.uk.
>
> -- 
> To find out more about BCAB and the benefits that membership can bring, 
> please visit our website:
> http://www.bcab.org.uk/
>
> To manage your subscription to the BCAB mailing list, please visit our 
> website:
> http://www.bcab.org.uk/bcab-discussion-list/
>
> To discuss matters relating to the mailing list, please email 
> moderator at bcab.org.uk.
> 




More information about the Bcab mailing list