[BCAB] How to Carry out a Big OCR/Indexing Project

M Lakhani muzz.lakhani at googlemail.com
Wed May 15 21:53:15 BST 2013


I'd be happy to carry it out if you want :)
If interested , pleas contact me off list to discuss it :)

Muzz

Sent from my iPhone

On 15 May 2013, at 11:18, "Tony" <tony at seeingear.org> wrote:

> Hello
> 
> I tried to automate Abby FineReader to batch process books and found it impossible as it is not reliable - it might process 3 or fifty, then crash. Then it would need restarting. FineReader works great if you use it interactively but is not so good for a big project. The only OCR server I am aware of that will handle a huge input without falling over is PrimeOCR but price is prohibitive (well over £5,000 depending on options).
> 
> When you say indexing, do you mean picking out various headings to stick in a database so that you can reference related judgements? This will be the trickiest part, depending on the form of the text you scan. Abbyy do a product (forget the exact name) specifically to read standardised documents such as invoices and index them by supplying a template of what the document looks like. When I played with it it could not handle multiple formats in the same document so no good for books. Suggest you contact their R&D - they are pretty helpful (and OmniPage are not.)
> 
> There was a big project to digitise newspapers in New Zealand and what they did was to OCR the text as best they could, then put it on-line in a sort of wiki, inviting members of the public to correct any OCR errors they found. This worked well. Could you get a grant to do something like that? Get a university to do it as a project?
> 
> Untried myself but transcribers at Dublin University use Abbyy PDF convertor for OCR in preference to FineReader as they say better results.
> 
> HTH
> 
> Tony Dart
> CTO
> The Seeing Ear
> www.seeingear.org
> 
> 
> 
> --------------------------------------------------
> From: "M Lakhani" <muzz.lakhani at googlemail.com>
> Sent: Wednesday, May 15, 2013 12:51 AM
> To: "BCAB Discussion List" <bcab at lists.bcab.org.uk>
> Subject: Re: [BCAB] How to Carry out a Big OCR/Indexing Project
> 
>> Hmm I'm trying to get what the issue is here!
>> You said that you want something that will be better with your OCR compared to what's on offer  from the Irish law society! I'm assuming that the purpose of an OCR conversion is to provide a text searchable document , right?  You said that their PDF service is inferior to your OCR , meaning one of these likelihoods:
>> Their PDFs are sometimes inaccessible !?
>> Or that what they put out can't be navigated or searched by text!?
>> 
>> Or that the search function on their site is primitive & can't provide good results, & google can't provide the results either, & you want documents for offline use?
>> What OCR do you currently use? How do you want to categorise these documents? If I could please get some more info, I'd be happy to assist :)
>> 
>> HTH
>> Muzz
>> 
>> Sent from my iPhone
>> 
>> On 14 May 2013, at 23:36, Gerard Sadlier <gerard.sadlier at gmail.com> wrote:
>> 
>>> Hi all,
>>> 
>>> The Law Society in Ireland (where I live) has put its collection of
>>> unreported judgments online, in pdf.
>>> 
>>> This is a brilliant development, in many ways, since before I would
>>> have had to scan these judgments to read them.
>>> 
>>> However, I do not think the judgments are searchable in their current
>>> form (I will check with the library but I am allmost certain).
>>> 
>>> I want to be able to search the judgments.
>>> 
>>> Also, I want ideally to OCR them myself. This is because I have read
>>> several of the judgments. While the Law Society has provided text for
>>> the pdfs, I find my own ocr works better.
>>> 
>>> I am looking for suggestions as to how I can:
>>> 
>>> 1. OCR these; and
>>> 2. Render them searchable.
>>> 
>>> I can download the judgments automatically to folders on my machine
>>> and I would like to OCR and index them automatically too.
>>> 
>>> Your help would be much appreciated.
>>> 
>>> Kind regards
>>> 
>>> Ger
>>> 
>>> -- 
>>> To find out more about BCAB and the benefits that membership can bring, please visit our website:
>>> http://www.bcab.org.uk/
>>> 
>>> To manage your subscription to the BCAB mailing list, please visit our website:
>>> http://www.bcab.org.uk/bcab-discussion-list/
>>> 
>>> To discuss matters relating to the mailing list, please email moderator at bcab.org.uk.
>> 
>> -- 
>> To find out more about BCAB and the benefits that membership can bring, please visit our website:
>> http://www.bcab.org.uk/
>> 
>> To manage your subscription to the BCAB mailing list, please visit our website:
>> http://www.bcab.org.uk/bcab-discussion-list/
>> 
>> To discuss matters relating to the mailing list, please email moderator at bcab.org.uk.
> 
> -- 
> To find out more about BCAB and the benefits that membership can bring, please visit our website:
> http://www.bcab.org.uk/
> 
> To manage your subscription to the BCAB mailing list, please visit our website:
> http://www.bcab.org.uk/bcab-discussion-list/
> 
> To discuss matters relating to the mailing list, please email moderator at bcab.org.uk.




More information about the Bcab mailing list