Print this Page. Send this Page.

The Problem

The TA contains entries in a large number of different languages, including transriptions of Arabic and languages using the Cyrillic alphabet. Even single entries may contain chunks in several different languages. We expected this to constitute a serious problem for digitization using the Optical Character Recognition (OCR) software available at the KJC: Even very good OCR results are still hardly acceptable for building a database, as entries that contain recognition mistakes cannot be reliably retrieved in search. However, it turned out that accordingly fine-tuned, OCR results were of such a high quality that the few remaining errors were mostly irrelevant for typical search queries. While this meant that the effort of developing automatic OCR correction software would not be justified for our project, we encountered problems of another kind: Syntax analysis of the TA entries proved to be much harder than anticipated, as entry types and data structures are often only implicitly marked, and some of them change from volume to volume. Additionally, syntax analysis (parsing) had to cope with structural errors in the entries - errors that human editors have made and that human readers would not even notice, but errors that can be serious problems for parsing. Our parsing software therefore needed to be tailored on the data in order to be comprehensive as well as robust.