The project was a collaboration between Islamic Studies of the Department of Languages and Cultures of the Near East, the Department of Computational Linguistics, and the Heidelberg Research Architecture at the Cluster of Excellence Asia and Europe. It received funding from the Cluster 2009-10 and was active through 2013. It digitized the first 26 from the printed volumes of "Turkologischer Anzeiger" (Turkology Annual), created a OCR'ed text version and developed a methodology to enhance automatically extraction of structured data from this kind of multilingual text resource.

Data extraction and post-processing

This Phase 1 project is a good example for what can be achieved with interdisciplinary research using a Digital Humanities approach. It not only created a new open access resource, but also developed and published state of the art automatized data extraction methods.

Digitization and creation of the OCR'ed text were carried out in the MediaLab run by the HRA. Because of the high quality of Abbyy Finereader's OCR output the 20+ different languages within the source turned out to be less problematic than expected. Instead, syntax analysis of the individual records turned out to be the challenge, as entry types and data structures are often only implicitly marked and some of them change from volume to volume. Together with  partners from Computational Linguistics, an enhanced approach to citation segmentation, using joint inference with Markov logic networks, was developed and published in a major DH journal (cf. publication). To further enhance the output on an individual record level, the project created a prototype for a data correction interface as a base for possible future crowdsourcing efforts. All project outcome is published at the Turkology Annual Online website providing open access to the data.

During the project, the main researcher from Ottoman Studies found a job outside academia. While the project was able to set up a frontend to the database and publish on aspects of segmentation, maintenance of the web resource is difficult. In October 2016 the HRA issued a call to the community (e.g. H-Turk list) asking for help to "Sustain the Turkology Annual Online," with only little feedback.

Dustin Heckmann, Anette Frank, Matthias Arnold, Peter Gietz, Christian Roth; Citation segmentation from sparse & noisy data: A joint inference approach with Markov logic networks. Lit Linguist Computing (Digital Scholarship in the Humanities) 2014; 31 (2): 333-356. doi: 10.1093/llc/fqu061