icon-symbol-logout-darkest-grey

Heidelberg Research ArchitectureServices

The HRA provides various services to both students and researchers. We maintain a small MediaLab for digitization and editing visual material, and co-host services, like the video annotation platform Pa.do/ra. We also offer workshops and training sessions tailored to your needs, for example on working with image, audio or video editing software, management software for bibliographic materials like Zotero or Juris-M, or heidICON - the object and multimedia repository of Heidelberg University Library.

In addition, the HCTS IT and the CATS Library offer short term rental items, like digital camera equipment.

Contact Matthias Arnold

MediaLab

The MediaLab is open to all CATS members and students. It comprises three well-equipped workstations for digitization and editing of visual material. The workstations feature professional image scanners (Epson Expression 10000 XL, Epson Perfection v700 Photo) and book scanners (Plustek OpticBook A300), as well as advanced image editing software (Adobe Creative Suite 6) and specialized OCR tools (Abbyy FineReader). In addition, all workstations have crucial helper tools (like IrfanView) and the full standard office suite (Microsoft) installed. Our team will be happy to assist you on all matters related to visual material and digitization. Together with our IT, we also offer trainings for all equipment and software.

The MediaLab at the HCTS is located in room 400.00.05B, on the ground-floor, east wing. Since the number of workstations is limited, please make sure to book in advance via Sharepoint.
The CATS Library has an excellent overhead A2-book scanner which university members can also book in advance via Sharepoint.

Video Annotation

Together with the ZO-IT, HRA offers a Video Annotation platform: Pan.do/ra. It allows collaborative annotation of audio and video sequences in different layers for various annotation types: transcription, description, keywords, locations, events. Videos and annotations can be set to private, group, or public access. Data import and export are possible, using the .srt format.

Pan.do/ra was presented at the workshop “Annotation of audio-visual data” held at HCH19 - The 2nd Heidelberg Summer School for Computational Humanities 2019. The training material used can be accessed via heiBOX.
If you are interested in using the platform, want to learn more about its features, or wish to upload your own material, please contact us.

References:

  • Arnold, Matthias, Hans Martin Krämer, Hanno Lecher, Jan Scholz, Max Stille, Sebastian Vogt. “Möglichkeiten und Grenzen der Videoannotation mit Pan.do/ra - Forschung, Lehre und institutionelles Repositorium.” In: Bilddaten in den digitalen Geisteswissenschaften. Hg. von Canan Hastik und Philipp Hegel. Harassowitz 2020, p. 231-53.
  • Wübbena, Thorsten, Eric Decker, Matthias Arnold. „»Losing My Religion« – Einsatz der Videoannotationsdatenbank Pan.do/ra in der kunstgeschichtlichen Analyse von Musikvideos.“ In: Grenzen und Möglichkeiten der Digital Humanities. Hg. von Constanze Baum / Thomas Stäcker. 2015 (= Sonderband der Zeitschrift für digitale Geisteswissenschaften, 1). DOI: 10.17175/sb001_018.

 

Text Recognition

The IT offers services to recognize text for everyday use (optical character recognition - OCR). Beginning with Adobe Acrobat Pro and it's “Edit PDF” function, you can OCR individual pdf files using a single recognition language. You can also use the Abbyy OCR Server which is available via the internal network at CATS. It is able to recognize texts containing letters or characters in multiple modern languages and scripts. On your own computer you can install many other tools, like the free PDF-X-Change, which also provides language packs, e.g., for Japanese or Chinese.

All these solutions work very well with contemporary (usually digitally printed) texts. However, if you wish to recognize text from historic documents (and “historic” usually starts around the 1970s), this can become quite a challenging task. To go deeper into text recognition, you have various options. One is to use the MediaLab at the HCTS, which provides the advanced Abbyy Finereader software. With this tool, it is not only possible to freely combine recognition languages, but also to train the recognition algorithm on your material. You can also define your own set of characters (define your own "language") and use it to train recognition. In addition, you can define the areas you want to recognize, and mark tables or lists separately. Finereader also offers a correction interface and various output formats, including plain text, docx/odt or xml.
Another option is to make use of the Transkribus platform. This platform makes use of artificial intelligence to recognize even handwritten texts, and offers powerful transcription services. While its focus lies on Latin-based texts, a number of initiatives have successfully shown how Transkribus can be used for languages in non-Latin scripts. At the CATS, the digital project “Naval Kishore Press” successfully used the platform to recognize Hindi and Sanskrit texts written in Devanagari.

There exist a number of other platforms that deal with text recognition, using approaches of optical character recognition (OCR), handwritten text recognition (HTR), or computational text recognition (CTR). Examples are eScriptorium (currently tested by the University Library), or OCR4all, to name just a few. Even more advanced approaches use dedicated recognition pipelines, like those of the German OCR-D initiative. These pipelines cannot just be installed on your computer, like Adobe Acrobat, but consist of multiple steps that often require stronger machinery, like High Performance Computing centers. One example for a project at the CATS working on developing such a text recognition pipeline for Republican China newspapers is the Early Chinese Periodicals Online (ECPO) project.

If you are interested in learning more about computational text recognition or think about implementing such algorithms for your research project, please contact Matthias Arnold.