icon-symbol-logout-darkest-grey

Heidelberg Research ArchitectureData Sets

Heidelberg University provides the research data repository heiDATA, a professional platform for Open Research Data. The system is maintained and coordinated by the University's Research Data Unit and is based on the Dataverse software.
Below you find an annotated list of data sets published in the context and/or with the help of the HRA. Not all of the sets are accessible through heiDATA.

Early Chinese Periodicals Online (ECPO)

ECPO joins several important digital collections of the early Chinese press and puts them into a single overarching framework. In the first phase, several databases on early women’s periodicals and entertainment publishing were created: “Chinese Women’s Magazines in the Late Qing and Early Republican Period” (WoMag), “Chinese Entertainment Newspapers” (Xiaobao), and databases hosted at the Academia Sinica in Taiwan. ECPO not only provides image scans, but also preserves materials often excluded in reprint, microfilm, or digital (even full-text) editions, such as advertising inserts and illustrations. In addition, it aims at incorporating metadata in both English and Chinese, including keywords and biographical information on editors, authors and individuals represented in illustrations and advertisements in the journals. The project developed an Agent service that provides central access to all names annotated in the material, and assigns them to their respective agents.

As the material basis of the database consists mostly of image scans, the project has been running experiments on one Republican newspaper to explore approaches toward full-text generation. Extremely complex layouts resulting in difficulties for reliable automatic detection of page segmentation have prevented full-text generation for these newspapers even within China. In the fall of 2021, the project successfully implemented OCR on a newspaper 晶報 Jing bao (The Crystal) sample with a character error rate below 3% (Henke 2021). On that basis, the project is now expanding and generalizing its approach. With additional funding by the Research Council Cultural Dynamics in Globalized Worlds, the project aims to offer a solution to automatically produce full text from Republican newspapers using neural networks and machine learning. The project’s current work will further develop its original aims and contribute to the field of research as a whole. With the disclosure of our network models and data sets, its results can be reproduced and evaluated, and others can adopt its approaches in the field. Although processing non-Latin-script is still a challenge in many cases, the project hopes its work may serve as good practice examples for such initiatives. The data set provides a first and complete extract of all metadata edited by the project so far. Future versions will also incorporate the full text produced in our OCR pipeline.

For a list of publications and presentations related to ECPO please see our bibliography.

Cataloging Cultural Objects - Sample Records

The work “Cataloging Cultural Objects - a Guide to Describing Cultural Works and Their Images” (CCO) [1] provides a data content standard for catalogers of cultural heritage, and is further supported by the online examples provided on CCO Commons [2]. A completed project of the Heidelberg Research Architecture at Universität Heidelberg focused on providing the CCO Commons examples encoded in the VRA Core 4.0 XML metadata schema [3]. Records make use of the multilingual extension of the VRA Core 4, developed by the HRA [4]. We provide the XML data as use case for adapting VRA Core 4 XML and implementing multi-lingual metadata. All records were published on GitHub [5]. Agnes Dober, former member of the HRA, presented a poster at the DARIAH-DE Grand Tour 2018 [6] taking CCO XML records as example. [7]

[1] Baca, Murtha et al. 2006. Cataloging Cultural Objects: CCO ; a Guide to Describing Cultural Works and Their Images. Chicago: American Library Association. Available online at: https://vraweb.org/resources/cataloging-cultural-objects/.
[2] Visual Resources Association. ”Examples.” CCO Commons. http://web.archive.org/web/20190307174324/http://cco.vrafoundation.org/index.php/toolkit/index_of_examples. In 2018 and 2019, CCO (Cataloging Cultural Objects) was subject of Special Interest Groups organized by the VRA Cataloging and Metadata Standards Committee.
[3] VRA Core: http://www.loc.gov/standards/vracore/.
[4] http://cluster-schemas.uni-hd.de/vra-strictCluster.xsd, see also the draft online specification.
[5] Arnold, Matthias, Tobi Krebs, Simon Grüning, Matthias Guth and Agnes Dober. ”CCO Samples.” GitHub - exc-asia-and-europe/CCO-Samples. https://github.com/exc-asia-and-europe/CCO-Samples.
[6] DARIAH-DE Grand Tour 2018 https://de.dariah.eu/dariah-de-grand-tour-2018
[7] Agnes Dober. "'Transcultural Metadata' - An An exploration of the way our metadata is culturally limited." DARIAH-DE Grand Tour 2018, Poster presentation.

The Abou Naddara Collection - James Sanua's complete works

The data set contains the bibliographic descriptions of the majority of James Sanua's magazines from 1878 to 1910. Publication records are provided in MODS XML and include the publication title in Arabic, western transcription, and English or French translation (using the Translation on the originals, if available), the date of each issue in Arabic and western notation, as well as issue number and number of pages. Scans were produced in tiff format at 300 ppi, 8bit.

Data was compiled by Eliane Ursula Ettmueller, Matthias Arnold, Johannes Alisch, Nina Sassani, Florian Kempf, and Hans Harder in 2017.

 

Turkology Annual Online

The Turkology Annual Online project digitised the first 26 volumes of the Turkologischer Anzeiger / Turkology Annual (TA). The contents was OCR'ed and parsed into individual bibliographic records.[1] With the digital data, we related the records to their respective entry of the TA specific subject headings, which were also translated into English. In addition, we created a full list of cited publications, and connected related records, e.g. reviews to the main publications. We extracted 61540 of the 61639 total records and provide them in a json format.

Since 2023, all TA data produced by the project was migrated to the Specialised Information Service Near East (FID Nah-Ost), where it can be accessed at the University of Halle.

The data set was produced by Dustin Heckmann, Matthias Arnold, Christian Roth, Mateusz Dolata, Nicolas Bellm, Arina Chitavong, Jens Hansche; Supervision: Peter Gietz, Anette Frank, Michael Ursinus. Publication of final data set: 2021.
Note: Without the continuous efforts of Dustin Heckmann, who worked especially on improving the parsing - even long after his contract ended - this data set would not exist.

[1] Heckmann, Dustin, Anette Frank, Matthias Arnold, Peter Gietz, and Christian Roth. "Citation Segmentation from Sparse & Noisy Data: A Joint Inference Approach with Markov Logic Networks." Digital Scholarship in the Humanities 31, no. 2 (2016, First published online 8 December 2014): 333-356. doi: 10.1093/llc/fqu061

GECCA mapped

GECCA mapped is a visualization providing geo-referenced metadata of sixty exhibition entries from the Group Exhibitions of Contemporary Chinese Art (GECCA) database.

The data set was compiled by Franziska Koch in 2018.

 

„Altfundstellen“ der Neumark

Full title: Katalog der „Altfundstellen“ der Neumark (in der ehemaligen Provinz Brandenburg in den heutigen Wojewodschaften Zachodnio-Pomorskie, Lubuskie und Wielkopolskie) aus den Archiven der Prähistorischen Abteilung des Märkischen Museums Berlin, des Museums für Vor- und Frühgeschichte Berlin und des Brandenburgischen Landesamts für Denkmalpflege.

The data set was compiled by Dr. Armin Volkmann in 2018.

Nepal Heritage Documentation Project (NHDP)

The heart of the Nepal Heritage Documentation Project (NHDP) is the Digital Archive for Nepalese Art and Monuments (DANAM). The project is located at the Heidelberg Centre of Transcultural Studies (HCTS) and the Academy of Sciences (AdW), and operated in cooperation with Saraf Foundation and the Department of Archaeology, Nepal. The database offers visual and textual documentation of heritage monuments, which are threatened by urbanisation and natural disasters. Data sets contain structured information on the monuments, i.e. details of their location, history, architectural structure, and religious and social activities.

The dataset was compiled by the NHDP team headed by Prof. Christiane Brosius and Prof. em. Axel Michaels. The first data sets were uploaded in 2020.

Search and Retrieval of Indic Texts (SARIT)

SARIT provides you with electronic editions of texts in Sanskrit and other Indian languages. These are documented, dated and have embedded notes about their change history, so that they can be publicly cited and used with confidence as scholarly sources. It also currently offers tools for text search, retrieval and analysis of the works in the SARIT library. You can search for words and phrases, and have your search results displayed as keywords-in-context. All the texts at SARIT are licensed under a Creative Commons license. You can download all the texts in the following formats: XML, EPUB and PDF; and you can also open the XML-file online.