Page break: The start of each new page is marked up at exactly the place it appears. This can be before the start of an entry, or in the middle of a text. For page breaks, the tag <pb> is used. As attributes, the page number as well as the facsimile file name should be given. Do not forget the closing slash: As there is no text inside the tag, the short form <pb/> can be used for <pb></pb>.

  1. <pb n="[pagenumber]" facs="[folder/filename]"/>

Facsimiles: Our database takes png-files for facsimiles. If we have pdf files, the single pages have to be extracted as png. Scans of two pages together have to be cropped into two single files. If possible, the filenames for the facsimiles should be numbered according to the original page number, and contain a short title, and short info on edition, volume, date, part, or whatever seems necessary and useful. For example,

  1. <pb n="657" facs="images/EncBrit/EncBrit.ed11.v02.0657.png"/>

indicates in the filename that the image shows page 657 of Encyclopedia Britannica, 11th edition, vol.2. Another example from a Chinese encyclopedia, showing page 25 inside the section of characters with 4 strokes:

  1. <pb n="二五" facs="images/Falue/FalueJingjiCidian.strokes04.0025.png"/>

Milestone: We also want to mark up the beginning of a new volume, or the beginning of a special section; this is done by setting the tag <milestone/>. For a huge encyclopedia of many volumes, the beginning of each volume should be indicated. For a Chinese dictionary ordered by stroke count, it is useful to mark the beginning of a new stroke count. Other criteria may also be found useful in the course of data mining.

  1. <milestone unit="volume" n="2"/>

Gap: For most encyclopedias, we do not digitize the full work, but only selected articles from it. The location and extent of text not digitized is indicated by the tag <gap/>. <gap/> can also be used for unreadable parts of the text, or for other reasons. It should also be indicated if the extent of the gap is only a few words, many lines or even whole volumes. To make this information explicit, the tag <gap> should take (at least) the two attributes: reason, extent.

  1. <gap reason="not digitized" extent="156 pages"/>

Example from Encyclopedia Britannica, 11th ed.:

  1.   <gap reason="not digitized" extent="1 volume"/>
  2.   <milestone unit="volume" n="2"/>
  3.   <gap reason="not digitized" extent="156 pages"/>
  4.   <pb n="657" facs="images/EncBrit.11.02.0657.png"/>
  5.   <gap reason="not digitized" extent="50 lines"/>
  6.   <div type="entry" subtype="art">
  7.     <head>
  8.       <term xml:id="ART">
  9.         <hi rend="bold">ART</hi>
  10.       </term>
  11.     </head>
  12.     <div type="article">
  13.       <p>,<gloss target="#ART" rend="unmarked"> a word in its most extended and most popular sense meaning everything which we distinguish from Nature</gloss>. Art and Nature are the two most comprehensive genera of which the human mind has formed the conception.
  14.         Under the genus Nature, or the genus Art, we include all the phenomena of the universe. But as our conception of Nature is ...
  15.       </p>
  16.     </div>
  17.   </div>