Sub Navigation

Seite drucken. Seite weiterempfehlen.

A Minimal Set of Tags for Marking Up Encyclopaedias

by Jens Østergaard Petersen, May 2010.

For additions and recent changes, see pages accesible on the left.

Contents:

Why mark up texts?
Mark-up and formatting
Mark-up technicalities
Chunk-based mark-up
Contents-based mark-up: Names and titles
Quotations
Highlighting
Soft Hyphens
Notes
Character variants
Bibliographies
Verse
Authorship
Terms
Dictionary-type information
Language

Why mark up texts?

The function of mark-up is to describe the structure of a document. A document "as such" is just a string of letters, all on the same level, but when we read it, we interpret it in terms of structures, giving each element a function in relation to the whole. By using mark-up, we can make explicit the structure that we "read into" a document. We can note, e.g., that some parts of a document are chapters with a heading and a body and that other parts are personal names consisting of given and family names. By marking up a document, we enrich it semantically, not by writing about it in some other document, but by lodging semantic information inside it, inside the tags that surround the elements of the text that we are interested in. A document can become quite complicated and information-heavy in this way. Using a database that understands the structure, it is possible to search in a very detailed manner in the document, treating it as a database. The "added" information can also be displayed in the form of informative formatting when displaying the document on the web or printing it.

Our task is to mark up the encyclopaedia documents in a manner that all can accept with a degree of detail that makes sense in terms of the effort involved. We should basically mark up things we will want search for, things that are relevant to the project aims.
The mark-up scheme we will use is based on TEI (Text Encoding Initiative), the most comprehensive mark-up standard available in the humanities. Our are encouraged to download the TEI Guidelines and look through them; these guidelines have been formulated by a great number of people in the humanities over several decades and contain much of interest to a student of the humanities. The guidelines contain "A gentle introduction to XML" which is recommended.
Special TEI modules have been made for many different kinds of text, but not for encyclopaedias, so we will have find the tools in TEI that work for us.
Each selection of encyclopaedia entries will have bibliographical metadata in what is called the (TEI) header. Here title, language, time of publication and so on will be noted. This is not the mark-up discussed here: here I will present a set of tags to mark up the body of the text itself.
Marking up a text is often an iterative process. You go though the text several times, each time marking up things you have only just discovered were important. It will probably be the case that we will want to mark up more features than those in the list below, and if you find any features that you think it would be a good idea for all to use, we should discuss it. The ability to add features to a document without having to restructure it is very valuable, allowing many people to work on one document over a long period of time, continually enriching it.
All mark-up is interpretative, but "real" interpretations, in the form of individual research contributions, will not be stored in the form of mark-up, but such interpretations can be linked to the sections of the text that it discusses.
Since the Cluster is concerned with concepts, one might think that concepts were the things we would mark up in a detailed way. However, since there is no organised structured corpus of concepts that is generally agreed upon, we will not do this, at the most marking up where central concepts occur in the text.

Top

Mark-up and formatting

Mark-up is added to "pure text", i.e. text without any formatting. Pure text is what you see e.g. in Notepad in Windows and in TextEdit (with Plain Text turned on) in Mac OS: just letters, punctuation and carriage returns and so on in a row. All word processing programs allow you to save a document as "pure text" or "plain text," but be aware that this usually means that all headers and footers, as well as all endnotes and footnotes disappear (or lose their link with the text itself). The text should be saved in UTF-8 format (if you have the option of saving it without a BOM (Byte Order Mark), do this). All 107,154 characters in Unicode can be used, except two: "<" must be represented as "&lt;" (for "less than") and "&" as "&amp;" (for "ampersand") – this is because these two characters play a special role in XML. TEI is a "dialect" of XML, so the document should have the extension ".xml".
Usually, we use formatting (large size, quotation marks, bold, italics, extra space between letters, underlining and so on) to signal that certain passages have a special meaning, but exactly which meaning this formatting have is not always very clear. Tagging is about making things like that explicit, as explicit as is reasonable.
When one marks up text one may choose to convert formatting to tags before one saves it as pure text and starts working with it as XML. In Chinese texts, for instance, titles and proper names are marked up by different kinds of underlining and it makes good sense to try to preserve this information. If one has a document in a European language where all titles are marked with italics, one should convert the formatting to title tags in a word processing program before saving it as pure text. If you have documents where a large investment has been made into formatting features with precision, you should contact me and I will see how much of this can be converted automatically into mark-up tags.

Top

Mark-up technicalities


Mark-up consists of tags put around strings of texts. Every mark-up is "opened" and "closed." It opens by having the name of the tag with "<" and ">" around it and it closes by having the name of the tag with "</" and ">" around it. It's that simple.
As an example we will take the following piece of text:

Monty Python's Life of Brian, also known as Life of Brian, is a 1979 comedy film written, directed and largely performed by the Monty Python comedy team. It tells the story of Brian Cohen (played by Graham Chapman), a young Jewish man who is born in the same era and location as Jesus Christ, and is subsequently mistaken for the Messiah.

One could here choose to mark up all names and titles.

<title>Monty Python's Life of Brian</title>, also known as <title>Life of Brian</title>, is a 1979 comedy film written, directed and largely performed by the <name>Monty Python</name> comedy team. It tells the story of <name>Brian Cohen</name> (played by <name>Graham Chapman</name>), a young Jewish man who is born in <name>Judea</name>, and is subsequently mistaken for <name>Jesus Christ</name>.

Tags must never overlap, but they can nest in each other. One can e.g. elaborate on the first tag and write "<title><name>Monty Python</name>'s Life of Brian</title>", but it will not do to write "<name><title>Monty Python</name>'s Life of Brian</title>", since then the name and title tag do not nest, because "name" is not completely contained within "title". One can nest tags to any degree of depth.
    A tag has a name; tags with capital letters are not identical to tags with minuscles, so <Name>Monty Python</name> is not valid, because </name> does not close <Name> – it would have to be written as "<name>".

Of course, one can also be more specific about which type of name is involved. This is typically done with an attribute. An attribute is written inside the opening tag and consists of a name followed by a value in quotation marks. Here capitalisation also matters. One can have only one attribute with the same name inside a tag and it can have only one value. Do not use smart (curly) quotations marks (“” or ‘’), only straight ones.
If one wishes to be more specific about which type of name is involved, one can do so by means of a"type" attribute that is entered into the opening tag. Names can e.g. by of type "person" or type "place" and so on, and the titles mentioned could be marked as being of type "film".

<title type="film">Monty Python's Life of Brian</title>, also known as <title type="film">Life of Brian</title>, is a 1979 comedy film written, directed and largely performed by the <name type="troupe">Monty Python</name> comedy team. It tells the story of <name type="person">Brian Cohen</name> (played by <name type="person">Graham Chapman</name>), a young Jewish man who is born in <name type="place">Judea</name>, and is subsequently mistaken for <name type="person">Jesus Christ</name>


Note that the closing tag never holds attributes – it simply consists of the name of the tag itself.
When the text is displayed in the browser, the tags are not directly visible. They can be made to show up as formatting – one might apply a different colour to each of the different name and title types – or they are discarded and only serve as aids in searching.
If we have a corpus of texts marked up according to these conventions, we will be able to make more exact searches for names and titles and we could easily list all names occurring in the text, which kinds of names were used, and so on.

Top

Chunk-based mark-up


There is no one way to mark up a text: in a project like ours, there have to be some guidelines on which specific tags can be used, otherwise the tags will not be of general use.
There are different classes of mark-up. One class is concerned with reproducing the structure of the text, how it divides into books, chapters, sections and paragraphs. This happens on the level of what one might call a text "chunk". Another kind is concerned with marking up passages inside such chunks with tags like <name> and <title>, to make explicit their meaning. Such mark-up occurs on the phrase level. There is also mark-up that can occur in both classes: a list may occur instead of a paragraph or it may occur inside a paragraph (the same goes for figures and notes).
The standard chunk of text is a paragraph. It is marked by <p></p>, so you have:

<p>
(text of paragraph 1)
</p>
<p>
(text of paragraph 2)
</p>

There will not necessarily be several paragraphs in each encyclopaedia entry, but even if it consists of only one paragraph, it will have to be marked with <p></p> as well.
    Indentation is disregarded. Indentations (made with tabs) are a kind of white space, just like carriage returns, regular spaces and so on. To the program that processes the XML document, these are all the same and they are all collapsed, meaning that if you have any number of these in a row, the program automatically regards them as a single space. A single space is a single space, and therefore significant, but several carriage returns, tabs and spaces are also a single space. This is quite important when you work with the text in an XML editor, because it means that you can make the structure of the text visually clear by adding as many spaces, tabs and carriage returns as you want. An XML editor can do this automatically for you, adding one indentation (four spaces) for each time the document nests one level, but you can also do it by hand.
In encyclopaedias, paragraphs "lodge" within entries. Such entries are marked by a <div>. A <div> is a general terms for any kind of division, so if one wishes to specify which kind of division one is talking about, one will have to type the <div>. This happens by supplying it with an attribute. Attributes occur inside a tag – the attribute name is written after the tag name, followed by an =, after which the attribute value is written in quotation marks.
The entry as such will thus be marked "<div type="entry"> and it will be closed by "</div>" (from now on, there will be no mention of closing tags).

<div type="entry">
<p>
(text of paragraph 1)
</p>
<p>
(text of paragraph 2)
</p>
</div>

An entry is of course made up of a headword and the entry itself. We will use <head> for the header. Possibly, there will be alternative headwords, in the form of a small dictionary-like entry, in some of the more modern encyclopaedias. <head type="alt"> can be used for this purpose.
The good thing is that <div>s can nest inside <div>s, meaning that you can create a hierarchy of such divisions. This can also be a little confusing, if you have several layers of nesting, and one should always see to it that the <div>s one has opened are closed in the right place.
The main part of the encyclopaedia entry we will mark with also with a <div>, but this time with a different attribute, as <div type="article">. This makes the core structure of an encyclopaedia entry look like the following:

<div type="entry">
<head>
(some headword)
</head>
<head type="alt">
(some alternative headword)
</head>
<div type="article">
<p>
(text of paragraph 1)
</p>
<p>
(text of paragraph 2)
</p>
</div>
</div>

Note that her are two </div>s at the end. The inner one closes <div type="article"> and the outer one closes <div type="entry">.
    If you have headers inside the entry, you just write the <head> information above the <p> or <p>s that it corresponds to.
Other "chunks" include figures and tables/lists. These may occur instead of <p>s.
Figures may be referred to on this model:

<figure>
<head>Emblemi d'Amore</head>
<figDesc>A pair of naked winged cupids, each holding a
flaming torch, in a rural setting.</figDesc>
</figure>

The description inside <figDesc> is from the your hand, the <head> is what is actually written.
No effort should be made to replicate the layout of the page at this stage, but the approximate position of the figure should be indicated by its position in relation to a paragraph. In practice this means that a figure should be placed after the paragraph containing it. The same goes for lists and tables.
Lists are made up according to this pattern:

<list>
<item>
(first item)
</item>
<item>
(second item)
</item>
</list>

There are more variations on this: consult me if you come across a list that does not fit in here.
Tables are more complicated. They contain an optional header and one or more columns and rows; each intersection of a column and a row is a cell:

<table rows="3" cols="4">
<head>Poor Man's Lodgings in Norfolk (Mayhew, 1843)</head>
<row role="label">
<cell></cell>
<cell>Dossing Cribs or Lodging Houses</cell>
<cell>Beds</cell>
<cell>Needys or Nightly Lodgers</cell>
</row>
<row>
<cell role="label">Bury St Edmund's</cell>
<cell>5</cell>
<cell>8</cell>
<cell>128</cell>
</row>
<row>
<cell role="label">Thetford</cell>
<cell>3</cell>
<cell>6</cell>
<cell>36</cell>
</row>
</table>

Note that the first cell is empty, since there is nothing between the tags <cell> and </cell>. This is to ensure that the column labels are correctly aligned with the data below.
Above I wrote that each tag must be opened and closed, but this is not quite true: one can also have empty tags. In our case, they will only be used for noting page breaks. Page breaks are of course not at all a "chunk," quite the opposite, since it often breaks up logical chunks in arbitrary ways, but it might as well be introduced here. We want to be able to display the encyclopaedia text as an image beside the text, and images come one per page, so therefore we will want to note the position (even in the middle of a word) of a page break. A page break is noted in the following way:

<pb n="5"/>

"5" is of course the page number: the notation says that a new page, page 5, begins where the mark is. Note that an empty tags opens and closes itself, using the closing notation "/>". (This of course means that the empty cell could have been written "<cell />" as well as <cell></cell> - the two are equivalent - but whereas something can occur between "<cell>" and "</cell>", nothing can occur between "<pb n="5"/>" and itself.)
    The semantic structure of the text and its physical structure of course overlap to some degree (a chapter starts on a new page), but because in XML one can only have a single hierarchy, with elements nesting completely inside each other, it is not possible to represent the two structures inside the same XML document.
If the following started on page 4 and ran over into page 5 in the middle, we would have the following mark-up.

<p>
<pb n="4"/>
<title>Monty Python's Life of Brian</title>, also known as <title>Life of Brian</title>, is a 1979 comedy film written, directed and largely performed by the <name type="troupe">Monty Python</name> comedy team. It tells the story of <name type="person">Brian Cohen</name> (played by <name type="person">Graham <pb n="5"/>Chapman</name>), a young Jewish man who is born in <name type="place">Judea</name>, and is subsequently mistaken for <name type="person">Jesus Christ</name>.
</p>

If we gather all "chunk" tags introduced into one entry and add in the <pb> tag, we will see something like the following nonsense passage.

<div type="entry">
    <head>
        Monty Python's Life of Brian
    </head>
    <head type="alt">
        Life of Brian
    </head>
    <div type="article">
        <p>
            <pb n="4"/><title>Monty Python's Life of Brian</title>, also known as <title>Life of Brian</title>, is a 1979 comedy film written, directed and largely performed by the <name type="troupe">Monty Python</name> comedy team. It tells the story of <name type="person">Brian Cohen</name> (played by <name type="person">Graham <pb n="5"/>Chapman</name>), a young Jewish man who is born in <name type="place">Judea</name>, and is subsequently mistaken for <name type="person">Jesus Christ</name>.
        </p>
        <table rows="3" cols="4">
            <head>Poor Man's Lodgings in Norfolk (Mayhew, 1843)</head>
            <row role="label">
                <cell></cell>
                <cell>Dossing Cribs or Lodging Houses</cell>
                <cell>Beds</cell>
                <cell>Needy or Nightly Lodgers</cell>
            </row>
            <row>
                <cell role="label">Bury St Edmund's</cell>
                <cell>5</cell>
                <cell>8</cell>
                <cell>128</cell>
            </row>
            <row>
                <cell role="label">Thetford</cell>
                <cell>3</cell>
                <cell>6</cell>
                <cell>36</cell>
            </row>
        </table>
        <p>
            (text of paragraph 2)
        </p>
        <list>
            <item>
                (first item)
            </item>
            <item>
                (second item)
            </item>
        </list>
        <p>
            (text of paragraph 3)
        </p>
    </div>
</div>
  

Top

Contents-based mark-up: Names and titles

The contents we are concerned about are ideas, but ideas are not that easy to mark up. However, ideas are expressed by persons in books and articles, and we would want to be able to search for all mentions of a certain person, e.g. Rousseau, and all mentions of a certain documents, e.g. The United States Declaration of Independence, in encyclopaedias written in different languages. But if one searches for "Rousseau", one will obviously not find "盧梭" or "ルソー" or "Жан-Жак Руссо". Something more than <name> is needed – this something is called a key, an externally defined means of identifying the thing being
named, by reference to something usually called an authority file. However, to make up and maintain such a file is a lot of work, therefore we use one of the services available for that purpose, subj3ct. The web site <https://subj3ct.com/> has registered (as of 3 Dec. 2009) 15,681,217 different subjects, i.e. "things", things that we can only talk meaningfully about if we can be more or less sure that we are talking about the same thing. The question of the identity of "things" can, as we all know, be a problem in human communication, but for a computer it is of crucial importance, since a computer is not able to draw the context of a statement into consideration when evaluating an expression.
If you look up "Jean-Jacques Rousseau" on subj3ct.com, you will get to the page <https://subj3ct.com/subject?si=http%3A%2F%2Fdbpedia.org%2Fresource%2FJean-Jacques_Rousseau>. This page gives what is intended to be a unique identifier to the philosopher, etc., Jean-Jacques Rousseau. If one mentions a name, in whatever language, and notes that one means to refer to the subject referred to on this page, no one should be in doubt about who is being mentioned.
On the page you will find a short description, but also "Representations", that is resources on the subject in question, and "Equivalent Subjects", that is other services that identify what is probably the same subject.
At the bottom of the page it says:
 To create a section on a page that is about this subject, paste the following snippet into the document body:

<div about="http://dbpedia.org/resource/Jean-Jacques_Rousseau" >
  <!-- Insert your content here -->
</div>

We use this for <name> keys, transforming it into

<name type="person" key="http://dbpedia.org/resource/Jean-Jacques_Rousseau">
  Жан-Жак Руссо
</name>

or

<name type="person" key="http://dbpedia.org/resource/Jean-Jacques_Rousseau">
  盧梭
</name>

If a subject cannot be found on subj3ct.com, one can add one; more about this later.
The same mechanism can be used for titles; if we have a reference to The United States Declaration of Independence in German, we could write:

<title key="http://dbpedia.org/resource/United_States_Declaration_of_Independence">
  Unabhängigkeitserklärung der Vereinigten Staaten
</title>

and it we wish to refer to Hegel's Phänomenologie des Geistes in Japanese, we could write

<title key="http://dbpedia.org/resource/The_Phenomenology_of_Spirit" >
  精神現象学
</title>

I think most of Wikipedia's articles have been "harvested" by subj3ct. If one finds the name of a person or a title in one language in Wikipedia, one can fairly reliably find the corresponding name or title in one of the other Wikipedia languages, simply by changing language on the left. Here one can find that the title is "Fyrirbærafræði andans" in Icelandic, "Феноменология духа" in Russian and so on.
How many types of names are there? I think we have to discuss this, but at least we will use the following:

    person
    place
    period
    people
    institution

By "people" are meant a people, i.e. an ethnic group. We also need to refer to spiritual beings, but I suggest that we type all gods and spirits as "persons" for the time being.
Titles also come in different types, but I don't think it would serve any purpose to identify them; titles should, however, be keyed to a subject identifier.
If there are cross-references in the encyclopaedia articles, it is very important to maintain them. They can be marked in the following way:

<ref key="http://dbpedia.org/resource/Marxism">Marxism</ref>
  

Top

Quotations

Quotation marks are one of the fairly clear forms of "pre-modern" mark-up, one that also survives in "pure text". Quotations marks can, however, be used for a multitude of purposes: report of speech, ironic "scare quotes", mentions of words – and for quotations. I suggest that we use two different tags for this: one for quotations from texts, <quote>, and one for all other cases, <q>. In the case of quotations from texts, there will usually be given some sort of bibliographical reference - more about this below. The concrete marks in the texts, "“”‘’ and so on, I suggest we leave as they are. The tags go on the "inside" of the quotation marks, in the following way:

<name type="person" key="http://dbpedia.org/resource/Georg_Wilhelm_Friedrich_Hegel">Hegel</name> wrote in his <title key="http://dbpedia.org/resource/The_Phenomenology_of_Spirit" >The Phenomenology of Spirit</title> that <quote>"the spiritual condition of self-estrangement exists in the sphere of culture as a fact"</quote>. However, I do not believe this to be the case; clearly, <q>'culture'</q> is not a fact.

This is of course a citation and I think we will want to mark up citations. This will be the most complicated mark-up I propose. It has the following structure:

<cit>
    <bibl>
        <author>
        </author>
        <title>
        </title>
    </bibl>
    <quote>
    </quote>
</cit>

The order of the elements is immaterial. If we wrap up the above inside these tags, we get the following:

<cit>
    <bibl>
        <author>
            <name type="person" key="http://dbpedia.org/resource/Georg_Wilhelm_Friedrich_Hegel">Hegel</name>
        </author>
        wrote in his
        <title key="http://dbpedia.org/resource/The_Phenomenology_of_Spirit">The Phenomenology of Spirit</title>
    </bibl>
    that
    <quote>"the spiritual condition of self-estrangement exists in the sphere of culture as a fact"</quote>.
</cit>
However, I do not believe this to be the case; clearly, <q>'culture'</q> is not a fact.

Often, there are just bibliographical references, not author and title, and one can then make do with:

<cit>
    <quote>
        You shall know a word by the company it keeps.
        </quote>
    <ref>
        (Firth, 1957)
    </ref>
</cit>

People who are into TEI love such constructs, but I can see why it looks confusing and over-elaborate. Such constructs will help us, however, in making the web site do it's job: pointing to ideas expressed by certain persons in certain publications.
Later on, we can perhaps use a more specified way of marking the other uses of quotation marks, one that represents the meaning of the text, not it appearance.

Top

Highlighting

There is one other area where I am tempted to cut corners: that is the many different ways of highlighting a passage of text: italics, bold, underlining, spacing, colouring - all kinds of methods are used to set off one passage of text from its surroundings. This formatting disappears from a word processing document when you save it in pure text format: if one has such formatting in one's text, one should capture them be means of explicit tags before saving in pure text format. I do not think we should make a big fuzz about these, but simply mark all of these different forms of highlighting with the tag <hi>. If there are some who (rightly) feel that we are actually subtracting from the value of the text, they are free to use the following attributes on this tag to indicate how the highlighting is rendered:

<hi rend="italic">
<hi rend="bold">
<hi rend="wavy underline">
<hi rend="dotted underline">

Top

Soft hyphens

Hyphens can be an orthographic part of how an expression is rendered (as in "cross-reference" and "twenty-three"), but it can also occur inside a word at the end of a line, or a page, breaking a word in two in order to achieve a more aesthetically spacing of the lines. Whereas orthographic hyphens are "hard," such hyphens are "soft." They break up words that would be one if they occur near the end of a line, but hide themselves elsewhere. Soft hyphens are difficult to work with, for in word processing applications they will normally be completely invisible, but they will show up in pure text. Your word processing application will have an option to show all spaces and carriage returns; this function should also display soft hyphens. Soft hyphens are difficult to work with, because there are cases in which one cannot know for sure if a hyphen is soft or hard. Use your knowledge of the text and its language and delete all the hyphens that you know are soft, leave hard hyphens as they are, and mark all doubtful occurrences with <g ref="#shy"/>, as in "re<g ref="#shy"/>creation" (both "recreation" and "re-creation" are possible).

Top

Notes

Encyclopaedias do not, as a rule, have many footnotes, but if there should be some, mark them up in the following manner. Where the reference to the footnote occurs, insert the text of the footnote inside <note>, deleting the reference. The following text

This is a span of text.¹ This is another span of text.

1  This is a footnote.
    
 becomes
 
This is a span of text.<note>This is a footnote.</note> This is another span of text.

All notes that are untyped are authorial, i.e. the stem from the source itself. If you wish to insert your own notes for everybody to see, use <note type="editorial"> instead. Use this like you were editing a book for publication and wished to make the book more understandable.

Uncertainty


If you are uncertain about the transcription or the tagging of a piece of text, you write a <note>immediately after it.

<name type=”person”>Elizabeth</ name> went to <name type=”place”>Essex</name>. She had always liked <name type=”place”>Essex</name>.<note type="uncertainty" resp="#jens.petersen">It is not clear here whether “Essex” refers to the place or to the nobleman. –Jens Østergaard Petersen</note>

The attribute “resp“ notes who has the responsibility for the note in question. After “#”, you write your Cluster username. In the note itself, you can refer to yourself by your real name.
    This can be seen as a special kind of an “editorial” <note>.

Notes to yourself


When you do mark-up, you may want to leave notes for yourself. Such notes can be written in the following way: you start out with "<!--" and end up with "-->". In between, you can write whatever you want (except "--"): your note will not be displayed. Use this as a way to record whatever observations you may have on the mark-up, such as doubts, ambiguities – things you want to take up later.

Top

Character variants

If a letter or a character is used in your text that is not the one commonly used, the orthographic representation of that letter or character, you may use the following formula:

<g rend="鎮">鎭</g>

Inside the <g> tag, you write what you see in the source after “rend=” and between the opening and closing tag, you write what you believe to be a standard representation of the same thing. This mechanism may be used for other scripts than Chinese.
Each group should come to an agreement about what is to be considered orthography for their texts. This is not easy, but will improve the seachability of the texts. In the search function of the database, we will attempt to cover systematic variants, meaning that if you search for "鎮" or "Straße", you will find "鎭" and "Strasse" as well.
If letters, symbols or characters occur that you think are not in Unicode, please contact Jens Petersen.

Top

Bibliographies

Many encyclopaedia entries have proper bibliographies. The bibliography should be inserted in a <div type=”bibliography”>.
For bibliographies, there are several formats: we use the looser <bib> structure that we also use within citations (see above).
    Generally speaking, there are two kinds of publications: independent ones and dependent ones. The typical example of an independent publication is a book that does not appear in a series or suchlike and typical examples of dependent publications are articles in periodicals and contributions to anthologies. In a dependent publication, a publication is lodged inside another publication; the place within this publication (periodical/anthology) is then generally noted and of course the “host” publication must be described as well.
    Bibliographies can thus be quite complicated, but since we are using the loose <bibl> format, just marking titles and authors/editors inside the bibliographic entry is enough.
    A book entry might be as simple as the following:

<bibl>
    <author>Nelson, T. H.</author>,
    <title>Replacing the printed word: a complete literary system.</title>
    London: Thames and Hudson, 1988.
</bibl>

One can also be slightly more elaborate and add tags indicating place of publication (<pubPlace>), publisher and date as in the following:

<bibl>
    <author>Nelson, T. H.</author>,
    <title>Replacing the printed word: a complete literary system.</title>
    <pubPlace>London</pubPlace>: <publisher>Thames and Hudson</publisher>, <date>1988</date>.
</bibl>

All the punctuation that occurs in the text you mark up should be retained. A way to format a reference to an article in a periodical would be like this:

<bibl>
    <author>Thaller, Manfred</author>
    <title>A Draft Proposal for a Standard for the Coding of Machine Readable Sources</title>;
    <title>Historical Social Research</title>
    Vol. 40 (<date>1986</date>), pp. 3-46.
</bibl>

If you want to go a little further, you can add information about where in the periodical the article occurs, its bibliographical scope (<biblScope>):

<bibl>
    <author>Thaller, Manfred</author>
    <title>A Draft Proposal for a Standard for the Coding of Machine Readable Sources</title>;
    <title>Historical Social Research</title>
    <biblScope>Vol. 40</biblScope> <date>(1986)</date>, <biblScope>pp. 3-46</biblScope>.
</bibl>

Of course, the two <biblScope>s do not  have the same meaning, but this is as far as we will go.
If the same publication were in an anthology, we would have the following:

<bibl>
    <author>Thaller, Manfred</author>
    <title>"A Draft Proposal for a Standard for the Coding of Machine Readable Sources"</title>;
    <title>Modelling Historical Data: Towards a Standard for Encoding and Exchanging Machine-Readable Texts</title>, ed. by
    <editor>Daniel I. Greenstein</editor>,
    (<pubPlace>St. Katharinen</pubPlace>:
    <publisher>Max-Planck-Institut für Geschichte</publisher>,
    <date>1991</date>),
    <biblScope>pp. 3-46</biblScope>.
</bibl>

This is more than enough for our purposes, but if you wish to format more elaborate bibliographies, you should consult the TEI Guidelines (Section 3.11). If you come across any info that you cannot mark up with the tags introduced thus far, you may simply leave the text unmarked. I have not done so in the above, but if anything is rendered in italics or bold, as is quite common, you may, if you wish, add the appropriate <hi> mark-up.
    If a short pointer to an entry in a bibliography that is contained in the article itself is found, the pointer should be in a ref tag containing a target attribute pointing to an xml:id contained in the bibliographical entry. This xml:id attribute must be unique; please construct one yourself. We may thus have a reference in the body of the text itself, like the following

Nelson claims <ref target="#NEL80">(ibid, passim)</ref>

or

Nelson claims (<ref target="#NEL80">Nelson [1990]</ref> pages 13–37)

Elsewhere, presumably at the end of the article, there is a proper bibliographical description. We use the format described above, but with an xml:id attribute in <bibl>:

<bibl xml:id="NELSON1988">
    <author>Nelson, T. H.</author>,
    <title>Replacing the printed word: a complete literary system.</title>
    <pubPlace>London</pubPlace>: <publisher>Thames and Hudson</publisher>, <date>1988</date>.
</bibl>

Now, a link from the pointer to the bibliographical entry can be constructed automatically on the web page. If the bibliographic entry referred to is not in the article itself, the passage is simply to be marked up with <bibl>:

Nelson claims (<bibl>Nelson [1980]</bibl > pages 13–37)

The bibliography as a whole is wrapped up in the <list> element; this may have a <head> (e.g. “Bibliography” or whatever appears in the text). It would have a structure like this:

<list>
    <head> Bibliography </head>
    <item>
        <bibl>
            (Book 1)
        </bibl>
        ;
    </item>
    <item>
        <bibl>
            (Book 2)
        </bibl>
        ;
    </item>
</list>

This will seem a little overelaborate, but if you have text (including punctuation as here) in between the individual <bibl> entries, you have to wrap each <bibl> element inside <item>. There is a consecrated <listBibl> element, but it does not allow contents between <bibl> elements.

Top

Verse

If poetry occurs in an encyclopaedia article, it can be marked up in the following way.
    Each line of verse is enclosed in <l> and </l>. Lines are “full lines”; in poetry, a line may break, but this is not noted here.
    A group of lines is enclosed in <lg> and </lg>. A stanza would be marked as a line group in this may; one may type the <lg> as appropriate. A <lg> may nest inside another <lg>.
    A division of verse may be wrapped in a suitably typed <div> and supplied with a <head> (if there is one).

<div type="poem">
    <lg type="octet">
        <l>Thus speaks the Muse, and bends her brow severe:—</l>
        <l>“Did I, <name>Lætitia</name>, lend my choicest lays,</l>
        <l>And crown thy youthful head with freshest bays,</l>
        <l>That all the' expectance of thy full-grown year</l>
        <l>Should lie inert and fruitless? O revere</l>
        <l>Those sacred gifts whose meed is deathless praise,</l>
        <l>Whose potent charms the' enraptured soul can raise</l>
        <l>Far from the vapours of this earthly sphere!</l>
    </lg>
</div>

Top

Authorship


In cases where an article is signed, one should mark the author in the following manner.

<byline>J. H. S.</byline>

The byline belongs inside the <div type=”entry”>, i.e. just before the last <div>.

Top

Terms


One can mark up central concepts with the <term> tag. Lemmata to other places in the encyclopaedia can be marked up in this way too. <term> contains a single-word, multi-word, or symbolic designation which is regarded as a technical term. If the term is not marked by highlighting or quotation marks, this should be noted in an attribute: <term rend=”unmarked”>.  Just as with names and titles, one can add a key attribute to a term:
The element <term> is intended for use with words or phrases identified as terminological in nature;
where words or phrases are simply being cited, discussed, or glossed in a text, it will oen be more appropriate to use the <mentioned> element.

Top

Dictionary-type information

Many encyclopaedia entries begin with a dictionary-like passage, i.e. a passage which does not tell facts, but analyses a word semantically and syntactically. If such a passage is separated from the article proper by having one or more paragraphs unto itself, it should be enclosed in a <div type=”dictionary”>. There are formats for marking up dictionaries, but these are very elaborate. Enclosing such a passage in its own <div> and marking up <hi> is all we will do.
    If one wishes,  one can mark up passages which contain a term and its gloss in the following way.
    The term is enclosed with <term> and given an xml:id attribute with a value that is unique inside the document (very often, just repeating the term will be enough). The explanation of the meaning of the term is wrapped in <gloss> and a target attribute is added, pointing to the unique xml:id of the term.

We may define <term xml:id="TDPv" rend="sc">discoursal point of view</term> as <gloss target="#TDPv">the relationship, expressed through discourse structure, between the implied author or some other addresser, and the fiction.</gloss>.

If a dictionary-like passage occurs in the beginning of an entry, one can use this pattern.

<head><term xml:id="empire">EMPIRE</term></head>
                <div type="article">
                    <p> <gloss target="empire">(<foreign xml:lang="la">imperium</foreign>), commandement, domination, et, dans une signification secondaire, état gouverné par un empereur (<hi rend="italic">voy.</hi>).</gloss> Dans ce dernier sens …
    If the gloss is not marked by highlighting or quotation marks, this should be noted in an attribute: <gloss rend=”unmarked”>.
    A <term> does not have to have a <gloss>.

Top

Language

In cases where this makes sense, one can add a language attribute to a tag.

<p>
Diet, <term xml:lang="grc">diaitotiku</term>, <term xml:lang="la">victus</term> or living
</p>

All elements can have this attribute. Find the abbreviations at <http://rishida.net/utils/>.

<note rend="parenthesis">不自矜夸</note>
<foreign xml:lang="la">subsequently</foreign>

Top

Suche