ENCRRC > Text Encoding Overview

EARLY 19TH CENTURY
RUSSIAN READERSHIP & CULTURE:


TEXT ENCODING OVERVIEW


The ENCRRC Project has enriched its texts using XML (Extensible Markup Language), according to the TEI-Lite version of the guidelines prepared by the TEI (Text Encoding Initiative). And, as noted on our project home page, we also attempt to follow the Level 4 (Basic Content Analysis) recommendations endorsed by the Digital Library Federation. But for ease of encoding we subdivide our Basic Content Analysis into (1) Structural and (2) Basic Content encoding. We also perform (3) extensive analytical encoding:
  1. Structure (Paragraphs, Front Matter, etc.)
  2. Basic Content (Foreign expressions, etc)
  3. Advanced Content (Analytical Categories)
NB: See below for a summary of our Attribute Values

STRUCTURE
When considered appropriate, ENCRRC makes sparing use of the following structural elements (besides <text> and <body>):

<front>: used for prefaces, tables of contents;
<back>: used for afterwords, appendices, endnotes, apparatus (when included);
<titlepage>: including verso if present, divided by < pb N="verso" >;
<list>: used with <item> to reflect tables of contents, errata, subcription lists, "other titles by the same author," cast lists, etc.;
<div1, etc.>: used with N= attribute to record sequence;
<head>;
<argument>;
<epigraph>;
<opener>; <dateline>; <salute>; <signed>; <closer>; <trailer>;
<q>: used only for quotations that are set off typographically (ie, not used for inline quotations, or for direct speech in prose fiction);
<q>: used for letters quoted in text as follows: q/text/body/div1 type=letter, including "opener, "dateline," "salute," "signed," "closer" as appropriate;
<p>;
<lg>: used within "div" for all verse of more than one line--even wihout stanzas-- to assist retrieval;
<l>: include use of the REND attribute to record indentation;
<milestone>: used with UNIT="typography" N="****" to represent divisions within poems so marked;
<pb>: the page break is placed at the beginning of the page;
<figure>: also used to encode frontispieces, within a separate div/p.

NB:
*Regarding <note>: the ENCRRC project does not currently reproduce notes (although this policy is being re-examined).


BASIC CONTENT
When considered appropriate, ENCRRC makes sparing use of the following basic content elements:

<foreign lang=xx> using 3-character language abbreviations. If appropriate, this tag also includes <rend=ital>;
<title>;
<emph>:
(a) used for for words that are emphasized linguistically or rhetorically, rather than only typographically;
(b) easiest to spot in dialog;
<hi>:
(a) used for ambiguous and/or typographically emphasized text that is not "foreign," "title," "emph";
(b) often used in texts with multiple instances of italics;
(c) used--instead of <q>--for inline quotations, but only when italicized;
<sic>: used to indicate typographic errors, with the CORR attribute to note corrections;
<reg>: used in preference to <orig>, <corr>, etc., to regularize unusual forms of names in text, together with the ORIG attribute to indicate form in source text;
<add>; <delete>; <unclear>;
<sp>: used to encode speeches, with speakers identified within < speaker > elements;

NB:
*Regarding <name>: the ENCRRC project does not currently encode names, dates, times.


ADVANCED CONTENT: ANALYTICAL CATEGORIES

Here is the interpretation structure that we use for the ENCRRC project:

<back>
<div1 type="Interpretations">

<interpGrp type="Publishing">
<interp value="Commercial" ID="pub-commer">
<interp value="Patronage" ID="pub-patron">
<interp value="Technology" ID="pub-tech">
</interpGrp>

<interpGrp type="Print Categories">
<interp value="Lang-French" ID="cat-frlang">
<interp value="Lang-Russian" ID="cat-rlang">
<interp value="Prose" ID="cat-prose">
<interp value="Verse" ID="cat-verse">
<interp value="Historical" ID="cat-hist">
<interp value="Nationalistic" ID="cat-nation">
<interp value="Political" ID="cat-polit">
<interp value="Prohibited" ID="cat-prohib">
<interp value="Religious" ID="cat-relig">
<interp value="Romantic" ID="cat-roman">
<interp value="Secular" ID="cat-secul">
</interpGrp>

<interpGrp type="Novels">
<interp value="Edition size" ID="novel-edsize">
<interp value="Original: FR" ID="novel-french">
<interp value="Original: RU" ID="novel-rus">
<interp value="Prices" ID="novel-price">
<interp value="Reading" ID="novel-read">
<interp value="Provinces" ID="novel-prov">
<interp value="Spb/Moscow" ID="novel-spbmos">
</interpGrp>

<interpGrp type="Journals">
<interp value="Circulation" ID="jour-circ">
<interp value="Prices" ID="jour-price">
<interp value="Reading" ID="jour-read">
<interp value="Provinces" ID="jour-prov">
<interp value="Spb/Moscow" ID="jour-spbmos">
</interpGrp>

<interpGrp type="Newspapers">
<interp value="Circulation" ID="news-circ">
<interp value="Political" ID="news-pol">
<interp value="Prices" ID="news-price">
<interp value="Reading" ID="news-read">
<interp value="Provinces" ID="news-prov">
<interp value="Spb/Moscow" ID="news-spbmos">
</interpGrp>

<interpGrp type="Booktrade">
<interp value="Provinces" ID="trade-prov">
<interp value="Spb/Moscow" ID="trade-spbmos">
</interpGrp>

<interpGrp type="Text Access">
<interp value="Bookstore" ID="access-store">
<interp value="Coffee-house" ID="access-coffee">
<interp value="Club" ID="access-club">
<interp value="Library (Circ)" ID="access-cirlib">
<interp value="Library (Personal)" ID="access-perlib">
<interp value="Library (Public)" ID="access-publib">
<interp value="Lighting" ID="access-light">
<interp value="Manuscripts" ID="access-mss">
<interp value="Market" ID="access-market">
<interp value="Relatives" ID="access-relat">
<interp value="Subscription" ID="access-sub">
</interpGrp>

<interpGrp type="Reading Publics">
<interp value="Expansion" ID="reapub-expan">
<interp value="Size" ID="reapub-size">
<interp value="Provinces" ID="reapub-prov">
</interpGrp>

<interpGrp type="Social Groups">
<interp value="Aristocracy" ID="grp-aristo">
<interp value="Civil servants" ID="grp-civil">
<interp value="Gentry" ID="grp-gentry">
<interp value="Merchants" ID="grp-merch">
<interp value="Military" ID="grp-milit">
<interp value="Professionals" ID="grp-prof">
<interp value="Women" ID="grp-women">
</interpGrp>

<interpGrp type="Job titles">
<interp value="Bookdealer" ID="job-bkd">
<interp value="Publisher" ID="job-pub">
<interp value="Doctor" ID="job-doct">
<interp value="Engineer" ID="job-engin">
<interp value="Lawyer" ID="job-law">
<interp value="Teacher" ID="job-teach">
</interpGrp>


ATTRIBUTE VALUES
  1. for TYPE: values are defined in editorialDecl;
  2. for REND: (a) use only to override a default value; (b) with "indent", include # of tabstops (eg <l REND="indent(1)">
  3. for FONT: italics, bold, fsc, smallcap, underlined, gothic;
  4. for ALIGN: right, left, center, block;
  5. for "indent": see REND;
  6. for LANG: use ISO639-2 3-character codes.

Last update: 2006-06-30
University of Illinois at Urbana-Champaign

University of Illinois at Urbana-Champaign
Library Gateway Homepage
Comments to: Miranda Remnek
Last updated by MBR on Monday, 10-Jul-2006 11:22:54 CDT

Valid XHTML 1.0!