Document type definition
A document type definition is a set of markup declarations that define a document type for an SGML-family markup language.
A DTD defines the valid building blocks of an XML document. It defines the document structure with a list of validated elements and attributes. A DTD can be declared inline inside an XML document, or as an external reference.
XML uses a subset of SGML DTD.
, newer XML namespace-aware schema languages have largely superseded DTDs. A namespace-aware version of DTDs is being developed as Part 9 of ISO DSDL. DTDs persist in applications that need special publishing characters, such as the XML and HTML Character Entity References, which derive from larger sets defined as part of the ISO SGML standard effort.
Associating DTDs with documents
A DTD is associated with an XML or SGML document by means of a document type declaration. The DOCTYPE appears in the syntactic fragment doctypedecl near the start of an XML document. The declaration establishes that the document is an instance of the type defined by the referenced DTD.DOCTYPEs make two sorts of declaration:
- an optional external subset
- an optional internal subset.
Any valid SGML or XML document that references an external subset in its DTD, or whose body contains references to parsed external [|entities] declared in its DTD, may only be partially parsed but cannot be fully validated by validating SGML or XML parsers in their standalone mode.
However, such documents are still fully parsable in the non-standalone mode of validating parsers, which signals an error if it can't locate these external entities with their specified public identifier or system identifier, or are inaccessible.. Non-validating parsers may eventually attempt to locate these external entities in the non-standalone mode, but do not validate the content model of these documents.
Examples
The following example of a DOCTYPE contains both public and system identifiers:XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
All HTML 4.01 documents conform to one of three SGML DTDs. The public identifiers of these DTDs are constant and are as follows:
This DOCTYPE can only appear after the optional XML declaration, and before the document body, if the document syntax conforms to XML. This includes XHTML documents:
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
...
An additional internal subset can also be provided after the external subset:
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" >
...
Alternatively, only the internal subset may be provided:
...
Finally, the document type definition may include no subset at all; in that case, it just specifies that the document has a single top-level element, and it indicates the type name of the root element:
...
Markup declarations
DTDs describe the structure of a class of documents via element and attribute-list declarations. Element declarations name the allowable set of elements within the document, and specify whether and how declared elements and runs of character data may be contained within each element. Attribute-list declarations name the allowable set of attributes for each declared element, including the type of each attribute value, if not an explicit set of valid values.DTD markup declarations declare which element types, attribute lists, entities, and notations are allowed in the structure of the corresponding class of XML documents.
Element type declarations
An [|element type] declaration defines an element and its possible content. A valid XML document contains only elements that are defined in the DTD.Various keywords and characters specify an element's content:
-
EMPTY for specifying that the defined element allows no content, i.e., it cannot have any children elements, not even text elements ; -
ANY for specifying that the defined element allows any content, without restriction, i.e., that it may have any number and type of children elements ; - or an expression, specifying the only elements allowed as direct children in the content of the defined element; this content can be either:
- * a mixed content, which means that the content may include at least one text element and zero or more named elements, but their order and number of occurrences cannot be restricted; this can be:
- **
: historically meaning parsed character data, this means that only one text element is allowed in the content ; - **
* : a limited choice of two or more child elements may be used in any order and number of occurrences in the content. - * an element content, which means that there must be no text elements in the children elements of the content. Such element content is specified as content particle in a variant of Backus–Naur form without terminal symbols and element names as non-terminal symbols. Element content consists of:
- ** a content particle can be either the name of an element declared in the DTD, or a sequence list or choice list. It may be followed by an optional quantifier.
- *** a sequence list means an ordered list of one or more content particles: all the content particles must appear successively as direct children in the content of the defined element, at the specified position and relative order;
- *** a choice list means a mutually exclusive list of two or more content particles: only one of these content particles may appear in the content of the defined element at the same position.
- ** A quantifier is a single character that immediately follows the specified item it applies to, to restrict the number of successive occurrences of these items at the specified position in the content of the element; it may be either:
- ***
+ for specifying that there must be one or more occurrences of the item — the effective content of each occurrence may be different; - ***
* for specifying that any number of occurrences is allowed — the item is optional and the effective content of each occurrence may be different; - ***
? for specifying that there must not be more than one occurrence — the item is optional; - *** If there is no quantifier, the specified item must occur exactly one time at the specified position in the content of the element.
Element type declarations are ignored by non-validating SGML and XML parsers, but these declarations are still checked for form and validity.
Attribute list declarations
An [|attribute list] specifies for a given element type the list of all possible attribute associated with that type. For each possible attribute, it contains:- the declared name of the attribute,
- its data type,
- and its default value.
src CDATA #REQUIRED
id ID #IMPLIED
sort CDATA #FIXED "true"
print "yes"
Here are some attribute types supported by both SGML and XML:
;
;
;
;
;
;
;
A default value can define whether an attribute must occur or not, or whether it has a fixed value, or which value should be used as a default value in case the given attribute is left out in an XML tag.
Attribute list declarations are ignored by non-validating SGML and XML parsers, but these declarations are still checked for well-formedness and validity.
Entity declarations
An entity is similar to a macro. The entity declaration assigns it a value that is retained throughout the document. A common use is to have a name more recognizable than a numeric character reference for an unfamiliar character. Entities help to improve legibility of an XML text. In general, there are two types: internal and external.- Internal entities are associating a name with any arbitrary textual content defined in their declaration. When a named entity reference is then encountered in the rest of the document, and if this entity name has effectively been defined as a parsed entity, the reference itself is replaced immediately by the textual content defined in the parsed entity, and the parsing continues within this replacement text.
- * Predefined named character entities are similar to internal entities: 5 of them however are treated specially in all SGML, HTML and XML parsers. These entities are a bit different from normal parsed entities, because when a named character entity reference is encountered in the document, the reference is also replaced immediately by the character content defined in the entity, but the parsing continues after the replacement text, which is immediately inserted literally in the currently parsed token. This allows some characters that are needed for the core syntax of HTML or XML themselves to be escaped from their special syntactic role. Predefined character entities also include numeric character references that are handled the same way and can also be used to escape the characters they represent, or to bypass limitations in the character repertoire supported by the document encoding.
- * In basic profiles for SGML or in HTML documents, the declaration of internal entities is not possible.
- * Instead, HTML standards predefine a large set of several hundred named character entities, which can still be handled as standard parsed entities defined in the DTD used by the parser.
- External entities refer to external storage objects. They are just declared by a unique name in the document, and defined with a public identifier and/or a system identifier specifying where the source of their content. They exist in fact in two variants:
- * parsed external entities that are not associated in their definition to a named annotation, in which case validating XML or SGML parsers retrieve their contents and parse them as if they were declared as internal entities ;
- * unparsed external entities that are defined and associated with an annotation name, in which case they are treated as opaque references and signaled as such to the application using the SGML or XML parser: their interpretation, retrieval and parsing is left to the application, according to the types of annotations it supports.
- * External entities are not supported in basic profiles for SGML or in HTML documents, but are valid in full implementations of SGML and in XML 1.0 or 1.1.
Internal entities may be defined in any order, as long as they are not referenced and parsed in the DTD or in the body of the document, in their order of parsing: it is valid to include a reference to a still undefined entity within the content of a parsed entity, but it is invalid to include anywhere else any named entity reference before this entity has been fully defined, including all other internal entities referenced in its defined content. This document is parsed as if it was:
Reference to the "author" internal entity is not substituted in the replacement text of the "signature" internal entity. Instead, it is replaced only when the "signature" entity reference is parsed within the content of the "sgml" element, but only by validating parsers and general entity references. The "%" character for introducing parameter entity references in the DTD loses its special role outside the DTD and it becomes a literal character.
However, the references to predefined numeric character entities are substituted wherever they occur, without needing a validating parser.
Notation declarations
Notations are used in SGML or XML. They provide a complete reference to unparsed external entities whose interpretation is left to the application, by assigning them a simple name, which is usable in the body of the document. For example, notations may be used to reference non-XML data in an XML 1.1 document. For example, to annotate SVG images to associate them with a specific renderer:This declares the MIME type of external images with this type, and associates it with a notation name "type-image-svg". However, notation names usually follow a naming convention that is specific to the application generating or using the notation: notations are interpreted as additional meta-data whose effective content is an external entity and either a PUBLIC FPI, registered in the catalogs used by XML or SGML parsers, or a SYSTEM URI, whose interpretation is application dependent.
The declared notation name must be unique within all the document type declaration, i.e. in the external subset as well as the internal subset, at least for conformance with XML.
Notations can be associated to unparsed external entities included in the body of the SGML or XML document. The
Within the body of the SGML document, these referenced external entities are not replaced like usual named entities, but are left as distinct unparsed tokens that may be used either as the value of an element attribute or within the element contents, provided that either the DTD allows such external entities in the declared content type of elements or in the declared type of attributes, or the SGML parser is not validating the content.
Notations may also be associated directly to elements as additional meta-data, without associating them to another external entity, by giving their names as possible values of some additional attributes. For example:
&example1SVG;
The example above shows a notation named "type-image-svg" that references the standard public FPI and the system identifier of an SVG 1.1 document, instead of specifying just a system identifier as in the first example. This annotation is referenced directly within the unparsed "type" attribute of the "img" element, but its content is not retrieved. It also declares another notation for a vendor-specific application, to annotate the "sgml" root element in the document. In both cases, the declared notation named is used directly in a declared "type" attribute, whose content is specified in the DTD with the "NOTATION" attribute type.
However, the "title" attribute of the "img" element specifies the internal entity "example1SVGTitle" whose declaration that does not define an annotation, so it is parsed by validating parsers and the entity replacement text is "Title of example1.svg".
The content of the "img" element references another external entity "example1SVG" whose declaration also does not define an notation, so it is also parsed by validating parsers and the entity replacement text is located by its defined SYSTEM identifier "example1.svg". The effective content for the "img" element be the content of this second external resource. The difference with the GIF image, is that the SVG image is parsed within the SGML document, according to the declarations in the DTD, where the GIF image is just referenced as an opaque external object via its "data" attribute.
Only one notation name may be specified in the value of ENTITY attributes. However multiple external entities may be referenced.
Notations are also completely opaque for XML and SGML parsers, so they are not differentiated by the type of the external entity that they may reference and/or a system identifier ).
Some applications also allow referencing notations indirectly by naming them in the
Notations are not used in HTML, or in basic profiles for XHTML and SVG, because:
- All external entities used by these standard document types are referenced by simple attributes, declared with the CDATA type in their standard DTD
- All external entities for additional meta-data are referenced by either:
- * Additional attributes
- * Additional elements within their own attributes
- * Standard pseudo-attributes in XML and XHTML.
- If the application can't use any notation, these notations may be either ignored silently by the application or the application could signal an error.
- Otherwise, the applications decide themselves how to interpret them, then if the external entities must be retrieved and then parsed separately.
- Applications may then signal an error, if such interpretation, retrieval or separate parsing fails.
- Unrecognized notations that may cause an application to signal an error should not block interpretation of the validated document using them.
XML DTDs and schema validation
Most XML schema languages are only replacements for element declarations and attribute list declarations, in such a way that it becomes possible to parse XML documents with non-validating XML parsers. In addition, documents for these XML schema languages must be parsed separately, so validating the schema of XML documents in pure standalone mode is not really possible with these languages: the document type declaration remains necessary for at least identifying the schema used in the parsed XML document and that is validated in another language.
A common misconception holds that a non-validating XML parser does not have to read document type declarations, when in fact, the document type declarations must still be scanned for correct syntax as well as validity of declarations, and the parser must still parse all entity declarations in the internal subset, and substitute the replacement texts of internal entities occurring anywhere in the document type declaration or in the document body.
A non-validating parser may, however, elect not to read parsable external entities, and does not have to honor the content model restrictions defined in element declarations and in attribute list declarations.
If the XML document depends on parsable external entities, it should assert
standalone="no"
in its XML declaration. The validating DTD may be identified by using XML Catalogs to retrieve its specified external subset.In the example below, the XML document is declared with
standalone="no"
because it has an external subset in its document type declaration:If the XML document type declaration includes any SYSTEM identifier for the external subset, it can't be safely processed as standalone: the URI should be retrieved, otherwise there may be unknown named character entities whose definition may be needed to correctly parse the effective XML syntax in the internal subset or in the document body. If it just includes any PUBLIC identifier, it may be processed as standalone, if the XML processor knows this PUBLIC identifier in its local catalog from where it can retrieve an associated DTD entity.
XML DTD schema example
An example of a very simple external XML DTD to describe the schema of a list of persons might consist of:Taking this line by line:
-
people_list
is a valid element name, and an instance of such an element contains any number ofperson
elements. The*
denotes there can be 0 or moreperson
elements within thepeople_list
element. -
person
is a valid element name, and an instance of such an element contains one element namedname
, followed by one namedbirthdate
, thengender
andsocialsecuritynumber
. The?
indicates that an element is optional. The reference to thename
element name has no?
, so aperson
element must contain aname
element. -
name
is a valid element name, and an instance of such an element contains "parsed character data". -
birthdate
is a valid element name, and an instance of such an element contains parsed character data. -
gender
is a valid element name, and an instance of such an element contains parsed character data. -
socialsecuritynumber
is a valid element name, and an instance of such an element contains parsed character data.
One can render this in an XML-enabled browser by pasting and saving the DTD component above to a text file named example.dtd and the XML file to a differently-named text file, and opening the XML file with the browser. The files should both be saved in the same directory. However, many browsers do not check that an XML document confirms to the rules in the DTD; they are only required to check that the DTD is syntactically correct. For security reasons, they may also choose not to read the external DTD.
The same DTD can also be embedded directly in the XML document itself as an internal subset, by encasing it within in the document type declaration, in which case the document no longer depends on external entities and can be processed in standalone mode:
Alternatives to DTDs are available:
- XML Schema, also referred to as XML Schema Definition, has achieved Recommendation status within the W3C, and is popular for "data oriented" XML use because of its stronger typing and easier round-tripping to Java declarations. Most of the publishing world has found that the added complexity of XSD would not bring them any particular benefits, so DTDs are still far more popular there. An XML Schema Definition is itself an XML document while a DTD is not.
- RELAX NG, which is also a part of DSDL, is an ISO international standard. It is more expressive than XSD, while providing a simpler syntax, but commercial software support has been slow in coming.
Security
For this reason,.NET Framework provides a property that allows prohibiting or skipping DTD parsing, and recent versions of Microsoft Office applications refuse to open XML files that contain DTD declarations.