All predicatesShow sourcesgml.pl -- SGML, XML and HTML parser

This library allows you to parse SGML, XML and HTML data into a Prolog data structure. The library defines several families of predicates:

High-level predicates
Most users will only use load_html/3, load_xml/3 or load_sgml/3 to parse arbitrary input into a DOM structure. These predicates all call load_structure/3, which provides more options and may be used for processing non-standard documents.

The DOM structure can be used by library(xpath) to extract information from the document.

The low-level parser
The actual parser is written in C and consists of two parts: one for processing DTD (Document Type Definitions) and one for parsing data. The data can either be parsed to a Prolog (DOM) term or the parser can perform callbacks for the DOM events.
Utility predicates
Finally, this library provides prmitives for classifying characters and strings according to the XML specification such as xml_name/1 to verify whether an atom is a valid XML name (identifier). It also provides primitives to quote attributes and CDATA elements.
Sourcedtd(+Type, -DTD) is det
DTD is a DTD object created from the file dtd(Type). Loaded DTD objects are cached. Note that DTD objects may not be shared between threads. Therefore, dtd/2 maintains the pool of DTD objects using a thread_local predicate. DTD objects are destroyed if a thread terminates.
Errors
- existence_error(source_sink, dtd(Type))
Sourceload_dtd(+DTD, +DtdFile, +Options)
Load DtdFile into a DTD. Defined options are:
dialect(+Dialect)
Dialect to use (xml, xmlns, sgml)
encoding(+Encoding)
Encoding of DTD file
Arguments:
DTD- is a fresh DTD object, normally created using new_dtd/1.
Sourcedestroy_dtds[private]
Destroy DTDs cached by this thread as they will become unreachable anyway.
Sourceregister_cleanup[private]
Register cleanup of DTDs created for this thread.
Sourceload_structure(+Source, -ListOfContent, :Options) is det
Parse Source and return the resulting structure in ListOfContent. Source is handed to open_any/5, which allows for processing an extensible set of input sources.

A proper XML document contains only a single toplevel element whose name matches the document type. Nevertheless, a list is returned for consistency with the representation of element content.

The encoding(+Encoding) option is treated special for compatibility reasons:

  • If Encoding is one of iso-8859-1, us-ascii or utf-8, the stream is opened in binary mode and the option is passed to the SGML parser.
  • If Encoding is present, but not one of the above, the stream is opened in text mode using the given encoding.
  • Otherwise (no Encoding), the stream is opened in binary mode and doing the correct decoding is left to the parser.
Sourceparser_meta_options(+Options0, +Module, -Options)[private]
Qualify meta-calling options to the parser.
Sourceset_input_location(+Parser, +In:stream) is det[private]
Set the input location if this was not set explicitly
Sourceload_sgml_file(+File, -DOM) is det
Load SGML from File and unify the resulting DOM structure with DOM.
deprecated
- New code should use load_sgml/3.
Sourceload_xml_file(+File, -DOM) is det
Load XML from File and unify the resulting DOM structure with DOM.
deprecated
- New code should use load_xml/3.
Sourceload_html_file(+File, -DOM) is det
Load HTML from File and unify the resulting DOM structure with DOM.
deprecated
- New code should use load_html/3.
Sourceload_html(+Input, -DOM, +Options) is det
Load HTML text from Input and unify the resulting DOM structure with DOM. Options are passed to load_structure/3, after adding the following default options:
dtd(DTD)
Pass the DTD for HTML as obtained using dtd(html, DTD).
dialect(Dialect)
Current dialect from the Prolog flag html_dialect
max_errors(-1)
syntax_errors(quiet)
Most HTML encountered in the wild contains errors. Even in the context of errors, the resulting DOM term is often a reasonable guess at the intend of the author.

You may also want to use the library(http/http_open) to support loading from HTTP and HTTPS URLs. For example:

:- use_module(library(http/http_open)).
:- use_module(library(sgml)).

load_html_url(URL, DOM) :-
    load_html(URL, DOM, []).
Sourceload_xml(+Input, -DOM, +Options) is det
Load XML text from Input and unify the resulting DOM structure with DOM. Options are passed to load_structure/3, after adding the following default options:
  • dialect(xml)
Sourceload_sgml(+Input, -DOM, +Options) is det
Load SGML text from Input and unify the resulting DOM structure with DOM. Options are passed to load_structure/3, after adding the following default options:
  • dialect(sgml)
Sourcexml_quote_attribute(+In, -Quoted) is det
Sourcexml_quote_cdata(+In, -Quoted) is det
Backward compatibility for versions that allow to specify encoding. All characters that cannot fit the encoding are mapped to XML character entities (&#dd;). Using ASCII is the safest value.
Sourcexml_name(+Atom) is semidet
True if Atom is a valid XML name.
Sourcexml_basechar(+CodeOrChar) is semidet
Sourcexml_ideographic(+CodeOrChar) is semidet
Sourcexml_combining_char(+CodeOrChar) is semidet
Sourcexml_digit(+CodeOrChar) is semidet
Sourcexml_extender(+CodeOrChar) is semidet
XML character classification predicates. Each of these predicates accept both a character (one-character atom) and a code (integer).
See also
- http://www.w3.org/TR/2006/REC-xml-20060816
Sourcexml_is_dom(@Term) is semidet
True if term statisfies the structure as returned by load_structure/3 and friends.
Sourcexml_quote_attribute(+In, -Quoted) is det
Sourcexml_quote_cdata(+In, -Quoted) is det
Backward compatibility for versions that allow to specify encoding. All characters that cannot fit the encoding are mapped to XML character entities (&#dd;). Using ASCII is the safest value.
Sourcexml_basechar(+CodeOrChar) is semidet
Sourcexml_ideographic(+CodeOrChar) is semidet
Sourcexml_combining_char(+CodeOrChar) is semidet
Sourcexml_digit(+CodeOrChar) is semidet
Sourcexml_extender(+CodeOrChar) is semidet
XML character classification predicates. Each of these predicates accept both a character (one-character atom) and a code (integer).
See also
- http://www.w3.org/TR/2006/REC-xml-20060816
Sourcexml_basechar(+CodeOrChar) is semidet
Sourcexml_ideographic(+CodeOrChar) is semidet
Sourcexml_combining_char(+CodeOrChar) is semidet
Sourcexml_digit(+CodeOrChar) is semidet
Sourcexml_extender(+CodeOrChar) is semidet
XML character classification predicates. Each of these predicates accept both a character (one-character atom) and a code (integer).
See also
- http://www.w3.org/TR/2006/REC-xml-20060816
Sourcexml_basechar(+CodeOrChar) is semidet
Sourcexml_ideographic(+CodeOrChar) is semidet
Sourcexml_combining_char(+CodeOrChar) is semidet
Sourcexml_digit(+CodeOrChar) is semidet
Sourcexml_extender(+CodeOrChar) is semidet
XML character classification predicates. Each of these predicates accept both a character (one-character atom) and a code (integer).
See also
- http://www.w3.org/TR/2006/REC-xml-20060816
Sourcexml_basechar(+CodeOrChar) is semidet
Sourcexml_ideographic(+CodeOrChar) is semidet
Sourcexml_combining_char(+CodeOrChar) is semidet
Sourcexml_digit(+CodeOrChar) is semidet
Sourcexml_extender(+CodeOrChar) is semidet
XML character classification predicates. Each of these predicates accept both a character (one-character atom) and a code (integer).
See also
- http://www.w3.org/TR/2006/REC-xml-20060816

Undocumented predicates

The following predicates are exported, but not or incorrectly documented.

Sourcexsd_time_string(Arg1, Arg2, Arg3)
Sourcesgml_parse(Arg1, Arg2)
Sourcenew_sgml_parser(Arg1, Arg2)
Sourcedtd_property(Arg1, Arg2)
Sourcexml_quote_cdata(Arg1, Arg2, Arg3)
Sourceget_sgml_parser(Arg1, Arg2)
Sourceopen_dtd(Arg1, Arg2, Arg3)
Sourceiri_xml_namespace(Arg1, Arg2)
Sourcexml_name(Arg1, Arg2)
Sourcexml_quote_attribute(Arg1, Arg2, Arg3)
Sourceset_sgml_parser(Arg1, Arg2)
Sourcefree_dtd(Arg1)
Sourceload_dtd(Arg1, Arg2)
Sourceiri_xml_namespace(Arg1, Arg2, Arg3)
Sourcexsd_number_string(Arg1, Arg2)
Sourcesgml_register_catalog_file(Arg1, Arg2)
Sourcefree_sgml_parser(Arg1)
Sourcenew_dtd(Arg1, Arg2)