[All Packages]  [Previous]  [Next]

Parser APIs

Extensible Markup Language (XML) describes a class of data objects called XML documents and partially describes the behavior of computer programs which process them. XML is an application profile or restricted form of SGML, the Standard Generalized Markup Language [ISO 8879]. By construction, XML documents are conforming SGML documents.

XML documents are made up of storage units called entities, which contain either parsed or unparsed data. Parsed data is made up of characters, some of which form character data, and some of which form markup. Markup encodes a description of the document's storage layout and logical structure. XML provides a mechanism to impose constraints on the storage layout and logical structure.

A software module called an XML processor is used to read XML documents and provide access to their content and structure. It is assumed that an XML processor is doing its work on behalf of another module, called the application.

This C implementation of the XML processor (or parser) followed the W3C XML specification (rev REC-xml-19980210) and included the required behavior of an XML processor in terms of how it must read XML data and the information it must provide to the application.

The following is the default behavior of this parser:

  1. The character set encoding is UTF-8. If all your documents are ASCII, you are encouraged to set the encoding to US-ASCII for better performance.
  2. Messages are printed to stderr unless msghdlr is given.
  3. A parse tree which can be accessed by DOM APIs is built unless saxcb is set to use the SAX callback APIs. Note that any of the SAX callback functions can be set to NULL if not needed.
  4. The default behavior for the parser is to check that the input is well-formed but not to check whether it is valid. The flag XML_FLAG_VALIDATE can be set to validate the input. The default behavior for whitespace processing is to be fully conformant to the XML 1.0 spec, i.e. all whitespace is reported back to the application but it is indicated which whitespace is ignorable. However, some applications may prefer to set the XML_FLAG_DISCARD_WHITESPACE which will discard all whitespace between an end-element tag and the following start-element tag.

Calling Sequence

The sequence of calls to the parser can be:
or
or

Memory

The memory callback functions memcb may be used if you wish to use your own memory allocation. If they are used, all of the functions should be specified.

The memory allocated for parameters passed to the SAX callbacks or for nodes and data stored with the DOM parse tree will not be freed until one of the following is done:

  1. xmlparse() or xmlparsebuf() is called to parse another file or buffer.
  2. xmlclean() is called.
  3. xmlterm() is called.

Thread Safety

If threads are forked off somewhere in the midst of the init-parse-term sequence of calls, you will get unpredictable behavior and results.

Data Types Index

oratext String pointer
xmlctx Master XML context
xmlmemcb Memory callback structure (optional)
xmlsaxcb SAX callback structure (SAX only)
ub4 32-bit (or larger) unsigned integer
uword Native unsigned integer

Function Index

xmlinit Initialize XML parser
xmlclean Clean up memory used during parse
xmlparse Parse a file
xmlparsebuf Parse a buffer
xmlterm Shut down XML parser
createDocument Create a new document
isStandalone Return document's standalone flag

Data Structures and Types


oratext

xmlctx

xmlmemcb

xmlsaxcb

ub4

uword


Functions


xmlinit

Purpose

Initializes the C XML parser. It must be called before any parsing can take place.

Syntax
xmlctx *xmlinit(uword *err, const oratext *encoding, 
                 void (*msghdlr)(void *msgctx, const oratext *msg, ub4 errcode), 
                 void *msgctx, const xmlsaxcb *saxcb, void *saxcbctx, 
                 const xmlmemcb *memcb, void *memcbctx, const oratext *lang);
Parameters
 err      (OUT)- The error, if any
 encoding (IN) - default character set encoding
 msghdlr  (IN) - Error message handler function
 msgctx   (IN) - Context for the error message handler
 saxcb    (IN) - SAX callback structure filled with function pointers
 saxcbctx (IN) - Context for SAX callbacks
 memcb    (IN) - Memory function callbacks
 memcbctx (IN) - Context for the memory function callbacks
 lang     (IN) - Language for error messages
Comments

Do not call any other XML parser functions if this is not successful!

This function should only be called once before starting the processing of one or more XML files. xmlterm() should be called after all processing of XML files has completed.

Error codes: XMLERR_LEH_INIT, XMLERR_BAD_ENCODING, XMLERR_NLS_INIT, XMLERR_NO_MEMORY, XMLERR_NULL_PTR

All values may be NULL except for err.

By default, the character set encoding is UTF-8. If all your documents are ASCII, you are encouraged to set the encoding to US-ASCII for better performance.

By default, messages are printed to stderr unless msghdlr is given.

By default, a parse tree is built (accessible by DOM APIs) unless saxcb is set (in which case the SAX callback APIs are invoked). Note that any of the SAX callback functions can be set to NULL if not needed.

The memory callback functions memcb may be used if you wish to use your own memory allocation. If they are used, all of the functions should be specified.

The parameters msgctx, saxcbctx, and memcbctx are structures that you may define and use to pass information to your callback routines for the message handler, SAX functions, or memory functions, respectively. They should be set to NULL if your callback functions do not need any additional information passed in to them.

The lang parameter is not used currently and may be set to NULL. It will be used in future releases to determine the language of the error messages.


xmlclean

Purpose

Frees any memory used during the previous parse.

Syntax
void xmlclean(xmlctx *ctx);
Parameters

ctx (IN) - The XML parser context

Comments

This function is provided as a convenience for those who want to parse multiple files but would like to free the memory used for parses before the subsequent call to xmlparse() or xmlparsebuf().


xmlparse

Purpose

Invokes the XML parser on an input file. The parser must have been initialized successfully with a call to xmlinit() first.

Syntax
uword xmlparse(xmlctx *ctx, const oratext *filename, const oratext *encoding, ub4 flags);
Parameters

 ctx      (IN/OUT) - The XML parser context
 filename (IN) - path to XML document
 encoding (IN) - default character set encoding
 flags    (IN) - what options to use
Comments

Flag bits must be OR'd to override the default behavior of the parser. The following flag bits may be set:

  • XML_FLAG_VALIDATE turns validation on.
  • XML_FLAG_DISCARD_WHITESPACE will discard whitespace where it appears to be insignificant.

The default behavior is to not validate the input. The default behavior for whitespace processing is to be fully conformant to the XML 1.0 spec, i.e. all whitespace is reported back to the application but it is indicated which whitespace is ignorable. However, some applications may prefer to set the XML_FLAG_DISCARD_WHITESPACE which will discard all whitespace between an end-element tag and the following start-element tag.

The memory passed to the SAX callbacks or stored with the DOM parse tree will not be freed until one of the following is done:

  1. xmlparse() or xmlparsebuf() is called to parse another file.
  2. xmlclean() is called.
  3. xmlterm() is called.

This function will free any memory used during the previous parse.


xmlparsebuf

Purpose

Invokes the XML parser on a buffer. The parser must have been initialized successfully with a call to xmlinit() first.

Syntax
uword xmlparsebuf(xmlctx *ctx, const oratext *buffer, size_t len, const oratext *encoding, ub4 flags);
Parameters

 ctx      (IN/OUT) - The XML parser context
 buffer   (IN) - file to be parsed
 len      (IN) - length of the buffer
 encoding (IN) - default character set encoding
 flags    (IN) - what options to use
Comments

This function is identical to xmlparse() except that input is taken from the user's buffer instead of from an external file.


xmlterm

Purpose

Terminates the XML parser. It should be called after xmlinit(), and before exiting the main program.

Syntax
uword xmlterm(xmlctx *ctx);
Parameters
ctx (IN) - the XML parser context
Comments

This function will free any memory used during the previous parse. No additional XML parser calls can be made until xmlinit() is called.


createDocument

Purpose

Creates a new document in memory.

Syntax
xmlnode* createDocument(xmlctx *ctx)
Parameters
ctx (IN) - the XML parser context
Comments

This function is used when constructing a new document in memory. An XML document is always rooted in a node of type DOCUMENT_NODE-- this function creates that root node and sets it in the context. There can be only one current document and hence only one document node; if one already exists, this function does nothing and returns NULL.


isStandalone

Purpose

Return value of document's standalone flag.

Syntax
boolean isStandalone(xmlctx *ctx)
Parameters
ctx (IN) - the XML parser context
Comments

This function returns the boolean value of the document's standalone flag, as specified in the <?xml?> processing instruction.