superwaba.ext.xplat.xml
Class XmlTokenizer

java.lang.Object
  |
  +--superwaba.ext.xplat.xml.XmlTokenizer
Direct Known Subclasses:
DumpXml, XmlReader

public class XmlTokenizer
extends java.lang.Object

A Tokenizer for XML input. In non-strict mode (default), it recognizes HTML constructs as well, e.g.: unquoted attributes value, unterminated references, etc.

Three "tokenize" methods are provided: one takes a byte[] array; another takes a byte[] array with offset and count; the last takes a (byte) Stream.

Tokenization events are reported via overridable methods:

Some of these methods pass the parameters pertinent to the kind of tokenized events: tag name, attribute name and value...  These values are only valid for the time the event is reported.  Never assume that, after returning from a "foundXxx" method, the information that was reported is still available! Persistent values are however provided through the "getAbsoluteOffset()" method, which returns the absolute offset of the current parameters of the foundXxxx method.

Typical invocation

 class XmlTokenizerTest {
    static class MyXmlTokenizer extends XmlTokenizer
    {
       public void foundStartOfInput(byte buffer[], int offset, int count) {
          Vm.debug("Start: " +  new String(buffer, offset, count));
       }
       public void foundStartTagName(byte buffer[], int offset, int count) {
          Vm.debug("StartTagName: " + new String(buffer, offset, count));
       }
       public void foundEndTagName(byte buffer[], int offset, int count) {
          Vm.debug("EndTagName: " +  new String(buffer, offset, count));
       }
       public void foundEndEmptyTag() {
          Vm.debug("EndEmptyTag");
       }
       public void foundCharacterData(byte buffer[], int offset, int count) {
          Vm.debug("Content: " + new String(buffer, offset, count));
       }
       public void foundCharacter(char charFound) {
          Vm.debug("Content Ref  |" + charFound + '|');
       }
       public void foundAttributeName(byte buffer[], int offset, int count) {
          Vm.debug("AttributeName: "  + new String(buffer, offset, count));
       }
       public void foundAttributeValue(byte buffer[], int offset, int count, byte dlm) {
          Vm.debug("AttributeValue: "  + new String(buffer, offset, count));
       }
       public void foundEndOfInput(int count) {
          Vm.debug("Ended: " + count + " bytes parsed.");
       }
    }
    public static void testMe()
    {
       String input = "<p>Hello<i>World!</i></p>";
       MyXmlTokenizer xtk = new MyXmlTokenizer();
       try {
          xtk.tokenize(input.getBytes());
       }catch (SyntaxException ex) {
          Vm.debug(ex.getMessage());
       }
    }
 }
 

Note: A Tokenizer is not a Parser.  The correctness of the tag structure (stack) is not examined.
Ex: the dangling markup "<foo><bar>opop</bar></foo>" is syntactically valid.
As a result, a Tokenizer can work on document fragments.


Constructor Summary
protected XmlTokenizer()
          Constructor
 
Method Summary
 void disableReferenceResolution(boolean disable)
          Turn off or on the automatic resolution of references.
protected  void foundAttributeName(byte[] input, int offset, int count)
          Method called when an attribute name has been found.
protected  void foundAttributeValue(byte[] input, int offset, int count, byte dlm)
          Method called when an attribute value has been found.
protected  void foundCharacter(char charFound)
          Method called when a character has been found in contents, this character resulting from a character reference resolution.
protected  void foundCharacterData(byte[] input, int offset, int count)
          Method called when character data content has been found.
protected  void foundComment(byte[] input, int offset, int count)
          Method called when a comment has been found.
protected  void foundDeclaration(byte[] input, int offset, int count)
          Method called when a declaration has been found.
protected  void foundEndEmptyTag()
          Method called when an empty-tag has been found.
protected  void foundEndOfInput(int count)
          Method called when the end of the input was found, and tokenization is about to end.
protected  void foundEndTagName(byte[] input, int offset, int count)
          Method called when an end-tag has been found.
protected  void foundInvalidData(byte[] input, int offset, int count)
          Method called when invalid data was found.
protected  void foundProcessingInstruction(byte[] input, int offset, int count)
          Method called when a processing instruction has been found.
protected  void foundReference(byte[] input, int offset, int count)
          Method called when a reference been found in content.
protected  void foundStartOfInput(byte[] input, int offset, int count)
          Method called before to start tokenizing.
protected  void foundStartTagName(byte[] input, int offset, int count)
          Method called when a start-tag has been found.
 int getAbsoluteOffset()
          Get the absolute offset of the data parameters of the currently reported event.
 boolean isDataCDATA()
          Tell if the data which is currently reported by foundCharacterData is CDATA versus PCDATA.
static char resolveCharacterReference(byte[] input, int offset, int count)
          Resolve a numeric or named character reference.
protected  void setCdataContents(byte[] input, int offset, int count)
          Declare the input to be CDATA, until the end tag of the element tagName is found.
 void setStrictlyXml(boolean toSet)
          Set or unset the strict Xml mode of the Parser.
 void tokenize(byte[] input)
          Tokenize an array of bytes.
 void tokenize(byte[] input, int offset, int count)
          Tokenize an array of bytes.
 void tokenize(Stream input)
          Tokenize a stream
 void tokenize(Stream input, byte[] buffer, int start, int end, int pos)
          Tokenize an already buffered Stream.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

XmlTokenizer

protected XmlTokenizer()
Constructor
Method Detail

tokenize

public final void tokenize(byte[] input,
                           int offset,
                           int count)
                    throws SyntaxException
Tokenize an array of bytes.
Parameters:
input - byte array to tokenize
offset - position of the first byte in the array
count - number of bytes to tokenize
Throws:
SyntaxException -  

tokenize

public final void tokenize(byte[] input)
                    throws SyntaxException
Tokenize an array of bytes.
Parameters:
input - byte array to tokenize
Throws:
SyntaxException -  

tokenize

public final void tokenize(Stream input)
                    throws SyntaxException
Tokenize a stream
Parameters:
input - stream to tokenize
Throws:
SyntaxException -  

tokenize

public final void tokenize(Stream input,
                           byte[] buffer,
                           int start,
                           int end,
                           int pos)
                    throws SyntaxException
Tokenize an already buffered Stream.

Versus the general method above, this tokenize method requires more arguments. It should be used when the HTML document is embedded within an HTTP stream.

Parameters:
input - stream to tokenize
buffer - buffer, already filled with bytes read from the input stream
start - starting position in the buffer
end - ending position in the buffer
pos - read position of the byte at offset 0 in the buffer
Throws:
SyntaxException -  

resolveCharacterReference

public static final char resolveCharacterReference(byte[] input,
                                                   int offset,
                                                   int count)
Resolve a numeric or named character reference. See XML Predefined Entities
Parameters:
input - byte array which describes the reference
offset - position of the first byte in the array
count - number of bytes of the reference
Returns:
the resulting character, or '\uffff' (not a unicode character) if the conversion could not be done

getAbsoluteOffset

public final int getAbsoluteOffset()
Get the absolute offset of the data parameters of the currently reported event.
Returns:
the absolute offset of the data parameters of the currently reported event.

setCdataContents

protected final void setCdataContents(byte[] input,
                                      int offset,
                                      int count)
Declare the input to be CDATA, until the end tag of the element tagName is found.

This settings permits to handle character data.  For example, when the <Script> tag is reported the derived class call this method: skipToEndOf("SCRIPT"); before to return.  From this point, all input is reported as data until </SCRIPT>is found.

Note: The Tokenizer is a low level class and does not register the tag name. Therefore, this method must be called at each time the caller wants to suprress markup recognition until the end tag is found. 

Parameters:
input - byte array containing the name of the element the end tag of which ends the character data
offset - position of the first character in the array
count - number of relevant bytes

isDataCDATA

public final boolean isDataCDATA()
Tell if the data which is currently reported by foundCharacterData is CDATA versus PCDATA.

In ISO 8879 (SGML) terminology, CDATA describes "non displayable" data, as, for instance, data that is the contents of a SCRIPT element.  It differs from "regular data" as, for instance, data that is the contents of a P element is named PCDATA (Parsed Character Data)


setStrictlyXml

public final void setStrictlyXml(boolean toSet)
Set or unset the strict Xml mode of the Parser.

By default, the Parser will allow most commonly used HTML constructs.

Parameters:
toSet - if true, set the strict Xml mode; if false, allows HTML constructs.

disableReferenceResolution

public final void disableReferenceResolution(boolean disable)
Turn off or on the automatic resolution of references.

References are normally solved, and reported via foundCharacter(char).  When automatic resolution is turned off, foundReference(byte[],int,int) is called instead.  By default, automatic resolution of references is on, and foundReference(byte[],int,int) is not called.

This option should be set before to start the tokenization.  See foundReference(byte[],int,int) for more details.

Parameters:
disable - boolean: if true automatic resolution of references is turned off, otherwise, it is turned on.

foundStartOfInput

protected void foundStartOfInput(byte[] input,
                                 int offset,
                                 int count)
Method called before to start tokenizing.

Derived class may override this method, for doing whatever appropriate housekeeping (sniffing at the encoding, etc.)

Parameters:
input - byte array containing the first bytes of the input about to be tokenized
offset - position of the first byte to be tokenized
count - number of bytes to be tokenized

foundStartTagName

protected void foundStartTagName(byte[] input,
                                 int offset,
                                 int count)
Method called when a start-tag has been found.

Derived class may override this method.

Parameters:
input - byte array containing the name of the tag that started
offset - position of the first character of the tag name in the array
count - number of bytes the tag name is made of

foundEndTagName

protected void foundEndTagName(byte[] input,
                               int offset,
                               int count)
Method called when an end-tag has been found.

Derived class may override this method.

Parameters:
input - byte array containing the name of the tag that ended
offset - position of the first character of the tag name in the array
count - number of bytes the tag name is made of

foundEndEmptyTag

protected void foundEndEmptyTag()
Method called when an empty-tag has been found.

This method is called just after all events related to the starting tag have been reported. The implied tagName is the one of the starting tag (e.g.: the most recently reported start-tag.)

Derived class may override this method.

 Example:
    generates:
   - foundStartTagName("FOO");
   - foundAttributeName("A");
   - foundAttributeValue("B");
   - foundEndEmptyTag();
 

foundCharacterData

protected void foundCharacterData(byte[] input,
                                  int offset,
                                  int count)
Method called when character data content has been found.

Derived class may override this method.

Parameters:
input - byte array containing the character data that was found
offset - position of the first character data in the array
count - number of bytes the character data content is made of

foundCharacter

protected void foundCharacter(char charFound)
Method called when a character has been found in contents, this character resulting from a character reference resolution.

Derived class may override this method.

Parameters:
charFound - resolved character - if the character is invalid, this value is set to '\uffff', which is not a Unicode character.
See Also:
foundReference(byte[],int,int)

foundAttributeName

protected void foundAttributeName(byte[] input,
                                  int offset,
                                  int count)
Method called when an attribute name has been found.

Derived class may override this method.

Parameters:
input - byte array containing the attribute name
offset - position of the first character of the attribute name in the array
count - number of bytes the attribute name is made of

foundAttributeValue

protected void foundAttributeValue(byte[] input,
                                   int offset,
                                   int count,
                                   byte dlm)
Method called when an attribute value has been found.

Derived class may override this method.

Parameters:
input - byte array containing the attribute value
offset - position of the first character of the attribute value in the array
count - number of bytes the attribute value is made of
dlm - delimiter that started the attribute value (' or "). '\0' if none

foundComment

protected void foundComment(byte[] input,
                            int offset,
                            int count)
Method called when a comment has been found.

Derived class may override this method.

Parameters:
input - byte array containing the comment (without the <!-- and --> delimiters)
offset - position of the first character of the comment in the array
count - number of bytes the comment is made of

foundProcessingInstruction

protected void foundProcessingInstruction(byte[] input,
                                          int offset,
                                          int count)
Method called when a processing instruction has been found.

Derived class may override this method.

Parameters:
input - byte array containing the processing instruction (without the <? and ?> delimiters)
offset - position of the first character of the processing instruction in the array
count - number of bytes the processing instruction is made of

foundDeclaration

protected void foundDeclaration(byte[] input,
                                int offset,
                                int count)
Method called when a declaration has been found.

Derived class may override this method.

Parameters:
input - byte array containing the declaration (without the <! and > delimiters)
offset - position of the first character of the declaration in the array
count - number of bytes the declaration is made of

foundReference

protected void foundReference(byte[] input,
                              int offset,
                              int count)
Method called when a reference been found in content.

It can be either a named or numeric character reference, or an entity reference.  Given the several syntaxes of reference, no verification is made a priori on the validity of the "name" of the reference.

For conveniency, a static method: resolveCharacterReference(byte[],int,int) allows to convert the character reference into its UCS-2 encoded value.

Note:  foundReference is called only if disableReferenceResolution(boolean disable) has been called first, with disable set to true.  If not, then foundReference is never called, and foundCharacter(char) is called instead.  This design permits to easily handle simple XML documents — only predefined named character entities, and numeric character entities — and documents which have user-defined internal/external entities.  This is explained below.

When working with a set of externally defined entities, issue disableReferenceResolution(true) to turn off automatic reference resolution. Then, your code in foundReference could make a quick check to see if the found reference is numeric.  If it is numeric — it starts with a # character — call resolveCharacterReference; if it is not a numeric reference, checks if the reference belongs to the known list of defined entities for the parsed document.  If it does, do the substitution; if not, call resolveCharacterReference, because it could be one of the XML Predefined Entities

By default, each character reference is naturally reported via foundCharacter(char), which, again, supersedes the foundReference notification.

Derived class may override this method.

Parameters:
input - byte array containing the reference name
offset - position of the first character of the reference name in the array
count - number of bytes the reference name is made of
See Also:
setStrictlyXml(boolean toSet)

foundInvalidData

protected void foundInvalidData(byte[] input,
                                int offset,
                                int count)
Method called when invalid data was found. This is often due to a bad tag syntax.

Derived class may override this method.

Parameters:
input - byte array containing the invalid data
offset - position of the first character of the invalid data in the array
count - number of bytes the invalidData is made of

foundEndOfInput

protected void foundEndOfInput(int count)
Method called when the end of the input was found, and tokenization is about to end.

Derived class may override this method.

Parameters:
count - count of bytes parsed