superwaba.ext.xplat.xml
Class XmlReader

java.lang.Object
  |
  +--superwaba.ext.xplat.xml.XmlTokenizer
        |
        +--superwaba.ext.xplat.xml.XmlReader
Direct Known Subclasses:
HtmlReader, XmlRpcClient

public class XmlReader
extends XmlTokenizer

Class to read HTML or XML documents, reporting events to handlers (for example, ContentHandler).

Note: While in the SAX 2.0 spirit, this implementation is not fully compliant.  Speed and footprint took precedence over what the author judged being details.

Versus SAX, reporting tag names, like in ContentHandler.startElement(int, superwaba.ext.xplat.xml.AttributeList), passes an integral tag code rather than the name itself.  This is, again, for performance reasons.  Comparing integers vs. string is notably more efficient and tag name comparison is heavily used for XML Applications.

The tag code must uniquely identify the name of the tag.  The default implementation — see getTagCode(byte[], int, int) in this code — simply consists to hash the tag name.  It can be overriden to suit specific needs.

Tag names should be translated to tag codes as soon are they are known, when reading the DTD for instance, or computed in advance and saved into a static correspondence table. 


Field Summary
protected  CharacterConverter converter
          charsetName - protected to allow non-default locale encoding
protected  int tagNameHashId
          hash ID of current tag name, set by foundStartTagName or foundEndTagName
 
Constructor Summary
XmlReader()
          Constructor
 
Method Summary
 void foundAttributeName(byte[] buffer, int offset, int count)
          Method called when an attribute name has been found.
 void foundAttributeValue(byte[] buffer, int offset, int count, byte dlm)
          Method called when an attribute value has been found.
 void foundCharacter(char charFound)
          Method called when a character has been found in contents, this character resulting from a character reference resolution.
 void foundCharacterData(byte[] buffer, int offset, int count)
          Method called when character data content has been found.
 void foundComment(byte[] buffer, int offset, int count)
          Method called when a comment has been found.
 void foundEndEmptyTag()
          Method called when an empty-tag has been found.
 void foundEndOfInput(int count)
          Method called when the end of the input was found, and tokenization is about to end.
 void foundEndTagName(byte[] buffer, int offset, int count)
          Method called when an end-tag has been found.
 void foundStartTagName(byte[] buffer, int offset, int count)
          Method called when a start-tag has been found.
 ContentHandler getContentHandler()
          Return the current content cntHandler.
protected  int getTagCode(byte[] b, int offset, int count)
          Method to compute the tag code identifying a tag name.
 void parse(byte[] input, int offset, int count)
          Parse XML data from an array of bytes, offset and count.
 void parse(Stream input)
          Parse an XML document from a Stream.
 void parse(Stream input, byte[] buffer, int start, int end, int pos)
          Parse an XML document from an already buffered Stream.
 void parse(XmlReadable input)
          Parse an XmlReadable
 AttributeList.Filter setAttributeListFilter(AttributeList.Filter filter)
          Set an AttributeList.Filter to filter the attribute entered in the AttributeList
 void setContentHandler(ContentHandler cntHandler)
          Allow an application to register a content event cntHandler.
 void setNewlineSignificant(boolean val)
          Enable or disable coalescing white spaces, according to HTML rules.
 
Methods inherited from class superwaba.ext.xplat.xml.XmlTokenizer
disableReferenceResolution, foundDeclaration, foundInvalidData, foundProcessingInstruction, foundReference, foundStartOfInput, getAbsoluteOffset, isDataCDATA, resolveCharacterReference, setCdataContents, setStrictlyXml, tokenize, tokenize, tokenize, tokenize
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

converter

protected CharacterConverter converter
charsetName - protected to allow non-default locale encoding

tagNameHashId

protected int tagNameHashId
hash ID of current tag name, set by foundStartTagName or foundEndTagName
Constructor Detail

XmlReader

public XmlReader()
Constructor
Method Detail

setContentHandler

public void setContentHandler(ContentHandler cntHandler)
Allow an application to register a content event cntHandler.

If the application does not register a content cntHandler, all content events reported by the SAX parser will be silently ignored.

Applications may register a new or different cntHandler in the middle of a parse, and the SAX parser must begin using the new cntHandler immediately.

Parameters:
cntHandler - The content cntHandler.
Throws:
java.lang.NullPointerException - If the cntHandler argument is null.
See Also:
getContentHandler()

setAttributeListFilter

public AttributeList.Filter setAttributeListFilter(AttributeList.Filter filter)
Set an AttributeList.Filter to filter the attribute entered in the AttributeList
Parameters:
filter - AttributeList.Filter to set, or null if the current AttributeList filter must be removed
Returns:
previous AttributeList.Filter or 0 if none was set

getContentHandler

public ContentHandler getContentHandler()
Return the current content cntHandler.
Returns:
The current content cntHandler, or null if none has been registered.
See Also:
setContentHandler(superwaba.ext.xplat.xml.ContentHandler)

parse

public final void parse(Stream input)
                 throws SyntaxException
Parse an XML document from a Stream.

The application can use this method to instruct the XML reader to begin parsing an XML document from reading a Stream.

Here is the general contract for all parse methods.

Applications may not invoke this method while a parse is in progress (they should create a new XMLReader instead for each nested XML document). Once a parse is complete, an application may reuse the same XMLReader object, possibly with a different input source.

During the parse, the XMLReader will provide information about the XML document through the registered event handlers.

This method is synchronous: it will not return until parsing has ended. If a client application wants to terminate parsing early, it should throw an exception.

Parameters:
input - The input source for the top-level XML document.
Throws:
SyntaxException -  
See Also:
setContentHandler(superwaba.ext.xplat.xml.ContentHandler)

parse

public final void parse(Stream input,
                        byte[] buffer,
                        int start,
                        int end,
                        int pos)
                 throws SyntaxException
Parse an XML document from an already buffered Stream.

Versus the general method above, this method requires more arguments. It should be used when the HTML document is embedded within an HTTP stream.

See the general contract of parse(Stream).

Parameters:
input - stream to parse
buffer - buffer, already filled with bytes read from the input stream
start - starting position in the buffer
end - ending position in the buffer
pos - read position of the byte at offset 0 in the buffer
Throws:
SyntaxException -  

parse

public final void parse(XmlReadable input)
                 throws SyntaxException
Parse an XmlReadable
Parameters:
input - The input source for the top-level XML document.

parse

public final void parse(byte[] input,
                        int offset,
                        int count)
                 throws SyntaxException
Parse XML data from an array of bytes, offset and count.

See the general contract of parse(Stream).

Parameters:
input - byte array to parse
offset - position of the first byte in the array
count - number of bytes to parse
Throws:
SyntaxException -  

setNewlineSignificant

public void setNewlineSignificant(boolean val)
Enable or disable coalescing white spaces, according to HTML rules.

White-spaces are any character less or equal to the ascii space (0x20)

This method allows to process the contents of pre-formatted lines, such as the contents of the <PRE> tag.  When the parse starts, newlines are not significant.  Hence, setNewLineSignificant must be called after the parse started.  For example, to make all newlines significant:


 class MyXmlReader extends XmlReader { public void foundStartOfInput(byte input[], int offset, int count) {
 setNewLineSignificant(true); } }

 

Note: this is a "stacked" call.


 setNewlineSignificant(true); // newlines are significant - stack is 1 setNewlineSignificant(true); // newlines
 are significant - stack is 2 setNewlineSignificant(false); // newlines are still significant - stack is 1
 setNewlineSignificant(false); // newlines are no more significant again - stack is 0

 
Parameters:
val - true if newline characters must be significant, false if they must be collapsed according to HTML rules.

getTagCode

protected int getTagCode(byte[] b,
                         int offset,
                         int count)
Method to compute the tag code identifying a tag name.

This is the value which is passed to ContentHandler's for reporting a tag name.  Derived class may override it.

Parameters:
b - byte array containing the bytes to be hashed
offset - position of the first byte in the array
count - number of bytes to be hashed
Returns:
the corresponding hash code

foundStartTagName

public void foundStartTagName(byte[] buffer,
                              int offset,
                              int count)
Description copied from class: XmlTokenizer
Method called when a start-tag has been found.

Derived class may override this method.

Overrides:
foundStartTagName in class XmlTokenizer
Tags copied from class: XmlTokenizer
Parameters:
input - byte array containing the name of the tag that started
offset - position of the first character of the tag name in the array
count - number of bytes the tag name is made of

foundEndTagName

public void foundEndTagName(byte[] buffer,
                            int offset,
                            int count)
Description copied from class: XmlTokenizer
Method called when an end-tag has been found.

Derived class may override this method.

Overrides:
foundEndTagName in class XmlTokenizer
Tags copied from class: XmlTokenizer
Parameters:
input - byte array containing the name of the tag that ended
offset - position of the first character of the tag name in the array
count - number of bytes the tag name is made of

foundEndEmptyTag

public final void foundEndEmptyTag()
Description copied from class: XmlTokenizer
Method called when an empty-tag has been found.

This method is called just after all events related to the starting tag have been reported. The implied tagName is the one of the starting tag (e.g.: the most recently reported start-tag.)

Derived class may override this method.

 Example:
    generates:
   - foundStartTagName("FOO");
   - foundAttributeName("A");
   - foundAttributeValue("B");
   - foundEndEmptyTag();
 
Overrides:
foundEndEmptyTag in class XmlTokenizer

foundCharacterData

public final void foundCharacterData(byte[] buffer,
                                     int offset,
                                     int count)
Description copied from class: XmlTokenizer
Method called when character data content has been found.

Derived class may override this method.

Overrides:
foundCharacterData in class XmlTokenizer
Tags copied from class: XmlTokenizer
Parameters:
input - byte array containing the character data that was found
offset - position of the first character data in the array
count - number of bytes the character data content is made of

foundCharacter

public final void foundCharacter(char charFound)
Description copied from class: XmlTokenizer
Method called when a character has been found in contents, this character resulting from a character reference resolution.

Derived class may override this method.

Overrides:
foundCharacter in class XmlTokenizer
Tags copied from class: XmlTokenizer
Parameters:
charFound - resolved character - if the character is invalid, this value is set to '\uffff', which is not a Unicode character.
See Also:
XmlTokenizer.foundReference(byte[],int,int)

foundAttributeName

public final void foundAttributeName(byte[] buffer,
                                     int offset,
                                     int count)
Description copied from class: XmlTokenizer
Method called when an attribute name has been found.

Derived class may override this method.

Overrides:
foundAttributeName in class XmlTokenizer
Tags copied from class: XmlTokenizer
Parameters:
input - byte array containing the attribute name
offset - position of the first character of the attribute name in the array
count - number of bytes the attribute name is made of

foundAttributeValue

public final void foundAttributeValue(byte[] buffer,
                                      int offset,
                                      int count,
                                      byte dlm)
Description copied from class: XmlTokenizer
Method called when an attribute value has been found.

Derived class may override this method.

Overrides:
foundAttributeValue in class XmlTokenizer
Tags copied from class: XmlTokenizer
Parameters:
input - byte array containing the attribute value
offset - position of the first character of the attribute value in the array
count - number of bytes the attribute value is made of
dlm - delimiter that started the attribute value (' or "). '\0' if none

foundComment

public final void foundComment(byte[] buffer,
                               int offset,
                               int count)
Description copied from class: XmlTokenizer
Method called when a comment has been found.

Derived class may override this method.

Overrides:
foundComment in class XmlTokenizer
Tags copied from class: XmlTokenizer
Parameters:
input - byte array containing the comment (without the <!-- and --> delimiters)
offset - position of the first character of the comment in the array
count - number of bytes the comment is made of

foundEndOfInput

public final void foundEndOfInput(int count)
Description copied from class: XmlTokenizer
Method called when the end of the input was found, and tokenization is about to end.

Derived class may override this method.

Overrides:
foundEndOfInput in class XmlTokenizer
Tags copied from class: XmlTokenizer
Parameters:
count - count of bytes parsed