|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: INNER | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Object | +--superwaba.ext.xplat.xml.XmlTokenizer
A Tokenizer for XML input. In non-strict mode (default), it recognizes HTML constructs as well, e.g.: unquoted attributes value, unterminated references, etc.
Three "tokenize" methods are provided: one takes a byte[] array; another takes a byte[] array with offset and count; the last takes a (byte) Stream.
Tokenization events are reported via overridable methods:
Some of these methods pass the parameters pertinent to the kind of tokenized events: tag name, attribute name and value... These values are only valid for the time the event is reported. Never assume that, after returning from a "foundXxx" method, the information that was reported is still available! Persistent values are however provided through the "getAbsoluteOffset()" method, which returns the absolute offset of the current parameters of the foundXxxx method.
Typical invocation
class XmlTokenizerTest {
static class MyXmlTokenizer extends XmlTokenizer
{
public void foundStartOfInput(byte buffer[], int offset, int count) {
Vm.debug("Start: " + new String(buffer, offset, count));
}
public void foundStartTagName(byte buffer[], int offset, int count) {
Vm.debug("StartTagName: " + new String(buffer, offset, count));
}
public void foundEndTagName(byte buffer[], int offset, int count) {
Vm.debug("EndTagName: " + new String(buffer, offset, count));
}
public void foundEndEmptyTag() {
Vm.debug("EndEmptyTag");
}
public void foundCharacterData(byte buffer[], int offset, int count) {
Vm.debug("Content: " + new String(buffer, offset, count));
}
public void foundCharacter(char charFound) {
Vm.debug("Content Ref |" + charFound + '|');
}
public void foundAttributeName(byte buffer[], int offset, int count) {
Vm.debug("AttributeName: " + new String(buffer, offset, count));
}
public void foundAttributeValue(byte buffer[], int offset, int count, byte dlm) {
Vm.debug("AttributeValue: " + new String(buffer, offset, count));
}
public void foundEndOfInput(int count) {
Vm.debug("Ended: " + count + " bytes parsed.");
}
}
public static void testMe()
{
String input = "<p>Hello<i>World!</i></p>";
MyXmlTokenizer xtk = new MyXmlTokenizer();
try {
xtk.tokenize(input.getBytes());
}catch (SyntaxException ex) {
Vm.debug(ex.getMessage());
}
}
}
Note: A Tokenizer is not a Parser.
The correctness of the tag structure (stack) is not examined.
Ex: the dangling markup "<foo><bar>opop</bar></foo>"
is syntactically valid.
As a result, a Tokenizer can work on document fragments.
| Constructor Summary | |
protected |
XmlTokenizer()
Constructor |
| Method Summary | |
void |
disableReferenceResolution(boolean disable)
Turn off or on the automatic resolution of references. |
protected void |
foundAttributeName(byte[] input,
int offset,
int count)
Method called when an attribute name has been found. |
protected void |
foundAttributeValue(byte[] input,
int offset,
int count,
byte dlm)
Method called when an attribute value has been found. |
protected void |
foundCharacter(char charFound)
Method called when a character has been found in contents, this character resulting from a character reference resolution. |
protected void |
foundCharacterData(byte[] input,
int offset,
int count)
Method called when character data content has been found. |
protected void |
foundComment(byte[] input,
int offset,
int count)
Method called when a comment has been found. |
protected void |
foundDeclaration(byte[] input,
int offset,
int count)
Method called when a declaration has been found. |
protected void |
foundEndEmptyTag()
Method called when an empty-tag has been found. |
protected void |
foundEndOfInput(int count)
Method called when the end of the input was found, and tokenization is about to end. |
protected void |
foundEndTagName(byte[] input,
int offset,
int count)
Method called when an end-tag has been found. |
protected void |
foundInvalidData(byte[] input,
int offset,
int count)
Method called when invalid data was found. |
protected void |
foundProcessingInstruction(byte[] input,
int offset,
int count)
Method called when a processing instruction has been found. |
protected void |
foundReference(byte[] input,
int offset,
int count)
Method called when a reference been found in content. |
protected void |
foundStartOfInput(byte[] input,
int offset,
int count)
Method called before to start tokenizing. |
protected void |
foundStartTagName(byte[] input,
int offset,
int count)
Method called when a start-tag has been found. |
int |
getAbsoluteOffset()
Get the absolute offset of the data parameters of the currently reported event. |
boolean |
isDataCDATA()
Tell if the data which is currently reported by foundCharacterData is CDATA versus PCDATA.
|
static char |
resolveCharacterReference(byte[] input,
int offset,
int count)
Resolve a numeric or named character reference. |
protected void |
setCdataContents(byte[] input,
int offset,
int count)
Declare the input to be CDATA, until the end tag of the element tagName is found.
|
void |
setStrictlyXml(boolean toSet)
Set or unset the strict Xml mode of the Parser. |
void |
tokenize(byte[] input)
Tokenize an array of bytes. |
void |
tokenize(byte[] input,
int offset,
int count)
Tokenize an array of bytes. |
void |
tokenize(Stream input)
Tokenize a stream |
void |
tokenize(Stream input,
byte[] buffer,
int start,
int end,
int pos)
Tokenize an already buffered Stream. |
| Methods inherited from class java.lang.Object |
clone,
equals,
finalize,
getClass,
hashCode,
notify,
notifyAll,
toString,
wait,
wait,
wait |
| Constructor Detail |
protected XmlTokenizer()
| Method Detail |
public final void tokenize(byte[] input,
int offset,
int count)
throws SyntaxException
input - byte array to tokenizeoffset - position of the first byte in the arraycount - number of bytes to tokenize
public final void tokenize(byte[] input)
throws SyntaxException
input - byte array to tokenize
public final void tokenize(Stream input)
throws SyntaxException
input - stream to tokenize
public final void tokenize(Stream input,
byte[] buffer,
int start,
int end,
int pos)
throws SyntaxException
Versus the general method above, this tokenize method requires more arguments. It should be used when the HTML document is embedded within an HTTP stream.
input - stream to tokenizebuffer - buffer, already filled with bytes read
from the input streamstart - starting position in the bufferend - ending position in the bufferpos - read position of the byte at offset 0 in the buffer
public static final char resolveCharacterReference(byte[] input,
int offset,
int count)
input - byte array which describes the referenceoffset - position of the first byte in the arraycount - number of bytes of the referencepublic final int getAbsoluteOffset()
protected final void setCdataContents(byte[] input,
int offset,
int count)
tagName is found.
This settings permits to handle character data.
For example, when the <Script> tag is reported
the derived class call this method:
skipToEndOf("SCRIPT");
before to return.
From this point, all input is reported as data until
</SCRIPT>is found.
Note: The Tokenizer is a low level class and does not register the tag name. Therefore, this method must be called at each time the caller wants to suprress markup recognition until the end tag is found.
input - byte array containing the name of the element the end tag
of which ends the character dataoffset - position of the first character in the arraycount - number of relevant bytespublic final boolean isDataCDATA()
CDATA versus PCDATA.
In ISO 8879 (SGML) terminology, CDATA describes
"non displayable" data, as, for instance, data that is the
contents of a SCRIPT element.
It differs from
"regular data" as, for instance, data that is the contents
of a P element is named PCDATA
(Parsed Character Data)
public final void setStrictlyXml(boolean toSet)
By default, the Parser will allow most commonly used HTML constructs.
toSet - if true, set the strict Xml mode; if false, allows HTML
constructs.public final void disableReferenceResolution(boolean disable)
References are normally solved, and reported via
foundCharacter(char).
When automatic resolution is turned off,
foundReference(byte[],int,int)
is called instead.
By default, automatic resolution of references is on, and
foundReference(byte[],int,int) is not called.
This option should be set before to start the tokenization.
See foundReference(byte[],int,int) for more details.
disable - boolean: if true automatic resolution
of references is turned off, otherwise, it is turned on.
protected void foundStartOfInput(byte[] input,
int offset,
int count)
Derived class may override this method, for doing whatever appropriate housekeeping (sniffing at the encoding, etc.)
input - byte array containing the first bytes of the input about to
be tokenizedoffset - position of the first byte to be tokenizedcount - number of bytes to be tokenized
protected void foundStartTagName(byte[] input,
int offset,
int count)
Derived class may override this method.
input - byte array containing the name of the tag that startedoffset - position of the first character of the tag name in the arraycount - number of bytes the tag name is made of
protected void foundEndTagName(byte[] input,
int offset,
int count)
Derived class may override this method.
input - byte array containing the name of the tag that endedoffset - position of the first character of the tag name in the arraycount - number of bytes the tag name is made ofprotected void foundEndEmptyTag()
This method is called just after all events related to the starting tag have been reported. The implied tagName is the one of the starting tag (e.g.: the most recently reported start-tag.)
Derived class may override this method.
Example:generates: - foundStartTagName("FOO"); - foundAttributeName("A"); - foundAttributeValue("B"); - foundEndEmptyTag();
protected void foundCharacterData(byte[] input,
int offset,
int count)
Derived class may override this method.
input - byte array containing the character data that was foundoffset - position of the first character data in the arraycount - number of bytes the character data content is made ofprotected void foundCharacter(char charFound)
Derived class may override this method.
charFound - resolved character - if the character is invalid,
this value is set to '\uffff', which is not a Unicode character.foundReference(byte[],int,int)
protected void foundAttributeName(byte[] input,
int offset,
int count)
Derived class may override this method.
input - byte array containing the attribute nameoffset - position of the first character of the attribute name
in the arraycount - number of bytes the attribute name is made of
protected void foundAttributeValue(byte[] input,
int offset,
int count,
byte dlm)
Derived class may override this method.
input - byte array containing the attribute valueoffset - position of the first character of the attribute value
in the arraycount - number of bytes the attribute value is made ofdlm - delimiter that started the attribute value (' or ").
'\0' if none
protected void foundComment(byte[] input,
int offset,
int count)
Derived class may override this method.
input - byte array containing the comment (without the
<!-- and
--> delimiters)offset - position of the first character of the comment
in the arraycount - number of bytes the comment is made of
protected void foundProcessingInstruction(byte[] input,
int offset,
int count)
Derived class may override this method.
input - byte array containing the processing instruction
(without the <?
and ?> delimiters)offset - position of the first character of the processing
instruction in the arraycount - number of bytes the processing instruction is made of
protected void foundDeclaration(byte[] input,
int offset,
int count)
Derived class may override this method.
input - byte array containing the declaration
(without the <!
and > delimiters)offset - position of the first character of the declaration
in the arraycount - number of bytes the declaration is made of
protected void foundReference(byte[] input,
int offset,
int count)
It can be either a named or numeric character reference, or an entity reference. Given the several syntaxes of reference, no verification is made a priori on the validity of the "name" of the reference.
For conveniency, a static method:
resolveCharacterReference(byte[],int,int)
allows to convert the character reference into its UCS-2 encoded value.
| Note: | foundReference is called only if
disableReferenceResolution(boolean disable)
has been called first, with disable
set to true.
If not, then foundReference is never called,
and foundCharacter(char) is called instead.
This design permits to easily handle simple XML documents —
only predefined named character entities, and numeric character entities
— and documents which have
user-defined internal/external entities.
This is explained below.
|
When working with a set of externally defined entities,
issue disableReferenceResolution(true)
to turn off automatic reference resolution.
Then, your code in foundReference could
make a quick check to see if the found reference is numeric.
If it is numeric — it starts with a # character —
call resolveCharacterReference;
if it is not a numeric reference, checks if the reference belongs
to the known list of
defined entities for the parsed document.
If it does, do the substitution; if not, call
resolveCharacterReference, because it could be one of the
XML Predefined Entities
By default, each character reference is naturally
reported via foundCharacter(char),
which, again, supersedes
the foundReference notification.
Derived class may override this method.
input - byte array containing the reference nameoffset - position of the first character of the reference name
in the arraycount - number of bytes the reference name is made ofsetStrictlyXml(boolean toSet)
protected void foundInvalidData(byte[] input,
int offset,
int count)
Derived class may override this method.
input - byte array containing the invalid dataoffset - position of the first character of the invalid data
in the arraycount - number of bytes the invalidData is made ofprotected void foundEndOfInput(int count)
Derived class may override this method.
count - count of bytes parsed
|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: INNER | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||