info.monitorenter.cpdetector.io
Class ParsingDetector

java.lang.Object
  extended by info.monitorenter.cpdetector.io.AbstractCodepageDetector
      extended by info.monitorenter.cpdetector.io.ParsingDetector
All Implemented Interfaces:
ICodepageDetector, Serializable, Comparable

public class ParsingDetector
extends AbstractCodepageDetector

A Fa�ade that internally uses an ANTLR - based parser / lexer.

The underlying lexer is more a filter: It does not verify lexical correctness by the means of matching a defined order of tokens, but just filters m_out certain tokens. By now the following tokens are filtered:

Token Name Match Lang. Specification
META_CONTENT_TYPE "meta" "http-equiv" "=" '"Content-Type"' "content" "=" '"' IDENTIFIER "charset" "=" <EncName> '"'> HTML W3C HTML 4.01 Specification Chapter 5.2.2
XML_ENCODING_DECL "<?xml" VersionInfo "encoding" "=" <EncName> XML Extensible Markup Language (XML) 1.0 (Third Edition) Chapter 2.8

Author:
Achim Westermann
See Also:
Serialized Form

Constructor Summary
ParsingDetector()
           
ParsingDetector(boolean verbose)
           
 
Method Summary
 Charset detectCodepage(InputStream in, int length)
           This method allows to detect the charset encoding from every source (even a String, which an URL does not decorate!).
 
Methods inherited from class info.monitorenter.cpdetector.io.AbstractCodepageDetector
compareTo, detectCodepage, open
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

ParsingDetector

public ParsingDetector()

ParsingDetector

public ParsingDetector(boolean verbose)
Method Detail

detectCodepage

public Charset detectCodepage(InputStream in,
                              int length)
                       throws IOException
Description copied from interface: ICodepageDetector

This method allows to detect the charset encoding from every source (even a String, which an URL does not decorate!).

Note that you cannot reuse the given InputStream unless it supports marking (InputStream.markSupported() == true), you mark the initial position with a sufficient readlimit and invoke reset afterwards (without getting any exception).

Parameters:
in - An InputStream for the document, that supports mark and a readlimit of argument length.
length - The amount of bytes to take into account. This number should not be longer than the amount of bytes retrievable from the InputStream but should be as long as possible to give the fallback detection (chardet) more hints to guess.
Throws:
IOException


Copyleft ㊢ 2003-2004 MPL 1.1, All Rights Footloose.