info.monitorenter.cpdetector.io
Class JChardetFacade

java.lang.Object
  extended by info.monitorenter.cpdetector.io.AbstractCodepageDetector
      extended by info.monitorenter.cpdetector.io.JChardetFacade
All Implemented Interfaces:
ICodepageDetector, Serializable, Comparable, org.mozilla.intl.chardet.nsICharsetDetectionObserver

public final class JChardetFacade
extends AbstractCodepageDetector
implements org.mozilla.intl.chardet.nsICharsetDetectionObserver

A fac�ade for jchardet codepage detection. JChardet is the java port of Frank Yung-Fong Tang's Mozilla charset detector.

This charset detector works on guessing the codepage. "The algorithm looks into the byte sequence and based on the values of each byte uses a elimination logic to narrow down to the final charset. If there is a tie between EUC charsets, it uses the second logic to narrow down. This logic uses the frequency statistics of characters in a given language." ( source of description ).

It is a singleton for performance reasons (buffer allocation). Because it is stateful (internal buffer) the method detectCodepage(InputStream, int)(delegated to by AbstractCodepageDetector.detectCodepage(URL)has to be synchronized.

Author:
Achim Westermann
See Also:
Serialized Form

Method Summary
 Charset detectCodepage(InputStream in, int length)
           This method allows to detect the charset encoding from every source (even a String, which an URL does not decorate!).
static JChardetFacade getInstance()
           
 boolean isGuessing()
           
 void Notify(String charset)
           
 void Reset()
           
 void setGuessing(boolean guessing)
           If it was impossible to narrow down possible results to one, an internal set of possible character encodings exists.
 
Methods inherited from class info.monitorenter.cpdetector.io.AbstractCodepageDetector
compareTo, detectCodepage, open
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Method Detail

getInstance

public static JChardetFacade getInstance()

detectCodepage

public Charset detectCodepage(InputStream in,
                              int length)
                       throws IOException
Description copied from interface: ICodepageDetector

This method allows to detect the charset encoding from every source (even a String, which an URL does not decorate!).

Note that you cannot reuse the given InputStream unless it supports marking (InputStream.markSupported() == true), you mark the initial position with a sufficient readlimit and invoke reset afterwards (without getting any exception).

Specified by:
detectCodepage in interface ICodepageDetector
Parameters:
in - An InputStream for the document, that supports mark and a readlimit of argument length.
length - The amount of bytes to take into account. This number should not be longer than the amount of bytes retrievable from the InputStream but should be as long as possible to give the fallback detection (chardet) more hints to guess.
Throws:
IOException

Notify

public void Notify(String charset)
Specified by:
Notify in interface org.mozilla.intl.chardet.nsICharsetDetectionObserver
See Also:
nsICharsetDetectionObserver.Notify(java.lang.String)

Reset

public void Reset()

isGuessing

public boolean isGuessing()
Returns:
Returns the m_guessing.

setGuessing

public void setGuessing(boolean guessing)

If it was impossible to narrow down possible results to one, an internal set of possible character encodings exists. By setting guessing to true, the call to detectCodepage(InputStream, int) and AbstractCodepageDetector.detectCodepage(URL) will return an arbitrary possible Charset.

Currently the following precedence is implemented to choose the possible Charset:

  1. If US-ASCII is possible, it is chosen.
  2. If US-ASCII is not possible, the first supported one in the set of possible charsets is returned. No information about the semantics of the order in that list is available. If no possibility is supported, an instance of UnsupportedCharset is returned.
ASCII indeed is never detected as possible: No internal verifier exists for ASCII, as all Charsets support ASCII. The possibility of ASCII is detected, when no Charset has been excluded: The amount of possible Charsets is equal to the amount of all detectable Charsets.

Parameters:
guessing - The guessing to set.


Copyleft ㊢ 2003-2004 MPL 1.1, All Rights Footloose.