info.monitorenter.cpdetector.io
Class UnicodeDetector

java.lang.Object
  extended by info.monitorenter.cpdetector.io.AbstractCodepageDetector
      extended by info.monitorenter.cpdetector.io.UnicodeDetector
All Implemented Interfaces:
ICodepageDetector, Serializable, Comparable

public class UnicodeDetector
extends AbstractCodepageDetector

This detector identifies byte order marks of the following codepages to give a 100 % deterministic result in case of detection.

00 00 FE FF UCS-4, big-endian machine (1234 order)
FF FE 00 00 UCS-4,little-endian machine (4321 order)
00 00 FF FE UCS-4, unusual octet order (2143)
FE FF 00 00 UCS-4, unusual octet order (3412)
FE FF ## ## UTF-16, big-endian
FF FE ## ## UTF-16, little-endian
EF BB BF UTF-8

Note that this detector is very fast as it only has to read a maximum of 8 bytes to provide a result. Nevertheless it is senseless to add it to the configuration if the documents to detect will have a low rate of documents in the codepages that will be detected. If added to the configuration of CodepageDetectorProxyit should be at front position to save computations of the following detection processses.

This implementation is based on:
W3C XML Specification 1.0 3rd Edition, F.1 Detection Without External Encoding Information .

Version:
$Revision: 1.1 $
Author:
Achim Westermann
See Also:
Serialized Form

Method Summary
 Charset detectCodepage(InputStream in, int length)
           This method allows to detect the charset encoding from every source (even a String, which an URL does not decorate!).
 Charset detectCodepage(URL url)
          Delegates to ICodepageDetector.detectCodepage(java.io.InputStream, int) with a buffered input stream.
static ICodepageDetector getInstance()
           
 
Methods inherited from class info.monitorenter.cpdetector.io.AbstractCodepageDetector
compareTo, open
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Method Detail

getInstance

public static ICodepageDetector getInstance()

detectCodepage

public Charset detectCodepage(InputStream in,
                              int length)
                       throws IOException
Description copied from interface: ICodepageDetector

This method allows to detect the charset encoding from every source (even a String, which an URL does not decorate!).

Note that you cannot reuse the given InputStream unless it supports marking (InputStream.markSupported() == true), you mark the initial position with a sufficient readlimit and invoke reset afterwards (without getting any exception).

Parameters:
in - An InputStream for the document, that supports mark and a readlimit of argument length.
length - The amount of bytes to take into account. This number should not be longer than the amount of bytes retrievable from the InputStream but should be as long as possible to give the fallback detection (chardet) more hints to guess.
Throws:
IOException

detectCodepage

public Charset detectCodepage(URL url)
                       throws IOException
Description copied from class: AbstractCodepageDetector
Delegates to ICodepageDetector.detectCodepage(java.io.InputStream, int) with a buffered input stream.

Specified by:
detectCodepage in interface ICodepageDetector
Overrides:
detectCodepage in class AbstractCodepageDetector
Returns:
null, if the codepage of the document specified by the given URL was not detected or the Charsetthat represents the document's codepage.
Throws:
IOException - thrown to indicate that it is was not possible to open the document specified by the given URL.
See Also:
ICodepageDetector.detectCodepage(java.net.URL)


Copyleft ㊢ 2003-2004 MPL 1.1, All Rights Footloose.