UnicodeDetector (cpdetector, an extensible codepage-detection framework.)

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

info.monitorenter.cpdetector.io
Class UnicodeDetector

java.lang.Object
  info.monitorenter.cpdetector.io.AbstractCodepageDetector
      info.monitorenter.cpdetector.io.UnicodeDetector

All Implemented Interfaces:: ICodepageDetector, Serializable, Comparable

public class UnicodeDetector
extends AbstractCodepageDetector
extends AbstractCodepageDetector

This detector identifies byte order marks of the following codepages to give a 100 % deterministic result in case of detection.

00 00 FE FF	UCS-4, big-endian machine (1234 order)
FF FE 00 00	UCS-4,little-endian machine (4321 order)
00 00 FF FE	UCS-4, unusual octet order (2143)
FE FF 00 00	UCS-4, unusual octet order (3412)
FE FF ## ##	UTF-16, big-endian
FF FE ## ##	UTF-16, little-endian
EF BB BF	UTF-8

Note that this detector is very fast as it only has to read a maximum of 8 bytes to provide a result. Nevertheless it is senseless to add it to the configuration if the documents to detect will have a low rate of documents in the codepages that will be detected. If added to the configuration of CodepageDetectorProxyit should be at front position to save computations of the following detection processses.

This implementation is based on:
W3C XML Specification 1.0 3rd Edition, F.1 Detection Without External Encoding Information .

Version:: $Revision: 1.1 $
Author:: Achim Westermann
See Also:: Serialized Form

Method Summary
`Charset`	`detectCodepage(InputStream in, int length)` This method allows to detect the charset encoding from every source (even a String, which an URL does not decorate!).
`Charset`	`detectCodepage(URL url)` Delegates to `ICodepageDetector.detectCodepage(java.io.InputStream, int)` with a buffered input stream.
`static ICodepageDetector`	`getInstance()`

Methods inherited from class info.monitorenter.cpdetector.io.AbstractCodepageDetector
`compareTo, open`

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Method Detail

getInstance

public static ICodepageDetector getInstance()

detectCodepage

public Charset detectCodepage(InputStream in,
                              int length)
                       throws IOException

Description copied from interface: ICodepageDetector

This method allows to detect the charset encoding from every source (even a String, which an URL does not decorate!).

Note that you cannot reuse the given InputStream unless it supports marking (InputStream.markSupported() == true), you mark the initial position with a sufficient readlimit and invoke reset afterwards (without getting any exception).

Parameters:: in - An InputStream for the document, that supports mark and a readlimit of argument length.; length - The amount of bytes to take into account. This number should not be longer than the amount of bytes retrievable from the InputStream but should be as long as possible to give the fallback detection (chardet) more hints to guess.
Throws:: IOException

detectCodepage

public Charset detectCodepage(URL url)
                       throws IOException

Description copied from class: AbstractCodepageDetector

Delegates to ICodepageDetector.detectCodepage(java.io.InputStream, int) with a buffered input stream.

Specified by:: detectCodepage in interface ICodepageDetector
Overrides:: detectCodepage in class AbstractCodepageDetector

Returns:: null, if the codepage of the document specified by the given URL was not detected or the Charsetthat represents the document's codepage.
Throws:: IOException - thrown to indicate that it is was not possible to open the document specified by the given URL.
See Also:: ICodepageDetector.detectCodepage(java.net.URL)

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

Copyleft ㊢ 2003-2004 MPL 1.1, All Rights Footloose.

info.monitorenter.cpdetector.io Class UnicodeDetector

getInstance

detectCodepage

detectCodepage

info.monitorenter.cpdetector.io
Class UnicodeDetector