|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Objectinfo.monitorenter.cpdetector.io.AbstractCodepageDetector
info.monitorenter.cpdetector.io.UnicodeDetector
public class UnicodeDetector
This detector identifies byte order marks of the following codepages to give a 100 % deterministic result in case of detection.
| 00 00 FE FF | UCS-4, big-endian machine (1234 order) |
| FF FE 00 00 | UCS-4,little-endian machine (4321 order) |
| 00 00 FF FE | UCS-4, unusual octet order (2143) | FE FF 00 00 | UCS-4, unusual octet order (3412) |
| FE FF ## ## | UTF-16, big-endian |
| FF FE ## ## | UTF-16, little-endian |
| EF BB BF | UTF-8 |
Note that this detector is very fast as it only has to read a maximum of 8 bytes to provide a result. Nevertheless it
is senseless to add it to the configuration if the documents to detect will have a low rate of documents in the
codepages that will be detected. If added to the configuration of CodepageDetectorProxyit
should be at front position to save computations of the following detection processses.
This implementation is based on:
W3C XML Specification 1.0 3rd Edition,
F.1 Detection Without External Encoding Information .
| Method Summary | |
|---|---|
Charset |
detectCodepage(InputStream in,
int length)
This method allows to detect the charset encoding from every source (even a String, which an URL does not decorate!). |
Charset |
detectCodepage(URL url)
Delegates to ICodepageDetector.detectCodepage(java.io.InputStream, int) with a buffered input stream. |
static ICodepageDetector |
getInstance()
|
| Methods inherited from class info.monitorenter.cpdetector.io.AbstractCodepageDetector |
|---|
compareTo, open |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Method Detail |
|---|
public static ICodepageDetector getInstance()
public Charset detectCodepage(InputStream in,
int length)
throws IOException
ICodepageDetectorThis method allows to detect the charset encoding from every source (even a String, which an URL does not decorate!).
Note that you cannot reuse the given InputStream unless it supports marking (InputStream.markSupported() ==
true), you mark the initial position with a sufficient readlimit and invoke
reset afterwards (without getting any exception).
in - An InputStream for the document, that supports mark and a
readlimit of argument length.length - The amount of bytes to take into account. This number should not
be longer than the amount of bytes retrievable from the
InputStream but should be as long as possible to give the fallback
detection (chardet) more hints to guess.
IOException
public Charset detectCodepage(URL url)
throws IOException
AbstractCodepageDetectorICodepageDetector.detectCodepage(java.io.InputStream, int) with a buffered input stream.
detectCodepage in interface ICodepageDetectordetectCodepage in class AbstractCodepageDetectorCharsetthat represents the
document's codepage.
IOException - thrown to indicate that it is was not possible to open the
document specified by the given URL.ICodepageDetector.detectCodepage(java.net.URL)
|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||