cpdetector Logo SourceForge.net Logo

What is cpdetector?

The name cpdetector is a short form for code page - detector and has nothing to do with java classpaths. cpdetector is a framework for configurable code page-detection of documents. It may be used to detect the code page of documents retrieved from remote hosts. Code page - detection is needed whenever it is not known, which encoding a document belongs to. Therefore it is a core requirement for any application in the field of information mining or just information retrieval.

What is a code page?

Excerpt from http://www.unicode.org/standard/WhatIsUnicode.html:

"Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one."

At first, a textual document is nothing more than sequences of bits. A computer has to decide, how he can display this data in form of characters (which are identified by the computer as numbers). A code page - which is also known as charset encoding - maps the raw data of a textual document to characters. The original ASCII code page for example only uses 7 bits of an octet (byte) for deciding the character that is represented thus allowing only to map 128 different characters. In the past memory was expensive and computers most often only had registers and busses for 8 bit. When a mainframe was conceived it had to be decided, which characters it should support. Physicians and mathematicians for example needed special characters for equations. As a result, a computer often shipped with a special codepage.

Even today - where unicode unifies all different character encodings by providing a unique number (codepoint) for every character - the documents in the internet are encoded in various different code pages. Especially asian documents consist of a huge amount of characters and therefore often are encoded in special language-specific codepages. In order to process a textual document, it's bits have to be mapped (decoded) to characters by the correct character encoding table (code page).

A further definition is given on http://www.webster-dictionary.org.

Codepage-detection is needed in:

  • Search engines:
    Codepage detection is the first step that has to be performed with an incoming document. A crawler would retrieve the raw binary document, detect the codepage and then map the bits to characters to continue with tokenization, annotation, indexing, language identification and further desireable steps.
  • Browsers:
    You will have seen weird documents in your browser conatining lots of empty squares or question marks. In that case, you either requested a document that was opened with the wrong character encoding or your computer just does know how to render the characters of the code page.
    Browser like mozilla contain clever strategies for code page - detection.
  • File sharing software:

    Bittorrent for example needs to exchange metadata (.torrent files) that has to be interpreted as character stream.

Why configurable code page - detection?

One may need different techniques to find out the codepage of a document. These techniques vary from the type of documents to be processed. XML documents may specify the "encoding" attribute in ASCII characters (the ASCII range is defined in almost all code pages so the search for it may be performed by interpretation of the unknown document as ASCII). HTML pages may specify the "charset" attribute in a meta tag. The hard way would be to perform exclusion of code pages by inspection of byte-sequences and narrow down the remaining candidates by frequency analyis for characters. This way could be skipped if other techniques are successful. But these other techniques are only useful for certain types of documents... .

Latest News

The news are exported as a weekly cronjob (that tends to be ignored), so it might take a day before changes are reflected.
Last retrieval: Monday, 26-Nov-2007 18:00:05 PST.

cpDetector in comparison   2007-05-03 23:30 - cpDetector
Fred Eaker published a blog article in which he compares eight different code page detection libraries. One of the available strategies of cpDetector (Charset detection by parsing) did quite a good job. Read more at:
Read More »

cpDetector 1.0.5 released   2007-04-21 11:06 - cpDetector
cpDetector is a configurable Java framework for code page detection of textual documents. It may be used to allow applications like browsers, file-sharing software or search engines to correctly handle documents received over network.
Read More »

cpDetector awakes   2006-10-18 06:24 - cpDetector
After more than one year without any new release or work on cpDetector the latest report of two severe bugs will cause a new release soon. Project work has already begun. Along with the bugfixes also new code conventions and improved java documentation will be released.
Read More »

cpDetector 1.04 released   2005-03-02 02:48 - cpDetector
cpDetector is a configureable java framework for code page detection of textual documents.

The new version 1.04 is a stability release. A bug in the ANT build has been fixed, the fit document encoding test has been documented.
Read More »

cpdetector 1.03 released   2004-12-14 07:38 - cpDetector
cpdetector is a configureable java framework for code page detection of textual documents. It may be used to allow applications like browsers, file-sharing software or search engines to correctly handle data received over network.
Read More »

cpDetector welcomes new Developer   2004-10-25 01:57 - cpDetector
Demian downloaded the project and quickly contributed a best-practice solution, that integrates cpdetector (http://cpdetector.sourceforge.net/doc/javadoc/index.html?cpdetector/CharsetPrinter.html).
Read More »

cpdetector makes it to sourceforge's front page!   2004-10-21 12:48 - cpDetector
This is not intended for the front page but a diary entry that marks a milestone for cpdetector. Started in July 2004, the project has constantly evolved to a name in information mining / internationalization. This is not only the result of development but also of promotional work. For it's problem domain, cpdetector now has a high (the highest) ranking in search engines and is visited regularly (not only upon announcements).
Read More »

cpdetector 1.02 released   2004-10-21 06:45 - cpDetector
cpdetector is a configureable java framework for code page detection of textual documents. It may be used to allow applications like browsers, file-sharing software or search engines to correctly handle data received over network. The new version 1.02 is a result of beginning with quality assurance and covers 2 severe bugs, additional features (XML dtd's, test/build automation and a new ASCII fallback detection implementation.
Read More »

cpdetector 1.01 (stable) released   2004-09-24 09:10 - cpDetector
cpdetector is a configureable java framework for code page detection of textual documents. It may be used to allow applications like browsers, file-sharing software or search engines to correctly handle data received over network.
Read More »

cpdetector documentation released.   2004-08-21 06:06 - cpDetector
Documentation for cpdetector, a java code page
detection framework is available in the new website:
http://cpdetector.sourceforge.net.
Read More »

Site news archive »

Last updated on by Achim Westermann hits: