cpdetector, free java code page detection.

What is cpdetector?

The name cpdetector is a short form for code page - detector and has nothing to do with java classpaths. cpdetector is a framework for configurable code page-detection of documents. It may be used to detect the code page of documents retrieved from remote hosts. Code page - detection is needed whenever it is not known, which encoding a document belongs to. Therefore it is a core requirement for any application in the field of information mining or just information retrieval.

What is a code page?

Excerpt from http://www.unicode.org/standard/WhatIsUnicode.html:

"Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one."

At first, a textual document is nothing more than sequences of bits. A computer has to decide, how he can display this data in form of characters (which are identified by the computer as numbers). A code page - which is also known as charset encoding - maps the raw data of a textual document to characters. The original ASCII code page for example only uses 7 bits of an octet (byte) for deciding the character that is represented thus allowing only to map 128 different characters. In the past memory was expensive and computers most often only had registers and busses for 8 bit. When a mainframe was conceived it had to be decided, which characters it should support. Physicians and mathematicians for example needed special characters for equations. As a result, a computer often shipped with a special codepage.

Even today - where unicode unifies all different character encodings by providing a unique number (codepoint) for every character - the documents in the internet are encoded in various different code pages. Especially asian documents consist of a huge amount of characters and therefore often are encoded in special language-specific codepages. In order to process a textual document, it's bits have to be mapped (decoded) to characters by the correct character encoding table (code page).

A further definition is given here http://en.wikipedia.org/wiki/Code_page.

Codepage-detection is needed in:

Search engines:
Codepage detection is the first step that has to be performed with an incoming document. A crawler would retrieve the raw binary document, detect the codepage and then map the bits to characters to continue with tokenization, annotation, indexing, language identification and further desireable steps.
Browsers:
You will have seen weird documents in your browser conatining lots of empty squares or question marks. In that case, you either requested a document that was opened with the wrong character encoding or your computer just does know how to render the characters of the code page.
Browser like mozilla contain clever strategies for code page - detection.
File sharing software:

Bittorrent for example needs to exchange metadata (.torrent files) that has to be interpreted as character stream.

Why configurable code page - detection?

One may need different techniques to find out the codepage of a document. These techniques vary from the type of documents to be processed. XML documents may specify the "encoding" attribute in ASCII characters (the ASCII range is defined in almost all code pages so the search for it may be performed by interpretation of the unknown document as ASCII). HTML pages may specify the "charset" attribute in a meta tag. The hard way would be to perform exclusion of code pages by inspection of byte-sequences and narrow down the remaining candidates by frequency analyis for characters. This way could be skipped if other techniques are successful. But these other techniques are only useful for certain types of documents... .

Latest News

Read the latest news here: https://sourceforge.net/news/?group_id=114421

Alternativley use the rss feed: https://sourceforge.net/export/rss2_projnews.php?group_id=114421&rss_fulltext=1

Last updated on by Achim Westermann

hits: