cpdetector, free java code page detection.

Usage

This page provides a quick howto for cpdetector. Additional information may be found at the documentation page.

Design notes

The framework is quite small, simple but clever.

An interface for code page - detection is the top entry point for all applications. This interface has to be imported by applications that want to use cpdetector.
A proxy implements this interface. It delegates the requests for code page - detection to contained implementations that implement the different code page - detections techniques. The proxy may be configured with different concrete implementations to delegate requests to. Currently a simple "first one returning non-null wins the decision" strategy is used.
Concrete implementations are shipped. A facade for the java port of mozilla's chardet (jchardet) is provided that uses exclusion and guessing by frequency analysis. Another implemenations is a facade to a parser / lexer combination generated by an ANTLR - grammar that currently searches for the html charset attribute (xml-encoding to come if feature request is made).

There are two ways to use cpdetecor:

Taxonomy sorting: The shipped cpdetector.jar may be used to start a command line tool that searches a given directory recursively for input documents and sorts these to an output directory into subfolders named after the code page detected. Optionally the documents may be transformed to a desired target code page.
This tool was used by me for sorting a huge collection of crawled documents after their codepage and perform tests with documents in specific code pages (unicode normalisation performance tests for documents in different encodings).
Application integration: The framework may be configured and integrated into your application in order to perform operations depending on the codepage. Maybe you want to add code page - detection as a feature to a search engine. Feel the power of internationalization for only 0 cent per license. Order now! I will never become a salesman.

Taxonomy sorting

cpdetector.jar along with the 3rd party libraries - which may be downloaded from the binary release at sourceforge - may be used to sort documents by their detected codepage. Simply invoke the following command:

java -cp jargs.jar;cpdetector_1.0.10.jar;antlr.jar;chardet.jar info.monitorenter.cpdetector.CodepageProcessor

As you did not specify mandatory arguments you will see a usage output:


usage: java -jar codepageProcessor.jar [options]
options:

  Optional:
  -e <extensions> : A comma- or semicolon- separated string for document extensions 
                    like "-e txt,dat" (without dot or space!).
  -m              : Move files with unknown charset to directory "unknown".
  -v              : Verbose output.
  -w <int>        : Wait <int> seconds before trying next document 
                    (good, if you want to work on the very same machine).

  -t <charset>    : Try to transform the document to given charset (code page) name.
                    This is only possible for documents that are detected to have a
                    code page that is supported by the current java VM. If not possible
                    sorting will be done as normal.
  -c              : Semicolon-separated list of fully qualified classnames. 
                    These classenames will be instantiadet, casted to ICodepageDetector instances
                    and used to detect the code page of documents in the order specified.
                    If this argument is ommited, a HTMLCodepageDetector followed by .
                    a JChardetFacade is used by default.
  Mandatory:
  -r            : Root directory containing the collection (recursive).
  -o            : Output directory containing the sorted collection.

Application integration

The following demo - code shows, how to configure the proxy for code page detection and how to use it. The example is kept simple. If you want to reuse the proxy, just keep it as a member of a class that provides a service that involves code page - detection. Please add the 3rd party libraries of the binary download to the classpath of your IDE.

import info.monitorenter.cpdetector.io.*;

public class myUsage{
  // Create the proxy:
  CodepageDetectorProxy detector = CodepageDetectorProxy.getInstance(); // A singleton.
  
  // constructor:
  public myUsage(){
    // Add the implementations of info.monitorenter.cpdetector.io.ICodepageDetector: 
    // This one is quick if we deal with unicode codepages: 
    detector.add(new ByteOrderMarkDetector()); 
    // The first instance delegated to tries to detect the meta charset attribut in html pages.
    detector.add(new ParsingDetector(true)); // be verbose about parsing.
    // This one does the tricks of exclusion and frequency detection, if first implementation is 
    // unsuccessful:
    detector.add(JChardetFacade.getInstance()); // Another singleton.
    detector.add(ASCIIDetector.getInstance()); // Fallback, see javadoc.
  }
...
  public boolean someMethod(File document){
    boolean ret = false;
    // Work with the configured proxy: 
    java.nio.charset.Charset charset = null;
    charset = detector.detectCodepage(document.toURL());
    if(charset == null){
      project.forName("cpdetector").report("bogus document",document.toUrl());
    }
    else{
      // Open the document in the given code page:
      java.io.Reader reader = new java.io.InputStreamReader(new java.io.FileInputStream(document),charset);
      // Read from it, do sth., whatever you desire. The character are now - hopefully - correct..
      ret = true;
    }
    return ret;
  }
  ... 
}

In order to compile this example, you have to put cpdetector.jar into your classpath. Your IDE will certainly support inclusion of *.jar files into your javac classpath.

Last updated on by Achim Westermann

hits: