info.monitorenter.unicode.decoder
Class DecodeUtil

java.lang.Object
  extended byinfo.monitorenter.unicode.decoder.DecodeUtil

public final class DecodeUtil
extends Object

Easy to use utility functions with scope on decoding to unicode.

Be careful with the methods that work on String data (vs. Streams): Large documents will cause an OutOfMemoryError.

Version:
$Revision: 1.10 $
Author:
Achim Westermann

Method Summary
static String decodeHtmlEntities(String html, boolean recursive)
          Decodes HTML Entities(e.g.
static void main(String[] args)
          Main hook used for short test.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Method Detail

decodeHtmlEntities

public static String decodeHtmlEntities(String html,
                                        boolean recursive)
                                 throws antlr.RecognitionException,
                                        antlr.TokenStreamException,
                                        IOException
Decodes HTML Entities(e.g.  ) in the given String into the unicode representation.

This method should perform quick as an ANTLR generated parser is used.

HTML entities are described in http://www.w3.org/TR/html401/sgml/entities.html

For enterprise support of arbitrary large files prefer the approach of HtmlEntityDecoderReader.

Parameters:
html - the html data to decode HTML Entities in.
recursive - if true the input will be processed until there are no character entity references contained any more (decoding ö will produce ö).
Returns:
a new String with the unicode representation of the HTML Entities in the input html.
Throws:
IOException - if sth. goes wrong.
antlr.TokenStreamException - if invalid character data was found in the underlying stream. This is unlikely to happen as the lexer covers all characters, but if it should happen (ANTLR error?) this method cannot deal with the problem and does not catch the exception.
antlr.RecognitionException - if invalid format was found in the given html. This is unlikely to happen as the grammar accepts any tokens , but if it should happen (ANTLR error?) this method cannot deal with the problem and does not catch the exception.

main

public static void main(String[] args)
                 throws antlr.RecognitionException,
                        antlr.TokenStreamException,
                        IOException,
                        jargs.gnu.CmdLineParser.IllegalOptionValueException,
                        jargs.gnu.CmdLineParser.UnknownOptionException
Main hook used for short test.

Parameters:
args - ignored.
Throws:
antlr.RecognitionException - if sth. in the parser goes wrong.
antlr.TokenStreamException - if sth. in the lexer goes wrong.
IOException - if sth. in io goes wrong.
UnknownOptionException - if arguments are wrong.
IllegalOptionValueException - if arguments are wronger.
jargs.gnu.CmdLineParser.IllegalOptionValueException
jargs.gnu.CmdLineParser.UnknownOptionException


Copyleft ㊢ 2003-2004 MPL 1.1, All Rights Footloose.