CharsetToolkit (Groovy 1.8.5)

Overview

Package

Class

Deprecated

Index

Help

Groovy 1.8.5

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

groovy.util
[Java] Class CharsetToolkit

java.lang.Object
  groovy.util.CharsetToolkit

public class CharsetToolkit
extends Object

Utility class to guess the encoding of a given text file.

Unicode files encoded in UTF-16 (low or big endian) or UTF-8 files with a Byte Order Marker are correctly discovered. For UTF-8 files with no BOM, if the buffer is wide enough, the charset should also be discovered.

A byte buffer of 4KB is used to be able to guess the encoding.

Usage:

 CharsetToolkit toolkit = new CharsetToolkit(file);

 // guess the encoding
 Charset guessedCharset = toolkit.getCharset();

 // create a reader with the correct charset
 BufferedReader reader = toolkit.getReader();

 // read the file content
 String line;
 while ((line = br.readLine())!= null)
 {
     System.out.println(line);
 }

Authors:: Guillaume Laforge

Constructor Summary
`CharsetToolkit(File file)` Constructor of the `CharsetToolkit` utility class.

Method Summary
`static Charset[]`	`getAvailableCharsets()` Retrieves all the available `Charset`s on the platform, among which the default `charset`.
`Charset`	`getCharset()`
`Charset`	`getDefaultCharset()` Retrieves the default Charset
`static Charset`	`getDefaultSystemCharset()` Retrieve the default charset of the system.
`boolean`	`getEnforce8Bit()` Gets the enforce8Bit flag, in case we do not want to ever get a US-ASCII encoding.
`BufferedReader`	`getReader()` Gets a `BufferedReader` (indeed a `LineNumberReader`) from the `File` specified in the constructor of `CharsetToolkit` using the charset discovered or the default charset if an 8-bit `Charset` is encountered.
`boolean`	`hasUTF16BEBom()` Has a Byte Order Marker for UTF-16 Big Endian (utf-16 and ucs-2).
`boolean`	`hasUTF16LEBom()` Has a Byte Order Marker for UTF-16 Low Endian (ucs-2le, ucs-4le, and ucs-16le).
`boolean`	`hasUTF8Bom()` Has a Byte Order Marker for UTF-8 (Used by Microsoft's Notepad and other editors).
`void`	`setDefaultCharset(Charset defaultCharset)` Defines the default `Charset` used in case the buffer represents an 8-bit `Charset`.
`void`	`setEnforce8Bit(boolean enforce)` If US-ASCII is recognized, enforce to return the default encoding, rather than US-ASCII.

Methods inherited from class Object
wait, wait, wait, equals, toString, hashCode, getClass, notify, notifyAll

Constructor Detail

CharsetToolkit

public CharsetToolkit(File file)

Constructor of the CharsetToolkit utility class.

Parameters:: file - of which we want to know the encoding.

Method Detail

getAvailableCharsets

public static Charset[] getAvailableCharsets()

Retrieves all the available Charsets on the platform, among which the default charset.

Returns:: an array of Charsets.

getCharset

public Charset getCharset()

getDefaultCharset

public Charset getDefaultCharset()

Retrieves the default Charset

getDefaultSystemCharset

public static Charset getDefaultSystemCharset()

Retrieve the default charset of the system.

Returns:: the default Charset.

getEnforce8Bit

public boolean getEnforce8Bit()

Gets the enforce8Bit flag, in case we do not want to ever get a US-ASCII encoding.

Returns:: a boolean representing the flag of use of US-ASCII.

getReader

public BufferedReader getReader()

Gets a BufferedReader (indeed a LineNumberReader) from the File specified in the constructor of CharsetToolkit using the charset discovered or the default charset if an 8-bit Charset is encountered.

throws:: FileNotFoundException if the file is not found.

Returns:: a BufferedReader

hasUTF16BEBom

public boolean hasUTF16BEBom()

Has a Byte Order Marker for UTF-16 Big Endian (utf-16 and ucs-2).

Returns:: true if the buffer has a BOM for UTF-16 Big Endian.

hasUTF16LEBom

public boolean hasUTF16LEBom()

Has a Byte Order Marker for UTF-16 Low Endian (ucs-2le, ucs-4le, and ucs-16le).

Returns:: true if the buffer has a BOM for UTF-16 Low Endian.

hasUTF8Bom

public boolean hasUTF8Bom()

Has a Byte Order Marker for UTF-8 (Used by Microsoft's Notepad and other editors).

Returns:: true if the buffer has a BOM for UTF8.

setDefaultCharset

public void setDefaultCharset(Charset defaultCharset)

Defines the default Charset used in case the buffer represents an 8-bit Charset.

Parameters:: defaultCharset - the default Charset to be returned if an 8-bit Charset is encountered.

setEnforce8Bit

public void setEnforce8Bit(boolean enforce)

If US-ASCII is recognized, enforce to return the default encoding, rather than US-ASCII. It might be a file without any special character in the range 128-255, but that may be or become a file encoded with the default charset rather than US-ASCII.

Parameters:: enforce - a boolean specifying the use or not of US-ASCII.

groovy.util [Java] Class CharsetToolkit