CharsetToolkit (Groovy 2.4.3)

java.lang.Object
- groovy.util.CharsetToolkit

public class CharsetToolkit
extends Object

Utility class to guess the encoding of a given text file.

Unicode files encoded in UTF-16 (low or big endian) or UTF-8 files with a Byte Order Marker are correctly discovered. For UTF-8 files with no BOM, if the buffer is wide enough, the charset should also be discovered.

A byte buffer of 4KB is used to be able to guess the encoding.

Usage:

 CharsetToolkit toolkit = new CharsetToolkit(file);

 // guess the encoding
 Charset guessedCharset = toolkit.getCharset();

 // create a reader with the correct charset
 BufferedReader reader = toolkit.getReader();

 // read the file content
 String line;
 while ((line = br.readLine())!= null)
 {
     System.out.println(line);
 }

Author:: Guillaume Laforge

Constructor Summary

Constructors
Constructor and Description

CharsetToolkit(File file)
Constructor of the CharsetToolkit utility class.

Constructors
Constructor and Description
`CharsetToolkit(File file)` Constructor of the `CharsetToolkit` utility class.

Method Summary

Methods
Modifier and Type	Method and Description
`static Charset[]`	`getAvailableCharsets()` Retrieves all the available `Charset`s on the platform, among which the default `charset`.
`Charset`	`getCharset()`
`Charset`	`getDefaultCharset()` Retrieves the default Charset
`static Charset`	`getDefaultSystemCharset()` Retrieve the default charset of the system.
`boolean`	`getEnforce8Bit()` Gets the enforce8Bit flag, in case we do not want to ever get a US-ASCII encoding.
`BufferedReader`	`getReader()` Gets a `BufferedReader` (indeed a `LineNumberReader`) from the `File` specified in the constructor of `CharsetToolkit` using the charset discovered or the default charset if an 8-bit `Charset` is encountered.
`boolean`	`hasUTF16BEBom()` Has a Byte Order Marker for UTF-16 Big Endian (utf-16 and ucs-2).
`boolean`	`hasUTF16LEBom()` Has a Byte Order Marker for UTF-16 Low Endian (ucs-2le, ucs-4le, and ucs-16le).
`boolean`	`hasUTF8Bom()` Has a Byte Order Marker for UTF-8 (Used by Microsoft's Notepad and other editors).
`void`	`setDefaultCharset(Charset defaultCharset)` Defines the default `Charset` used in case the buffer represents an 8-bit `Charset`.
`void`	`setEnforce8Bit(boolean enforce)` If US-ASCII is recognized, enforce to return the default encoding, rather than US-ASCII.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Constructor Detail
  - CharsetToolkit
```
public CharsetToolkit(File file)
               throws IOException
```
    Constructor of the CharsetToolkit utility class.
    
    Parameters:
    file - of which we want to know the encoding.
    
    Throws:
    
    IOException
- Method Detail
  - setDefaultCharset
```
public void setDefaultCharset(Charset defaultCharset)
```
    Defines the default Charset used in case the buffer represents an 8-bit Charset.
    
    Parameters:
    defaultCharset - the default Charset to be returned if an 8-bit Charset is encountered.
  - getCharset
```
public Charset getCharset()
```
  - setEnforce8Bit
```
public void setEnforce8Bit(boolean enforce)
```
    If US-ASCII is recognized, enforce to return the default encoding, rather than US-ASCII. It might be a file without any special character in the range 128-255, but that may be or become a file encoded with the default charset rather than US-ASCII.
    
    Parameters:
    enforce - a boolean specifying the use or not of US-ASCII.
  - getEnforce8Bit
```
public boolean getEnforce8Bit()
```
    Gets the enforce8Bit flag, in case we do not want to ever get a US-ASCII encoding.
    
    Returns:
    a boolean representing the flag of use of US-ASCII.
  - getDefaultCharset
```
public Charset getDefaultCharset()
```
    Retrieves the default Charset
  - getDefaultSystemCharset
```
public static Charset getDefaultSystemCharset()
```
    Retrieve the default charset of the system.
    
    Returns:
    the default Charset.
  - hasUTF8Bom
```
public boolean hasUTF8Bom()
```
    Has a Byte Order Marker for UTF-8 (Used by Microsoft's Notepad and other editors).
    
    Returns:
    true if the buffer has a BOM for UTF8.
  - hasUTF16LEBom
```
public boolean hasUTF16LEBom()
```
    Has a Byte Order Marker for UTF-16 Low Endian (ucs-2le, ucs-4le, and ucs-16le).
    
    Returns:
    true if the buffer has a BOM for UTF-16 Low Endian.
  - hasUTF16BEBom
```
public boolean hasUTF16BEBom()
```
    Has a Byte Order Marker for UTF-16 Big Endian (utf-16 and ucs-2).
    
    Returns:
    true if the buffer has a BOM for UTF-16 Big Endian.
  - getReader
```
public BufferedReader getReader()
                         throws FileNotFoundException
```
    Gets a BufferedReader (indeed a LineNumberReader) from the File specified in the constructor of CharsetToolkit using the charset discovered or the default charset if an 8-bit Charset is encountered.
    
    Returns:
    a BufferedReader
    
    Throws:
    
    FileNotFoundException - if the file is not found.
  - getAvailableCharsets
```
public static Charset[] getAvailableCharsets()
```
    Retrieves all the available Charsets on the platform, among which the default charset.
    
    Returns:
    an array of Charsets.

Class CharsetToolkit

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Detail

CharsetToolkit

Method Detail

setDefaultCharset

getCharset

setEnforce8Bit

getEnforce8Bit

getDefaultCharset

getDefaultSystemCharset

hasUTF8Bom

hasUTF16LEBom

hasUTF16BEBom

getReader

getAvailableCharsets