Package groovy.util
Class CharsetToolkit
- java.lang.Object
 - 
- groovy.util.CharsetToolkit
 
 
- 
public class CharsetToolkit extends Object
Utility class to guess the encoding of a given text file.Unicode files encoded in UTF-16 (low or big endian) or UTF-8 files with a Byte Order Marker are correctly discovered. For UTF-8 files with no BOM, if the buffer is wide enough, the charset should also be discovered.
A byte buffer of 4KB is used to be able to guess the encoding.
Usage:
CharsetToolkit toolkit = new CharsetToolkit(file); // guess the encoding Charset guessedCharset = toolkit.getCharset(); // create a reader with the correct charset BufferedReader reader = toolkit.getReader(); // read the file content String line; while ((line = br.readLine())!= null) { System.out.println(line); } 
- 
- 
Constructor Summary
Constructors Constructor Description CharsetToolkit(File file)Constructor of theCharsetToolkitutility class. 
- 
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static Charset[]getAvailableCharsets()Retrieves all the availableCharsets on the platform, among which the defaultcharset.CharsetgetCharset()CharsetgetDefaultCharset()Retrieves the default Charsetstatic CharsetgetDefaultSystemCharset()Retrieve the default charset of the system.booleangetEnforce8Bit()Gets the enforce8Bit flag, in case we do not want to ever get a US-ASCII encoding.BufferedReadergetReader()Gets aBufferedReader(indeed aLineNumberReader) from theFilespecified in the constructor ofCharsetToolkitusing the charset discovered or the default charset if an 8-bitCharsetis encountered.booleanhasUTF16BEBom()Has a Byte Order Marker for UTF-16 Big Endian (utf-16 and ucs-2).booleanhasUTF16LEBom()Has a Byte Order Marker for UTF-16 Low Endian (ucs-2le, ucs-4le, and ucs-16le).booleanhasUTF8Bom()Has a Byte Order Marker for UTF-8 (Used by Microsoft's Notepad and other editors).voidsetDefaultCharset(Charset defaultCharset)Defines the defaultCharsetused in case the buffer represents an 8-bitCharset.voidsetEnforce8Bit(boolean enforce)If US-ASCII is recognized, enforce to return the default encoding, rather than US-ASCII. 
 - 
 
- 
- 
Constructor Detail
- 
CharsetToolkit
public CharsetToolkit(File file) throws IOException
Constructor of theCharsetToolkitutility class.- Parameters:
 file- of which we want to know the encoding.- Throws:
 IOException
 
 - 
 
- 
Method Detail
- 
setDefaultCharset
public void setDefaultCharset(Charset defaultCharset)
Defines the defaultCharsetused in case the buffer represents an 8-bitCharset.- Parameters:
 defaultCharset- the defaultCharsetto be returned if an 8-bitCharsetis encountered.
 
- 
getCharset
public Charset getCharset()
 
- 
setEnforce8Bit
public void setEnforce8Bit(boolean enforce)
If US-ASCII is recognized, enforce to return the default encoding, rather than US-ASCII. It might be a file without any special character in the range 128-255, but that may be or become a file encoded with the defaultcharsetrather than US-ASCII.- Parameters:
 enforce- a boolean specifying the use or not of US-ASCII.
 
- 
getEnforce8Bit
public boolean getEnforce8Bit()
Gets the enforce8Bit flag, in case we do not want to ever get a US-ASCII encoding.- Returns:
 - a boolean representing the flag of use of US-ASCII.
 
 
- 
getDefaultCharset
public Charset getDefaultCharset()
Retrieves the default Charset 
- 
getDefaultSystemCharset
public static Charset getDefaultSystemCharset()
Retrieve the default charset of the system.- Returns:
 - the default 
Charset. 
 
- 
hasUTF8Bom
public boolean hasUTF8Bom()
Has a Byte Order Marker for UTF-8 (Used by Microsoft's Notepad and other editors).- Returns:
 - true if the buffer has a BOM for UTF8.
 
 
- 
hasUTF16LEBom
public boolean hasUTF16LEBom()
Has a Byte Order Marker for UTF-16 Low Endian (ucs-2le, ucs-4le, and ucs-16le).- Returns:
 - true if the buffer has a BOM for UTF-16 Low Endian.
 
 
- 
hasUTF16BEBom
public boolean hasUTF16BEBom()
Has a Byte Order Marker for UTF-16 Big Endian (utf-16 and ucs-2).- Returns:
 - true if the buffer has a BOM for UTF-16 Big Endian.
 
 
- 
getReader
public BufferedReader getReader() throws FileNotFoundException
Gets aBufferedReader(indeed aLineNumberReader) from theFilespecified in the constructor ofCharsetToolkitusing the charset discovered or the default charset if an 8-bitCharsetis encountered.- Returns:
 - a 
BufferedReader - Throws:
 FileNotFoundException- if the file is not found.
 
- 
getAvailableCharsets
public static Charset[] getAvailableCharsets()
Retrieves all the availableCharsets on the platform, among which the defaultcharset.- Returns:
 - an array of 
Charsets. 
 
 - 
 
 -