Utility class to guess the encoding of a given text file.
Unicode files encoded in UTF-16 (low or big endian) or UTF-8 files with a Byte Order Marker are correctly discovered. For UTF-8 files with no BOM, if the buffer is wide enough, the charset should also be discovered.
A byte buffer of 4KB is used to be able to guess the encoding.
Usage:
CharsetToolkit toolkit = new CharsetToolkit(file); // guess the encoding Charset guessedCharset = toolkit.getCharset(); // create a reader with the correct charset BufferedReader reader = toolkit.getReader(); // read the file content String line; while ((line = br.readLine())!= null) { System.out.println(line); }
Constructor and description |
---|
CharsetToolkit
(File file) Constructor of the CharsetToolkit utility class. |
Type Params | Return Type | Name and description |
---|---|---|
|
static Charset[] |
getAvailableCharsets() Retrieves all the available Charset s on the platform,
among which the default charset . |
|
Charset |
getCharset() |
|
Charset |
getDefaultCharset() Retrieves the default Charset |
|
static Charset |
getDefaultSystemCharset() Retrieve the default charset of the system. |
|
boolean |
getEnforce8Bit() Gets the enforce8Bit flag, in case we do not want to ever get a US-ASCII encoding. |
|
BufferedReader |
getReader() Gets a BufferedReader (indeed a LineNumberReader ) from the File
specified in the constructor of CharsetToolkit using the charset discovered or the default
charset if an 8-bit Charset is encountered. |
|
boolean |
hasUTF16BEBom() Has a Byte Order Marker for UTF-16 Big Endian (utf-16 and ucs-2). |
|
boolean |
hasUTF16LEBom() Has a Byte Order Marker for UTF-16 Low Endian (ucs-2le, ucs-4le, and ucs-16le). |
|
boolean |
hasUTF8Bom() Has a Byte Order Marker for UTF-8 (Used by Microsoft's Notepad and other editors). |
|
void |
setDefaultCharset(Charset defaultCharset) Defines the default Charset used in case the buffer represents
an 8-bit Charset . |
|
void |
setEnforce8Bit(boolean enforce) If US-ASCII is recognized, enforce to return the default encoding, rather than US-ASCII. |
Constructor of the CharsetToolkit
utility class.
file
- of which we want to know the encoding. Retrieves all the available Charset
s on the platform,
among which the default charset
.
Charset
s.Retrieves the default Charset
Retrieve the default charset of the system.
Charset
.Gets the enforce8Bit flag, in case we do not want to ever get a US-ASCII encoding.
Gets a BufferedReader
(indeed a LineNumberReader
) from the File
specified in the constructor of CharsetToolkit
using the charset discovered or the default
charset if an 8-bit Charset
is encountered.
BufferedReader
Has a Byte Order Marker for UTF-16 Big Endian (utf-16 and ucs-2).
Has a Byte Order Marker for UTF-16 Low Endian (ucs-2le, ucs-4le, and ucs-16le).
Has a Byte Order Marker for UTF-8 (Used by Microsoft's Notepad and other editors).
Defines the default Charset
used in case the buffer represents
an 8-bit Charset
.
defaultCharset
- the default Charset
to be returned
if an 8-bit Charset
is encountered. If US-ASCII is recognized, enforce to return the default encoding, rather than US-ASCII.
It might be a file without any special character in the range 128-255, but that may be or become
a file encoded with the default charset
rather than US-ASCII.
enforce
- a boolean specifying the use or not of US-ASCII.Copyright © 2003-2020 The Apache Software Foundation. All rights reserved.