Package groovy.util

Class CharsetToolkit


  • public class CharsetToolkit
    extends Object
    Utility class to guess the encoding of a given text file.

    Unicode files encoded in UTF-16 (low or big endian) or UTF-8 files with a Byte Order Marker are correctly discovered. For UTF-8 files with no BOM, if the buffer is wide enough, the charset should also be discovered.

    A byte buffer of 4KB is used to be able to guess the encoding.

    Usage:

     CharsetToolkit toolkit = new CharsetToolkit(file);
    
     // guess the encoding
     Charset guessedCharset = toolkit.getCharset();
    
     // create a reader with the correct charset
     BufferedReader reader = toolkit.getReader();
    
     // read the file content
     String line;
     while ((line = br.readLine())!= null)
     {
         System.out.println(line);
     }
     
    • Constructor Detail

      • CharsetToolkit

        public CharsetToolkit​(File file)
                       throws IOException
        Constructor of the CharsetToolkit utility class.
        Parameters:
        file - of which we want to know the encoding.
        Throws:
        IOException
    • Method Detail

      • setDefaultCharset

        public void setDefaultCharset​(Charset defaultCharset)
        Defines the default Charset used in case the buffer represents an 8-bit Charset.
        Parameters:
        defaultCharset - the default Charset to be returned if an 8-bit Charset is encountered.
      • getCharset

        public Charset getCharset()
      • setEnforce8Bit

        public void setEnforce8Bit​(boolean enforce)
        If US-ASCII is recognized, enforce to return the default encoding, rather than US-ASCII. It might be a file without any special character in the range 128-255, but that may be or become a file encoded with the default charset rather than US-ASCII.
        Parameters:
        enforce - a boolean specifying the use or not of US-ASCII.
      • getEnforce8Bit

        public boolean getEnforce8Bit()
        Gets the enforce8Bit flag, in case we do not want to ever get a US-ASCII encoding.
        Returns:
        a boolean representing the flag of use of US-ASCII.
      • getDefaultCharset

        public Charset getDefaultCharset()
        Retrieves the default Charset
      • getDefaultSystemCharset

        public static Charset getDefaultSystemCharset()
        Retrieve the default charset of the system.
        Returns:
        the default Charset.
      • hasUTF8Bom

        public boolean hasUTF8Bom()
        Has a Byte Order Marker for UTF-8 (Used by Microsoft's Notepad and other editors).
        Returns:
        true if the buffer has a BOM for UTF8.
      • hasUTF16LEBom

        public boolean hasUTF16LEBom()
        Has a Byte Order Marker for UTF-16 Low Endian (ucs-2le, ucs-4le, and ucs-16le).
        Returns:
        true if the buffer has a BOM for UTF-16 Low Endian.
      • hasUTF16BEBom

        public boolean hasUTF16BEBom()
        Has a Byte Order Marker for UTF-16 Big Endian (utf-16 and ucs-2).
        Returns:
        true if the buffer has a BOM for UTF-16 Big Endian.
      • getReader

        public BufferedReader getReader()
                                 throws FileNotFoundException
        Gets a BufferedReader (indeed a LineNumberReader) from the File specified in the constructor of CharsetToolkit using the charset discovered or the default charset if an 8-bit Charset is encountered.
        Returns:
        a BufferedReader
        Throws:
        FileNotFoundException - if the file is not found.
      • getAvailableCharsets

        public static Charset[] getAvailableCharsets()
        Retrieves all the available Charsets on the platform, among which the default charset.
        Returns:
        an array of Charsets.