All Manuals > LispWorks User Guide and Reference Manual > 26 Internationalization: characters, strings and encodings

NextPrevUpTopContentsIndex

26.8 External Formats and File Streams

The :external-format argument of open and related functions should be an ef-spec, where the name can be :default. The symbol :default is the default value.

If you know the format of the data when doing file I/O, you should definitely specify external-format explicitly, in the ef-spec syntax described in this section.

26.8.1 Complete external format ef-specs

An ef-spec is "complete" if and only if the name is not :default and the parameters include :eol-style.

All external formats have an :eol-style parameter. If eol-style is not explicit in an ef-spec a default is used. The allowed values are

:lf

This is the default on non-Windows systems, meaning that lines are terminated by Linefeed.

:crlf

This is the default on Windows, meaning that lines are terminated by Carriage-Return followed by Linefeed.

:cr

Lines are terminated by Carriage-Return.

26.8.2 Using complete external formats

If open or with-open-file gets a complete :external-format argument then, it is used as is. For example, this form opens an ASCII linefeed-terminated stream:

(with-open-file (ss "C:/temp/ascii-lf"
                    :direction :output 
                    :external-format 
                    '(:ascii :eol-style :lf)) 
  (stream-external-format ss))
=>
(:ASCII :EOL-STYLE :LF)

If you know the encoding of a file you are opening, then you should pass the appropriate :external-format argument.

26.8.3 Guessing the external format

If open or with-open-file gets a non-complete :external-format argument ef-spec then the system decides which external format to use by calling the function guess-external-format.

The default behavior of guess-external-format is as follows:

  1. When ef-spec's name is :default, this finds a match based on the filename; or (if that fails), looks in the Emacs-style (-*-) attribute line for an option called ENCODING or EXTERNAL-FORMAT or CODING; or (if that fails), chooses from amongst likely encodings by analysing the bytes near the start of the file, or (if that fails) uses a default encoding. Otherwise ef-spec's name is assumed to name an encoding and this encoding is used.
  2. When ef-spec does not include the :eol-style parameter, it then also analyzes the start of the file for byte patterns indicating the end-of-line style, and uses a default end-of-line style if no such pattern is found.

The file in this example was written by a Windows program which writes the Byte Order Mark at the start of the file, indicating that it is Unicode encoded. The routine in step 1 above detects this:

(set-default-character-element-type 'character)
=>
CHARACTER
 
(with-open-file (ss "C:/temp/unicode-notepad.txt") 
  (stream-external-format ss))
=>
(:UNICODE :LITTLE-ENDIAN T :EOL-STYLE :CRLF)

The behavior of guess-external-format is configurable via the variables *file-encoding-detection-algorithm* and *file-eol-style-detection-algorithm*. See the manual pages for details.

26.8.3.1 Example of using UTF-8 by default

To change the default for all file access via open, compile-file and so on, you can modify the value of *file-encoding-detection-algorithm*.

For example given the following definition:

(defun utf-8-file-encoding (pathname ef-spec buffer length)
  (declare (ignore pathname buffer length))
  (system:merge-ef-specs ef-spec :utf-8))

then this makes it use UTF-8 as a fallback:

(setq system:*file-encoding-detection-algorithm*
      (substitute 'utf-8-file-encoding
                  'system:locale-file-encoding
                  system:*file-encoding-detection-algorithm*))

and this forces it to always use UTF-8:

(setq system:*file-encoding-detection-algorithm*
      '(utf-8-file-encoding))
26.8.3.2 Example of using UTF-8 if possible

The example in Example of using UTF-8 by default will use UTF-8 even if the file is known to contain bytes that cannot be in this encoding. As an alternative way to use UTF-8 when possible, you can modify the value of *specific-valid-file-encodings*.

For example:

(pushnew :utf-8 system:*specific-valid-file-encodings*)

26.8.4 External formats and stream-element-type

The :element-type argument in open and with-open-file defaults to the value of *default-character-element-type*.

If element-type is not :default, checks are made to ensure that the resulting stream's stream-element-type is compatible with its external format:

  1. If direction is :input or :io, the element-type argument must be a supertype of the type of characters produced by the external format.
  2. If direction is :output or :io, the element-type argument must be a subtype of the type of characters accepted by the external format

If the element-type argument does not satisfy these requirements, an error is signaled.

If element-type is :default the system chooses the stream-element-type on the basis of the external format.

26.8.5 External formats and the LispWorks Editor

The LispWorks Editor uses open with :element-type :default to read and write files. On reading a file, the external format is remembered and used when saving the file. On writing a Unicode (UTF-16) file, the Byte Order Mark is written.

It is possible to insert characters in the Editor (for example by pasting clipboard text) which are not supported by the chosen external format. This will lead to errors on attempt to save the buffer. You can handle this by setting the external format appropriately.

See the LispWorks Editor User Guide for more details.

26.8.6 Byte Order Mark

The Unicode Byte Order Mark (BOM) is treated as whitespace in the default readtable. This allows the Lisp reader to read a 16-bit (UTF-16 or BMP encoded) file regardless of whether the BOM is present. See 16-bit External formats guide for more information.

Some editors including Microsoft Notepad and the LispWorks editor write the BOM when writing a file with 16-bit (UTF-16 or BMP) encoding.


LispWorks User Guide and Reference Manual - 20 Sep 2017

NextPrevUpTopContentsIndex