All Manuals > LispWorks® User Guide and Reference Manual > 26 Internationalization: characters, strings and encodings

26.6 External Formats to translate Lisp characters from/to external encodings

External formats are two-way translations from Lisp's internal encoding to an external encoding. They can be used in file I/O, and in passing and receiving string data in foreign function calls.

An external format is named in LispWorks by an external format specification (ef-spec). An ef-spec is a symbol naming the external format, or a list with such a name as its first element followed by parameter/value pairs.

26.6.1 External format names

LispWorks has a number of predefined external formats:

win32:code-page

The Windows code page with identifier given by the :id parameter. Implemented only on Windows.

:latin-1
ISO8859-1.
:latin-1-terminal

As Latin-1, except that if a non-Latin-1 character is output, it is written as <xxxx> where xxxx is the hexadecimal character code and does not signal error.

:latin-1-safe

As Latin-1, except that if a non-Latin-1 character is output, it is written as ? and does not signal error.

:macos-roman

The Mac OS Roman encoding.

:ascii
ASCII.

:utf-16 with default native byte order. See 26.6.2 16-bit External formats guide for details and variants.

Compatibility note: In LispWorks 6.1 and earlier versions, :unicode encodes 16-bit characters reading.

:utf-8
The UTF-8 encoding of Unicode.
The UTF-16 encoding of Unicode with big-endian byte order. See 26.6.2 16-bit External formats guide for details and variants.

The UTF-32 encoding of Unicode with big-endian byte order.

Note: There is a :utf-32 external format corresponding to each of the :utf-16 variants.

Reads and writes 16-bit characters with native byte order. See 26.6.2 16-bit External formats guide for details and variants.
:jis

JIS. The encoding data is read from a file Uni2JIS and is pre-built into LispWorks.

Note: Uni2JIS is provided by way of documentation in the directory lib/8-0-0-0/etc/. It is also used at run time by the function cl:char-name.

:euc-jp
EUC-JP. The encoding data is read from a file Uni2JIS and is pre-built into LispWorks.
:sjis
Shift JIS.
:windows-cp936

Windows code page 936. The encoding data is read from a file windows-936-2000.ucm and is pre-built into LispWorks.

Note: windows-936-2000.ucm is provided by way of documentation in the directory lib/8-0-0-0/etc/. It is not read at run time.

:gbk
A synonym for :windows-cp936.
:gb18030
GB18030-2005 character encoding.
:koi8-r
The KOI8-R (RFC 1489) encoding.

26.6.2 16-bit External formats guide

LispWorks has several external formats that generate 16-bit encodings as documented below.

26.6.2.1 Unicode

The :unicode format maps to :utf-16 with the native endianness (by default). Note that :unicode differs from :utf-16 by the default byte order that it uses: :utf-16 defaults to big-endian (matching the Unicode standard), while :unicode defaults to the native byte order.

Compatibility note: In LispWorks 6.1 and earlier versions the external format :unicode is actually "raw UCS-2", that is reading and writing only 16-bit characters. That would interpret surrogate code points (#xd800 to #xdfff) differently if they are actual characters, but in LispWorks 7.0 and later :utf-16 (and hence the :unicode) interprets them as encoding the supplementary characters (codes #x10000 to #x10ffff). The latter behavior is probably what you need, so in most cases there is no need to replace usage of :unicode. There is no external format that interprets surrogate code points as characters in LispWorks 7.0 and later, but you can use any of the :bmp formats with :use-replacement t to read 16-bit characters without giving an error, although this does not exactly match the input, because surrogate code points are translated by the replacement character. The only format that can read anything without any loss is :latin-1.

26.6.2.2 UTF-16

There are several UTF-16 external formats. There are more than one because UTF-16 is actually two different encodings: UTF-16 big-endian and UTF-16 little-endian.

:utf-16-native and :utf-16-reversed are the actual implementation formats. They implement UTF-16 with the native byte order (:utf-16-native) or the reversed byte order (:utf-16-reversed).

:utf-16be and :utf-16le implement the big-endian (:utf-16be) and little-endian (:utf-16le) UTF-16. The system maps these formats to :utf-16-native or :utf-16-reversed as appropriate, depending on the byte order of the computer.

:utf-16 implements the UTF-16 standard, defaulting to UTF-16BE unless there is a BOM (Byte Order Mark).

In general, you will need to decide which of these to use depending on the circumstances.

26.6.2.3 BMP

BMP stands for Basic Multilingual Plane in Unicode and there are a few BMP external formats, which read and write only 16-bit characters (characters in the range 0 to #xffff, excluding the surrogate range #xd800 to #xdfff).

:bmp-native and :bmp-reversed are the actual implementation formats. They implement reading 16-bit characters with the native byte order (:bmp-native) or the reversed byte order (:bmp-reversed). These formats never read supplementary characters. When they encounter a surrogate code point, they either signal an error or replace it by the replacement character, depending on the parameter :use-replacement.

:bmp implements 16-bit character reading and writing, defaulting to the native one.

Notes: In LispWorks 6.1 and earlier versions, the :unicode external format is similar to :bmp now, but handles surrogate code points as if they represent characters. In LispWorks 7.0 and later, :unicode maps to :utf-16, and there is no external format that reads surrogate code points as characters.

26.6.3 External Formats and File Streams

The :external-format argument of open and related functions should be an ef-spec, where the name can be :default. The symbol :default is the default value.

If you know the format of the data when doing file I/O, you should definitely specify external-format explicitly, in the ef-spec syntax described in this section.

26.6.3.1 Complete external format ef-specs

An ef-spec is "complete" if and only if the name is not :default and the parameters include :eol-style.

All external formats have an :eol-style parameter. If eol-style is not explicit in an ef-spec a default is used. The allowed values are:

:lf
This is the default on non-Windows systems, meaning that lines are terminated by Linefeed.
:crlf
This is the default on Windows, meaning that lines are terminated by Carriage-Return followed by Linefeed.
:cr
Lines are terminated by Carriage-Return.
26.6.3.2 Using complete external formats

If open or with-open-file gets a complete :external-format argument then, it is used as is. For example, this form opens an ASCII linefeed-terminated stream:

(with-open-file (ss "C:/temp/ascii-lf"
                    :direction :output 
                    :external-format 
                    '(:ascii :eol-style :lf)) 
  (stream-external-format ss))
=>
(:ASCII :EOL-STYLE :LF)

If you know the encoding of a file you are opening, then you should pass the appropriate :external-format argument.

26.6.3.3 Guessing the external format

If open or with-open-file gets a non-complete :external-format argument ef-spec then the system decides which external format to use by calling the function guess-external-format.

The default behavior of guess-external-format is as follows:

  1. When ef-spec's name is :default, this finds a match based on the filename; or (if that fails), looks in the Emacs-style (-*-) attribute line for an option called ENCODING or EXTERNAL-FORMAT or CODING; or (if that fails), chooses from amongst likely encodings by analysing the bytes near the start of the file, or (if that fails) uses a default encoding. Otherwise ef-spec's name is assumed to name an encoding and this encoding is used.
  2. When ef-spec does not include the :eol-style parameter, it then also analyzes the start of the file for byte patterns indicating the end-of-line style, and uses a default end-of-line style if no such pattern is found.

The file in this example was written by a Windows program which writes the Byte Order Mark at the start of the file, indicating that it is Unicode encoded. The routine in step 1 above detects this:

(set-default-character-element-type 'character)
=>
CHARACTER
 
(with-open-file (ss "C:/temp/unicode-notepad.txt") 
  (stream-external-format ss))
=>
(:UNICODE :LITTLE-ENDIAN T :EOL-STYLE :CRLF)

The behavior of guess-external-format is configurable via the variables *file-encoding-detection-algorithm* and *file-eol-style-detection-algorithm*. See the manual pages for details.

26.6.3.4 Example of using UTF-8 by default

To change the default for all file access via open, compile-file and so on, you can modify the value of *file-encoding-detection-algorithm*.

For example given the following definition:

(defun utf-8-file-encoding (pathname ef-spec buffer length)
  (declare (ignore pathname buffer length))
  (system:merge-ef-specs ef-spec :utf-8))

then this makes it use UTF-8 as a fallback:

(setq system:*file-encoding-detection-algorithm*
      (substitute 'utf-8-file-encoding
                  'system:locale-file-encoding
                  system:*file-encoding-detection-algorithm*))

and this forces it to always use UTF-8:

(setq system:*file-encoding-detection-algorithm*
      '(utf-8-file-encoding))
26.6.3.5 Example of using UTF-8 if possible

The example in 26.6.3.4 Example of using UTF-8 by default will use UTF-8 even if the file contains bytes that cannot be in this encoding. As an alternative way to use UTF-8 when possible, you can modify the value of *specific-valid-file-encodings*.

For example, the following will cause LispWorks to use UTF-8 if the file begins with valid UTF-8 bytes:

(pushnew :utf-8 system:*specific-valid-file-encodings*)
26.6.3.6 External formats and stream-element-type

The :element-type argument in open and with-open-file defaults to the value of *default-character-element-type*.

If element-type is not :default, checks are made to ensure that the resulting stream's stream-element-type is compatible with its external format:

  1. If direction is :input or :io, the element-type argument must be a supertype of the type of characters produced by the external format.
  2. If direction is :output or :io, the element-type argument must be a subtype of the type of characters accepted by the external format.

If the element-type argument does not satisfy these requirements, an error is signaled.

If element-type is :default the system chooses the stream-element-type on the basis of the external format.

26.6.3.7 External formats and the LispWorks Editor

The LispWorks Editor uses open with :element-type :default to read and write files. On reading a file, the external format is remembered and used when saving the file. On writing a Unicode (UTF-16) file, the Byte Order Mark is written.

It is possible to insert characters in the Editor (for example by pasting clipboard text) which are not supported by the chosen external format. This will lead to errors on attempt to save the buffer. You can handle this by setting the external format appropriately.

See the Editor User Guide for more details.

26.6.3.8 Byte Order Mark

The Unicode Byte Order Mark (BOM) is treated as whitespace in the default readtable. This allows the Lisp reader to read a 16-bit (UTF-16 or BMP encoded) file regardless of whether the BOM is present. See 26.6.2 16-bit External formats guide for more information.

Some editors including Microsoft Notepad and the LispWorks editor write the BOM when writing a file with 16-bit (UTF-16 or BMP) encoding.

26.6.4 External Formats and the Foreign Language Interface

External formats can be used to pass and receive string data via the FLI. See the section on string types in the Foreign Language Interface User Guide and Reference Manual.


LispWorks® User Guide and Reference Manual - 01 Dec 2021 19:30:24