All Manuals > LispWorks User Guide and Reference Manual > 26 Internationalization: characters, strings and encodings


26.7 16-bit External formats guide

26.7.1 Unicode

In LispWorks 6.1 and earlier versions the external format :unicode is actually "raw UCS-2", that is reading and writing only 16-bit characters (including character objects corresponding to surrogate code points). The :unicode format now maps to :utf-16 with the native endianness (by default). This interprets surrogate code points (#xd800 to #xdfff) differently: the old :unicode would read these as if they are actual characters, while :utf-16 and hence the new :unicode will try to interpret them as encoding supplementary characters (codes #x10000 to #x10ffff). The latter behavior is probably what you need, so in most cases there is no need to replace usage of :unicode. There is no external format which interprets surrogate code points as characters now, but you can uses any of the :bmp formats with :use-replacement t to read 16-bit characters without giving an error, although this does not exactly match the input, because surrogate code points are translated by the replacement character. The only format that can read anything without any loss is :latin-1.

Note that :unicode differs from :utf-16 by the default byte order that it uses: :utf-16 defaults to big-endian (matching the Unicode standard), while :unicode defaults to the native byte order.

26.7.2 UTF-16

There are now several UTF-16 external formats. There are more than one because UTF-16 is actually two different encodings: UTF-16 big-endian and UTF-16 little-endian.

:utf-16-native and :utf-16-reversed are the actual implementation formats. They implement UTF-16 with the native byte order (:utf-16-native) or the reversed byte order (:utf-16-reversed)

:utf-16be and :utf-16le implement the big-endian (:utf-16be) and little-endian (:utf-16le) UTF-16. The system maps these formats to :utf-16-native or :utf-16-reversed as appropriate, depending on the byte order of the computer.

:utf-16 implements the UTF-16 standard, defaulting to UTF-16BE unless there is a BOM (Byte Order Mark).

In general, you will need to decide which of these to use depending on the circumstances.


26.7.3 BMP

There are now a few BMP external formats, which read and write only 16-bit characters (characters in the range 0 to #xffff, excluding the surrogate range #xd800 to #xdfff).

:bmp-native and :bmp-reversed are the actual implementation formats. They implement reading 16-bit characters with the native byte order (:bmp-native) or the reversed byte order (:bmp-reversed). These formats never read supplementary characters. When they encounter a surrogate code point, they either signal an error or replace it by the replacement character, depending on the parameter :use-replacement.

:bmp implements 16-bit character reading and writing, defaulting to the native one.

Notes: In LispWorks 6.1 and earlier versions, the :unicode external format is similar to :bmp now, but handles surrogate code points as if they represent characters. In LispWorks 7.0 and later :unicode maps to :utf-16, and there is no external format that reads surrogate code points as characters.

LispWorks User Guide and Reference Manual - 20 Sep 2017