Character Encodings supported by PARLANSE (and DMS)

PARLANSE supports a wide variety of character encodings. This enables it to in turn support document processing. This also supports the DMS Software Reengineering Toolkit as applied to a wide variety of programming language source code or other formal documents, written/encoded in a broad variety of standard and national character sets. This allows PARLANSE/DMS to easily read or generate 7 or eight bit ASCII, Unicode (Version 13), EBCDIC, Asian character sets, and Chinese (including Mainland China's GB-18030 UTF variant) documents.

PARLANSE supports these character sets via its Streams library which enable opening and creating local devices and files, and/or network-accessible file paths, on Windows or Linux. Characters internal to PARLANSE/DMS are always 16 bit Unicode characters; the Streams module handles all conversions to other character encodings.

When a stream is created for output, a default encoding is used unless overridden by a specific encoding request. A stream may be opened for reading with a specific encoding, or the stream may be sniffed to determine its encoding by inspecting for various Byte Order Marks and/or apparantly legal conformance to UTF-8 conventions. If such sniffing does not produce an obvious choice, then a default encoding is used.

The Streams modules also allows the explicit control of treatment of various control characters codes 0x00-0x1F and 0x7F. They may be individually ignored or retained as printing or non-printing characters. CR and LF can be treated as uninteresting characters or as various ways indicate end-of-line. Finally, TAB characters can be treated as "tabs" to various prespecified columns

The input stream module keeps accurate track of line and column number according the interpretation of the characters.

PARLANSE also offers HTML and XML output streams, providing many useful tag or content generation calls, that encode data according to HMTM/XML escape conventions, on top of the specified character encoding.

Available Stream Encodings

  • Encoding-CP-037
    The characters are encoded as defined by CP-037, the IBM EBCDIC US/Canada code page. An input stream using this character encoding must not start with any of the byte sequences ff fe, fe ff, or ef bb bf.
  • Encoding-CP-037+R80
    The characters are encoded as defined by CP-037, the IBM EBCDIC US/Canada code page. Additionally, a line is assumed to consist of exactly 80 characters with newline characters being considered normal characters that increase the column number by 1. The encoding is only supported by certain DMS domains that have reference formats with fixed line length. Specifying this encoding for any other DMS domain results in undefined behavior. An input stream using this character encoding must not start with any of the byte sequences ff fe, fe ff, or ef bb bf.
  • Encoding-CP-500
    The characters are encoded as defined by CP-500, the IBM EBCDIC International code page. An input stream using this character encoding must not start with any of the byte sequences ff fe, fe ff, or ef bb bf.
  • Encoding-CP-500+R80
    The characters are encoded as defined by CP-500, the IBM EBCDIC International code page. Additionally, a line is assumed to consist of exactly 80 characters with newline characters being considered normal characters that increase the column number by 1. The encoding is only supported by certain DMS domains that have reference formats with fixed line length. Specifying this encoding for any other DMS domain results in undefined behavior. An input stream using this character encoding must not start with any of the byte sequences ff fe, fe ff, or ef bb bf.
  • Encoding-CP-932
    The characters are encoded as defined by CP-932, the Microsoft Shift-JIS 0208 code page. An input stream using this character encoding must not start with any of the byte sequences ff fe, fe ff, or ef bb bf.
  • Encoding-CP-1250
    The characters are encoded as defined by CP-1250, the Microsoft Windows Latin 2 code page. An input stream using this character encoding must not start with any of the byte sequences ff fe, fe ff, or ef bb bf.
  • Encoding-CP-1251
    The characters are encoded as defined by CP-1251, the Microsoft Windows Cyrillic code page. An input stream using this character encoding must not start with any of the byte sequences ff fe, fe ff, or ef bb bf.
  • Encoding-CP-1252
    The characters are encoded as defined by CP-1252, the Microsoft Windows Latin 1 code page. An input stream using this character encoding must not start with any of the byte sequences ff fe, fe ff, or ef bb bf.
  • Encoding-CP-1253
    The characters are encoded as defined by CP-1253, the Microsoft Windows Greek code page. An input stream using this character encoding must not start with any of the byte sequences ff fe, fe ff, or ef bb bf.
  • Encoding-CP-1254
    The characters are encoded as defined by CP-1254, the Microsoft Windows Latin 5 (Turkish) code page. An input stream using this character encoding must not start with any of the byte sequences ff fe, fe ff, or ef bb bf.
  • Encoding-CP-1255
    The characters are encoded as defined by CP-1255, the Microsoft Windows Hebrew code page. An input stream using this character encoding must not start with any of the byte sequences ff fe, fe ff, or ef bb bf.
  • Encoding-CP-1256
    The characters are encoded as defined by CP-1256, the Microsoft Windows Arabic code page. An input stream using this character encoding must not start with any of the byte sequences ff fe, fe ff, or ef bb bf.
  • Encoding-CP-1257
    The characters are encoded as defined by CP-1257, the Microsoft Windows Baltic code page. An input stream using this character encoding must not start with any of the byte sequences ff fe, fe ff, or ef bb bf.
  • Encoding-CP-1258
    The characters are encoded as defined by CP-1258, the Microsoft Windows Vietnamese code page. An input stream using this character encoding must not start with any of the byte sequences ff fe, fe ff, or ef bb bf.
  • Encoding-ISO-646-US
    The characters are encoded as defined by the ISO-646-US standard. An input stream using this character encoding must not start with any of the byte sequences ff fe, fe ff, or ef bb bf.
  • Encoding-ISO-8859-1
    The characters are encoded as defined by the ISO-8859-1 standard. An input stream using this character encoding must not start with any of the byte sequences ff fe, fe ff, or ef bb bf.
  • Encoding-ISO-8859-1+R80
    The characters are encoded as defined by the ISO-8859-1 standard. Additionally, a line is assumed to consist of exactly 80 characters with newline characters being considered normal characters that increase the column number by 1. The encoding is only supported by certain DMS domains that have reference formats with fixed line length. Specifying this encoding for any other DMS domain results in undefined behavior. An input stream using this character encoding must not start with any of the byte sequences ff fe, fe ff, or ef bb bf.
  • Encoding-ISO-8859-2
    The characters are encoded as defined by the ISO-8859-2 standard. An input stream using this character encoding must not start with any of the byte sequences ff fe, fe ff, or ef bb bf.
  • Encoding-ISO-8859-3
    The characters are encoded as defined by the ISO-8859-3 standard. An input stream using this character encoding must not start with any of the byte sequences ff fe, fe ff, or ef bb bf.
  • Encoding-ISO-8859-4
    The characters are encoded as defined by the ISO-8859-4 standard. An input stream using this character encoding must not start with any of the byte sequences ff fe, fe ff, or ef bb bf.
  • Encoding-ISO-8859-5
    The characters are encoded as defined by the ISO-8859-5 standard. An input stream using this character encoding must not start with any of the byte sequences ff fe, fe ff, or ef bb bf.
  • Encoding-ISO-8859-6
    The characters are encoded as defined by the ISO-8859-6 standard. An input stream using this character encoding must not start with any of the byte sequences ff fe, fe ff, or ef bb bf.
  • Encoding-ISO-8859-7
    The characters are encoded as defined by the ISO-8859-7 standard. An input stream using this character encoding must not start with any of the byte sequences ff fe, fe ff, or ef bb bf.
  • Encoding-ISO-8859-8
    The characters are encoded as defined by the ISO-8859-8 standard. An input stream using this character encoding must not start with any of the byte sequences ff fe, fe ff, or ef bb bf.
  • Encoding-ISO-8859-9
    The characters are encoded as defined by the ISO-8859-9 standard. An input stream using this character encoding must not start with any of the byte sequences ff fe, fe ff, or ef bb bf.
  • Encoding-ISO-8859-10
    The characters are encoded as defined by the ISO-8859-10 standard. An input stream using this character encoding must not start with any of the byte sequences ff fe, fe ff, or ef bb bf.
  • Encoding-ISO-8859-11
    The characters are encoded as defined by the ISO-8859-11 standard. An input stream using this character encoding must not start with any of the byte sequences ff fe, fe ff, or ef bb bf.
  • Encoding-ISO-8859-13
    The characters are encoded as defined by the ISO-8859-13 standard. An input stream using this character encoding must not start with any of the byte sequences ff fe, fe ff, or ef bb bf.
  • Encoding-ISO-8859-14
    The characters are encoded as defined by the ISO-8859-14 standard. An input stream using this character encoding must not start with any of the byte sequences ff fe, fe ff, or ef bb bf.
  • Encoding-ISO-8859-15
    The characters are encoded as defined by the ISO-8859-15 standard. An input stream using this character encoding must not start with any of the byte sequences ff fe, fe ff, or ef bb bf.
  • Encoding-ISO-8859-16
    The characters are encoded as defined by the ISO-8859-16 standard. An input stream using this character encoding must not start with any of the byte sequences ff fe, fe ff, or ef bb bf.
  • Encoding-Shift-JIS-0208
    The characters are encoded as defined by JIS X 0208, Appendix 1, the Shift-JIS 0208 code page. An input stream using this character encoding must not start with any of the byte sequences ff fe, fe ff, or ef bb bf.
  • Encoding-Unicode-UTF-8
    The characters are encoded using the encoding UTF-8 as defined in the Unicode standard. An input stream using this character encoding must not start with any of the byte sequences ff fe, fe ff, or ef bb bf.
  • Encoding-Unicode-UTF-8?ISO-8859-1
    If the input stream contains a byte with the high bit set and the input stream consists of a well-formed UTF-8 encoded byte sequence, the characters are encoded using the encoding UTF-8 as defined in the Unicode standard. Otherwise, the characters are encoded as defined by the ISO-8859-1 standard. An input stream using this character encoding must not start with any of the byte sequences ff fe, fe ff, or ef bb bf.
  • Encoding-Unicode-UTF-8-BOM
    The characters are encoded using the encoding UTF-8 as defined in the Unicode standard. An input stream using this character encoding must start with the byte sequence ef bb bf. This initial byte sequence is not considered to be part of the character sequence provided by the input stream. If this encoding is specified in an environment variable, it is ignored, i.e. considered as not specified. The encoding is detected automatically by consulting the initial byte sequence of the input stream.
  • Encoding-Unicode-UTF-16LE-BOM
    The characters are encoded using the encoding UTF-16LE as defined in the Unicode standard. An input stream using this character encoding must start with the byte sequence ff fe. This initial byte sequence is not considered to be part of the character sequence provided by the input stream. If this encoding is specified in an environment variable, it is ignored, i.e. considered as not specified. The encoding is detected automatically by consulting the initial byte sequence of the input stream.
  • Encoding-MS-ANSI
    The characters are encoded according to the National character set defined by the ANSI setting on Microsoft Windows systems.
  • Encoding-GB-18030
    The characters are encoded according to Mainland China's mandated variant of Unicode.
For more information: [email protected]    Follow us at Twitter: @SemanticDesigns

PARLANSE/DMS
Character Encodings