inspecting-codecs/README.rst

3.3 KiB

Inspecting Codecs

Abstract

Codecs that determine the encoding by looking for a BOM or encoding declaration.

Copyright

© 2022 Günter Milde.

License

Released under the terms of the 2-Clause BSD license, in short:

Copying and distribution of this package, with or without modification, are permitted in any medium without royalty provided the copyright notices and this notice are preserved. This package is offered as-is, without any warranty.

Features

The inspecting_codecs module provides two codecs:

utf-sig

BOM sniffing (only decoding)

declared

scan for encoding declaration similar to PEP 263.

If no encoding can be determined rise a UnicodeError. A fallback can be specified by chaining codecs with the "or" operator (see usage below).

Currently, the codecs don't support the legacy codecs.StreamReader and codecs.StreamWriter interface (cf. PEP 400). Use standard open() instead of codecs.open().

This module is provisional. API and implementation details may change.

Usage

Importing the module registers the "utf-sig" and "declared" codec names and support for specifying fallback codecs:

import inspecting_codecs

Open a file with BOM sniffing:

open(filename, encoding='utf-sig')

Open a file with encoding declaration:

open(filename, encoding='declared')

A fallback can be specified, e.g. UTF-8:

open(filename, encoding='declared or utf-8')

Open a file with the encoding indicated by a BOM or encoding declaration or the fallback UTF-8:

open(filename, encoding='utf-sig or declared or utf-8')

Open a file with BOM or encoding declaration (fallback locale dependent)1:

open(filename, encoding='utf-sig or declared or locale')

A fallback can also be specified for standard encodings, e.g. try UTF-8, fall back to latin1 in case of errors:

open(filename, encoding='utf-8 or latin1')

The default input encoding handling of Docutils <= 0.19 can be recreated with:

open(filename, encoding='utf-sig or declared or utf-8 or locale or latin1)

One difference remains: Docutils does not use the fallback in case of decoding errors after detecting an encoding by BOM or declaration.

Warning

When encoding a string with the "declared" codec and

  • the data has less than 2 newlines,
  • there is no valid encoding declaration, and
  • there is no last call with final=True,

the data may silently disappear!

Unfortunately, writing to a file opened with standard open() never calls the encoder with the final argument set to True. As a test, you may call the seek() method fd.seek(0) which will raise a UnicodeError in case of a non-empty buffer (cf. tests/test_declared.py).

When using the IncrementalEncoder programatically, make sure the last call to encode() sets final to True. In doubt, call with an empty string: encode('', final=True).


  1. Support for the encoding name "locale" is new in Python 3.10.↩︎