|
||
---|---|---|
inspecting_codecs | ||
tests | ||
.gitignore | ||
.pre-commit-config.yaml | ||
README.rst | ||
pyproject.toml | ||
tox.ini |
README.rst
Inspecting Codecs
- Abstract
Codecs that determine the encoding by looking for a BOM or encoding declaration.
- Copyright
© 2022 Günter Milde.
- License
Released under the terms of the 2-Clause BSD license, in short:
Copying and distribution of this package, with or without modification, are permitted in any medium without royalty provided the copyright notices and this notice are preserved. This package is offered as-is, without any warranty.
Features
The inspecting_codecs module provides two codecs:
- utf-sig
BOM sniffing (only decoding)
- declared
scan for encoding declaration similar to PEP 263.
If no encoding can be determined rise a UnicodeError. A fallback can be specified by chaining codecs with the "or" operator (see usage below).
Currently, the codecs don't support the legacy codecs.StreamReader and codecs.StreamWriter interface (cf. PEP 400). Use standard open() instead of codecs.open().
This module is provisional. API and implementation details may change.
Usage
Importing the module registers the "utf-sig" and "declared" codec names and support for specifying fallback codecs:
import inspecting_codecs
Open a file with BOM sniffing:
open(filename, encoding='utf-sig')
Open a file with encoding declaration:
open(filename, encoding='declared')
A fallback can be specified, e.g. UTF-8:
open(filename, encoding='declared or utf-8')
Open a file with the encoding indicated by a BOM or encoding declaration or the fallback UTF-8:
open(filename, encoding='utf-sig or declared or utf-8')
Open a file with BOM or encoding declaration (fallback locale dependent)1:
open(filename, encoding='utf-sig or declared or locale')
A fallback can also be specified for standard encodings, e.g. try UTF-8, fall back to latin1 in case of errors:
open(filename, encoding='utf-8 or latin1')
The default input encoding handling of Docutils <= 0.19 can be recreated with:
open(filename, encoding='utf-sig or declared or utf-8 or locale or latin1)
One difference remains: Docutils does not use the fallback in case of decoding errors after detecting an encoding by BOM or declaration.
Warning
When encoding a string with the "declared" codec and
- the data has less than 2 newlines,
- there is no valid encoding declaration, and
- there is no last call with
final=True
,
the data may silently disappear!
Unfortunately, writing to a file opened with standard open() never calls the encoder with the final argument set to True
. As a test, you may call the seek() method fd.seek(0)
which will raise a UnicodeError in case of a non-empty buffer (cf. tests/test_declared.py
).
When using the IncrementalEncoder programatically, make sure the last call to encode() sets final to True. In doubt, call with an empty string: encode('', final=True)
.
Support for the encoding name "locale" is new in Python 3.10.↩︎