4.8 codecs -- Codec registry and base classes

 

This module defines base classes for standard Python codecs (encoders and decoders) and provides access to the internal Python codec registry which manages the codec lookup process.

It defines the following functions:

register(search_function)
Register a codec search function. Search functions are expected to take one argument, the encoding name in all lower case letters, and return a tuple of functions (encoder, decoder, stream_reader, stream_writer) taking the following arguments:

encoder and decoder: These must be functions or methods which have the same interface as the encode()/decode() methods of Codec instances (see Codec Interface). The functions/methods are expected to work in a stateless mode.

stream_reader and stream_writer: These have to be factory functions providing the following interface:

factory(stream, errors='strict')

The factory functions must return objects providing the interfaces defined by the base classes StreamWriter and StreamReader, respectively. Stream codecs can maintain state.

Possible values for errors are 'strict' (raise an exception in case of an encoding error), 'replace' (replace malformed data with a suitable replacement marker, such as "?") and 'ignore' (ignore malformed data and continue without further notice).

In case a search function cannot find a given encoding, it should return None.

lookup(encoding)
Looks up a codec tuple in the Python codec registry and returns the function tuple as defined above.

Encodings are first looked up in the registry's cache. If not found, the list of registered search functions is scanned. If no codecs tuple is found, a LookupError is raised. Otherwise, the codecs tuple is stored in the cache and returned to the caller.

To simply access to the various codecs, the module provides these additional functions which use lookup() for the codec lookup:

getencoder(encoding)
Lookup up the codec for the given encoding and return its encoder function.

Raises a LookupError in case the encoding cannot be found.

getdecoder(encoding)
Lookup up the codec for the given encoding and return its decoder function.

Raises a LookupError in case the encoding cannot be found.

getreader(encoding)
Lookup up the codec for the given encoding and return its StreamReader class or factory function.

Raises a LookupError in case the encoding cannot be found.

getwriter(encoding)
Lookup up the codec for the given encoding and return its StreamWriter class or factory function.

Raises a LookupError in case the encoding cannot be found.

To simplify working with encoded files or stream, the module also defines these utility functions:

open(filename, mode[, encoding[, errors[, buffering]]])
Open an encoded file using the given mode and return a wrapped version providing transparent encoding/decoding.

Note: The wrapped version will only accept the object format defined by the codecs, i.e. Unicode objects for most built-in codecs. Output is also codec-dependent and will usually be Unicode as well.

encoding specifies the encoding which is to be used for the the file.

errors may be given to define the error handling. It defaults to 'strict' which causes a ValueError to be raised in case an encoding error occurs.

buffering has the same meaning as for the built-in open() function. It defaults to line buffered.

EncodedFile(file, input[, output[, errors]])
Return a wrapped version of file which provides transparent encoding translation.

Strings written to the wrapped file are interpreted according to the given input encoding and then written to the original file as strings using the output encoding. The intermediate encoding will usually be Unicode but depends on the specified codecs.

If output is not given, it defaults to input.

errors may be given to define the error handling. It defaults to 'strict', which causes ValueError to be raised in case an encoding error occurs.

The module also provides the following constants which are useful for reading and writing to platform dependent files:

BOM
BOM_BE
BOM_LE
BOM32_BE
BOM32_LE
BOM64_BE
BOM64_LE
These constants define the byte order marks (BOM) used in data streams to indicate the byte order used in the stream or file. BOM is either BOM_BE or BOM_LE depending on the platform's native byte order, while the others represent big endian ("_BE" suffix) and little endian ("_LE" suffix) byte order using 32-bit and 64-bit encodings.

See Also:

http://sourceforge.net/projects/python-codecs/
A SourceForge project working on additional support for Asian codecs for use with Python. They are in the early stages of development at the time of this writing -- look in their FTP area for downloadable files.


Subsections
See About this document... for information on suggesting changes.