Earlier this week, I had to deal with some files in Python's marshal format (some
.pyc files, specifically) in Ruby and discovered that the details of this format aren't documented. Since it's meant to be purely internal, the Python team has decided not to document it in any way.
The marshal format is used in
.pyc files, lots of internal storage for random apps, etc. It's a shame that it's undocumented, as this means that there are, to my knowledge, no implementations for other languages. This also means that if you have a malicious marshal blob, you have to load it up with Python to play around with it; not a good idea.
Fortunately, you can read the source (Python/marshal.c) and figure out how it works pretty easily. However, to make it even easier, I've decided to write up some simple documentation on the format. I'll give types as
int/uint where n is the number of bits. The
object type indicates that this is a marshalled object.
The marshal format in and of itself is very simple. It consists of a series of nested objects, represented by a type (
uint8 -- a char, in fact) followed by some serialized data. All data is little-endian.
Note: I wrote all of this for the 2.x line. I don't know how much has changed in 3.x.
These types contain no data and are simply representations of Python constants.
TYPE_NULL) -- Used to null terminate dictionaries and to represent the serialization of a null object internally (not sure if this can happen or not).
TYPE_NONE) -- Represents the
TYPE_FALSE) -- Represents the
TYPE_TRUE) -- Represents the
TYPE_STOPITER) -- Represents the
TYPE_ELLIPSIS) -- Represents the
TYPE_INT) -- Represents a
inton a 32-bit machine. Stored as an
TYPE_INT64) -- Represents a
inton a 64-bit machine. Stored as an
int64. When read on a 32-bit machine, this may automatically become a
long(if it's above
TYPE_FLOAT) -- Represents a
floatin the old (< 1) marshal format. Stored as a string with a
uint8before it indicating the size.
TYPE_BINARY_FLOAT) -- Represents a
floatin the new marshal format. Stored as a
float64. (Thanks to Trevor Blackwell for noting that these are not
TYPE_COMPLEX) -- Represents a
complexin the old (< 1) marshal format. Contains the real and imaginary components stored like TYPE_FLOAT; that is, as strings.
TYPE_BINARY_COMPLEX) -- Represents a
complexin the new marshal format. Stored as two
float64s representing the real and imaginary components.
TYPE_LONG) -- Represents a
long. Haven't yet figured out how this works; I'll update shortly with that.
TYPE_STRING) -- Represents a
str. Stored as a
int32representing the size, followed by that many bytes.
TYPE_INTERNED) -- Represents a
str. Identical to
TYPE_STRING, with the exception that it's added to an "interned" list as well.
TYPE_STRINGREF) -- Represents a
str. Stored as a
int32reference into the interned list mentioned above. Note that this is zero-indexed.
TYPE_UNICODE) -- Represents a
unicode. Stored as a
int32representing the size, followed by that many bytes. This is always UTF-8.
TYPE_TUPLE) -- Represents a
tuple. Stored as a
int32followed by that many objects, which are marshalled as well.
TYPE_LIST) -- Represents a
list. Stored identically to
TYPE_DICT) -- Represents a
dict. Stored as a series of marshalled key-value pairs. At the end of the dict, you'll have a "key" that consists of a
TYPE_NULL; there's no value following it.
TYPE_FROZENSET) -- Represents a
frozenset. Stored identically to
Code objects (like that in a
.pyc file, or in the
func_code property of a function) use the
TYPE_CODE) type flag. Even in the case of the top level (as in a
.pyc), they represent a function.
They consist of the following fields:
int32) -- Number of arguments.
int32) -- Number of local variables.
int32) -- Max stack depth used.
int32) -- Flags for the function.
- This list is not all encompassing; certain
__future__declarations will set their own flags.
object) -- String representation of the bytecode.
object) -- Tuple of constants used.
object) -- Tuple of names.
object) -- Tuple of variable names (this includes arguments and locals).
object) -- Tuple of "free" variables. (Can anyone clarify this a bit?)
object) -- Tuple of variables used in nested functions.
object) -- String containing the original filename this code object was generated from.
object) -- Name of the function. If it's the top level code object in a
.pyc, this will be
int32) -- First line number of the code this code object was generated from.
object) -- String mapping bytecode offsets to line numbers. Haven't delved into the details here.
Hopefully this will be of use to someone. One potential use is in studying malicious marshalled data; the Python guys strongly recommend against unmarshalling untrusted data, but we all know how well such notices are regarded. In addition, it may help you manipulate Python bytecode from non-Python languages.
Drop me a line if you do anything cool with it.
- Cody Brocious (Daeken)