Abstract
If Python Virtual Machine (PVM) bytecode is not "well-formed" it
is possible to crash or exploit the PVM by causing various errors
such as under/overflowing the value stack or reading/writing into
arbitrary areas of the PVM program space. Most of these kinds of
errors can be eliminated by verifying that PVM bytecode does not
violate a set of simple constraints before execution.
This PEP proposes a set of constraints on the format and structure
of Python Virtual Machine (PVM) bytecode and provides an
implementation in Python of this verification process.
Pronouncement
Guido believes that a verification tool has some value. If
someone wants to add it to Tools/scripts, no PEP is required.
Such a tool may have value for validating the output from
"bytecodehacks" or from direct edits of PYC files. As security
measure, its value is somewhat limited because perfectly valid
bytecode can still do horrible things. That situation could
change if the concept of restricted execution were to be
successfully resurrected.
Motivation
The Python Virtual Machine executes Python programs that have been
compiled from the Python language into a bytecode representation.
The PVM assumes that any bytecode being executed is "well-formed"
with regard to a number implicit constraints. Some of these
constraints are checked at run-time, but most of them are not due
to the overhead they would create.
When running in debug mode the PVM does do several run-time checks
to ensure that any particular bytecode cannot violate these
constraints that, to a degree, prevent bytecode from crashing or
exploiting the interpreter. These checks add a measurable
overhead to the interpreter, and are typically turned off in
common use.
Bytecode that is not well-formed and executed by a PVM not running
in debug mode may create a variety of fatal and non-fatal errors.
Typically, ill-formed code will cause the PVM to seg-fault and
cause the OS to immediately and abruptly terminate the
interpreter.
Conceivably, ill-formed bytecode could exploit the interpreter and
allow Python bytecode to execute arbitrary C-level machine
instructions or to modify private, internal data structures in the
interpreter. If used cleverly this could subvert any form of
security policy an application may want to apply to its objects.
Practically, it would be difficult for a malicious user to
"inject" invalid bytecode into a PVM for the purposes of
exploitation, but not impossible. Buffer overflow and memory
overwrite attacks are commonly understood, particularly when the
exploit payload is transmitted unencrypted over a network or when
a file or network security permission weakness is used as a
foothold for further attacks.
Ideally, no bytecode should ever be allowed to read or write
underlying C-level data structures to subvert the operation of the
PVM, whether the bytecode was maliciously crafted or not. A
simple pre-execution verification step could ensure that bytecode
cannot over/underflow the value stack or access other sensitive
areas of PVM program space at run-time.
This PEP proposes several validation steps that should be taken on
Python bytecode before it is executed by the PVM so that it
compiles with static and structure constraints on its instructions
and their operands. These steps are simple and catch a large
class of invalid bytecode that can cause crashes. There is also
some possibility that some run-time checks can be eliminated up
front by a verification pass.
There is, of course, no way to verify that bytecode is "completely
safe", for every definition of complete and safe. Even with
bytecode verification, Python programs can and most likely in the
future will seg-fault for a variety of reasons and continue to
cause many different classes of run-time errors, fatal or not.
The verification step proposed here simply plugs an easy hole that
can cause a large class of fatal and subtle errors at the bytecode
level.
Currently, the Java Virtual Machine (JVM) verifies Java bytecode
in a way very similar to what is proposed here. The JVM
Specification version 2 [1], Sections 4.8 and 4.9 were therefore
used as a basis for some of the constraints explained below. Any
Python bytecode verification implementation at a minimum must
enforce these constraints, but may not be limited to them.
Static Constraints on Bytecode Instructions
1. The bytecode string must not be empty. (len(co_code) > 0).
2. The bytecode string cannot exceed a maximum size
(len(co_code) < sizeof(unsigned char) - 1).
3. The first instruction in the bytecode string begins at index 0.
4. Only valid byte-codes with the correct number of operands can
be in the bytecode string.
Static Constraints on Bytecode Instruction Operands
1. The target of a jump instruction must be within the code
boundaries and must fall on an instruction, never between an
instruction and its operands.
2. The operand of a LOAD_* instruction must be an valid index into
its corresponding data structure.
3. The operand of a STORE_* instruction must be an valid index
into its corresponding data structure.
Structural Constraints between Bytecode Instructions
1. Each instruction must only be executed with the appropriate
number of arguments in the value stack, regardless of the
execution path that leads to its invocation.
2. If an instruction can be executed along several different
execution paths, the value stack must have the same depth prior
to the execution of the instruction, regardless of the path
taken.
3. At no point during execution can the value stack grow to a
depth greater than that implied by co_stacksize.
4. Execution never falls off the bottom of co_code.
Implementation
This PEP is the working document for an Python bytecode
verification implementation written in Python. This
implementation is not used implicitly by the PVM before executing
any bytecode, but is to be used explicitly by users concerned
about possibly invalid bytecode with the following snippet:
import verify
verify.verify(object)
The `verify` module provides a `verify` function which accepts the
same kind of arguments as `dis.dis`: classes, methods, functions,
or code objects. It verifies that the object's bytecode is
well-formed according to the specifications of this PEP.
If the code is well-formed the call to `verify` returns silently
without error. If an error is encountered, it throws a
'VerificationError' whose argument indicates the cause of the
failure. It is up to the programmer whether or not to handle the
error in some way or execute the invalid code regardless.
Phillip Eby has proposed a pseudo-code algorithm for bytecode
stack depth verification used by the reference implementation.
Verification Issues
This PEP describes only a small number of verifications. While
discussion and analysis will lead to many more, it is highly
possible that future verification may need to be done or custom,
project-specific verifications. For this reason, it might be
desirable to add a verification registration interface to the test
implementation to register future verifiers. The need for this is
minimal since custom verifiers can subclass and extend the current
implementation for added behavior.
Required Changes
Armin Rigo noted that several byte-codes will need modification in
order for their stack effect to be statically analyzed. These are
END_FINALLY, POP_BLOCK, and MAKE_CLOSURE. Armin and Guido have
already agreed on how to correct the instructions. Currently the
Python implementation punts on these instructions.
This PEP does not propose to add the verification step to the
interpreter, but only to provide the Python implementation in the
standard library for optional use. Whether or not this
verification procedure is translated into C, included with the PVM
or enforced in any way is left for future discussion.
References
[1] The Java Virtual Machine Specification 2nd Edition
http://java.sun.com/docs/books/vmspec/2nd-edition/html/ClassFile.doc.html
Copyright
This document has been placed in the public domain.