THIS IS MAINMATTER CHAPTER I -- PYTHON BASICS ------------------------------------------------------------------- This chapter discusses Python capabilities that are likely to be used in text processing applications. For an introduction to Python syntax and semantics per se, readers might want to skip ahead to Appendix A (A Selective and Impressionistic Short Review of Python); Guido van Rossum's _Python Tutorial_ at is also quite excellent. The focus here occupies a somewhat higher level: not the Python language narrowly, but also not yet specific to text processing. In Section 1.1, I look at some programming techniques that flow out of the Python language itself, but that are usually not obvious to Python beginners--and are sometimes not obvious even to intermediate Python programmers. The programming techniques that are discussed are ones that tend to be applicable to text processing contexts--other programming tasks are likely to have their own tricks and idioms that are not explicitly documented in this book. In Section 1.2, I document modules in the Python standard library that you will probably use in your text processing application, or at the very least want to keep in the back of your mind. A number of other Python standard library modules are far enough afield of text processing that you are unlikely to use them in this type of application. Such remaining modules are documented very briefly with one- or two- line descriptions. More details on each module can be found with Python's standard documentation. SECTION 1 -- Techniques and Patterns ------------------------------------------------------------------------ TOPIC -- Utilizing Higher-Order Functions in Text Processing -------------------------------------------------------------------- This first topic merits a warning. It jumps feet-first into higher-order functions (HOFs) at a fairly sophisticated level and may be unfamiliar even to experienced Python programmers. Do not be too frightened by this first topic--you can understand the rest of the book without it. If the functional programming (FP) concepts in this topic seem unfamiliar to you, I recommend you jump ahead to Appendix A, especially its final section on FP concepts. In text processing, one frequently acts upon a series of chunks of text that are, in a sense, homogeneous. Most often, these chunks are lines, delimited by newline characters--but sometimes other sorts of fields and blocks are relevant. Moreover, Python has standard functions and syntax for reading in lines from a file (sensitive to platform differences). Obviously, these chunks are not entirely homogeneous--they can contain varying data. But at the level we worry about during processing, each chunk contains a natural parcel of instruction or information. As an example, consider an imperative style code fragment that selects only those lines of text that match a criterion 'isCond()': #*---------- Imperative style line selection ------------# selected = [] # temp list to hold matches fp = open(filename): for line in fp.readlines(): # Py2.2 -> "for line in fp:" if isCond(line): # (2.2 version reads lazily) selected.append(line) del line # Cleanup transient variable There is nothing -wrong- with these few lines (see [xreadlines] on efficiency issues). But it does take a few seconds to read through them. In my opinion, even this small block of lines does not parse as a -single thought-, even though its operation really is such. Also the variable 'line' is slightly superfluous (and it retains a value as a side effect after the loop and also could conceivably step on a previously defined value). In FP style, we could write the simpler: #*---------- Functional style line selection ------------# selected = filter(isCond, open(filename).readlines()) # Py2.2 -> filter(isCond, open(filename)) In the concrete, a textual source that one frequently wants to process as a list of lines is a log file. All sorts of applications produce log files, most typically either ones that cause system changes that might need to be examined or long-running applications that perform actions intermittently. For example, the PythonLabs Windows installer for Python 2.2 produces a file called 'INSTALL.LOG' that contains a list of actions taken during the install. Below is a highly abridged copy of this file from one of my computers: #------------ INSTALL.LOG sample data file --------------# Title: Python 2.2 Source: C:\DOWNLOAD\PYTHON-2.2.EXE | 02-23-2002 | 01:40:54 | 7074248 Made Dir: D:\Python22 File Copy: D:\Python22\UNWISE.EXE | 05-24-2001 | 12:59:30 | | ... RegDB Key: Software\Microsoft\Windows\CurrentVersion\Uninstall\Py... RegDB Val: Python 2.2 File Copy: D:\Python22\w9xpopen.exe | 12-21-2001 | 12:22:34 | | ... Made Dir: D:\PYTHON22\DLLs File Overwrite: C:\WINDOWS\SYSTEM\MSVCRT.DLL | | | | 295000 | 770c8856 RegDB Root: 2 RegDB Key: Software\Microsoft\Windows\CurrentVersion\App Paths\Py... RegDB Val: D:\PYTHON22\Python.exe Shell Link: C:\WINDOWS\Start Menu\Programs\Python 2.2\Uninstall Py... Link Info: D:\Python22\UNWISE.EXE | D:\PYTHON22 | | 0 | 1 | 0 | Shell Link: C:\WINDOWS\Start Menu\Programs\Python 2.2\Python ... Link Info: D:\Python22\python.exe | D:\PYTHON22 | D:\PYTHON22\... You can see that each action recorded belongs to one of several types. A processing application would presumably handle each type of action differently (especially since each action has different data fields associated with it). It is easy enough to write Boolean functions that identify line types, for example: #*------- Boolean "predicative" functions on lines -------# def isFileCopy(line): return line[:10]=='File Copy:' # or line.startswith(...) def isFileOverwrite(line): return line[:15]=='File Overwrite:' The string method `"".startswith()` is less error prone than an initial slice for recent Python versions, but these examples are compatible with Python 1.5. In a slightly more compact functional programming style, you can also write these like: #*----------- Functional style predicates ---------------# isRegDBRoot = lambda line: line[:11]=='RegDB Root:' isRegDBKey = lambda line: line[:10]=='RegDB Key:' isRegDBVal = lambda line: line[:10]=='RegDB Val:' Selecting lines of a certain type is done exactly as above: #*----------- Select lines that fill predicate ----------# lines = open(r'd:\python22\install.log').readlines() regroot_lines = filter(isRegDBRoot, lines) But if you want to select upon multiple criteria, an FP style can initially become cumbersome. For example suppose you are interested all the "RegDB" lines; you could write a new custom function for this filter: #*--------------- Find the RegDB lines ------------------# def isAnyRegDB(line): if line[:11]=='RegDB Root:': return 1 elif line[:10]=='RegDB Key:': return 1 elif line[:10]=='RegDB Val:': return 1 else: return 0 # For recent Pythons, line.startswith(...) is better Programming a custom function for each combined condition can produce a glut of named functions. More importantly, each such custom function requires a modicum of work to write and has a nonzero chance of introducing a bug. For conditions which should be jointly satisfied, you can either write custom functions or nest several filters within each other. For example: #*------------- Filter on two line predicates -----------# shortline = lambda line: len(line) < 25 short_regvals = filter(shortline, filter(isRegDBVal, lines)) In this example, we rely on previously defined functions for the filter. Any error in the filters will be in either 'shortline()' or 'isRegDBVal()', but not independently in some third function 'isShortRegVal()'. Such nested filters, however, are difficult to read--especially if more than two are involved. Calls to `map()` are sometimes similarly nested if several operations are to be performed on the same string. For a fairly trivial example, suppose you wished to reverse, capitalize, and normalize whitespace in lines of text. Creating the support functions is straightforward, and they could be nested in `map()` calls: #*------------ Multiple line transformations ------------# from string import upper, join, split def flip(s): a = list(s) a.reverse() return join(a,'') normalize = lambda s: join(split(s),' ') cap_flip_norms = map(upper, map(flip, map(normalize, lines))) This type of `map()` or `filter()` nest is difficult to read, and should be avoided. Moreover, one can sometimes be drawn into nesting alternating `map()` and `filter()` calls, making matters still worse. For example, suppose you want to perform several operations on each of the lines that meet several criteria. To avoid this trap, many programmers fall back to a more verbose imperative coding style that simply wraps the lists in a few loops and creates some temporary variables for intermediate results. Within a functional programming style, it is nonetheless possible to avoid the pitfall of excessive call nesting. The key to doing this is an intelligent selection of a few combinatorial -higher-order functions-. In general, a higher-order function is one that takes as argument or returns as result a function object. First-order functions just take some data as arguments and produce a datum as an answer (perhaps a data-structure like a list or dictionary). In contrast, the "inputs" and "outputs" of a HOF are more function objects--ones generally intended to be eventually called somewhere later in the program flow. One example of a higher-order function is a -function factory-: a function (or class) that returns a function, or collection of functions, that are somehow "configured" at the time of their creation. The "Hello World" of function factories is an "adder" factory. Like "Hello World," an adder factory exists just to show what can be done; it doesn't really -do- anything useful by itself. Pretty much every explanation of function factories uses an example such as: >>> def adder_factory(n): ... return lambda m, n=n: m+n ... >>> add10 = adder_factory(10) >>> add10 at 0x00FB0020> >>> add10(4) 14 >>> add10(20) 30 >>> add5 = adder_factory(5) >>> add5(4) 9 For text processing tasks, simple function factories are of less interest than are -combinatorial- HOFs. The idea of a combinatorial higher-order function is to take several (usually first-order) functions as arguments and return a new function that somehow synthesizes the operations of the argument functions. Below is a simple library of combinatorial higher-order functions that achieve surprisingly much in a small number of lines: #------------------- combinatorial.py -------------------# from operator import mul, add, truth apply_each = lambda fns, args=[]: map(apply, fns, [args]*len(fns)) bools = lambda lst: map(truth, lst) bool_each = lambda fns, args=[]: bools(apply_each(fns, args)) conjoin = lambda fns, args=[]: reduce(mul, bool_each(fns, args)) all = lambda fns: lambda arg, fns=fns: conjoin(fns, (arg,)) both = lambda f,g: all((f,g)) all3 = lambda f,g,h: all((f,g,h)) and_ = lambda f,g: lambda x, f=f, g=g: f(x) and g(x) disjoin = lambda fns, args=[]: reduce(add, bool_each(fns, args)) some = lambda fns: lambda arg, fns=fns: disjoin(fns, (arg,)) either = lambda f,g: some((f,g)) anyof3 = lambda f,g,h: some((f,g,h)) compose = lambda f,g: lambda x, f=f, g=g: f(g(x)) compose3 = lambda f,g,h: lambda x, f=f, g=g, h=h: f(g(h(x))) ident = lambda x: x Even with just over a dozen lines, many of these combinatorial functions are merely convenience functions that wrap other more general ones. Let us take a look at how we can use these HOFs to simplify some of the earlier examples. The same names are used for results, so look above for comparisons: #----- Some examples using higher-order functions -----# # Don't nest filters, just produce func that does both short_regvals = filter(both(shortline, isRegVal), lines) # Don't multiply ad hoc functions, just describe need regroot_lines = \ filter(some([isRegDBRoot, isRegDBKey, isRegDBVal]), lines) # Don't nest transformations, make one combined transform capFlipNorm = compose3(upper, flip, normalize) cap_flip_norms = map(capFlipNorm, lines) In the example, we bind the composed function 'capFlipNorm' for readability. The corresponding `map()` line expresses just the -single thought- of applying a common operation to all the lines. But the binding also illustrates some of the flexibility of combinatorial functions. By condensing the several operations previously nested in several `map()` calls, we can save the combined operation for reuse elsewhere in the program. As a rule of thumb, I recommend not using more than one `filter()` and one `map()` in any given line of code. If these "list application" functions need to nest more deeply than this, readability is preserved by saving results to intermediate names. Successive lines of such functional programming style calls themselves revert to a more imperative style--but a wonderful thing about Python is the degree to which it allows seamless combinations of different programming styles. For example: #*------ Limit nesting depth of map()/filter() ------# intermed = filter(niceProperty, map(someTransform, lines)) final = map(otherTransform, intermed) Any nesting of successive `filter()` or `map()` calls, however, can be reduced to single functions using the proper combinatorial HOFs. Therefore, the number of procedural steps needed is pretty much always quite small. However, the reduction in total lines-of-code is offset by the lines used for giving names to combinatorial functions. Overall, FP style code is usually about one-half the length of imperative style equivalents (fewer lines generally mean correspondingly fewer bugs). A nice feature of combinatorial functions is that they can provide a complete Boolean algebra for functions that have not been called yet (the use of `operator.add` and `operator.mul` in 'combinatorial.py' is more than accidental, in that sense). For example, with a collection of simple values, you might express a (complex) relation of multiple truth values as, for example: #*---------- Simple Boolean algebra of values ----------# satisfied = (this or that) and (foo or bar) In the case of text processing on chunks of text, these truth values are often the results of predicative functions applied to a chunk, for example: #*---------- Boolean algebra of return values ----------# satisfied = (thisP(s) or thatP(s)) and (fooP(s) or barP(s)) In an expression like the above one, several predicative functions are applied to the same string (or other object), and a set of logical relations on the results are evaluated. But this expression is itself a logical predicate of the string. For naming clarity--and especially if you wish to evaluate the same predicate more than once--it is convenient to create an actual function expressing the predicate: #*------ Boolean algebra of composed functions ------# satisfiedP = both(either(thisP,thatP), either(fooP,barP)) Using a predicative function created with combinatorial techniques is the same as using any other function: #*------ Use of a compositional Boolean function ------# selected = filter(satisfiedP, lines) EXERCISE: More on combinatorial functions -------------------------------------------------------------------- The module 'combinatorial.py' presented above provides some of the most commonly useful combinatorial higher-order functions. But there is room for enhancement in the brief example. Creating a personal or organization library of useful HOFs is a way to improve the reusability of your current text processing libraries. QUESTIONS: 1. Some of the functions defined in 'combinatorial.py' are not, strictly speaking, combinatorial. In a precise sense, a combinatorial function should take one or several functions as arguments and return one or more function objects that "combine" the input arguments. Identify which functions are not "strictly" combinatorial, and determine exactly what type of thing each one -does- return. 2. The functions 'both()' and 'and_()' do almost the same thing. But they differ in an important, albeit subtle, way. 'and_()', like the Python operator `and`, uses -shortcutting- in its evaluation. Consider these lines: >>> f = lambda n: n**2 > 10 >>> g = lambda n: 100/n > 10 >>> and_(f,g)(5) 1 >>> both(f,g)(5) 1 >>> and_(f,g)(0) 0 >>> both(f,g)(0) Traceback (most recent call last): ... The shortcutting 'and_()' can potentially allow the first function to act as a "guard" for the second one. The second function never gets called if the first function returns a false value on a given argument. a. Create a similarly shortcutting combinatorial 'or_()' function for your library. b. Create general shortcutting functions 'shortcut_all()' and 'shortcut_some()' that behave similarly to the functions 'all()' and 'some()', respectively. c. Describe some situations where nonshortcutting combinatorial functions like 'both()', 'all()', or 'anyof3()' are more desirable than similar shortcutting functions. 3. The function 'ident()' would appear to be pointless, since it simply returns whatever value is passed to it. In truth, 'ident()' is an almost indispensable function for a combinatorial collection. Explain the significance of 'ident()'. Hint: Suppose you have a list of lines of text, where some of the lines may be empty strings. What filter can you apply to find all the lines that start with a '#'? 4. The function 'not_()' might make a nice addition to a combinatorial library. We could define this function as: >>> not_ = lambda f: lambda x, f=f: not f(x) Explore some situations where a 'not_()' function would aid combinatoric programming. 5. The function 'apply_each()' is used in 'combinatorial.py' to build some other functions. But the utility of 'apply_each()' is more general than its supporting role might suggest. A trivial usage of 'apply_each()' might look something like: >>> apply_each(map(adder_factory, range(5)),(10,)) [10, 11, 12, 13, 14] Explore some situations where 'apply_each()' simplifies applying multiple operations to a chunk of text. 6. Unlike the functions 'all()' and 'some()', the functions 'compose()' and 'compose3()' take a fixed number of input functions as arguments. Create a generalized composition function that takes a list of input functions, of any length, as an argument. 7. What other combinatorial higher-order functions that have not been discussed here are likely to prove useful in text processing? Consider other ways of combining first-order functions into useful operations, and add these to your library. What are good names for these enhanced HOFs? TOPIC -- Specializing Python Datatypes -------------------------------------------------------------------- Python comes with an excellent collection of standard datatypes--Appendix A discusses each built-in type. At the same time, an important principle of Python programming makes types less important than programmers coming from other languages tend to expect. According to Python's "principle of pervasive polymorphism" (my own coinage), it is more important what an object -does- than what it -is-. Another common way of putting the principle is: if it walks like a duck and quacks like a duck, treat it like a duck. Broadly, the idea behind polymorphism is letting the same function or operator work on things of different types. In C++ or Java, for example, you might use signature-based method overloading to let an operation apply to several types of things (acting differently as needed). For example: #------------ C++ signature-based polymorphism -----------# #include class Print { public: void print(int i) { printf("int %d\n", i); } void print(double d) { printf("double %f\n", d); } void print(float f) { printf("float %f\n", f); } }; main() { Print *p = new Print(); p->print(37); /* --> "int 37" */ p->print(37.0); /* --> "double 37.000000" */ } The most direct Python translation of signature-based overloading is a function that performs type checks on its argument(s). It is simple to write such functions: #------- Python "signature-based" polymorphism -----------# def Print(x): from types import * if type(x) is FloatType: print "float", x elif type(x) is IntType: print "int", x elif type(x) is LongType: print "long", x Writing signature-based functions, however, is extremely un-Pythonic. If you find yourself performing these sorts of explicit type checks, you have probably not understood the problem you want to solve correctly! What you -should- (usually) be interested in is not what type 'x' is, but rather whether 'x' can perform the action you need it to perform (regardless what type of thing it is strictly). PYTHONIC POLYMORPHISM: Probably the single most common case where pervasive polymorphism is useful is in identifying "file-like" objects. There are many objects that can do things that files can do, such as those created with [urllib], [cStringIO], [zipfile], and by other means. Various objects can perform only subsets of what actual files can: some can read, others can write, still others can seek, and so on. But for many purposes, you have no need to exercise every "file-like" capability--it is good enough to make sure that a specified object has those capabilities you actually need. Here is a typical example. I have a module that uses DOM to work with XML documents; I would like users to be able to specify an XML source in any of several ways: using the name of an XML file, passing a file-like object that contains XML, or indicating an already-built DOM object to work with (built with any of several XML libraries). Moreover, future users of my module may get their XML from novel places I have not even thought of (an RDBMS, over sockets, etc.). By looking at what a candidate object can -do-, I can just utilize whichever capabilities that object -has-: #-------- Python capability-based polymorphism -----------# def toDOM(xml_src=None): from xml.dom import minidom if hasattr(xml_src, 'documentElement'): return xml_src # it is already a DOM object elif hasattr(xml_src,'read'): # it is something that knows how to read data return minidom.parseString(xml_src.read()) elif type(xml_src) in (StringType, UnicodeType): # it is a filename of an XML document xml = open(xml_src).read() return minidom.parseString(xml) else: raise ValueError, "Must be initialized with " +\ "filename, file-like object, or DOM object" Even simple-seeming numeric types have varying capabilities. As with other objects, you should not usually care about the internal representation of an object, but rather about what it can do. Of course, as one way to assure that an object has a capability, it is often appropriate to coerce it to a type using the built-in functions `complex()`, `dict()`, `float()`, `int()`, `list()`, `long()`, `str()`, `tuple()` and `unicode()`. All of these functions make a good effort to transform anything that looks a little bit like the type of thing they name into a true instance of it. It is usually not necessary, however, actually to transform values to prescribed types; again we can just check capabilities. For example, suppose that you want to remove the "least significant" portion of any number--perhaps because they represent measurements of limited accuracy. For whole numbers--ints or longs--you might mask out some low-order bits; for fractional values you might round to a given precision. Rather than testing value types explicitly, you can look for numeric capabilities. One common way to test a capability in Python is to -try- to do something, and catch any exceptions that occur (then try something else). Below is a simple example: #----------- Checking what numbers can do ---------------# def approx(x): # int attributes require 2.2+ if hasattr(x,'__and__'): # supports bitwise-and return x & ~0x0FL try: # supports real/imag return (round(x.real,2)+round(x.imag,2)*1j) except AttributeError: return round(x,2) ENHANCED OBJECTS: The reason that the principle of pervasive polymorphism matters is because Python makes it easy to create new objects that behave mostly--but not exactly--like basic datatypes. File-like objects were already mentioned as examples; you may or may not think of a file object as a datatype precisely. But even basic datatypes like numbers, strings, lists, and dictionaries can be easily specialized and/or emulated. There are two details to pay attention to when emulating basic datatypes. The most important matter to understand is that the capabilities of an object--even those utilized with syntactic constructs--are generally implemented by its "magic" methods, each named with leading and trailing double underscores. Any object that has the right magic methods can act like a basic datatype in those contexts that use the supplied methods. At heart, a basic datatype is just an object with some well-optimized versions of the right collection of magic methods. The second detail concerns exactly how you get at the magic methods--or rather, how best to make use of existing implementations. There is nothing stopping you from writing your own version of any basic datatype, except for the piddling details of doing so. However, there are quite a few such details, and the easiest way to get the functionality you want is to specialize an existing class. Under all non-ancient versions of Python, the standard library provides the pure-Python modules [UserDict], [UserList], and [UserString] as starting points for custom datatypes. You can inherit from an appropriate parent class and specialize (magic) methods as needed. No sample parents are provided for tuples, ints, floats, and the rest, however. Under Python 2.2 and above, a better option is available. "New-style" Python classes let you inherit from the underlying C implementations of all the Python basic datatypes. Moreover, these parent classes have become the self-same callable objects that are used to coerce types and construct objects: `int()`, `list()`, `unicode()`, and so on. There is a lot of arcana and subtle profundities that accompanies new-style classes, but you generally do not need to worry about these. All you need to know is that a class that inherits from [string] is faster than one that inherits from [UserString]; likewise for [list] versus [UserList] and [dict] versus [UserDict] (assuming your scripts all run on a recent enough version of Python). Custom datatypes, however, need not specialize full-fledged implementations. You are free to create classes that implement "just enough" of the interface of a basic datatype to be used for a given purpose. Of course, in practice, the reason you would create such custom datatypes is either because you want them to contain non-magic methods of their own or because you want them to implement the magic methods associated with multiple basic datatypes. For example, below is a custom datatype that can be passed to the prior 'approx()' function, and that also provides a (slightly) useful custom method: >>> class I: # "Fuzzy" integer datatype ... def __init__(self, i): self.i = i ... def __and__(self, i): return self.i & i ... def err_range(self): ... lbound = approx(self.i) ... return "Value: [%d, %d)" % (lbound, lbound+0x0F) ... >>> i1, i2 = I(29), I(20) >>> approx(i1), approx(i2) (16L, 16L) >>> i2.err_range() 'Value: [16, 31)' Despite supporting an extra method and being able to get passed into the 'approx()' function, 'I' is not a very versatile datatype. If you try to add, or divide, or multiply using "fuzzy integers," you will raise a 'TypeError'. Since there is no module called [UserInt], under an older Python version you would need to implement every needed magic method yourself. Using new-style classes in Python 2.2+, you could derive a "fuzzy integer" from the underlying 'int' datatype. A partial implementation could look like: >>> class I2(int): # New-style fuzzy integer ... def __add__(self, j): ... vals = map(int, [approx(self), approx(j)]) ... k = int.__add__(*vals) ... return I2(int.__add__(k, 0x0F)) ... def err_range(self): ... lbound = approx(self) ... return "Value: [%d, %d)" %(lbound,lbound+0x0F) ... >>> i1, i2 = I2(29), I2(20) >>> print "i1 =", i1.err_range(),": i2 =", i2.err_range() i1 = Value: [16, 31) : i2 = Value: [16, 31) >>> i3 = i1 + i2 >>> print i3, type(i3) 47 Since the new-style class 'int' already supports bitwise-and, there is no need to implement it again. With new-style classes, you refer to data values directly with 'self', rather than as an attribute that holds the data (e.g., 'self.i' in class 'I'). As well, it is generally unsafe to use syntactic operators within magic methods that define their operation; for example, I utilize the '.__add__()' method of the parent 'int' rather than the '+' operator in the 'I2.__add__()' method. In practice, you are less likely to want to create number-like datatypes than you are to emulate container types. But it is worth understanding just how and why even plain integers are a fuzzy concept in Python (the fuzziness of the concepts is of a different sort than the fuzziness of 'I2' integers, though). Even a function that operates on whole numbers need not operate on objects of 'IntType' or 'LongType'--just on an object that satisfies the desired protocols. TOPIC -- Base Classes for Datatypes -------------------------------------------------------------------- There are several magic methods that are often useful to define for -any- custom datatype. In fact, these methods are useful even for classes that do not really define datatypes (in some sense, every object is a datatype since it can contain attribute values, but not every object supports special syntax such as arithmetic operators and indexing). Not quite every magic method that you can define is documented in this book, but most are under the parent datatype each is most relevant to. Moreover, each new version of Python has introduced a few additional magic methods; those covered either have been around for a few versions or are particularly important. In documenting class methods of base classes, the same general conventions are used as for documenting module functions. The one special convention for these base class methods is the use of 'self' as the first argument to all methods. Since the name 'self' is purely arbitrary, this convention is less special than it might appear. For example, both of the following uses of 'self' are equally legal: >>> import string >>> self = 'spam' >>> object.__repr__(self) '' >>> string.upper(self) 'SPAM' However, there is usually little reason to use class methods in place of perfectly good built-in and module functions with the same purpose. Normally, these methods of datatype classes are used only in child classes that override the base classes, as in: >>> class UpperObject(object): ... def __repr__(self): ... return object.__repr__(self).upper() ... >>> uo = UpperObject() >>> print uo <__MAIN__.UPPEROBJECT OBJECT AT 0X1C2C6C> ================================================================= BUILTIN -- object : Ancestor class for new-style datatypes ================================================================= Under Python 2.2+, 'object' has become a base for new-style classes. Inheriting from 'object' enables a custom class to use a few new capabilities, such as slots and properties. But usually if you are interested in creating a custom datatype, it is better to inherit from a child of 'object', such as 'list', 'float', or 'dict'. METHODS: object.__eq__(self, other) Return a Boolean comparison between 'self' and 'other'. Determines how a datatype responds to the '==' operator. The parent class 'object' does not implement '.__eq__()' since by default object equality means the same thing as identity (the 'is' operator). A child is free to implement this in order to affect comparisons. object.__ne__(self, other) Return a Boolean comparison between 'self' and 'other'. Determines how a datatype responds to the '!=' and '<>' operators. The parent class 'object' does not implement '.__ne__()' since by default object inequality means the same thing as nonidentity (the 'is not' operator). Although it might seem that equality and inequality always return opposite values, the methods are not explicitly defined in terms of each other. You could force the relationship with: >>> class EQ(object): ... # Abstract parent class for equality classes ... def __eq__(self, o): return not self <> o ... def __ne__(self, o): return not self == o ... >>> class Comparable(EQ): ... # By def'ing inequlty, get equlty (or vice versa) ... def __ne__(self, other): ... return someComplexComparison(self, other) object.__nonzero__(self) Return a Boolean value for an object. Determines how a datatype responds to the Boolean comparisons 'or', 'and', and 'not', and to 'if' and 'filter(None,...)' tests. An object whose '.__nonzero__()' method returns a true value is itself treated as a true value. object.__len__(self) len(object) Return an integer representing the "length" of the object. For collection types, this is fairly straightforward--how many objects are in the collection? Custom types may change the behavior to some other meaningful value. object.__repr__(self) repr(object) object.__str__(self) str(object) Return a string representation of the object 'self'. Determines how a datatype responds to the `repr()` and `str()` built-in functions, to the 'print' keyword, and to the back-tick operator. Where feasible, it is desirable to have the '.__repr__()' method return a representation with sufficient information in it to reconstruct an identical object. The goal here is to fulfill the equality 'obj==eval(repr(obj))'. In many cases, however, you cannot encode sufficient information in a string, and the 'repr()' of an object is either identical to, or slightly more detailed than, the 'str()' representation of the same object. SEE ALSO, [repr], [operator] ================================================================= BUILTIN -- file : New-style base class for file objects ================================================================= Under Python 2.2+, it is possible to create a custom file-like object by inheriting from the built-in class 'file'. In older Python versions you may only create file-like objects by defining the methods that define an object as "file-like." However, even in recent versions of Python, inheritance from 'file' buys you little--if the data contents come from somewhere other than a native filesystem, you will have to reimplement every method you wish to support. Even more than for other object types, what makes an object file-like is a fuzzy concept. Depending on your purpose you may be happy with an object that can only read, or one that can only write. You may need to seek within the object, or you may be happy with a linear stream. In general, however, file-like objects are expected to read and write strings. Custom classes only need implement those methods that are meaningful to them and should only be used in contexts where their capabilities are sufficient. In documenting the methods of file-like objects, I adopt a slightly different convention than for other built-in types. Since actually inheriting from 'file' is unusual, I use the capitalized name 'FILE' to indicate a general file-like object. Instances of the actual 'file' class are examples (and implement all the methods named), but other types of objects can be equally good 'FILE' instances. BUILT-IN FUNCTIONS: open(fname [,mode [,buffering]]) file(fname [,mode [,buffering]]) Return a file object that attaches to the filename 'fname'. The optional argument 'mode' describes the capabilities and access style of the object. An 'r' mode is for reading; 'w' for writing (truncating any existing content); 'a' for appending (writing to the end). Each of these modes may also have the binary flag 'b' for platforms like Windows that distinguish text and binary files. The flag '+' may be used to allow both reading and writing. The argument 'buffering' may be 0 for none, 1 for line-oriented, a larger integer for number of bytes. >>> open('tmp','w').write('spam and eggs\n') >>> print open('tmp','r').read(), spam and eggs >>> open('tmp','w').write('this and that\n') >>> print open('tmp','r').read(), this and that >>> open('tmp','a').write('something else\n') >>> print open('tmp','r').read(), this and that something else METHODS AND ATTRIBUTES: FILE.close() Close a file object. Reading and writing are disallowed after a file is closed. FILE.closed Return a Boolean value indicating whether the file has been closed. FILE.fileno() Return a file descriptor number for the file. File-like objects that do not attach to actual files should not implement this method. FILE.flush() Write any pending data to the underlying file. File-like objects that do not cache data can still implement this method as 'pass'. FILE.isatty() Return a Boolean value indicating whether the file is a TTY-like device. The standard documentation says that file-like objects that do not attach to actual files should not implement this method, but implementing it to always return '0' is probably a better approach. FILE.mode Attribute containing the mode of the file, normally identical to the 'mode' argument passed to the object's initializer. FILE.name The name of the file. For file-like objects without a filesystem name, some string identifying the object should be put into this attribute. FILE.read([size=sys.maxint]) Return a string containing up to 'size' bytes of content from the file. Stop the read if an EOF is encountered or upon other condition that makes sense for the object type. Move the file position forward immediately past the read in bytes. A negative 'size' argument is treated as the default value. FILE.readline([size=sys.maxint]) Return a string containing one line from the file, including the trailing newline, if any. A maximum of 'size' bytes are read. The file position is moved forward past the read. A negative 'size' argument is treated as the default value. FILE.readlines([size=sys.maxint]) Return a list of lines from the file, each line including its trailing newline. If the argument 'size' is given, limit the read to -approximately- 'size' bytes worth of lines. The file position is moved forward past the read in bytes. A negative 'size' argument is treated as the default value. FILE.seek(offset [,whence=0]) Move the file position by 'offset' bytes (positive or negative). The argument 'whence' specifies where the initial file position is prior to the move: 0 for BOF; 1 for current position; 2 for EOF. FILE.tell() Return the current file position. FILE.truncate([size=0]) Truncate the file contents (it become 'size' length). FILE.write(s) Write the string 's' to the file, starting at the current file position. The file position is moved forward past the written bytes. FILE.writelines(lines) Write the lines in the sequence 'lines' to the file. No newlines are added during the write. The file position is moved forward past the written bytes. FILE.xreadlines() Memory-efficient iterator over lines in a file. In Python 2.2+, you might implement this as a generator that returns one line per each 'yield'. SEE ALSO, [xreadlines] ================================================================= BUILTIN -- int : New-style base class for integer objects ================================================================= BUILTIN -- long : New-style base class for long integers ================================================================= In Python, there are two standard datatypes for representing integers. Objects of type 'IntType' have a fixed range that depends on the underlying platform--usually between plus and minus 2**31. Objects of type 'LongType' are unbounded in size. In Python 2.2+, operations on integers that exceed the range of an 'int' object results in automatic promotion to 'long' objects. However, no operation on a 'long' will demote the result back to an 'int' object (even if the result is of small magnitude)--with the exception of the `int()` function, of course. From a user point of view ints and longs provide exactly the same interface. The difference between them is only in underlying implementation, with ints typically being significantly faster to operate on (since they use raw CPU instructions fairly directly). Most of the magic methods integers have are shared by floating point numbers as well and are discussed below. For example, consult the discussion of `float.__mul__()` for information on the corresponding `int.__mul__()` method. The special capability that integers have over floating point numbers is their ability to perform bitwise operations. Under Python 2.2+, you may create a custom datatype that inherits from 'int' or 'long'; under earlier versions, you would need to manually define all the magic methods you wished to utilize (generally a lot of work, and probably not worth it). Each binary bit operation has a left-associative and a right-associative version. If you define both versions and perform an operation on two custom objects, the left-associative version is chosen. However, if you perform an operation with a basic 'int' and a custom object, the custom right-associative method will be chosen over the basic operation. For example: >>> class I(int): ... def __xor__(self, other): ... return "XOR" ... def __rxor__(self, other): ... return "RXOR" ... >>> 0xFF ^ 0xFF 0 >>> 0xFF ^ I(0xFF) 'RXOR' >>> I(0xFF) ^ 0xFF 'XOR' >>> I(0xFF) ^ I(0xFF) 'XOR' METHODS: int.__and__(self, other) int.__rand__(self, other) Return a bitwise-and between 'self' and 'other'. Determines how a datatype responds to the '&' operator. int.__hex__(self) Return a hex string representing 'self'. Determines how a datatype responds to the built-in `hex()` function. int.__invert__(self) Return a bitwise inversion of 'self'. Determines how a datatype responds to the '~' operator. int.__lshift__(self, other) int.__rlshift__(self, other) Return the result of bit-shifting 'self' to the left by 'other' bits. The right-associative version shifts 'other' by 'self' bits. Determines how a datatype responds to the '<<' operator. int.__oct__(self) Return an octal string representing 'self'. Determines how a datatype responds to the built-in `oct()` function. int.__or__(self, other) int.__ror__(self, other) Return a bitwise-or between 'self' and 'other'. Determines how a datatype responds to the '|' operator. int.__rshift__(self, other) int.__rrshift__(self, other) Return the result of bit-shifting 'self' to the right by 'other' bits. The right-associative version shifts 'other' by 'self' bits. Determines how a datatype responds to the '>>' operator. int.__xor__(self, other) int.__rxor__(self, other) Return a bitwise-xor between 'self' and 'other'. Determines how a datatype responds to the '^' operator. SEE ALSO, [float], `int`, `long`, `sys.maxint`, [operator] ================================================================= BUILTIN -- float : New-style base class for floating point numbers ================================================================= Python floating point numbers are mostly implemented using the underlying C floating point library of your platform; that is, to a greater or lesser degree based on the IEEE 754 standard. A complex number is just a Python object that wraps a pair of floats with a few extra operations on these pairs. DIGRESSION: Although the details are far outside the scope of this book, a general warning is in order. Floating point math is harder than you think! If you think you -understand- just how complex IEEE 754 math is, you are not yet aware of all of the subtleties. By way of indication, Python luminary and erstwhile professor of numeric computing Alex Martelli commented in 2001 (on ''): Anybody who thinks he knows what he's doing when floating point is involved IS either naive, or Tim Peters (well, it COULD be W. Kahan I guess, but I don't think he writes here). Fellow Python guru Tim Peters observed: I find it's possible to be both (wink). But *nothing* about fp comes easily to anyone, and even Kahan works his butt off to come up with the amazing things that he does. Peters illustrated further by way of Donald Knuth (_The Art of Computer Programming_, Third Edition, Addison-Wesley, 1997; ISBN: 0201896842, vol. 2, p. 229): Many serious mathematicians have attempted to analyze a sequence of floating point operations rigorously, but found the task so formidable that they have tried to be content with plausibility arguments instead. The trick about floating point numbers is that although they are extremely useful for representing real-life (fractional) quantities, operations on them do not obey the arithmetic rules we learned in middle school: associativity, transitivity, commutativity; moreover, many very ordinary-seeming numbers can be represented only approximately with floating point numbers. For example: >>> 1./3 0.33333333333333331 >>> .3 0.29999999999999999 >>> 7 == 7./25 * 25 0 >>> 7 == 7./24 * 24 1 CAPABILITIES: In the hierarchy of Python numeric types, floating point numbers are higher up the scale than integers, and complex numbers higher than floats. That is, operations on mixed types get promoted upwards. However, the magic methods that make a datatype "float-like" are strictly a subset of those associated with integers. All of the magic methods listed below for floats apply equally to ints and longs (or integer-like custom datatypes). Complex numbers support a few addition methods. Under Python 2.2+, you may create a custom datatype that inherits from 'float' or 'complex'; under earlier versions, you would need to manually define all the magic methods you wished to utilize (generally a lot of work, and probably not worth it). Each binary operation has a left-associative and a right-associative version. If you define both versions and perform an operation on two custom objects, the left-associative version is chosen. However, if you perform an operation with a basic datatype and a custom object, the custom right-associative method will be chosen over the basic operation. See the example under [int]. METHODS: float.__abs__(self) Return the absolute value of 'self'. Determines how a datatype responds to the built-in function `abs()`. float.__add__(self, other) float.__radd__(self, other) Return the sum of 'self' and 'other'. Determines how a datatype responds to the '+' operator. float.__cmp__(self, other) Return a value indicating the order of 'self' and 'other'. Determines how a datatype responds to the numeric comparison operators '<', '>', '<=', '>=', '==', '<>', and '!='. Also determines the behavior of the built-in `cmp()` function. Should return -1 for 'selfother'. If other comparison methods are defined, they take precedence over '.__cmp__()': '.__ge__()', '.__gt__()', '.__le__()', and '.__lt__()'. float.__div__(self, other) float.__rdiv__(self, other) Return the ratio of 'self' and 'other'. Determines how a datatype responds to the '/' operator. In Python 2.3+, this method will instead determine how a datatype responds to the floor division operator '//'. float.__divmod__(self, other) float.__rdivmod__(self, other) Return the pair '(div, remainder)'. Determines how a datatype responds to the built-in `divmod()` function. float.__floordiv__(self, other) float.__rfloordiv__(self, other) Return the number of whole times 'self' goes into 'other'. Determines how a datatype responds to the Python 2.2+ floor division operator '//'. float.__mod__(self, other) float.__rmod__(self, other) Return the modulo division of 'self' into 'other'. Determines how a datatype responds to the '%' operator. float.__mul__(self, other) float.__rmul__(self, other) Return the product of 'self' and 'other'. Determines how a datatype responds to the '*' operator. float.__neg__(self) Return the negative of 'self'. Determines how a datatype responds to the unary '-' operator. float.__pow__(self, other) float.__rpow__(self, other) Return 'self' raised to the 'other' power. Determines how a datatype responds to the '^' operator. float.__sub__(self, other) float.__rsub__(self, other) Return the difference between 'self' and 'other'. Determines how a datatype responds to the binary '-' operator. float.__truediv__(self, other) float.__rtruediv__(self, other) Return the ratio of 'self' and 'other'. Determines how a datatype responds to the Python 2.3+ true division operator '/'. SEE ALSO, [complex], [int], `float`, [operator] ================================================================= BUILTIN -- complex : New-style base class for complex numbers ================================================================= Complex numbers implement all the above documented methods of floating point numbers, and a few additional ones. Inequality operations on complex numbers are not supported in recent versions of Python, even though they were previously. In Python 2.1+, the methods `complex.__ge__()`, `complex.__gt__()` `complex.__le__()`, and `complex.__lt__()` all raise 'TypeError' rather than return Boolean values indicating the order. There is a certain logic to this change inasmuch as complex numbers do not have a "natural" ordering. But there is also significant breakage with this change--this is one of the few changes in Python, since version 1.4 when I started using it, that I feel was a real mistake. The important breakage comes when you want to sort a list of various things, some of which might be complex numbers: >>> lst = ["string", 1.0, 1, 1L, ('t','u','p')] >>> lst.sort() >>> lst [1.0, 1, 1L, 'string', ('t', 'u', 'p')] >>> lst.append(1j) >>> lst.sort() Traceback (most recent call last): File "", line 1, in ? TypeError: cannot compare complex numbers using <, <=, >, >= It is true that there is no obvious correct ordering between a complex number and another number (complex or otherwise), but there is also no natural ordering between a string, a tuple, and a number. Nonetheless, it is frequently useful to sort a heterogeneous list in order to create a canonical (even if meaningless) order. In Python 2.2+, you can remedy this shortcoming of recent Python versions in the style below (under 2.1 you are largely out of luck): >>> class C(complex): ... def __lt__(self, o): ... if hasattr(o, 'imag'): ... return (self.real,self.imag) < (o.real,o.imag) ... else: ... return self.real < o ... def __le__(self, o): return self < o or self==o ... def __gt__(self, o): return not (self==o or self < o) ... def __ge__(self, o): return self > o or self==o ... >>> lst = ["str", 1.0, 1, 1L, (1,2,3), C(1+1j), C(2-2j)] >>> lst.sort() >>> lst [1.0, 1, 1L, (1+1j), (2-2j), 'str', (1, 2, 3)] Of course, if you adopt this strategy, you have to create all of your complex values using the custom datatype 'C'. And unfortunately, unless you override arithmetic operations also, a binary operation between a 'C' object and another number reverts to a basic complex datatype. The reader can work out the details of this solution if she needs it. METHODS: complex.conjugate(self) Return the complex conjugate of 'self'. A quick refresher here: If 'self' is 'n+mj' its conjugate is 'n-mj'. complex.imag Imaginary component of a complex number. complex.real Real component of a complex number. SEE ALSO, [float], `complex` ================================================================= MODULE -- UserDict : Custom wrapper around dictionary objects ================================================================= BUILTIN -- dict : New-style base class for dictionary objects ================================================================= Dictionaries in Python provide a well-optimized mapping between immutable objects and other Python objects (see Glossary entry on "immutable"). You may create custom datatypes that respond to various dictionary operations. There are a few syntactic operations associated with dictionaries, all involving indexing with square braces. But unlike with numeric datatypes, there are several regular methods that are reasonable to consider as part of the general interface for dictionary-like objects. If you create a dictionary-like datatype by subclassing from `UserDict.UserDict`, all the special methods defined by the parent are proxies to the true dictionary stored in the object's '.data' member. If, under Python 2.2+, you subclass from 'dict' itself, the object itself inherits dictionary behaviors. In either case, you may customize whichever methods you wish. Below is an example of the two styles for subclassing a dictionary-like datatype: >>> from sys import stderr >>> from UserDict import UserDict >>> class LogDictOld(UserDict): ... def __setitem__(self, key, val): ... stderr.write("Set: "+str(key)+"->"+str(val)+"\n") ... self.data[key] = val ... >>> ldo = LogDictOld() >>> ldo['this'] = 'that' Set: this->that >>> class LogDictNew(dict): ... def __setitem__(self, key, val): ... stderr.write("Set: "+str(key)+"->"+str(val)+"\n") ... dict.__setitem__(self, key, val) ... >>> ldn = LogDictOld() >>> ldn['this'] = 'that' Set: this->that METHODS: dict.__cmp__(self, other) UserDict.UserDict.__cmp__(self, other) Return a value indicating the order of 'self' and 'other'. Determines how a datatype responds to the numeric comparison operators '<', '>', '<=', '>=', '==', '<>', and '!='. Also determines the behavior of the built-in `cmp()` function. Should return -1 for 'selfother'. If other comparison methods are defined, they take precedence over '.__cmp__()': '.__ge__()', '.__gt__()', '.__le__()', and '.__lt__()'. dict.__contains__(self, x) UserDict.UserDict.__contains__(self, x) Return a Boolean value indicating whether 'self' "contains" the value 'x'. By default, being contained in a dictionary means matching one of its keys, but you can change this behavior by overriding it (e.g., check whether 'x' is in a value rather than a key). Determines how a datatype responds to the 'in' operator. dict.__delitem__(self, x) UserDict.UserDict.__delitem__(self, x) Remove an item from a dictionary-like datatype. By default, removing an item means removing the pair whose key equals 'x'. Determines how a datatype responds to the 'del' statement, as in: 'del self[x]'. dict.__getitem__(self, x) UserDict.UserDict.__getitem__(self, x) By default, return the value associated with the key 'x'. Determines how a datatype responds to indexing with square braces. You may override this method to either search differently or return special values. For example: >>> class BagOfPairs(dict): ... def __getitem__(self, x): ... if self.has_key(x): ... return (x, dict.__getitem__(self,x)) ... else: ... tmp = dict([(v,k) for k,v in self.items()]) ... return (dict.__getitem__(tmp,x), x) ... >>> bop = BagOfPairs({'this':'that', 'spam':'eggs'}) >>> bop['this'] ('this', 'that') >>> bop['eggs'] ('spam', 'eggs') >>> bop['bacon'] = 'sausage' >>> bop {'this': 'that', 'bacon': 'sausage', 'spam': 'eggs'} >>> bop['nowhere'] Traceback (most recent call last): File "", line 1, in ? File "", line 7, in __getitem__ KeyError: nowhere dict.__len__(self) UserDict.UserDict.__len__(self) Return the length of the dictionary. By default this is simply a count of the key/val pairs, but you could perform a different calculation if you wished (e.g, perhaps you would cache the size of a record set returned from a database query that emulated a dictionary). Determines how a datatype responds to the built-in `len()` function. dict.__setitem__(self, key, val) UserDict.UserDict.__setitem__(self, key, val) Set the dictionary key 'key' to value 'val'. Determines how a datatype responds to indexed assignment; that is, 'self[key]=val'. A custom version might actually perform some calculation based on 'val' and/or 'key' before adding an item. dict.clear(self) UserDict.UserDict.clear(self) Remove all items from 'self'. dict.copy(self) UserDict.UserDict.copy(self) Return a copy of the dictionary 'self' (i.e., a distinct object with the same items). dict.get(self, key [,default=None]) UserDict.UserDict.get(self, key [,default=None]) Return the value associated with the key 'key'. If no item with the key exists, return 'default' instead of raising a 'KeyError'. dict.has_key(self, key) UserDict.UserDict.has_key(self, key) Return a Boolean value indicating whether 'self' has the key 'key'. dict.items(self) UserDict.UserDict.items(self) dict.iteritems(self) UserDict.UserDict.iteritems(self) Return the items in a dictionary, in an unspecified order. The '.items()' method returns a true list of '(key,val)' pairs, while the '.iteritems()' method (in Python 2.2+) returns a generator object that successively yields items. The latter method is useful if your dictionary is not a true in-memory structure, but rather some sort of incremental query or calculation. Either method responds externally similarly to a 'for' loop: >>> d = {1:2, 3:4} >>> for k,v in d.iteritems(): print k,v,':', ... 1 2 : 3 4 : >>> for k,v in d.items(): print k,v,':', ... 1 2 : 3 4 : dict.keys(self) UserDict.UserDict.keys(self) dict.iterkeys(self) UserDict.UserDict.iterkeys(self) Return the keys in a dictionary, in an unspecified order. The '.keys()' method returns a true list of keys, while the '.iterkeys()' method (in Python 2.2+) returns a generator object. SEE ALSO, `dict.items()` dict.popitem(self) UserDict.UserDict.popitem(self) Return a '(key,val)' pair for the dictionary, or raise as 'KeyError' if the dictionary is empty. Removes the returned item from the dictionary. As with other dictionary methods, the order in which items are popped is unspecified (and can vary between versions and platforms). dict.setdefault(self, key [,default=None]) UserDict.UserDict.setdefault(self, key [,default=None]) If 'key' is currently in the dictionary, return the corresponding value. If 'key' is not currently in the dictionary, set 'self[key]=default', then return 'default'. SEE ALSO, `dict.get()` dict.update(self, other) UserDict.UserDict.update(self, other) Update the dictionary 'self' using the dictionary 'other'. If a key in 'other' already exists in 'self', the corresponding value from 'other' is used in 'self'. If a '(key,val)' pair in 'other' is not in 'self', it is added. dict.values(self) UserDict.UserDict.values(self) dict.itervalues(self) UserDict.UserDict.itervalues(self) Return the values in a dictionary, in an unspecified order. The '.values()' method returns a true list of keys, while the '.itervalues()' method (in Python 2.2+) returns a generator object. SEE ALSO, `dict.items()` SEE ALSO, `dict`, [list], [operator] ================================================================= MODULE -- UserList : Custom wrapper around list objects ================================================================= BUILTIN -- list : New-style base class for list objects ================================================================= BUILTIN -- tuple : New-style base class for tuple objects ================================================================= A Python list is a (possibly) heterogeneous mutable sequence of Python objects. A tuple is a similar immutable sequence (see Glossary entry on "immutable"). Most of the magic methods of lists and tuples are the same, but a tuple does not have those methods associated with internal transformation. If you create a list-like datatype by subclassing from `UserList.UserList`, all the special methods defined by the parent are proxies to the true list stored in the object's '.data' member. If, under Python 2.2+, you subclass from 'list' (or 'tuple') itself, the object itself inherits list (tuple) behaviors. In either case, you may customize whichever methods you wish. The discussion of [dict] and [UserDict] show an example of the different styles of specialization. The difference between a list-like object and a tuple-like object runs less deep than you might think. Mutability is only really important for using objects as dictionary keys, but dictionaries only check the mutability of an object by examining the return value of an object's '.__hash__()' method. If this method fails to return an integer, an object is considered mutable (and ineligible to serve as a dictionary key). The reason that tuples are useful as keys is because every tuple composed of the same items has the same hash; two lists (or dictionaries), by contrast, may also have the same items, but only as a passing matter (since either can be changed). You can easily give a hash value to a list-like datatype. However, there is an obvious and wrong way to do so: >>> class L(list): ... __hash__ = lambda self: hash(tuple(self)) ... >>> lst = L([1,2,3]) >>> dct = {lst:33, 7:8} >>> print dct {[1, 2, 3]: 33, 7: 8} >>> dct[lst] 33 >>> lst.append(4) >>> print dct {[1, 2, 3, 4]: 33, 7: 8} >>> dct[lst] Traceback (most recent call last): File "", line 1, in ? KeyError: [1, 2, 3, 4] As soon as 'lst' changes, its hash changes, and you cannot reach the dictionary item keyed to it. What you need is something that does not change as the object changes: >>> class L(list): ... __hash__ = lambda self: id(self) ... >>> lst = L([1,2,3]) >>> dct = {lst:33, 7:8} >>> dct[lst] 33 >>> lst.append(4) >>> dct {[1, 2, 3, 4]: 33, 7: 8} >>> dct[lst] 33 As with most everything about Python datatypes and operations, mutability is merely a protocol that you can choose to support or not support in your custom datatypes. Sequence datatypes may choose to support order comparisons--in fact they probably should. The methods '.__cmp__()', '.__ge__()', '.__gt__()', '.__le__()', and '.__lt__()' have the same meanings for sequences that they do for other datatypes; see [operator], [float], and [dict] for details. METHODS: list.__add__(self, other) UserList.UserList.__add__(self, other) tuple.__add__(self, other) list.__iadd__(self, other) UserList.UserList.__iadd__(self, other) Determine how a datatype responds to the '+' and '+=' operators. Augmented assignments ("in-place add") are supported in Python 2.0+. For list-like datatypes, normally the statements 'lst+=other' and 'lst=lst+other' have the same effect, but the augmented version might be more efficient. Under standard meaning, addition of the two sequence objects produces a new (distinct) sequence object with all the items in both 'self' and 'other'. An in-place add ('.__iadd__') mutates the left-hand object without creating a new object. A custom datatype might choose to give a special meaning to addition, perhaps depending on the datatype of the object added in. For example: >>> class XList(list): ... def __iadd__(self, other): ... if issubclass(other.__class__, list): ... return list.__iadd__(self, other) ... else: ... from operator import add ... return map(add, self, [other]*len(self)) ... >>> xl = XList([1,2,3]) >>> xl += [4,5,6] >>> xl [1, 2, 3, 4, 5, 6] >>> xl += 10 >>> xl [11, 12, 13, 14, 15, 16] list.__contains__(self, x) UserList.UserList.__contains__(self, x) tuple.__contains__(self, x) Return a Boolean value indicating whether 'self' contains the value 'x'. Determines how a datatype responds to the 'in' operator. list.__delitem__(self, x) UserList.UserList.__delitem__(self, x) Remove an item from a list-like datatype. Determines how a datatype responds to the 'del' statement, as in 'del self[x]'. list.__delslice__(self, start, end) UserList.UserList.__delslice__(self, start, end) Remove a range of items from a list-like datatype. Determines how a datatype responds to the 'del' statement applied to a slice, as in 'del self[start:end]'. list.__getitem__(self, pos) UserList.UserList.__getitem__(self, pos) tuple.__getitem__(self, pos) Return the value at offset 'pos' in the list. Determines how a datatype responds to indexing with square braces. The default behavior on list indices is to raise an 'IndexError' for nonexistent offsets. list.__getslice__(self, start, end) UserList.UserList.__getslice__(self, start, end) tuple.__getslice__(self, start, end) Return a subsequence of the sequence 'self'. Determines how a datatype responds to indexing with a slice parameter, as in 'self[start:end]'. list.__hash__(self) UserList.UserList.__hash__(self) tuple.__hash__(self) Return an integer that distinctly identifies an object. Determines how a datatype responds to the built-in `hash()` function--and probably more importantly the hash is used internally in dictionaries. By default, tuples (and other immutable types) will return hash values but lists will raise a 'TypeError'. Dictionaries will handle hash collisions gracefully, but it is best to try to make hashes unique per object. >>> hash(219750523), hash((1,2)) (219750523, 219750523) >>> dct = {219750523:1, (1,2):2} >>> dct[219750523] 1 list.__len__(self UserList.UserList.__len__(self tuple.__len__(self Return the length of a sequence. Determines how a datatype responds to the built-in `len()` function. list.__mul__(self, num) UserList.UserList.__mul__(self, num) tuple.__mul__(self, num) list.__rmul__(self, num) UserList.UserList.__rmul__(self, num) tuple.__rmul__(self, num) list.__imul__(self, num) UserList.UserList.__imul__(self, num) Determine how a datatype responds to the '*' and '*=' operators. Augmented assignments ("in-place add") are supported in Python 2.0+. For list-like datatypes, normally the statements 'lst*=other' and 'lst=lst*other' have the same effect, but the augmented version might be more efficient. The right-associative version '.__rmul__()' determines the value of 'num*self', the left-associative '.__mul__()' determines the value of 'self*num'. Under standard meaning, the product of a sequence and a number produces a new (distinct) sequence object with the items in 'self' duplicated 'num' times: >>> [1,2,3] * 3 [1, 2, 3, 1, 2, 3, 1, 2, 3] list.__setitem__(self, pos, val) UserList.UserList.__setitem__(self, pos, val) Set the value at offset 'pos' to value 'value'. Determines how a datatype responds to indexed assignment; that is, 'self[pos]=val'. A custom version might actually perform some calculation based on 'val' and/or 'key' before adding an item. list.__setslice__(self, start, end, other) UserList.UserList.__setslice__(self, start, end, other) Replace the subsequence 'self[start:end]' with the sequence 'other'. The replaced and new sequences are not necessarily the same length, and the resulting sequence might be longer or shorter than 'self'. Determines how a datatype responds to assignment to a slice, as in 'self[start:end]=other'. list.append(self, item) UserList.UserList.append(self, item) Add the object 'item' to the end of the sequence 'self'. Increases the length of 'self' by one. list.count(self, item) UserList.UserList.count(self, item) Return the integer number of occurrences of 'item' in 'self'. list.extend(self, seq) UserList.UserList.extend(self, seq) Add each item in 'seq' to the end of the sequence 'self'. Increases the length of 'self' by 'len(seq)'. list.index(self, item) UserList.UserList.index(self, item) Return the offset index of the first occurrence of 'item' in 'self'. list.insert(self, pos, item) UserList.UserList.insert(self, pos, item) Add the object 'item' to the sequence 'self' before the offset 'pos'. Increases the length of 'self' by one. list.pop(self [,pos=-1]) UserList.UserList.pop(self [,pos=-1]) Return the item at offset 'pos' of the sequence 'self', and remove the returned item from the sequence. By default, remove the last item, which lets a list act like a stack using the '.pop()' and '.append()' operations. list.remove(self, item) UserList.UserList.remove(self, item) Remove the first occurrence of 'item' in 'self'. Decreases the length of 'self' by one. list.reverse(self) UserList.UserList.reverse(self) Reverse the list 'self' in place. list.sort(self [cmpfunc]) UserList.UserList.sort(self [,cmpfunc]) Sort the list 'self' in place. If a comparison function 'cmpfunc' is given, perform comparisons using that function. SEE ALSO, `list`, `tuple`, [dict], [operator] ================================================================= MODULE -- UserString : Custom wrapper around string objects ================================================================= BUILTIN -- str : New-style base class for string objects ================================================================= A string in Python is an immutable sequence of characters (see Glossary entry on "immutable"). There is special syntax for creating strings--single and triple quoting, character escaping, and so on--but in terms of object behaviors and magic methods, most of what a string does a tuple does, too. Both may be sliced and indexed, and both respond to pseudo-arithmetic operators '+' and '*'. For the [str] and [UserString] magic methods that are strictly a matter of the sequence quality of strings, see the corresponding [tuple] documentation. These include `str.__add__()`, `str.__getitem__()`, `str.__getslice__()`, `str.__hash__()`, `str.__len__()`, `str.__mul__()`, and `str.__rmul__()`. Each of these methods is also defined in [UserString]. The [UserString] module also includes a few explicit definitions of magic methods that are not in the new-style [str] class: `UserString.__iadd__()`, `UserString.__imul__()`, and `UserString.__radd__()`. However, you may define your own implementations of these methods, even if you inherit from [str] (in Python 2.2+). In any case, internally, in-place operations are still performed on all strings. Strings have quite a number of nonmagic methods as well. If you wish to create a custom datatype that can be utilized in the same functions that expect strings, you may want to specialize some of these common string methods. The behavior of string methods is documented in the discussion of the [string] module, even for the few string methods that are not also defined in the [string] module. However, inheriting from either [str] or [UserString] provides very reasonable default behaviors for all these methods. SEE ALSO, `"".capitalize()`, `"".title()`, `"".center()`, `"".count()`, `"".endswith()`, `"".expandtabs()`, `"".find()`, `"".index()`, `"".isalpha()`, `"".isalnum()`, `"".isdigit()`, `"".islower()`, `"".isspace()`, `"".istitle()`, `"".isupper()`, `"".join()`, `"".ljust()`, `"".lower()`, `"".lstrip()`, `"".replace()`, `"".rfind()`, `"".rindex()`, `"".rjust()`, `"".rstrip()`, `"".split()`, `"".splitlines()`, `"".startswith()`, `"".strip()`, `"".swapcase()`, `"".translate()`, `"".upper()`, `"".encode()` METHODS: str.__contains__(self, x) UserString.UserString.__contains__(self, x) Return a Boolean value indicating whether 'self' contains the character 'x'. Determines how a datatype responds to the 'in' operator. In Python versions through 2.2, the 'in' operator applied to strings has a semantics that tends to trip me up. Fortunately, Python 2.3+ has the behavior that I expect. In older Python versions, string, 'in' can only be used to determine the presence of a single character in a string--this makes sense if you think of a string as a sequence of characters, but I nonetheless intuitively want something like the code below to work: >>> s = "The cat in the hat" >>> if "the" in s: print "Has definite article" ... Traceback (most recent call last): File "", line 1, in ? TypeError: 'in ' requires character as left operand It is easy to get the "expected" behavior is a custom string-like datatype (while still always producing the same result whenever 'x' is indeed a character: >>> class S(str): ... def __contains__(self, x): ... for i in range(len(self)): ... if self.startswith(x,i): return 1 ... >>> s = S("The cat in the hat") >>> "the" in s 1 >>> "an" in s 0 Python 2.3 strings behave the same way as my datatype 'S'. SEE ALSO, `string`, [string], [operator], [tuple] EXERCISE: Filling out the forms (or deciding not to) -------------------------------------------------------------------- DISCUSSION: A particular little task that was quite frequent and general before the advent of Web servers, has become absolutely ubiquitous for slightly dynamic Web pages. The pattern one encounters is that one has a certain general format that is desired for a document or file, but miscellaneous little details differ from instance to instance. Form letters are another common case where one comes across this pattern, but thematically related collections of Web pages rule the roost of templating techniques. It turns out that everyone and her sister has developed her own little templating system. Creating a templating system is a very appealing task for users of most scripting languages, just a little while after they have gotten a firm grasp of "Hello World!" Some of these are discussed in Chapter 5, but many others are not addressed. Often, these templating systems will be HTML/CGI oriented and will often include some degree of dynamic calculation of fill-in values--the inspiration in these cases comes from systems like Allaire's ColdFusion, Java Server Pages, Active Server Pages, and PHP, in which some program code gets sprinkled around in documents that are primarily made of HTML. At the very simplest, Python provides interpolation of special characters in strings, in a style similar to the C 'sprintf()' function. So a simple example might appear like: >>> form_letter="""Dear %s %s, ... ... You owe us $%s for account (#%s). Please Pay. ... ... The Company""" >>> fname = 'David' >>> lname = 'Mertz' >>> due = 500 >>> acct = '123-T745' >>> print form_letter % (fname,lname,due,acct) Dear David Mertz, You owe us $500 for account (#123-T745). Please Pay. The Company This approach does the basic templating, but it would be easy to make an error in composing the tuple of insertion values. And moreover, a slight change to the 'form_letter' template--such as the addition or subtraction of a field--would produce wrong results. A bit more robust approach is to use Python's dictionary-based string interpolation. For example: >>> form_letter="""Dear %(fname)s %(lname)s, ... ... You owe us $%(due)s for account (#%(acct)s). Please Pay. ... ... The Company""" >>> fields = {'lname':'Mertz', 'fname':'David'} >>> fields['acct'] = '123-T745' >>> fields['due'] = 500 >>> fields['last_letter'] = '01/02/2001' >>> print form_letter % fields Dear David Mertz, You owe us $500 for account (#123-T745). Please Pay. The Company With this approach, the fields need not be listed in a particular order for the insertion. Furthermore, if the order of fields is rearranged in the template, or if the same fields are used for a different template, the 'fields' dictionary may still be used for insertion values. If 'fields' has unused dictionary keys, it doesn't hurt the interpolation, either. The dictionary interpolation approach is still subject to failure if dictionary keys are missing. Two improvements using the [UserDict] module can improve matters, in two different (and incompatible) ways. In Python 2.2+ the built-in 'dict' type can be a parent for a "new style class;" if available everywhere you need it to run, 'dict' is a better parent than is `UserDict.UserDict`. One approach is to avoid all key misses during dictionary interpolation: >>> form_letter="""%(salutation)s %(fname)s %(lname)s, ... ... You owe us $%(due)s for account (#%(acct)s). Please Pay. ... ... %(closing)s The Company""" >>> from UserDict import UserDict >>> class AutoFillingDict(UserDict): ... def __init__(self,dict={}): UserDict.__init__(self,dict) ... def __getitem__(self,key): ... return UserDict.get(self, key, '') >>> fields = AutoFillingDict() >>> fields['salutation'] = 'Dear' >>> fields {'salutation': 'Dear'} >>> fields['fname'] = 'David' >>> fields['due'] = 500 >>> fields['closing'] = 'Sincerely,' >>> print form_letter % fields Dear David , You owe us $500 for account (#). Please Pay. Sincerely, The Company Even though the fields 'lname' and 'acct' are not specified, the interpolation has managed to produce a basically sensible letter (instead of crashing with a KeyError). Another approach is to create a custom dictionary-like object that will allow for "partial interpolation." This approach is particularly useful to gather bits of the information needed for the final string over the course of the program run (rather than all at once): >>> form_letter="""%(salutation)s %(fname)s %(lname)s, ... ... You owe us $%(due)s for account (#%(acct)s). Please Pay. ... ... %(closing)s The Company""" >>> from UserDict import UserDict >>> class ClosureDict(UserDict): ... def __init__(self,dict={}): UserDict.__init__(self,dict) ... def __getitem__(self,key): ... return UserDict.get(self, key, '%('+key+')s') >>> name_dict = ClosureDict({'fname':'David','lname':'Mertz'}) >>> print form_letter % name_dict %(salutation)s David Mertz, You owe us $%(due)s for account (#%(acct)s). Please Pay. %(closing)s The Company Interpolating using a 'ClosureDict' simply fills in whatever portion of the information it knows, then returns a new string that is closer to being filled in. SEE ALSO, [dict], [UserDict], [UserList], [UserString] QUESTIONS: 1. What are some other ways to provide "smart" string interpolation? Can you think of ways that the [UserList] or [UserString] modules might be used to implement a similar enhanced interpolation? 2. Consider other "magic" methods that you might add to classes inheriting from `UserDict.UserDict`. How might these additional behaviors make templating techniques more powerful? 3. How far do you think you can go in using Python's string interpolation as a templating technique? At what point would you decide you had to apply other techniques, such as regular expression substitutions or a parser? Why? 4. What sorts of error checking might you implement for customized interpolation? The simple list or dictionary interpolation could fail fairly easily, but at least those were trappable errors (they let the application know something is amiss). How would you create a system with both flexible interpolation and good guards on the quality and completeness of the final result? PROBLEM: Working with lines from a large file -------------------------------------------------------------------- At its simplest, reading a file in a line-oriented style is just a matter of using the '.readline()', '.readlines()', and '.xreadlines()' methods of a file object. Python 2.2+ provides a simplified syntax for this frequent operation by letting the file object itself efficiently iterate over lines (strictly in forward sequence). To read in an entire file, you may use the '.read()' method and possibly split it into lines or other chunks using the `string.split()` function. Some examples: >>> for line in open('chap1.txt'): # Python 2.2+ ... # process each line in some manner ... pass ... >>> linelist = open('chap1.txt').readlines() >>> print linelist[1849], EXERCISE: Working with lines from a large file >>> txt = open('chap1.txt').read() >>> from os import linesep >>> linelist2 = txt.split(linesep) For moderately sized files, reading the entire contents is not a big issue. But large files make time and memory issues more important. Complex documents or active log files, for example, might be multiple megabytes, or even gigabytes, in size--even if the contents of such files do not strictly exceed the size of available memory, reading them can still be time consuming. A related technique to those discussed here is discussed in Chapter 2: "Reading a file backwards by record, line, or paragraph." Obviously, if you -need- to process every line in a file, you have to read the whole file; [xreadlines] does so in a memory-friendly way, assuming you are able to process them sequentially. But for applications that only need a subset of lines in a large file, it is not hard to make improvements. The most important module to look to for support here is [linecache]. A CACHED LINE LIST: It is straightforward to read a particular line from a file using [linecache]: >>> import linecache >>> print linecache.getline('chap1.txt',1850), PROBLEM: Working with lines from a large file Notice that `linecache.getline()` uses one-based counting, in contrast to the zero-based list indexing in the prior example. While there is not much to this, it would be even nicer to have an object that combined the efficiency of [linecache] with the interfaces we expect in lists. Existing code might exist to process lists of lines, or you might want to write a function that is agnostic about the source of a list of lines. In addition to being able to enumerate and index, it would be useful to be able to slice [linecache]-based objects, just as we might do to real lists (including with extended slices, which were added to lists in Python 2.3). #------------------ cachedlinelist.py --------------------# import linecache, types class CachedLineList: # Note: in Python 2.2+, it is probably worth including: # __slots__ = ('_fname') # ...and inheriting from 'object' def __init__(self, fname): self._fname = fname def __getitem__(self, x): if type(x) is types.SliceType: return [linecache.getline(self._fname, n+1) for n in range(x.start, x.stop, x.step)] else: return linecache.getline(self._fname, x+1) def __getslice__(self, beg, end): # pass to __getitem__ which does extended slices also return self[beg:end:1] Using these new objects is almost identical to using a list created by 'open(fname).readlines()', but more efficient (especially in memory usage): >>> from cachedlinelist import CachedLineList >>> cll = CachedLineList('../chap1.txt') >>> cll[1849] ' PROBLEM: Working with lines from a large file\r\n' >>> for line in cll[1849:1851]: print line, ... PROBLEM: Working with lines from a large file ---------------------------------------------------------- >>> for line in cll[1853:1857:2]: print line, ... a matter of using the '.readline()', '.readlines()' and simplified syntax for this frequent operation by letting the A RANDOM LINE: Occasionally--especially for testing purposes--you might want to check "typical" lines in a line-oriented file. It is easy to fall into the trap of making sure that a process works for the first few lines of a file, and maybe for the last few, then assuming it works everywhere. Unfortunately, the first and last few lines of many files tend to be atypical: sometimes headers or footers are used; sometimes a log file's first lines were logged during development rather than usage; and so on. Then again, exhaustive testing of entire files might provide more data than you want to worry about. Depending on the nature of the processing, complete testing could be time consuming as well. On most systems, seeking to a particular position in a file is far quicker than reading all the bytes up to that position. Even using [linecache], you need to read a file byte-by-byte up to the point of a cached line. A fast approach to finding random lines from a large file is to seek to a random position within a file, then read comparatively few bytes before and after that position, identifying a line within that chunk. #-------------------- randline.py ------------------------# #!/usr/bin/python """Iterate over random lines in a file (req Python 2.2+) From command-line use: % randline.py """ import sys from os import stat, linesep from stat import ST_SIZE from random import randrange MAX_LINE_LEN = 4096 #-- Iterable class class randline(object): __slots__ = ('_fp','_size','_limit') def __init__(self, fname, limit=sys.maxint): self._size = stat(fname)[ST_SIZE] self._fp = open(fname,'rb') self._limit = limit def __iter__(self): return self def next(self): if self._limit <= 0: raise StopIteration self._limit -= 1 pos = randrange(self._size) priorlen = min(pos, MAX_LINE_LEN) # maybe near start self._fp.seek(pos-priorlen) # Add extra linesep at beg/end in case pos at beg/end prior = linesep + self._fp.read(priorlen) post = self._fp.read(MAX_LINE_LEN) + linesep begln = prior.rfind(linesep) + len(linesep) endln = post.find(linesep) return prior[begln:]+post[:endln] #-- Use as command-line tool if __name__=='__main__': fname, numlines = sys.argv[1], int(sys.argv[2]) for line in randline(fname, numlines): print line The presented [randline] module may be used either imported into another application or as a command-line tool. In the latter case, you could pipe a collection of random lines to another application, as in: #*---------- Piping random lines to application ----------# % randline.py reallybig.log 1000 | testapp A couple details should be noted in my implementation. (1) The same line can be chosen more than once in a line iteration. If you choose a small number of lines from a large file, this probably will not happen (but the so-called "birthday paradox" makes an occasional collision more likely than you might expect; see the Glossary). (2) What is selected is "the line that contains a random position in the file"; which means that short lines are less likely to be chosen than long lines. That distribution could be a bug or feature, depending on your needs. In practical terms, for testing "enough" typical cases, the precise distribution is not all that important. SEE ALSO, [xreadlines], [linecache], [random] SECTION 2 -- Standard Modules ------------------------------------------------------------------------ There are a variety of tasks that many or most text processing applications will perform, but that are not themselves text processing tasks. For example, texts typically live inside files, so for a concrete application you might want to check whether files exist, whether you have access to them, and whether they have certain attributes; you might also want to read their contents. The text processing per se does not happen until the text makes it into a Python value, but getting the text into local memory is a necessary step. Another task is making Python objects persistent so that final or intermediate processing results can be saved in computer-usable forms. Or again, Python applications often benefit from being able to call external processes and possibly work with the results of those calls. Yet another class of modules helps you deal with Python internals in ways that go beyond what the inherent syntax does. I have made a judgment call in this book as to which such "Python internal" modules are sufficiently general and frequently used in text processing applications; a number of "internal" modules are given only one-line descriptions under the "Other Modules" topic. TOPIC -- Working with the Python Interpreter -------------------------------------------------------------------- Some of the modules in the standard library contain functionality that is nearly as important to Python as the basic syntax. Such modularity is an important strength of Python's design, but users of other languages may be surprised to find capabilities for reading command-line arguments, catching exceptions, copying objects, or the like in external modules. ================================================================= MODULE -- copy : Generic copying operations ================================================================= Names in Python programs are merely bindings to underlying objects; many of these objects are mutable. This point is simple, but it winds up biting almost every beginning Python programmer--and even a few experienced Pythoners get caught, too. The problem is that binding another name (including a sequence position, dictionary entry, or attribute) to an object leaves you with two names bound to the same object. If you change the underlying object using one name, the other name also points to a changed object. Sometimes you want that, sometimes you do not. One variant of the binding trap is a particularly frequent pitfall. Say you want a 2D table of values, initialized as zeros. Later on, you would like to be able to refer to a row/column position as, for example, 'table[2][3]' (as in many programming languages). Here is what you would probably try first, along with its failure: >>> row = [0]*4 >>> print row [0, 0, 0, 0] >>> table = [row]*4 # or 'table = [[0]*4]*4 >>> for row in table: print row ... [0, 0, 0, 0] [0, 0, 0, 0] [0, 0, 0, 0] [0, 0, 0, 0] >>> table[2][3] = 7 >>> for row in table: print row ... [0, 0, 0, 7] [0, 0, 0, 7] [0, 0, 0, 7] [0, 0, 0, 7] >>> id(table[2]), id(table[3]) (6207968, 6207968) The problem with the example is that 'table' is a list of four positional bindings to the -exact same- list object. You cannot change just one row, since all four point to just one object. What you need instead is a -copy- of 'row' to put in each row of 'table'. Python provides a number of ways to create copies of objects (and bind them to names). Such a copy is a "snapshot" of the state of the object that can be modified independently of changes to the original. A few ways to correct the table problem are: >>> table1 = map(list, [(0,)*4]*4) >>> id(table1[2]), id(table1[3]) (6361712, 6361808) >>> table2 = [lst[:] for lst in [[0]*4]*4] >>> id(table2[2]), id(table2[3]) (6356720, 6356800) >>> from copy import copy >>> row = [0]*4 >>> table3 = map(copy, [row]*4) >>> id(table3[2]), id(table3[3]) (6498640, 6498720) In general, slices always create new lists. In Python 2.2+, the constructors 'list()' and 'dict()' likewise construct new/copied lists/dicts (possibly using other sequence or association types as arguments). But the most general way to make a new copy of -whatever object you might need- is with the [copy] module. If you use the [copy] module you do not need to worry about issues of whether a given sequence is a list, or merely list-like, which the 'list()' coercion forces into a list. FUNCTIONS: copy.copy(obj) Return a shallow copy of a Python object. Most (but not quite all) types of Python objects can be copied. A shallow copy binds its elements/members to the same objects as bound in the original--but the object itself is distinct. >>> import copy >>> class C: pass ... >>> o1 = C() >>> o1.lst = [1,2,3] >>> o1.str = "spam" >>> o2 = copy.copy(o1) >>> o1.lst.append(17) >>> o2.lst [1, 2, 3, 17] >>> o1.str = 'eggs' >>> o2.str 'spam' copy.deepcopy(obj) Return a deep copy of a Python object. Each element or member in an object is itself recursively copied. For nested containers, it is usually more desirable to perform a deep copy--otherwise you can run into problems like the 2D table example above. >>> o1 = C() >>> o1.lst = [1,2,3] >>> o3 = copy.deepcopy(o1) >>> o1.lst.append(17) >>> o3.lst [1, 2, 3] >>> o1.lst [1, 2, 3, 17] ================================================================= MODULE -- exceptions : Standard exception class hierarchy ================================================================= Various actions in Python raise exceptions, and these exceptions can be caught using an 'except' clause. Although strings can serve as exceptions for backwards-compatibility reasons, it is greatly preferable to use class-based exceptions. When you catch an exception in using an 'except' clause, you also catch any descendent exceptions. By utilizing a hierarchy of standard and user-defined exception classes, you can tailor exception handling to meet your specific code requirements. >>> class MyException(StandardError): pass ... >>> try: ... raise MyException ... except StandardError: ... print "Caught parent" ... except MyException: ... print "Caught specific class" ... except: ... print "Caught generic leftover" ... Caught parent In general, if you need to raise exceptions manually, you should either use a built-in exception close to your situation, or inherit from that built-in exception. The outline in Figure 1.1 shows the exception classes defined in [exceptions]. !!! #----- Standard exceptions -----# <> ================================================================= MODULE -- getopt : Parser for command line options ================================================================= Utility applications--whether for text processing or otherwise--frequently accept a variety of command-line switches to configure their behavior. In principle, and frequently in practice, all that you need to do to process command-line options is read through the list 'sys.argv[1:]' and handle each element of the option line. I have certainly written my own small "sys.argv parser" more than once; it is not hard if you do not expect too much. The [getopt] module provides some automation and error handling for option parsing. It takes just a few lines of code to tell [getopt] what options it should handle, and which switch prefixes and parameter styles to use. However, [getopt] is not necessarily the final word in parsing command lines. Python 2.3 includes Greg Ward's [optik] module renamed as [optparse], and the Twisted Matrix library contains [twisted.python.usage] . These modules, and other third-party tools, were written because of perceived limitations in [getopt]. For most purposes, [getopt] is a perfectly good tool. Moreover, even if some enhanced module is included in later Python versions, either this enhancement will be backwards compatible or [getopt] will remain in the distribution to support existing scripts. SEE ALSO, `sys.argv` FUNCTIONS: getopt.getopt(args, options [,long_options]]) The argument 'args' is the actual list of options being parsed, most commonly 'sys.argv[1:]'. The argument 'options' and the optional argument 'long_options' contain formats for acceptable options. If any options specified in 'args' do not match any acceptable format, a `getopt.GetoptError` exception is raised. All options must begin with either a single dash for single-letter options or a double dash for long options (DOS-style leading slashes are not usable, unfortunately). The return value of `getopt.getopt()` is a pair containing an option list and a list of additional arguments. The latter is typically a list of filenames the utility will operate on. The option list is a list of pairs of the form '(option, value)'. Under recent versions of Python, you can convert an option list to a dictionary with 'dict(optlist)', which is likely to be useful. The 'options' format string is a sequence of letters, each optionally followed by a colon. Any option letter followed by a colon takes a (mandatory) value after the option. The format for 'long_options' is a list of strings indicating the option names (excluding the leading dashes). If an option name ends with an equal sign, it requires a value after the option. It is easiest to see [getopt] in action: >>> import getopt >>> opts='-a1 -b -c 2 --foo=bar --baz file1 file2'.split() >>> optlist, args = getopt.getopt(opts,'a:bc:',['foo=','baz']) >>> optlist [('-a', '1'), ('-b', ''), ('-c', '2'), ('--foo', 'bar'), ('--baz', '')] >>> args ['file1', 'file2'] >>> nodash = lambda s: \ ... s.translate(''.join(map(chr,range(256))),'-') >>> todict = lambda l: \ ... dict([(nodash(opt),val) for opt,val in l]) >>> optdict = todict(optlist) >>> optdict {'a': '1', 'c': '2', 'b': '', 'baz': '', 'foo': 'bar'} You can examine options given either by looping through 'optlist' or by performing 'optdict.get(key, default)' type tests as needed in your program flow. ================================================================= MODULE -- operator : Standard operations as functions ================================================================= All of the standard Python syntactic operators are available in functional form using the [operator] module. In most cases, it is more clear to use the actual operators, but in a few cases functions are useful. The most common usage for [operator] is in conjunction with functional programming constructs. For example: >>> import operator >>> lst = [1, 0, (), '', 'abc'] >>> map(operator.not_, lst) # fp-style negated bool vals [0, 1, 1, 1, 0] >>> tmplst = [] # imperative style >>> for item in lst: ... tmplst.append(not item) ... >>> tmplst [0, 1, 1, 1, 0] >>> del tmplst # must cleanup stray name As well as being shorter, I find the FP style more clear. The source code below provides -sample- implementations of the functions in the [operator] module. The actual implementations are faster and are written directly in C, but the samples illustrate what each function does. #------------------ operator2.py -------------------------# ### Comparison functions lt = __lt__ = lambda a,b: a < b le = __le__ = lambda a,b: a <= b eq = __eq__ = lambda a,b: a == b ne = __ne__ = lambda a,b: a != b ge = __ge__ = lambda a,b: a >= b gt = __gt__ = lambda a,b: a > b ### Boolean functions not_ = __not__ = lambda o: not o truth = lambda o: not not o # Arithmetic functions abs = __abs__ = abs # same as built-in function add = __add__ = lambda a,b: a + b and_ = __and__ = lambda a,b: a & b # bitwise, not boolean div = __div__ = \ lambda a,b: a/b # depends on __future__.division floordiv = __floordiv__ = lambda a,b: a/b # Only for 2.2+ inv = invert = __inv__ = __invert__ = lambda o: ~o lshift = __lshift__ = lambda a,b: a << b rshift = __rshift__ = lambda a,b: a >> b mod = __mod__ = lambda a,b: a % b mul = __mul__ = lambda a,b: a * b neg = __neg__ = lambda o: -o or_ = __or__ = lambda a,b: a | b # bitwise, not boolean pos = __pos__ = lambda o: +o # identity for numbers sub = __sub__ = lambda a,b: a - b truediv = __truediv__ = lambda a,b: 1.0*a/b # New in 2.2+ xor = __xor__ = lambda a,b: a ^ b ### Sequence functions (note overloaded syntactic operators) concat = __concat__ = add contains = __contains__ = lambda a,b: b in a countOf = lambda seq,a: len([x for x in seq if x==a]) def delitem(seq,a): del seq[a] __delitem__ = delitem def delslice(seq,b,e): del seq[b:e] __delslice__ = delslice getitem = __getitem__ = lambda seq,i: seq[i] getslice = __getslice__ = lambda seq,b,e: seq[b:e] indexOf = lambda seq,o: seq.index(o) repeat = __repeat__ = mul def setitem(seq,i,v): seq[i] = v __setitem__ = setitem def setslice(seq,b,e,v): seq[b:e] = v __setslice__ = setslice ### Functionality functions (not implemented here) # The precise interfaces required to pass the below tests # are ill-defined, and might vary at limit-cases between # Python versions and custom data types. import operator isCallable = callable # just use built-in 'callable()' isMappingType = operator.isMappingType isNumberType = operator.isNumberType isSequenceType = operator.isSequenceType ================================================================= MODULE -- sys : Information about current Python interpreter ================================================================= As with the Python "userland" objects you create within your applications, the Python interpreter itself is very open to introspection. Using the [sys] module, you can examine and modify many aspects of the Python runtime environment. However, as with much of functionality in the [os] module, some of what [sys] provides is too esoteric to address in this book about text processing. Consult the _Python Library Reference_ for information on those attributes and functions not covered here. The module attributes `sys.exc_type`, `sys.exc_value`, and `sys.exc_traceback` have been deprecated in favor of the function `sys.exc_info()`. All of these, and also `sys.last_type`, `sys.last_value`, `sys.last_traceback`, and `sys.tracebacklimit`, let you poke into exceptions and stack frames to a finer degree than the basic `try` and `except` statements do. `sys.exec_prefix` and `sys.executable` provide information on installed paths for Python. The functions `sys.displayhook()` and `sys.excepthook()` control where program output goes, and `sys.__displayhook__` and `sys.__excepthook__` retain their original values (e.g., STDOUT and STDERR). `sys.exitfunc` affects interpreter cleanup. The attributes `sys.ps1` and `sys.ps2` control prompts in the Python interactive shell. Other attributes and methods simply provide more detail than you almost ever need to know for text processing applications. The attributes `sys.dllhandle` and `sys.winver` are Windows specific; `sys.setdlopenflags()`, and `sys.getdlopenflags()` are Unix only. Methods like `sys.builtin_module_names`, `sys.getrecursionlimit()`, `sys._getframe()`, `sys.setcheckinterval()`, `sys.modules`, `sys.prefix`, `sys.setprofile()`, `sys.setrecursionlimit()`, `sys.settrace()`, and `sys.warnoptions` concern Python internals. Unicode behavior is affected by `sys.setdefaultencoding()` (but is overridable with arguments anyway). ATTRIBUTES: sys.argv A list of command-line arguments passed to a Python script. The first item, 'argv[0]', is the script name itself, so you are normally interested in 'argv[1:]' when parsing arguments. SEE ALSO, [getopt], `sys.stdin`, `sys.stdout` sys.byteorder The native byte order (endianness) of the current platform. Possible values are 'big' and 'little'. Available in Python 2.0+. sys.copyright A string with copyright information for the current Python interpreter. sys.hexversion The version number of the current Python interpreter as an integer. This number increases with every version, even nonproduction releases. This attribute is not very human-readable; `sys.version` or `sys.version_info` is generally easier to work with. SEE ALSO, `sys.version`, `sys.version_info` sys.maxint The largest positive integer supported by Python's regular integer type, on most platforms, 2**31-1. The largest negative integer is -sys.maxint-1. sys.maxunicode The integer of the largest supported code point for a Unicode character under the current configuration. Unicode characters are stored as UCS-2 or UCS-4. sys.path A list of the pathnames searched for modules. You may modify this path to control module loading. sys.platform A string identifying the OS platform. SEE ALSO, `os.uname()` sys.stderr sys.__stderr__ File object for standard error stream (STDERR). `sys.__stderr__` retains the original value in case `sys.stderr` is modified during program execution. Error messages and warnings from the Python interpreter are written to `sys.stderr`. The most typical use of `sys.stderr` is for application messages that indicate "abnormal" conditions. For example: #*------ Typical usage of sys.stderr and sys.stdout -----# % cat cap_file.py #!/usr/bin/env python import sys, string if len(sys.argv) < 2: sys.stderr.write("No filename specified\n") else: fname = sys.argv[1] try: input = open(fname).read() sys.stdout.write(string.upper(input)) except: sys.stderr.write("Could not read '%s'\n" % fname) % ./cap_file.py this > CAPS % ./cap_file.py nosuchfile > CAPS Could not read 'nosuchfile' % ./cap_file.py > CAPS No filename specified SEE ALSO, `sys.argv`, `sys.stdin`, `sys.stdout` sys.stdin sys.__stdin__ File object for standard input stream (STDIN). `sys.__stdin__` retains the original value in case `sys.stdin` is modified during program execution. `input()` and `raw_input()` are read from `sys.stdin`, but the most typical use of `sys.stdin` is for piped and redirected streams on the command line. For example: #*-------------- Typical usage of sys.stdin -------------# % cat cap_stdin.py #!/usr/bin/env python import sys, string input = sys.stdin.read() print string.upper(input) % echo "this and that" | ./cap_stdin.py THIS AND THAT SEE ALSO, `sys.argv`, `sys.stderr`, `sys.stdout` sys.stdout sys.__stdout__ File object for standard output stream (STDOUT). `sys.__stdout__` retains the original value in case `sys.stdout` is modified during program execution. The formatted output of the `print` statement goes to `sys.stdout`, and you may also use regular file methods, such as `sys.stdout.write()`. SEE ALSO, `sys.argv`, `sys.stderr`, `sys.stdin` sys.version A string containing version information on the current Python interpreter. The form of the string is 'version (#build_num, build_date, build_time) [compiler]'. For example: >>> print sys.version 1.5.2 (#0 Apr 13 1999, 10:51:12) [MSC 32 bit (Intel)] Or: >>> print sys.version 2.2 (#1, Apr 17 2002, 16:11:12) [GCC 2.95.2 19991024 (release)] This version independent way to find the major, minor, and micro version components should work for 1.5-2.3.x, (at least): >>> from string import split >>> from sys import version >>> ver_tup = map(int, split(split(version)[0],'.'))+[0] >>> major, minor, point = ver_tup[:3] >>> if (major, minor) >= (1, 6): ... print "New Way" ... else: ... print "Old Way" ... New Way sys.version_info A 5-tuple containing five components of the version number of the current Python interpreter: '(major, minor, micro, releaselevel, serial)'. 'releaselevel' is a descriptive phrase; the other are integers. >>> sys.version_info (2, 2, 0, 'final', 0) Unfortunately, this attribute was added to Python 2.0, so its items are not entirely useful in requiring a minimal version for some desired functionality. SEE ALSO, `sys.version` FUNCTIONS: sys.exit([code=0]) Exit Python with exit code 'code'. Cleanup actions specified by 'finally' clauses of 'try' statements are honored, and it is possible to intercept the exit attempt by catching the SystemExit exception. You may specify a numeric exit code for those systems that codify them; you may also specify a string exit code, which is printed to STDERR (with the actual exit code set to 1). sys.getdefaultencoding() Return the name of the default Unicode string encoding in Python 2.0+. sys.getrefcount(obj) Return the number of references to the object 'obj'. The value returned is one higher than you might expect, because it includes the (temporary) reference passed as the argument. >>> x = y = "hi there" >>> import sys >>> sys.getrefcount(x) 3 >>> lst = [x, x, x] >>> sys.getrefcount(x) 6 SEE ALSO, [os] ================================================================= MODULE -- types : Standard Python object types ================================================================= Every object in Python has a type; you can find it by using the built-in function `type()`. Often Python functions use a sort of -ad hoc- overloading, which is implemented by checking features of objects passed as arguments. Programmers coming from languages like C or Java are sometimes surprised by this style, since they are accustomed to seeing multiple "type signatures" for each set of argument types the function can accept. But that is not the Python way. Experienced Python programmers try not to rely on the precise types of objects, not even in an inheritance sense. This attitude is also sometimes surprising to programmers of other languages (especially statically typed). What is usually important to a Python program is what an object can -do-, not what it -is-. In fact, it has become much more complicated to describe what many objects -are- with the "type/class unification" in Python 2.2 and above (the details are outside the scope of this book). For example, you might be inclined to write an overloaded function in the following manner: #-------- Naive overloading of argument ---------# import types, exceptions def overloaded_get_text(o): if type(o) is types.FileType: text = o.read() elif type(o) is types.StringType: text = o elif type(o) in (types.IntType, types.FloatType, types.LongType, types.ComplexType): text = repr(o) else: raise exceptions.TypeError return text The problem with this rigidly typed code is that it is far more fragile than is necessary. Something need not be an actual 'FileType' to read its text, it just needs to be sufficiently "file-like" (e.g.,a `urllib.urlopen()` or `cStringIO.StringIO()` object is file-like enough for this purpose). Similarly, a new-style object that descends from `types.StringType` or a `UserString.UserString()` object is "string-like" enough to return as such, and similarly for other numeric types. A better implementation of the function above is: #---- "Quacks like a duck" overloading of argument -----# def overloaded_get_text(o): if hasattr(o,'read'): return o.read() try: return ""+o except TypeError: pass try: return repr(0+o) except TypeError: pass raise At times, nonetheless, it is useful to have symbolic names available to name specific object types. In many such cases, an empty or minimal version of the type of object may be used in conjunction with the `type()` function equally well--the choice is mostly stylistic: >>> type('') == types.StringType 1 >>> type(0.0) == types.FloatType 1 >>> type(None) == types.NoneType 1 >>> type([]) == types.ListType 1 BUILT-IN: type(o) Return the datatype of any object 'o'. The return value of this function is itself an object of the type `types.TypeType`. TypeType objects implement '.__str__()' and '.__repr__()' methods to create readable descriptions of object types. >>> print type(1) >>> print type(type(1)) >>> type(1) is type(0) 1 CONSTANTS: types.BuiltinFunctionType types.BuiltinMethodType The type for built-in functions like `abs()`, `len()`, and `dir()`, and for functions in "standard" C extensions like [sys] and [os]. However, extensions like [string] and [re] are actually Python wrappers for C extensions, so their functions are of type `types.FunctionType`. A general Python programmer need not worry about these fussy details. types.BufferType The type for objects created by the built-in buffer() function. types.ClassType The type for user-defined classes. >>> from operator import eq >>> from types import * >>> map(eq, [type(C), type(C()), type(C().foo)], ... [ClassType, InstanceType, MethodType]) [1, 1, 1] SEE ALSO, `types.InstanceType`, `types.MethodType` types.CodeType The type for code objects such as returned by 'compile()'. types.ComplexType Same as 'type(0+0j)'. types.DictType types.DictionaryType Same as 'type({})'. types.EllipsisType The type for built-in Ellipsis object. types.FileType The type for open file objects. >>> from sys import stdout >>> fp = open('tst','w') >>> [type(stdout), type(fp)] == [types.FileType]*2 1 types.FloatType Same as 'type(0.0)'. types.FrameType The type for frame objects like 'tb.tb_frame' where 'tb' has type `types.TracebackType`. types.FunctionType types.LambdaType Same as 'type(lambda:0)'. types.GeneratorType The type for generator-iterator objects in Python 2.2+. >>> from __future__ import generators >>> def foo(): yield 0 ... >>> type(foo) == types.FunctionType 1 >>> type(foo()) == types.GeneratorType 1 SEE ALSO, `types.FunctionType` types.InstanceType The type for instances of user-defined classes. SEE ALSO, `types.ClassType`, `types.MethodType` types.IntType Same as 'type(0)'. types.ListType Same as 'type([])'. types.LongType Same as 'type(0L)'. types.MethodType types.UnboundMethodType The type for methods of user-defined class instances. SEE ALSO, `types.ClassType`, `types.InstanceType` types.ModuleType The type for modules. >>> import os, re, sys >>> [type(os), type(re), type(sys)] == [types.ModuleType]*3 1 types.NoneType Same as 'type(None)'. types.StringType Same as 'type(" ")'. types.TracebackType The type for traceback objects found in `sys.exc_traceback`. types.TupleType Same as 'type(())'. types.UnicodeType Same as 'type(u"")'. types.SliceType The type for objects returned by 'slice()'. types.StringTypes Same as '(types.StringType,types.UnicodeType)'. SEE ALSO, `types.StringType`, `types.UnicodeType` types.TypeType Same as 'type(type(obj))' (for any 'obj'). types.XRangeType Same as 'type(xrange(1))'. TOPIC -- Working with the Local Filesystem -------------------------------------------------------------------- ================================================================= MODULE -- dircache : Read and cache directory listings ================================================================= The [dircache] module is an enhanced version of the `os.listdir()` function. Unlike the [os] function, [dircache] keeps prior directory listings in memory to avoid the need for a new call to the filesystem. Since [dircache] is smart enough to check whether a directory has been touched since last caching, [dircache] is a complete replacement for `os.listdir()` (with possible minor speed gains). FUNCTIONS: dircache.listdir(path) Return a directory listing of path 'path'. Uses a list cached in memory where possible. dircache.opendir(path) Identical to `dircache.listdir()`. Legacy function to support old scripts. dircache.annotate(path, lst) Modify the list 'lst' in place to indicate which items are directories, and which plain files. The string 'path' should indicate the path to reach the listed files. >>> l = dircache.listdir('/tmp') >>> l ['501', 'md10834.db'] >>> dircache.annotate('/tmp', l) >>> l ['501/', 'md10834.db'] ================================================================= MODULE -- filecmp : Compare files and directories ================================================================= The [filecmp] module lets you check whether two files are identical, and whether two directories contain some identical files. You have several options in determining how thorough of a comparison is performed. FUNCTIONS: filecmp.cmp(fname1, fname2 [,shallow=1 [,use_statcache=0]]) Compare the file named by the string 'fname1' with the file named by the string 'fname2'. If the default true value of 'shallow' is used, the comparison is based only on the mode, size, and modification time of the two files. If 'shallow' is a false value, the files are compared byte by byte. Unless you are concerned that someone will deliberately falsify timestamps on files (as in a cryptography context), a shallow comparison is quite reliable. However, 'tar' and 'untar' can also change timestamps. >>> import filecmp >>> filecmp.cmp('dir1/file1', 'dir2/file1') 0 >>> filecmp.cmp('dir1/file2', 'dir2/file2', shallow=0) 1 The 'use_statcache' argument is not relevant for Python 2.2+. In older Python versions, the [statcache] module provided (slightly) more efficient cached access to file stats, but its use is no longer needed. filecmp.cmpfiles(dirname1, dirname2, fnamelist [,shallow=1 -¯ [,use_statcache=0]]) Compare those filenames listed in 'fnamelist' if they occur in both the directory 'dirname1' and the directory 'dirname2'. `filecmp.cmpfiles()` returns a tuple of three lists (some of the lists may be empty): '(matches,mismatches,errors)'. 'matches' are identical files in both directories, 'mismatches' are nonidentical files in both directories. 'errors' will contain names if a file exists in neither, or in only one, of the two directories, or if either file cannot be read for any reason (permissions, disk problems, etc.). >>> import filecmp, os >>> filecmp.cmpfiles('dir1','dir2',['this','that','other']) (['this'], ['that'], ['other']) >>> print os.popen('ls -l dir1').read() -rwxr-xr-x 1 quilty staff 169 Sep 27 00:13 this -rwxr-xr-x 1 quilty staff 687 Sep 27 00:13 that -rwxr-xr-x 1 quilty staff 737 Sep 27 00:16 other -rwxr-xr-x 1 quilty staff 518 Sep 12 11:57 spam >>> print os.popen('ls -l dir2').read() -rwxr-xr-x 1 quilty staff 169 Sep 27 00:13 this -rwxr-xr-x 1 quilty staff 692 Sep 27 00:32 that The 'shallow' and 'use_statcache' arguments are the same as those to `filecmp.cmp()`. CLASSES: filecmp.dircmp(dirname1, dirname2 [,ignore=... [,hide=...]) Create a directory comparison object. 'dirname1' and 'dirname2' are two directories to compare. The optional argument 'ignore' is a sequence of pathnames to ignore and defaults to '["RCS","CVS","tags"]'; 'hide' is a sequence of pathnames to hide and defaults to '[os.curdir,os.pardir]' (i.e., '[".",".."]'). METHODS AND ATTRIBUTES: The attributes of `filecmp.dircmp` are read-only. Do not attempt to modify them. filecmp.dircmp.report() Print a comparison report on the two directories. >>> mycmp = filecmp.dircmp('dir1','dir2') >>> mycmp.report() diff dir1 dir2 Only in dir1 : ['other', 'spam'] Identical files : ['this'] Differing files : ['that'] filecmp.dircmp.report_partial_closure() Print a comparison report on the two directories, including immediate subdirectories. The method name has nothing to do with the theoretical term "closure" from functional programming. filecmp.dircmp.report_partial_closure() Print a comparison report on the two directories, recursively including all nested subdirectories. filecmp.dircmp.left_list Pathnames in the 'dirname1' directory, filtering out the 'hide' and 'ignore' lists. filecmp.dircmp.right_list Pathnames in the 'dirname2' directory, filtering out the 'hide' and 'ignore' lists. filecmp.dircmp.common Pathnames in both directories. filecmp.dircmp.left_only Pathnames in 'dirname1' but not 'dirname2'. filecmp.dircmp.right_only Pathnames in 'dirname2' but not 'dirname1'. filecmp.dircmp.common_dirs Subdirectories in both directories. filecmp.dircmp.common_files Filenames in both directories. filecmp.dircmp.common_funny Path names in both directories, but of different types. filecmp.dircmp.same_files Filenames of identical files in both directories. filecmp.dircmp.diff_files Filenames of nonidentical files whose name occurs in both directories. filecmp.dircmp.funny_files Filenames in both directories where something goes wrong during comparison. filecmp.dircmp.subdirs A dictionary mapping `filecmp.dircmp.common_dirs` strings to corresponding `filecmp.dircmp` objects, for example: >>> usercmp = filecmp.dircmp('/Users/quilty','/Users/alb') >>> usercmp.subdirs['Public'].common ['Drop Box'] SEE ALSO, `os.stat()`, `os.listdir()` ================================================================= MODULE -- fileinput : Read multiple files or STDIN ================================================================= Many utilities, especially on Unix-like systems, operate line-by-line on one or more files and/or on redirected input. A flexibility in treating input sources in a homogeneous fashion is part of the "Unix philosophy." The [fileinput] module allows you to write a Python application that uses these common conventions with almost no special programming to adjust to input sources. A common, minimal, but extremely useful Unix utility is 'cat', which simply writes its input to STDOUT (allowing redirection of STDOUT as needed). Below are a few simple examples of 'cat': #*---------- Examples of 'cat' utility ---------# % cat a AAAAA % cat a b AAAAA BBBBB % cat - b < a AAAAA BBBBB % cat < b BBBBB % cat a < b AAAAA % echo "XXX" | cat a - AAAAA XXX Notice that STDIN is read only if either "-" is given as an argument, or no arguments are given at all. We can implement a Python version of 'cat' using the [fileinput] module as follows: #------------- cat.py -----------------# #!/usr/bin/env python import fileinput for line in fileinput.input(): print line, FUNCTIONS: fileinput.input([files=sys.argv[1:] [,inplace=0 [,backup=".bak"]]]) Most commonly, this function will be used without any of its optional arguments, as in the introductory example of 'cat.py'. However, behavior may be customized for special cases. The argument 'files' is a sequence of filenames to process. By default, it consists of all the arguments given on the command line. Commonly, however, you might want to treat some of these arguments as flags rather than filenames (e.g., if they start with '-' or '/'). Any list of filenames you like may be used as the 'files' argument, whether or not it is built from 'sys.argv'. If you specify a true value for 'inplace', output will go into each file specified rather than to STDOUT. Input taken from STDIN, however, will still go to STDOUT. For in-place operation, a temporary backup file is created as the actual input source and is given the extension indicated by the 'backup' argument. For example: #*------ Modifying files in place with [fileinput] ------# % cat a b AAAAA BBBBB % cat modify.py #!/usr/bin/env python import fileinput, sys for line in fileinput.input(sys.argv[1:], inplace=1): print "MODIFIED", line, % echo "XXX" | ./modify.py a b - MODIFIED XXX % cat a b MODIFIED AAAAA MODIFIED BBBBB fileinput.close() Close the input sequence. fileinput.nextfile() Close the current file, and proceed to the next one. Any unread lines in the current file will not be counted towards the line total. There are several functions in the [fileinput] module that provide information about the current input state. These tests can be used to process the current line in a context-dependent way. fileinput.filelineno() The number of lines read from the current file. fileinput.filename() The name of the file from which the last line was read. Before a line is read, the function returns 'None'. fileinput.isfirstline() Same as 'fileinput.filelineno()==1'. fileinput.isstdin() True if the last line read was from STDIN. fileinput.lineno() The number of lines read during the input loop, cumulative between files. CLASSES: fileinput.FileInput([files [,inplace=0 [,backup=".bak"]]]) The methods of `fileinput.FileInput` are the same as the module-level functions, plus an additional '.readline()' method that matches that of file objects. `fileinput.FileInput` objects also have a '.__getitem__()' method to support sequential access. The arguments to initialize a `fileinput.FileInput` object are the same as those passed to the `fileinput.input()` function. The class exists primarily in order to allow subclassing. For normal usage, it is best to just use the [fileinput] functions. SEE ALSO, `multifile`, [xreadlines] ================================================================= MODULE -- glob : Filename globing utility ================================================================= The [glob] module provides a list of pathnames matching a glob-style pattern. The [fnmatch] module is used internally to determine whether a path matches. FUNCTIONS: glob.glob(pat) Both directories and plain files are returned, so if you are only interested in one type of path, use `os.path.isdir()` or `os.path.isfile()`; other functions in [os.path] also support other filters. Pathnames returned by `glob.glob()` contain as much absolute or relative path information as the pattern 'pat' gives. For example: >>> import glob, os.path >>> glob.glob('/Users/quilty/Book/chap[3-4].txt') ['/Users/quilty/Book/chap3.txt', '/Users/quilty/Book/chap4.txt'] >>> glob.glob('chap[3-6].txt') ['chap3.txt', 'chap4.txt', 'chap5.txt', 'chap6.txt'] >>> filter(os.path.isdir, glob.glob('/Users/quilty/Book/[A-Z]*')) ['/Users/quilty/Book/SCRIPTS', '/Users/quilty/Book/XML'] SEE ALSO, [fnmatch], [os.path] ================================================================= MODULE -- linecache : Cache lines from files ================================================================= The module [linecache] can be used to simulate relatively efficient random access to the lines in a file. Lines that are read are cached for later access. FUNCTIONS: linecache.getline(fname, linenum) Read line 'linenum' from the file named 'fname'. If an error occurs reading the line, the function will catch the error and return an empty string. 'sys.path' is also searched for the filename if it is not found in the current directory. >>> import linecache >>> linecache.getline('/etc/hosts', 15) '192.168.1.108 hermes hermes.gnosis.lan\n' linecache.clearcache() Clear the cache of read lines. linecache.checkcache() Check whether files in the cache have been modified since they were cached. ================================================================= MODULE -- os.path : Common pathname manipulations ================================================================= The [os.path] module provides a variety of functions to analyze and manipulate filesystem paths in a cross-platform fashion. FUNCTIONS: os.path.abspath(pathname) Return an absolute path for a (relative) pathname. >>> os.path.abspath('SCRIPTS/mk_book') '/Users/quilty/Book/SCRIPTS/mk_book' os.path.basename(pathname) Same as 'os.path.split(pathname)[1]'. os .path.commonprefix(pathlist) Return the path to the most nested parent directory shared by all elements of the sequence 'pathlist'. >>> os.path.commonprefix(['/usr/X11R6/bin/twm', ... '/usr/sbin/bash', ... '/usr/local/bin/dada']) '/usr/' os.path.dirname(pathname) Same as 'os.path.split(pathname)[0]'. os.path.exists(pathname) Return true if the pathname 'pathname' exists. os.path.expanduser(pathname) Expand pathnames that include the tilde character: '~'. Under standard Unix shells, an initial tilde refers to a user's home directory, and a tilde followed by a name refers to the named user's home directory. This function emulates that behavior on other platforms. >>> os.path.expanduser('~alb') '/Users/alb' >>> os.path.expanduser('~/Book') '/Users/quilty/Book' os.path.expandvars(pathname) Expand 'pathname' by replacing environment variables in a Unix shell style. While this function is in the [os.path] module, you could equally use it for bash-like scripting in Python, generally (this is not necessarily a good idea, but it is possible). >>> os.path.expandvars('$HOME/Book') '/Users/quilty/Book' >>> from os.path import expandvars as ev # Python 2.0+ >>> if ev('$HOSTTYPE')=='macintosh' and ev('$OSTYPE')=='darwin': ... print ev("The vendor is $VENDOR, the CPU is $MACHTYPE") ... The vendor is apple, the CPU is powerpc os.path.getatime(pathname) Return the last access time of 'pathname' (or raise 'os.error' if checking is not possible). os.path.getmtime(pathname) Return the modification time of 'pathname' (or raise 'os.error' if checking is not possible). os.path.getsize(pathname) Return the size of 'pathname' in bytes (or raise 'os.error' if checking is not possible). os.path.isabs(pathname) Return true if 'pathname' is an absolute path. os.path.isdir(pathname) Return true if 'pathname' is a directory. os.path.isfile(pathname) Return true if 'pathname' is a regular file (including symbolic links). os.path.islink(pathname) Return true if 'pathname' is a symbolic link. os.path.ismount(pathname) Return true if 'pathname' is a mount point (on POSIX systems). os.path.join(path1 [,path2 [...]]) Join multiple path components intelligently. >>> os.path.join('/Users/quilty/','Book','SCRIPTS/','mk_book') '/Users/quilty/Book/SCRIPTS/mk_book' os.path.normcase(pathname) Convert 'pathname' to canonical lowercase on case-insensitive filesystems. Also convert slashes on Windows systems. os.path.normpath(pathname) Remove redundant path information. >>> os.path.normpath('/usr/local/bin/../include/./slang.h') '/usr/local/include/slang.h' os.path.realpath(pathname) Return the "real" path to 'pathname' after de-aliasing any symbolic links. New in Python 2.2+. >>> os.path.realpath('/usr/bin/newaliases') '/usr/sbin/sendmail' os.path.samefile(pathname1, pathname2) Return true if 'pathname1' and 'pathname2' are the same file. SEE ALSO, [filecmp] os.path.sameopenfile(fp1, fp2) Return true if the file handles 'fp1' and 'fp2' refer to the same file. Not available on Windows. os.path.split(pathname) Return a tuple containing the path leading up to the named pathname and the named directory or filename in isolation. >>> os.path.split('/Users/quilty/Book/SCRIPTS') ('/Users/quilty/Book', 'SCRIPTS') os.path.splitdrive(pathname) Return a tuple containing the drive letter and the rest of the path. On systems that do not use a drive letter, the drive letter is empty (as it is where none is specified on Windows-like systems). os.path.walk(pathname, visitfunc, arg) For every directory recursively contained in 'pathname', call 'visitfunc(arg,dirname,pathnames)' for each path. >>> def big_files(minsize, dirname, files): ... for file in files: ... fullname = os.path.join(dirname,file) ... if os.path.isfile(fullname): ... if os.path.getsize(fullname) >= minsize: ... print fullname ... >>> os.path.walk('/usr/', big_files, 5e6) /usr/lib/libSystem.B_debug.dylib /usr/lib/libSystem.B_profile.dylib ================================================================= MODULE -- shutil : Copy files and directory trees ================================================================= The functions in the [shutil] module make working with files a bit easier. There is nothing in this module that you could not do using basic file objects and [os.path] functions, but [shutil] often provides a more direct means and handles minor details for you. The functions in [shutil] match fairly closely the capabilities you would find in Unix file system utilities like 'cp' and 'rm'. FUNCTIONS: shutil.copy(src, dst) Copy the file named 'src' to the pathname 'dst'. If 'dst' is a directory, the created file is given the name 'os.path.join(dst+os.path.basename(src))'. SEE ALSO, `os.path.join()`, `os.path.basename()` shutil.copy2(src, dst) Same as `shutil.copy()` except that the access and creation time of 'dst' are set to the values in 'src'. shutil.copyfile(src, dst) Copy the file named 'src' to the filename 'dst' (overwriting 'dst' if present). Basically, this has the same effect as 'open(dst,"wb").write(open(src,"rb").read())'. shutil.copyfileobj(fpsrc, fpdst [,buffer=-1]) Copy the file-like object 'fpsrc' to the file-like object 'fpdst'. If the optional argument 'buffer' is given, only the specified number of bytes are read into memory at a time; this allows copying very large files. shutil.copymode(src, dst) Copy the permission bits from the file named 'src' to the filename 'dst'. shutil.copystat(src, dst) Copy the permission and timestamp data from the file named 'src' to the filename 'dst'. shutil.copytree(src, dst [,symlinks=0]) Copy the directory 'src' to the destination 'dst' recursively. If the optional argument 'symlinks' is a true value, copy symbolic links as links rather than the default behavior of copying the content of the link target. This function may not be entirely reliable on every platform and filesystem. shutil.rmtree(dirname [ignore [,errorhandler]]) Remove an entire directory tree rooted at 'dirname'. If optional argument 'ignore' is a true value, errors will be silently ignored. If 'errorhandler' is given, a custom error handler is used to catch errors. This function may not be entirely reliable on every platform and filesystem. SEE ALSO, `open()`, [os.path] ================================================================= MODULE -- stat : Constants/functions for os.stat() ================================================================= The [stat] module provides two types of support for analyzing the results of `os.stat()`, `os.lstat()`, and `os.fstat()` calls. Several functions exist to allow you to perform tests on a file. If you simply wish to check one predicate of a file, it is more direct to use one of the `os.path.is*()` functions, but for performing several such tests, it is faster to read the mode once and perform several `stat.S_*()` tests. As well as helper functions, [stat] defines symbolic constants to access the fields of the 10-tuple returned by `os.stat()` and friends. For example: >>> from stat import * >>> import os >>> fileinfo = os.stat('chap1.txt') >>> fileinfo[ST_SIZE] 68666L >>> mode = fileinfo[ST_MODE] >>> S_ISSOCK(mode) 0 >>> S_ISDIR(mode) 0 >>> S_ISREG(mode) 1 FUNCTIONS: stat.S_ISDIR(mode) Mode indicates a directory. stat.S_ISCHR(mode) Mode indicates a character special device file. stat.S_ISBLK(mode) Mode indicates a block special device file. stat.S_ISREG(mode) Mode indicates a regular file. stat.S_ISFIFO(mode) Mode indicates a FIFO (named pipe). stat.S_ISLNK(mode) Mode indicates a symbolic link. stat.S_ISSOCK(mode) Mode indicates a socket. CONSTANTS: stat.ST_MODE I-node protection mode. stat.ST_INO I-node number. stat.ST_DEV Device. stat.ST_NLINK Number of links to this i-node. stat.ST_UID User id of file owner. stat.ST_GID Group id of file owner. stat.ST_SIZE Size of file. stat.ST_ATIME Last access time. stat.ST_MTIME Modification time. stat.ST_CTIME Time of last status change. ================================================================= MODULE -- tempfile : Temporary files and filenames ================================================================= The [tempfile] module is useful when you need to store transient data using a file-like interface. In contrast to the file-like interface of [StringIO], [tempfile] uses the actual filesystem for storage rather than simulating the interface to a file in memory. In memory-constrained contexts, therefore, [tempfile] is preferable. The temporary files created by [tempfile] are as secure against external modification as is supported by the underlying platform. You can be fairly confident that your temporary data will not be read or changed either while your program is running or afterwards (temporary files are deleted when closed). While you should not count on [tempfile] to provide you with cryptographic-level security, it is good enough to prevent accidents and casual inspection. FUNCTIONS: tempfile.mktemp([suffix=""]) Return an absolute path to a unique temporary filename. If optional argument 'suffix' is specified, the name will end with the 'suffix' string. tempfile.TemporaryFile([mode="w+b" [,buffsize=-1 [suffix=""]]]) Return a temporary file object. In general, there is little reason to change the default 'mode' argument of 'w+b'; there is no existing file to append to before the creation, and it does little good to write temporary data you cannot read. Likewise, the optional 'suffix' argument generally will not ever be visible, since the file is deleted when closed. The default 'buffsize' uses the platform defaults, but may be modified if needed. >>> tmpfp = tempfile.TemporaryFile() >>> tmpfp.write('this and that\n') >>> tmpfp.write('something else\n') >>> tmpfp.tell() 29L >>> tmpfp.seek(0) >>> tmpfp.read() 'this and that\nsomething else\n' SEE ALSO, [StringIO], [cStringIO] ================================================================= MODULE -- xreadlines : Efficient iteration over a file ================================================================= Reading over the lines of a file had some pitfalls in older versions of Python: There was a memory-friendly way, and there was a fast way, but never the twain shall meet. These techniques were: >>> fp = open('bigfile') >>> line = fp.readline() >>> while line: ... # Memory-friendly but slow ... # ...do stuff... ... line = fp.readline() >>> for line in open('bigfile').readlines(): ... # Fast but memory-hungry ... # ...do stuff... Fortunately, with Python 2.1 a more efficient technique was provided. In Python 2.2+, this efficient technique was also wrapped into a more elegant syntactic form (in keeping with the new iterator). With Python 2.3+, [xreadlines] is officially deprecated in favor of the idiom "'for line in file:'". FUNCTIONS: xreadlines.xreadlines(fp) Iterate over the lines of file object 'fp' in an efficient way (both speed-wise and in memory usage). >>> for line in xreadlines.xreadlines(open('tmp')): ... # Efficient all around ... # ...do stuff... Corresponding to this [xreadlines] module function is the '.xreadlines()' method of file objects. >>> for line in open('tmp').xreadlines(): ... # As a file object method ... # ...do stuff... If you use Python 2.2 or above, an even nicer version is available: >>> for line in open('tmp'): ... # ...do stuff... SEE ALSO, [linecache], `FILE.xreadlines()`, `os.tmpfile()` TOPIC -- Running External Commands and Accessing OS Features -------------------------------------------------------------------- ================================================================= MODULE -- commands : Quick access to external commands ================================================================= The [commands] module exists primarily as a convenience wrapper for calls to `os.popen*()` functions on Unix-like systems. STDERR is combined with STDOUT in the results. FUNCTIONS: commands.getoutput(cmd) Return the output from running 'cmd'. This function could also be implemented as: >>> def getoutput(cmd): ... import os ... return os.popen('{ '+cmd+'; } 2>&1').read() commands.getstatusoutput(cmd) Return a tuple containing the exit status and output from running 'cmd'. This function could also be implemented as: >>> def getstatusoutput(cmd): ... import os ... fp = os.popen('{ '+cmd+'; } 2>&1') ... output = fp.read() ... status = fp.close() ... if not status: status=0 # Want zero rather than None ... return (status, output) ... >>> getstatusoutput('ls nosuchfile') (256, 'ls: nosuchfile: No such file or directory\n') >>> getstatusoutput('ls c*[1-3].txt') (0, 'chap1.txt\nchap2.txt\nchap3.txt\n') commands.getstatus(filename) Same as 'commands.getoutput('ls -ld '+filename)'. SEE ALSO, `os.popen()`, `os.popen2()`, `os.popen3()`, `os.popen4()` ================================================================= MODULE -- os : Portable operating system services ================================================================= The [os] module contains a large number of functions, attributes, and constants for calling on or determining features of the operating system that Python runs on. In many cases, functions in [os] are internally implemented using modules like [posix], [os2], [riscos], or [mac], but for portability it is better to use the [os] module. Not everything in the [os] module is documented in this book. You can read about those features that are unlikely to be used in text processing applications in the _Python Library Reference_ that accompanies Python distributions. Functions and constants not documented here fall into several categories. The functions and attributes `os.confstr()`, `os.confstr_names`, `os.sysconf()`, and `os.sysconf_names` let you probe system configuration. As well, I skip some functions specific to process permissions on Unix-like systems: `os.ctermid()`, `os.getegid()`, `os.geteuid()`, `os.getgid()`, `os.getgroups()`, `os.getlogin()`, `os.getpgrp()`, `os.getppid()`, `os.getuid()`, `os.setegid()`, `os.seteuid()`, `os.setgid()`, `os.setgroups()`, `os.setpgrp()`, `os.setpgid()`, `os.setreuid()`, `os.setregid()`, `os.setsid()`, and `os.setuid(uid)`. The functions `os.abort()`, `os.exec*()`, `os._exit()`, `os.fork()`, `os.forkpty()`, `os.plock()`, `os.spawn*()`, `os.times()`, `os.wait()`, `os.waitpid()`, `os.WIF*()`, `os.WEXITSTATUS()`, os.WSTOPSIG()`, and `os.WTERMSIG()` and the constants `os.P_*` and `os.WNOHANG` all deal with process creation and management. These are not documented in this book, since creating and managing multiple processes is not typically central to text processing tasks. However, I briefly document the basic capabilities in `os.kill()`, `os.nice()`, `os.startfile()`, and `os.system()` and in the `os.popen()` family. Some of the omitted functionality can also be found in the [commands] and [sys] modules. A number of functions in the [os] module allow you to perform low-level I/O using file descriptors. In general, it is simpler to perform I/O using file objects created with the built-in `open()` function or the `os.popen*()` family. These file objects provide methods like `FILE.readline()`, `FILE.write()`, `FILE.seek()`, and `FILE.close()`. Information about files can be determined using the `os.stat()` function or functions in the [os.path] and [shutil] modules. Therefore, the functions `os.close()`, `os.dup()`, `os.dup2()`, `os.fpathconf()`, `os.fstat()`, `os.fstatvfs()`, `os.ftruncate()`, `os.isatty()`, `os.lseek()`, `os.open()`, `os.openpty()`, `os.pathconf()`, `os.pipe()`, `os.read()`, `os.statvfs()`, `os.tcgetpgrp()`, `os.tcsetpgrp()`, `os.ttyname()`, `os.umask()`, and `os.write()` are not covered here. As well, the supporting constants `os.O_*` and `os.pathconf_names` are omitted. SEE ALSO, [commands], [os.path], [shutil], [sys] FUNCTIONS: os.access(pathname, operation) Check the permission for the file or directory 'pathname'. If the type of operation specified is allowed, return a true value. The argument 'operation' is a number between 0 and 7, inclusive, and encodes four features: exists, executable, writable, and readable. These features have symbolic names: >>> import os >>> os.F_OK, os.X_OK, os.W_OK, os.R_OK (0, 1, 2, 4) To query a specific combination of features, you may add or bitwise-or the individual features. >>> os.access('myfile', os.W_OK | os.R_OK) 1 >>> os.access('myfile', os.X_OK + os.R_OK) 0 >>> os.access('myfile', 6) 1 os.chdir(pathname) Change the current working directory to the path 'pathname'. SEE ALSO, `os.getcwd()` os.chmod(pathname, mode) Change the mode of file or directory 'pathname' to numeric mode 'mode'. See the 'man' page for the 'chmod' utility for more information on modes. os.chown(pathname, uid, gid) Change the owner and group of file or directory 'pathname' to 'uid' and 'gid' respectively. See the 'man' page for the 'chown' utility for more information. os.chroot(pathname) Change the root directory under Unix-like systems (on Python 2.2+). See the 'man' page for the 'chroot' utility for more information. os.getcwd() Return the current working directory as a string. >>> os.getcwd() '/Users/quilty/Book' SEE ALSO, `os.chdir()` os.getenv(var [,value=None]) Return the value of environment variable 'var'. If the environment variable is not defined, return 'value'. An equivalent call is 'os.environ.get(var, value)'. SEE ALSO, `os.environ`, `os.putenv()` os.getpid() Return the current process id. Possibly useful for calls to external utilities that use process id's. SEE ALSO, `os.kill()` os.kill(pid, sig) Kill an external process on Unix-like systems. You will need to determine values for the 'pid' argument by some means, such as a call to the 'ps' utility. Values for the signal 'sig' sent to the process may be found in the [signal] module or with 'man signal'. For example: >>> from signal import * >>> SIGHUP, SIGINT, SIGQUIT, SIGIOT, SIGKILL (1, 2, 3, 6, 9) >>> def kill_by_name(progname): ... pidstr = os.popen('ps|grep '+progname+'|sort').read() ... pid = int(pidstr.split()[0]) ... os.kill(pid, 9) ... >>> kill_by_name('myprog') os.link(src, dst) Create a hard link from path 'src' to path 'dst' on Unix-like systems. See the 'man' page on the 'ln' utility for more information. SEE ALSO, `os.symlink()` os.listdir(pathname) Return a list of the names of files and directories at path 'pathname'. The special entries for the current and parent directories (typically "." and "..") are excluded from the list. os.lstat(pathname) Information on file or directory 'pathname'. See `os.stat()` for details. `os.lstat()` does not follow symbolic links. SEE ALSO, `os.stat()`, [stat] os.mkdir(pathname [,mode=0777]) Create a directory named 'pathname' with the numeric mode 'mode'. On some operating systems, 'mode' is ignored. See the 'man' page for the 'chmod' utility for more information on modes. SEE ALSO, `os.chmod()`, `os.mkdirs()` os.mkdirs(pathname [,mode=0777]) Create a directory named 'pathname' with the numeric mode 'mode'. Unlike `os.mkdir()`, this function will create any intermediate directories needed for a nested directory. SEE ALSO, `os.mkdir()` os.mkfifo(pathname [,mode=0666]) Create a named pipe on Unix-like systems. os.nice(increment) Decrease the process priority of the current application under Unix-like systems. This is useful if you do not wish for your application to hog system CPU resources. The four functions in the `os.popen*()` family allow you to run external processes and capture their STDOUT and STDERR and/or set their STDIN. The members of the family differ somewhat in how these three pipes are handled. os.popen(cmd [,mode="r" [,bufsize]]) Open a pipe to or from the external command 'cmd'. The return value of the function is an open file object connected to the pipe. The 'mode' may be 'r' for read (the default) or 'w' for write. The exit status of the command is returned when the file object is closed. An optional buffer size 'bufsize' may be specified. >>> import os >>> def ls(pat): ... stdout = os.popen('ls '+pat) ... result = stdout.read() ... status = stdout.close() ... if status: print "Error status", status ... else: print result ... >>> ls('nosuchfile') ls: nosuchfile: No such file or directory Error status 256 >>> ls('chap[7-9].txt') chap7.txt os.popen2(cmd [,mode [,bufsize]]) Open both STDIN and STDOUT pipes to the external command 'cmd'. The return value is a pair of file objects connecting to the two respective pipes. 'mode' and 'bufsize' work as with `os.popen()`. SEE ALSO, `os.popen3()`, `os.popen()` os.popen3(cmd [,mode [,bufsize]]) Open STDIN, STDOUT and STDERR pipes to the external command 'cmd'. The return value is a 3-tuple of file objects connecting to the three respective pipes. 'mode' and 'bufsize' work as with `os.popen()`. >>> import os >>> stdin, stdout, stderr = os.popen3('sed s/line/LINE/') >>> print >>stdin, 'line one' >>> print >>stdin, 'line two' >>> stdin.write('line three\n)' >>> stdin.close() >>> stdout.read() 'LINE one\nLINE two\nLINE three\n' >>> stderr.read() '' os.popen4(cmd [,mode [,bufsize]]) Open STDIN, STDOUT, and STDERR pipes to the external command 'cmd'. In contrast to `os.popen3()`, `os.popen4()` combines STDOUT and STDERR on the same pipe. The return value is a pipe of file objects connecting to the two respective pipes. 'mode' and 'bufsize' work as with `os.popen()`. SEE ALSO, `os.popen3()`, `os.popen()` os.putenv(var, value) Set the environment variable 'var' to the value 'value'. Changes to the current environment only affect subprocesses of the current process, such as those launched with `os.system()` or `os.popen()`, not the whole OS. Calls to `os.putenv()` will update the environment, but not the `os.environ` variable. Therefore, it is better to update `os.environ` directly (which also changes the external environment). SEE ALSO, `os.environ`, `os.getenv()`, `os.popen()`, `os.system()` os.readlink(linkname) Return a string containing the path symbolic link 'linkname' points to. Works on Unix-like systems. SEE ALSO, `os.symlink()` os.remove(filename) Remove the file named 'filename'. This function is identical to `os.unlink()`. If the file cannot be removed, an 'OSError' is raised. SEE ALSO, `os.unlink()` os.removedirs(pathname) Remove the directory named 'pathname' and any subdirectories of 'pathname'. This function will not remove directories with files, and will raise an 'OSError' if you attempt to do so. SEE ALSO, `os.rmdir()` os.rename(src, dst) Rename the file or directory 'src' as 'dst'. Depending on the operating system, the operation may raise an 'OSError' if 'dst' already exists. SEE ALSO, `os.renames()` os.renames(src, dst) Rename the file or directory 'src' as 'dst'. Unlike `os.rename()`, this function will create any intermediate directories needed for a nested directory. SEE ALSO, `os.rename()` os.rmdir(pathname) Remove the directory named 'pathname'. This function will not remove nonempty directories and will raise an 'OSError' if you attempt to do so. SEE ALSO, `os.removedirs() os.startfile(path) Launch an application under Windows system. The behavior is the same as if 'path' was double-clicked in a Drives window or as if you typed 'start ' at a command line. Using Windows associations, a data file can be launched in the same manner as an actual executable application. SEE ALSO, `os.system()` os.stat(pathname) Create a 'stat_result' object that contains information on the file or directory 'pathname'. A 'stat_result' object has a number of attributes and also behaves like a tuple of numeric values. Before Python 2.2, only the tuple was provided. The attributes of a 'stat_result' object are named the same as the constants in the [stat] module, but in lowercase. >>> import os, stat >>> file_info = os.stat('chap1.txt') >>> file_info.st_size 87735L >>> file_info[stat.ST_SIZE] 87735L On some platforms, additional attributes are available. For example, Unix-like systems usually have '.st_blocks', '.st_blksize', and '.st_rdev' attributes; MacOS has '.st_rsize', '.st_creator', and '.st_type'; RISCOS has '.st_ftype', '.st_attrs', and '.st_obtype'. SEE ALSO, [stat], `os.lstat()` os.strerror(code) Give a description for a numeric error code 'code', such as that returned by 'os.popen(bad_cmd).close()'. SEE ALSO, `os.popen()` os.symlink(src, dst) Create a soft link from path 'src' to path 'dst' on Unix-like systems. See the 'man' page on the 'ln' utility for more information. SEE ALSO, `os.link()`, `os.readlink()` os.system(cmd) Execute the command 'cmd' in a subshell. Unlike execution using `os.popen()` the output of the executed process is not captured (but it may still echo to the same terminal as the current Python application). In some cases, you can use `os.system()` on non-Windows systems to detach an application in a manner similar to `os.startfile()`. For example, under MacOSX, you could launch the TextEdit application with: >>> import os >>> cmd="/Applications/TextEdit.app/Contents/MacOS/TextEdit &" >>> os.system(cmd) 0 SEE ALSO, `os.popen()`, `os.startfile()`, [commands] os.tempnam([dir [,prefix]]) Return a unique filename for a temporary file. If optional argument 'dir' is specified, that directory will be used in the path; if 'prefix' is specified, the file will have the indicated prefix. For most purposes, it is more secure to use `os.tmpfile()` to directly obtain a file object rather than first generating a name. SEE ALSO, [tempfile], `os.tmpfile()` os.tmpfile() Return an "invisible" file object in update mode. This file does not create a directory entry, but simply acts as a transient buffer for data on the filesystem. SEE ALSO, [tempfile], [StringIO], [cStringIO] os.uname() Return detailed information about the current operating system on recent Unix-like systems. The returned 5-tuple contains sysname, nodename, release, version, and machine, each as descriptive strings. os.unlink(filename) Remove the file named 'filename'. This function is identical to `os.remove()`. If the file cannot be removed, an 'OSError' is raised. SEE ALSO, `os.remove()` os.utime(pathname, times) Set the access and modification timestamps of file 'pathname' to the tuple '(atime, mtime)' specified in 'times'. Alternately, if 'times' is 'None', set both timestamps to the current time. SEE ALSO, [time], `os.chmod()`, `os.chown()`, `os.stat()` CONSTANTS AND ATTRIBUTES: os.altsep Usually 'None', but an alternative path delimiter ("/") under Windows. os.curdir The string the operating system uses to refer to the current directory; for example, "." on Unix or ":" on Macintosh (before MacOSX). os.defpath The search path used by exec*p*() and spawn*p*() absent a PATH environment variable. os.environ A dictionary-like object containing the current environment. >>> os.environ['TERM'] 'vt100' >>> os.environ['TERM'] = 'vt220' >>> os.getenv('TERM') 'vt220' SEE ALSO, `os.getenv()`, `os.putenv()` os.linesep The string that delimits lines in a file; for example "\n" on Unix, "\r" on Macintosh, "\r\n" on Windows. os.name A string identifying the operating system the current Python interpreter is running on. Possible strings include 'posix', 'nt', 'dos', 'mac', 'os2', 'ce', 'java', and 'riscos'. os.pardir The string the operating system uses to refer to the parent directory; for example, ".." on Unix or "::" on Macintosh (before MacOSX). os.pathsep The string that delimits search paths; for example, ";" on Windows or ":" on Unix. os.sep The string the operating system uses to refer to path delimiters; for example "/" on Unix, "\" on Windows, ":" on Macintosh. SEE ALSO, [sys], [os.path] TOPIC -- Special Data Values and Formats -------------------------------------------------------------------- ================================================================= MODULE -- random : Pseudo-random value generator ================================================================= Python provides better pseudo-random number generation than do most C libraries with a 'rand()' function, but not good enough for cryptographic purposes. The period of Python's Wichmann-Hill generator is about 7 trillion (7e13), but that merely indicates how long it will take a particular seeded generator to cycle; a different seed will produce a different sequence of numbers. Python 2.3 uses the superior Mersenne Twister generator, which has a longer period and have been better analyzed. For practical purposes, pseudo-random numbers generated by Python are more than adequate for random-seeming behavior in applications. The underlying pseudo-random numbers generated by the [random] module can be mapped into a variety of nonuniform patterns and distributions. Moreover, you can capture and tinker with the state of a pseudo-random generator; you can even subclass the `random.Random` class that operates behind the scenes. However, this latter sort of specialization is outside the scope of this book, and the class `random.Random`, and functions `random.getstate()`, `random.jumpahead()`, and `random.setstate()` are omitted from this discussion. The functions `random.whseed()` and `random.randint()` are deprecated. FUNCTIONS: random.betavariate(alpha, beta) Return a floating point value in the range [0.0, 1.0) with a beta distribution. random.choice(seq) Select a random element from the nonempty sequence 'seq'. random.cunifvariate(mean, arc) Return a floating point value in the range ['mean-arc/2', 'mean+arc/2') with a circular uniform distribution. Arguments and result are expressed in radians. random.expovariate(lambda_) Return a floating point value in the range [0.0, +inf) with an exponential distribution. The argument 'lambda_' gives the -inverse- of the mean of the distribution. >>> import random >>> t1,t2 = 0,0 >>> for x in range(100): ... t1 += random.expovariate(1./20) ... t2 += random.expovariate(20.) ... >>> print t1/100, t2/100 18.4021962198 0.0558234063338 random.gamma(alpha, beta) Return a floating point value with a gamma distribution (not the gamma function). random.gauss(mu, sigma) Return a floating point value with a Gaussian distribution; the mean is 'mu' and the sigma is 'sigma'. `random.gauss()` is slightly faster than `random.normalvariate()`. random.lognormvariate(mu, sigma) Return a floating point value with a log normal distribution; the natural logarithm of this distribution is Gaussian with mean 'mu' and sigma 'sigma'. random.normalvariate(mu, sigma) Return a floating point value with a Gaussian distribution; the mean is 'mu' and the sigma is 'sigma'. random.paretovariate(alpha) Return a floating point value with a Pareto distribution. 'alpha' specifies the shape parameter. random.random() Return a floating point value in the range [0.0, 1.0). random.randrange([start=0,] stop [,step=1]) Return a random element from the specified range. Functionally equivalent to the expression 'random.choice(range(start,stop,step))', but it does not build the actual range object. Use `random.randrange()` in place of the deprecated `random.randint()`. random.seed([x=time.time()]) Initialize the Wichmann-Hill generator. You do not necessarily -need- to call `random.seed()`, since the current system time is used to initialize the generator upon module import. But if you wish to provide more entropy in the initial state, you may pass any hashable object as argument 'x'. Your best choice for 'x' is a positive long integer less than 27814431486575L, whose value is selected at random by independent means. random.shuffle(seq [,random=random.random]) Permute the mutable sequence 'seq' in place. An optional argument 'random' may be specified to use an alternate random generator, but it is unlikely you will want to use one. Possible permutations get very big very quickly, so even for moderately sized sequences, not every permutation will occur. random.uniform(min, max) Return a random floating point value in the range [min, max). random.vonmisesvariate(mu, kappa) Return a floating point value with a von Mises distribution. 'mu' is the mean angle expressed in radians, and 'kappa' is the concentration parameter. random.weibullvariate(alpha, beta) Return a floating point value with a Weibull distribution. 'alpha' is the scale parameter, and 'beta' is the shape parameter. ================================================================= MODULE -- struct : Create and read packed binary strings ================================================================= The [struct] module allows you to encode compactly Python numeric values. This module may also be used to read C structs that use the same formats; some formatting codes are only useful for reading C structs. The exception `struct.error` is raised if a format does not match its string or values. A format string consists of a sequence of alphabetic formatting codes. Each code is represented by zero or more bytes in the encoded packed binary string. Each formatting code may be preceded by a number indicating a number of occurrences. The entire format string may be preceded by a global flag. If the flag '@' is used, platform-native data sizes and endianness are used. In all other cases, standard data sizes are used. The flag '=' explicitly indicates platform endianness; '<' indicates little-endian representations; '>' or '!' indicates big-endian representations. The available formatting codes are listed below. The standard sizes are given (check your platform for its sizes if platform-native sizes are needed). #------ Formatting codes for struct module -----# x pad byte 0 bytes c char 1 bytes b signed char 1 bytes B unsigned char 1 bytes h short int 2 bytes H unsigned short 2 bytes i int 4 bytes I unsigned int 4 bytes l long int 4 bytes L unsigned long 4 bytes q long long int 8 bytes Q unsigned long long 8 bytes f float 4 bytes d double 8 bytes s string padded to size p Pascal string padded to size P char pointer 4 bytes Some usage examples clarify the encoding: >>> import struct >>> struct.pack('5s5p2c', 'sss','ppp','c','c') 'sss\x00\x00\x03ppp\x00cc' >>> struct.pack('h', 1) '\x00\x01' >>> struct.pack('I', 1) '\x00\x00\x00\x01' >>> struct.pack('l', 1) '\x00\x00\x00\x01' >>> struct.pack('>> struct.pack('f', 1) '?\x80\x00\x00' >>> struct.pack('hil', 1,2,3) '\x00\x01\x00\x00\x00\x00\x00\x02\x00\x00\x00\x03' FUNCTIONS: struct.calcsize(fmt) Return the length of the string that corresponds to the format 'fmt'. struct.pack(fmt, v1 [,v2 [...]]) Return a string with values 'v1', et alia, packed according to the format 'fmt'. struct.unpack(fmt, s) Return a tuple of values represented by string 's' packed according to the format 'fmt'. ================================================================= MODULE -- time : Functions to manipulate date/time values ================================================================= The [time] module is useful both for computing and displaying dates and time increments, and for simple benchmarking of applications and functions. For some purposes, eGenix.com's [mx.Date] module is more useful for manipulating datetimes than is [time]. You may obtain [mx.Date] from: Time tuples--used by several functions--consist of year, month, day, hour, minute, second, weekday, Julian day, and Daylight Savings flag. All values are integers. Month, day, and Julian day (day of year) are one-based; hour, minute, second, and weekday are zero-based (Monday is 0). The Daylight Savings flag uses 1 for DST, 0 for Standard Time, and -1 for "best guess." CONSTANTS AND ATTRIBUTES: time.accept2dyear Boolean to allow two-digit years in date tuples. Default is true value, in which case the first matching date since 'time.gmtime(0)' is extrapolated. >>> import time >>> time.accept2dyear 1 >>> time.localtime(time.mktime((99,1,1,0,0,0,0,0,0))) (1999, 1, 1, 0, 0, 0, 4, 1, 0) >>> time.gmtime(0) (1970, 1, 1, 0, 0, 0, 3, 1, 0) time.altzone time.daylight time.timezone time.tzname These several constants show information on the current timezone. Different locations use Daylight Savings adjustments during different portions of the year, usually but not always a one-hour adjustment. `time.daylight` indicates only whether such an adjustment is available in `time.altzone`. `time.timezone` indicates how many seconds west of UTC the current zone is; `time.altzone` adjusts that for Daylight Savings if possible. `time.tzname` gives a tuple of strings describing the current zone. >>> time.daylight, time.tzname (1, ('EST', 'EDT')) >>> time.altzone, time.timezone (14400, 18000) FUNCTIONS: time.asctime([tuple=time.localtime()]) Return a string description of a time tuple. >>> time.asctime((2002, 10, 25, 1, 51, 48, 4, 298, 1)) 'Fri Oct 25 01:51:48 2002' SEE ALSO, `time.ctime()`, `time.strftime()` time.clock() Return the processor time for the current process. The raw value returned has little inherent meaning, but the value is guaranteed to increase roughly in proportion to the amount of CPU time used by the process. This makes `time.clock()` useful for comparative benchmarking of various operations or approaches. The values returned should not be compared between different CPUs, OSs, and so on, but are meaningful on one machine. For example: #*---------- Use of time.clock() for benchmarking --------# import time start1 = time.clock() approach_one() time1 = time.clock()-start1 start2 = time.clock() approach_two() time2 = time.clock()-start2 if time1 > time2: print "The second approach seems better" else: print "The first approach seems better" Always use `time.clock()` for benchmarking rather than `time.time()`. The latter is a low-resolution "wall clock" only. time.ctime([seconds=time.time()]) Return a string description of 'seconds' since epoch. >>> time.ctime(1035526125) 'Fri Oct 25 02:08:45 2002' SEE ALSO, `time.asctime()` time.gmtime([seconds=time.time()]) Return a time tuple of 'seconds' since epoch, giving Greenwich Mean Time. >>> time.gmtime(1035526125) (2002, 10, 25, 6, 8, 45, 4, 298, 0) SEE ALSO, `time.localtime()` time.localtime([seconds=time.time()]) Return a time tuple of 'seconds' since epoch, giving the local time. >>> time.localtime(1035526125) (2002, 10, 25, 2, 8, 45, 4, 298, 1) SEE ALSO, `time.gmtime()`, `time.mktime()` time.mktime(tuple) Return a number of seconds since epoch corresponding to a time tuple. >>> time.mktime((2002, 10, 25, 2, 8, 45, 4, 298, 1)) 1035526125.0 SEE ALSO, `time.localtime()` time.sleep(seconds) Suspend execution for approximately 'seconds' measured in "wall clock" time (not CPU time). The argument 'seconds' is a floating point value (precision subject to system timer) and is fully thread safe. time.strftime(format [,tuple=time.localtime()]) Return a custom string description of a time tuple. The format given in the string 'format' may contain the following fields: '%a'/'%A'/'%w' for abbreviated/full/decimal weekday name; '%b'/'%B'/'%m' for abbreviated/full/decimal month; '%y'/'%Y' for abbreviated/full year; '%d' for day-of-month; '%H'/'%I' for 24/12 clock hour; '%j' for day-of-year; '%M' for minute; '%p' for AM/PM; '%S' for seconds; '%U'/'%W' for week-of-year (Sunday/Monday start); '%c'/'%x'/'%X' for locale-appropriate datetime/date/time; '%Z' for timezone name. Other characters may occur in the format also and will appear as literals (a literal '%' can be escaped). >>> import time >>> tuple = (2002, 10, 25, 2, 8, 45, 4, 298, 1) >>> time.strftime("%A, %B %d '%y (week %U)", tuple) "Friday, October 25 '02 (week 42)" SEE ALSO, `time.asctime()`, `time.ctime()`, `time.strptime()` time.strptime(s [,format="%a %b %d %H:%M:%S %Y"]) Return a time tuple based on a string description of a time. The format given in the string 'format' follows the same rules as in `time.strftime()`. Not available on most platforms. SEE ALSO, `time.strftime()` time.time() Return the number of seconds since the epoch for the current time. You can specifically determine the epoch using 'time.ctime(0)', but normally you will use other functions in the [time] module to generate useful values. Even though `time.time()` is also generally nondecreasing in its return values, you should use `time.clock()` for benchmarking purposes >>> time.ctime(0) 'Wed Dec 31 19:00:00 1969' >>> time.time() 1035585490.484154 >>> time.ctime(1035585437) 'Fri Oct 25 18:37:17 2002' SEE ALSO, `time.clock()`, `time.ctime()` SEE ALSO, `calendar` SECTION 3 -- Other Modules in the Standard Library ------------------------------------------------------------------------ If your application performs other types of tasks besides text processing, a skim of this module list can suggest where to look for relevant functionality. As well, readers who find themselves maintaining code written by other developers may find that unfamiliar modules are imported by the existing code. If an imported module is not summarized in the list below, nor documented elsewhere, it is probably an in-house or third-party module. For standard library modules, the summaries here will at least give you a sense of the general purpose of a given module. __builtin__ Access to built-in functions, exceptions, and other objects. Python does a great job of exposing its own internals, but "normal" developers do not need to worry about this. TOPIC -- Serializing and Storing Python Objects -------------------------------------------------------------------- In object-oriented programming (OOP) languages like Python, compound data and structured data is frequently represented at runtime as native objects. At times these objects belong to basic datatypes--lists, tuples, dictionaries--but more often, once you reach a certain degree of complexity, hierarchies of instances containing attributes become more likely. For simple objects, especially sequences, serialization and storage is rather straightforward. For example, lists can easily be represented in delimited or fixed-length strings. Lists-of-lists can be saved in line oriented files, each line containing delimited fields, or in rows of RDBMS tables. But once the dimension of nested sequences goes past two, and even more so for heterogeneous data structures, traditional table-oriented storage is a less-obvious fit. While it is -possible- to create "object/relational adaptors" that write OOP instances to flat tables, that usually requires custom programming. A number of more general solutions exist, both in the Python standard library and in third-party tools. There are actually two separate issues involved in storing Python objects. The first issue is how to convert them into strings in the first place; the second issue is how to create a general persistence mechanism for such serialized objects. At a minimal level, of course, it is simple enough to store (and retrieve) a serialization string the same way you would any other string--to a file, a database, and so on. The various [*dbm] modules create a "dictionary on disk," while the [shelve] module automatically utilizes [cPickle] serialization to write arbitrary objects as values (keys are still strings). Several third-party modules support object serialization with special features. If you need an XML dialect for your object representation, the modules [gnosis.xml.pickle] and [xmlrpclib] are useful. The YAML format is both human-readable/editable and has support libraries for Python, Perl, Ruby, and Java; using these various libraries, you can exchange objects between these several programming languages. SEE ALSO, `gnosis.xml.pickle`, `yaml`, `xmlrpclib` ================================================================= MODULES -- DBM : Interfaces to dbm-style databases ================================================================= A dbm-style database is a "dictionary on disk." Using a database of this sort allows you to store a set of key/val pairs to a file, or files, on the local filesystem, and to access and set them as if they were an in-memory dictionary. A dbm-style database, unlike a standard dictionary, always maps strings to strings. If you need to store other types of objects, you will need to convert them to strings (or use the [shelve] module as a wrapper). Depending on your platform, and on which external libraries are installed, different dbm modules might be available. The performance characteristics of the various modules vary significantly. As well, some DBM modules support some special functionality. Most of the time, however, your best approach is to access the locally supported DBM module using the wrapper module [anydbm]. Calls to this module will select the best available DBM for the current environment without a programmer or user having to worry about the underlying support mechanism. Functions and methods are documents using the nonspecific capitalized form 'DBM'. In real usage, you would use the name of a specific module. Most of the time, you will get or set DBM values using standard named indexing, for example, 'db["key"]'. A few methods characteristic of dictionaries are also supported, as well as a few methods special to DBM databases. SEE ALSO, [shelve], [dict], [UserDict] FUNCTIONS: DBM.open(fname [,flag="r" [,mode=0666]]) Open the filename 'fname' for dbm access. The optional argument 'flag' specifies how the database is accessed. A value of 'r' is for read-only access (on an existing dbm file); 'w' opens an already existing file for read/write access; 'c' will create a database or use an existing one, with read/write access; the option 'n' will always create a new database, erasing the one named in 'fname' if it already existed. The optional 'mode' argument specifies the Unix mode of the file(s) created. METHODS: DBM.close() Close the database any flush and pending writes. DBM.first() Return the first key/val pair in the DBM. The order is arbitrary but stable. You may use the `DBM.first()` method, combined with repeated calls to `DBM.next()`, to process every item in the dictionary. In Python 2.2+, you can implement an 'items()' function to emulate the behavior of the '.items()' method of dictionaries for DBMs: >>> from __future__ import generators >>> def items(db): ... try: ... yield db.first() ... while 1: ... yield db.next() ... except KeyError: ... raise StopIteration ... >>> for k,v in items(d): # typical usage ... print k,v DBM.has_key(key) Return a true value if the DBM has the key 'key'. DBM.keys() Return a list of string keys in the DBM. DBM.last() Return the last key/val pair in the DBM. The order is arbitrary but stable. You may use the `DBM.last()` method, combined with repeated calls to `DBM.previous()`, to process every item in the dictionary in reverse order. DBM.next() Return the next key/val pair in the DBM. A pointer to the current position is always maintained, so the methods `DBM.next()` and `DBM.previous()` can be used to access relative items. DBM.previous() Return the previous key/val pair in the DBM. A pointer to the current position is always maintained, so the methods `DBM.next()` and `DBM.previous()` can be used to access relative items. DBM.sync() Force any pending data to be written to disk. SEE ALSO, `FILE.flush()` MODULES: anydbm Generic interface to underlying DBM support. Calls to this module use the functionality of the "best available" DBM module. If you open an existing database file, its type is guessed and used--assuming the current machine supports that style. SEE ALSO, `whichdb` bsddb Interface to the Berkeley DB library. dbhash Interface to the BSD DB library. dbm Interface to the Unix (n)dbm library. dumbdbm Interface to slow, but portable pure Python DBM. gdbm Interface to the GNU DBM (GDBM) library. whichdb Guess which db package to use to open a db file. This module contains the single function `whichdb.whichdb()`. If you open an existing DBM file with [anydbm], this function is called automatically behind the scenes. SEE ALSO, [shelve] ================================================================= MODULE -- cPickle : Fast Python object serialization ================================================================= MODULE -- pickle : Standard Python object serialization ================================================================= The module [cPickle] is a comparatively fast C implementation of the pure Python [pickle] module. The streams produced and read by [cPickle] and [pickle] are interchangeable. The only time you should prefer [pickle] is in the uncommon case where you wish to subclass the pickling base class; [cPickle] is many times faster to use. The class `pickle.Pickler` is not documented here. The [cPickle] and [pickle] modules support a both binary and an ASCII format. Neither is designed for human readability, but it is not hugely difficult to read an ASCII pickle. Nonetheless, if readability is a goal, [yaml] or [gnosis.xml.pickle] are better choices. Binary format produces smaller pickles that are faster to write or load. It is possible to fine-tune the pickling behavior of objects by defining the methods '.__getstate__()', '.__setstate__()', and '.__getinitargs__()'. The particular black magic invocations involved in defining these methods, however, are not addressed in this book and are rarely necessary for "normal" objects (i.e., those that represent data structures). Use of the [cPickle] or [pickle] module is quite simple: >>> import cPickle >>> from somewhere import my_complex_object >>> s = cPickle.dumps(my_complex_object) >>> new_obj = cPickle.loads(s) FUNCTIONS: pickle.dump(o, file [,bin=0]) cPickle.dump(o, file [,bin=0]) Write a serialized form of the object 'o' to the file-like object 'file'. If the optional argument 'bin' is given a true value, use binary format. pickle.dumps(o [,bin=0]) cPickle.dumps(o [,bin=0]) Return a serialized form of the object 'o' as a string. If the optional argument 'bin' is given a true value, use binary format. pickle.load(file) cPickle.load(file) Return an object that was serialized as the contents of the file-like object 'file'. pickle.loads(s) cPickle.load(s) Return an object that was serialized in the string 's'. SEE ALSO, `gnosis.xml.pickle`, `yaml` marshal Internal Python object serialization. For more general object serialization, use [pickle], [cPickle], or [gnosis.xml.pickle], or the YAML tools at ; [marshal] is a limited-purpose serialization to the pseudo-compiled byte-code format used by Python '.pyc' files. ================================================================= MODULE -- pprint : Pretty-print basic datatypes ================================================================= The module [pprint] is similar to the built-in function `repr()` and the module [repr]. The purpose of [pprint] is to represent objects of basic datatypes in a more readable fashion, especially in cases where collection types nest inside each other. In simple cases `pprint.pformat` and `repr()` produce the same result; for more complex objects, [pprint] uses newlines and indentation to illustrate the structure of a collection. Where possible, the string representation produced by [pprint] functions can be used to re-create objects with the built-in `eval()`. I find the module [pprint] somewhat limited in that it does not produce a particularly helpful representation of objects of custom types, which might themselves represent compound data. Instance attributes are very frequently used in a manner similar to dictionary keys. For example: >>> import pprint >>> dct = {1.7:2.5, ('t','u','p'):['l','i','s','t']} >>> dct2 = {'this':'that', 'num':38, 'dct':dct} >>> class Container: pass ... >>> inst = Container() >>> inst.this, inst.num, inst.dct = 'that', 38, dct >>> pprint.pprint(dct2) {'dct': {('t', 'u', 'p'): ['l', 'i', 's', 't'], 1.7: 2.5}, 'num': 38, 'this': 'that'} >>> pprint.pprint(inst) <__main__.Container instance at 0x415770> In the example, 'dct2' and 'inst' have the same structure, and either might plausibly be chosen in an application as a data container. But the latter [pprint] representation only tells us the barest information about -what- an object is, not what data it contains. The mini-module below enhances pretty-printing: #--------------------- pprint2.py ------------------------# from pprint import pformat import string, sys def pformat2(o): if hasattr(o,'__dict__'): lines = [] klass = o.__class__.__name__ module = o.__module__ desc = '<%s.%s instance at 0x%x>' % (module, klass, id(o)) lines.append(desc) for k,v in o.__dict__.items(): lines.append('instance.%s=%s' % (k, pformat(v))) return string.join(lines,'\n') else: return pprint.pformat(o) def pprint2(o, stream=sys.stdout): stream.write(pformat2(o)+'\n') Continuing the session above, we get a more useful report: >>> import pprint2 >>> pprint2.pprint2(inst) <__main__.Container instance at 0x415770> instance.this='that' instance.dct={('t', 'u', 'p'): ['l', 'i', 's', 't'], 1.7: 2.5} instance.num=38 FUNCTIONS: pprint.isreadable(o) Return a true value if the equality below holds: #*------------ Round-tripping with pprint ----------------# o == eval(pprint.pformat(o)) pprint.isrecursive(o) Return a true value if the object 'o' contains recursive containers. Objects that contain themselves at any nested level cannot be restored with `eval()`. pprint.pformat(o) Return a formatted string representation of the object 'o'. pprint.pprint(o [,stream=sys.stdout]) Print the formatted representation of the object 'o' to the file-like object 'stream'. CLASSES: pprint.PrettyPrinter(width=80, depth=..., indent=1, stream=sys.stdout) Return a pretty-printing object that will format using a width of 'width', will limit recursion to depth 'depth', and will indent each new level by 'indent' spaces. The method `pprint.PrettyPrinter.pprint()` will write to the file-like object 'stream'. >>> pp = pprint.PrettyPrinter(width=30) >>> pp.pprint(dct2) {'dct': {1.7: 2.5, ('t', 'u', 'p'): ['l', 'i', 's', 't']}, 'num': 38, 'this': 'that'} METHODS: The class `pprint.PrettyPrinter` has the same methods as the module level functions. The only difference is that the stream used for `pprint.PrettyPrinter.pprint()` is configured when an instance is initialized rather than passed as an optional argument. SEE ALSO, `gnosis.xml.pickle`, `yaml` ================================================================= MODULE -- repr : Alternative object representation ================================================================= The module [repr] contains code for customizing the string representation of objects. In its default behavior the function `repr.repr()` provides a length-limited string representation of objects--in the case of large collections, displaying the entire collection can be unwieldy, and unnecessary for merely distinguishing objects. For example: >>> dct = dict([(n,str(n)) for n in range(6)]) >>> repr(dct) # much worse for, e.g., 1000 item dict "{0: '0', 1: '1', 2: '2', 3: '3', 4: '4', 5: '5'}" >>> from repr import repr >>> repr(dct) "{0: '0', 1: '1', 2: '2', 3: '3', ...}" >>> `dct` "{0: '0', 1: '1', 2: '2', 3: '3', 4: '4', 5: '5'}" The back-tick operator does not change behavior if the built-in `repr()` function is replaced. Your can change the behavior of the `repr.repr()` by modifying attributes of the instance object `repr.aRepr`. >>> dct = dict([(n,str(n)) for n in range(6)]) >>> repr(dct) "{0: '0', 1: '1', 2: '2', 3: '3', 4: '4', 5: '5'}" >>> import repr >>> repr.repr(dct) "{0: '0', 1: '1', 2: '2', 3: '3', ...}" >>> repr.aRepr.maxdict = 5 >>> repr.repr(dct) "{0: '0', 1: '1', 2: '2', 3: '3', 4: '4', ...}" In my opinion, the choice of the name for this module is unfortunate, since it is identical to that of the built-in function. You can avoid some of the collision by using the 'as' form of importing, as in: >>> import repr as _repr >>> from repr import repr as newrepr For fine-tuned control of object representation, you may subclass the class `repr.Repr`. Potentially, you could use substitutable 'repr()' functions to change the behavior of application output, but if you anticipate such a need, it is better practice to give a name that indicates this, for example, 'overridable_repr()'. CLASSES: repr.Repr() Base for customized object representations. The instance `repr.aRepr` automatically exists in the module namespace, so this class is useful primarily as a parent class. To change an attribute, it is simplest just to set it in an instance. ATTRIBUTES: repr.maxlevel Depth of recursive objects to follow. repr.maxdict repr.maxlist repr.maxtuple Number of items in a collection of the indicated type to include in the representation. Sequences default to 6, dicts to 4. repr.maxlong Number of digits of a long integer to stringify. Default is 40. repr.maxstring Length of string representation (e.g., 's[:N]'). Default is 30. repr.maxother "Catch-all" maximum length of other representations. FUNCTIONS: repr.repr(o) Behaves like built-in `repr()`, but potentially with a different string representation created. repr.repr_TYPE(o, level) Represent an object of the type 'TYPE', where the names used are the standard type names. The argument 'level' indicates the level of recursion when this method is called (you might want to decide what to print based on how deep within the representation the object is). The _Python Library Reference_ gives the example: #*--------------- Custom Repr class ---------------------# class MyRepr(repr.Repr): def repr_file(self, obj, level): if obj.name in ['', '', '']: return obj.name else: return `obj` aRepr = MyRepr() print aRepr.repr(sys.stdin) # prints '' ================================================================= MODULE -- shelve : General persistent dictionary ================================================================= The module [shelve] builds on the capabilities of the DBM modules, but takes things a step forward. Unlike with the DBM modules, you may write arbitrary Python objects as values in a [shelve] database. The keys in [shelve] databases, however, must still be strings. The methods of [shelve] databases are generally the same as those for their underlying DBMs. However, shelves do not have the '.first()', '.last()', '.next()', or '.previous()' methods; nor do they have the '.items()' method that actual dictionaries do. Most of the time you will simply use name-indexed assignment and access. But from time to time, the available `shelve.get()`, `shelve.keys()`, `shelve.sync()`, `shelve.has_key()`, and `shelve.close()` methods are useful. Usage of a shelve consists of a few simple steps like the ones below: >>> import shelve >>> sh = shelve.open('test_shelve') >>> sh.keys() ['this'] >>> sh['new_key'] = {1:2, 3:4, ('t','u','p'):['l','i','s','t']} >>> sh.keys() ['this', 'new_key'] >>> sh['new_key'] {1: 2, 3: 4, ('t', 'u', 'p'): ['l', 'i', 's', 't']} >>> del sh['this'] >>> sh.keys() ['new_key'] >>> sh.close() In the example, I opened an existing shelve, and the previously existing key/value pair was available. Deleting a key/value pair is the same as doing so from a standard dictionary. Opening a new shelve automatically creates the necessary file(s). Although [shelve] only allows strings to be used as keys, in a pinch it is not difficult to generate strings that characterize other types of immutable objects. For the same reasons that you do not generally want to use mutable objects as dictionary keys, it is also a bad idea to use mutable objects as [shelve] keys. Using the built-in `hash()` method is a good way to generate strings--but keep in mind that this technique does not strictly guarantee uniqueness, so it is possible (but unlikely) to accidentally overwrite entries using this hack: >>> '%x' % hash((1,2,3,4,5)) '866123f4' >>> '%x' % hash(3.1415) '6aad0902' >>> '%x' % hash(38) '26' >>> '%x' % hash('38') '92bb58e3' Integers, notice, are their own hash, and strings of digits are common. Therefore, if you adopted this approach, you would want to hash strings as well, before using them as keys. There is no real problem with doing so, merely an extra indirection step that you need to remember to use consistently: >>> sh['%x' % hash('another_key')] = 'another value' >>> sh.keys() ['new_key', '8f9ef0ca'] >>> sh['%x' % hash('another_key')] 'another value' >>> sh['another_key'] Traceback (most recent call last): File "", line 1, in ? File "/sw/lib/python2.2/shelve.py", line 70, in __getitem__ f = StringIO(self.dict[key]) KeyError: another_key If you want to go beyond the capabilities of [shelve] in several ways, you might want to investigate the third-party library Zope Object Database (ZODB). ZODB allows arbitrary objects to be persistent, not only dictionary-like objects. Moreover, ZODB lets you store data in ways other than in local files, and also has adaptors for multiuser simultaneous access. Look for details at: SEE ALSO, [DBM], [dict] -*- The rest of the listed modules are comparatively unlikely to be needed in text processing applications. Some modules are specific to a particular platform; if so, this is indicated parenthetically. Recent distributions of Python have taken a "batteries included" approach--much more is included in a base Python distribution than is with other free programming languages (but other popular languages still have a range of existing libraries that can be downloaded separately). TOPIC -- Platform-Specific Operations -------------------------------------------------------------------- _winreg Access to the Windows registry (Windows). AE AppleEvents (Macintosh; replaced by [Carbon.AE]). aepack Conversion between Python variables and AppleEvent data containers (Macintosh). aetypes AppleEvent objects (Macintosh). applesingle Rudimentary decoder for AppleSingle format files (Macintosh). buildtools Build MacOS applets (Macintosh). calendar Print calendars, much like the Unix 'cal' utility. A variety of functions allow you to print or stringify calendars for various time frames. For example, >>> print calendar.month(2002,11) November 2002 Mo Tu We Th Fr Sa Su 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Carbon.AE, Carbon.App, Carbon.CF, Carbon.Cm, Carbon.Ctl, Carbon.Dlg, Carbon.Evt, Carbon.Fm, Carbon.Help, Carbon.List, Carbon.Menu, Carbon.Mlte, Carbon.Qd, Carbon.Qdoffs, Carbon.Qt, Carbon.Res, Carbon.Scrap, Carbon.Snd, Carbon.TE, Carbon.Win Interfaces to Carbon API (Macintosh). cd CD-ROM access on SGI systems (IRIX). cfmfile Code Fragment Resource module (Macintosh). ColorPicker Interface to the standard color selection dialog (Macintosh). ctb Interface to the Communications Tool Box (Macintosh). dl Call C functions in shared objects (Unix). EasyDialogs Basic Macintosh dialogs (Macintosh). fcntl Access to Unix 'fcntl()' and 'iocntl()' system functions (Unix). findertools AppleEvents interface to MacOS finder (Macintosh). fl, FL, flp Functions and constants for working with the FORMS library (IRIX). fm, FM Functions and constants for working with the Font Manager library (IRIX). fpectl Floating point exception control (Unix). FrameWork, MiniAEFrame Structured development of MacOS applications (Macintosh). gettext The module [gettext] eases the development of multilingual applications. While actual translations must be performed manually, this module aids in identifying strings for translation and runtime substitutions of language-specific strings. grp Information on Unix groups (Unix). locale Control the language and regional settings for an application. The 'locale' setting affects the behavior of several functions, such as `time.strftime()` and `string.lower()`. The [locale] module is also useful for creating strings such as number with grouped digits and currency strings for specific nations. mac, macerrors, macpath Macintosh implementation of [os] module functionality. It is generally better to use [os] directly and let it call [mac] where needed (Macintosh). macfs, macfsn, macostools File system services (Macintosh). MacOS Access to MacOS Python interpreter (Macintosh). macresource Locate script resources (Macintosh). macspeech Interface to Speech Manager (Macintosh). mactty Easy access serial to line connections (Macintosh). mkcwproject Create CodeWarrior projects (Macintosh). msvcrt Miscellaneous Windows-specific functions provided in Microsoft's Visual C++ Runtime libraries (Windows). Nac Interface to Navigation Services (Macintosh). nis Access to Sun's NIS Yellow Pages (Unix). pipes Manage pipes at a finer level than done by `os.popen()` and its relatives. Reliability varies between platforms (Unix). PixMapWrapper Wrap PixMap objects (Macintosh). posix, posixfile Access to operating system functionality under Unix. The [os] module provides more portable version of the same functionality and should be used instead (Unix). preferences Application preferences manager (Macintosh). pty Pseudo terminal utilities (IRIX, Linux). pwd Access to Unix password database (Unix). pythonprefs Preferences manager for Python (Macintosh). py_resource Helper to create PYC resources for compiled applications (Macintosh). quietconsole Buffered, nonvisible STDOUT output (Macintosh). resource Examine resource usage (Unix). syslog Interface to Unix syslog library (Unix). tty, termios, TERMIOS POSIX tty control (Unix). W Widgets for the Mac (Macintosh). waste Interface to the WorldScript-Aware Styled Text Engine (Macintosh). winsound Interface to audio hardware under Windows (Windows). xdrlib Implements (a subset of) Sun eXternal Data Representation (XDR). In concept, [xdrlib] is similar to the [struct] module, but the format is less widely used. TOPIC -- Working with Multimedia Formats -------------------------------------------------------------------- aifc Read and write AIFC and AIFF audio files. The interface to [aifc] is the same as for the [sunau] and [wave] modules. al, AL Audio functions for SGI (IRIX). audioop Manipulate raw audio data. chunk Read chunks of IFF audio data. colorsys Convert between RGB color model and YIQ, HLS, and HSV color spaces. gl, DEVICE, GL Functions and constants for working with Silicon Graphics' Graphics Library (IRIX). imageop Manipulate image data stored as Python strings. For most operations on image files, the third-party -Python Imaging Library- () is a versatile and powerful tool. imgfile Support for imglib files (IRIX). jpeg Read and write JPEG files on SGI (IRIX). The -Python Imaging Library- () provides a cross-platform means of working with a large number of image formats and is preferable for most purposes. rgbimg Read and write SGI RGB files (IRIX). sunau Read and write Sun AU audio files. The interface to [sunau] is the same as for the [aifc] and [wave] modules. sunaudiodev, SUNAUDIODEV Interface to Sun audio hardware (SunOS/Solaris). videoreader Read QuickTime movies frame by frame (Macintosh). wave Read and write WAV audio files. The interface to [wave] is the same as for the [aifc] and [sunau] modules TOPIC -- Miscellaneous Other Modules -------------------------------------------------------------------- array Typed arrays of numeric values. More efficient than standard Python lists, where applicable. atexit Exit handlers. Same functionality as `sys.exitfunc`, but different interface. BaseHTTPServer, SimpleHTTPServer, SimpleXMLRPCServer, CGIHTTPServer HTTP server classes. [BaseHTTPServer] should usually be treated as an abstract class. The other modules provide sufficient customization for usage in the specific context indicated by their names. All may be customized for your application's needs. Bastion Restricted object access. Used in conjunction with [rexec]. bisect List insertion maintaining sort order. cmath Mathematical functions over complex numbers. cmd Build line-oriented command interpreters. code Utilities to emulate Python's interactive interpreter. codeop Compile possibly incomplete Python source code. compileall Module/script to compile .py files to cached byte-code files. compile, compile.ast, compile.visitor Analyze Python source code and generate Python byte-codes. copy_reg Helper to provide extensibility for pickle/cPickle. curses, curses.ascii, curses.panel, curses.textpad, curses.wrapper Full-screen terminal handling with the (n)curses library. dircache Cached directory listing. This module enhances the functionality of `os.listdir()`. dis Disassembler of Python byte-code into mnemonics. distutils Build and install Python modules and packages. [distutils] provides a standard mechanism for creating distribution packages of Python tools and libraries, and also for installing them on target machines. Although [distutils] is likely to be useful for text processing applications that are distributed to users, a discussion of the details of working with [distutils] is outside the scope of this book. Useful information can be found in the Python standard documentation, especially Greg Ward's _Distributing Python Modules_ and _Installing Python Modules_. doctest Check the accuracy of __doc__ strings. errno Standard 'errno' system symbols. fpformat General floating point formatting functions. Duplicates string interpolation functionality. gc Control Python's (optional) cyclic garbage collection. getpass Utilities to collect a password without echoing to screen. imp Access the internals of the 'import' statement. inspect Get useful information from live Python objects for Python 2.1+. keyword Check whether string is a Python keyword. math Various trigonometric and algebraic functions and constants. These functions generally operate on floating point numbers--use [cmath] for calculations on complex numbers. mutex Work with mutual exclusion locks, typically for threaded applications. new Create special Python objects in customizable ways. For example, Python hackers can create a module object without using a file of the same name or create an instance while bypassing the normal '.__init__()' call. "Normal" techniques generally suffice for text processing applications. pdb A Python debugger. popen2 Functions to spawn commands with pipes to STDIN, STDOUT, and optionally STDERR. In Python 2.0+, this functionality is copied to the [os] module in slightly improved form. Generally you should use the [os] module (unless you are running Python 1.52 or earlier). profile Profile the performance characteristics of Python code. If speed becomes an issue in your application, your first step in solving any problem issues should be profiling the code. But details of using [profile] are outside the scope of this book. Moreover, it is usually a bad idea to -assume- speed is a problem until it is actually found to be so. pstats Print reports on profiled Python code. pyclbr Python class browser; useful for implementing code development environments for editing Python. pydoc Extremely useful script and module for examining Python documentation. [pydoc] is included with Python 2.1+, but is compatible with earlier versions if downloaded. [pydoc] can provide help similar to Unix 'man' pages, help in the interactive shell, and also a Web browser interface to documentation. This tool is worth using frequently while developing Python applications, but its details are outside the scope of this book. py_compile "Compile" a .py file to a .pyc (or .pyo) file. Queue A multiproducer, multiconsumer queue, especially for threaded programming. readline, rlcompleter Interface to GNU readline (Unix). rexec Restricted execution facilities. sched General event scheduler. signal Handlers for asynchronous events. site, user Customizable startup module that can be modified to change the behavior of the local Python installation. statcache Maintain a cache of `os.stat()` information on files. Deprecated in Python 2.2+. statvfs Constants for interpreting the results of `os.statvfs()` and `os.fstatvfs()`. thread, threading Create multithreaded applications with Python. Although text processing applications--like other applications--might use a threaded approach, this topic is outside the scope of this book. Most, but not all, Python platforms support threaded applications. Tkinter, ScrolledText, Tix, turtle Python interface to TCL/Tk and higher-level widgets for TK. Supported on many platforms, but not on all Python installations. traceback Extract, format, and print information about Python stack traces. Useful for debugging applications. unittest Unit testing framework. Like a number of other documenting, testing, and debugging modules, [unittest] is a useful facility--and its usage is recommended for Python applications in general. But this module is not specific enough to text processing applications to be addressed in this book. warnings Python 2.1 added a set of warning messages for conditions a user should be aware of, but that fall below the threshold for raising exceptions. By default, such messages are printed to STDERR, but the [warning] module can be used to modify the behavior of warning messages. weakref Create references to objects that do not limit garbage collection. At first brush, weak references seem strange, and the strangeness does not really go away quickly. If you do not know why you would want to use these, do not worry about it--you do not need to. whrandom Wichmann-Hill random number generator. Deprecated since Python 2.1, and not necessary to use directly before that--use the module [random] to create pseudo-random values.