1 |
One of the best qualities of Python is its consistency. After working with Python for a while, you are able to start making informed, correct guesses about features that are new to you. However, if you learned another object-oriented language before Python, you may have found it strange to use len(collection) instead of collection.len(). This apparent oddity is the tip of an iceberg that, when properly understood, is the key to everything we call Pythonic. The iceberg is called the Python data model, and it describes the API that you can use to make your own objects play well with the most idiomatic language features. You can think of the data model as a description of Python as a framework. It formalizes the interfaces of the building blocks of the language itself, such as sequences, iterators, functions, classes, context managers, and so on. While coding with any framework, you spend a lot of time implementing methods that are called by the framework. The same happens when you leverage the Python data model. The Python interpreter invokes special methods to perform basic object oper‐ ations, often triggered by special syntax. The special method names are always written with leading and trailing double underscores (i.e., __getitem__). For example, the syn‐ 1. Story of Jython, written as a Foreword to Jython Essentials (O’Reilly, 2002), by Samuele Pedroni and Noel Rappin. 3 tax obj[key] is supported by the __getitem__ special method. In order to evaluate my_collection[key], the interpreter calls my_collection.__getitem__(key). The special method names allow your objects to implement, support, and interact with basic language constructs such as: • Iteration • Collections • Attribute access • Operator overloading • Function and method invocation • Object creation and destruction • String representation and formatting • Managed contexts (i.e., with blocks) Magic and Dunder The term magic method is slang for special method, but when talking about a specific method like __getitem__, some Python developers take the shortcut of saying “under-under-getitem” which is ambiguous, because the syntax __x has another special meaning.2 Being precise and pronouncing “under-under-getitem under-under” is tiresome, so I follow the lead of author and teach‐ er Steve Holden and say “dunder-getitem.” All experienced Pytho‐ nistas understand that shortcut. As a result, the special methods are also known as dunder methods.3 A Pythonic Card Deck The following is a very simple example, but it demonstrates the power of implementing just two special methods, __getitem__ and __len__. Example 1-1 is a class to represent a deck of playing cards. Example 1-1. A deck as a sequence of cards import collections 2. See “Private and “Protected” Attributes in Python” on page 262. 3. I personally first heard “dunder” from Steve Holden. Wikipedia credits Mark Johnson and Tim Hochberg for the first written records of “dunder” in responses to the question “How do you pronounce __ (double underscore)?” in the python-list on September 26, 2002: Johnson’s message; Hochberg’s (11 minutes later). 4 | Chapter 1: The Python Data Model Card = collections.namedtuple('Card', ['rank', 'suit']) class FrenchDeck: ranks = [str(n) for n in range(2, 11)] + list('JQKA') suits = 'spades diamonds clubs hearts'.split() def __init__(self): self._cards = [Card(rank, suit) for suit in self.suits for rank in self.ranks] def __len__(self): return len(self._cards) def __getitem__(self, position): return self._cards[position] The first thing to note is the use of collections.namedtuple to construct a simple class to represent individual cards. Since Python 2.6, namedtuple can be used to build classes of objects that are just bundles of attributes with no custom methods, like a database record. In the example, we use it to provide a nice representation for the cards in the deck, as shown in the console session: >>> beer_card = Card('7', 'diamonds') >>> beer_card Card(rank='7', suit='diamonds') But the point of this example is the FrenchDeck class. It’s short, but it packs a punch. First, like any standard Python collection, a deck responds to the len() function by returning the number of cards in it: >>> deck = FrenchDeck() >>> len(deck) 52 Reading specific cards from the deck—say, the first or the last—should be as easy as deck[0] or deck[-1], and this is what the __getitem__ method provides: >>> deck[0] Card(rank='2', suit='spades') >>> deck[-1] Card(rank='A', suit='hearts') Should we create a method to pick a random card? No need. Python already has a function to get a random item from a sequence: random.choice. We can just use it on a deck instance: >>> from random import choice >>> choice(deck) Card(rank='3', suit='hearts') >>> choice(deck) Card(rank='K', suit='spades') A Pythonic Card Deck | 5 >>> choice(deck) Card(rank='2', suit='clubs') We’ve just seen two advantages of using special methods to leverage the Python data model: • The users of your classes don’t have to memorize arbitrary method names for stan‐ dard operations (“How to get the number of items? Is it .size(), .length(), or what?”). • It’s easier to benefit from the rich Python standard library and avoid reinventing the wheel, like the random.choice function. But it gets better. Because our __getitem__ delegates to the [] operator of self._cards, our deck auto‐ matically supports slicing. Here’s how we look at the top three cards from a brand new deck, and then pick just the aces by starting on index 12 and skipping 13 cards at a time: >>> deck[:3] [Card(rank='2', suit='spades'), Card(rank='3', suit='spades'), Card(rank='4', suit='spades')] >>> deck[12::13] [Card(rank='A', suit='spades'), Card(rank='A', suit='diamonds'), Card(rank='A', suit='clubs'), Card(rank='A', suit='hearts')] Just by implementing the __getitem__ special method, our deck is also iterable: >>> for card in deck: # doctest: +ELLIPSIS ... print(card) Card(rank='2', suit='spades') Card(rank='3', suit='spades') Card(rank='4', suit='spades') ... The deck can also be iterated in reverse: >>> for card in reversed(deck): # doctest: +ELLIPSIS ... print(card) Card(rank='A', suit='hearts') Card(rank='K', suit='hearts') Card(rank='Q', suit='hearts') ... 6 | Chapter 1: The Python Data Model Ellipsis in doctests Whenever possible, the Python console listings in this book were extracted from doctests to ensure accuracy. When the output was too long, the elided part is marked by an ellipsis (...) like in the last line in the preceding code. In such cases, we used the # doctest: +ELLIPSIS directive to make the doctest pass. If you are trying these examples in the interactive console, you may omit the doctest directives altogether. Iteration is often implicit. If a collection has no __contains__ method, the in operator does a sequential scan. Case in point: in works with our FrenchDeck class because it is iterable. Check it out: >>> Card('Q', 'hearts') in deck True >>> Card('7', 'beasts') in deck False How about sorting? A common system of ranking cards is by rank (with aces being highest), then by suit in the order of spades (highest), then hearts, diamonds, and clubs (lowest). Here is a function that ranks cards by that rule, returning 0 for the 2 of clubs and 51 for the ace of spades: suit_values = dict(spades=3, hearts=2, diamonds=1, clubs=0) def spades_high(card): rank_value = FrenchDeck.ranks.index(card.rank) return rank_value * len(suit_values) + suit_values[card.suit] Given spades_high, we can now list our deck in order of increasing rank: >>> for card in sorted(deck, key=spades_high): # doctest: +ELLIPSIS ... print(card) Card(rank='2', suit='clubs') Card(rank='2', suit='diamonds') Card(rank='2', suit='hearts') ... (46 cards ommitted) Card(rank='A', suit='diamonds') Card(rank='A', suit='hearts') Card(rank='A', suit='spades') Although FrenchDeck implicitly inherits from object,4 its functionality is not inherited, but comes from leveraging the data model and composition. By implementing the spe‐ cial methods __len__ and __getitem__, our FrenchDeck behaves like a standard Python sequence, allowing it to benefit from core language features (e.g., iteration and slicing) 4. In Python 2, you’d have to be explicit and write FrenchDeck(object), but that’s the default in Python 3. A Pythonic Card Deck | 7 and from the standard library, as shown by the examples using random.choice, reversed, and sorted. Thanks to composition, the __len__ and __getitem__ imple‐ mentations can hand off all the work to a list object, self._cards. How About Shuffling? As implemented so far, a FrenchDeck cannot be shuffled, be‐ cause it is immutable: the cards and their positions cannot be changed, except by violating encapsulation and handling the _cards attribute directly. In Chapter 11, that will be fixed by adding a one-line __setitem__ method. How Special Methods Are Used The first thing to know about special methods is that they are meant to be called by the Python interpreter, and not by you. You don’t write my_object.__len__(). You write len(my_object) and, if my_object is an instance of a user-defined class, then Python calls the __len__ instance method you implemented. But for built-in types like list, str, bytearray, and so on, the interpreter takes a short‐ cut: the CPython implementation of len() actually returns the value of the ob_size field in the PyVarObject C struct that represents any variable-sized built-in object in memory. This is much faster than calling a method. More often than not, the special method call is implicit. For example, the statement for i in x: actually causes the invocation of iter(x), which in turn may call x.__iter__() if that is available. Normally, your code should not have many direct calls to special methods. Unless you are doing a lot of metaprogramming, you should be implementing special methods more often than invoking them explicitly. The only special method that is frequently called by user code directly is __init__, to invoke the initializer of the superclass in your own __init__ implementation. If you need to invoke a special method, it is usually better to call the related built-in function (e.g., len, iter, str, etc). These built-ins call the corresponding special meth‐ od, but often provide other services and—for built-in types—are faster than method calls. See, for example, “A Closer Look at the iter Function” on page 436 in Chapter 14. Avoid creating arbitrary, custom attributes with the __foo__ syntax because such names may acquire special meanings in the future, even if they are unused today. 8 | Chapter 1: The Python Data Model Emulating Numeric Types Several special methods allow user objects to respond to operators such as +. We will cover that in more detail in Chapter 13, but here our goal is to further illustrate the use of special methods through another simple example. We will implement a class to represent two-dimensional vectors—that is Euclidean vectors like those used in math and physics (see Figure 1-1). |
2 |
Figure 1-1. Example of two-dimensional vector addition; Vector(2, 4) + Vector(2, 1) re‐ sults in Vector(4, 5). The built-in complex type can be used to represent two dimensional vectors, but our class can be extended to represent n dimensional vectors. We will do that in Chapter 14. We will start by designing the API for such a class by writing a simulated console session that we can use later as a doctest. The following snippet tests the vector addition pictured in Figure 1-1: >>> v1 = Vector(2, 4) >>> v2 = Vector(2, 1) >>> v1 + v2 Vector(4, 5) Note how the + operator produces a Vector result, which is displayed in a friendly manner in the console. How Special Methods Are Used | 9 The abs built-in function returns the absolute value of integers and floats, and the magnitude of complex numbers, so to be consistent, our API also uses abs to calculate the magnitude of a vector: >>> v = Vector(3, 4) >>> abs(v) 5.0 We can also implement the * operator to perform scalar multiplication (i.e., multiplying a vector by a number to produce a new vector with the same direction and a multiplied magnitude): >>> v * 3 Vector(9, 12) >>> abs(v * 3) 15.0 Example 1-2 is a Vector class implementing the operations just described, through the use of the special methods __repr__, __abs__, __add__ and __mul__. Example 1-2. A simple two-dimensional vector class from math import hypot class Vector: def __init__(self, x=0, y=0): self.x = x self.y = y def __repr__(self): return 'Vector(%r, %r)' % (self.x, self.y) def __abs__(self): return hypot(self.x, self.y) def __bool__(self): return bool(abs(self)) def __add__(self, other): x = self.x + other.x y = self.y + other.y return Vector(x, y) def __mul__(self, scalar): return Vector(self.x * scalar, self.y * scalar) Note that although we implemented four special methods (apart from __init__), none of them is directly called within the class or in the typical usage of the class illustrated by the console listings. As mentioned before, the Python interpreter is the only frequent 10 | Chapter 1: The Python Data Model caller of most special methods. In the following sections, we discuss the code for each special method. String Representation The __repr__ special method is called by the repr built-in to get the string represen‐ tation of the object for inspection. If we did not implement __repr__, vector instances would be shown in the console like <Vector object at 0x10e100070>. The interactive console and debugger call repr on the results of the expressions evalu‐ ated, as does the %r placeholder in classic formatting with the % operator, and the !r conversion field in the new Format String Syntax used in the str.format method. Speaking of the % operator and the str.format method, you will notice I use both in this book, as does the Python community at large. I am increasingly favoring the more powerful str.for mat, but I am aware many Pythonistas prefer the simpler %, so we’ll probably see both in Python source code for the foreseea‐ ble future. Note that in our __repr__ implementation, we used %r to obtain the standard repre‐ sentation of the attributes to be displayed. This is good practice, because it shows the crucial difference between Vector(1, 2) and Vector('1', '2')—the latter would not work in the context of this example, because the constructor’s arguments must be num‐ bers, not str. The string returned by __repr__ should be unambiguous and, if possible, match the source code necessary to re-create the object being represented. That is why our chosen representation looks like calling the constructor of the class (e.g., Vector(3, 4)). Contrast __repr__ with __str__, which is called by the str() constructor and implicitly used by the print function. __str__ should return a string suitable for display to end users. If you only implement one of these special methods, choose __repr__, because when no custom __str__ is available, Python will call __repr__ as a fallback. “Difference between __str__ and __repr__ in Python” is a Stack Overflow question with excellent contributions from Pythonistas Alex Martelli and Martijn Pieters. How Special Methods Are Used | 11 Arithmetic Operators Example 1-2 implements two operators: + and *, to show basic usage of __add__ and __mul__. Note that in both cases, the methods create and return a new instance of Vector, and do not modify either operand—self or other are merely read. This is the expected behavior of infix operators: to create new objects and not touch their operands. I will have a lot more to say about that in Chapter 13. As implemented, Example 1-2 allows multiplying a Vector by a number, but not a number by a Vector, which violates the com‐ mutative property of multiplication. We will fix that with the spe‐ cial method __rmul__ in Chapter 13. Boolean Value of a Custom Type Although Python has a bool type, it accepts any object in a boolean context, such as the expression controlling an if or while statement, or as operands to and, or, and not. To determine whether a value x is truthy or falsy, Python applies bool(x), which always returns True or False. By default, instances of user-defined classes are considered truthy, unless either __bool__ or __len__ is implemented. Basically, bool(x) calls x.__bool__() and uses the result. If __bool__ is not implemented, Python tries to invoke x.__len__(), and if that returns zero, bool returns False. Otherwise bool returns True. Our implementation of __bool__ is conceptually simple: it returns False if the mag‐ nitude of the vector is zero, True otherwise. We convert the magnitude to a Boolean using bool(abs(self)) because __bool__ is expected to return a boolean. Note how the special method __bool__ allows your objects to be consistent with the truth value testing rules defined in the “Built-in Types” chapter of The Python Standard Library documentation. A faster implementation of Vector.__bool__ is this: def __bool__(self): return bool(self.x or self.y) This is harder to read, but avoids the trip through abs, __abs__, the squares, and square root. The explicit conversion to bool is needed because __bool__ must return a boolean and or returns either operand as is: x or y evaluates to x if that is truthy, other‐ wise the result is y, whatever that is. 12 | Chapter 1: The Python Data Model Overview of Special Methods The “Data Model” chapter of The Python Language Reference lists 83 special method names, 47 of which are used to implement arithmetic, bitwise, and comparison opera‐ tors. As an overview of what is available, see Tables 1-1 and 1-2. The grouping shown in the following tablesis not exactly the same as in the official documentation. Table 1-1. Special method names (operators excluded) Category Method names String/bytes representation __repr__, __str__, __format__, __bytes__ Conversion to number __abs__, __bool__, __complex__, __int__, __float__, __hash__, __index__ Emulating collections __len__, __getitem__, __setitem__, __delitem__, __contains__ Iteration __iter__, __reversed__, __next__ Emulating callables __call__ Context management __enter__, __exit__ Instance creation and destruction __new__, __init__, __del__ Attribute management __getattr__, __getattribute__, __setattr__, __delattr__, __dir__ Attribute descriptors __get__, __set__, __delete__ Class services __prepare__, __instancecheck__, __subclasscheck__ Table 1-2. Special method names for operators Category Method names and related operators Unary numeric operators __neg__ -, __pos__ +, __abs__ abs() Rich comparison operators __lt__ >, __le__ <=, __eq__ ==, __ne__ !=, __gt__ >, __ge__ >= Arithmetic operators __add__+, __sub__ -, __mul__ *, __truediv__ /, __floordiv__ //, __mod__ %, __divmod__ divmod() , __pow__ ** or pow(), __round__ round() Reversed arithmetic operators __radd__,__rsub__,__rmul__,__rtruediv__,__rfloordiv__,__rmod__, __rdivmod__, __rpow__ Augmented assignment arithmetic operators __iadd__,__isub__,__imul__,__itruediv__,__ifloordiv__,__imod__, __ipow__ Bitwise operators __invert__ ~, __lshift__ <<, __rshift__ >>, __and__ &, __or__ |, __xor__ ^ Overview of Special Methods | 13 Category Method names and related operators Reversed bitwise operators __rlshift__, __rrshift__, __rand__, __rxor__, __ror__ Augmented assignment bitwise operators __ilshift__, __irshift__, __iand__, __ixor__, __ior__ The reversed operators are fallbacks used when operands are swapped (b * a instead of a * b), while augmented assignments are shortcuts combining an infix operator with variable assign‐ ment (a = a * b becomes a *= b). Chapter 13 explains both reversed operators and augmented assignment in detail. Why len Is Not a Method I asked this question to core developer Raymond Hettinger in 2013 and the key to his answer was a quote from The Zen of Python: “practicality beats purity.” In “How Special Methods Are Used” on page 8, I described how len(x) runs very fast when x is an instance of a built-in type. No method is called for the built-in objects of CPython: the length is simply read from a field in a C struct. Getting the number of items in a collection is a common operation and must work efficiently for such basic and diverse types as str, list, memoryview, and so on. In other words, len is not called as a method because it gets special treatment as part of the Python data model, just like abs. But thanks to the special method __len__, you can also make len work with your own custom objects. This is a fair compromise between the need for efficient built-in objects and the consistency of the language. Also from The Zen of Python: “Special cases aren’t special enough to break the rules.” If you think of abs and len as unary operators, you may be more inclined to forgive their functional look-and-feel, as opposed to the method call syntax one might expect in an OO language. In fact, the ABC language—a direct ancestor of Python that pio‐ neered many of its features—had an # operator that was the equivalent of len (you’d write #s). When used as an infix opera‐ tor, written x#s, it counted the occurrences of x in s, which in Python you get as s.count(x), for any sequence s. Chapter Summary By implementing special methods, your objects can behave like the built-in types, en‐ abling the expressive coding style the community considers Pythonic. 14 | Chapter 1: The Python Data Model A basic requirement for a Python object is to provide usable string representations of itself, one used for debugging and logging, another for presentation to end users. That is why the special methods __repr__ and __str__ exist in the data model. Emulating sequences, as shown with the FrenchDeck example, is one of the most widely used applications of the special methods. Making the most of sequence types is the subject of Chapter 2, and implementing your own sequence will be covered in Chap‐ ter 10 when we create a multidimensional extension of the Vector class. Thanks to operator overloading, Python offers a rich selection of numeric types, from the built-ins to decimal.Decimal and fractions.Fraction, all supporting infix arith‐ metic operators. Implementing operators, including reversed operators and augmented assignment, will be shown in Chapter 13 via enhancements of the Vector example. The use and implementation of the majority of the remaining special methods of the Python data model is covered throughout this book. Further Reading The “Data Model” chapter of The Python Language Reference is the canonical source for the subject of this chapter and much of this book. Python in a Nutshell, 2nd Edition (O’Reilly) by Alex Martelli has excellent coverage of the data model. As I write this, the most recent edition of the Nutshell book is from 2006 and focuses on Python 2.5, but there have been very few changes in the data model since then, and Martelli’s description of the mechanics of attribute access is the most author‐ itative I’ve seen apart from the actual C source code of CPython. Martelli is also a prolific contributor to Stack Overflow, with more than 5,000 answers posted. See his user profile at Stack Overflow. David Beazley has two books covering the data model in detail in the context of Python 3: Python Essential Reference, 4th Edition (Addison-Wesley Professional), and Python Cookbook, 3rd Edition (O’Reilly), coauthored with Brian K. Jones. The Art of the Metaobject Protocol (AMOP, MIT Press) by Gregor Kiczales, Jim des Rivieres, and Daniel G. Bobrow explains the concept of a metaobject protocol (MOP), of which the Python data model is one example. Soapbox Data Model or Object Model? What the Python documentation calls the “Python data model,” most authors would say is the “Python object model.” Alex Martelli’s Python in a Nutshell 2E, and David Beazley’s Python Essential Reference 4E are the best books covering the “Python data model,” but Further Reading | 15 they always refer to it as the “object model.” On Wikipedia, the first definition of object model is “The properties of objects in general in a specific computer programming language.” This is what the “Python data model” is about. In this book, I will use “data model” because the documentation favors that term when referring to the Python object model, and because it is the title of the chapter of The Python Language Reference most relevant to our discussions. Magic Methods The Ruby community calls their equivalent of the special methods magic methods. Many in the Python community adopt that term as well. I believe the special methods are actually the opposite of magic. Python and Ruby are the same in this regard: both em‐ power their users with a rich metaobject protocol that is not magic, but enables users to leverage the same tools available to core developers. In contrast, consider JavaScript. Objects in that language have features that are magic, in the sense that you cannot emulate them in your own user-defined objects. For ex‐ ample, before JavaScript 1.8.5, you could not define read-only attributes in your Java‐ Script objects, but some built-in objects always had read-only attributes. In JavaScript, read-only attributes were “magic,” requiring supernatural powers that a user of the lan‐ guage did not have until ECMAScript 5.1 came out in 2009. The metaobject protocol of JavaScript is evolving, but historically it has been more limited than those of Python and Ruby. Metaobjects The Art of the Metaobject Protocol (AMOP) is my favorite computer book title. Less subjectively, the term metaobject protocol is useful to think about the Python data model and similar features in other languages. The metaobject part refers to the objects that are the building blocks of the language itself. In this context, protocol is a synonym of interface. So a metaobject protocol is a fancy synonym for object model: an API for core language constructs. A rich metaobject protocol enables extending a language to support new programming paradigms. Gregor Kiczales, the first author of the AMOP book, later became a pioneer in aspect-oriented programming and the initial author of AspectJ, an extension of Java implementing that paradigm. Aspect-oriented programming is much easier to imple‐ ment in a dynamic language like Python, and several frameworks do it, but the most important is zope.interface, which is briefly discussed in “Further Reading” on page 342 of Chapter 11. 16 | Chapter 1: The Python Data Model PART II Data Structures CHAPTER 2 An Array of Sequences As you may have noticed, several of the operations mentioned work equally for texts, lists and tables. Texts, lists and tables together are called trains. […] The FOR command also works generically on trains.1 — Geurts, Meertens, and Pemberton ABC Programmer’s Handbook Before creating Python, Guido was a contributor to the ABC language—a 10-year re‐ search project to design a programming environment for beginners. ABC introduced many ideas we now consider “Pythonic”: generic operations on sequences, built-in tuple and mapping types, structure by indentation, strong typing without variable declara‐ tions, and more. It’s no accident that Python is so user-friendly. Python inherited from ABC the uniform handling of sequences. Strings, lists, byte se‐ quences, arrays, XML elements, and database results share a rich set of common oper‐ ations including iteration, slicing, sorting, and concatenation. Understanding the variety of sequences available in Python saves us from reinventing the wheel, and their common interface inspires us to create APIs that properly support and leverage existing and future sequence types. Most of the discussion in this chapter applies to sequences in general, from the familiar list to the str and bytes types that are new in Python 3. Specific topics on lists, tuples, arrays, and queues are also covered here, but the focus on Unicode strings and byte sequences is deferred to Chapter 4. Also, the idea here is to cover sequence types that are ready to use. Creating your own sequence types is the subject of Chapter 10. 1. Leo Geurts, Lambert Meertens, and Steven Pemberton, ABC Programmer’s Handbook, p. 8. 19 Overview of Built-In Sequences The standard library offers a rich selection of sequence types implemented in C: Container sequences list, tuple, and collections.deque can hold items of different types. Flat sequences str, bytes, bytearray, memoryview, and array.array hold items of one type. Container sequences hold references to the objects they contain, which may be of any type, while flat sequences physically store the value of each item within its own memory space, and not as distinct objects. Thus, flat sequences are more compact, but they are limited to holding primitive values like characters, bytes, and numbers. Another way of grouping sequence types is by mutability: Mutable sequences list, bytearray, array.array, collections.deque, and memoryview Immutable sequences tuple, str, and bytes Figure 2-1 helps visualize how mutable sequences differ from immutable ones, while also inheriting several methods from them. Note that the built-in concrete sequence types do not actually subclass the Sequence and MutableSequence abstract base classes (ABCs) depicted, but the ABCs are still useful as a formalization of what functionality to expect from a full-featured sequence type. |
3 |
Figure 2-1. UML class diagram for some classes from collections.abc (superclasses are on the left; inheritance arrows point from subclasses to superclasses; names in italic are abstract classes and abstract methods) Keeping in mind these common traits—mutable versus immutable; container versus flat—is helpful to extrapolate what you know about one sequence type to others. 20 | Chapter 2: An Array of Sequences The most fundamental sequence type is the list—mutable and mixed-type. I am sure you are comfortable handling them, so we’ll jump right into list comprehensions, a powerful way of building lists that is somewhat underused because the syntax may be unfamiliar. Mastering list comprehensions opens the door to generator expressions, which—among other uses—can produce elements to fill up sequences of any type. Both are the subject of the next section. List Comprehensions and Generator Expressions A quick way to build a sequence is using a list comprehension (if the target is a list) or a generator expression (for all other kinds of sequences). If you are not using these syntactic forms on a daily basis, I bet you are missing opportunities to write code that is more readable and often faster at the same time. If you doubt my claim that these constructs are “more readable,” read on. I’ll try to convince you. For brevity, many Python programmers refer to list comprehen‐ sions as listcomps, and generator expressions as genexps. I will use these words as well. List Comprehensions and Readability Here is a test: which do you find easier to read, Example 2-1 or Example 2-2? Example 2-1. Build a list of Unicode codepoints from a string >>> symbols = '$¢£¥€¤' >>> codes = [] >>> for symbol in symbols: ... codes.append(ord(symbol)) ... >>> codes [36, 162, 163, 165, 8364, 164] Example 2-2. Build a list of Unicode codepoints from a string, take two >>> symbols = '$¢£¥€¤' >>> codes = [ord(symbol) for symbol in symbols] >>> codes [36, 162, 163, 165, 8364, 164] Anybody who knows a little bit of Python can read Example 2-1. However, after learning about listcomps, I find Example 2-2 more readable because its intent is explicit. List Comprehensions and Generator Expressions | 21 A for loop may be used to do lots of different things: scanning a sequence to count or pick items, computing aggregates (sums, averages), or any number of other processing tasks. The code in Example 2-1 is building up a list. In contrast, a listcomp is meant to do one thing only: to build a new list. Of course, it is possible to abuse list comprehensions to write truly incomprehensible code. I’ve seen Python code with listcomps used just to repeat a block of code for its side effects. If you are not doing something with the produced list, you should not use that syntax. Also, try to keep it short. If the list comprehension spans more than two lines, it is probably best to break it apart or rewrite as a plain old for loop. Use your best judgment: for Python as for English, there are no hard-and-fast rules for clear writing. Syntax Tip In Python code, line breaks are ignored inside pairs of [], {}, or (). So you can build multiline lists, listcomps, genexps, dictionar‐ ies and the like without using the ugly \ line continuation escape. Listcomps No Longer Leak Their Variables In Python 2.x, variables assigned in the for clauses in list comprehensions were set in the surrounding scope, sometimes with tragic consequences. See the following Python 2.7 console session: Python 2.7.6 (default, Mar 22 2014, 22:59:38) [GCC 4.8.2] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> x = 'my precious' >>> dummy = [x for x in 'ABC'] >>> x 'C' As you can see, the initial value of x was clobbered. This no longer happens in Python 3. List comprehensions, generator expressions, and their siblings set and dict compre‐ hensions now have their own local scope, like functions. Variables assigned within the expression are local, but variables in the surrounding scope can still be referenced. Even better, the local variables do not mask the variables from the surrounding scope. This is Python 3: >>> x = 'ABC' >>> dummy = [ord(x) for x in x] >>> x 'ABC' >>> dummy 22 | Chapter 2: An Array of Sequences [65, 66, 67] >>> The value of x is preserved. The list comprehension produces the expected list. List comprehensions build lists from sequences or any other iterable type by filtering and transforming items. The filter and map built-ins can be composed to do the same, but readability suffers, as we will see next. Listcomps Versus map and filter Listcomps do everything the map and filter functions do, without the contortions of the functionally challenged Python lambda. Consider Example 2-3. Example 2-3. The same list built by a listcomp and a map/filter composition >>> symbols = '$¢£¥€¤' >>> beyond_ascii = [ord(s) for s in symbols if ord(s) > 127] >>> beyond_ascii [162, 163, 165, 8364, 164] >>> beyond_ascii = list(filter(lambda c: c > 127, map(ord, symbols))) >>> beyond_ascii [162, 163, 165, 8364, 164] I used to believe that map and filter were faster than the equivalent listcomps, but Alex Martelli pointed out that’s not the case—at least not in the preceding examples. The 02- array-seq/listcomp_speed.py script in the Fluent Python code repository is a simple speed test comparing listcomp with filter/map. I’ll have more to say about map and filter in Chapter 5. Now we turn to the use of listcomps to compute Cartesian products: a list containing tuples built from all items from two or more lists. Cartesian Products Listcomps can generate lists from the Cartesian product of two or more iterables. The items that make up the cartesian product are tuples made from items from every input iterable. The resulting list has a length equal to the lengths of the input iterables mul‐ tiplied. See Figure 2-2. List Comprehensions and Generator Expressions | 23 |
4 |
Figure 2-2. The Cartesian product of a sequence of three card ranks and a sequence of four suits results in a sequence of twelve pairings For example, imagine you need to produce a list of T-shirts available in two colors and three sizes. Example 2-4 shows how to produce that list using a listcomp. The result has six items. Example 2-4. Cartesian product using a list comprehension >>> colors = ['black', 'white'] >>> sizes = ['S', 'M', 'L'] >>> tshirts = [(color, size) for color in colors for size in sizes] >>> tshirts [('black', 'S'), ('black', 'M'), ('black', 'L'), ('white', 'S'), ('white', 'M'), ('white', 'L')] >>> for color in colors: ... for size in sizes: ... print((color, size)) ... ('black', 'S') ('black', 'M') ('black', 'L') ('white', 'S') ('white', 'M') ('white', 'L') >>> tshirts = [(color, size) for size in sizes ... for color in colors] >>> tshirts [('black', 'S'), ('white', 'S'), ('black', 'M'), ('white', 'M'), ('black', 'L'), ('white', 'L')] 24 | Chapter 2: An Array of Sequences This generates a list of tuples arranged by color, then size. Note how the resulting list is arranged as if the for loops were nested in the same order as they appear in the listcomp. To get items arranged by size, then color, just rearrange the for clauses; adding a line break to the listcomp makes it easy to see how the result will be ordered. In Example 1-1 (Chapter 1), the following expression was used to initialize a card deck with a list made of 52 cards from all 13 ranks of each of the 4 suits, grouped by suit: self._cards = [Card(rank, suit) for suit in self.suits for rank in self.ranks] Listcomps are a one-trick pony: they build lists. To fill up other sequence types, a genexp is the way to go. The next section is a brief look at genexps in the context of building nonlist sequences. Generator Expressions To initialize tuples, arrays, and other types of sequences, you could also start from a listcomp, but a genexp saves memory because it yields items one by one using the iterator protocol instead of building a whole list just to feed another constructor. Genexps use the same syntax as listcomps, but are enclosed in parentheses rather than brackets. Example 2-5 shows basic usage of genexps to build a tuple and an array. Example 2-5. Initializing a tuple and an array from a generator expression >>> symbols = '$¢£¥€¤' >>> tuple(ord(symbol) for symbol in symbols) (36, 162, 163, 165, 8364, 164) >>> import array >>> array.array('I', (ord(symbol) for symbol in symbols)) array('I', [36, 162, 163, 165, 8364, 164]) If the generator expression is the single argument in a function call, there is no need to duplicate the enclosing parentheses. The array constructor takes two arguments, so the parentheses around the generator expression are mandatory. The first argument of the array constructor defines the storage type used for the numbers in the array, as we’ll see in “Arrays” on page 48. Example 2-6 uses a genexp with a Cartesian product to print out a roster of T-shirts of two colors in three sizes. In contrast with Example 2-4, here the six-item list of T-shirts is never built in memory: the generator expression feeds the for loop producing one List Comprehensions and Generator Expressions | 25 item at a time. If the two lists used in the Cartesian product had 1,000 items each, using a generator expression would save the expense of building a list with a million items just to feed the for loop. Example 2-6. Cartesian product in a generator expression >>> colors = ['black', 'white'] >>> sizes = ['S', 'M', 'L'] >>> for tshirt in ('%s %s' % (c, s) for c in colors for s in sizes): ... print(tshirt) ... black S black M black L white S white M white L The generator expression yields items one by one; a list with all six T-shirt variations is never produced in this example. Chapter 14 is devoted to explaining how generators work in detail. Here the idea was just to show the use of generator expressions to initialize sequences other than lists, or to produce output that you don’t need to keep in memory. Now we move on to the other fundamental sequence type in Python: the tuple. Tuples Are Not Just Immutable Lists Some introductory texts about Python present tuples as “immutable lists,” but that is short selling them. Tuples do double duty: they can be used as immutable lists and also as records with no field names. This use is sometimes overlooked, so we will start with that. Tuples as Records Tuples hold records: each item in the tuple holds the data for one field and the position of the item gives its meaning. If you think of a tuple just as an immutable list, the quantity and the order of the items may or may not be important, depending on the context. But when using a tuple as a collection of fields, the number of items is often fixed and their order is always vital. Example 2-7 shows tuples being used as records. Note that in every expression, sorting the tuple would destroy the information because the meaning of each data item is given by its position in the tuple. 26 | Chapter 2: An Array of Sequences Example 2-7. Tuples used as records >>> lax_coordinates = (33.9425, -118.408056) >>> city, year, pop, chg, area = ('Tokyo', 2003, 32450, 0.66, 8014) >>> traveler_ids = [('USA', '31195855'), ('BRA', 'CE342567'), ... ('ESP', 'XDA205856')] >>> for passport in sorted(traveler_ids): ... print('%s/%s' % passport) ... BRA/CE342567 ESP/XDA205856 USA/31195855 >>> for country, _ in traveler_ids: ... print(country) ... USA BRA ESP Latitude and longitude of the Los Angeles International Airport. Data about Tokyo: name, year, population (millions), population change (%), area (km²). A list of tuples of the form (country_code, passport_number). As we iterate over the list, passport is bound to each tuple. The % formatting operator understands tuples and treats each item as a separate field. The for loop knows how to retrieve the items of a tuple separately—this is called “unpacking.” Here we are not interested in the second item, so it’s assigned to _, a dummy variable. Tuples work well as records because of the tuple unpacking mechanism—our next sub‐ ject. Tuple Unpacking In Example 2-7, we assigned ('Tokyo', 2003, 32450, 0.66, 8014) to city, year, pop, chg, area in a single statement. Then, in the last line, the % operator assigned each item in the passport tuple to one slot in the format string in the print argument. Those are two examples of tuple unpacking. Tuples Are Not Just Immutable Lists | 27 Tuple unpacking works with any iterable object. The only require‐ ment is that the iterable yields exactly one item per variable in the receiving tuple, unless you use a star (*) to capture excess items as explained in “Using * to grab excess items” on page 29. The term tuple unpacking is widely used by Pythonistas, but iterable un‐ packing is gaining traction, as in the title of PEP 3132 — Exten‐ ded Iterable Unpacking. The most visible form of tuple unpacking is parallel assignment; that is, assigning items from an iterable to a tuple of variables, as you can see in this example: >>> lax_coordinates = (33.9425, -118.408056) >>> latitude, longitude = lax_coordinates # tuple unpacking >>> latitude 33.9425 >>> longitude -118.408056 An elegant application of tuple unpacking is swapping the values of variables without using a temporary variable: >>> b, a = a, b Another example of tuple unpacking is prefixing an argument with a star when calling a function: >>> divmod(20, 8) (2, 4) >>> t = (20, 8) >>> divmod(*t) (2, 4) >>> quotient, remainder = divmod(*t) >>> quotient, remainder (2, 4) The preceding code also shows a further use of tuple unpacking: enabling functions to return multiple values in a way that is convenient to the caller. For example, the os.path.split() function builds a tuple (path, last_part) from a filesystem path: >>> import os >>> _, filename = os.path.split('/home/luciano/.ssh/idrsa.pub') >>> filename 'idrsa.pub' Sometimes when we only care about certain parts of a tuple when unpacking, a dummy variable like _ is used as placeholder, as in the preceding example. 28 | Chapter 2: An Array of Sequences If you write internationalized software, _ is not a good dummy variable because it is traditionally used as an alias to the get text.gettext function, as recommended in the gettext module documentation. Otherwise, it’s a nice name for placeholder vari‐ able. Another way of focusing on just some of the items when unpacking a tuple is to use the *, as we’ll see right away. Using * to grab excess items Defining function parameters with *args to grab arbitrary excess arguments is a classic Python feature. In Python 3, this idea was extended to apply to parallel assignment as well: >>> a, b, *rest = range(5) >>> a, b, rest (0, 1, [2, 3, 4]) >>> a, b, *rest = range(3) >>> a, b, rest (0, 1, [2]) >>> a, b, *rest = range(2) >>> a, b, rest (0, 1, []) In the context of parallel assignment, the * prefix can be applied to exactly one variable, but it can appear in any position: >>> a, *body, c, d = range(5) >>> a, body, c, d (0, [1, 2], 3, 4) >>> *head, b, c, d = range(5) >>> head, b, c, d ([0, 1], 2, 3, 4) Finally, a powerful feature of tuple unpacking is that it works with nested structures. Nested Tuple Unpacking The tuple to receive an expression to unpack can have nested tuples, like (a, b, (c, d)), and Python will do the right thing if the expression matches the nesting structure. Example 2-8 shows nested tuple unpacking in action. Example 2-8. Unpacking nested tuples to access the longitude metro_areas = [ ('Tokyo', 'JP', 36.933, (35.689722, 139.691667)), # ('Delhi NCR', 'IN', 21.935, (28.613889, 77.208889)), ('Mexico City', 'MX', 20.142, (19.433333, -99.133333)), Tuples Are Not Just Immutable Lists | 29 ('New York-Newark', 'US', 20.104, (40.808611, -74.020386)), ('Sao Paulo', 'BR', 19.649, (-23.547778, -46.635833)), ] print('{:15} | {:^9} | {:^9}'.format('', 'lat.', 'long.')) fmt = '{:15} | {:9.4f} | {:9.4f}' for name, cc, pop, (latitude, longitude) in metro_areas: # if longitude <= 0: # print(fmt.format(name, latitude, longitude)) Each tuple holds a record with four fields, the last of which is a coordinate pair. By assigning the last field to a tuple, we unpack the coordinates. if longitude <= 0: limits the output to metropolitan areas in the Western hemisphere. The output of Example 2-8 is: | lat. | long. Mexico City | 19.4333 | -99.1333 New York-Newark | 40.8086 | -74.0204 Sao Paulo | -23.5478 | -46.6358 Before Python 3, it was possible to define functions with nested tuples in the formal parameters (e.g., def fn(a, (b, c), d):). This is no longer supported in Python 3 function definitions, for practical reasons explained in PEP 3113 — Removal of Tuple Pa‐ rameter Unpacking. To be clear: nothing changed from the per‐ spective of users calling a function. The restriction applies only to the definition of functions. As designed, tuples are very handy. But there is a missing feature when using them as records: sometimes it is desirable to name the fields. That is why the namedtuple func‐ tion was invented. Read on. Named Tuples The collections.namedtuple function is a factory that produces subclasses of tuple enhanced with field names and a class name—which helps debugging. Instances of a class that you build with namedtuple take exactly the same amount of memory as tuples because the field names are stored in the class. They use less memory than a regular object because they don’t store attributes in a per-instance __dict__. 30 | Chapter 2: An Array of Sequences Recall how we built the Card class in Example 1-1 in Chapter 1: Card = collections.namedtuple('Card', ['rank', 'suit']) Example 2-9 shows how we could define a named tuple to hold information about a city. Example 2-9. Defining and using a named tuple type >>> from collections import namedtuple >>> City = namedtuple('City', 'name country population coordinates') >>> tokyo = City('Tokyo', 'JP', 36.933, (35.689722, 139.691667)) >>> tokyo City(name='Tokyo', country='JP', population=36.933, coordinates=(35.689722, 139.691667)) >>> tokyo.population 36.933 >>> tokyo.coordinates (35.689722, 139.691667) >>> tokyo[1] 'JP' Two parameters are required to create a named tuple: a class name and a list of field names, which can be given as an iterable of strings or as a single space delimited string. Data must be passed as positional arguments to the constructor (in contrast, the tuple constructor takes a single iterable). You can access the fields by name or position. A named tuple type has a few attributes in addition to those inherited from tuple. Example 2-10 shows the most useful: the _fields class attribute, the class method _make(iterable), and the _asdict() instance method. Example 2-10. Named tuple attributes and methods (continued from the previous ex‐ ample) >>> City._fields ('name', 'country', 'population', 'coordinates') >>> LatLong = namedtuple('LatLong', 'lat long') >>> delhi_data = ('Delhi NCR', 'IN', 21.935, LatLong(28.613889, 77.208889)) >>> delhi = City._make(delhi_data) >>> delhi._asdict() OrderedDict([('name', 'Delhi NCR'), ('country', 'IN'), ('population', 21.935), ('coordinates', LatLong(lat=28.613889, long=77.208889))]) >>> for key, value in delhi._asdict().items(): print(key + ':', value) name: Delhi NCR country: IN population: 21.935 Tuples Are Not Just Immutable Lists | 31 coordinates: LatLong(lat=28.613889, long=77.208889) >>> _fields is a tuple with the field names of the class. _make() allow you to instantiate a named tuple from an iterable; City(*del hi_data) would do the same. _asdict() returns a collections.OrderedDict built from the named tuple instance. That can be used to produce a nice display of city data. Now that we’ve explored the power of tuples as records, we can consider their second role as an immutable variant of the list type. Tuples as Immutable Lists When using a tuple as an immutable variation of list, it helps to know how similar they actually are. As you can see in Table 2-1, tuple supports all list methods that do not involve adding or removing items, with one exception—tuple lacks the __re versed__ method. However, that is just for optimization; reversed(my_tuple) works without it. Table 2-1. Methods and attributes found in list or tuple (methods implemented by ob‐ ject are omitted for brevity) list tuple s.__add__(s2) ● ● s + s2—concatenation s.__iadd__(s2) ● s += s2—in-place concatenation s.append(e) ● Append one element after last s.clear() ● Delete all items s.__contains__(e) ● ● e in s s.copy() ● Shallow copy of the list s.count(e) ● ● Count occurrences of an element s.__delitem__(p) ● Remove item at position p s.extend(it) ● Append items from iterable it s.__getitem__(p) ● ● s[p]—get item at position s.__getnewargs__() ● Support for optimized serialization with pickle s.index(e) ● ● Find position of first occurrence of e s.insert(p, e) ● Insert element e before the item at position p s.__iter__() ● ● Get iterator s.__len__() ● ● len(s)—number of items s.__mul__(n) ● ● s * n—repeated concatenation 32 | Chapter 2: An Array of Sequences list tuple s.__imul__(n) ● s *= n—in-place repeated concatenation s.__rmul__(n) ● ● n * s—reversed repeated concatenationa s.pop([p]) ● Remove and return last item or item at optional position p s.remove(e) ● Remove first occurrence of element e by value s.reverse() ● Reverse the order of the items in place s.__reversed__() ● Get iterator to scan items from last to first s.__setitem__(p, e) ● s[p] = e—put e in position p, overwriting existing item s.sort([key], [reverse]) ● Sort items in place with optional keyword arguments key and reverse a Reversed operators are explained in Chapter 13. Every Python programmer knows that sequences can be sliced using the s[a:b] syntax. We now turn to some less well-known facts about slicing. Slicing A common feature of list, tuple, str, and all sequence types in Python is the support of slicing operations, which are more powerful than most people realize. In this section, we describe the use of these advanced forms of slicing. Their imple‐ mentation in a user-defined class will be covered in Chapter 10, in keeping with our philosophy of covering ready-to-use classes in this part of the book, and creating new classes in Part IV. Why Slices and Range Exclude the Last Item The Pythonic convention of excluding the last item in slices and ranges works well with the zero-based indexing used in Python, C, and many other languages. Some convenient features of the convention are: • It’s easy to see the length of a slice or range when only the stop position is given: range(3) and my_list[:3] both produce three items. • It’s easy to compute the length of a slice or range when start and stop are given: just subtract stop - start. • It’s easy to split a sequence in two parts at any index x, without overlapping: simply get my_list[:x] and my_list[x:]. For example: >>> l = [10, 20, 30, 40, 50, 60] >>> l[:2] # split at 2 [10, 20] >>> l[2:] [30, 40, 50, 60] Slicing | 33 >>> l[:3] # split at 3 [10, 20, 30] >>> l[3:] [40, 50, 60] But the best arguments for this convention were written by the Dutch computer scientist Edsger W. Dijkstra (see the last reference in “Further Reading” on page 59). Now let’s take a close look at how Python interprets slice notation. Slice Objects This is no secret, but worth repeating just in case: s[a:b:c] can be used to specify a stride or step c, causing the resulting slice to skip items. The stride can also be negative, returning items in reverse. Three examples make this clear: >>> s = 'bicycle' >>> s[::3] 'bye' >>> s[::-1] 'elcycib' >>> s[::-2] 'eccb' Another example was shown in Chapter 1 when we used deck[12::13] to get all the aces in the unshuffled deck: >>> deck[12::13] [Card(rank='A', suit='spades'), Card(rank='A', suit='diamonds'), Card(rank='A', suit='clubs'), Card(rank='A', suit='hearts')] The notation a:b:c is only valid within [] when used as the indexing or subscript operator, and it produces a slice object: slice(a, b, c). As we will see in “How Slicing Works” on page 281, to evaluate the expression seq[start:stop:step], Python calls seq.__getitem__(slice(start, stop, step)). Even if you are not implementing your own sequence types, knowing about slice objects is useful because it lets you assign names to slices, just like spreadsheets allow naming of cell ranges. Suppose you need to parse flat-file data like the invoice shown in Example 2-11. Instead of filling your code with hardcoded slices, you can name them. See how readable this makes the for loop at the end of the example. Example 2-11. Line items from a flat-file invoice >>> invoice = """ ... 0.....6.................................40........52...55........ ... 1909 Pimoroni PiBrella $17.50 3 $52.50 ... 1489 6mm Tactile Switch x20 $4.95 2 $9.90 ... 1510 Panavise Jr. - PV-201 $28.00 1 $28.00 ... 1601 PiTFT Mini Kit 320x240 $34.95 1 $34.95 ... """ 34 | Chapter 2: An Array of Sequences >>> SKU = slice(0, 6) >>> DESCRIPTION = slice(6, 40) >>> UNIT_PRICE = slice(40, 52) >>> QUANTITY = slice(52, 55) >>> ITEM_TOTAL = slice(55, None) >>> line_items = invoice.split(' ')[2:] >>> for item in line_items: ... print(item[UNIT_PRICE], item[DESCRIPTION]) ... $17.50 Pimoroni PiBrella $4.95 6mm Tactile Switch x20 $28.00 Panavise Jr. - PV-201 $34.95 PiTFT Mini Kit 320x240 We’ll come back to slice objects when we discuss creating your own collections in “Vector Take #2: A Sliceable Sequence” on page 280. Meanwhile, from a user perspective, slicing includes additional features such as multidimensional slices and ellipsis (...) notation. Read on. Multidimensional Slicing and Ellipsis The [] operator can also take multiple indexes or slices separated by commas. This is used, for instance, in the external NumPy package, where items of a two-dimensional numpy.ndarray can be fetched using the syntax a[i, j] and a two-dimensional slice obtained with an expression like a[m:n, k:l]. Example 2-22 later in this chapter shows the use of this notation. The __getitem__ and __setitem__ special methods that handle the [] operator simply receive the indices in a[i, j] as a tuple. In other words, to evaluate a[i, j], Python calls a.__getitem__((i, j)). The built-in sequence types in Python are one-dimensional, so they support only one index or slice, and not a tuple of them. The ellipsis—written with three full stops (...) and not … (Unicode U+2026)—is rec‐ ognized as a token by the Python parser. It is an alias to the Ellipsis object, the single instance of the ellipsis class.2 As such, it can be passed as an argument to functions and as part of a slice specification, as in f(a, ..., z) or a[i:...]. NumPy uses ... as a shortcut when slicing arrays of many dimensions; for example, if x is a four dimensional array, x[i, ...] is a shortcut for x[i, :, :, :,]. See the Tentative NumPy Tutorial to learn more about this. At the time of this writing, I am unaware of uses of Ellipsis or multidimensional indexes and slices in the Python standard library. If you spot one, let me know. These syntactic features exist to support user-defined types and extensions such as NumPy. 2. No, I did not get this backwards: the ellipsis class name is really all lowercase and the instance is a built in named Ellipsis, just like bool is lowercase but its instances are True and False. Slicing | 35 Slices are not just useful to extract information from sequences; they can also be used to change mutable sequences in place—that is, without rebuilding them from scratch. Assigning to Slices Mutable sequences can be grafted, excised, and otherwise modified in place using slice notation on the left side of an assignment statement or as the target of a del statement. The next few examples give an idea of the power of this notation: >>> l = list(range(10)) >>> l [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] >>> l[2:5] = [20, 30] >>> l [0, 1, 20, 30, 5, 6, 7, 8, 9] >>> del l[5:7] >>> l [0, 1, 20, 30, 5, 8, 9] >>> l[3::2] = [11, 22] >>> l [0, 1, 20, 11, 5, 22, 9] >>> l[2:5] = 100 Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: can only assign an iterable >>> l[2:5] = [100] >>> l [0, 1, 100, 22, 9] When the target of the assignment is a slice, the right side must be an iterable object, even if it has just one item. Everybody knows that concatenation is a common operation with sequences of any type. Any introductory Python text explains the use of + and * for that purpose, but there are some subtle details on how they work, which we cover next. Using + and * with Sequences Python programmers expect that sequences support + and *. Usually both operands of + must be of the same sequence type, and neither of them is modified but a new sequence of the same type is created as result of the concatenation. To concatenate multiple copies of the same sequence, multiply it by an integer. Again, a new sequence is created: >>> l = [1, 2, 3] >>> l * 5 [1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3] 36 | Chapter 2: An Array of Sequences >>> 5 * 'abcd' 'abcdabcdabcdabcdabcd' Both + and * always create a new object, and never change their operands. Beware of expressions like a * n when a is a sequence contain‐ ing mutable items because the result may surprise you. For exam‐ ple, trying to initialize a list of lists as my_list = [[]] * 3 will result in a list with three references to the same inner list, which is probably not what you want. The next section covers the pitfalls of trying to use * to initialize a list of lists. Building Lists of Lists Sometimes we need to initialize a list with a certain number of nested lists—for example, to distribute students in a list of teams or to represent squares on a game board. The best way of doing so is with a list comprehension, as in Example 2-12. Example 2-12. A list with three lists of length 3 can represent a tic-tac-toe board >>> board = [['_'] * 3 for i in range(3)] >>> board [['_', '_', '_'], ['_', '_', '_'], ['_', '_', '_']] >>> board[1][2] = 'X' >>> board [['_', '_', '_'], ['_', '_', 'X'], ['_', '_', '_']] Create a list of three lists of three items each. Inspect the structure. Place a mark in row 1, column 2, and check the result. A tempting but wrong shortcut is doing it like Example 2-13. Example 2-13. A list with three references to the same list is useless >>> weird_board = [['_'] * 3] * 3 >>> weird_board [['_', '_', '_'], ['_', '_', '_'], ['_', '_', '_']] >>> weird_board[1][2] = 'O' >>> weird_board [['_', '_', 'O'], ['_', '_', 'O'], ['_', '_', 'O']] The outer list is made of three references to the same inner list. While it is unchanged, all seems right. Placing a mark in row 1, column 2, reveals that all rows are aliases referring to the same object. Using + and * with Sequences | 37 The problem with Example 2-13 is that, in essence, it behaves like this code: row = ['_'] * 3 board = [] for i in range(3): board.append(row) The same row is appended three times to board. On the other hand, the list comprehension from Example 2-12 is equivalent to this code: >>> board = [] >>> for i in range(3): ... row = ['_'] * 3 # ... board.append(row) ... >>> board [['_', '_', '_'], ['_', '_', '_'], ['_', '_', '_']] >>> board[2][0] = 'X' >>> board # [['_', '_', '_'], ['_', '_', '_'], ['X', '_', '_']] Each iteration builds a new row and appends it to board. Only row 2 is changed, as expected. If either the problem or the solution in this section are not clear to you, relax. Chapter 8 was written to clarify the mechanics and pitfalls of references and mutable objects. So far we have discussed the use of the plain + and * operators with sequences, but there are also the += and *= operators, which produce very different results depending on the mutability of the target sequence. The following section explains how that works. Augmented Assignment with Sequences The augmented assignment operators += and *= behave very differently depending on the first operand. To simplify the discussion, we will focus on augmented addition first (+=), but the concepts also apply to *= and to other augmented assignment operators. The special method that makes += work is __iadd__ (for “in-place addition”). However, if __iadd__ is not implemented, Python falls back to calling __add__. Consider this simple expression: >>> a += b 38 | Chapter 2: An Array of Sequences If a implements __iadd__, that will be called. In the case of mutable sequences (e.g., list, bytearray, array.array), a will be changed in place (i.e., the effect will be similar to a.extend(b)). However, when a does not implement __iadd__, the expression a += b has the same effect as a = a + b: the expression a + b is evaluated first, producing a new object, which is then bound to a. In other words, the identity of the object bound to a may or may not change, depending on the availability of __iadd__. In general, for mutable sequences, it is a good bet that __iadd__ is implemented and that += happens in place. For immutable sequences, clearly there is no way for that to happen. What I just wrote about += also applies to *=, which is implemented via __imul__. The __iadd__ and __imul__ special methods are discussed in Chapter 13. Here is a demonstration of *= with a mutable sequence and then an immutable one: >>> l = [1, 2, 3] >>> id(l) 4311953800 >>> l *= 2 >>> l [1, 2, 3, 1, 2, 3] >>> id(l) 4311953800 >>> t = (1, 2, 3) >>> id(t) 4312681568 >>> t *= 2 >>> id(t) 4301348296 ID of the initial list After multiplication, the list is the same object, with new items appended ID of the initial tuple After multiplication, a new tuple was created Repeated concatenation of immutable sequences is inefficient, because instead of just appending new items, the interpreter has to copy the whole target sequence to create a new one with the new items concatenated.3 We’ve seen common use cases for +=. The next section shows an intriguing corner case that highlights what “immutable” really means in the context of tuples. 3. str is an exception to this description. Because string building with += in loops is so common in the wild, CPython is optimized for this use case. str instances are allocated in memory with room to spare, so that concatenation does not require copying the whole string every time. Augmented Assignment with Sequences | 39 A += Assignment Puzzler Try to answer without using the console: what is the result of evaluating the two ex‐ pressions in Example 2-14?4 Example 2-14. A riddle >>> t = (1, 2, [30, 40]) >>> t[2] += [50, 60] What happens next? Choose the best answer: a. t becomes (1, 2, [30, 40, 50, 60]). b. TypeError is raised with the message 'tuple' object does not support item assignment. c. Neither. d. Both a and b. When I saw this, I was pretty sure the answer was b, but it’s actually d, “Both a and b.”! Example 2-15 is the actual output from a Python 3.4 console (actually the result is the same in a Python 2.7 console).5 Example 2-15. The unexpected result: item t2 is changed and an exception is raised >>> t = (1, 2, [30, 40]) >>> t[2] += [50, 60] Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: 'tuple' object does not support item assignment >>> t (1, 2, [30, 40, 50, 60]) Online Python Tutor is an awesome online tool to visualize how Python works in detail. Figure 2-3 is a composite of two screenshots showing the initial and final states of the tuple t from Example 2-15. 4. Thanks to Leonardo Rochael and Cesar Kawakami for sharing this riddle at the 2013 PythonBrasil Confer‐ ence. 5. A reader suggested that the operation in the example can be performed with t[2].extend([50,60]), without errors. We’re aware of that, but the intent of the example is to discuss the odd behavior of the += operator. 40 | Chapter 2: An Array of Sequences |
5 |
Figure 2-3. Initial and final state of the tuple assignment puzzler (diagram generated by Online Python Tutor) If you look at the bytecode Python generates for the expression s[a] += b (Example 2-16), it becomes clear how that happens. Example 2-16. Bytecode for the expression s[a] += b >>> dis.dis('s[a] += b') 1 0 LOAD_NAME 0 (s) 3 LOAD_NAME 1 (a) 6 DUP_TOP_TWO 7 BINARY_SUBSCR 8 LOAD_NAME 2 (b) 11 INPLACE_ADD 12 ROT_THREE 13 STORE_SUBSCR 14 LOAD_CONST 0 (None) 17 RETURN_VALUE Put the value of s[a] on TOS (Top Of Stack). Perform TOS += b. This succeeds if TOS refers to a mutable object (it’s a list, in Example 2-15). Assign s[a] = TOS. This fails if s is immutable (the t tuple in Example 2-15). This example is quite a corner case—in 15 years of using Python, I have never seen this strange behavior actually bite somebody. I take three lessons from this: • Putting mutable items in tuples is not a good idea. Augmented Assignment with Sequences | 41 • Augmented assignment is not an atomic operation—we just saw it throwing an exception after doing part of its job. • Inspecting Python bytecode is not too difficult, and is often helpful to see what is going on under the hood. After witnessing the subtleties of using + and * for concatenation, we can change the subject to another essential operation with sequences: sorting. list.sort and the sorted Built-In Function The list.sort method sorts a list in place—that is, without making a copy. It returns None to remind us that it changes the target object, and does not create a new list. This is an important Python API convention: functions or methods that change an object in place should return None to make it clear to the caller that the object itself was changed, and no new object was created. The same behavior can be seen, for example, in the random.shuffle function. The convention of returning None to signal in-place changes has a drawback: you cannot cascade calls to those methods. In con‐ trast, methods that return new objects (e.g., all str methods) can be cascaded in the fluent interface style. See Wikipedia’s Wikipe‐ dia’s “Fluent interface” entry for further description of this topic. In contrast, the built-in function sorted creates a new list and returns it. In fact, it accepts any iterable object as an argument, including immutable sequences and gener‐ ators (see Chapter 14). Regardless of the type of iterable given to sorted, it always returns a newly created list. Both list.sort and sorted take two optional, keyword-only arguments: reverse If True, the items are returned in descending order (i.e., by reversing the comparison of the items). The default is False. key A one-argument function that will be applied to each item to produce its sorting key. For example, when sorting a list of strings, key=str.lower can be used to perform a case-insensitive sort, and key=len will sort the strings by character length. The default is the identity function (i.e., the items themselves are compared). 42 | Chapter 2: An Array of Sequences The key optional keyword parameter can also be used with the min() and max() built-ins and with other functions from the stan‐ dard library (e.g., itertools.groupby() and heapq.nlargest()). Here are a few examples to clarify the use of these functions and keyword arguments6: >>> fruits = ['grape', 'raspberry', 'apple', 'banana'] >>> sorted(fruits) ['apple', 'banana', 'grape', 'raspberry'] >>> fruits ['grape', 'raspberry', 'apple', 'banana'] >>> sorted(fruits, reverse=True) ['raspberry', 'grape', 'banana', 'apple'] >>> sorted(fruits, key=len) ['grape', 'apple', 'banana', 'raspberry'] >>> sorted(fruits, key=len, reverse=True) ['raspberry', 'banana', 'grape', 'apple'] >>> fruits ['grape', 'raspberry', 'apple', 'banana'] >>> fruits.sort() >>> fruits ['apple', 'banana', 'grape', 'raspberry'] This produces a new list of strings sorted alphabetically. Inspecting the original list, we see it is unchanged. This is simply reverse alphabetical ordering. A new list of strings, now sorted by length. Because the sorting algorithm is stable, “grape” and “apple,” both of length 5, are in the original order. These are the strings sorted in descending order of length. It is not the reverse of the previous result because the sorting is stable, so again “grape” appears before “apple.” So far, the ordering of the original fruits list has not changed. This sorts the list in place, and returns None (which the console omits). Now fruits is sorted. Once your sequences are sorted, they can be very efficiently searched. Fortunately, the standard binary search algorithm is already provided in the bisect module of the Python standard library. We discuss its essential features next, including the convenient 6. The examples also demonstrate that Timsort—the sorting algorithm used in Python—is stable (i.e., it pre‐ serves the relative ordering of items that compare equal). Timsort is discussed further in the “Soapbox” sidebar at the end of this chapter. list.sort and the sorted Built-In Function | 43 bisect.insort function, which you can use to make sure that your sorted sequences stay sorted. Managing Ordered Sequences with bisect The bisect module offers two main functions—bisect and insort—that use the bi‐ nary search algorithm to quickly find and insert items in any sorted sequence. Searching with bisect bisect(haystack, needle) does a binary search for needle in haystack—which must be a sorted sequence—to locate the position where needle can be inserted while main‐ taining haystack in ascending order. In other words, all items appearing up to that position are less than or equal to needle. You could use the result of bisect(haystack, needle) as the index argument to haystack.insert(index, needle)—however, using insort does both steps, and is faster. Raymond Hettinger—a prolific Python contributor—has a Sorted Collection recipe that leverages the bisect module but is easier to use than these standalone functions. Example 2-17 uses a carefully chosen set of “needles” to demonstrate the insert positions returned by bisect. Its output is in Figure 2-4. Example 2-17. bisect finds insertion points for items in a sorted sequence import bisect import sys HAYSTACK = [1, 4, 5, 6, 8, 12, 15, 20, 21, 23, 23, 26, 29, 30] NEEDLES = [0, 1, 2, 5, 8, 10, 22, 23, 29, 30, 31] ROW_FMT = '{0:2d} @ {1:2d} {2}{0:<2d}' def demo(bisect_fn): for needle in reversed(NEEDLES): position = bisect_fn(HAYSTACK, needle) offset = position * ' |' print(ROW_FMT.format(needle, position, offset)) if __name__ == '__main__': if sys.argv[-1] == 'left': bisect_fn = bisect.bisect_left else: 44 | Chapter 2: An Array of Sequences bisect_fn = bisect.bisect print('DEMO:', bisect_fn.__name__) print('haystack ->', ' '.join('%2d' % n for n in HAYSTACK)) demo(bisect_fn) Use the chosen bisect function to get the insertion point. Build a pattern of vertical bars proportional to the offset. Print formatted row showing needle and insertion point. Choose the bisect function to use according to the last command-line argument. Print header with name of function selected. |
6 |
Figure 2-4. Output of Example 2-17 with bisect in use—each row starts with the nota‐ tion needle @ position and the needle value appears again below its insertion point in the haystack The behavior of bisect can be fine-tuned in two ways. First, a pair of optional arguments, lo and hi, allow narrowing the region in the sequence to be searched when inserting. lo defaults to 0 and hi to the len() of the sequence. Second, bisect is actually an alias for bisect_right, and there is a sister function called bisect_left. Their difference is apparent only when the needle compares equal to an item in the list: bisect_right returns an insertion point after the existing item, and bisect_left returns the position of the existing item, so insertion would occur before Managing Ordered Sequences with bisect | 45 it. With simple types like int this makes no difference, but if the sequence contains objects that are distinct yet compare equal, then it may be relevant. For example, 1 and 1.0 are distinct, but 1 == 1.0 is True. Figure 2-5 shows the result of using bisect_left. |
7 |
Figure 2-5. Output of Example 2-17 with bisect_left in use (compare with Figure 2-4 and note the insertion points for the values 1, 8, 23, 29, and 30 to the left of the same numbers in the haystack). An interesting application of bisect is to perform table lookups by numeric values— for example, to convert test scores to letter grades, as in Example 2-18. Example 2-18. Given a test score, grade returns the corresponding letter grade >>> def grade(score, breakpoints=[60, 70, 80, 90], grades='FDCBA'): ... i = bisect.bisect(breakpoints, score) ... return grades[i] ... >>> [grade(score) for score in [33, 99, 77, 70, 89, 90, 100]] ['F', 'A', 'C', 'C', 'B', 'A', 'A'] The code in Example 2-18 is from the bisect module documentation, which also lists functions to use bisect as a faster replacement for the index method when searching through long ordered sequences of numbers. These functions are not only used for searching, but also for inserting items in sorted sequences, as the following section shows. 46 | Chapter 2: An Array of Sequences Inserting with bisect.insort Sorting is expensive, so once you have a sorted sequence, it’s good to keep it that way. That is why bisect.insort was created. insort(seq, item) inserts item into seq so as to keep seq in ascending order. See Example 2-19 and its output in Figure 2-6. Example 2-19. Insort keeps a sorted sequence always sorted import bisect import random SIZE = 7 random.seed(1729) my_list = [] for i in range(SIZE): new_item = random.randrange(SIZE*2) bisect.insort(my_list, new_item) print('%2d ->' % new_item, my_list) |
8 |
Figure 2-6. Output of Example 2-19 Like bisect, insort takes optional lo, hi arguments to limit the search to a sub sequence. There is also an insort_left variation that uses bisect_left to find inser‐ tion points. Much of what we have seen so far in this chapter applies to sequences in general, not just lists or tuples. Python programmers sometimes overuse the list type because it is so handy—I know I’ve done it. If you are handling lists of numbers, arrays are the way to go. The remainder of the chapter is devoted to them. Managing Ordered Sequences with bisect | 47 When a List Is Not the Answer The list type is flexible and easy to use, but depending on specific requirements, there are better options. For example, if you need to store 10 million floating-point values, an array is much more efficient, because an array does not actually hold full-fledged float objects, but only the packed bytes representing their machine values—just like an array in the C language. On the other hand, if you are constantly adding and removing items from the ends of a list as a FIFO or LIFO data structure, a deque (double-ended queue) works faster. If your code does a lot of containment checks (e.g., item in my_collection), consider using a set for my_collection, espe‐ cially if it holds a large number of items. Sets are optimized for fast membership checking. But they are not sequences (their content is unordered). We cover them in Chapter 3. For the remainder of this chapter, we discuss mutable sequence types that can replace lists in many cases, starting with arrays. Arrays If the list will only contain numbers, an array.array is more efficient than a list: it supports all mutable sequence operations (including .pop, .insert, and .extend), and additional methods for fast loading and saving such as .frombytes and .tofile. A Python array is as lean as a C array. When creating an array, you provide a typecode, a letter to determine the underlying C type used to store each item in the array. For example, b is the typecode for signed char. If you create an array('b'), then each item will be stored in a single byte and interpreted as an integer from –128 to 127. For large sequences of numbers, this saves a lot of memory. And Python will not let you put any number that does not match the type for the array. Example 2-20 shows creating, saving, and loading an array of 10 million floating-point random numbers. Example 2-20. Creating, saving, and loading a large array of floats >>> from array import array >>> from random import random >>> floats = array('d', (random() for i in range(10**7))) >>> floats[-1] 0.07802343889111107 >>> fp = open('floats.bin', 'wb') >>> floats.tofile(fp) >>> fp.close() >>> floats2 = array('d') 48 | Chapter 2: An Array of Sequences >>> fp = open('floats.bin', 'rb') >>> floats2.fromfile(fp, 10**7) >>> fp.close() >>> floats2[-1] 0.07802343889111107 >>> floats2 == floats True Import the array type. Create an array of double-precision floats (typecode 'd') from any iterable object—in this case, a generator expression. Inspect the last number in the array. Save the array to a binary file. |
Комментарии