Three Axioms of Python Part I: List Comprehensions



Contents:
Introduction to List Comprehensions
An Excursion Into Lisp
List Comprehension Useful Tips and Tricks
The Monty Hall Problem

Developers who use Python often highlight the minimalist nature of its syntax as one of the main reasons they like it: the amount of stuff you have to know to be effective in the language is unbelievably small.

Understanding just three Python ideas: list comprehensions, tuples, and name spaces instantly transports you to the level of facility with the language that takes months to achieve in many others. The choice of syntax for a programming language is much akin to the choice of axiom sets in mathematics: though it is ultimately arbitrary, a useful criterion for selecting one set of axioms over another is the richness of its implications — the amount of interesting theorems that can be derived from it. Judged on this basis, Python is impressive: the yield of interesting implications per pound of syntax is very attractive.

It is true that there are other languages with syntax that is even more minimal. However, the sheer minimality is not the object — in that sense nothing can beat a good old Turing machine. What makes Python effective is not just the fact that its base syntax is terse, but that the code that can be naturally and easily derived from it happens to be the stuff a programmer actually does on daily basis.

Although understanding these three “axioms” (list comprehensions, tuples, and name spaces) is central to writing Pythonic code, they can be a bit counter-intuitive to programmers coming from other languages, and, since Python makes it easy to avoid using them, it is quite possible to program in Python for years without gaining true fluency. This is a three part series of posts that focuses on demonstrating the utility of these concepts with practical examples.

It’s worth noting that some or all of these concepts exist in various forms in many other languages including Clojure, Lisp, and Haskell — to name a few. These concepts certainly predate Python by many years. Python’s syntax just happens to combine them in a particularly elegant manner, and this article focuses on how a programmer can become more effective by utilizing their full potential.

Introduction to List Comprehensions

First, the name “list comprehension” can be confusing to a newcomer since in standard English we don’t normally use the word comprehension as a countable noun, as in: I improved my comprehensions. In Python (and in other languages that support them), it is perfectly possible to say that “I wrote several list comprehensions”, so, perhaps, “a list evaluation” would be another useful way to to think about the concept.

A list comprehension is a flexible a way to make a list from another list. As such, it has three logical parts:

1. the input list (or the source list)
2. the condition based on which to select a subset of the input list
3. the expression (or the function) that should be applied to transform each element of the subset

In Python, the syntax logically looks as follows:

output_list = [EXPRESSION INPUT_LIST CONDITION]

This order of this syntax should be read: (1) for each element of the INPUT_LIST; (2) select only the elements that satisfy CONDITION; (3) apply EXPRESSION to each selected element and assign the resulting list to output_list.

Let’s consider a specific example: we want to transform a list of natural numbers into a list of their squares:

Example 1:

>>> nums = [1, 2, 3, 4, 5]
>>> [x*x for x in nums]
[1, 4, 9, 16, 25]


In this example, the INPUT_LIST is specified by for x in nums, the CONDITION is left empty which means that all elements of the INPUT_LIST are selected, and the EXPRESSION is x*x. Now let’s add a condition by selecting only the even numbers from the subset:

Example 2:

>>> nums = [1, 2, 3, 4, 5]
>>> [x*x for x in nums if x%2==0]
[4, 16]


In this example CONDITION if x%2==0 is a filter that selects numbers divisible by 2 prior to applying the EXPRESSION x*x.

Note that the order in which these syntactic elements are positioned in the list comprehension is not necessarily the most intuitive, but if you take a couple of seconds to memorize it you will never find it problematic again. In many tutorials, the CONDITION element is usually glossed over, but it is a worthwhile investment to be aware of it as we’ll see with more complex examples.

For SQL experts: Someone familiar with SQL will notice that this is in fact identical to the order used in SQL:

SELECT EXPRESSION(x) FROM SOME_TABLE WHERE CONDITION



This no coincidence since the syntax for both SQL and list comprehensions derive from set_builder notation. This can be a very helpful mnemonic for those coming from enterprise Java or .NET background.


An Excursion Into Lisp

Python provides a set of Lisp-like functional tools: lambda,filter,map, and reduce. When I first starting using Python, I found its list-comprehension syntax a bit off-putting and out of familiarity stuck with the functional tools for all my list transformation needs. Then I read Guido’s reasoning for why Lisp tools should be dropped from Python. The crux of the argument was that list comprehensions already gave you all those things, and, in keeping with Python minimalist philosophy, functional tools had to go as there should be one– and preferably only one –obvious way to do it. I decided to give the man a fair hearing by making myself use only list comprehensions for a week and found that in most everyday situations they were a good match, resulting in more readable code. I still use the functional tools, but only when the problem truly calls for them. It’s not a matter of better or worse, but, given Python’s overall syntactic structure, list comprehensions fit more naturally in the overall program flow.

For those who are coming from the Lisp background, I will be mentioning the equivalents using Python’s functional tools to provide a point of comparison.

In Example 1 [x*x for x in nums] would be equivalent to map(lambda x:x*x, nums)
In Example 2 [x*x for x in nums if x%2==0] would be equivalent to map(lambda x:x*x, filter(lambda x:x%2==0, nums)).


List Comprehension Useful Tips and Tricks






A. The EXPRESSION does not have to be scalar.


This is absolutely key to making the most of Python’s list comprehension syntax. In Python, almost anywhere you can use a scalar, you can also use a list or a tuple. We will cover the use of tuples in more detail separately (they can be thought of loosely as multivalued primitives). For now, in the interest of clarity, we can stick with lists without loss of generality.

For instance, to generate a list of pairs x, x^2 – 1 for odd numbers you could write:

Example A1: Generate a list of pairs

>>> nums = [1, 2, 3, 4, 5]
>>> pairs = [[x, x*x-1] for x in nums if x%2>0]
>>> pairs
[[1, 0], [3, 8], [5, 24]]





B. You can have more than one CONDITION.


To generate the same list as above excluding number 5 you would write:

Example B1: Use of multiple conditions

>>> nums = [1, 2, 3, 4, 5]
>>> pairs = [[x, x*x-1] for x in nums if x%2>0 if x!=5]
>>> pairs
[[1, 0], [3, 8]]





C. List comprehensions can be used for any kind of reordering.


Let’s say we wanted to swap the elements in our pairs from the previous example:

Example C1: Swap order of elements

>>> nums = [1, 2, 3, 4, 5]
>>> pairs = [[x, x*x-1] for x in nums if x%2>0 if x!=5]
>>> pairs
[[1, 0], [3, 8]]
>>> swapped_pairs = [[p[1], p[0]] for p in pairs]
>>> swapped_pairs
[[0, 1], [8, 3]]





D. You can have several for-clauses in a single list comprehension.


This may seem a little involved, but it comes in very handy when you work with multi-dimensional datastructures. Note that this is NOT the same as a nested list comprehension shown in Example F1.

For SQL experts: this creates a cartesian product of several INPUT_LISTs specfied in the for-clauses to which CONDITIONs and EXPRESSIONs can then be applied. This is similar to specifying several tables in the SQL FROM clause.

To flatten a two dimensional matrix into a single list:

Example D1: Flatten a matrix into a list

>>> mx = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
>>> [value for row in mx for value in row]
[1, 2, 3, 4, 5, 6, 7, 8, 9]


This is equivalent to:


result = []
for row in mx:
    for col in row:
        result.append(col)


You can of course augment this with any CONDITIONs and EXPRESSIONs. Say we wanted to only pick the odd values from the matrix and then subtract 1.

Example D2: Flatten a matrix into a list with a condition and an expression


>>> mx = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
>>> [value-1 for row in mx for value in row if value%2>0]
[0, 2, 4, 6, 8]





E. Use enumerate() to inject an explicit index into the list comprehension.


Sometimes developers use for-loops instead of list comprehensions because they need an explicit iterator index. This is elegantly accomplished with enumerate().

Example E1: Use enumerate() to add index to a list


>>> nums = [5, 6, 7, 8, 9]
>>> idx_sqrs = [[idx, value*value] for idx, value in enumerate(nums)]
>>> idx_sqrs
[[0, 25], [1, 36], [2, 49], [3, 64], [4, 81]]


Another basic use is to employ the for-loop index as an explicit part of the CONDITION.

Example E2: Pick odd numbered elements from a list


# adding 1 in the if condition because Python is 0-based
>>> nums = [32, 58, 43, 78, 29]
>>> [value for idx, value in enumerate(nums) if (idx+1)%2>0]
[32, 43, 29]


You can mix value and index CONDITIONs.

Example E3: Mixed conditions on index and value


# adding 1 because Python is 0-based
>>> nums = [32, 58, 43, 78, 29]
>>> [value for idx, value in enumerate(nums)
>>> if (idx+1)%2>0 if value>30]
[32, 43]


For a more advanced example let's use enumerate() to select the diagonal of a matrix. That is, we need to select the elements whose row index and column index are equal.

Example E4: Select diagonal members of a matrix


>>> mx = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
>>> [value for row_num, row in enumerate(mx) for col_num,
>>> value in enumerate(row) if row_num == col_num]
[1, 5, 9]





F. You can nest list comprehensions.


Note this is NOT the same as Example D1 where we had multiple for elements in a single list comprehension. The idea is that each EXPRESSION is itself a list comprehension, and thus what results is a list of lists. Here we use a nested list comprehension to transpose a matrix.

Example F1: Transpose a matrix


>>> mx = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
>>> mx_trans = [[row[idx] for row in mx] for idx in range(len(mx))]
>>> mx_trans
[[1, 4, 7], [2, 5, 8], [3, 6, 9]]


range(size) is a Python function that provides a list of sequential integers of the specified size that we are using here as an independently varying index.


G. You can use list comprehensions with string-oriented data as well.



Let's say we have a list (or file reader) with urls of various websites and we want to select only the urls that contain cnn.com (this is the CONDITION). Furthermore, we want to remove any leading or trailing spaces (this is the EXPRESSION). For matching the pattern we will use the Python regex package "re" (which is really awesome but that's another story).

Example G1: Filter string data


>>> import re
>>> urls = [' http://www.abc.com/ ',
>>> ' http://cnn.com/ ', ' http://nbc.com/' ]
>>> cnn_urls = [url.strip() for url in urls
>>> if re.findall('.*cnn.com.*', url)]
>>> cnn_urls
['http://cnn.com/']



Using Lisp tools the equivalent of this would be:


>>> cnn_urls = map(lambda url: url.strip(),
>>> filter(lambda url: re.findall('.*cnn.com.*', url), urls))
>>> cnn_urls
['http://cnn.com/']





H. The EXPRESSION does not have to depend on the iteration variable.


Typically, a list comprehension is used to iterate over an input list and output some function of the list elements into the resulting list. That has been the case with all the preceding examples. However, that is not required.

To illustrate, we will generate a list of 10 random numbers, each number ranging from 1 to 5 inclusively. We will use the randint() function from the Python random library.

In this case, the range() function provides the input list 0,1,2,3 and so on. However, the EXPRESSION does not depend on the iteration variable x and instead makes a call to randint().

Example H1: Generate a list of random numbers


>>> from random import randint
>>> nums = [randint(1, 5) for x in range(0, 10)]
>>> nums
[3, 1, 2, 2, 1, 5, 4, 1, 4, 3]






The Monty Hall Problem

For our list comprehension coup-de-grace, we will implement a simulation of the famous Monty Hall problem. The problem is fascinating in its own right and touches on many deep philosophical questions regarding probability. In fact the paradox underlying the problem still remains unsolved and is a subject of debate among scientists.

For our purposes the gist is that it's a game show, and there are three doors. Behind one door there is a fancy car, and the other two are empty (in many versions the non-prize doors have goats but that's not nice to the goats, and besides the goats can give away the game by making noise behind the doors).

The player picks a door (at random). The show host then opens one of the doors that does not have the prize (the host knows which doors are which, the player does not). The host then offers to the player to either stick with his original choice or to switch to the other closed door (now there are only two doors remaining -- the player's original choice and the one the host did not open).

The question is this: "Should the player switch or stick with his original choice to maximize his chances of winning?" It seems hard to believe, but the right answer is that you should always switch -- and switching makes a huge difference in the probability of picking the right door. According to Wikipedia, the famous mathematician Paul Erdos refused to believe the proven solution to this problem until he was shown a computer simulation.

Note: The implementation of the Monty Hall simulation discussed below is designed to bring the three axioms of Python (list comprehensions, tuples, namespaces) together in one exercise. The latter two will be discussed in separate posts.

Here is the whole simulation with detailed explanations that follow.


from random import choice, randint
size = 1000
num_doors = 3

# generate simulation data 
prize_door = [randint(0, num_doors - 1) for x in range(0, size)]
player_door1 = [randint(0, num_doors - 1) for x in range(0, size)]
host_choice = lambda x, y: choice(list(set(range(0, num_doors)) - set([x, y])))
host_open_door = [host_choice(x, y) for x, y in zip(prize_door, player_door1)]
player_choice = lambda x,y: choice(list(set(range(0, num_doors)) - set([x, y])))
player_door2 = [player_choice(x, y) for x, y  in zip(player_door1, host_open_door)]

# total up successful picks to compare the two strategies
noswitch_success_num = len([1 for x, y in zip(prize_door, player_door1) if x == y])
switch_success_num = len([1 for x, y in zip(prize_door, player_door2) if x == y])

# output results
print "Without door switch win chance pct = ",  (100 * noswitch_success_num / size)
print "With door switch win chance pct = ",  (100 * switch_success_num / size)

Program output:
Without door switch win chance pct =  32
With door switch win chance pct =  67



This is nearly exactly correct. According to the theoretical solution the chances are respectively 1/3 and 2/3.

Detailed Commentary and Analysis


1. We will simulate 1000 game shows with the player using the two different strategies, and we will see which one fares better. We import Python's random module from which we will use two functions: randint() and choice().


from random import choice, randint
size = 1000
num_doors = 3


2. First we have to simulate 1000 random placements of the prize behind the 3 doors and 1000 initial door picks made by the player. To generate random numbers we will use the Python library function random.randint(a,b) which returns a single random integer between in the interval [a,b] inclusively. Here we will use an occasionally useful trick: the EXPRESSION in this case does not depend on the iteration variable x, as explained in example H1. The iteration variable x is there solely to create 1000 evaluations of the EXPRESSION which in this case is random.randint(0, num_doors - 1). Each evaluation of the EXPRESSION generates a random number 0,1, or 2. Thus, the list comprehension generates a list of 1000 random numbers between 0 and 2.


prize_door = [randint(0, num_doors - 1) for x in range(0, size)]
player_door1 = [randint(0, num_doors - 1) for x in range(0, size)]


3. By definition of the problem the host opens the door that is (a) not the player original choice and (b) is not the prize. So to generate one instance of host's choice we need to (a) take the total range of available choices, that is the doors numbered 0,1,2 then (b) remove the door that contains the prize and the door that the player chose. Since the prize door and the player's original pick can be the same door we need to remove duplicates. To achieve this we use Python's "set" datatype which automatically removes duplicates and also allows us to compute set differences.

After we take the difference between the full set of 0,1,2 and the doors that have been eliminated we need to choose between the remaining doors at random. To accomplish this we will use Python's random.choice() function that picks an item from a list at random. Note: in the 3 door version of the problem there will be either 2 doors left (if player picked right the first time) or just 1 (if the player picked wrong and the host has to open the only remaining empty door). When there is only 1 door remaining it is a special case and the random.choice() function only has 1 number to pick from.

Since the function definition easily fits on one line, for readability we will use an in-line lambda rather than creating a separate single-line function.


host_choice = lambda x, y: choice(list(set(range(0, num_doors)) - set([x, y])))


Note that, regrettably, the random.randint(a,b) is inclusive of its left bound, whereas range(a,b) is not inclusive of its left bound. For this reason we use randint(0, num_doors - 1) but range(0, num_doors). (I'm sure the must be a good reason for this difference in syntax but I don't know what it is.)

4. Now we use the host_choice function to create an array of 1000 empty doors opened by the host. Here we see heavy use of tuples and zip() which will be discussed in a separate post.


host_open_door = [host_choice(x, y) for x, y in zip(prize_door, player_door1)]


5. Now that the host opened an empty door he or she offers the player to make an alternate choice. By definition the alternate choice is one that (a) is not the host's choice because it is empty and (b) is not the player's orginal choice. Here we use the Python set() datatype similar to the way we did in step (3) above and then we generate 1000 player alternate door picks as we did for the host in step (4)


player_choice = lambda x,y: choice(list(set(range(0, num_doors)) - set([x, y])))
player_door2 = [player_choice(x, y) for x, y  in zip(player_door1, host_open_door)]


That's it! We have everything we need:

prize_door - the actual position of the prize
player_door1 - player's original choice
players_door2 - the alternate door the player picks if he takes the host's offer and makes the switch

6. Now we just need to see in how many cases out of 1000 player_door1 indicates the correct door and compare it with the success rate of players_alt_door. We scale the success ratio of each strategy by a 100 to obtain percentage.

Note that in this case we are only interested in counting the number of times the CONDITION matches in the list comprehension. Therefore, the EXPRESSION is not relevant and for readability we put "1" as the EXPRESSION to highlight the fact we are not using it.


# total up successful picks to compare the two strategies
noswitch_success_num = len([1 for x, y in zip(prize_door, player_door1) if x == y])
switch_success_num = len([1 for x, y in zip(prize_door, player_door2) if x == y])

# output results
print "Without door switch win chance pct = ",  (100 * noswitch_success_num / size)
print "With door switch win chance pct = ",  (100 * switch_success_num / size)

Program output:
Without door switch win chance pct =  32
With door switch win chance pct =  67


submit to reddit
Add this Article to Stumbleupon Add this Article to Digg Add this Article to Del.icio.us