Python Notes, Part 2 - Sequences

Python provides a variety of sequences; understanding these builtin sequences saves us from reinventing the wheel. We can also aspire to create APIs that support existing and user created sequence types. Python offers two types of sequences:

  • Container sequences: These hold items of different types, for example list, tuple, etc.
  • Flat sequences: These hold items of the same type, for example str, bytes, etc.

A container sequence holds references to objects where as a flat sequence contains the value of the contents in it’s own memory. To understand the above point clearly, we need to understand how python stores objects.

Every python object has a header with metadata. For example, a float has:

  • ob_refcnt: Object’s reference count.
  • ob_type: Object’s type.
  • ob_fval: Object’s value.

Each of these fields take 8 bytes in a 64-bit built. So an array.array of floats actually holds the raw values of the floats whereas a list of floats has several objects - the list iteself and reference to each float object. Another way the sequences can be classified is based on it’s mutability:

  • Mutable sequences: list, array.array, collections.deque, etc.
  • Immutable sequences: tuple, str, bytes, etc.

List Comprehensions

List comprehensions or listcomps are used to create a new list from a list by transforming or/and filtering the data in the original list.

Listcomps can be created using mulltiple loops over multiple iterators as well, the key idea is to make sure that the script is readable and the transformation is obvious to the end user. It’s very easy to abuse listcomps to write incomprehensible code. We should stick to plain old for loops in such cases. Listcomps are also faster than map and filter.

1
2
%%timeit
even_squares = [x**2 for x in range(100) if x % 2 == 0]
14.6 µs ± 47.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
1
print(even_squares)
(0, 4, 16, 36, 64, 100, 144, 196, 256, 324, 400, 484, 576, 676, 784, 900, 1024, 1156, 1296,
1444, 1600, 1764, 1936, 2116, 2304, 2500, 2704, 2916, 3136, 3364, 3600, 3844, 4096, 4356, 4624,
4900, 5184, 5476, 5776, 6084, 6400, 6724, 7056, 7396, 7744, 8100, 8464, 8836, 9216, 9604)
1
2
%%timeit
even_squares = list(filter(lambda x: x % 2 == 0, map(lambda x: x**2, range(100))))
33.6 µs ± 867 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
1
print(even_squares)
(0, 4, 16, 36, 64, 100, 144, 196, 256, 324, 400, 484, 576, 676, 784, 900, 1024, 1156, 1296, 1444,
1600, 1764, 1936, 2116, 2304, 2500, 2704, 2916, 3136, 3364, 3600, 3844, 4096, 4356, 4624, 4900,
5184, 5476, 5776, 6084, 6400, 6724, 7056, 7396, 7744, 8100, 8464, 8836, 9216, 9604)

Generator Expressions

To initialize sequences of types other than list, like arrays, tuples, etc., we can use genexp as they save memory by yielding one item at a time using the iterator protocol. A listcomp creates a whole list to feed into another constructor.

1
2
even_squares = tuple(x**2 for x in range(100) if x % 2 == 0)
print(even_squares)
(0, 4, 16, 36, 64, 100, 144, 196, 256, 324, 400, 484, 576, 676, 784, 900, 1024, 1156, 1296, 1444,
1600, 1764, 1936, 2116, 2304, 2500, 2704, 2916, 3136, 3364, 3600, 3844, 4096, 4356, 4624, 4900,
5184, 5476, 5776, 6084, 6400, 6724, 7056, 7396, 7744, 8100, 8464, 8836, 9216, 9604)

Tuples

Beyond using tuples as “immutable lists”, they can also be thought of as a system of records. Tuples along with “unpacking” make for very readable way to access elements from a sequence by avoiding indexing.

1
2
3
for country, capital in [("India", "New Delhi"), ("Afganistan", "Kabul"),
                         ("Canada", "Ottawa")]:
    print(f"{country}/{capital}")
India/New Delhi
Afganistan/Kabul
Canada/Ottawa

Immutaiblity of Tuples

The fact that tuples are immutable has many benefits:

  • We know that the value of a tuple object will never change.
  • Tuples are faster than lists as they have less overheads.

Tuple are faster than lists because:

Tuples can be constant folded

1
2
3
from dis import dis

dis(compile("(10, 'abc')", "", "eval"))
  1           0 LOAD_CONST               0 ((10, 'abc'))
              2 RETURN_VALUE
1
dis(compile("[10, 'abc']", "", "eval"))
  1           0 LOAD_CONST               0 (10)
              2 LOAD_CONST               1 ('abc')
              4 BUILD_LIST               2
              6 RETURN_VALUE

Tuples do not need to be copied

1
2
3
a = (10, 20)
b = tuple(a)
a is b
True
1
2
3
a = [10, 20]
b = list(a)
a is b
False

Tuples do not over-allocate

1
2
3
import sys

sys.getsizeof(tuple(iter(range(10))))
120
1
sys.getsizeof(list(iter(range(10))))
192

Here is the comment from Objects/listobject.c that explains what lists are doing:

* This over-allocates proportional to the list size, making room
* for additional growth.  The over-allocation is mild, but is
* enough to give linear-time amortized behavior over a long
* sequence of appends() in the presence of a poorly-performing
* system realloc().
* The growth pattern is:  0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ...
* Note: new_allocated won't overflow because the largest possible value
*       is PY_SSIZE_T_MAX * (9 / 8) + 6 which always fits in a size_t.

Coming back to the immutability aspect of Tuples, the fact that tuples are container style sequences, they can hold mutable objects.

1
2
3
a = (10, "alpha", [1, 2])
b = (10, "alpha", [1, 2])
a == b
True
1
2
b[-1].append(99)
a == b
False
1
b
(10, 'alpha', [1, 2, 99])

Examples like this can be a source of bugs. It’s always a good idea to avoid mutable objects in a tuple. A tuple without mutable objects can be used as a key for a dict. This can be verified using the hash method.

1
hash((10, "alpha", 1, 2))
1614661222696736202
1
hash((10, "alpha", [1, 2]))
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-58-6818f30e4402> in <module>
----> 1 hash((10, "alpha", [1, 2]))


TypeError: unhashable type: 'list'

Unpacking sequences and iterables

Unpacking helps avoid indexing to extract items from sequences. It works with any iterable object, even if the object doesn’t support index notation [].

1
2
country, capital = ("India", "New Delhi")
country
'India'
1
capital
'New Delhi'

An elegant way to use unpacking is to swap values without using a temporary variable.

1
2
3
4
a = 10
b = 20
a, b = b, a
a, b
(20, 10)

We can use * to grab excess items.

1
2
a, b, *rest = range(5)
a, b, rest
(0, 1, [2, 3, 4])
1
2
a, *rest, b = range(10)
a, rest, b
(0, [1, 2, 3, 4, 5, 6, 7, 8], 9)

Unpacking works with nested structures as well.

1
2
3
4
5
shopping_cart = [(1, "apple", (3.2, 10)), (2, "milk", (3, 20))]

print(f"{'item':<10} | {'cost':>4} {'qnt':>4} {'total':>5}\n{'-'*28}")
for _, item, (cost, qnt) in shopping_cart:
    print(f"{item:<10} | {cost:>4} {qnt:>4} {cost*qnt:>5}")
item       | cost  qnt total
----------------------------
apple      |  3.2   10  32.0
milk       |    3   20    60

Slicing

Slices in python excldue the last item. This convention is common across many languages. This feature has many convenient side effects:

  • Length of range(3) and my_list[:3] is the same.
  • It’s easy to compute the length of a slice when start and stop are given using stop - start.
  • It’s easy to split a sequence at an index x using my_list[:x] and my_list[x:]
1
2
my_list = list(range(10))
my_list[:3], my_list[3:]
([0, 1, 2], [3, 4, 5, 6, 7, 8, 9])

One uncommon use of slices is to create constants to make the code more readable. By assigning names to slices we can be more explicit instead of needing the user to figure out what the indexes mean.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
invoice = """
1  apple  $3.2 10
2  banana $1.5. 5
"""

SKU_ID = slice(0, 2)
SKU_NAME = slice(3, 9)
PRICE = slice(10, 14)
QUANTITY = slice(15, 17)

for sku_line in invoice.split("\n")[1:-1]:
    print(sku_line[SKU_ID], sku_line[SKU_NAME], sku_line[PRICE], sku_line[QUANTITY])
1  apple  $3.2 10
2  banana $1.5  5

The [] operator can also take multpile indexes and slices separated by commas. This pattern is common in external libraries like numpy and pandas where this notation is used to indicate index/slice across rows and columns.

1
2
3
4
import numpy as np

arr = np.array([[1, 2, 3], [4, 5, 6]])
arr
array([[1, 2, 3],
       [4, 5, 6]])
1
arr[:, 1:]  # all rows, columns 2 and 3
array([[2, 3],
       [5, 6]])

Slices can also be used on the left-hand side of an assignment operation or as the target of a del operation.

1
2
my_list = list(range(10))
my_list
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
1
2
my_list[3:5] = [30, 40]
my_list
[0, 1, 2, 30, 40, 5, 6, 7, 8, 9]
1
2
del my_list[3:5]
my_list
[0, 1, 2, 5, 6, 7, 8, 9]

Using * with Sequences

To concatenate multiple copies of the same sequence, we multiply it by an integer which creates a new sequence.

1
2
3
my_list = [1, 2, 3]
new_list = my_list * 3
new_list
[1, 2, 3, 1, 2, 3, 1, 2, 3]
1
id(my_list), id(new_list)
(140325040574144, 140325040580864)

Using * to multiply sequences with mutable objects can lead to weird behavior due to the fact that we end of referring to the same object.

1
2
weird_board = [["_"] * 3] * 3
weird_board
[['_', '_', '_'], ['_', '_', '_'], ['_', '_', '_']]
1
weird_board[0] is weird_board[1]
True

Looking at the above(both referring to the same object), the below result should not be a surprise.

1
2
weird_board[0][2] = "X"
weird_board
[['_', '_', 'X'], ['_', '_', 'X'], ['_', '_', 'X']]

Augmented Assignment with Sequences

The augmented assignment += works using the special method __iadd__. If __iadd__ is not implemented, Python falls back to calling __add__.

1
2
3
my_list = [1, 2, 3]
my_list += [4, 5]
my_list
[1, 2, 3, 4, 5]

__iadd__ works similar to .extend(). If my_list does not implement __iadd__, a+=b is the same as a=a+b.

Usage

Although it’s highlighted everywhere that lists are container type sequences that can hold objects of different types, in practice we typically store objects of similar type in a list so that we can apply some common operation on them(i.e. they should all “quack” whether or not they are 100% ducks).

Tuples being used as records can be used to store objects of different types in practice.

1
2
my_list = [1, 2, "30", 4, "5"]
my_list
[1, 2, '30', 4, '5']
1
sorted(my_list)
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-135-cb83e3f42d3f> in <module>
----> 1 sorted(my_list)


TypeError: '<' not supported between instances of 'str' and 'int'

Most of the languages need us to provide a cmp(a, b) method to make comparisons to be able to sort a sequence. In python, using key argument in list.sort, max, min or sorted methods removes the need for this cmp() method. Python does compare keys but that is done in optimized C code and not using a method provided by us.

1
sorted(my_list, key=int)
[1, 2, 4, '5', '30']
1
sorted(my_list, key=str)
[1, 2, '30', 4, '5']


References

[1] Luciano Ramalho. “Fluent Python”.