Iterators for Curious Minds
My talk given at PyCon CZ 2019.
Abstract
Do you know the difference between iterators and iterables? Are you able to implement a custom collection? Why something so basic as a for-loop got its chapter in the famous Gang of Four book?
The iterator protocol is a prime example of Python language design.
I will not only show how to implement custom iterators but also illustrate how other Python features are provided using similar idioms. I will informally introduce the iterator design pattern and show how it is built into the language. After mentioning some examples of iterables, we will look into Python data model, implement some generators, and maybe touch async frameworks.
Video
Transcript
Here is a transcript of my talk.
PDF version of my slides is available too, but the slide deck is meant for illustration, so it may not be useful alone.
Hi, thank you for coming. It’s great to see so many curious minds.
My name is Miloslav Pojman, and my job to make the Internet faster. I’m very curious, especially about Python.
In this talk, I would like to share with you what my curiosity taught me about Python for loops.
You know, the for loops: For something in something else. Seriously, we often forget what smart software design is hidden beneath that statement.
After this talk, you should understand what’s happening under the hood when you write a for loop. With this knowledge, you should be able to write custom classes that can be iterated over. I will show you
how iteration fits into Python data model.
You probably know the iter()
function, but you may not be why the following expression is useful.
Let’s make a short poll:
Who knows what does the iter(v) is v
condition return? Who knows why this expressions can be useful, please raise your hand.
I have one more poll for you:
Who read the “Gang of Four” book? Who read “Design Patterns: Elements of Reusable Object-Oriented Software”, please raise your hand.
The Gang of Four describes 23 classic design patterns, including the iterator. I would not be afraid to call it a bible of software design.
Martin Fowler describes it as “the best book ever written on object-oriented design.”, maybe on software design overall.
Even so, many Python developers don’t feel the need to read it. I assume that we do not encounter the problems that the book is solving. I think that it is because Python solved those problems for us by incorporating design patterns into the language.
The iterator is a great example: we use iterators every day, but we may not realize it because they are hidden.
Language Comparison
Design patterns, including the iterator, are not specific to one programming language. So, for motivation, let’s look at examples of iteration in other languages.
In C++, maps are an equivalent of Python dictionaries. The maps in C++ are implemented using red-black trees. You don’t have to know about the red-black trees, the iterator does that.
The code example on the slide is quite compact if you consider what logic is necessary. But still, the example on the slide is longer than it would be in Python because the iterator is visible.
For comparison, look at JavaScript. JavaScript supports a for-in statement for iterating over objects.
Unfortunately, the statement is almost useless because its behavior is hardcoded and it cannot be customized. If you wrote a map class, it would not be supported by the for-in statement.
In C++, we can see iterators without syntactic sugar. In JavaScript, we can see syntactic sugar without iterators.
Both these states are suboptimal. Both C++ and JavaScript addressed that by introducing a new syntax that uses iterators.
Back to Python.
We should appreciate that Python since its beginning provides a nice for-loop syntax.
And since Python 2.2, since year 2001, the for-loop syntax uses iterators under the hood.
If you are interested about iterators’ history, I recommend you David Beazley’s talk on that topic. What I found most interesting is that the current model started as a hack abusing indexing of sequences.
Iterator Pattern
If this has to be a proper software engineering talk, we have to start with the serious stuff.
The UML.
You can see an UML diagram of the iterator pattern. A very similar one can be found in the Gang of Four book. Do you understand it? Do you like it? Don’t worry, it’s not as bad as it looks like
First, ignore that the diagram has two rows.
The top row is exactly same as the bottom one. The top row describes how it should look like, and the bottom says that it looks like it should:)
Now look what’s on the left and what’s on the right.
On the left is a container, aka an aggregate, aka a collection. The container is an object that holds other objects. It can be a list, a set, a dictionary, you name it.
On the right is an iterator. The iterator is what you may not see, but what is necessary, because the iteration is implemented there. The iterator is created when a for loop is encountered.
So the whole UML diagram shrinks to an observation that there is something hidden called iterator, and it is created when necessary.
Let’s look how this is translated to Python code.
In Python, the iter()
function returns an iterator for a given container. The iter()
function is called once for each for loop, at the begging of the iteration.
For getting individual items, the next()
function is available. When next()
is called for the first time, it returns the first item. When you call it again, it returns subsequent items. When there are no more items to be produced, the next()
function raises StopIteration
.
The next()
function is called once for each loop cycle, plus once on top that of that to receive the StopIteration
exception.
In Python, there is no method to check whether a next item is available. You have ask for that item and catch possible exception.
This approach is called EAFP, “Easier to ask for forgiveness than permission”. Some may argue that throwing exceptions for normal control flow is a bad practice, but this coding style is common in Python.
A curious mind should ask: What superpowers does the iter()
function have? How is it able to construct an iterator for any container it gets?
A curious mind should ask: How can the next()
function understand to all possible iterators?
Answers to previous questions will be obvious if we observe a common pattern in the the Python data model.
- The
str()
builtin calls__str__()
, len()
calls__len__()
,bool()
calls__bool__()
,- and so on.
It won’t surprise you that the iter()
function calls __iter__()
and next()
calls __next()__
.
Just be careful, in Python 2, they forgot the underscores in __next()__
, so upgrade to Python 3 to fix that bug.
I you saw some Python talks before, you probably noticed that these double-underscore methods are often abbreviated “dunder”. So, dunder-iter, dunder-next.
Python provides convenient syntax and builtins which delegate to double-underscore methods which can be overridden to customize Python behavior.
Containers vs. Iterators
A curious mind should ask: Why do we need iterators? Isn’t iteration something that should be implemented in containers? Why do we need separate classes for that?
I will give you two reasons.
The first reason is a software-engineering design principle called separation of concerns.
Let’s look at a Python list. I can call the iter()
function and I will get items in order.
Or, I can decide to call the reversed()
function, and I will get items from the last one to the first one.
Thanks to separation between containers and iterators, you can come up with any iteration order and write an iterator that will execute it.
The even more important reason for separating iterators from containers is that iteration state has to be stored somewhere.
Python strings or tuples are immutable, they cannot be changed after creation. Something that is immutable cannot hold a current index.
Even if we ignored immutability, it would not be a good idea to store position in containers. One container can be used in two nested loops, so storing the position with the container would not be sufficient.
That’s where iterators become handy. For each for loop, for each iteration, a new iterator with a new variable storing a position is created.
I have shown you that we need two kinds of things: the containers and the iterators. What I haven’t told you is that both the containers and the iterators are iterable.
We use containers in for loops routinely.
You can add an iter()
call to a for loop and its behavior will not change.
Or more importantly, you request an alternative iterator, for example using the reversed()
function.
This is specific for Python. For example in Java, iterators are not iterable. Also the Gang of Four book does not mention this tweak.
How does this work?
It’s quite simple. When you call iter()
on an existing iterator, the iterator is returned unmodified.
Do you remember when I was explaining that each for loop creates a fresh iterator? Well, it’s not true if you use an existing iterator. It means that iterators cannot be reused.
Reuse of an existing iterator can lead to unexpected bug. For example, look at the slide. The second for loop does not print anything. The reversed iterator is exhausted after the first loop.
Now you have enough information to answer my question from the first poll: What does the iter(v) is v
condition return and why it can be useful?
It checks whether the given value is an iterator. It can useful because when we know that an object is an iterator, then we know that it cannot be used more than once. So, if we plan multiple iteration, we can take some countermeasures, for example store product of the iterator to a list.
Speaking about iterables, special mention belongs to files and file like objects. You probably know that you can iterate over a file object to get its content line by line.
The question is: Are file objects containers or iterators? Can they be reused?
The correct answer is that files are iterators. They are exhausted after the first iteration. When you open a file, the operating system tracks position in that file for you, so you cannot have more than one position.
Think of the standard input stream. When you iterate over it, a user has to type one the keyboard. She will not retype all input just because you added another for loop to your code.
Custom Iterables
Now, you should understand now what happens when you write a for loop. Let’s use this knowledge to write a custom class, that can be iterated over.
Let’s write a DateRange
class. The DateRange
class will behave very similarly to the Python built-in range, but it will produce date instances instead of numbers.
The class will need the start
and the stop
of the range.
If the class has to be iterable, it has to implement the __iter__()
method. This method has to return an iterator. Let’s call that iterator DateRangeIterator
.
What you can see is a half of an UML diagram. We implemented the left side of it, the container. Let’s move on to the right side.
The DateRangeIterator
has to remember the state of the iteration. Let’s store what date value should be produced next. And of course, we have to remember when to stop.
Don’t forget that iterators are iterables too. We have to implement the __iter__()
method which should return self.
You should remember that the iterators are advanced by calling the __next__()
method.
The __next__()
method should raise StopIteration
when the stop is reached.
In other cases, the current date should be returned and advanced by one day. And that’s all. We are done. We implemented the whole UML diagram. Was that difficult? What do you think? Can we simplify it? Do I hear generators?
Generators
Yes, we can do better. The magic keyword is called yield
.
The DateRangeIterator
class, the right side of the UML diagram, can be replaced by the __iter__
method that you can see on the slide.
Any function that contains the yield keyword returns a generator, which is a special type of an iterator. When next is called on that iterator, the function is executed to the yield statement. Then it is suspended, waiting to be resumed by another next call.
Our example can be simplified even more. You don’t need a class to return a generator. A simple function is enough. Five lines of code can replace both classes we defined a few slides ago.
But don’t forget that without a container, you are directly creating a generator, and generators are iterators, and iterators are exhausted after the first use.
Generators would be a great topic for a separate talk, or more likely a workshop. I won’t go into details here.
My recommendation is: When you see a list, consider replacing it by a generator.
The code on the slide somehow reads a file and appends processed lines to a list.
If I use a generator instead, I save two lines of code and potentially lots of memory.
List comprehensions can be replaced by generator comprehensions. Just replace square brackets by rounded parentheses.
Iterators and generators can be piped into each other, composing large data flows with almost no memory requirements.
This is a really powerful construct. The whole async/await support, which landed to Python recently, is built on top generators.
Summary
Time is running, so let’s sum it up.
We have two kind of iterables:
- Containers
- Iterators
Containers, which can be iterated again and again.
Iterators are one-off, they are exhausted after the first use.
I hope that you learnt something new. Thank you for your attention.
License
Slides are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.