Switching from 2 to 3: Love, asyncio, and more

Published: 2016-02-27

Tagged: python networking essay

Updated with more Python 3 awesomeness from the good people at r/Python at the bottom

Like many Python developers who have been riding the 2.x line, I've been itching to write some Python 3 code. Just think about amazing things like easy unicode handling, a built-in asynchronous library, iterable unpacking, tracebacks on exception objects, an end to the *.pyc madness, and so much more!

If you follow r/Python, you know that Python 3 has been a pretty hot topic recently with a nice summary here as well a huge thread with 2 vs 3 stats. When I started my adventure with Python, the lack of libraries for Py3k was still a major issue. That is no longer the case. Up until now I was unsure whether to believe the people that said "Write everything new in Python 3 while slowly migrating the old stuff".

I mean, once you grasp stuff like .encode('utf-8-sig') and .decode('utf-8') or running find ./ -name "*.pyc" -delete as the first thing when your good tests suddenly fail, is there really any incentive to switch over to Py3?

The best thing to do in such a situation is to build something!

I decided to build a simple, asynchronous Trivial File Transfer Protocol server. The code is available on github and PyPI - for all your trivial file sharing needs. This gave me the chance to really get a feel for Python 3, especially for the asyncio module. To time-box the project, I decided to implement RFCs 1350 and 2347-9.

Asyncio

I'll start with asyncio since it determined the shape of the whole project. Asyncio is Python's way of working asynchronously. This means that your script can run multiple pieces of code at the same time in a single thread. "Same time"? Not really - multiple chunks of code are in the process of being executed; however, only one chunk is actually running at any single moment - until it stops executing and relinquishes control to another chunk.

This arrangement is efficient because code that does slow IO can pause executing, resume periodically to check if the IO operation is still running, then resume fully once the IO operation returns control to the Python process. While this code is paused, other code can run - handling other requests, for example. This approach is great for scripts that handle a lot of network or disk IO - web applications, scrapers, etc. This is actually the way that NodeJS works. This approach is not so great for CPU-bound operations - these only benefit if code can be split between multiple processors.

On top of this, writing asynchronous code is easier to reason about because it can be imagined as a sequence of code segments executed in a linear fashion. This is due to the fact that the programmer tells these code segments explicitly when to relinquish execution. This control, as far as I know, is absent from multi-threading or multi-processing.

Using multiple threads would likely be more efficient than asynchronous operations, but there are two big issues to consider:

There's more multi-threaded code due to the necessity of tackling shared stated between threads.
This code is harder to write and maintain.

The combined cost of these is big enough that the Twisted library was born. If you want to read a more in-depth explanation, check out this excellent post: Unyielding.

The asyncio package exposes an interface (Transports and Protocols) that will be instantly familiar to any who has used Twisted to write networked applications. This is exactly the route I went when writing py3tftp. The script runs the main server coroutine on start up. Whenever the main coroutine gets a request, it schedules a new coroutine that handles everything related to that request until a file a transferred, or the connection times or errors out, after which the coroutine is cleaned up. Because the networking is asynchronous, this simple server can accept and service around 50 small requests on a first-gen Raspberry Pi mod. B.

If I were to go back in time, I would read the whole of section 18.5 and actually run the examples - the documentation goes a long way in explaining this new concept. Especially the section on coroutines, futures, and tasks. If something isn't covered in the documentation then you can download Python's source code and rummage through the asyncio package until you understand what's going on.

The Small Stuff

The things that I want to talk about next are conceptually "smaller" than the asyncio package, but they are just as crucial. To me, these changes represent a step forward for the language itself - a response to how everything around the language is changing - paradigms shift, newer platforms appear, old problems fade away to make room for new ones, etc.

Text and Bytes

Python 3 treats source code as UTF-8 and strings and unicode by default.

In the old days, you were able to get away with treating bytes and strings in a similar fashion. As long as you steered clear of non-ascii text. If you needed to sprinkle internationalization on top your app, you'd most likely start with Joel Spolsky's post The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets and then, after some time and a lot of frustration, you would pin down how Python's .encode() and .decode() work. This, I believe, is the biggest hurdle in translating Python 2 code to Python 3 code - most Pythonistas never had to explicitly differentiate between text and bytes.

This approach might have worked in 2006 when roughly 17% or a little more than a billion souls were accessing the Internet. But that number reached 40% or almost three billion souls in 2014. The need to process unicode will only grow as more and more people get access to computers and the Internet. Not only has the end-user group grown more diverse, but so has the Python community.

These are the reasons why I think the whole bytes-n-text thing is a real big deal. How did it look in Py3tftp? Well, the TFTP world is mainly bytes, so I only had to mark every string as b''. I also had to convert a part of each datagram, which is just a bunch of bytes, to an integer and vice-versa. These bytes also have to be big-endian because, you know, networks. Nothing could be easier:

msg = b'\x00\x65'
number = int.from_bytes(msg, byteorder='big') # = 101
reply = (number + 1).to_bytes(2, byteorder='big') # = b'\x00\x66'

Variable Unpacking

PEP448 (available in 3.5+) extends the way we can unpack lists and dictionaries.

It used to be just this:

opts = {'user': 'BillyBob', 'password': '*****'}
authenticate(True, **opts) # authenticate(True, user='BillyBob', password='*****')

def i_am_function(*args):
    for arg in args:
        # process each arg

Get ready, this is going to be crazy:

opts = {'user': 'BillyBob'}
extra_opts = {'password': '****'}
authenticate(**opts, True, **extra_opts) # authenticate(True, user='BillyBob', password='*****')

# merging dictionaries:
merged_opts = {**opts, **extra_opts}

def totally_crazy(i):
    return (15, *range(i))

a = totally_crazy(5)
# a == (15, 0, 1, 2, 3, 4)
a, *b = totally_crazy(5)
# a == 15, b == [0, 1, 2, 3, 4]
a, b, *c = totally_crazy(5)
# a == 15, b == 0, c == [1, 2, 3, 4]
# etc.

This small change makes it easier to work with iterables of all kinds. Instead of being forced to either unpack iterables manually or feed them to a function one by one - we can now, in a most natural manner, pass and return as many iterables as we wish.

I used this functionality only once in Py3tftp, to merge dictionaries of options ie. default options should be overwritten by user options, which in turn should be overwritten by TFTP options:

self.opts = {**self.default_opts, **timeout_opts, **self.r_opts}

Pattern matching was one of the biggest joys to discover in Elixir and Clojure. Those languages make the programmer use a lot of collections (think: lists and dicts) and pattern matching makes working with them a breeze. I'm happy that a certain form of it has made its way into Python in the form of unpacking.

Caching

Every Python developer I know has sooner or later been bitten by an undead *.pyc file. *.pyc files are files that contain compiled Python bytecode, which can be executed. This contrasts to Python source code, which has to be compiled before execution. To notice this, try running a Python project on an under-clocked Raspberry Pi and compare the start up time with and without *.pyc files.

Sometimes, your *.pyc files won't get updated. I've never looked into this why this happens, but as I mentioned earlier, this has happened to me and every Python developer that I know.

It usually starts with a test failing. No amount of moving code around seems to affect what ever is causing the test to fail. Pdb does nothing to help. After some time, you break down and call your colleague for help. They ask you to run ls -la and you notice that some *.pyc file's timestamps are from an hour ago. You delete the *.pyc files, execute your scripts, and all the tests are suddenly green. Your team lead tasers you while you laugh maniacally.

Now all that evil stuff is shoved into a __pycache__ directory. Much easier to keep an eye on it there. Apparently, this change had a different rationale, but we all know that this change saves thousands of developer-hours, not to mention the fact that it keeps psyche wards empty.

Wrapping Up

These are the biggest surprises that Python 3 sprung on me in the course of this small project. There is a lot more left to check out and assimilate - a lot more stuff from the asyncio package, more stdlib improvements (reading the changelogs from every update brings a smile to my face), the possibly amazing type hinting functionality, and much more that I simply didn't have the chance to take out for a spin.

I'm not saying that Python 3 is the One and Only True Path for everyone and every project, but I feel that the changes it brings are bigger and more important than the community thinks and that these changes are what will keep the language strong and relevant in the years to come.

More Python 3 Awesomeness

All of this extra info comes from replies to this r/Python thread.

User mmpix wrote:

One IMO underrated improvement is that since 3.4 the gc is able to collect objects with cyclic references despite them implementing del. This eliminates an important class of memory leaks.

More info in PEP0442

Ah, I forgot to mention another "technicality" also due to Antoine Pitrou: he not only improved the GC but he rewrote the GIL. So we have a new GIL since 3.2:

Among the objectives were more predictable switching intervals and reduced overhead due to lock contention and the number of ensuing system calls.

Check this presentation by David Beazley.

To the OP: it would be nice if you add something about these internals enhancements to your post, they need more press.

The presentation

User FFX01 wrote about how much better Python 3 is at speed and memory management along with some benchmarks:

Matt's Codecave