How does Twisted do away with the thread problems in the context of network connections? Twisted runs a main loop called the reactor which schedules the callbacks. It is the coach of our prior comparison. The reactor scheduling decisions derive directly from the availability of the data received in the supervised file descriptors. The reactor is twofold:
it is a wrapper around a specialised system calls which monitors events on an array of sockets. Instead of supervising sockets from userland, Twisted delegates this hard work to the kernel, via the best system call available on the platform: epoll on Linux, kqueue on BSD, etc
In a nutshell, these system calls return after either a timeout or after the reception of data in one of the socket. The system call returns an array of events received, for each supervised socket[1].
There is a big bonus for a developer to be able to leverage efficient advanced system calls on diverse operating with the same code. Another bonus is the delegation of concurrent supervision of the sockets to the kernel. As the kernel offers to do it, why should developers re-invent the wheel in userland?
the reactor maintains a list of Twisted Protocol instances. In Twisted, a Protocol serves many purposes, and in particular it holds a reference to a socket supervised by epoll and a method dataReceived(). When epoll returns and presents an array of events for each socket, the reactor dutifully runs the dataReceived() method of the protocol associated to the socket.
The reactor is the runtime hub of the Twisted framework, it handles the network connections and triggers the processing of the received data as soon as it arrives by calling specific methods of the Protocol associated to the socket. Let’s focus on a single page download, first, with the sequential urlopen() function:
Here is the corresponding steps of how Twisted operates with the getPage() function:
getPage() parses the input URL, format the HTTP request string, and uses the reactor.connectTCP() method to stack a socket creation and monitoring request to the reactor. The argument of connectTCP() are a host, a port and an instance of the HTTPGetClient class, deriving from the Protocol class.
connectTCP() tranparently inserts a DNS request if the host is a domain name and not an IP address. This conditions the HTTP request to the availability of the IP address, in a non blocking manner,
getPage() returns a deferred, a slot that the developer must fill with a function which will be executed when the HTTP reply arrives (more on the deferred in the next section). This function should expect the HTML body of the response as the argument,
the reactor is run: for each Protocol object queued: the reactor opens a socket, copies the corresponding file descriptor in the transport attribute of the Protocol instance, and puts the socket under supervision.
The reactor calls the connectionMade() method of the Protocol instance which, in the case of getPage() writes the formatted HTTP request to the transport and returns immediately to the reactor loop,
when the reactor detects the reply bytes in the socket associated to transport, it calls the dataReceived() method of the associated Protocol which, in the case of getPage(), is written to parse the HTTP header from the HTML body.
Finally, the dataReceived() method for this protocol fires the developer callback attached to the instance deferred, with the HTML as the parameter.
Additional abstractions such as the Factory interface are left out in this article to ease the learning curve , they are are described in the official documentation. For our third problem, let’s compare two complete versions, one concurrent, one sequential of a simple script which, 30 times, prints the HTML title of the http://twistedmatrix.com web site.
# trivial_sequential.py
from lxml.html import parse
from urllib2 import urlopen
url = 'http://twistedmatrix.com'
def title(url):
print parse(urlopen(url)).xpath('/html/head/title')[0].text
# let's download the page 30 times
for i in range(30):
title(url)
Note that in the following version, the Twisted main loop started by reactor.run() never returns: a line of code below the start of the reactor loop will never be executed.
# trivial_deferred.py
from twisted.internet import reactor
from twisted.internet.defer import DeferredList
from twisted.web.client import getPage
from lxml.html import fromstring
url= 'http://twistedmatrix.com'
def getpage_callback(html):
print fromstring(html).xpath( '/html/head/title' )[0].text
# 30 pending asynchronous network calls, and attachment of the callback
[ getPage(url).addCallback(getpage_callback) for i in range(30) ]
reactor.run() # open the network connections, and fires the callbacks
# as soon as the replies are available
# Use Ctrl-C to terminate the script
The attention should be drawn on the following blocking snippet:
html = urlopen(url))
print parse(html).xpath( ... )
which becomes, with Twisted primitives:
def getpage_callback(html):
parse(html).xpath( ... )
getPage(url).addCallback(getpage_callback)
It is indeed bewildering to realize that in Twisted, the calling function can not manipulate the result of the request. Here is a longer form, which might seem simpler to read because the callback code is presented after the request code:
d = getPage(url)
def getpage_callback(html):
parse(html).xpath( ... )
d.addCallback(getpage_callback)
If you don’t like neither these style, stay tuned, you will appreciate the section The yield keyword simplifies Twisted code. There is something unexplained in the last code snippet: what is the object to which d is bound? What does getPage() returns if it’s not the server reply? you will find out in the next section.
[1] | the C10K problem is a reference on server handling concurrently ten thousands of clients. |