Mostly, getting stuff done fast in the web world requires doing things in parallel, and splitting the problem up so that this guy doesn’t have to wait for that guy. However, every now and then you bump into a problem where this just doesn’t work.

Imagine you have a bunch of sequential tickets to hand out. Tickets in the lucky dip, seats on the lifeboat, whatever. Normally, to make the process faster, you’d allocate N processes, make each of them responsible for 1/N of the tickets, and off you go. However in this case just imagine that it is important that the tickets are handed out in order. Since our N processes aren’t waiting for each other, one can lag behind the others, and that’s not fair.

There’s no way around this, we’re just going to have to have, somewhere in the stack, an atomically incrementing counter.

In the following discussion I’m assuming two things:

Python / Tornado

My first approach was to use Python and Tornado. The ticket counter just lives in a global variable ticket_number, and whenever there is a request it gets incremented and returned. Pretty simple:

``` {.sourceCode .python} #!/usr/bin/env python

import sys

import tornado.web import tornado.httpserver import tornado.template

ticket_number = 0

class TicketHandler(tornado.web.RequestHandler):

def get(self):

    global ticket_number

    ticket_number += 1

    self.finish("ticket number %010d" % ticket_number)

handlers = [ (r”/$”, TicketHandler), ]

def main(): portnum = int(sys.argv[1]) if len(sys.argv) > 1 else 8000 application = tornado.web.Application(handlers, template_path=”template”)
http_server = tornado.httpserver.HTTPServer(application) http_server.listen(portnum) tornado.ioloop.IOLoop.instance().start()

if name == ‘main’: main()

It's also pretty quick, right up until you try to handle thousands of
simultaneous connections, at which point performance tapers off rapidly.
The problem is, there's no way to expand it ... if you fire up another
process, it gets its own `ticket_number`.

FastCGI / Memcached

The amount of CPU time spent incrementing that counter is dwarfed by the
amount spent negotiating TCP handshakes, handling HTTP protocol options,
writing log files, etc. And really, if we want to get this thing moving
along, we shouldn't be doing all that in Python anyway.

[Nginx]( and [FastCGI]( to the
rescue. Nginx is a fast and reliable asynchronous HTTP server with
support for various protocols, including FastCGI. FastCGI is a very
simple improvement to the original
[CGI](, allowing
many requests to be handled by each process. It has good library support
in all sorts of languages, including

To get the best performance out of Nginx, we're going to end up with a
pool of worker processes running, about one per CPU core. And each of
those will be maintaining a pool of FastCGI daemons to handle its ticket
requests. So how do we get all these processes to handle one atomic
counter? One way would be to use shared memory and locks, and we might
get to that eventually, but for the moment I'd rather piggyback off
someone else's hard work.

[Memcached]( has done that hard work. It comes
with [increment/decrement
methods]( which allow a
memcache record to act as an atomic counter. It also has a rather simple
C API, so it is easy to write a small piece of C to join FastCGI to

``` {.sourceCode .c}
// Based on

#include <fcgi_stdio.h>
#include <stdlib.h>
#include <libmemcached/memcached.h>

char key[] = "tick";

int main(int argc, char **argv)
    const char *mc_config = "--SERVER= --BINARY-PROTOCOL";
    memcached_st *mc_handle = memcached(mc_config, strlen(mc_config));
    memcached_return rc;

    uint64_t ticket_number;

    while (FCGI_Accept() >= 0)   {

        rc = memcached_increment_with_initial(mc_handle, key, sizeof(key), 1, 1, 0x7FFFFFFF, &ticket_number);

        if (rc != MEMCACHED_SUCCESS) {
            printf("Status: 500 Server Error\r\n\r\nMemcache %d %s\r\n", rc, memcached_strerror(mc_handle, rc));
        } else {

            printf("Content-type: text/plain\r\n\r\nticket number %010d\r\n", ticket_number);

    return 0;

Using this method, my laptop was able to handle a hundred thousand connections in ten seconds, which isn’t too shoddy, and well and truly fast enough to handle my original problem.

Nginx Module

It turns out that (I think since I originally wrote this code and started on this article) someone has actually gone to the bother of writing a Nginx Memcached Module supporting INCR. I’ll have to do some benchmarking, but frankly this is likely to be a winner just because there are fewer moving parts.

Another option would be a Nginx module which uses shared memory and locking to maintain the counter across the nginx processes. This would be very fast, but would restrict the ticket issuer to a single machine.


Another approach I think is really interesting is to use Mongrel2 or the Nginx 0MQ module to just have the web server queue requests, and then set up a single process to “own” the ticket counter, answering queued requests as fast as it can.