Belt & Suspenders: Why put a .pex file inside a Docker container?

November 19, 2016

Recently I have been talking about deploying Python, and some people had the reasonable question: if a .pex file is used for isolating dependencies, and a Docker container is used for isolating dependencies, why use both? Isn’t it redundant?

Why use containers?

I really like glyph’s explanation for containers: they isolate not just the filesystem stack but the processes and the network, giving a lot of the power that UNIX was supposed to give but missed out on. Containers isolate the file system, making it easier for code to write/read files from known locations. For example, its log files will be carefully segregated, and can be moved to arbitrary places by the operator without touching the code.

The other part is that none of the reasonable options packages Python and this means that a pex file would still have to be tested with multiple Pythons, and perhaps do some checking at start-up that it is using the right interpreter. If PyPy is the right choice, it is the choice the operator would have to make and implement.

Why use pex?

Containers are an easy sell. They are right on the hype train. But if we use containers, what use is pex?

In order to explain, it is worthwhile comparing a correctly built runtime container that is not using pex, with one that is: (parts that are not relevant have been removed)

ADD wheelhouse /wheelhouse
RUN . /appenv/bin/activate; \
    pip install --no-index -f wheelhouse DeployMe
COPY twist.pex /

Note that in the first option, we are left with extra gunk in the /wheelhouse directory. Note also that we still have to have pip and virtualenv installed in the runtime container. Pex files bring the double-dutch philosophy to its logical conclusion: do even more of the build on the builder side, do even less of it on the runtime side.


Deploying with Twisted: Saying Hello

November 13, 2016

Too Long: Didn’t Read

The build script builds a Docker container, moshez/sayhello:.

$ ./build MY_VERSION
$ docker run --rm -it --publish 8080:8080 \
  moshez/sayhello:MY_VERSION --port 8080

There will be a simple application running on port 8080.

If you own the domain name hello.example.com, you can point it at a machine that the domain resolves to and then run:

$ docker run --rm -it --publish 443:443 \
  moshez/sayhello:MY_VERSION --port le:/srv/www/certs:tcp:443 \
  --empty-file /srv/www/certs/hello.example.com.pem

It will result in the same application running on a secure web site: https://hello.example.com.

 

All source code is available on GitHub.

Introduction

WSGI has been a successful standard. Very successful. It allows people to write Python applications using many frameworks (Django, Pyramid, Flask and Bottle, to name but a few) and deploy using many different servers (uwsgi, gunicorn and Apache).

Twisted makes a good WSGI container. Like Gunicorn, it is pure Python, simplifying deployment. Like Apache, it sports a production-grade web server that does not need a front end.

Modern web applications tend to be complex beasts. In order to be trusted by users, they need to have TLS support, signed by a trusted CA. They also need to transmit a lot of static resources — images, CSS and JavaScript files, even if all HTML is dynamically generated. Deploying them often requires complicated set-ups.

Containers

Container images allow us to package an application with all of its dependencies. They often cause a temptation to use those as the configuration management. However, Dockerfile is a challenging language to write big parts of the application in. People writing WSGI applications probably think Python is a good programming language. The more of the application logic is in Python, the easier it is for a WSGI-based team to master it.

PEX

Pex is a way to package several Python “distributions” (sometimes informally called “Packages”, the things that are hosted by PyPI) into one file, optionally with an entry-point so that running the file will call a pre-defined function. It can take an explicit list of wheels but can also, as in our example here, take arguments compatible with the ones pip takes. The best practice is to give it a list of wheels, and build the wheels with pip wheel.

pkg_resources

The pkg_resources module allows access to files packaged in a distribution in a way that is agnostic to how the distribution was deployed. Specifically, it is possible to install a distribution as a zipped directory, instead of unpacking it into site-packages. The code:pex format relies on this feature of Python, so adherence to using pkg_resources to access data files is important in order to not break code:pex compatibility.

Let’s Encrypt

Let’s Encrypt is a free, automated, and open Certificate Authority. It has invented the ACME protocol in order to make getting secure certificates a simple operation. txacme is an implementation of an ACME client, i.e., something that asks for certificates, for Twisted applications. It uses the server endpoint plugin mechanism in order to allow any application that builds a listening endpoint to support ACME.

Twist

The twist command-line tools allows running any Twisted service plugin. Service plugins allow us to configure a service using Python, a pretty nifty language, while still allowing specific customizations at the point of use via command line parameters.

Putting it all together

Our setup.py files defines a distribution called sayhello. In it, we have three parts:

  • src/sayhello/wsgi.py: A simple Flask-based WSGI application
  • src/sayhello/data/index.html: an HTML file meant to serve as the root
  • src/twisted/plugins/sayhello.py: A Twist plugin

There is also some build infrastructure:

  • build is a Python script to run the build.
  • build.docker is a Dockerfile designed to build pex files, but not run as a production server.
  • run.docker is a Dockerfile designed for production container.

Note that build does not push the resulting container to DockerHub.

Credits

Glyph Lefkowitz has inspired me in his blog about how to build efficient containers. He has also spoken about how deploying applications should be no more than one file copy.

Tristan Seligmann has written txacme.

Amber “Hawkowl” Brown has written “twist”, which is much better at running Twisted-based services than the older “twistd”.

Of course, all mistakes and problems here are completely my responsibility.


Twisted as Your WSGI Container

October 29, 2016

Introduction

WSGI is a great standard. It has been amazingly successful. In order to describe how successful it is, let me describe life before WSGI. In the beginning, CGI existed. CGI was just a standard for how a web server can run a process — what environment variables to pass, and so forth. In order to write a web-based application, people would write programs that complied with CGI. At that time, Apache’s only competition was commercial web servers, and CGI allowed you to write applications that ran on both. However, starting a process for each request was slow and wasteful.

For Python applications, people wrote mod_python for Apache. It allowed people to write Python programs that ran inside the Apache process, and directly used Apache’s API to access the HTTP request details. Since Apache was the only server that mattered, that was fine. However, as more servers arrived, a standard was needed. mod_wsgi was originally a way to run the same Django application on many servers. However, as a side effect, it also allowed the second wave of Python web application frameworks –Paste, Flask and more — to have something to run on. In order to make life easier, Python included wsgiref, a module that implemented a single-thread single-process blocking web server with the WSGI protocol.

Development

Some web frameworks come with their own development web servers that will run their WSGI apps. Some use wsgiref. Almost always those options are carefully documented as “just for development use, do not use in production.” Wouldn’t it be nice to use the same WSGI container in both development and production, eliminating one potential source of reproduction bugs?

For ease of use, it should probably be written in Python. Luckily, “twist web –wsgi” is just such a server. In order to show-case how easy it is to use it, twist-wsgi shows commands to run Django, Flask, Pyramid and Bottle apps as easy as it is to run frameworks’ built-in web server.

Production

In production, using the Twisted WSGI containers come with several advantages. Production-grade SSL support using PyOpenssl and cryptography allows elimination of “SSL terminators”, removing one moving piece from the equation. With third-party extensions like txsni and txacme, it allows modern support for “easy SSL”. The built-in HTTP/2 support, starting with Twisted 16.3, allows better support for parallel requests from modern browsers.

The Twisted web server also has a built-in static file server, allowing the elimination of a “front-end” web server that deals with static files by itself, and passing dynamic requests to the application server.

Twisted is also not limited to web serving. As a full-stack network application, it has support for scheduling repeated tasks, running processes and supporting other protocols (for example, a side-channel for online control). Last but not least, in order to integrate that, the language used is Python. As an example for an integrated solution, the Frankenstenian monster plugin show-cases a combo web application combining 4 frameworks, a static file server and a scheduled task updating a file.

While the goal is not to encourage using four web frameworks and a couple of side services in order to greet the user and tell them what time it is, it is nice that if the need strikes this can all be integrated into one process in one language, without the need to remember how to spell “every 4 seconds” in cron or how to quote a string in the nginx configuration file.


Post-Object-Oriented Design

September 15, 2016

In the beginning, came the so-called “procedural” style. Data was data, and behavior, implemented as procedure, were separate things. Object-oriented design is the idea to bundle data and behavior into a single thing, usually called “classes”. In return for having to tie the two together, the thought went, we would get polymorphism.

Polymorphism is pretty neat. We send different objects the same message, for example, “turn yourself into a string”, and they respond appropriately — each according to their uniquely defined behavior.

But what if we could separate the data and beahvior, and still get polymorphism? This is the idea behind post-object-oriented design.

In Python, we achieve this with two external packages. One is the “attr” package. This package allows a useful way to define bundles of data, that still exhibit the minimum amount of behavior we do want: initialization, string representation, hashing and more.

The other is the “singledispatch” package (available as functools.singledispatch in Python 3.4+).

import attr
import singledispatch

In order to be specific, we imagine a simple protocol. The low-level details of the protocol do not concern us, but we assume some lower-level parsing allows us to communicate in dictionaries back and forth (perhaps serialized/deserialized using JSON).

Our protocol is one to send changes to a map. The only two messages are “set”, to set a key to a given value, and “delete”, to delete a key.

messages = (
{
    'type': 'set',
    'key': 'language',
    'value': 'python'
},
{
    'type': 'delete',
    'key': 'human'
}
)

We want to represent those as attr-based classes.

@attr.s
class Set(object):
    key = attr.ib()
    value = attr.ib()

@attr.s
class Delete(object):
    key = attr.ib()
print(Set(key='language', value='python'))
print(Delete(key='human'))
Set(key='language', value='python')
Delete(key='human')

When incoming dictionaries arrive, we want to convert them to the logical classes. This code could not be simpler, in this example. (The reason is mostly because the protocol is simple.)

def from_dict(dct):
    tp = dct.pop('type')
    name_to_klass = dict(set=Set, delete=Delete)
    try:
        klass = name_to_klass[tp]
    except KeyError:
        raise ValueError('unknown type', tp)
    return klass(**dct)

Note how we take advantage of the fact that attr-based classes accept correctly-named keyword arguments.

from_dict(dict(type='set', key='name', value='myname')), from_dict(dict(type='delete', key='data'))
(Set(key='name', value='myname'), Delete(key='data'))

But this was easy! There was no need for polymorphism: we always get one type in (dictionaries), and we consult a mapping to decide which type to produce.

However, for serialization, we do need polymorphism. Enter our second tool — the singledispatch package. The default function is equivalent to a method defined on “object”: the ultimate super-class. Since we do not want to serialize generic objects, our default implementation errors out.

@singledispatch.singledispatch
def to_dict(obj):
    raise TypeError("cannot serialize", obj)

Now, we implement the actual serializers. The names of the functions are not important. To emphasize they should not be used directly, we make them “private” by prepending an underscore.

@to_dict.register(Set)
def _to_dict_set(st):
    return dict(type='set', key=st.key, value=st.value)

@to_dict.register(Delete)
def _to_dict_delete(dlt):
    return dict(type='delete', key=dlt.key)

Indeed, we do not call them directly.

print(to_dict(Set(key='k', value='v')))
print(to_dict(Delete(key='kk')))
{'type': 'set', 'value': 'v', 'key': 'k'}
{'type': 'delete', 'key': 'kk'}

However, arbitrary objects cannot be serialized.

try:
    to_dict(object())
except TypeError as e:
    print e
('cannot serialize', <object object at 0x7fbdb254ac60>)

Now that the structure of adding such an “external method” has been shown, another example can be given: “act on”: applying the changes requested to an in-memory map.

@singledispatch.singledispatch
def act_on(command, d):
    raise TypeError("Cannot act on", command)

@act_on.register(Set)
def act_on_set(st, d):
    d[st.key] = st.value

@act_on.register(Delete)
def act_on_delete(dlt, d):
    del d[dlt.key]

d = {}
act_on(Set(key='name', value='woohoo'), d)
print("After setting")
print(d)
act_on(Delete(key='name'), d)
print("After deleting")
print(d)
After setting
{'name': 'woohoo'}
After deleting
{}

In this case, we kept the functionality “near” the code. However, note that the functionality could be implemented in a different module: these functions, even though they are polymorphic, follow Python namespace rules. This is useful: several different modules could implement “act_on”: for example, an in-memory map (as we defined above), a module using Redis or a module using a SQL database.

Actual methods are not completely obsolete. It would still be best to make methods do anything that would require private attribute access. In simple cases, as above, there is no difference between the public interface and the public implementation.


Time Series Data

August 25, 2016

When operating computers, we are often exposed to so-called “time series”. Whether it is database latency, page fault rate or total memory used, these are all exposed as numbers that are usually sampled at frequent intervals.

However, not only computer engineers are exposed to such data. It is worthwhile to know what other disciplines are exposed to such data, and what they do with it. “Earth sciences” (geology, climate, etc.) have a lot of numbers, and often need to analyze trends and make predictions. Sometimes these predictions have, literally, billions dollars’ worth of decision hinging on them. It is worthwhile to read some of the textbooks for students of those disciplines to see how to approach those series.

Another discipline that needs to visually inspect time series data is physicians. EKG data is often vital to analyze patients’ health — and especially when compared to their historical records. For that, that data needs to be saved. A lot of EKG research has been done on how to compress numerical data, but still keep it “visually the same”. While the research on that is not as rigorous, and not as settled, as the trend analysis in geology, it is still useful to look into. Indeed, even the basics are already better than so-called “roll-ups”, which preserve none of the visual distinction of the data, flattening peaks and filling hills while keeping a score of “standard deviation” that is not as helpful as is usually hoped for.


Extension API: An exercise in a negative case study

August 20, 2016

I was idly contemplating implementing a new Jupyter kernel. Luckily, they try to provide facility to make it easier. Unfortunately, they made a number of suboptimal choices in their API. Fortunately, those mistakes are both common and easily avoidable.

Subclassing as API

They suggest subclassing IPython.kernel.zmq.kernelbase.Kernel. Errr…not “suggest”. It is a “required step”. The reason is probably that this class already implements 21 methods. When you subclass, make sure to not use any of these names, or things will break randomly. If you do not want to subclass, good luck figuring out what the assumption that the system makes about these 21 methods because there is no interface or even prose documentation.

The return statement in their example is particularly illuminating:

        return {'status': 'ok',
                # The base class increments the execution count
                'execution_count': self.execution_count,
                'payload': [],
                'user_expressions': {},
               }

Note the comment “base class increments the execution count”. This is a classic code smell: this seems like this would be needed in every single overrider, which means it really belongs in the helper class, not in every kernel.

None

The signature for the example do_execute is:

    def do_execute(self, code, silent, store_history=True, 
                   user_expressions=None,
                   allow_stdin=False):

Of course, this means that user_expressions will sometimes be a dictionary and sometimes None. It is likely that the code will be written to anticipate one or the other, and will fail in interesting ways if None is actually sent.

Optional Overrides

As described in this section there are also ways to make the kernel better with optional overrides. The convention used, which is nowhere explained, is that do_ methods mean you should override to make a better kernel. Nowhere it is explained why there is no default history implementation, or where to get one, or why a simple stupid implementation is wrong.

Dictionaries

 

All overrides return dictionaries, which get serialized directly into the underlying communication platform. This is a poor abstraction, especially when the documentation is direct links to the underlying protocol. When wrapping a protocol, it is much nicer to use an Interface as the documentation of what is assumed — and define an attr.s-based class to allow returning something which is automatically the correct type, and will fail in nice ways if a parameter is forgotten.

Summary

If you are providing an API, here are a few positive lessons based on the issues above:

  • You should expect interfaces, not subclasses. Use composition, not subclassing.If you want to provide a default implementation in composition, just check for a return of NotImplemeted(), and use the default.
  • Do the work of abstracting your customers from the need to use dictionaries and unwrap automatically. Use attr.s to avoid customer boilerplate.
  • Send all arguments. Isolate your customers from the need to come up with sane defaults.
  • As much as possible, try to have your interfaces be side-effect free. Instead of asking the customer to directly make a change, allow the customer to make the “needed change” be part of the return type. This will let the customers test their class much more easily.

__name__ == __main__ considered harmful

June 7, 2016

Every single Python tutorial shows the pattern of

# define functions, classes,
# etc.

if __name__ == '__main__':
    main()

This is not a good pattern. If your code is not going to be in a Python module, there is no reason not to unconditionally call ‘main()’ at the bottom. So this code will only be used in modules — where it leads to unpredictable effects. If this module is imported as ‘foo’, then the identity of ‘foo.something’ and ‘__main__.something’ will be different, even though they share code.

This leads to hilarious effects like @cache decorators not doing what they are supposed to, parallel registry lists and all kinds of other issues. Hilarious unless you spend a couple of hours debugging why ‘isinstance()’ is giving incorrect results.

If you want to write a main module, make sure it cannot be imported. In this case, reversed stupidity is intelligence — just reverse the idiom:

# at the top
if __name__ != '__main__':
    raise ImportError("this module cannot be imported")

This, of course, will mean that this module cannot be unit tested: therefore, any non-trivial code should go in a different module that this one imports. Because of this, it is easy to gravitate towards a package. In that case, put the code above in a module called ‘__main__.py‘. This will lead to the following layout for a simple package:

PACKAGE_NAME/
             __init__.py
                 # Empty
             __main__.py
                 if __name__ != '__main__':
                     raise ImportError("this module cannot be imported")
                 from PACKAGE_NAME import api
                 api.main()
             api.py
                 # Actual code
             test_api.py
                 import unittest
                 # Testing code

And then, when executing:

$ python -m PACKAGE_NAME arg1 arg2 arg3

This will work in any environment where the package is on the sys.path: in particular, in any virtualenv where it was pip-installed. Unless a short command-line is important, it allows skipping over creating a console script in setup.py completely, and letting “python -m” be the official CLI. Since pex supports setting a module as an entry point, if this tool needs to be deployed in other environment, it is easy to package into a tool that will execute the script:

$ pex . --entry-point SOME_PACKAGE --output-file toolname