__name__ == __main__ considered harmful

June 7, 2016

Every single Python tutorial shows the pattern of

# define functions, classes,
# etc.

if __name__ == '__main__':
    main()

This is not a good pattern. If your code is not going to be in a Python module, there is no reason not to unconditionally call ‘main()’ at the bottom. So this code will only be used in modules — where it leads to unpredictable effects. If this module is imported as ‘foo’, then the identity of ‘foo.something’ and ‘__main__.something’ will be different, even though they share code.

This leads to hilarious effects like @cache decorators not doing what they are supposed to, parallel registry lists and all kinds of other issues. Hilarious unless you spend a couple of hours debugging why ‘isinstance()’ is giving incorrect results.

If you want to write a main module, make sure it cannot be imported. In this case, reversed stupidity is intelligence — just reverse the idiom:

# at the top
if __name__ != '__main__':
    raise ImportError("this module cannot be imported")

This, of course, will mean that this module cannot be unit tested: therefore, any non-trivial code should go in a different module that this one imports. Because of this, it is easy to gravitate towards a package. In that case, put the code above in a module called ‘__main__.py‘. This will lead to the following layout for a simple package:

PACKAGE_NAME/
             __init__.py
                 # Empty
             __main__.py
                 if __name__ != '__main__':
                     raise ImportError("this module cannot be imported")
                 from PACKAGE_NAME import api
                 api.main()
             api.py
                 # Actual code
             test_api.py
                 import unittest
                 # Testing code

And then, when executing:

$ python -m PACKAGE_NAME arg1 arg2 arg3

This will work in any environment where the package is on the sys.path: in particular, in any virtualenv where it was pip-installed. Unless a short command-line is important, it allows skipping over creating a console script in setup.py completely, and letting “python -m” be the official CLI. Since pex supports setting a module as an entry point, if this tool needs to be deployed in other environment, it is easy to package into a tool that will execute the script:

$ pex . --entry-point SOME_PACKAGE --output-file toolname

Stop Feeling Guilty about Writing Classes

May 31, 2016

I’ve seen a few talks about “stop writing classes”. I think they have a point, but it is a little over-stated. All debates are bravery debates, so it is hard to say which problem is harder — but as a recovering class-writing-guiltoholic, let me admit this: I took this too far. I was avoiding classes when I shouldn’t have.

Classes are best kept small

It is true that classes are best kept small. Any “method” which is not really designed to be overridden is often best implemented as a function that accepts a “duck-type” (or a more formal interface).

This, of course, sometimes leads to…

If a class has only one public method, except __init__, it wants to be a function

Especially given function.partial, it is not needed to decide ahead of time which arguments are “static” and which are “dynamic”

Classes are useful as data packets

This is the usual counter-point to the first two anti-class sentiments: a class which is nothing more than a bunch of attributes (a good example is the TCP envelope: source IP/target IP/source port/target port) are useful. Sure, they could be passed around as dictionaries, but this does not make things better. Just use attrs — and it is often useful to write two more methods:

  • Some variant of “serialize”, an instance method that returns some lower-level format (dictionary, string, etc.)
  • Some variant of “deserialize”, a class method that takes the lower-level format above and returns a corresponding instance.

It is perfectly ok to write this class rather than shipping dictionaries around. If nothing else, error messages will be a lot nicer. Please do not feel guilty.


Forking Skip-level Dependencies

May 6, 2016

I have recently found I explain this concept over and over to people, so I want to have a reference.

Most modern languages comes with a “dependency manager” of sorts that helps manage the 3rd party libraries a given project uses. Rust has Cargo, Node.js has npm, Python has pip and so on. All of these do some things well and some things poorly. But one thing that can be done (well or poorly) is “support forking skip-level dependencies”.

In order to explain what I mean, here as an example: our project is PlanetLocator, a program to tell the user which direction they should face to see a planet. It depends on a library called Astronomy. Astronomy depends on Physics. Physics depends on Math.

  • PlanetLocator
    • Astronomy
      • Physics
        • Math

PlanetLocator is a SaaS, running on our servers. One day, we find Math has a critical bug, leading to a remote execution vulnerability. This is pretty bad, because it can be triggered via our application by simply asking PlanetLocator for the location of Mars at a specific date in the future. Luckily, the bug is simple — in Math’s definition of Pi, we need to add a couple of significant digits.

How easy is it to fix?

Well, assume PlanetLocator is written in Go, and not using any package manager. A typical import statement in PlanetLocator is

import “github.com/astronomy/astronomy”

A typical import statement in Astronomy is

import “github.com/physics/physics

..and so on.

We fork Math over to “github.com/planetlocator/math” and fix the vulnerability. Now we have to fork over physics to use the forked math, and astronomy to use the forked physics and finally, change all of our imports to import the forked astronomy — and Physics, Astronomy and PlanetLocator had no bugs!

Now assume, instead, we had used Python. In our requirements.txt file, we could put

git+https://github.com/planetlocator/math#egg=math

and voila! even though Physics’ “setup.py” said “install_requires=[‘math’]”, it will get our forked math.

When starting to use a new language/dependency manager, the first question to ask is: will it support me forking skip-level dependencies? Because every upstream maintainer is,  effectively, an absent-maintainer if rapid response is at stake (for any reason — I chose security above, but it might be beating the competition to a deadline, or fulfilling contractual obligations).


Use virtualenv

April 24, 2016

In a conversation recently with a friend, we agreed that “something the instructions tell you to do ‘sudo pip install’…which is good, because then you know to ignore them”.

There is never a need for “sudo pip install”, and doing it is an anti-pattern. Instead, all installation of packages should go into a virtualenv. The only exception is, of course, virtualenv (and arguably, pip and wheel). I got enough questions about this that I wanted to write up an explanation about the how, why and why the counter-arguments are wrong.

What is virtualenv?

The documentation says:

virtualenv is a tool to create isolated Python environments.

The basic problem being addressed is one of dependencies and versions, and indirectly permissions. Imagine you have an application that needs version 1 of LibFoo, but another application requires version 2. How can you use both these applications? If you install everything into/usr/lib/python2.7/site-packages (or whatever your platform’s standard location is), it’s easy to end up in a situation where you unintentionally upgrade an application that shouldn’t be upgraded.

Or more generally, what if you want to install an application and leave it be? If an application works, any change in its libraries or the versions of those libraries can break the application.

The tl:dr; is:

  • virtualenv allows not needing administrator privileges
  • virtualenv allows installing different versions of the same library
  • virtualenv allows installing an application and never accidentally updating a dependency

The first problem is the one the “sudo” comment addresses — but the real issues stem from the second and third: not using a virtual environment leads to the potential of conflicts and dependency hell.

How to use virtualenv?

Creating a virtual environment is easy:

$ virtualenv dirname

will create the directory, if it does not exist, and then create a virtual environment in it. It is possible to use it either activated or unactivated. Activating a virtual environment is done by

$ . dirname/bin/activate
(dirname)$

this will make python, as well as any script installed using setuptools’ “console_scripts” option in the virtual environment, on the command-execution path. The most important of those is pip, and so using pip will install into the virtual environment.

It is also possible to use a virtual environment without activating it, by directly calling dirname/bin/python or any other console script. Again, pip is an example of those, and used for installing into the virtual environment.

Installing tools for “general use”

I have seen a couple of times the argument that when installing tools for general use it makes sense to install them into the system install. I do not think that this is a reasonable exception for two reasons:

  • It still forces to use root to install/upgrade those tools
  • It still runs into the dependency/conflict hell problems

There are a few good alternatives for this:

  • Create a (handful of) virtual environments, and add them to users’ path.
  • Use “pex” to install Python tools in a way that isolates them even further from system dependencies.

Exploratory programming

People often use Python for exploratory programming. That’s great! Note that since pip 7, pip is building and caching wheels by default. This means that creating virtual environments is even cheaper: tearing down an environment and building a new one will not require recompilation. Because of that, it is easy to treat virtual environments as disposable except for configuration: activate a virtual environment, explore — and whenever needing to move things into production, ‘pip freeze’ will allow easy recreation of the environment.


Weak references, caches and immutable objects

March 26, 2016

Consider the following situation:

  • We have a lot of immutable objects. For our purposes, an “immutable” object is one where “hash(…)” is defined.
  • We have a (pure) function that is fairly expensive to compute.
  • The objects get created and destroyed regularly.

We often would like to cache the function. As an example, consider a function to serialize an object — if the same objects serialized several times, we would like to avoid recomputing the serialization.

One naive solution would be to implement a cache:

cache = {}
def serialize(obj):
    if obj not in cache:
        cache[obj] = _really_serialize(obj)
    return cache[obj]

The problem with that is that the cache would keep references to our objects long after they should have died. We can try and use an LRU (for example, repoze.lru) so that only a certain number of objects would extend their lifetimes in that way, but the size of the LRU would trade-off space overhead and time overhead.

An alternative is to use weak references. Weak references are references that do not keep objects from being collected. There are several ways to use weak references, but here one is ideally suited:

import weakref
cache = weakref.WeakKeyDictionary()
def serialize(obj):
    if obj not in cache:
        cache[obj] = _really_serialize(obj)
    return cache[obj]

Note that this is the same code as before — except that the cache is a weak key dictionary. A weak key dictionary keeps weak references to the keys, but strong references to the value. When a key is garbage collected, the entry in the dictionary disappears.

>>> import weakref
>>> a=weakref.WeakKeyDictionary()
>>> fs = frozenset([1,2,3])
>>> a[fs] = "three objects"
>>> print a[fs]
three objects
>>> len(a)
1
>>> fs = None
>>> len(a)
0

Learn Public Speaking: The Heart-Attack Way

February 11, 2016

I know there are all kinds of organizations, and classes, that teach public speaking. But if you want to learn how to do public speaking the way I did, it’s pretty easy. Get a bunch of friends together, who are all interested in improving their skills. Meet once a week.

The rules are this: one person gets to give a talk with no more props than a marker and a whiteboard. If it’s boring, or they hesitate, another person can butt in, say “boring” and give a talk about something else. Talks are to be prepared somewhat in advance, and to be about five minutes long.

This is scary. You stand there, at the board, never knowing when someone will decide you’re boring. You try to entertain the audience. You try to keep their interest. You lose your point for 10 seconds…and you’re out. It’s the fight club of public speaking. After a few weeks of this, you’ll give flowing talks, and nothing will ever scare you again about public speaking — you’ll laugh as you realize that nobody is going to kick you out.

Yeah, Israel was a good training ground for speaking.


Docker: Are we there yet?

February 1, 2016

Obviously, we know the answer. This blog is intended to allow me to have an easy place to point people when they ask me “so what’s wrong with Docker”?

[To clarify, I use Docker myself, and it is pretty neat. All the more reason missing features annoy me.]

Docker itself:

  • User namespaces — slated to land in February 2016, so pretty close.
  • Temporary adds/squashing — currently “closed” and suggests people use work-arounds.
  • Dockerfile syntax is limited — this is related to the issue above, but there are a lot of missing features in Dockerfile (for example, a simple form of “reuse” other than chaining). No clear idea when it will be possible to actually implement the build in terms of an API, because there is no link to an issue or PR.

Tooling:

  • Image size — Minimal versions of Debian, Ubuntu or CentOS are all unreasonably big. Alpine does a lot better. People really should move to Alpine. I am disappointed there is no competition on being a “minimal container-oriented distribution”.
  • Build determinism — Currently, almost all Dockerfiles in the wild call out to the network to grab some files while building. This is really bad — it assumes networking, depends on servers being up and assumes files on servers never change. The alternative seems to be checking big files into one’s own repo.
    • The first thing to do would be to have an easy way to disable networking while the container is being built.
    • The next thing would be a “download and compare hash” operation in a build-prep step, so that all dependencies can be downloaded and verified, while the hashes would be checked into the source.
    • Sadly, Alpine linux specifically makes it non-trivial to “just download the package” from outside of Alpine.

 


Follow

Get every new post delivered to your Inbox.

Join 366 other followers