Use virtualenv

April 24, 2016

In a conversation recently with a friend, we agreed that “something the instructions tell you to do ‘sudo pip install’…which is good, because then you know to ignore them”.

There is never a need for “sudo pip install”, and doing it is an anti-pattern. Instead, all installation of packages should go into a virtualenv. The only exception is, of course, virtualenv (and arguably, pip and wheel). I got enough questions about this that I wanted to write up an explanation about the how, why and why the counter-arguments are wrong.

What is virtualenv?

The documentation says:

virtualenv is a tool to create isolated Python environments.

The basic problem being addressed is one of dependencies and versions, and indirectly permissions. Imagine you have an application that needs version 1 of LibFoo, but another application requires version 2. How can you use both these applications? If you install everything into/usr/lib/python2.7/site-packages (or whatever your platform’s standard location is), it’s easy to end up in a situation where you unintentionally upgrade an application that shouldn’t be upgraded.

Or more generally, what if you want to install an application and leave it be? If an application works, any change in its libraries or the versions of those libraries can break the application.

The tl:dr; is:

  • virtualenv allows not needing administrator privileges
  • virtualenv allows installing different versions of the same library
  • virtualenv allows installing an application and never accidentally updating a dependency

The first problem is the one the “sudo” comment addresses — but the real issues stem from the second and third: not using a virtual environment leads to the potential of conflicts and dependency hell.

How to use virtualenv?

Creating a virtual environment is easy:

$ virtualenv dirname

will create the directory, if it does not exist, and then create a virtual environment in it. It is possible to use it either activated or unactivated. Activating a virtual environment is done by

$ . dirname/bin/activate
(dirname)$

this will make python, as well as any script installed using setuptools’ “console_scripts” option in the virtual environment, on the command-execution path. The most important of those is pip, and so using pip will install into the virtual environment.

It is also possible to use a virtual environment without activating it, by directly calling dirname/bin/python or any other console script. Again, pip is an example of those, and used for installing into the virtual environment.

Installing tools for “general use”

I have seen a couple of times the argument that when installing tools for general use it makes sense to install them into the system install. I do not think that this is a reasonable exception for two reasons:

  • It still forces to use root to install/upgrade those tools
  • It still runs into the dependency/conflict hell problems

There are a few good alternatives for this:

  • Create a (handful of) virtual environments, and add them to users’ path.
  • Use “pex” to install Python tools in a way that isolates them even further from system dependencies.

Exploratory programming

People often use Python for exploratory programming. That’s great! Note that since pip 7, pip is building and caching wheels by default. This means that creating virtual environments is even cheaper: tearing down an environment and building a new one will not require recompilation. Because of that, it is easy to treat virtual environments as disposable except for configuration: activate a virtual environment, explore — and whenever needing to move things into production, ‘pip freeze’ will allow easy recreation of the environment.


Weak references, caches and immutable objects

March 26, 2016

Consider the following situation:

  • We have a lot of immutable objects. For our purposes, an “immutable” object is one where “hash(…)” is defined.
  • We have a (pure) function that is fairly expensive to compute.
  • The objects get created and destroyed regularly.

We often would like to cache the function. As an example, consider a function to serialize an object — if the same objects serialized several times, we would like to avoid recomputing the serialization.

One naive solution would be to implement a cache:

cache = {}
def serialize(obj):
    if obj not in cache:
        cache[obj] = _really_serialize(obj)
    return cache[obj]

The problem with that is that the cache would keep references to our objects long after they should have died. We can try and use an LRU (for example, repoze.lru) so that only a certain number of objects would extend their lifetimes in that way, but the size of the LRU would trade-off space overhead and time overhead.

An alternative is to use weak references. Weak references are references that do not keep objects from being collected. There are several ways to use weak references, but here one is ideally suited:

import weakref
cache = weakref.WeakKeyDictionary()
def serialize(obj):
    if obj not in cache:
        cache[obj] = _really_serialize(obj)
    return cache[obj]

Note that this is the same code as before — except that the cache is a weak key dictionary. A weak key dictionary keeps weak references to the keys, but strong references to the value. When a key is garbage collected, the entry in the dictionary disappears.

>>> import weakref
>>> a=weakref.WeakKeyDictionary()
>>> fs = frozenset([1,2,3])
>>> a[fs] = "three objects"
>>> print a[fs]
three objects
>>> len(a)
1
>>> fs = None
>>> len(a)
0

Learn Public Speaking: The Heart-Attack Way

February 11, 2016

I know there are all kinds of organizations, and classes, that teach public speaking. But if you want to learn how to do public speaking the way I did, it’s pretty easy. Get a bunch of friends together, who are all interested in improving their skills. Meet once a week.

The rules are this: one person gets to give a talk with no more props than a marker and a whiteboard. If it’s boring, or they hesitate, another person can butt in, say “boring” and give a talk about something else. Talks are to be prepared somewhat in advance, and to be about five minutes long.

This is scary. You stand there, at the board, never knowing when someone will decide you’re boring. You try to entertain the audience. You try to keep their interest. You lose your point for 10 seconds…and you’re out. It’s the fight club of public speaking. After a few weeks of this, you’ll give flowing talks, and nothing will ever scare you again about public speaking — you’ll laugh as you realize that nobody is going to kick you out.

Yeah, Israel was a good training ground for speaking.


Docker: Are we there yet?

February 1, 2016

Obviously, we know the answer. This blog is intended to allow me to have an easy place to point people when they ask me “so what’s wrong with Docker”?

[To clarify, I use Docker myself, and it is pretty neat. All the more reason missing features annoy me.]

Docker itself:

  • User namespaces — slated to land in February 2016, so pretty close.
  • Temporary adds/squashing — currently “closed” and suggests people use work-arounds.
  • Dockerfile syntax is limited — this is related to the issue above, but there are a lot of missing features in Dockerfile (for example, a simple form of “reuse” other than chaining). No clear idea when it will be possible to actually implement the build in terms of an API, because there is no link to an issue or PR.

Tooling:

  • Image size — Minimal versions of Debian, Ubuntu or CentOS are all unreasonably big. Alpine does a lot better. People really should move to Alpine. I am disappointed there is no competition on being a “minimal container-oriented distribution”.
  • Build determinism — Currently, almost all Dockerfiles in the wild call out to the network to grab some files while building. This is really bad — it assumes networking, depends on servers being up and assumes files on servers never change. The alternative seems to be checking big files into one’s own repo.
    • The first thing to do would be to have an easy way to disable networking while the container is being built.
    • The next thing would be a “download and compare hash” operation in a build-prep step, so that all dependencies can be downloaded and verified, while the hashes would be checked into the source.
    • Sadly, Alpine linux specifically makes it non-trivial to “just download the package” from outside of Alpine.

 


Learning Python: The ecosystem

January 27, 2016

When first learning Python, the tutorial is a pretty good resource to get acquainted with the language and, to some extent, the standard library. I have written before about how to release open source libraries — but it is quite possible that one’s first foray into Python will not be to write a reusable library, but an application to accomplish something — maybe a web application with Django or a tool to send commands to your TV. Much of what I said there will not apply — no need for a README.rst if you are writing a Django app for your personal website!

However, it probably is useful to learn a few tools that the Python eco-system engineered to make life more pleasant. In a perfect world, those would be built-in to Python: the “cargo” to Python’s “Rust”. However, in the world we live in, we must cobble together a good tool-chain from various open source projects. So strap in, and let’s begin!

The first three are cribbed from my “open source” link above, because good code hygiene is always important.

Testing

There are several reasonably good test runners. If there is no clear reason to choose one, py.test is a good default. “Using Twisted” is a good reason to choose trial. Using coverage is a no-brainer. It is good to run some functional tests too. Test runners should be able to help with this too, but even writing a Python program that fails if things are not working can be useful.

Static checking

There are a lot of tools for static checking of Python programs — pylint, flake8 and more. Use at least one. Using more is not completely free (more ways to have to say “ignore this, this is ok”) but can be useful to catch more style static issue. At worst, if there are local conventions that are not easily plugged into these checkers, write a Python program that will check for them and fail if those are violated.

Meta testing

Use tox. Put tox.ini at the root of your project, and make sure that “tox” (with no arguments) works and runs your entire test-suite. All unit tests, functional tests and static checks should be run using tox.

Set tox to put all build artifacts in a build/ top-level directory.

Pex

A tox test-environment of “pex” should result in a Python EXectuable created and put somewhere under “build/”. Running your Python application to actually serve web pages should be as simple as taking that pex and running it without arguments. BoredBot shows an example of how to create such a pex that includes a web application, a Twisted application and a simple loop/sleep-based application. This pex build can take a requirements.txt file with exact dependencies, though it if it is built by tox, you can inline those dependencies directly in the tox file.

Collaboration

If you do collaborate with others on the project, whether it is open source or not, it is best if the collaboration instructions are as easy as possible. Ideally, collaboration instructions should be no more complicated than “clone this, make changes, run tox, possibly do whatever manual verification using ‘./build/my-thing.pex’ and submit a pull request”.

If they are, consider investing some effort into changing the code to be more self-sufficient and make less assumptions about its environment. For example, default to a local SQLite-based database if no “–database” option is specified, and initialize it with whatever your code needs. This will also make it easier to practices the “infinite environment” methodology, since if one file is all it takes to “bring up” an environment, it should be easy enough to run it on a cloud server and allow people to look at it.


Big O for the working programmer

December 6, 2015

Let’s say you’re writing code, and you plan out one function. You try and figure out what are the constraints on the algorithmic efficiency of this function — how good should your algorithm be? This depends, of course on many things. But less than you’d think. First, let’s assume you are ok with around a billion operations (conveniently, modern Gigahertz processors do about a billion low-level operations per second, so it’s a good first assumption.)

If your algorithm is O(n), that means n can’t be much bigger than a billion.

O(n**2) — n should be no more than the root of a billion — around 30,000.

O(n**3) — n should be no more than the third root of a billion — a thousand.

O(fib(n)) — n should be no more than 43

O(2**n) — a billion is right around 2**30, so n should be no more than 30.

O(n!) — n should be no more than 12

OK, now let’s assume you’re the NSA. You can fund a million cores, and your algorithm parallelizes perfectly. How do these numbers change?

O(n) — trillion

O(n**2) — 30 million

O(n**3) — hundred thousand

O(fib(n)) — 71

O(2**n) — 50

O(n!) — 16

You will notice that the difference between someone with a Raspberry PI and a nation-state espionage agency is important for O(n)/O(n**2) algorithms, but quickly becomes meaningless for the higher order ones. You will also notice log(n) was not really factored in — even in a billion, it would mean the difference between handling a billion and handling a hundred million.

This table is useful to memorize for a quick gut check — “is this algorithm feasible”? If the sizes you plan to attack are way smaller, than yes, if way bigger than no, and if “right around” — that’s where you might need to go to a lower-level language or micro-optimize to get it to work. It is useful to know these things before starting to implement: regardless of whether the code is going to run on a smartwatch or on an NSA data center.

Conveniently, these numbers are also useful for memory performance — whether you need to support a one GB device or if you have a terabyte of memory.

 

 


Deploying Python Applications

October 10, 2015

Related to: 2L2T: DjangoCon FeedbackDeploying Python Applications with Docker – A SuggestionDeploying With DockerSoftware You Can Use

I have tried to make an example of some of the ways to avoid mistakes with BoredBot. Hopefully the README explains the problem domain and motivation, but here is the tl;dr:

Deploying applications is a place where it is possible to do things badly, amazingly badly and “oh God!”. I hope to slightly improve on this consensus with the current state of 2015 technology and get to “well, ok, this does not suck too much”. BoredBot is basically my attempt to show how combining NaCl, NColony, Pex and Docker can get you to a simple image which can be deployed to Amazon ECS, Google Container Engine or your own Docker servers without too much trouble.

There is still a lot BoredBot doesn’t do that I think any good deployment infrastructure should support — staging vs. production, dry-run mode, etc. Pull requests and issues are happily accepted!


Follow

Get every new post delivered to your Inbox.

Join 361 other followers