Forking Skip-level Dependencies

May 6, 2016

I have recently found I explain this concept over and over to people, so I want to have a reference.

Most modern languages comes with a “dependency manager” of sorts that helps manage the 3rd party libraries a given project uses. Rust has Cargo, Node.js has npm, Python has pip and so on. All of these do some things well and some things poorly. But one thing that can be done (well or poorly) is “support forking skip-level dependencies”.

In order to explain what I mean, here as an example: our project is PlanetLocator, a program to tell the user which direction they should face to see a planet. It depends on a library called Astronomy. Astronomy depends on Physics. Physics depends on Math.

  • PlanetLocator
    • Astronomy
      • Physics
        • Math

PlanetLocator is a SaaS, running on our servers. One day, we find Math has a critical bug, leading to a remote execution vulnerability. This is pretty bad, because it can be triggered via our application by simply asking PlanetLocator for the location of Mars at a specific date in the future. Luckily, the bug is simple — in Math’s definition of Pi, we need to add a couple of significant digits.

How easy is it to fix?

Well, assume PlanetLocator is written in Go, and not using any package manager. A typical import statement in PlanetLocator is

import “github.com/astronomy/astronomy”

A typical import statement in Astronomy is

import “github.com/physics/physics

..and so on.

We fork Math over to “github.com/planetlocator/math” and fix the vulnerability. Now we have to fork over physics to use the forked math, and astronomy to use the forked physics and finally, change all of our imports to import the forked astronomy — and Physics, Astronomy and PlanetLocator had no bugs!

Now assume, instead, we had used Python. In our requirements.txt file, we could put

git+https://github.com/planetlocator/math#egg=math

and voila! even though Physics’ “setup.py” said “install_requires=[‘math’]”, it will get our forked math.

When starting to use a new language/dependency manager, the first question to ask is: will it support me forking skip-level dependencies? Because every upstream maintainer is,  effectively, an absent-maintainer if rapid response is at stake (for any reason — I chose security above, but it might be beating the competition to a deadline, or fulfilling contractual obligations).


Use virtualenv

April 24, 2016

In a conversation recently with a friend, we agreed that “something the instructions tell you to do ‘sudo pip install’…which is good, because then you know to ignore them”.

There is never a need for “sudo pip install”, and doing it is an anti-pattern. Instead, all installation of packages should go into a virtualenv. The only exception is, of course, virtualenv (and arguably, pip and wheel). I got enough questions about this that I wanted to write up an explanation about the how, why and why the counter-arguments are wrong.

What is virtualenv?

The documentation says:

virtualenv is a tool to create isolated Python environments.

The basic problem being addressed is one of dependencies and versions, and indirectly permissions. Imagine you have an application that needs version 1 of LibFoo, but another application requires version 2. How can you use both these applications? If you install everything into/usr/lib/python2.7/site-packages (or whatever your platform’s standard location is), it’s easy to end up in a situation where you unintentionally upgrade an application that shouldn’t be upgraded.

Or more generally, what if you want to install an application and leave it be? If an application works, any change in its libraries or the versions of those libraries can break the application.

The tl:dr; is:

  • virtualenv allows not needing administrator privileges
  • virtualenv allows installing different versions of the same library
  • virtualenv allows installing an application and never accidentally updating a dependency

The first problem is the one the “sudo” comment addresses — but the real issues stem from the second and third: not using a virtual environment leads to the potential of conflicts and dependency hell.

How to use virtualenv?

Creating a virtual environment is easy:

$ virtualenv dirname

will create the directory, if it does not exist, and then create a virtual environment in it. It is possible to use it either activated or unactivated. Activating a virtual environment is done by

$ . dirname/bin/activate
(dirname)$

this will make python, as well as any script installed using setuptools’ “console_scripts” option in the virtual environment, on the command-execution path. The most important of those is pip, and so using pip will install into the virtual environment.

It is also possible to use a virtual environment without activating it, by directly calling dirname/bin/python or any other console script. Again, pip is an example of those, and used for installing into the virtual environment.

Installing tools for “general use”

I have seen a couple of times the argument that when installing tools for general use it makes sense to install them into the system install. I do not think that this is a reasonable exception for two reasons:

  • It still forces to use root to install/upgrade those tools
  • It still runs into the dependency/conflict hell problems

There are a few good alternatives for this:

  • Create a (handful of) virtual environments, and add them to users’ path.
  • Use “pex” to install Python tools in a way that isolates them even further from system dependencies.

Exploratory programming

People often use Python for exploratory programming. That’s great! Note that since pip 7, pip is building and caching wheels by default. This means that creating virtual environments is even cheaper: tearing down an environment and building a new one will not require recompilation. Because of that, it is easy to treat virtual environments as disposable except for configuration: activate a virtual environment, explore — and whenever needing to move things into production, ‘pip freeze’ will allow easy recreation of the environment.


Weak references, caches and immutable objects

March 26, 2016

Consider the following situation:

  • We have a lot of immutable objects. For our purposes, an “immutable” object is one where “hash(…)” is defined.
  • We have a (pure) function that is fairly expensive to compute.
  • The objects get created and destroyed regularly.

We often would like to cache the function. As an example, consider a function to serialize an object — if the same objects serialized several times, we would like to avoid recomputing the serialization.

One naive solution would be to implement a cache:

cache = {}
def serialize(obj):
    if obj not in cache:
        cache[obj] = _really_serialize(obj)
    return cache[obj]

The problem with that is that the cache would keep references to our objects long after they should have died. We can try and use an LRU (for example, repoze.lru) so that only a certain number of objects would extend their lifetimes in that way, but the size of the LRU would trade-off space overhead and time overhead.

An alternative is to use weak references. Weak references are references that do not keep objects from being collected. There are several ways to use weak references, but here one is ideally suited:

import weakref
cache = weakref.WeakKeyDictionary()
def serialize(obj):
    if obj not in cache:
        cache[obj] = _really_serialize(obj)
    return cache[obj]

Note that this is the same code as before — except that the cache is a weak key dictionary. A weak key dictionary keeps weak references to the keys, but strong references to the value. When a key is garbage collected, the entry in the dictionary disappears.

>>> import weakref
>>> a=weakref.WeakKeyDictionary()
>>> fs = frozenset([1,2,3])
>>> a[fs] = "three objects"
>>> print a[fs]
three objects
>>> len(a)
1
>>> fs = None
>>> len(a)
0

Learn Public Speaking: The Heart-Attack Way

February 11, 2016

I know there are all kinds of organizations, and classes, that teach public speaking. But if you want to learn how to do public speaking the way I did, it’s pretty easy. Get a bunch of friends together, who are all interested in improving their skills. Meet once a week.

The rules are this: one person gets to give a talk with no more props than a marker and a whiteboard. If it’s boring, or they hesitate, another person can butt in, say “boring” and give a talk about something else. Talks are to be prepared somewhat in advance, and to be about five minutes long.

This is scary. You stand there, at the board, never knowing when someone will decide you’re boring. You try to entertain the audience. You try to keep their interest. You lose your point for 10 seconds…and you’re out. It’s the fight club of public speaking. After a few weeks of this, you’ll give flowing talks, and nothing will ever scare you again about public speaking — you’ll laugh as you realize that nobody is going to kick you out.

Yeah, Israel was a good training ground for speaking.


Docker: Are we there yet?

February 1, 2016

Obviously, we know the answer. This blog is intended to allow me to have an easy place to point people when they ask me “so what’s wrong with Docker”?

[To clarify, I use Docker myself, and it is pretty neat. All the more reason missing features annoy me.]

Docker itself:

  • User namespaces — slated to land in February 2016, so pretty close.
  • Temporary adds/squashing — currently “closed” and suggests people use work-arounds.
  • Dockerfile syntax is limited — this is related to the issue above, but there are a lot of missing features in Dockerfile (for example, a simple form of “reuse” other than chaining). No clear idea when it will be possible to actually implement the build in terms of an API, because there is no link to an issue or PR.

Tooling:

  • Image size — Minimal versions of Debian, Ubuntu or CentOS are all unreasonably big. Alpine does a lot better. People really should move to Alpine. I am disappointed there is no competition on being a “minimal container-oriented distribution”.
  • Build determinism — Currently, almost all Dockerfiles in the wild call out to the network to grab some files while building. This is really bad — it assumes networking, depends on servers being up and assumes files on servers never change. The alternative seems to be checking big files into one’s own repo.
    • The first thing to do would be to have an easy way to disable networking while the container is being built.
    • The next thing would be a “download and compare hash” operation in a build-prep step, so that all dependencies can be downloaded and verified, while the hashes would be checked into the source.
    • Sadly, Alpine linux specifically makes it non-trivial to “just download the package” from outside of Alpine.

 


Learning Python: The ecosystem

January 27, 2016

When first learning Python, the tutorial is a pretty good resource to get acquainted with the language and, to some extent, the standard library. I have written before about how to release open source libraries — but it is quite possible that one’s first foray into Python will not be to write a reusable library, but an application to accomplish something — maybe a web application with Django or a tool to send commands to your TV. Much of what I said there will not apply — no need for a README.rst if you are writing a Django app for your personal website!

However, it probably is useful to learn a few tools that the Python eco-system engineered to make life more pleasant. In a perfect world, those would be built-in to Python: the “cargo” to Python’s “Rust”. However, in the world we live in, we must cobble together a good tool-chain from various open source projects. So strap in, and let’s begin!

The first three are cribbed from my “open source” link above, because good code hygiene is always important.

Testing

There are several reasonably good test runners. If there is no clear reason to choose one, py.test is a good default. “Using Twisted” is a good reason to choose trial. Using coverage is a no-brainer. It is good to run some functional tests too. Test runners should be able to help with this too, but even writing a Python program that fails if things are not working can be useful.

Static checking

There are a lot of tools for static checking of Python programs — pylint, flake8 and more. Use at least one. Using more is not completely free (more ways to have to say “ignore this, this is ok”) but can be useful to catch more style static issue. At worst, if there are local conventions that are not easily plugged into these checkers, write a Python program that will check for them and fail if those are violated.

Meta testing

Use tox. Put tox.ini at the root of your project, and make sure that “tox” (with no arguments) works and runs your entire test-suite. All unit tests, functional tests and static checks should be run using tox.

Set tox to put all build artifacts in a build/ top-level directory.

Pex

A tox test-environment of “pex” should result in a Python EXectuable created and put somewhere under “build/”. Running your Python application to actually serve web pages should be as simple as taking that pex and running it without arguments. BoredBot shows an example of how to create such a pex that includes a web application, a Twisted application and a simple loop/sleep-based application. This pex build can take a requirements.txt file with exact dependencies, though it if it is built by tox, you can inline those dependencies directly in the tox file.

Collaboration

If you do collaborate with others on the project, whether it is open source or not, it is best if the collaboration instructions are as easy as possible. Ideally, collaboration instructions should be no more complicated than “clone this, make changes, run tox, possibly do whatever manual verification using ‘./build/my-thing.pex’ and submit a pull request”.

If they are, consider investing some effort into changing the code to be more self-sufficient and make less assumptions about its environment. For example, default to a local SQLite-based database if no “–database” option is specified, and initialize it with whatever your code needs. This will also make it easier to practices the “infinite environment” methodology, since if one file is all it takes to “bring up” an environment, it should be easy enough to run it on a cloud server and allow people to look at it.


Big O for the working programmer

December 6, 2015

Let’s say you’re writing code, and you plan out one function. You try and figure out what are the constraints on the algorithmic efficiency of this function — how good should your algorithm be? This depends, of course on many things. But less than you’d think. First, let’s assume you are ok with around a billion operations (conveniently, modern Gigahertz processors do about a billion low-level operations per second, so it’s a good first assumption.)

If your algorithm is O(n), that means n can’t be much bigger than a billion.

O(n**2) — n should be no more than the root of a billion — around 30,000.

O(n**3) — n should be no more than the third root of a billion — a thousand.

O(fib(n)) — n should be no more than 43

O(2**n) — a billion is right around 2**30, so n should be no more than 30.

O(n!) — n should be no more than 12

OK, now let’s assume you’re the NSA. You can fund a million cores, and your algorithm parallelizes perfectly. How do these numbers change?

O(n) — trillion

O(n**2) — 30 million

O(n**3) — hundred thousand

O(fib(n)) — 71

O(2**n) — 50

O(n!) — 16

You will notice that the difference between someone with a Raspberry PI and a nation-state espionage agency is important for O(n)/O(n**2) algorithms, but quickly becomes meaningless for the higher order ones. You will also notice log(n) was not really factored in — even in a billion, it would mean the difference between handling a billion and handling a hundred million.

This table is useful to memorize for a quick gut check — “is this algorithm feasible”? If the sizes you plan to attack are way smaller, than yes, if way bigger than no, and if “right around” — that’s where you might need to go to a lower-level language or micro-optimize to get it to work. It is useful to know these things before starting to implement: regardless of whether the code is going to run on a smartwatch or on an NSA data center.

Conveniently, these numbers are also useful for memory performance — whether you need to support a one GB device or if you have a terabyte of memory.

 

 


Follow

Get every new post delivered to your Inbox.

Join 361 other followers