bigappThese are the notes for a lightning talks. These are not slides for a lightning talk. This is what I am going to say. At some point, I’ll illustrate it with slides. Perhaps…
Many people find themselves in a position to design a ‘big application’. Maybe implementing a bank’s web site. Maybe an OCR system. Maybe an airport kiosk. Maybe a cross-data-center firewall management system. Maybe a system to select and show music clips, with select advertisements. The first step is making a few strategic choices. The first one is the language. Luckily, that is an easy question. All these systems are “big”, at least in a colloquial sense. While standards for “big” vary, all of these will need at least a couple of man-years to actually be useful products. With a good team of 2-3 programmers, and 5-6 months, there is a good chance to have a version which can actually be used for its intended purpose.
The first thing is to make some strategic choices. The first one is choosing a language. This is an easy choice. Use Python. For all of the examples above, clearly Python is the best fit. The next choice is thinking of how the application will be structured. The obvious instinct too many of us have is to “be modular”. Segment the application into areas of responsibilities. Have packages and sub-packages follow that division. Make sure the APIs are clearly defined. Unfortunately, that is probably a bad choice.
Ironically, the name “modular” actually implies the opposite of what it seems to imply. Modules are almost the tightest form of binding, second only to copy-and-paste. A module will explicitly import another module, and will be able to use any “public” API. The dependencies are staggering. A much better way is to write a self-contained “small” program that does one thing, and does it well. These small programs can communicate in various ways — and in general, the motto is “the looser the better!”.
The next tightest way after importing code directly is using remote calls. Many times these take the form of web APIs — XML-RPC, JSON-RPC, RESTful APIs or others. Here, the merits of the forms of which style of remote calling, including web APIs, will not be discussed. As a class, they have the advantage of allowing the customary advantages of segregating concerns into multiple processes — different address spaces, different privileges and the possibility to run them on different computers. However, these methods still need the processes to be alive at the same time to communicate. Sometimes, this is unavoidable. Sometimes, it is avoidable.
The loosest way of integration is through a database. If the concerns are separate enough that one process can read the database and another write it, at different times, the system will be truly resilient. Not only do the separate privileges make it easy to have mutually untrusting processes, but databases are specifically geared for having different uses with different permissions. This makes it easy to ensure that the web side does not write to a database — even if it is completely hacked. “Databases” should not be construed to mean only classic SQL ACID databases like Postgres or Oracle. The various non-SQL databases like CouchDB or Cassandra. One important special case of non-SQL databases is the filesystem itself. Some things are easier, some are harder — but one advantage is that by definition, the filesystem will crash only if the operating itself does. The process in charge of “maintaining” a database table can have an associated library to read the database, offered as a module to be used by the other processes. This trades-off code separation for separation of knowledge of the table structure.
There are things in the middle. Message queues, like Apache ActiveMQ or RabbitMQ will allow communications between different processes which crash independently while still keeping the messages. The binding is still tighter than with a database — the consumer still has to make sure to clean the queue in a timely fashion — but for one-way communication, it is certainly superior to using direct TCP connection, over any protocol.
Using small processes allows the use of different languages (gasp!). If a single task is better done in Java (because it has a library to do that) or in C (because it needs the speed boost), no “integration strategy” is needed. Even more importantly, it allows using incompatible frameworks without needing ugly kludges. The web UI can be done in Django, the part that communicates with remote devices can be in Twisted and the database maintainer can be plain-old Python process which does its job in a single-thread blocking way.
Big applications are a recipe for disaster. Using the mechanism of small, co-operating processes, which are as loosely coupled as possible, allows for applications which are more resilient, debuggable and scalable.