Keeping changing configuration data

I have written a post, a long time ago, about how all configuration mechanisms, when taken to the extreme, arrive at one of two extremes:

  • Programming language
  • Database

Back when I wrote it, I went on to explain why you should forsee it, and to use a production grade programming language (such as, say, Python) instead of adding more and more capabilities to an ad-hoc configuration language with few specs and less consistency. That works well for the standard situation with classic UNIX servers, which execute the complicated configuration when they start up, and are restarted whenever the configuration changes, usually manually by the system administrator.

However, many modern systems’ need their configuration data to change online, as indicated by network traffic, user actions or other outside influences. These systems must change their configuration data at real time, and adapt to it. Many times, several concurrent, mutually unaware sources, will want to modify the configuration. It is ok if some of these modifications fail, of course (if two people remove the same thing), but it is not ok if the semantics are unclear (attempt to remove one item caused the removal of another). So, we need data structures which insure that level of consistency.

Some simplifying assumptions: the configuration is “small” (limited to 10K-100K objects) and is read much more than it is changed. In the application which my system serves, configuration may be read tens of thousands per second, yet can stay the same for days.

One data structure which is well adapted to this is the “map”. A map between a tuple of keys and a tuple of values has the properties we want. Since few other data structures satisfy the needed properties, we assume all of our configuration data is kept in some maps. Luckily, this is no practical limitations: just like with modern database interfaces, we can build ORMs to have objects in our programming languages, which are saved as tables. The details of such ORMs are beyond the scope of this post (though they are no less interesting!). So we have our first “approximation”:

Configuration data is a bunch of maps, each between some defined key-tuple and a value tuple.

Now come the most interesting part: we said at the beginning, this configuration data can change in real time, and the last thing we want is to restart the program. The other insight relevant here is that a good configuration system makes “manual caching” of the configuration data inside the program unnecessary — the configuration data structures are always available at zero cost. Otherwise, the user of the system will write their own caching system, badly. This means that an application can always adjust to a stream of configuration change requests. Thus, we can have an application start from “zero configuration” (all maps are empty), and send it the real history of the configuration changes that the system underwent from its inception. A table can be added to, removed from, or the value modified.

Configuration is the list of changes from zero configuration to the current state.

Of course, the above might be rather wasteful. A system can have a long history, but only a small configuration. Note that the reverse is not true — a system’s configuration is bounded by its history. Again there is a critical insight here — any prefix of the history can be “optimized” to a provably minimal list which is equivalent as far as the resulting state is concerned. The way to do it is to go over the prefix’s changes line by line, keeping the current state, and then go over all maps, emitting an “add” for each item in the map.

In a list of changes, any prefix can be rewritten to a provably minimal equivalent prefix, with O(prefix length) time and O(configuration size) memory.

This means that if the system that is being changed just logs every change, and occasionally closes one file and opens a new one, then a separate “optimizer” can occasionally go over all non-current files, and optimize them. Such an optimizer will frequently also need to keep some order betweent the maps, in order for dependencies to work correctly. The online system being configured can just keep logs, confident that if it needs to shut down and start again, it can just read those logs (perhaps partially optimized) to achieve the same configuration state.

Configuration is a list of changes to maps from zero configuration to the current state, with some optimized prefix.

Since all configuration is being distributed as changes, it is trivial to chain up some of these systems. One component can interact with the client, and distribute the configuration changes to others. When a new component connects it gets the current optimized configuration, and gets an update whenever its upstream gets an updated. The chaining can be recursive, too, with configuration changes being pushed downstream as needed. This feature can be used when distributing the system to avoid undue load on the central configuration manager.


One Response to Keeping changing configuration data

  1. […] You will want to be able to split your code into many processes. These will need to communicate. Since processes tend to die randomly, in the while, you will want the communication to be loosely coupled. The best loosely-coupled communication is shared state. One way to share state is to use an external state storer — AKA database. Relational or not, and perhaps both, you will want some sort of database. But some state needs to be accessed without calling out to an external process. For that, you will want something like what I already blogged about. […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: