Below spoilers for the Kushiel’s Legacy books (first trilogy). If you have not read them, I highly recommend reading before looking below — they are pretty awesome. Technically, there are also spoilers for Lord of the Rings, though if you have not read it by now, I don’t know what else to do.
I don’t like console-scripts. Among what I dislike is their magicality and the actual code produced. I mean, the autogenerated Python looks like:
#!/home/moshez/src/mainland/build/toxenv/bin/python2 # -*- coding: utf-8 -*- import re import sys from tox import cmdline if __name__ == '__main__': sys.argv = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv) sys.exit(cmdline())
Notice the encoding header for an ASCII only file, and the weird windows compatibility maybe generated along with a unix-style header that also contains the name of the path? Truly, a file that only its generator could love.
Then again, I don’t much like what we have in Twisted, with the preamble:
#!/home/moshez/src/ncolony/build/env/bin/python2 # Copyright (c) Twisted Matrix Laboratories. # See LICENSE for details. import os, sys try: import _preamble except ImportError: sys.exc_clear() sys.path.insert(0, os.path.abspath(os.getcwd())) from twisted.scripts.twistd import run run()
This time, it’s not auto-generated, it is artisanal. What it lacks in codegen ugliness it makes up in importing an under-module at top-level, and who doesn’t like a little bit of sys.path games?
I will note that furthermore, neither of these options would play nice with something like pex. Worst of all, all of these options, regardless of their internal beauty (or horrible lack thereof), must be named veeeeery caaaaarefully to avoid collision with existing UNIX, Mac or Windows command. Think this is easy? My Ubuntu has two commands called “chromium-browser” and “chromium-bsu” because Google didn’t check what popular games there are on Ubuntu before naming their browser.
Enter the best thing about Python, “-m” which allows executing modules. What “-m” gains in awesomeness, it loses in the horribleness of the two-named-module, where a module is executed twice, once resulting in sys.modules[‘__main__’] and once in sys.modules[real_name], with hilarious consequences for class hierarchies, exceptions and other things relying on identity of defined things.
Luckily, packages will automatically execute “package.__main__” if “python -m package” is called. Ideally, nobody would try to import __main__, but this means packages can contain only one command. Enter ‘mainland‘, which defines a main which will, carefully and accurately, dispatch to the right module’s main without any code generation. It has not been released yet, but is available from GitHub. Pull requests and issues are happily accepted!
Edit: Released! 0.0.1 is up on PyPI and on GitHub
I’ve seen elsewhere (thankfully, not my project, and no, I won’t link to it) that people want the code of conduct to “protect contributors from 3rd parties as well as 3rd parties from contributors. I would like to first suggest the idea that this is…I don’t even know. The US government has literal nuclear bombs, as well as aircraft carriers, and it cannot enforce its law everywhere. The idea that an underfunded OSS project can do literally anything to 3rd parties is ludicrous. The only enforcement mechanism any project can have is “you can’t contribute here” (which, by necessity, only applies to those who wanted to contribute in the first place.)
So why would a bunch of people take on a code of conduct that will only limit them?
Because, to quote the related article above, “the death cry of liberalism is not ‘death to the unbelievers’, it is ‘if you’re nice, you can join our cuddle pile.” Perhaps describing open source projects as “cuddle piles” is somewhat of an exaggeration, but the point remains. A code of conduct defines what “nice enough is”. The cuddle piles? Those conquer the world. WebKit is powering both Android and iPhone browsers, for example, making open source be, literally, the way people see the online world.
Adopting a code of conduct that says that we do not harass, we do not retaliate and we accept all people is a powerful statement. This is how we keep our own garden clear of the pests of prejudice, of hatred. Untended gardens end up wilderness. Well-tended gardens grow until we need to keep a fence around the wild and call it a “preserve”.
When I first saw the Rust code of conduct I thought “cool, but why does an open source project need a code of conduct”? Now I wonder how any open source project will survive without one. It will have to survive without me — in three years, I commit to not contribute to any project without a code of conduct, and I hope others will follow. Any project that does not have a CoC, in the interim, better behave as though it has one.
Twisted is working hard on adopting a code of conduct, and I will check one into NColony soon (a simpler one, appropriate to what is, so far, a one-person show).
(Or, “So you want to build a new open-source Python project”)
This is not going to be a “here is a list of options, but you do what is right for you” pandering post. This is going to be a “this is 2015, there are right ways to do things, and here they are” post. This is going to be an opinionated, in-your-face, THIS IS BEST post. That said, if you think that anything here does not represent the best practices in 2015, please do leave a comment. Also, if you have specific opinions on most of these, you’re already ahead of the curve — the main target audience are people new to creating open-source Python projects, who could use a clear guide.
This post will also emphasize the new project. It is not worth it, necessarily, to switch to these things in an existing project — certainly not as a “whole-sale, stop the world and let’s change” thing. But when starting a new project with zero legacy, there is no reason to do things wrong.
tl:dr; Use GitHub, put MIT license in “LICENSE”, README.rst with badges, use py.test and coverage, flake8 for static checking, tox to run tests, package with setuptools, document with sphinx on RTD, Travis CI/Coveralls for continuous integration, SemVer and versioneer for versioning, support Py2+Py3+PyPy, avoid C extensions, avoid requirements.txt.
When publishing kids’ photos, you do it on Facebook, because that’s where everybody is. LinkedIn is where you connect with colleagues and lead your professional life. When publishing projects, put them on GitHub, for exactly the same reason. Have GitHub pull requests be the way contributors propose changes. (There is no need to use the “merge” button on GitHub — it is fine to merge changes via git and push. But the official “how do I submit a change” should be “pull request”, because that’s what people know).
Put the license in a file called “LICENSE” at the root. If you do not have a specific reason to choose otherwise, MIT is reasonably permissive and compatible. Otherwise, use something like the license chooser and remember the three most important rules:
- Don’t create your own license
- No, really, don’t create your own license
- Don’t do it
At the end of the license file, you can have a list of the contributors. This is an easy place to credit them. It is a good idea to ask people who send in pull requests to add themselves to the contributor list in their first one (this allows them to spell their name and e-mail exactly the way they want to).
Note that if you use the GPL or LGPL, they will recommend putting it in a file called “COPYING”. Put it in “LICENSE” (the licenses explicitly allow it as an option, and it makes it easier for people to find the license if it always has the same name).
The GitHub default is README.md, but README.rst (restructured text) is perfectly supported via Sphinx, and is a better place to put Python-related documentation, because ReST plays better with Pythonic toolchains. It is highly encouraged to put badges on top of the document to link to CI status (usually Travis), ReadTheDocs and PyPI.
There are several reasonably good test runners. If there is no clear reason to choose one, py.test is a good default. “Using Twisted” is a good reason to choose trial. Using the built-in unittest runner is not a good option — there is a reason the cottage industry of “test runner” evolved. Using coverage is a no-brainer. It is good to run some functional tests too. Test runners should be able to help with this too, but even writing a Python program that fails if things are not working can be useful.
Distribute your tests alongside your code, by putting them under a subpackage called “tests” of the main package. This allows people who “pip install …” to run the tests, which means sending you bug reports is a lot easier.
There are a lot of tools for static checking of Python programs — pylint, flake8 and more. Use at least one. Using more is not completely free (more ways to have to say “ignore this, this is ok”) but can be useful to catch more style static issue. At worst, if there are local conventions that are not easily plugged into these checkers, write a Python program that will check for them and fail if those are violated.
Use tox. Put tox.ini at the root of your project, and make sure that “tox” (with no arguments) works and runs your entire test-suite. All unit tests, functional tests and static checks should be run using tox. It is not a bad idea to write a tox clause that builds and tests an installed wheel. This will require including all test code in the deployed package, which is a good idea.
Set tox to put all build artifacts in a build/ top-level directory.
Have a setup.py file that uses setuptools. Tox will need it anyway to work correctly.
It is unlikely that you have a good reason to take more than one top-level name in the package namespace. Barring completely unavoidable name conflicts, your PyPI package name should be the same as your Python package name should be the same as your GitHub project. Your Python package should live at the top-level, not under “src/” or “py/”.
Use sphinx for prose documentation. Put it in doc/ with a relevant conf.py. Use either pydoctor or sphinx autodoc for API docs. “Pydoctor” has the potential for nicer docs, sphinx is well integrated with ReadTheDocs. Configure ReadTheDocs to auto-build the documentation every check-in.
If you enjoy owning your own machines, or platform diversity in testing really matters, use buildbot. Otherwise, take advantage for free Travis CI and configure your project with a .travis.yml that breaks your tox tests into one test per Travis clause. Integrate with coveralls to have coverage monitored.
A full run of “tox” should leave in its wake tested .zip and .whl files. A successful, post-tag run of tox, combined with versioneer, should leave behind tested .zip and .whl. The release script could be as simple as “tox && git tag $1 && (tox || (git tag -d $1;exit 1) && cp …whl and zip locations… dist/”
GPG sign dist/ files, and then use “twine” to upload them to PyPI. Make sure to upload to TestPyPI first, and verify the upload, before uploading to PyPI. Twine is a great tool, but badly documented — among other things, it is hard to find information about .pypirc. “.pypirc” is an ini file, which needs to have the following sections:
- [distutils]: only needed field is “index-servers = pypi testpypi”
- [testpypi]: fields are “repository: https://testpypi.python.org/pypi”, username: and password:
- [pypi]: fields are “repository: https://pypi.python.org/pypi”, username: and password:
- build — all your build artifacts will go here
- dist — this is where “ready to release” output will be
- *.egg?info — this is an artifact of sdist that is really hard to put elsewhere
- *.pyc — ignore byte-code files
- .coverage — coverage artifact
If all your dependencies support Python 2 and 3, support Python 2 and 3. That will almost certainly require using “six” (or one of its competitors, like “future”). Run your unit tests under both Python 2 and 3. Make sure to run your unit tests under PyPy, as well.
Avoid, if possible. Certainly do not use C extensions for performance improvements before (1) making sure they’re needed (2) making sure they’re helpful (3) trying other performance improvements. Ideally structure your C extensions to be optional, and fall back to a slow(er) Python implementation if they are unavailable. If they speed up something more general than your specific needs, consider breaking them out into a C-only project which your Python will depend on.
If using C extensions, regardless of whether to improve performance or integrate with 3rd party libraries, use CFFI.
If C extensions have successfully been avoided, and Python 3 compatibility kept, build universal wheels.
The only good “requirements.txt” file is a non-existing one. The “setup.py” file should have the dependencies (ideally as weak-versioned as possible, usually just a “>=” for a library that tends not to break backwards compatibility a lot). Tox will maintain the virtualenv needed based on the things in the tox file, and if needing to test against specific versions, this is where specific versions belong. The “requirements.txt” file belongs in Salt-style (Chef, Fab, …) configurations, Docker-style (Vagrant-, etc.) configurations or Pants-style (Buck-, etc.) build scripts when building a Pex. This is the “deployment configuration”, and needs to be decided by a deployer.
If your package has dependencies, they belong in a setup.py. Use extended_dependencies for test-only dependencies. Lists of test dependencies, and reproducible tests, can be configured in tox.ini. Tox will take care of pip-installing relevant packages in a virtualenv when running the tests.
Thanks to John A. Booth, Dan Callahan, Robert Collins, Jack Diedrich, Steve Holden for their reviews and insightful comments. Any mistakes that remain are mine!
There are a few interesting new languages that would be fun to play with: Rust, Julia, LFE, Go and D all have interesting takes on some domain. But ultimately, a language is only as interesting as the things it can do. It used to be that “things it can do” referred to the standard library. I am old enough to remember when “batteries included” was one of the most interesting benefits Python had — the language had dared doing such avant-garde things in ’99 as having a *built-in* url fetching library. That behaved, pretty much, the same on all platform. Out of the box. (It’s hard to convey how amazing this was in ’99.)
This is no longer the case. Now what distinguishes Python as “can do lots of things” is its built-in package management. Python, pip and virtualenv together give the ability to work on multiple Python projects, that need different versions of libraries, without any interference. They are, to a first approximation, 100% reliable. With pip 7.0 supporting caching wheels, virtualenvs are even more disposable (I am starting to think pip uninstall is completely unnecessary). In fact, except for virtualenv itself, it is rare nowadays to install any Python module globally. There are some rusty corners, of course:
- Uploading packages is…non-trivial. Any time the default tool is so bad that the official docs say “use this 3rd party library instead”, something is wrong.
- The distinction between “PyPI name” and “Python package name”, while theoretically reasonable, is just annoying and hard to explain
- (Related) There is no official PyPI-name-conflict-resolution procedure.
I’ll note npm, for example, clearly does better on these three (while doing worse on some things that Python does better). Of the languages mentioned above, it is nice to see that most have out-of-the-box built-in tools for ecosystem creation. Julia’s hilariously piggybacks on GitHub’s. Go’s…well…uses URLs as the equivalent PyPI-level-names. Rust has a pretty decent system in cargo and crates.io (the .toml file is kind of weird, but I’ve seen worse). While there is justifiable excitement about Rust’s approach to memory and thread-safety, it might be that it will win in the end based on having a better internal package management system than the low-level alternatives.
Note: OS package mgmt (yum, apt-get, brew and whatever Windows uses nowadays) is not a good fit for what developers need from a language-level package manager. Locality, quick package refreshes — these matter more to developers than to OS end-users.
Unicode is not a panacea. Some people’s names can’t even be written in unicode. However, as far as universal encodings go, it is the best we have got — warts and all. It is the only reasonable way to represent text inside programs, except for very very specialized needs (no, you don’t qualify).
Now, programs are made of libraries, and often there are several layers of abstraction between the library and the program. Sometimes, some weird abstraction layer in the middle will make it hard to convey user configuration into the library’s guts. Code should figure things out itself, most of the time.
So, there are several ways to make dealing with unicode not-horrible.
I’ve already mentioned it, but it bears repeating. Internal representation should use the language’s built-in type (str in Python 3, String in Java, unicode in Python 2). All formatting, templating, etc. should be, internally, represented as taking unicode parameters and returning unicode results.
Obviously, when interacting with an external protocol that allows the other side to specify encoding, follow the encoding it specifies. Your program should support, at least, UTF-8, UTF-16, UTF-32 and Latin-1 through Latin-9. When choosing output encoding, choose UTF-8 by default. If there is some way for the user to specify an encoding, allow choosing between that and UTF-16. Anything else should be under “Advanced…” or, possibly, not at all.
When reading input that is not marked with an encoding, attempt to decode as UTF-8, then as UTF-16 (most UTF-16 decoders will auto-detect endianity, but it is pretty easy to hand-hack if people put in the BOM. UTF-8/16 are unlikely to have false positives, so if either succeeds, it’s likely correct. Otherwise, as-ASCII-and-ignore-high-order is often the best that can be done. If it is reasonable, allow user-intervention in specifying the encoding.
When writing output, the default should be UTF-8. If it is non-trivial to allow user specification of the encoding, that is fine. If it is possible, UTF-16 should be offered (and BOM should be prepended to start-of-output). Other encodings are not recommended if there is no way to specify them: the reader will have to guess correctly. At the least, giving the user such options should be hidden behind an “Advanced…” option.
The most popular I/O that does not have explicit encoding, or any way to specify one, is file names on UNIX systems. UTF-8 should be assumed, and reasonably recovered from when it proves false. No other encoding is reasonable (UTF-16 is uniquely unsuitable since UNIX filenames cannot have NULs, and other encodings cannot encode some characters).