I have been working on a Python template repository as part of my day-job at Orcfax.
It is based on the popular pypa sample project and adds important tooling that support quality control of projects that are engaged with by multiple developers. Primarily, I add editor defaults, linting, and prepare the repository for unit tests, and then deployment.
I have migrated a copy of the template I created for Orcfax to a new file format organisation I have created to capture work I am doing around tools such as ffdev.info (the PRONOM signature development utility).
The new template repository can be found here: ffdev-info/template.py.
Linting as understanding
The title of this blog came from my introduction to a new piece of code. I wanted to see if I could use the script, but the code was a bit out of date and didn’t use strong Python coding standards – such as those captured in the famous Python PEP-8 guidelines.
I also ran into a similar thing just last week when I converted the code supporting Genesis of a File Format into Python 3 from Python 2.
In both cases I needed help improving:
- introducing more idiomatic patterns to the code-base (also increasing readability),
- deciphering parts of the code that weren’t immediately clear such as redundant flow of execution.
Linting is a process that can help with all three of these things. Linting refers to a process of “static analysis” that identifies programming errors, bugs, stylistic errors and (according to Wikipedia) suspicious constructs; I’d re-frame this as under-optimized code-layout, but it is true, linting can also identify potential safety and security concerns.
Some linting tools can fix code in-place, others require human interaction. Both have their benefits that can help us to understand a new codebase better.
Adding a touch of lint
I took the linting components of the template repository and copy and pasted them into the unfamiliar codebase and started to understand the project through its linting output.
The eyeglass repository is a good example for a very flat project, i.e. a project that isn’t expected to be packaged, and likely to be run as a standalone script.
Its layout before adding linting:
│ ├── eyeglass-big-endian.eygl
│ ├── eyeglass-bof-eof.eygl
│ ├── eyeglass-characterisation-signature-file.xml
│ ├── eyeglass-complete-signature-file-DROID-6.0-only.xml
│ ├── eyeglass-complete-signature-file.xml
│ ├── eyeglass-id-signature-file.xml
│ ├── eyeglass-invalid-endianness.eygl
│ ├── eyeglass-little-endian.eygl
│ └── eyeglass-no-eof.eygl
And the files I added:
│ ├── local.txt
│ └── requirements.txt
Breaking the changes down
The files added can be summarized as follows. It’s a bit opaque to begin with but I will try to go into more useful detail. I also recommend taking a look at the different links in-line for more information.
- We add
pytest.inito allow us to run linting and test processes.
.pre-commit-config.yamlallows us to configure a runner for other linting processes.
.gitignoreallows us to ignore artifacts from linting or testing that we don’t want to commit to source control.
requirements/requirements.txtallow us to install linting dependencies.
.ruff.tomlare configuration files for some of the more opinionated tooling we are adding.
.vscode/settings.jsonhelp us to configure our code editors consistently so that our settings do not change existing code unexpectedly.
All in all, those files enable us to run all of the following:
- pre-commit: check-yaml – checks yaml files for parseable syntax.
- pre-commit: check-json – checks json files for parseable syntax.
- pre-commit: check-toml – checks toml files for parseable syntax.
- pre-commit: end-of-file-fixer – ensures that a file is either empty, or ends with one newline.
- pre-commit: trailing-whitespace – trims trailing whitespace.
- pre-commit: check-case-conflict – checks for files that would conflict in case-insensitive filesystems.
- psf/black – Python code (layout) formatter.
- pycqa/isort – sorts Python imports idiomatically (“isort your imports, so you don’t have to.”)
- astral-sh/ruff – Python linter, written in Rust.
- igorshuovych/markdownlint-cli – check markdown files and flag style issues.
- codespell-project/codespell – checks code for common misspellings.
Installation and running
Installation can be done as follows:
python3 -m venv venv
python -m pip install -r requirements/local.txt
python -m tox -e linting
Depending on the project, the output of the tools running for the first time will vary. The eyeglass project looked as follows:
fix end of files.........................................................Failed
- hook id: end-of-file-fixer
- exit code: 1
- files were modified by this hook
trim trailing whitespace.................................................Failed
- hook id: trailing-whitespace
- exit code: 1
- files were modified by this hook
check for case conflicts.................................................Passed
- hook id: black
- files were modified by this hook
All done! ✨ ???? ✨
2 files reformatted.
- hook id: markdownlint
- exit code: 1
README.md:4:81 MD013/line-length Line length [Expected: 80; Actual: 195]
README.md:6:1 MD018/no-missing-space-atx No space after hash on atx style heading [Context: "###Specification"]
README.md:35:1 MD018/no-missing-space-atx No space after hash on atx style heading [Context: "###Further reading"]
README.md:37:1 MD034/no-bare-urls Bare URL used [Context: "http://exponentialdecay.co.uk/..."]
- hook id: pylint
- exit code: 28
************* Module eyeglass_default
eyeglass_default.py:1:0: C0114: Missing module docstring (missing-module-docstring)
************* Module eyeglass
eyeglass.py:1:0: C0114: Missing module docstring (missing-module-docstring)
eyeglass.py:10:0: R0902: Too many instance attributes (20/7) (too-many-instance-attributes)
eyeglass.py:54:8: C0103: Variable name "d" doesn't conform to snake_case naming style (invalid-name)
eyeglass.py:75:15: R1732: Consider using 'with' for resource-allocating operations (consider-using-with)
eyeglass.py:127:4: C0116: Missing function or method docstring (missing-function-docstring)
eyeglass.py:131:4: C0116: Missing function or method docstring (missing-function-docstring)
eyeglass.py:192:4: C0116: Missing function or method docstring (missing-function-docstring)
eyeglass.py:60:8: W0201: Attribute 'bigendian' defined outside __init__ (attribute-defined-outside-init)
eyeglass.py:62:12: W0201: Attribute 'float' defined outside __init__ (attribute-defined-outside-init)
eyeglass.py:70:12: W0201: Attribute 'int' defined outside __init__ (attribute-defined-outside-init)
Given an output like this, I would then start to work through the output and fix the issues. It may not be immediately clear how this helps my understanding, so I will elaborate a bit more below.
Automated fixes help to create an idiomatic view of code across distributed environments, e.g. when working on code with bigger teams, or even just moving code between two computers.
Black is an exceptional tool here and the end result for a user is that they can cast their eye over black formatted code and understand its shape with greater ease than if the code remained unformatted, i.e. as was perhaps first drafted.
Opinionated project settings also provide automated fixes. These tend to be as the developer writes code for the first time. The most important setting I have seen reduce friction on a project is the removal of whitespace at the end of individual lines of code. These are often added accidentally when typing or when browsing a codebase. The problem with newlines at the end of code, however, is that they can impact the output of a “diff” (an important tool used in code review to view changes, differences, or “diffs”) as they compare code line-by-line. Even if there are no syntactic changes to a line of code, a diff is created with the introduction of whitespace and asks someone reviewing the code to answer why something might have changed.
diff --git a/eyeglass.py b/eyeglass.py
index f8c515a..86387b3 100644
@@ -109,7 +109,7 @@ class Eyeglass:
- with open(filename + ".eygl", "wb") as file: <-- original line
+ with open(filename + ".eygl", "wb") as file: <-- added whitespace
file.write(struct.pack(self.byte, self.version)) # unsigned char
file.write(struct.pack(self.bool, self.bigendian)) # bool
Giving the increasing prevalence of markdown, improving markdown’s consistency across projects is as important as code consistency. Markdownlint looks for issues with semantic headings, overly long lines, and other issues that might prevent correct display across platforms, as well as identifying improvements, such as marking up code snippets to make use of syntax-highlighting, thus improving readability.
It is the automatic identification of coding issues that perhaps helps me the most when finding my way into a codebase.
Linting tools will return issues with code, and the process of visiting those issues one-by-one to fix them, even if it isn’t your own codebase (it can always result in a pull-request!), helps one to grok the codebase and its intentions.
I created a few example linting outputs below. Some are actual errors, which you are less likely to fund in an already working piece of code, but others are actual corrections you might want to start making to start to make it more readable and understandable.
example.py:1:0: C0114: Missing module docstring (missing-module-docstring)
example.py:1:0: C0116: Missing function or method docstring (missing-function-docstring)
example.py:1:17: C0103: Argument name "x" doesn't conform to snake_case naming style (invalid-name)
example.py:1:0: W0102: Dangerous default value  as argument (dangerous-default-value)
example.py:5:4: R1705: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it (no-else-return)
example.py:14:4: C0103: Variable name "testVar" doesn't conform to snake_case naming style (invalid-name)
example.py:18:7: C1802: Do not use `len(SEQUENCE)` without comparison to determine if a sequence is empty (use-implicit-booleaness-not-len)
example.py:11:8: W0612: Unused variable 'key' (unused-variable)
example.py:22:0: C0116: Missing function or method docstring (missing-function-docstring)
example.py:24:4: E0602: Undefined variable 'sys' (undefined-variable)
From the errors above:
Adding docstrings (documentation strings), e.g. for the module and functions, asks that we look at a piece of code and answer what the code might be doing overall; and asks what its smaller units are trying to achieve. A docstring for the function
do_something() might look as follows:
def do_something() -> str:
"""This function performs a very specific function and outputs some
Flow of execution can be improved by looking at redundant constructs such as the “unnecessary else” above. Redundancy adds complexity and makes code more difficult to read and understand. Removing redundancy can help you to understand more precisely what something may be doing. More advanced concepts such as following the happy path may follow after you start to pick up on these things.
Unused variables are also redundant and once removed it becomes easier to see how the code fits together.
Naming variables with meaningful names means that a reader can come along and immediately see what information a variable is supposed to hold. In only a small number of instances should they be single character names, e.g.
i is often used for “index” but
idx is often already a lot better;
x could have any meaning and use. Identifying the use of
x in my example code and trying to rename it helps to reveal the code’s intentions.
There are some clues that some of this code may be dangerous. Having a look at why a list is supplied to a function as an argument (in this example) and considering how the code can be refactored to avoid this helps you to improve your understanding, as well as improve the code for the benefit of others. It may be that the function has too many responsibilities, or the function sits within a bigger process that can be further separated out into other smaller functions. There may also be other design patterns that the code can adopt that the original developer hasn’t considered.
While you might not want to see linting errors in a project, their existence in a codebase that you are trying to learn can be a benefit. You can start to walk through each of the issues in turn, cleaning up the codebase as you go, and at the end of the process you should be able to read things more clearly as well as understand what the code is doing.
For your future projects, your use of linting will help make your work more understandable to someone else. As you develop your knowledge of a programming language, following the linting trail helps you to build your own knowledge base of good and bad practices and you will see the shape of your code improve as well as its readability and reliability.
There are other benefits to good linting, including improving the code review process. The first step of which is to have agreed standards, and making those machine actionable in the first instance means that colleagues can focus on more substantial changes that improve the quality of the project.
Maintenance is a word I am yet to touch upon in this blog, but following an idiomatic approach to coding that improves its overall quality when it is first written helps improve its maintainability for others (and yourself when you revisit your past-self’s work). This is always something we want to consider in the field of digital preservation and beyond.
Like a compiled language
The output of linting such as in this template repository makes Python work much more like a compiled language in which syntactical and semantic errors need to be identified up front so that code can be compiled into an executable. There are benefits and drawbacks as one might expect. Sometimes the number of issues are very high the first time the tools are run. It may take a while to fix a large number. On the other hand, as you get more used to the errors, you also become more articulate in the language, creating fewer and fewer issues as your write your code.
Give it a try!
This is probably one of the hardest blogs I’ve tried to write – making very explicit (probably overly so) something that works very seamlessly when put into practice. Once linting tools are added to a Python project it is as simple as running
python -m tox -e linting, and then following the output.
The sample project is available for anyone to use.
If you give it a whirl, let me know! If there are improvements you’d like to see made, either leave an issue on GitHub, or let me know in the comments here. Alternatively, submit a pull request!
Finally, let me know what some of your favorite linting tools are and any that I should be using!
- As I was writing this blog a potentially useful study group for Python was announced that may interest readers: https://www.dpconline.org/events/eventdetail/211/-/python-study-group-launch-and-information-session
- I write a little bit more about the benefits of learning to code in the GLAM sector here: http://exponentialdecay.co.uk/blog/context-switching-do-you-really-need-to-be-an-archivist-programmer-one-perspective/
A point of note
I didn’t go into the fact that some of the linting messages also have “codes” associated with them, e.g.
'example.py:22:0: C0116: Missing function or method docstring (missing-function-docstring)'
the code for this message here is
These codes can be traced back to different industry standards for what they mean, and pylint which I find one of the more useful tools for its messages, lists these here: https://pylint.readthedocs.io/en/latest/user_guide/messages/messages_overview.html. Marking up error messages like this is something we’ve been trying to do in digital preservation as well, such as in JHOVE. Maybe there’s some more we can learn from this linting example than I have already described?