Information Maintainers talk: Something something twenty years open source…
It was back in May, yes, way back when, that Jordan Hale of the Information Maintainers group put the following to me:
I write today to ask if you’d be interested in being our special guest on the next Information Maintainers call … we thought your perspective on working within and maintaining decentralized, small-group systems and development infrastructures would be really rad to hear about. What do you think?
I was pretty stoked. I have followed Information Maintainers for 18 months or so, and a small change in my work schedule meant that I have been able to attend their meetings this past year in-person. Finally jumping in when I heard the sister of a colleague in New Zealand – Kelly Pendergrast would be talking alongside Juliana Castro on design and communication.
Since then we have had the pleasure to hear about delivering logistics and services within Toronto Library; the reaction to C19 in a health services clinic in Pennsylvania – right at the moment of the initial lock-down period. And in just the last month we have questioned what the role might be for memory institutions in dismantling structural inequalities.
The group embodies their community values and it is always a pleasure to be involved in their conversations which are always at such a high-degree of social consciousness.
Long-form notes and minutes from past-meetings can be found in a shared Google Doc. The information maintainers mailing list is also a good read.
I ended up speaking mid-August (21 August 2020). I did need all that time to prepare. To understand if I could offer something, and maybe help initiate a productive conversation. I do not know if I quite managed to hit the notes Jordan maybe would have initially envisioned, but Jordan was a great facilitator. There were a few new faces in the group too. And Devon from Information Maintainers shared a link in their post about the talk earlier the week before that really gave me a lot more confidence before the meeting that suggested maybe something of what I put together might be of interest.
In case it is of interest to you too, a transcript follows. Let me know what you think in the comments, or via @beet_keeper on Twitter.
Something something twenty years open source…
I noted to Jordan when we first spoke that this might be an existential meditation but I do not know if it really lives up to that. But what I hope I get right, given this group and how rich discussions are, is that I will ask a lot of questions that I hope will stimulate some good conversation. I hope.
Introduction
My GitHub profile shows that I have been at this since 2012. But actually, Outside of GitHub, I was publishing on a site called PlanetSourceCode.com (no-longer an active domain) since 98, 99.
Applications like Yoda Lottery Number Generator, and the original Pigeon Racing game: Pidge Racer.
My publishing did not really die-down, but GitHub did really make things quote-unquote take-off? For a number of us, I think. At least, provided better tools to socialize our work.
It was at The National Archives UK that my work really became open by default. I say open by default because I could not get the support I needed at work and recognized open source might be one way around that.
I created two related projects there. Around file format identification and democratizing that work. This was related to in-house tools we also published to the public and other government departments.
Interestingly, both tools I created were open source, but more important than their being open-source was just their publication as a service/services that could be used.
One tool – the signature development utility, enables users to write file format signatures compatible with the DROID format identification tool which scans file formats for byte sequences that match file format magic numbers.
Being able to publish that onto a web-site was key to opening up that process for other colleagues in the field. Where the processes we used internally were very much linked to a a closed database with restricted access.
It is interesting for me to reflect on taking the “code” out of open source for a moment, and think about the relationship with publishing in general. We really did not get any investment in code form external contributors but folks did get some use from the tool.
Ideally, what I would have loved to have happened by having the code in the open would be for others to take it and extend the user-interface, making it more user-friendly, and able to aggregate lots of these file format signatures into a test file; where the tool could only really make one at a time. The code is all there, but I did not have the user interface skills at the time. Though, I kind of do have a bit more experience now, so this month, there might be some movement there.
Still even without a great deal of participation in that effort, more a lone-developer I do ask questions, both, from the time, and now with hindsight:
- Should I have promoted it more? What are the implications of that? I really only had my personal blog to talk about it…
- Was it the right language to write in? PHP with clunky vanilla javascript and some pretty shaky HTML!
- If it was the right language for the need, would things like the lack of testing or quality of code put people off? I imagine it might…
Same career in a new town…
Another nice open-source project came about in New Zealand. Again related to content analysis.
Though whether we would have written it, or found ourselves contributing to another open source development was probably 50/50. Contributing to another project would have been really nice.
I started with some due diligence on a tool called C3PO. The C3 (came from Clever Crafty Content Profiling). The tool was developed by Petar Petrov and Artur Kulmukhametov at the Austrian Institute of Technology in Vienna.
It would take the analysis of another tool called FITS and provide neat collection level statistics and charts.
For a developer without the ability to create compelling visuals it was a good looking choice.
Even with some decent effort over one weekend, and then back at work on the first Monday back after deciding to try the tool out – I was not able to get the application up and running. I did not understand the dependencies well enough. It used Java which I had always found difficult, and something called the Play Framework for publishing web apps. Any particular dependencies in general were going to be tough to sell to the department. Which had pretty rigid rules around that because of the other departments on the same network.
So we started to look at creating our own. It took the output of format identification tools, and splits it out into more atomic data components that help with appraisal and submission into a digital preservation system.
I’m not very good at naming. For a while it lived as the droid-siegfried-sqlite-analysis-engine but a colleague at Archives NZ has since suggested Demystify, so let us go with that!
The features of the tool.
- Naive coding style, at the time!
- Core Python – that is it only used what was standard in the Python language.
The language (Python) allows you to import third party libraries through a tool called PIP. PIP will save those libraries into the correct places in the operating system, or virtual environments for you, and then they can be used in your modules.
But again working around dependencies in the department was going to be tough.
At the time, PIP could not even be used over the network, so there would need to be some managing upwards across teams and outside of Archives itself if we were to try that.
I work a little more freely now, but it taught me a lot. And having observed this situation in two organisations, I try not to assume what liberties folks have at their disposal in their organisations to install and use other’s software. Everyone is dealing with something different. If you can get hold of a programming language and use it, I still believe there is a lot that can be achieved but I know for some even that will be tough.
I should note as well that I am not advocating that everyone should code here, there is a lot we can all contribute to open source without code.
Another feature of the tool was that as we started to demonstrate the value of it, we were able to socialize it a little more with the organization, and we sought the input of the archivists and potential users to include entry-level digital preservation tool-tips and in-application documentation.
Because it is open source, that knowledge is encoded now and available just by downloading the tool from GitHub.
Which I think particularly reinforces mine, and I hope other’s decisions to create open source software.
Some questions I have about that tool’s experience are:
- Were the C3P0 repository more active, would it have made a difference to how we first attempted to engage with it? (Dearly, I think contributing to it would have been perfect for us at the time). I think it can enhance reputation and it is really important not to reinvent the wheel and to support projects as we ourselves hope projects might be supported.)
- How many others in the call have looked at something but found its entry point too difficult, and so chosen to write something for yourselves? Or just find a better alternative? Maybe even paid for something?
- Do many others have restrictions on what they can install at work? Not just to code, but any tools they might want?
Creating it initially was easy to do alone, but eventually it needed the team at Archives NZ contributing their feedback into it to create something more useful.
Making it public was an important goal achieved.
We were sharing knowledge…
Ideally though putting it into the commons. We were:
- Seeking input from others, looking for bugs and fixing them, there was a hope for things like translation too. Which would have been cool.
I mentioned in response to Devon’s forum message this week, that there is a temptation for my to say there is a great intersection between what we need to do in digital preservation and the benefits of open source. But for all the analysis tools, arrangement, or packaging tools we create.
Is it in-fact just a reaction to the closed sourced-ness of what we work with around the outside of digital preservation workflows?
If you look at some of the other work were doing – we were chaining things like the output of our analysis tool to other mechanism that interacted with much bigger systems.
Open source tool development does start look like a reaction…
- Tools for parsing metadata from enterprise records management systems which largely seem to be closed-source. And the extract of data not always intuitive. The records are yours of course! One hopes!
- Tools for creating submission information packages for digital preservation systems. Which is work I do see lots of folks repeating. Depending on your system.
When talking about records we talk about create-to-maintain. It would be great to see more open source options when we create and store records. Rather than having to write more code in response when it comes to disposal and archiving.
I do not really have a point there. It is just something I like to meditate on.
Working professionally…
I should touch on my current role.
How do things change working in open source professionally?
Well, first, I think for all of us doing this. We are already working professionally.
I appreciate there might be great gaps in the resources available to you. Or the amount of buy-in from management.
But yeah. Working for a company that is known for software, perhaps, more specifically services around their software does not change too much. (I do not feel, as I write this today anyway).
One of the things folks can really benefit from is peer review of code. I had missed that from my first role many moons ago, but it has been good to shape my code style up recently.
Being able to work with code daily too, and not in your spare time outside of work to sell it up to an organisation really makes a difference to your personal well-being!
And also, it does prepare you a little more to take on more complex projects, and make your work feel more professional and make it something you are better able to maintain into the future.
Some of the challenges are familiar. Even a well known application does not necessarily have a massive number of public submissions. In a more well-known app there is an understanding too that more submissions are not necessarily a blessing to the product as it is relied upon in certain ways by its users. It is a skill for our product owner to help maintain the tool’s consistency.
And for developers, a skill in shaping external contributions into something compatible to the styles and encoded knowledge of the system proper. In Jordan’s email before the talk – the ‘ton’ of patience – the patience and community building skills.
We spoke about resources. Well, even having two of you looking at the same thing seems to be able to make a difference in quality. In splitting responsibilities, like coding while another writes documentation, or testing, or deployment scripts, for example. Of course where I work, probably has a factor of ten more who work on the product life-cycle. Not insignificant, but not quite the Linux kernel project either.
And if you only need two, at the very least for peer-review, there are ways you can set that up with distributed systems today like a GitHub, or Gitlab.
All the projects…
I have spoken about two of my projects, but my portfolio of projects is many. I have selected these two because they are probably the ones I have invested the most in. And also, they speak most about my intentions and hopes for working open source in digital preservation and information records management.
There is an emphasis on maintaining the long-term record here. Looking after information and opening it up to others.
Making it open is always in the hope that it will be useful to at least one other person or group of people.
I am a fan of the null hypothesis. Knowing something is out there and has been tried, even if it did not work, or showed something other than you anticipated is important too. You can save other’s time by putting even your most esoteric, or drafty work out there.
There are a lot of other projects available on my repository and sub-repositories.
None of them really got open source magic sprinkles though! It would be nice to figure out the magic for just one tool. But I do suspect it would be asking to bite off quite a lot… which I have other questions about.
Participating in a study over Christmas, from University of San Diego, one question that popped into my head was about what it means for a project to ostensibly go viral? To be what some might judge as successful?
They say to make a web-site successful you need an Instagram/Facebook page/and a Twitter account, as well as your website. But what for open source software projects? The more you would like to share, do you also need to prepare to write the code, and then follow up with a Code4lib like journal?
Or be out there promoting your work in-person at conferences?
- How do you do that without backing?
- How do you do that if you are working in your spare time?
- Does it differ culturally? Does a larger country make a difference? or do some locations have greater openness to technology?
- Would a Reddit post be enough? There are good threads for specific languages but I have not done that…
Which speaks to another point. It is scary. And there are still plenty of issues around that out there…
It is almost easier to not promote work.
I feel publishing might be safe (from my experience anyway) less people are looking than you might think. And fewer have the skills to look too deeply. But actively promoting something to seek contribution does feel like a much greater leap.
The perfect language for us?
Going back to the top a little.
We had one project that was difficult to resurrect in C3p0. (Though I think others might have had better luck with that one since). But it does make me think, using existing code within a few years of its creation is such a low-bar. Other’s in the room have more experience here – so how do we really preserve software for use later on?
And related I noted there were restrictions at previous organisations with what we could do. The resolution was to work in core language libraries only. I personally think there is a resilience that brings. And a potential to “preserve” – all you need the code, and its runtime. But there is some discipline to that. There are trade-offs to enjoyment, and ease of creation. Would it be worth it? (After all, what is the life-span of some of these tools?)
And related further – I really like to think about this question – is there a perfect language? Is it Python? Is it something that exists? Is it something new that does not exist yet? Would some standardization of languages and practices help specifically, code sustainability? And socialization, and sharing of knowledge?
What would it need – ease of installation, ease translation, ease of writing tests, and ease for writing documentation are big considerations when choosing something as well.
In short. Is there a framework that the heritage sector could adopt to share vast knowledge and skills and amplify capability?
That is not really a conclusion to this talk today, but it does happen to bring me to the end.
Really the last thing to put out there is that I would love know what resonates with others in this very brief reflection here? And what does open source mean to them?
Thank you.
Video:
A video of the talk was published to YouTube:
See also:
- The Information Maintainers
- Roads and Bridges: The Unseen Labor Behind Our Digital Infrastructure by Nadia Eghbal