User research report: software forges exports & usage

dachary · March 29, 2024, 6:03am

Executive Summary

The Friendly Forge Format (abbreviated F3) is a project to develop an Open File Format for storing the information from a forge such as issues, pull/merge requests, milestones, release assets, etc. as well as the associated VCS (Git, Mercurial, etc.). It will allow software developers to easily move from one forge to another, work offline or in a private network. The interviews revealed developers would greatly benefit from high quality data repository that provides them with:

A stable data schema, backward compatible so the software they write does not unexpectedly break, with detailed release notes when it changes
An exhaustive and up to date reference documentation for all data structure
Downloadable data files with their modification history (i.e. git or another VCS) to know when new data is available and to see the differences when an update happens, for debugging purposes and tracking

Some findings were unexpected and heavily influenced the recommendations:

All developers write scripts to cleanup the data and cope with errors originating from the software forges from which they download the data. This is a very significant part of their initial development and continues with the maintenance process when data needs to be updated. A reliable data repository that contains data that is carefully checked for errors before publication is a significant added value for all developers because it reduces their workload.
Most forges make an effort to publish data in multiple formats and protocols but it turns out most developers transform the data they download instead of using the original version. Therefore using a single format that is universally supported (e.g. JSON) is enough.
Most developers do not rely on the documentation: they try to guess the structure of the data and its meaning by observing the content of the forges, its API and export format when available. It follows that a high quality documentation will only be of use if and when the developer is stuck when guessing the meaning of a data field.

The recommendations for an Open Format targeting developers using forges is therefore to focus on (in that order):

Theme: Cleanup: ensure F3 exports are published only if they validate against a well documented schema. When the schema changes, ensure it is backward compatible. When a new F3 schema is not backward compatible the data should be published in the old schema and the new schema during a period of time that allows developers to update the software.
Theme: Documentation: a detailed documentation should be written for each F3 version. It should be included, for the most part, within the schema describing the data to facilitate the maintenance. The research does not show how and when the documentation is used. Data about the documentation usage should be collected after it is published.
Theme: Modification history: the forge exports should be made available in VCS repositories so their modification history is published as well as their content.
Theme: Format: publishing the data in a single, well documented format is OK because developers always convert the dataset they download into another format. Even when they are available in multiple formats.

Aim

Identify emerging themes related to the moving data from one forge to another by developers and provide recommendations to guide a new version of the F3 format. Its ambition is to publish forge data originating from existing forges in a manner that takes into account the needs of developers.

Research Methods

Intercept interviews

The users willing to participate in this research answered the interview designed for developers working software forges. The invitation email explains what F3 and the research is about.

Affinity mapping

The interview transcripts were used to draw an affinity diagram. The outcome is the following list of emerging themes.

Collecting and preparing material

While the interviews were conducted, the following material was collected or prepared:

An exhaustive inventory of existing forges
The structure
The JSON schemas

They are the raw material to be published and required a significant amount of work. Smaller items that should be present (such as the License, information about the source of the data etc.) were also collected.

All this material is published on the website https://f3.forgefriends.org, the forum and the repositories.

Participants/Sample

The participants are developers who work with software forges (as admins or users), with different backgrounds and focus.

Software developers
System administrators operating a software forge
Forge developers
Software developers writing tools that rely on forges

It should be noted that project managers are not participants because they organize the work but are not involved in the actual movement of data between forges.

Results

The results are presented as chapters matching the themes that emerged from the user interviews, in order of importance. The quotes are extracted (translated when needed) from the interview transcripts.

Developers guess before reading the documentation

Software developers interviews suggest the documentation is lacking for existing export formats. Because of the widespread shortage of quality documentation, developers do not read it and prefer to reverse engineer the structure and meaning of the data based on its content. They focus on what they think is relevant for the task at hand. Only when a problem arises or data seems to be missing do they browse the documentation, searching for the answer. In other words, not matter how good the documentation is, the developer will attempt to guess the meaning of the data and is unlikely to read the documentation unless they face a problem.

Q: Have you even been blocked by the lack of documentation ?
A: No, it is not a blocker. You may waste a day or two but you manage. Or, if there is data you don’t understand, you don’t use it.
Q: Did you ever find the answer to a question about a field in the documentation ?
A: On some topics, yes. On fields, I don’t think so. But there are other official documentations and by cross referencing them, one can find the desired information.

Formats are converted into other formats

Most of the time developers convert the format in which forge information is available into their preferred format. None of the interviewees use the data as-is.

Data on the Web Best Practices suggests that a dataset should be available in multiple formats:

Providing data in more than one format reduces costs incurred in data transformation. It also minimizes the possibility of introducing errors in the process of transformation. If many users need to transform the data into a specific data format, publishing the data in that format from the beginning saves time and money and prevents errors many times over. Lastly it increases the number of tools and applications that can process the data.

This desirable goal conflicts with the variety of motivations behind the systematic format conversion, ranging from facilitating fast search in a large data corpus to unifying heterogeneous sources into a format common to all of them.

the tools that I installed enable the usage of a pivot format to normalize the data from N forges that have datasets containing similar things but under different formats

Identifying all use cases and providing the datasets in a format that would effectively relieve the developer from the burden of format transformation is a huge undertaking. Even when the dataset distribution is available in multiple formats (CSV, XML, JSON, etc.), the majority of the developers still feel the need for a format conversion. It sometime is just a manifestation of the NIH symdrom but is almost always justified by the need to cleanup the data, as explained below.

Most forge export need cleaning and are not backward compatible

The quality of most forge export is bad: not only are they not documented, their content is inconsistent and contains errors. The most time consuming activity for developers working on forge exports is to cleanup the data and they often need to write dedicated software to do so. There exist a few middleware that help with this task (solidata is Free Software and there are non-free alternatives) but developers mostly rely on off-the-shelf tooling and home grown recipes.

for instance, out of 1000 projects, a number of them contain references to non existent issues because they were deleted but the references stayed

When update is automated, sophisticated strategies need to be put in place to cope with problems originating from the source of the data. Recurring errors are fixed by a dedicated software and when the update fails despite their efforts, a human intervention is required and the update is delayed.

For example, if the exported projects suddenly went down by 10% our script would send an alert with a warning that something is probably wrong with the export because such a variance is not expected

The backward compatibility of the data format is not nearly as important as it should be because breaking changes are treated in the same fashion as other, more frequent, problems. Most developers did not even think about the specific problem of the backward compatibility of data formats, except when standards are used.

“one of the key things is making sure the standards are versioned so that they are interoperable and you know what version of the data you’re using and hopefully the schema versions are backward compatible.”

Cleaning the incoming forge export every time it is downloaded is entirely unnecessary if the provider of the dataset does the cleaning upstream and guarantees the backward compatibility of the schemas.

VCS as a universal tool to keep the history of a dataset

When a forge export is updated on a regular basis the provider does nothing to help track the changes. It also happens that the dataset changes for technical reason (such as renumbering all the ids used for references) although the content remains the same. Although forges have all the tooling for keeping a history of software, the forge data itself (issues, pull requests, etc.) is kept in a database that is not versioned.

The main motivations for keeping the history of changes and use sophisticated tooling to explore them are:

Diagnostic of problems (i.e. why is this forge export corrupt today although it was good yesterday ?)
Figuring out if the forge export changed or not in the absence of a reliable notification from the provider (no difference compared to yesterday means no change).

“…we just continually fetch that data so we can see that new data comes online by the fact that there new forge export are available, new URLs.”

Since the vast majority of changes are on text (as opposed to binary files such as images), it is both easier for the producer and the developer to store the datasets in a VCS such as git or mercurial. Here is a list of recommendations from Data on the Web Best Practices that it addresses:

Requirements for Data Access
- Data should be available for bulk download
- Data should be available in an up-to-date manner and the update cycle made explicit
Requirements for Data Identification
- Each data resource should be associated with a unique identifier
Requirements for Preservation
- An identifier for a particular resource should be resolvable on the Web and associated for the foreseeable future with a single resource or with information about why the resource is no longer available
Requirements for Provenance
- If different versions of data exist, data versioning should be provided
Requirements for Data Usage
- It should be possible to cite data on the Web

API are not favored by developers

Forge export can be made available available as files to download via a URL: this is the way developers prefer to use to get access. Although they could also use an API. And even if they did, it would not be in the scope of this user research.

In scope: datasets as files available for download to developers
Not in scope: API built on top of a forge databases and made available to developers

This needs clarification because there can be a confusion between:

API which requires both a protocol and a file format to represent the data being exchanged with the protocol
files available for download which only requires a file format describing the downloaded file

Which sometime leads to:

files available for download being presented as an API
providers claiming their API can be used to download the forge export instead of using files and that there is no need for them to provide files for download

Recommendations

The theme cleanup comes first because it relieves the developer from the burden of writing sophisticated cleaning procedures. The theme documentation comes second because it is not the primary source of information developer use to figure out what the dataset mean.

Theme: Cleanup: ensure F3 exports are published only if they validate against a well documented schema. When the schema changes, ensure it is backward compatible. When a new F3 schema is not backward compatible the data should be published in the old schema and the new schema during a period of time that allows developers to update the software.
Theme: Documentation: a detailed documentation should be written for each F3 version. It should be included, for the most part, within the schema describing the data to facilitate the maintenance. The research does not show how and when the documentation is used. Data about the documentation usage should be collected after it is published.
Theme: Modification history: the forge exports should be made available in VCS repositories so their modification history is published as well as their content.
Theme: Format: publishing the data in a single, well documented format is OK because developers always convert the dataset they download into another format. Even when they are available in multiple formats.