The Tricoteuses non-profit collective wants to publish datasets originating from the French national assembly at data.tricoteuses.fr for software developers to use. The original motivation for this project was the lack of documentation but the user research revealed developers would greatly benefit from high quality data repository that provides them with:
- A stable data schema, backward compatible so the software they write does not unexpectedly break, with detailed release notes when it changes
- An exhaustive and up to date reference documentation for all data structure
- Downloadable data files with their modification history (i.e. git or another VCS) to know when new data is available and to see the differences when an update happens, for debugging purposes and tracking
Some findings were unexpected and heavily influenced the recommendations:
- All developers write scripts to cleanup the data and cope with errors originating from the repositories from which they download the data. This is a very significant part of their initial development and continues with the maintenance process when data needs to be updated. A reliable data repository that contains data that is carefully checked for errors before publication is a significant added value for all developers because it reduces their workload.
- Most repository make an effort to publish data in multiple formats but it turns out most developers transform the data they download instead of using the original file. Therefore using a single format that is universally supported (e.g. JSON) is enough.
- Most developers do not rely on the documentation: they try to guess the structure of the data and its meaning by observing the content of the repository. It follows that a high quality documentation will only be of use if and when the developer is stuck when guessing the meaning of a data field.
The recommendations for a data repository targetting developers is therefore to focus on (in that order):
- Cleanup: ensure all dataset are published only if they validate against a well documented schema. When the schema changes, ensure it is backward compatible. When a schema is not backward compatible the data should be published in the old schema and the new schema during a period of time that allows developers to update the software.
- Documentation: a detailed documentation should be written for each dataset because there currently is none. It should be included, for the most part, within the schema describing the data to facilitate the maintenance.
- Modification history: the datasets should be made available in VCS repositories so their modification history is published as well as their content.
- Format: publishing the data in a single, well documented format is enough.
Identify emerging themes related to the usage of a dataset repository by developers and provide recommendations to guide a user interface prototype for the data.tricoteuses.fr service. Its ambition is to publish datasets originating from the French national assembly in a manner that takes into account the needs of developers.
The five users willing to participate in this research answered the interview designed for developers working with datasets. The invitation email explains what tricoteuses and the research is about. A small number of participants is fine.
While the interviews were conducted, the following material was collected or prepared:
They are the raw material to be published on data.tricoteuses.fr and required a significant amount of work (about four weeks for the documentation, one week for the JSON schemas and a few days for the data repositories). Smaller items that should be present (such as the License, information about the source of the data etc.) were also collected and a checklist is available for details.
All this material is published on the experimental website https://data.tricoteuses.fr/ but no effort was made to figure out which of those elements should eventually be published or not. It is not a draft of the user interface.
The participants are five developers who work on datasets (as producers or consumers), with different backgrounds and focus, as detailed at the beginning of their interview.
- Data oriented middleware authors
- Service provider specialized on datasets
- Dataset publisher
- Individual developer contributing to the Open Data commons
It should be noted that frontend web application programmer are not participants because they consume API’s which are a layer to access the datasets that is outside of the scope of this user research.
The results are presented as five chapters matching the themes that emerged from the user interviews, in order of importance.
Software developers interviews suggest the documentation is lacking, not only for the Open Data published by the French national assembly, but for most datasets (with the notable exception of those relying on standars such as Dublin Core or Socle Commun des Données Locales). Because of the widespread shortage of quality documentation, developers do not read it and prefer to reverse engineer the structure and meaning of the data based on its content. They focus on what they think is relevant for the task at hand. Only when a problem arises or data seems to be missing do they browse the documentation, searching for the answer. In other words, not matter how good the documentation is, the developer will attempt to guess the meaning of the data and is unlikely to read the documentation unless they face a problem.
Question: Ça ne t’a jamais bloqué ? (Have you even been blocked by the lack of documentation ?)
Answer: Non, en fait ça bloque pas. On peu perdre une journée ou deux mais on arrive à s’en sortir. Ou alors, s’il y a des données que tu ne comprends pas, tu ne t’en sert pas. (No, it is not a blocker. You may waste a day or two but you manage. Or, if there is data you don’t understand, you don’t use it.)
Q: Est-ce qu’il t’es arrivé de te poser une question sur un champ et de trouver la réponse dans la documentation ? (Did you ever find the answer to a question about a data field in the documentation ?)
R: Sur certains points oui. Sur les champs je ne crois pas. Mais il y a d’autres docs officielles et en croisant on trouve les informations. (On some topics, yes. On data fields, I don’t think so. But there are other official documentations and by cross referencing them, one can find the desired information).
Most of the time developers convert the format in which a dataset is available into their preferred format. None of the interviewees use the data as-is.
@seb35 “Il y a un autre projet libre qui permet de transformer ces données en base SQL, beaucoup plus utilisable. C’est le projet GitHub - Legilibre/legi.py: Outils de manipulation des archives LEGI (lois françaises) auquel je contribue et c’est ce que j’utilise dans Archéo Lex” (there is another Free Software project which transforms these data in a SQL database, much more usable. It’s the GitHub - Legilibre/legi.py: Outils de manipulation des archives LEGI (lois françaises), to which I’m a contributor, and that I use in Archéo Lex)
Providing data in more than one format reduces costs incurred in data transformation. It also minimizes the possibility of introducing errors in the process of transformation. If many users need to transform the data into a specific data format, publishing the data in that format from the beginning saves time and money and prevents errors many times over. Lastly it increases the number of tools and applications that can process the data.
This desirable goal conflicts with the variety of motivations behind the systematic format conversion, ranging from facilitating fast search in a large data corpus to unifying heterogeneous sources into a format common to all of them.
“les outils que j’ai mis en place permettent de faire un format pivot qui permet de normaliser les données de N clients qui ont des jeux de données contenant des choses similaires mais sous différents formats.” (the tools that I installed enable the usage of a pivot format to normalize the data from N clients who have datasets that contain similar things but under different formats)
Identifying all use cases and providing the datasets in a format that would effectively relieve the developer from the burden of format transformation is a huge undertaking. Even when the dataset distribution is available in multiple formats (CSV, XML, JSON, etc.), the majority of the developers still feel the need for a format conversion. It sometime is just a manifestation of the NIH symdrom but is almost always justified by the need to cleanup the data, as explained below.
The quality of most datasets is bad: not only are they not documented, their content is inconsistent and contains errors. The most time consuming activity for developers working on datasets is to cleanup the data and they often need to write dedicated software to do so. There exist a few middleware that help with this task (solidata is Free Software and there are non-free alternatives) but developers mostly rely on off-the-shelf tooling and home grown recipes.
“Par exemple sur 1000 lignes dans cette colonne c’est uniquement des chiffres avec des virgules, sauf un ou c’est écrit #ERR. C’est du nettoyage.” (for instance, out of 1000 lines in this column, there only are numbers with comas, except one which reads #ERR. This is cleaning.)
When update is automatic, sophisticated strategies need to be put in place to cope with problems originating from the source of the data. Recurring errors are fixed by designated software and when the update fails despite their efforts, a human intervention is required and the update is delayed.
"For example, if a publisher has suddenly decreased the number of [documents] they publish, if it suddenly went down by 10% we would get an alert that says it’s not a very likely scenario. And then someone would go and have a look to see what’s actually happening. "
The backward compatibility of the data format is not nearly as important as it should be because breaking changes are treated in the same fashion as other, more frequent, problems. Most developers did not even think about the specific problem of the backward compatibility of data formats, except when standards are used.
“one of the key things is making sure the standards are versioned so that they are interoperable and you know what version of the data you’re using and hopefully the schema versions are backward compatible.”
Cleaning the incoming datasets every time it is downloaded is entirely unnecessary if the provider of the dataset does the cleaning upstream and guarantees the backward compatibility of the schemas.
When data is updated on a regular basis the provider either does nothing to help track these changes or rely on techniques such as splitting the data in files with the date of the update in their name. It also happens that the dataset changes for technical reason (such as renumbering all the ids used for indexing) although the content remains the same.
“…ni le sénat ni l’assemblée ne conservent l’historique.” (neither the senate nor the parliament keep the history of changes)
The main motivations for keeping the history of changes and use sophisticated tooling to explore them are:
- Diagnostic of problems (i.e. why is this dataset corrupt today although it was good yesterday ?)
- Figuring out if the dataset changed or not in the absence of a reliable notification from the provider (no difference compared to yesterday means no change).
“…we just continually fetch that data so we can see that new data comes online by the fact that there are new dataset are being published, new URLs.”
Since the vast majority of changes are on text (as opposed to binary files such as images), it is both easier for the producer and the developer to store the datasets in a VCS such as git or mercurial. Here is a list of recommendations from Data on the Web Best Practices that it addresses:
- Requirements for Data Access
- Data should be available for bulk download (example: data.tricoteuses.fr · GitLab)
- Data should be available in an up-to-date manner and the update cycle made explicit (example: even without any other documentation, the git history shows the chronology of changes and their frequency))
- Requirements for Data Identification
- Each data resource should be associated with a unique identifier (example: the commit hash of a file associated with line intervals)
- Requirements for Preservation
- Requirements for Provenance
- If different versions of data exist, data versioning should be provided (example: the entire modification history is available at all times and tags can be placed to identify a remarkable point in time)
- Requirements for Data Usage
All datasets are available as files to download via a URL: this is the only way developers use to get access. Although they could, in theory, use an API or connect to a database, they don’t. And even if they did, it would not be in the scope of this user research.
- In scope: datasets as files available for download to developers
- Not in scope: API built on top of a dataset and made available to developers
This needs clarification because there can be a confusion between:
- API which requires both a protocol and a file format to represent the data being exchanged with the protocol
- files available for download which only requies a file format describing the downloaded file
Which sometime leads to:
- files available for download being presented as an API (a harmless confusion since downloading files requires a protocol)
- providers claiming their API can be used to download the dataset instead of using files and that there is no need for them to provide files for download
Data visualization is even further from what developers use although almost all of them work on software or services that include some sort of visualization. It probably is a reason why they seem to value the visualization of datasets much more than the software and services that allow them to access the raw data. During the research a majority of participants questioned the usefulness of focusing on such a narrow topic and advised that it would be much more useful to research how to make dataset more accessible with visualization.
The importance of the themes are modified to take into account the results of the research. The theme cleanup now comes first because it relieves the developer from the burden of writing sophisticated cleaning procedures. The theme documentation comes second because it turns out to not be the primary source of information developer use to figure out what the dataset mean. The theme visualisation is discarded because it not in the scope of this research.
- Theme: Cleanup: ensure all dataset are published only if they validate against a well documented schema. When the schema changes, ensure it is backward compatible. When a schema is not backward compatible the data should be published in the old schema and the new schema during a period of time that allows developers to update the software.
- Theme: Documentation: a detailed documentation should be written for each dataset because there currently is none. It should be included, for the most part, within the schema describing the data to facilitate the maintenance. The research does not show how and when the documentation is used. Data about the documentation usage should be collected after it is published.
- Theme: Modification history: the datasets should be made available in VCS repositories so their modification history is published as well as their content.
- Theme: Format: publishing the data in a single, well documented format is ok because developers always convert the dataset they download into another format. Even when they are available in multiple formats.