A practical approach to fixing bugs in data repositories


The national assembly data repositories have bugs. When someone suspects a problem, a discussion is initiated in the data.tricoteuses.fr category. If it turns out to be a problem that should be fixed in the national assembly data repositories, the bug tag is set and a mail is sent to the contact email so they get a chance to act on it.

This workflow is very similar to reporting bugs on a Free Software project and even when the upstream project is very quick to provide a fix, it is very common to implement and use a bug fix in the meantime.

The problem

In software or OpenData there is no tooling to support the tracking of bugs across projects that depend on each other. The workflow is however the same in both cases:

  • Discuss a problem
  • File a bug report with a reproducer so the upstream can repeat it
  • Implement a fix and submit it upstream
  • Package the fix locally for immediate use
  • From time to time run the reproducer to check if the bug is still present
  • If the bug cannot be reproduced, remove the local fix because it no longer is necessary

The proposed solution

The bug fixing workflow in the context of data.tricoteuses.fr could be as follows:

  • Discuss a problem: a discussion is initiated in the data.tricoteuses.fr category
  • File a bug report: when the problem is confirmed, file a new bug, mimicking the structure of other bug reports.
  • Create a reproducer: if possible, implement a reproducer in tricoteuses-assemblee in the src/bugs directory and name it after the number of the bug report and the schema of the repository in which it was found. For instance a bug found in Agenda_XV and filed in the issue number 43 should have reproducer in the scrutin-00043.ts file.
  • Implement a fix: The fix, if any, should be implemented in the same file as the reproducer.
  • Submit it upstream: the code that creates the OpenData repositories is not published and the fix itself can’t be used as is. It is however useful to demonstrate how the data should be modified. A mail is sent on a regular basis to mailto:opendata@assemblee-nationale.fr with a link to the list of open bugs.
  • Package the fix locally for immediate use: if a fix is implemented, it will be applied for each repository found in assemblee-nettoye before it is pushed to its counterpart in data.tricoteuses.fr. This process is done via the tricoteuses-assemblee-QA CI each time a change is pushed in the tricoteuses-assemblee codebase or a repository is updated. If applying a fix fails for any reason or if the resulting files do not validate with the JSON schemas, the changes are not pushed and data.tricoteuses.fr does not have the latest data.
  • From time to time run the reproducer to check if the bug is still present: on a weekly basis the tricoteuses-assemblee-QA CI runs the reproducers. It is not done too often because they can be resource consuming. Each reproducer is expected to produce a report and commit it in the repository. The report should be an empty file if the problem no longer exists or a description of the problem if it sill exists. The URL to the report, if any, should be linked from the issue. On a regular basis the issues should be scrubbed by a maintainer and the reports reviewed to find if the bug still exists. This process cannot be fully automated because problems can be transient and only show up from time to time. The maintainer can browse the report history to decide if the problem is gone of good or if resurfaces on a regular basis and still needs fixing. If a problem is fixed, the issue can be closed.
  • If the bug cannot be reproduced, remove the local fix because it no longer is necessary: For every issue that is closed, the corresponding file can be deleted from the tricoteuses-assemblee repository. The fix it contains will no longer be applied and the reproducer will no longer be run on a weekly basis.

The bugs_helper script is the entry point for the CI to apply all fixes and run the reproducers.

On updating a repository containing fixes

The workflow is better explained with an example and reading the implementation.

The naive implementation would be:

But what happens the next time commits are pulled from assemblee-nettoye Agenda_XV ? They may conflict because of the fixes and require manual intervention which is very unpractical.

The proposed implementation is to:

  • Pull changes from assemblee-nettoye Agenda_XV into the upstream branch of data.tricoteuses.fr Agenda_XV
  • In a tmp branch created from master
    • Revert the last commit which contains all bug fixes
    • Merge upstream into tmp which never conflicts because it is as it was before being bug fixed
    • Apply all fixes in the bugfixes branch
    • Merge the bugfixes branch in the tmp branch
  • If the content of the tmp branch is exactly the same as the content of the master branch, meaning neither the bug fixes nor the content of the upstream branch were modified, discard it and leave the master branch unmodified
  • If the content of tmp and master are different (according to git diff), pull the tmp branch into master

The result is that the data.tricoteuses.fr Agenda_XV master branch ends up being a series of commits with the following pattern: