Intercept Interview 04: developer using a dataset

2019-12-05T23:00:00Z
Script de l’interview

About you

Q: What is your background?
A: I work on Open Source projects and application programming. Things like linux build systems has been traditionally what I’ve been doing. Currently I work around tooling, data organizing and wrangling of data.
Q: You work for [organization], do you have a title? Or are you a just a member?
A: I am a member, yes.
Q: How long have you been working with [organization]?
A: Seven months, something like that.
Q: What were you doing before that?
A: I was working with [corporation].
Q: Could you describe what a typical day is for you at [organization]?
A: Usually we all get together in the morning and have a quick chat, checking about what we’re going to do during the day. It includes working with customers and clients on their systems. For example I’m working on [data publisher] at the moment. It is a [activity sector] transparency system. I’m working on the system that helps automate the flow of data from people who are publishing information about [activity] into their online platform for transparency. Today I’m going to be working on some database schemas stuff and how we manage the data flow in an automated way. Because currently there is quite few manual processes involved in making the data public.
Q: The people you work with on a daily basis are your colleagues who essentially do things around data, as you do, but maybe in different areas? And clients.
A: Yes. We sort of have two main specialties. One is tooling and system around Open Data. The other is analysis of Open Data and getting information out of that Open Data. So we have developers and data analysts.
Q: It may be the case that sometime you do things for clients and sometime you do things to help your colleagues?
A: Yes.
Q: What programming language do you use?
A: Mostly python.

About data sets

Q: In your own words, what is a data set?
A: For me a data set would be some sort of structured data which has an obvious start and end. So it could be limited by a date range (a particular year) or you could have a data set that’s limited by a particular topic. And in my context most of the data sets are structured based on a particular schema, it is “schema bound”.
Q: By schema you mean XSD or JSON schema or something else?
A: Yeah, usually the technologies are coupled with schemas. Sometime you get schemas that are just written completely as a document. You could have a documented schema for people using a spreadsheet for example or whatever format they use. Usually it boils down to a JSON schema or XML schema.
Q: In your own words, what is a ope data?
A: Open Data is usually data which is licensed in a way that means that any member of the public can access it and manipulate it without any kind of restrictions.
Q: What kind of people in your organisation use data sets?
A: Pretty much everyone.

About the usage of data sets

Q: How long have you been working with data sets?
A: There were a lot of data sets in the build systems we were creating in [corporation]. The project was involved in what is called open embedded and that uses a lot of metadata to create Linux OSs. Those are discrete data sets in themselves. When people are building operating systems they want to know things like: “Why a particular package of a software got built or what depends on what, what driver is needed for this thing”.
Q: Could you tell me the kind of things you do with data sets?
A: One example would be around making data sets accessible. That’s things like, you are in a fronted to the data set and collating them. We use elastic search on the [data publisher] system which allows people to do full text searches of [activity] information. Creating UI on top of data sets which allow people who don’t have high level of technical skills to be able to interrogate the data sets. So you can do filtering, fairly advanced queries on the data without having to know things like python or SQL or something like that. Another example would be around aggregation and collating the data. Different clients work in different ways in terms of how they receive information. Some of the organisations publish to the transparency organisation, some have a kind of a registry where publishers list their data sets and basically create a whole bunch of links to their Open Data. And then we create a system which goes through all those lists and then fetches that information from their web site (usually) and converts it into the standard format for presenting on the transparency organisation web site.
Q: Do you work on software that you publish to work on data sets?
A: All of our projects are on GitHub and we always encourage our clients to make sure that all of our tooling and software (that we create) is Open Source.
Q: You maintain one specific software which is [toolname].
A: Yeah. Some of our clients have their own GitHub organisations and some don’t. Some of the stuff we do is on our own git pages. [toolname] is one example.
Q: You mentioned working with clients to help them publish their data. Does your [organisation] publish data sets?
A: I’m not sure. I think we have published (in the past) our own transparency reports and things like that. But I’m not entirely sure where we are at the moment.
Q: Are there third party software (other than general purpose software like databases) that you use to work on data sets?
A: One example is elastic search which is more like a search engine than a database. We also use things like https://colab.research.google.com/ python and SQL system. There also are GitHub, travis etc. that we use around software development process.

About consuming data sets

Q: How do you cope with changes in the format of a data set you rely on?
A: Often this will come down from the standards organisations that setup the standard. Usually there will be a standard working group and we will have either someone on that working group to be aware of any changes. But, obviously, one of the key things is making sure the standards are versioned so that they are interoperable and you know what version of the data you’re using and hopefully the schema versions are backward compatible.
Q: Would you say the majority of the data sets you work on are based on standards?
A: Yes. Usually part of the process is making sure there is a very well defined standard before we even approach this kind of problem. And as a company we’ve been involved in helping create various standards for different organisations.
Q: In the case where the documentation is lacking, how do you cope with it?
A: Sometime in those cases you fallback to precedents. You infer the meaning of a field based on previous data that you’ve collected. That’s generally the way forward. But obviously we try to rectify the problem before it got there.
Q: Going back to the standardization process, if I rephrase what you said: before working with the data set you try to engage with the producer of the data set so that they use a standard. Did I understand correctly?
A: Yes.
Q: And you’ve been successful in doing that? I’m sorry to ask you to repeat but this is very unusual.
A: I guess our company is quite niche in that when someone has a Open Data problem and they need a standard being created or help building a standard, our company is quite well known in helping to do that. We will be heavily involved in that from quite early on. And those people are already on board with the idea of needing a standard.
Q: Regarding the updates of the data sets. How do you get notified when the data set is updated?
A: It depends on the client. Some are setup with a push method so the publisher will push data to the transparency organisation. And they will be notified that the push is coming. More successful examples are base on pulling the data from the publishers. [data publisher] as an example and [org2], [org3], they publish URLs which tells us where their data is located and periodically we just continually fetch that data so we can see that new data comes online by the fact that there are new dataset are being published, new URLs. Or the datasets themselves grow in size so we can tell they have published new data.
Q: With all this machinery in place, how do you go about when something breaks?
A: We got validators which help to filter out data which are broken. It could be a date in the past or vastly in the future, data in the wrong fields. Before data enters any kind of system, it is validated against the schema. We do a few checks as well, based on what we think is likely to be happening. For example, if a publisher has suddenly decreased the number of [documents] they publish, if it suddenly went down by 10% we would get an alert that says it’s not a very likely scenario. And then someone would go and have a look to see what’s actually happening. There is some kind of automated validation and verification and some heuristics based on what we think is likely to be happen in the system. And it fallbacks to manual intervention from the [data publisher] or from us.
Q: Do you remember the last time you had a problem and how you fixed it?
A: Recently the web site of some publishers went down. We contacted our client to escalate that and contact the publisher. Some of them are really small and have their data published on a wordpress instance somewhere and do not have IT people. Often it will be the [data publisher] who tells them about the error fetching their data. And then they will fix it.

About publishing data

Q: How useful are the https://www.w3.org/TR/dwbp/ Data on the Web Best Practices?
A: It is useful in terms of it being a benchmark for an organisation where people can easily understand the usefulness of data standards and the way in which people can collaborate to extend or create the standards in the first place.
Q: The backward compatibility you already mentioned in standards, how challenging is it?
A: I personally have not been involved in that. I’ve got colleagues who have. There is always a tension between backward compatibility and people wanting to extend the standards. Sometime you want a clean break from the past.
Q: You mentioned that [organisation] publishes UI to help people search data, for example. What kind of user research was involved?
A: The transparency organisation would do that kind of work. I’ve seen documentation from [data publisher] where they are looking into who are their data users and creating user stories based on that. It informs how the frontend / features it will have.