The ABCE Open Data Project
Have some data but not sure how to best share it? This is a new project about best practice for Open Data and Fair Data which might help.
What is the background to this work?
One of my roles at Loughborough University is that I am the Open Research Lead for the School of Architecture, Building and Civil Engineering (ABCE). The Open Research Lead is a brand new post and recently I was given some funds by the University for a short term project of my choosing about Open Research. This led to the ABCE Open Data Project, a 3 month project where I am working with eight PhD researchers to develop useful guidelines and suggestions for publishing Open Data and FAIR Data. The project started a few weeks ago and this is the first blog post giving an update.
What is the project about?
Open Research involves many aspects including Open Access, Open Data, Open Methods and Open Software. Open Access, i.e. publishing papers so that they are public and free to read, has now been largely implemented by both universities and funders. I believe the next challenge is Open Data, where researchers publish their research data so that others can use it. This might be to repeat and better understand the research of the original researchers, or it might be to carry out new research using the data (perhaps in novel ways or different disciplines that the original researchers would not have thought of).
However publishing research data is not simple. The key is not sharing the data itself, but rather sharing the data in a way that it is then possible for other researchers to reuse the data. Ideally the dataset should be completely self-explanatory and researchers new to the dataset should be able to reuse the data with confidence, without worrying if they might have misinterpreted what a particular variable means or how the data was collected.
The ABCE Open Data Project aims to help others to publish their data as Open Data by providing a series of best-practice examples of Open Data. There will never be a single set of instructions on how best to publish Open Data, but I think good examples are very helpful to others who are interested in sharing their data for others to use.
What will the project do?
The project is looking at the concept of FAIR data - Findable, Accessible, Interoperable and Reusable. In particular it focuses on the ‘Interoperable’ part of FAIR which looks at using unique identifiers and common vocabularies when describing data - essentially trying to use the same terms which have the same meanings across the community of researchers in a discipline.
True interoperability is a difficult technical task and the FAIR guidelines hint that data structures such as RDF combined with OWL ontologies may be required for this. However in my experience this isn’t a practical solution for many researchers who are working with tabular datasets in Excel or CSV formats and have no real need to learn a new data model such as RDF.
A neat solution to this is the CSV on the Web (CSVW) standard. CSVW combines CSV files, which everybody knows and can use, with a JSON metadata file to provide full descriptions of the data in the CSV file. It is possible, but not required, for the JSON metadata file to include unique identifier and common vocabulary terms (say from OWL ontologies) so the metadata file can provide a route to moving towards the FAIR Interoperability recommendations. Not everyone knows what a JSON file is, but as they can be opened in text editors and are easy for humans to read, it is a much simpler format to work with than, say, RDF.
So what will the project deliver?
The objectives of the project are to:
- Create a series of best-practice examples of publishing FAIR data using the CSVW standard. This will use small, example datasets to demonstrate the principles and provide suggestions for others to build upon.
- Work with new or existing ‘real-world’ datasets to publish them as FAIR data using the CSVW standard. These will be complex, multi-faceted datasets from actual research projects and large-scale national surveys. This will both demonstrate how the CSVW approach can be used in practice and highlight the challenges when working with complex datasets.
Current progress and next steps
We have started by looking at the following different types of data:
- Sensor data
- Questionnaire data
- Interview data
- Image data
- Simulation data
We’ve created some small, simple examples of these datasets and are now looking to represent this data as CSVW. The idea is to post these example datasets on the Loughborough University Research Repository. Also we may post some examples of how to analyse data which has been published in the CSVW format, by sharing some analysis examples on GitHub.
It sounds simple, but is proving to be a little more complex than anticipated. The CSVW standard is well developed and can covers many different use cases, which is a good thing but which also makes it a little difficult to get started with. So the next few weeks will be about really getting to grips with how to create the metafiles for CSVW for the different data formats.
Next steps
- Look at all blog posts on CSVW