CSV on the Web: Creating descriptive metadata files

This post looks at creating CSV on the Web (CSVW) metadata files with simple descriptions.

CSV on the Web: Creating descriptive metadata files
An example of a CSVW metadata.json file

Introduction

CSVW involves having the data of a dataset in a CSV file and the metadata about the data in a JSON file. This metadata might include information such as a description of the dataset, who collected the data and how it was collected. It can also include information on the columns in the CSV file (i.e. the variables of the data) such as a description of the variable, the units used and what datatype is being used (string, number etc.).

In this post we look at including such information in the metadata.json file using simple descriptions. This is the level of detail ‘Level 1’ as discussed in my previous blog post CSV on the Web: An Introduction. For many existing datasets this information will already exist but it may be stored in a non-standard format which is difficult for computers to read, such as a readme.txt file, a Word document or on a webpage.

What is a descriptive metadata file?

Here I’m defining a descriptive metadata file as one which the information is largely provided through descriptions using text strings. It will make use of the existing properties available to CSVW metadata objects where possible and, if any extra properties are required, it will use well-known vocabularies drawing on the examples in the CSVW Primer in the first instance. The aim is to get the descriptive information for a CSV file into the metadata.json file in the simplest manner possible.

A basic metadata file

Let’s consider an example which is given the the CSVW Primer. Example 6 shows a CSV file about countries, their names in different languages and their latitude and longitude. Let’s assume this CSV file is saved somewhere with the name ‘countries.csv’.

"country","country group","name (en)","name (fr)","name (de)","latitude","longitude"
"at","eu","Austria","Autriche","Österreich","47.6965545","13.34598005"
"be","eu","Belgium","Belgique","Belgien","50.501045","4.47667405"
"bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375","25.4823218"
The 'countries.csv' file. This is Example 6 of the CSVW Primer.

The basic metadata.json file for this CSV file is given in Example 7.

{
    "@context": "http://www.w3.org/ns/csvw",
    "url": "countries.csv",
    "tableSchema": {
      "columns": [
        {
          "titles": "country"
        },
        {
          "titles": "country group"
        },
        {
          "titles": "name (en)"
        },
        {
          "titles": "name (fr)"
        },
        {
          "titles": "name (de)"
        },
        {
          "titles": "latitude"
        },
        {
          "titles": "longitude"
        }
      ]
    }
  }
The 'countries.csv-metadata.json' file. This is Example 7 of the CSVW Primer.

There are a number of things to note about this metadata file:

  • This is a metadata file describing a Table object (i.e. a description of a single table). As such it is following the rules from Section 5.4 of the Metadata Vocabulary.
  • The file also contains a Schema object which is the value of the tableSchema property. Schema objects are described in Section 5.5 of the Metadata Vocabulary.
  • The file also contains seven Column objects where each one is a description of an individual column in the CSV file. These are provided as a list in the columns property of the Schema object. The rules for Column objects are given in Section 5.6 of the Metadata Vocabulary.

Looking at the properties of the Table object:

  • The @context property tells us that this is a CSVW metadata file and is always used in the first object in the metadata.json file. This is a Top Level Property as described in Section 5.2 of the Metadata Vocabulary.
  • The url property points to the location of the CSV file which this metadata document is describing. Assuming that the CSV file and the metadata.json file are stored in the same location, this can then be the name of the CSV file.
  • The tableSchema property holds the Schema object for the table. Schema objects describe the structure of a table. They are separate from the Table object as then the same schema might be reused to describe different tables of data.

Looking at the properties of the Schema object:

  • The columns property provides a list of Column objects. Note that the order of the Column objects in the list is important. This order should exactly match the order of the columns in the CSV file.

Looking at the properties of the Column object:

  • The titles property provides one or more titles for the column. If the CSV file includes a header row, then the header used for the column should match one of the titles given in this property.

How to start to edit a metadata.json file

The CSVW metadata files are in JSON format. These can be edited with a basic text editor such as Notepad. But it is much easier to use a dedicated JSON editor such as Visual Studio Code. You can find JSON editor software by googling ‘JSON editor’ and there are options for desktop applications, Chrome extensions and websites. The advantages of using a JSON editor is that is will highlight if the syntax of the JSON file is not correct, it will autocomplete closing parenthesis for objects and lists, and it will indent the file to make it easier to read.

Here I use Visual Studio Code to edit the metadata.json file of Example 7 to include @type properties for the Table, Schema and Column objects. This isn’t strictly speaking necessary, but I thinking this helps to make clearer the objects which are described by the metadata file.

{
    "@context": "http://www.w3.org/ns/csvw",
    "url": "countries.csv",
    "tableSchema": {
      "columns": [
        {
          "titles": "country",
          "@type": "Column"
        },
        {
          "titles": "country group",
          "@type": "Column"
        },
        {
          "titles": "name (en)",
          "@type": "Column"
        },
        {
          "titles": "name (fr)",
          "@type": "Column"
        },
        {
          "titles": "name (de)",
          "@type": "Column"
        },
        {
          "titles": "latitude",
          "@type": "Column"
        },
        {
          "titles": "longitude",
          "@type": "Column"
        }
      ],
      "@type": "Schema"
    },
    "@type": "Table"
  }
The 'countries.csv-metadata.json' file including @type properties.

Adding descriptions about the table of data (the dataset)

Let’s say we want to include a description of the CSV file itself in the metadata. This might include what the data is describing, who collected the data and how it was collected. The CSVW Primer suggests that this can be done by adding additional properties to the Table object. These additional properties are defined as Common Properties, as described in Section 5.8 of the Metadata vocabulary. There is no restriction on what the additional properties are, but the property names have to be prefixed names, in the syntax prefix:name.

Here we are using the methods given in the CSVW primer where possible. It suggests two possible options for property names for table descriptions as shown in Example 8 and Example 9. These are property names with the prefix dc (the Dublin Core vocabulary) or the schema prefix (the schema.org vocabulary).

{
  "@context": "http://www.w3.org/ns/csvw",
  "dc:title": "Unemployment in Europe (monthly)"
  "dc:description": "Harmonized unemployment data for European countries."
  "dc:creator": "Eurostat",
  "tables": [{
    "url": "countries.csv",
    "dc:title": "Countries"
  }, {
    "url": "country-groups.csv",
    "dc:title": "Country groups"
  }, {
    "url": "unemployment.csv",
    "dc:title": "Unemployment (monthly)",
    "dc:description": "The total number of people unemployed"
  }]
}
Table descriptions using the Dublin Core vovaculary. This is Example 8 of the CSVW Primer document.
{
  "@context": "http://www.w3.org/ns/csvw",
  "schema:name": "Unemployment in Europe (monthly)",
  "schema:description": "Harmonized unemployment data for European countries."
  "schema:creator": { "schema:name": "Eurostat" },
  "tables": [{
    "url": "countries.csv",
    "schema:name": "Countries"
  }, {
    "url": "country-groups.csv",
    "schema:name": "Country groups"
  }, {
    "url": "unemployment.csv",
    "schema:name": "Unemployment (monthly)",
    "schema:description": "The total number of people unemployed"
  }]
}
Table descriptions using the schema.org vocabulary. This is Example 9 of the CSVW Primer document.

Let’s take a moment to consider what these prefix:name property names represent. In the case of the dc property names these are referring to the Dublin Core vocabulary. This is a well known vocabulary of terms which anyone can use in their data and metadata. The idea behind such vocabularies is that they provide a stable and unique way of describing things. If some uses these terms to describe their data, it is possible for others to understand what is meant. The dc terms are exactly defined so there is less chance of misinterpretation.

The dc terms are described on the Dublin Core website. Example 8 of the CSVW primers uses the terms dc:title, dc:description and dc:creator. The definitions of these terms, as provided on the Dublin Core website, are as follows:

Let’s extend Example 6 above by using these terms to describe the table of data:

{
    "@context": "http://www.w3.org/ns/csvw",
    "url": "countries.csv",
    "dc:title": "Country data table (source: Example 8 of the 'CSV on the Web: A Primer' Working Group Note).",
    "dc:description": "This table describes groupings, names and location for different countries",
    "dc:creator": "The CSV on the Web Working Group https://www.w3.org/2013/csvw/wiki/Main_Page",
    "tableSchema": {
      "columns": [
        {
          "titles": "country",
          "@type": "Column"
        },
        {
          "titles": "country group",
          "@type": "Column"
        },
        {
          "titles": "name (en)",
          "@type": "Column"
        },
        {
          "titles": "name (fr)",
          "@type": "Column"
        },
        {
          "titles": "name (de)",
          "@type": "Column"
        },
        {
          "titles": "latitude",
          "@type": "Column"
        },
        {
          "titles": "longitude",
          "@type": "Column"
        }
      ],
      "@type": "Schema"
    },
    "@type": "Table"
  }
The 'countries.csv-metadata.json' file including Dubln Core (dc:) table descriptions.

Adding descriptions about columns

This can be done in the same method as for Table objects. Example 10 in the CSVW primer shows how the dc:description  property is used to provide descriptions for each individual column. As for Table objects, this is classed as a Common Property which in this case is part of the Column objects. Let’s include these descriptions in our example metadata file:

{
    "@context": "http://www.w3.org/ns/csvw",
    "url": "countries.csv",
    "dc:title": "Country data table (source: Example 8 of the 'CSV on the Web: A Primer' Working Group Note).",
    "dc:description": "This table describes groupings, names and location for different countries",
    "dc:creator": "The CSV on the Web Working Group https://www.w3.org/2013/csvw/wiki/Main_Page",
    "tableSchema": {
      "columns": [{
          "titles": "country",
          "dc:description": "The ISO two-letter code for a country, in lowercase.",
          "@type": "Column"
        },
        {
          "titles": "country group",
          "dc:description": "A lowercase two-letter code for a group of countries.",
          "@type": "Column"
        },
        {
          "titles": "name (en)",
          "dc:description": "The official name of the country in English.",
          "@type": "Column"
        },
        {
          "titles": "name (fr)",
          "dc:description": "The official name of the country in French.",
          "@type": "Column"
        },
        {
          "titles": "name (de)",
          "dc:description": "The official name of the country in German.",
          "@type": "Column"
        },
        {
          "titles": "latitude",
          "dc:description": "The latitude of an indicative point in the country.",
          "@type": "Column"
        },
        {
          "titles": "longitude",
          "dc:description": "The longitude of an indicative point in the country.",
          "@type": "Column"
        }
      ],
      "@type": "Schema"
    },
    "@type": "Table"
  }
The 'countries.csv-metadata.json' file including Dubln Core (dc:) column descriptions.

Adding datatypes to columns

It is very useful to know what type of data a column should contain. This allows us to check that the correct datatype is being used when adding new data. It also allows us to read the data into analysis software using the correct datatype. Datatypes can be added to Column objects using the datatype property, which is an Inherited Property as described in Section 5.7 of the Metadata Vocabulary. Inherited properties can be applied to Column, Schema, Table and TableGroup objects, and the basic idea is that an inherited property applied to, say, a table will then also apply to all columns in that table (unless overridden by an additional inherited property of the same type on an individual column within the table).

There are two options for the values of the datatype property:

  • A predetermined string value such as "string", "number", "date" etc.
  • A JSON object such as {“base”: “integer”, “minimum”: 1, “maximum”: 5}. This is taken from Example 19 of the CSVW primer and tells us that this datatype is an integer and its value must be between 1 and 5 inclusive.

Further information on datatypes and the meaning of the terms ‘base’, ‘minimum’, ‘maximum’ etc. is given in Section 5.11 of the Metadata Vocabulary.

The CSVW primer gives many examples of using datatypes (for example see Section 3). Here I’ve gathered these all together to extend our example metadata file:

{
    "@context": "http://www.w3.org/ns/csvw",
    "url": "countries.csv",
    "dc:title": "Country data table (source: Example 8 of the 'CSV on the Web: A Primer' Working Group Note).",
    "dc:description": "This table describes groupings, names and location for different countries",
    "dc:creator": "The CSV on the Web Working Group https://www.w3.org/2013/csvw/wiki/Main_Page",
    "tableSchema": {
      "columns": [{
          "titles": "country",
          "dc:description": "The ISO two-letter code for a country, in lowercase.",
          "datatype": {
            "dc:title": "Country Code",
            "dc:description": "Country codes as specified in ISO 3166.",
            "base": "string",
            "format": "[a-z]{2}"
          },
          "@type": "Column"
        },
        {
          "titles": "country group",
          "dc:description": "A lowercase two-letter code for a group of countries.",
          "datatype": {
            "base": "string",
            "format": "eu|non-eu"
          },
          "@type": "Column"
        },
        {
          "titles": "name (en)",
          "dc:description": "The official name of the country in English.",
          "datatype": {
            "base": "string",
            "minLength": "3",
            "maxLength": "128"
          },
          "@type": "Column"
        },
        {
          "titles": "name (fr)",
          "dc:description": "The official name of the country in French.",
          "datatype": "string",
          "@type": "Column"
        },
        {
          "titles": "name (de)",
          "dc:description": "The official name of the country in German.",
          "datatype": "string",
          "@type": "Column"
        },
        {
          "titles": "latitude",
          "dc:description": "The latitude of an indicative point in the country.",
          "datatype": {
            "base": "number",
            "minimum": "-90",
            "maximum": "90",
            "format": "#0.000000##"
          },
          "@type": "Column"
        },
        {
          "titles": "longitude",
          "dc:description": "The longitude of an indicative point in the country.",
          "datatype": {
            "base": "number",
            "minimum": "-90",
            "maximum": "90",
            "format": "#0.000000##"
          },
          "@type": "Column"
        }
      ],
      "@type": "Schema"
    },
    "@type": "Table"
  }
The 'countries.csv-metadata.json' file including datatypes for each column.

This metadata.json file now achieves the goal of this blog post to create a descriptive metadata file. It includes information about the table of data and information about each individual column.

Summary

This blog post has shown how to create a metadata.json file for a CSVW dataset. It has shown how to describe the table of data itself and the individual columns within the table. The Dublin Core vocabulary is used to provide descriptions based on the dc prefix. The types of data expected in each column in the CSV file is also described using the datatype property.

This, I believe, is a great point to aim for, for anyone who is working with CSV data and who would like to share the data with the wider world.

Further information