Table Schema
Authors | Rufus Pollock, Paul Walsh, Adam Kariv, Evgeny Karev, Peter Desmet, Ethan Welty, DC Slagel |
---|---|
Profile | https://datapackage.org/profiles/2.0/tableschema.json |
A simple format to declare a schema for tabular data. The schema is designed to be expressible in JSON
Language
The key words MUST
, MUST NOT
, REQUIRED
, SHALL
, SHALL NOT
, SHOULD
, SHOULD NOT
, RECOMMENDED
, MAY
, and OPTIONAL
in this document are to be interpreted as described in RFC 2119
Introduction
Table Schema is a simple language- and implementation-agnostic way to declare a schema for tabular data. Table Schema is well suited for use cases around handling and validating tabular data in text formats such as CSV, but its utility extends well beyond this core usage, towards a range of applications where data benefits from a portable schema format.
Concepts
This specification heavily relies on the following concepts:
In this document, we will explicitly refer to either the Native Representation or Logical Representation of data in places where it prevents ambiguity for those engaging with the specification, especially implementors. Note, that this specification does not deal in any way with Physical Representation of data.
Note, that whenever a native value is allowed to be provided in this spec, the most similar JSON type should be used to represent it. If no such type exists (e.g. in case there’s a native date value), a string representation of that value should be provided. Such mappings between native types and JSON types, and the string representations described above are file format specific and left for implementors to decide (unless defined explicitly in this specification or its appendixes).
Descriptor
Table Schema descriptor MUST
be a descriptor as per Descriptor definition. A list of standard properties that can be included into a descriptor is defined in the Properties section.
An example of a Table Schema descriptor:
Properties
Schema
A Table Schema descriptor MAY
contain these standard properties:
fields
[required]
A Table Schema descriptor MUST
contain a property fields
. fields
MUST
be an array where each entry in the array is a field descriptor as defined below.
The way Table Schema fields
are mapped onto the data source fields are defined by the fieldsMatch
property. By default, the most strict approach is applied, i.e. fields in the data source MUST
completely match the elements in the fields
array, both in number and order. Using different options of the fieldsMatch
property, a data producer can relax requirements for the data source.
$schema
A root level Table Schema descriptor MAY
have a $schema
property that MUST
point to a profile as per Profile definition that MUST
include all the metadata constraints required by this specification.
The default value is https://datapackage.org/profiles/1.0/tableschema.json
and the recommended value is https://datapackage.org/profiles/2.0/tableschema.json
.
fieldsMatch
A Table Schema descriptor MAY
contain a property fieldsMatch
that MUST
be a string with the following possible values and the exact
value by default:
- exact (default): The data source
MUST
have exactly the same fields as defined in thefields
array. FieldsMUST
be mapped by their order. - equal: The data source
MUST
have exactly the same fields as defined in thefields
array. FieldsMUST
be mapped by their names. - subset: The data source
MUST
have all the fields defined in thefields
array, butMAY
have more. FieldsMUST
be mapped by their names. - superset: The data source
MUST
only have fields defined in thefields
array, butMAY
have fewer. FieldsMUST
be mapped by their names. - partial: The data source
MUST
have at least one field defined in thefields
array. FieldsMUST
be mapped by their names.
missingValues
Many datasets arrive with missing data values, either because a value was not collected or it never existed. Missing values may be indicated simply by the value being empty in other cases a special value may have been used e.g. -
, NaN
, 0
, -9999
etc.
The missingValues
property configures which native values MUST
be treated as logical null
values. If provided, the missingValues
property MUST
be an array
of values.
This conversion to null
is done before any other attempted type-specific conversion. The default value [ "" ]
means that empty strings will be converted to null before any other processing takes place. Providing the empty list []
means that no conversion to null will be done, on any value.
Examples of the missingValues
property:
primaryKey
A primary key is a field or set of fields that uniquely identifies each row in the table. Per SQL standards, the fields cannot be null
, so their use in the primary key is equivalent to adding required: true
to their constraints
.
The primaryKey
entry in the schema object
is optional. If present it specifies the primary key for this table.
The primaryKey
, if present, MUST
be an array of strings with each string corresponding to one of the field name
values in the fields
array (denoting that the primary key is made up of those fields). It is acceptable to have an array with a single value (indicating just one field in the primary key). Strictly, order of values in the array does not matter. However, it is RECOMMENDED
that one follow the order the fields in the fields
has as client applications MAY
utilize the order of the primary key list (e.g. in concatenating values together).
Here’s an example:
uniqueKeys
A unique key is a field or a set of fields that are required to have unique logical values in each row in the table. It is directly modeled on the concept of unique constraint in SQL.
The uniqueKeys
property, if present, MUST
be a non-empty array. Each entry in the array MUST
be a uniqueKey
. A uniqueKey
MUST
be an array of strings with each string corresponding to one of the field name
values in the fields
array, denoting that the unique key is made up of those fields. It is acceptable to have an array with a single value, indicating just one field in the unique key.
An example of using the uniqueKeys
property:
In the case of the definition above, the data in the table has to be considered valid only if:
- each row has a unique logical value in the field
a
- each row has a unique set of logical values in the fields
a
andb
- each row has a unique set of logical values in the fields
a
andc
Handling null
values
All the field values that are on the logical level are considered to be null
values MUST
be excluded from the uniqueness check, as the uniqueKeys
property is modeled on the concept of unique constraint in SQL.
Relation to constraints.unique
In contrast with field.constraints.unique
, uniqueKeys
allows to define uniqueness as a combination of fields. Both properties SHOULD
be assessed separately.
foreignKeys
A foreign key is a reference where values in a field (or fields) on the table (‘resource’ in data package terminology) described by this Table Schema connect to values a field (or fields) on this or a separate table (resource). They are directly modelled on the concept of foreign keys in SQL.
The foreignKeys
property, if present, MUST
be an Array. Each entry in the array MUST
be a foreignKey
. A foreignKey
MUST
be a object
and MUST
have the following properties:
fields
-fields
is an array of strings specifying the field or fields on this resource that form the source part of the foreign key. The structure of the array is as perprimaryKey
above.reference
-reference
MUST
be aobject
. Theobject
MUST
have a propertyfields
which is an array of strings of the same length as the outerfields
, describing the field (or fields) references on the destination resource. The structure of the array is as perprimaryKey
above.MAY
have a propertyresource
which is the name of the resource within the current data package, i.e. the data package within which this Table Schema is located. For referencing another data resource theresource
propertyMUST
be provided. For self-referencing, i.e. references between fields in this Table Schema, theresource
propertyMUST
be omitted.
Here’s an example:
An example of a self-referencing foreign key:
Foreign Keys create links between one Table Schema and another Table Schema, and implicitly between the data tables described by those Table Schemas. If the foreign key is referring to another Table Schema how is that other Table Schema discovered? The answer is that a Table Schema will usually be embedded inside some larger descriptor for a dataset, in particular as the schema for a resource in the resources array of a Data Package. It is the use of Table Schema in this way that permits a meaningful use of a non-empty resource
property on the foreign key.
Field
A field descriptor MUST
be a JSON object
that describes a single field. The descriptor provides additional human-readable documentation for a field, as well as additional information that can be used to validate the field or create a user interface for data entry.
Here is an illustration:
The field descriptor object
MAY
contain any number of other properties. Some specific properties are defined below. Of these, only the name
property is REQUIRED
.
name
[required]
The field descriptor MUST
contain a name
property and it MUST
be unique amongst other field names in this Table Schema. This property SHOULD
correspond to the name of a column in the data file if it has a name.
type
and format
These properties are used to give the type of the field (string, number, etc.) - see below for more detail. If type is not provided a consumer MUST
utilize the any
type for the field instead of inferring it from the field’s values.
A field’s type
property is a string indicating the type of this field.
A field’s format
property is a string, indicating a format for the field type.
Both type
and format
are optional: in a field descriptor, the absence of a type
property indicates that the field is of the type “any”, and the absence of a format
property indicates that the field’s type format
is “default”.
Types are based on the type set of json-schema with some additions and minor modifications (cf other type lists include those in Elasticsearch types).
title
A human readable label or title for the field
description
A description for this field e.g. “The recipient of the funds”
example
An example value for the field
constraints
missingValues
A list of missing values for this field as per Missing Values definition. If this property is defined, it takes precedence over the schema-level property and completely replaces it for the field without combining the values.
For example, for the Table Schema below:
A data consumer MUST
:
- interpret
""
andNA
as missing values forcolumn1
- interpret only
-
as a missing value forcolumn2
rdfType
A richer, “semantic”, description of the “type” of data in a given column MAY
be provided using a rdfType
property on a field descriptor.
The value of the rdfType
property MUST
be the URI of a RDF Class, that is an instance or subclass of RDF Schema Class object.
Here is an example using the Schema.org RDF Class http://schema.org/Country
:
The corresponding Table Schema is:
Field Types
The type list with associated formats and other related properties is as follows.
string
The field contains strings, that is, sequences of characters.
Supported formats:
- default: any valid string.
- email: A valid email address.
- uri: A valid URI.
- binary: A base64 encoded string representing binary data.
- uuid: A string that is a uuid.
Native Representaiton
Values MUST
be represented as strings.
number
The field contains numbers of any kind including decimals.
Native Representaiton
If supported, values MUST
be natively represented by a data format. If not supported, values MUST
be represented as strings following the rules below.
Formatting follows that of decimal in XMLSchema: a non-empty finite-length sequence of decimal digits separated by a period as a decimal indicator. An optional leading sign is allowed. If the sign is omitted, ”+” is assumed. Leading and trailing zeroes are optional. If the fractional part is zero, the period and following zero(es) can be omitted. For example: ‘-1.23’, ‘12678967.543233’, ‘+100000.00’, ‘210’.
The following special string values are permitted (case need not be respected):
- NaN: not a number
- INF: positive infinity
- -INF: negative infinity
A number MAY
also have a trailing:
- exponent: this
MUST
consist of an E followed by an optional + or - sign followed by one or more decimal digits (0-9)
Formatting MAY
be modified using these additional properties:
- decimalChar: A string whose value is used to represent a decimal point within the number. The default value is ”.“.
- groupChar: A string whose value is used to group digits within the number. This property does not have a default value. A common value is ”,” e.g. “100,000”.
- bareNumber: a boolean field with a default of
true
. Iftrue
the contents of this fieldMUST
follow the formatting constraints already set out. Iffalse
the contents of this field may contain leading and/or trailing non-numeric characters (which implementorsMUST
therefore strip). The purpose ofbareNumber
is to allow publishers to publish numeric data that contains trailing characters such as percentages e.g.95%
or leading characters such as currencies e.g.€95
orEUR 95
. Note that it is entirely up to implementors what, if anything, they do with stripped text.
integer
The field contains integers - that is whole numbers.
Native Representaiton
If supported, values MUST
be natively represented by a data format. If not supported, values MUST
be represented as strings following the rules below.
Integer values are indicated in the standard way for any valid integer. Formatting MAY
be modified using these additional properties:
- groupChar: A string whose value is used to group digits within the integer. This property does not have a default value. A common value is ”,” e.g. “100,000”.
- bareNumber: a boolean field with a default of
true
. Iftrue
the contents of this fieldMUST
follow the formatting constraints already set out. Iffalse
the contents of this field may contain leading and/or trailing non-numeric characters (which implementorsMUST
therefore strip). The purpose ofbareNumber
is to allow publishers to publish numeric data that contains trailing characters such as percentages e.g.95%
or leading characters such as currencies e.g.€95
orEUR 95
. Note that it is entirely up to implementors what, if anything, they do with stripped text.
boolean
The field contains boolean data i.e. logical true
or logical false
.
Native Representaiton
If supported, values MUST
be natively represented by a data format. If not supported, values MUST
be represented as defined by the trueValues
and falseValues
properties that can be customized to user need:
- trueValues: An array of native values to be interpreted as logical
true
. The default is[ "true", "True", "TRUE", "1" ]
. - falseValues: An array of native values to be interpreted as logical
false
. The default is[ "false", "False", "FALSE", "0" ]
.
object
The field contains a valid JSON object.
Native Representaiton
If supported, values MUST
be natively represented by a data format. If not supported, values MUST
be strings that are valid serialized JSON objects.
array
The field contains a valid JSON array.
Native Representaiton
If supported, values MUST
be natively represented by a data format. If not supported, values MUST
be strings that are valid serialized JSON arrays.
list
The field contains data that is an ordered one-level depth collection of primitive values with a fixed item type. In comparison to the array
type, the list
type is directly modelled on the concept of SQL typed collections.
The list field can be customised with this additional property:
- itemType: specifies the list item type in terms of existent Table Schema types. If present, it
MUST
be one ofstring
,integer
,boolean
,number
,datetime
,date
, andtime
. If not present, the default isstring
. A data consumerMUST
process list items as it were individual values of the corresponding data type.
Native Representaiton
If supported, values MUST
be natively represented by a data format. If not supported, the field MUST
contain a string with list items separated by a delimiter which is ,
(comma) by default e.g. value1,value2
. The list items MUST
be serialized using a default format of the corresponding itemType
. The delimiter can be customised with this additional property:
- delimiter: specifies the character sequence which separates list items. If not present, the default is
,
(comma).
datetime
The field contains a date with a time and an optional timezone.
Native Representaiton
If supported, values MUST
be natively represented by a data format. If not supported, values MUST
be represented as strings in one of the following formats:
- default: values
MUST
be in a form defined by XML Schema containing required date and time parts, followed by optional milliseconds and timezone parts, for example,2024-01-26T15:00:00
or2024-01-26T15:00:00.300-05:00
. - <PATTERN>: values in this field can be parsed according to
<PATTERN>
.<PATTERN>
MUST
follow the syntax of standard Python / C strptime. Values in the this fieldSHOULD
be parsable by Python / C standardstrptime
using<PATTERN>
. Example for"format": ""%d/%m/%Y %H:%M:%S"
which would correspond to a date with time like:12/11/2018 09:15:32
. - any: Any parsable representation of the value. The implementing library can attempt to parse the datetime via a range of strategies. An example is
dateutil.parser.parse
from thepython-dateutils
library. It isNOT RECOMMENDED
to useany
format as it might cause interoperability issues.
date
The field contains a date without a time.
Native Representaiton
If supported, values MUST
be natively represented by a data format. If not supported, values MUST
be represented as strings in one of the following formats:
- default: values
MUST
beyyyy-mm-dd
e.g.2024-01-26
- <PATTERN>: The same as for
datetime
- any: The same as for
datetime
time
The field contains a time without a date.
Native Representaiton
If supported, values MUST
be natively represented by a data format. If not supported, values MUST
be represented as strings in one of the following formats:
- default: values
MUST
behh:mm:ss
e.g.15:00:00
- <PATTERN>: The same as for
datetime
- any: The same as for
datetime
year
The field contains a calendar year.
Native Representaiton
If supported, values MUST
be natively represented by a data format. If not supported, values MUST
be represented as strings as per XMLSchema gYear
. Usual representation as a string is YYYY
.
yearmonth
The field containts a specific month in a specific year.
Native Representaiton
If supported, values MUST
be natively represented by a data format. If not supported, values MUST
be represented as strings as per XMLSchema gYearMonth
. Usual representation as a string is YYYY-MM
.
duration
The field contains a duration of time.
Native Representaiton
If supported, values MUST
be natively represented by a data format. If not supported, values MUST
be represented as strings as per XML Schema duration
.
The duration MUST
be in the ISO 8601 extended format PnYnMnDTnHnMnS
, where nY
represents the number of years, nM
the number of months, nD
the number of days, T
is the date/time separator, nH
the number of hours, nM
the number of minutes and nS
the number of seconds. The number of seconds can include decimal digits to arbitrary precision. Date and time elements including their designator MAY
be omitted if their value is zero, and lower order elements MAY
also be omitted for reduced precision.
geopoint
The field contains data describing a geographic point i.e. lon
and lat
values that are floating point numbers.
Native Representaiton
If supported, values MUST
be natively represented by a data format. If not supported, values MUST
be represented as strings in one of the following formats:
- default: A string of the pattern “lon, lat”, where each value is a number, and
lon
is the longitude andlat
is the latitude (note the space is optional after the,
). E.g."90.50, 45.50"
. - array: A JSON array, or a string parsable as a JSON array, of exactly two items, where each item is a number, and the first item is
lon
and the second item islat
e.g.[90.50, 45.50]
- object: A JSON object with exactly two keys,
lon
andlat
and each value is a number e.g.{"lon": 90.50, "lat": 45.50}
geojson
The field contains a JSON object according to GeoJSON or TopoJSON specifications.
Supported formats:
- default: A geojson object as per the GeoJSON spec.
- topojson: A topojson object as per the TopoJSON spec.
Native Representaiton
If supported, values MUST
be natively represented by a data format. If not supported, values MUST
be represented as strings that are valid serialized JSON objects.
any
The field contains values of a unspecified or mixed type. A data consumer MUST NOT
perform any processing on values and MUST
interpret them as it is in native representation of data. This data type is directly modelled on the concept of the any
type of strongly typed object-oriented languages like TypeScript.
For example, having a Table Schema below:
This CSV data file will have logical values as below:
While this JSON data file will have logical values as below:
Note, that for the CSV data source the id
field is interpreted as a string because CSV supports only one data type i.e. string, and for the JSON data source the id
field is interpreted as an integer because JSON supports a numeric data type and the value was declared as an integer. Also, for the Table Schema above a type
property for each field can be omitted as it is a default field type.
Native Representaiton
Values MUST
be natively represented by a data format.
Field Constraints
The constraints
property on Table Schema Fields can be used by consumers to list constraints for validating field values. For example, validating the data in a Tabular Data Resource against its Table Schema; or as a means to validate data being collected or updated via a data entry interface.
All constraints MUST
be tested against the logical representation of data, and the native representation of constraint values MAY
be primitive types as possible in JSON, or represented as strings that are castable with the type
and format
rules of the field.
A constraints descriptor MUST
be a JSON object
and MAY
contain one or more of the following properties:
required
- Type: boolean
- Fields: all
Indicates whether this field cannot be null
. If required is false
(the default), then null
is allowed. See the section on missingValues
for how, in the native representation of the data, strings can represent null
values.
unique
- Type: boolean
- Fields: all
If true
, then all values for that field MUST
be unique within the data file in which it is found.
minLength
- Type: integer
- Fields: collections (string, list, array, object)
An integer that specifies the minimum length of a value.
maxLength
- Type: integer
- Fields: collections (string, list, array, object)
An integer that specifies the maximum length of a value.
minimum
- Type: integer, number, date, time, datetime, duration, year, yearmonth
- Fields: integer, number, date, time, datetime, duration, year, yearmonth
Specifies a minimum value for a field. This is different to minLength
which checks the number of items in the value. A minimum
value constraint checks whether a field value is greater than or equal to the specified value. The range checking depends on the type
of the field. E.g. an integer field may have a minimum value of 100; a date field might have a minimum date. If a minimum
value constraint is specified then the field descriptor MUST
contain a type
key.
maximum
- Type: integer, number, date, time, datetime, duration, year, yearmonth
- Fields: integer, number, date, time, datetime, duration, year, yearmonth
As for minimum
, but specifies a maximum value for a field.
exclusiveMinimum
- Type: integer, number, date, time, datetime, duration, year, yearmonth
- Fields: integer, number, date, time, datetime, duration, year, yearmonth
As for minimum
, but for expressing exclusive range.
exclusiveMaximum
- Type: integer, number, date, time, datetime, duration, year, yearmonth
- Fields: integer, number, date, time, datetime, duration, year, yearmonth
As for maximum
, but for expressing exclusive range.
jsonSchema
- Type: object
- Fields: array, object
A valid JSON Schema object to validate field values. If a field value conforms to the provided JSON Schema then this field value is valid.
pattern
- Type: string
- Fields: string
A regular expression that can be used to test field values. If the regular expression matches then the value is valid. The values of this field MUST
conform to the standard XML Schema regular expression syntax.
enum
- Type: array
- Fields: all
The value of the field MUST
exactly match one of the values in the enum
array.
Related Work
Table Schema draws content and/or inspiration from, among others, the following specifications and implementations: