A Data Package specification for archaeological stratigraphy data following the Harris Matrix convention.
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Stefano Costa e51cd27b9d
Add a citation for the reference paper
3 years ago
fig12 Remove spurious backup file 3 years ago
hmdp Change the temporary graph in place. 3 years ago
LICENSE Add LICENSE 4 years ago
Pipfile Add reading of stratigraphy data and matrix creation 3 years ago
README.md Add a citation for the reference paper 3 years ago
setup.py Introduce Black for formatting the source code 3 years ago


Harris Matrix Data Package

This repository contains archaeological stratigraphic datasets in CSV format, following the table schema developed by Thomas S. Dye for the hm Lisp package, together with a Python command-line tool that can check consistency of data with the format.

Each dataset contains various tables and a data package descriptor (datapackage.json) that enables consistency checks and streamlined data access with the Frictionless Data tools and programming libraries.

Setting up the environment

I installed the Python datapackage and goodtables packages with Pipenv. The repository contains a Pipfile, so it should be enough to run:

pipenv install

Then install the hmdp package with:

pipenv run python setup.py install

This will make the hmdp command available in the virtual environment.

All source code is formatted with Black.


In the Frictionless Data glossary:

  • data descriptor is a JSON file, named datapackage.json, that is-found in the top-level directory of a data package, and contains metadata about the entire data package (name, description, creation date, author names, references) together with the data package schema
  • resource is a single block of data, such as a CSV table or a JSON data file

In the Harris Matrix Data Package:

  • each Harris Matrix is a data package
  • there is 1 data descriptor
  • there are from 2 to 7 CSV tables
  • each CSV table is a resource

The two resources that MUST be present are:

  • contexts
  • observations

Most often, excavation data will make use of three other resources:

  • inferences
  • periods
  • phases

Only in case there are radiocarbon dates or other absolute chronology available the two resources should be used:

  • events
  • event-order

Resource names are standardized so that the data descriptor can remain largely untouched, except for the specific metadata.

Using the hmdp program from the command line

hmdp matrix datapackage.json will check stratigraphy data consistency and output a matrix.gv file for processing with Graphviz.

To create a graphical representation of the resulting matrix, the default procedure is to use the dot command, like this:

dot matrix.gv -Tpng -o matrix.png

In case something goes wrong, but also if you are experimenting with the data format, the check command is a useful shortcut to run all possible automated checks.

hmdp check datapackage.json will perform three checks on the dataset:

  • validate the data descriptor without looking at the data (e.g. resources can be missing or broken but the JSON file is well formatted), this is equivalent to running datapackage validate datapackage.json
  • validate every resource for internal consistency (e.g. there are column headers, each row has the right number of columns, constraints like integer values, enums, etc. are respected), this is equivalent to running goodtables datapackage.json (but in case of errors the separate command will give more details)
  • check the consistency of foreign keys based on the data descriptor, again using the goodtables programming library.

How to cite this work

If you use this software in your research, please provide a citation to the paper introducing it:

Costa, Stefano. “Una proposta di standard per l’archiviazione e la condivisione di dati stratigrafici.” Archeologia e Calcolatori, 30, 2019, pp. 459–62, DOI: https://doi.org/10.19282/ac.30.2019.29