Reliable Granular References to Changing Linked Data

Tobias Kuhn, Egon Willighagen, Chris Evelo, Núria Queralt-Rosinach, Emilio Centeno, Laura I. Furlong

International Semantic Web Conference (ISWC)
Vienna, 2017

These slides: http://purl.org/tkuhn/presentations/iswc2017-nanodiff

Reproducibility

[image source]

Reproducible (Linked) Data Science is complicated...

Even Just Specifying the Used Data is Complicated

Specify Input Data:
Current Best Practice

In papers:

"... we used DisGeNET-RDF version 4.0 [32]" [32] N. Queralt-Rosinach, J. Piñero, À. Bravo, F. Sanz, and L. I. Furlong. DisGeNET-RDF: harnessing the innovative power of the semantic web to explore the genetic basis of diseases. Bioinformatics, 32(14), 2016.

In code:

wget http://rdf.disgenet.org/download/v4.0.0/gda.ttl.gz
# Run analysis here

Requirements and Related Work

We need:

Principled Linked Data versioning
Cryptographically reliable dataset identifiers
References to subsets of larger datasets

Related work addresses these individually but not in combination.

Nanopublications

http://nanopub.org

Trusty URIs make resources ...

http://trustyuri.net

Trusty Nanopublications

A Server Network for Nanopublications

9M Nanopublications on the Server Network

http://purl.org/nanopub/monitor

The "Overhead" of Nanopublications

Decontextualizing

For comparison, we can decontextualize the triples of the nanopublications of a given dataset:

Attach provenance/metadata to entire dataset (instead of individual nanopublication)
Drop context (graph) URI
Then count unique triples

Measures hypothetical size of a nanopublication dataset if no nanopublications would have been used

Decontextualized Datasets

Dataset	relative size after decontextualization
LIDDI	82%
neXtProt	55%
GeneRIF-AIDA	43%
DisGeNET v4.0.0.0	14%
DisGeNET v3.0.0.0	14%
DisGeNET v2.1.0.0	14%
OpenBEL 20131211	69%
OpenBEL 1.0	69%

Can we better represent the overlap between dataset versions to counter this significant overhead?

Approach: Granual Versioning

Evaluation 1

How well does it work on the data publisher side?

Evaluation based on WikiPathways, a community-curated open database of biological pathways:

~ 10 000 nanopublications
Monthly releases over 11 months

http://www.wikipathways.org

WikiPathways Versions

Evaluation 2

How well does it work on the data consumer side?

Evaluation based on DisGeNET, a database on human diseases and their related genes:

1 414 902 nanopublications in version 4.0
Highly cited: 31 publications in 2017 (until 5 May)

http://www.disgenet.org

20 Publications that Used DisGeNET

Sizes of DisGeNET Subsets (Compared to Decontextualized Full Dataset)

Time to Download Typical Subset

Downloading full dataset from DisGeNET server

VERSUS

Downloading 18 098 nanopublications through the server network

Time to Download Typical Subset

Downloading full dataset from DisGeNET server

VERSUS

Downloading 18 098 nanopublications through the server network

Data Publishing with Nanopublication Datasets

Researchers can now exactly specify their input data.

In papers:

"... we used DisGeNET data about these diseases [27]" [27] Nanopublications from DisGeNET v4.0.0.0 about umls:C0003507 or umls:C1956346. http://purl.org/np/RAcf4tihZLL_aK81hwThIrNxjOhks4sEloBStEgzyR1tI, 11 May 2017.

In code:

np get -c -o data.trig \
  RAcf4tihZLL_aK81hwThIrNxjOhks4sEloBStEgzyR1tI
# Run analysis here

Nanopublication "Overhead" Disappears

Price of nanopublications is offset by:

the benefits of incremental versioning
by being able to refer to needed subset

Querying the Nanopublication Cloud

Separation between Publishing and Querying:

Nanopublication network for publishing ✓
Ongoing: development of query services

Early acknowledgements:

Loading all nanopublications in performant graph DB (Michel Dumontier, Alexander Malic)
Applying Quad Pattern Fragments (Ruben Verborgh)
Applying HDT for quads (Javier Fernández)