Reliable Granular References to Changing Linked Data

Tobias Kuhn, Egon Willighagen, Chris Evelo, Núria Queralt-Rosinach, Emilio Centeno, Laura I. Furlong

International Semantic Web Conference (ISWC)
Vienna, 2017

These slides: http://purl.org/tkuhn/presentations/iswc2017-nanodiff

Reproducibility

[image source]

Reproducible (Linked) Data Science is complicated...

Even Just Specifying the Used Data is Complicated

Even Just Specifying the Used Data is Complicated

Even Just Specifying the Used Data is Complicated

Even Just Specifying the Used Data is Complicated

Even Just Specifying the Used Data is Complicated

Even Just Specifying the Used Data is Complicated

Specify Input Data:
Current Best Practice

In papers:

"... we used DisGeNET-RDF version 4.0 [32]" [32] N. Queralt-Rosinach, J. Piñero, À. Bravo, F. Sanz, and L. I. Furlong. DisGeNET-RDF: harnessing the innovative power of the semantic web to explore the genetic basis of diseases. Bioinformatics, 32(14), 2016.

In code:

wget http://rdf.disgenet.org/download/v4.0.0/gda.ttl.gz
# Run analysis here

Requirements and Related Work

We need:

  • Principled Linked Data versioning
  • Cryptographically reliable dataset identifiers
  • References to subsets of larger datasets

Related work addresses these individually but not in combination.

Nanopublications

http://nanopub.org


Trusty URIs make resources ...

http://trustyuri.net

Trusty Nanopublications

Trusty Nanopublications

Trusty Nanopublications

Trusty Nanopublications

A Server Network for Nanopublications

9M Nanopublications on the Server Network

http://purl.org/nanopub/monitor

The "Overhead" of Nanopublications

Decontextualizing

For comparison, we can decontextualize the triples of the nanopublications of a given dataset:

  • Attach provenance/metadata to entire dataset (instead of individual nanopublication)
  • Drop context (graph) URI
  • Then count unique triples

Measures hypothetical size of a nanopublication dataset if no nanopublications would have been used

Decontextualized Datasets

Datasetrelative size after decontextualization
LIDDI82%
neXtProt55%
GeneRIF-AIDA43%
DisGeNET v4.0.0.014%
DisGeNET v3.0.0.014%
DisGeNET v2.1.0.014%
OpenBEL 2013121169%
OpenBEL 1.069%

Can we better represent the overlap between dataset versions to counter this significant overhead?

Approach: Granual Versioning

Approach: Granual Versioning

Approach: Granual Versioning

Approach: Granual Versioning

Approach: Granual Versioning

Approach: Granual Versioning

Approach: Granual Versioning

Approach: Granual Versioning

Approach: Granual Versioning

Approach: Granual Versioning

Approach: Granual Versioning

Approach: Granual Versioning

Approach: Granual Versioning

Evaluation 1

How well does it work on the data publisher side?


Evaluation based on WikiPathways, a community-curated open database of biological pathways:

  • ~ 10 000 nanopublications
  • Monthly releases over 11 months

http://www.wikipathways.org

WikiPathways Versions

WikiPathways Versions

WikiPathways Versions

WikiPathways Versions

WikiPathways Versions

Evaluation 2

How well does it work on the data consumer side?


Evaluation based on DisGeNET, a database on human diseases and their related genes:

  • 1 414 902 nanopublications in version 4.0
  • Highly cited: 31 publications in 2017 (until 5 May)

http://www.disgenet.org

20 Publications that Used DisGeNET

20 Publications that Used DisGeNET

Sizes of DisGeNET Subsets (Compared to Decontextualized Full Dataset)

Sizes of DisGeNET Subsets (Compared to Decontextualized Full Dataset)

Sizes of DisGeNET Subsets (Compared to Decontextualized Full Dataset)

Time to Download Typical Subset

Downloading full dataset from DisGeNET server

VERSUS

Downloading 18 098 nanopublications through the server network

Time to Download Typical Subset

Downloading full dataset from DisGeNET server

VERSUS

Downloading 18 098 nanopublications through the server network

Data Publishing with Nanopublication Datasets

Data Publishing with Nanopublication Datasets

Researchers can now exactly specify their input data.

In papers:

"... we used DisGeNET data about these diseases [27]" [27] Nanopublications from DisGeNET v4.0.0.0 about umls:C0003507 or umls:C1956346. http://purl.org/np/RAcf4tihZLL_aK81hwThIrNxjOhks4sEloBStEgzyR1tI, 11 May 2017.

In code:

np get -c -o data.trig \
  RAcf4tihZLL_aK81hwThIrNxjOhks4sEloBStEgzyR1tI
# Run analysis here

Nanopublication "Overhead" Disappears

Price of nanopublications is offset by:

  • the benefits of incremental versioning
  • by being able to refer to needed subset

Querying the Nanopublication Cloud

Separation between Publishing and Querying:

  • Nanopublication network for publishing
  • Ongoing: development of query services

Early acknowledgements:

  • Loading all nanopublications in performant graph DB (Michel Dumontier, Alexander Malic)
  • Applying Quad Pattern Fragments (Ruben Verborgh)
  • Applying HDT for quads (Javier Fernández)

Thank you for Your Attention!


Questions?