Reliable Granular References to Changing Linked Data
Tobias Kuhn, Egon Willighagen, Chris Evelo, Núria Queralt-Rosinach, Emilio Centeno, Laura I. Furlong
International Semantic Web Conference (ISWC)
Vienna, 2017
These slides: http://purl.org/tkuhn/presentations/iswc2017-nanodiff
Reproducibility
[image source]
Reproducible (Linked) Data Science is complicated...
Even Just Specifying the Used Data is Complicated
Even Just Specifying the Used Data is Complicated
Even Just Specifying the Used Data is Complicated
Even Just Specifying the Used Data is Complicated
Even Just Specifying the Used Data is Complicated
Even Just Specifying the Used Data is Complicated
Specify Input Data:
Current Best Practice
In papers:
"... we used DisGeNET-RDF version 4.0 [32]"
[32] N. Queralt-Rosinach, J. Piñero, À. Bravo, F. Sanz, and L. I. Furlong. DisGeNET-RDF: harnessing the innovative power of the semantic web to explore the genetic basis of diseases. Bioinformatics, 32(14), 2016.
In code:
wget http://rdf.disgenet.org/download/v4.0.0/gda.ttl.gz
# Run analysis here
Requirements and Related Work
We need:
- Principled Linked Data versioning
- Cryptographically reliable dataset identifiers
- References to subsets of larger datasets
Related work addresses these individually but not in combination.
Trusty Nanopublications
Trusty Nanopublications
Trusty Nanopublications
Trusty Nanopublications
A Server Network for Nanopublications
The "Overhead" of Nanopublications
Decontextualizing
For comparison, we can decontextualize the triples of the nanopublications of a given dataset:
- Attach provenance/metadata to entire dataset (instead of individual nanopublication)
- Drop context (graph) URI
- Then count unique triples
Measures hypothetical size of a nanopublication dataset if no nanopublications would have been used
Decontextualized Datasets
Dataset | relative size after decontextualization |
LIDDI | 82% |
neXtProt | 55% |
GeneRIF-AIDA | 43% |
DisGeNET v4.0.0.0 | 14% |
DisGeNET v3.0.0.0 | 14% |
DisGeNET v2.1.0.0 | 14% |
OpenBEL 20131211 | 69% |
OpenBEL 1.0 | 69% |
Can we better represent the overlap between dataset versions to counter this significant overhead?
Approach: Granual Versioning
Approach: Granual Versioning
Approach: Granual Versioning
Approach: Granual Versioning
Approach: Granual Versioning
Approach: Granual Versioning
Approach: Granual Versioning
Approach: Granual Versioning
Approach: Granual Versioning
Approach: Granual Versioning
Approach: Granual Versioning
Approach: Granual Versioning
Approach: Granual Versioning
Evaluation 1
How well does it work on the data publisher side?
Evaluation based on WikiPathways, a community-curated open database of biological pathways:
- ~ 10 000 nanopublications
- Monthly releases over 11 months
http://www.wikipathways.org
WikiPathways Versions
WikiPathways Versions
WikiPathways Versions
WikiPathways Versions
WikiPathways Versions
Evaluation 2
How well does it work on the data consumer side?
Evaluation based on DisGeNET, a database on human diseases and their related genes:
- 1 414 902 nanopublications in version 4.0
- Highly cited: 31 publications in 2017 (until 5 May)
http://www.disgenet.org
20 Publications that Used DisGeNET
20 Publications that Used DisGeNET
Sizes of DisGeNET Subsets (Compared to Decontextualized Full Dataset)
Sizes of DisGeNET Subsets (Compared to Decontextualized Full Dataset)
Sizes of DisGeNET Subsets (Compared to Decontextualized Full Dataset)
Time to Download Typical Subset
Downloading full dataset from DisGeNET server
VERSUS
Downloading 18 098 nanopublications through the server network
Time to Download Typical Subset
Downloading full dataset from DisGeNET server
VERSUS
Downloading 18 098 nanopublications through the server network
Data Publishing with Nanopublication Datasets
Data Publishing with Nanopublication Datasets
Researchers can now exactly specify their input data.
In papers:
"... we used DisGeNET data about these diseases [27]"
[27] Nanopublications from DisGeNET v4.0.0.0 about umls:C0003507 or umls:C1956346. http://purl.org/np/RAcf4tihZLL_aK81hwThIrNxjOhks4sEloBStEgzyR1tI, 11 May 2017.
In code:
np get -c -o data.trig \
RAcf4tihZLL_aK81hwThIrNxjOhks4sEloBStEgzyR1tI
# Run analysis here
Nanopublication "Overhead" Disappears
Price of nanopublications is offset by:
- the benefits of incremental versioning
- by being able to refer to needed subset
Querying the Nanopublication Cloud
Separation between Publishing and Querying:
- Nanopublication network for publishing ✓
- Ongoing: development of query services
Early acknowledgements:
- Loading all nanopublications in performant graph DB (Michel Dumontier, Alexander Malic)
- Applying Quad Pattern Fragments (Ruben Verborgh)
- Applying HDT for quads (Javier Fernández)
Thank you for Your Attention!
Questions?