Skip to content

Modeling VCS with PROV

Tim L edited this page Jun 18, 2014 · 11 revisions

What is first

What we will cover

This page discusses a potential new RDFS vocabulary to model the provenance of Version Control Systems (svn, git, etc.). Perhaps we'll name it http://prefix.cc/pvcs, the PROVenance for Version Control Systems?

Let's get to it

What we can reuse

As of 8 Jan 2013, Linked Open Vocabularies lists eight vocabularies that specialize PROV, and one that draws equivalences. Names of PROV-WG members are noted, since we can assume their work is aligned with the Rec. Vocabularies are sorted by most relevant at the top.

  • Provenance Vocabulary (Jun/Olaf)
    • Their (outdated) documentation.
    • They offer prv:File that is disjoint from prv:DataItem. prv:DataItems are prv:serializedBy prv:Files. We can also use Nepomuk.
    • DataItems are prv:containedBy other [larger] DataItems, e.g. triple in RDF graph.
    • They offer prov:Activities prv:DataCreation that is disjoint from prov:DataAccess. In narrative, it says "creation of data items" -- but is that prv:DataItem specifically (and thus not files)?
    • They offer disjoint classes prv:HumanAgent and prv:NonHumanAgent.
    • They offer prv:NonHumanAgent prv:DataProvidingService (which includes nokia:Service and nokia:Server and WebServer)
    • They offer prv:Immutable prov:Entities
    • They offer subproperty of wasAssociatedWith prv:performedBy - "that whom performed the activity".
    • Their distinction between prv:usedData and prv:usedGuideline (both subproperties of prov:used) can be very helpful.
    • They offer prv:precededBy to tie new versions to the previous. This is subproperty of dcterms:replaces, (and should also be subproperty of prov:wasDerivedFrom?).
  • NLP Interchange Format
    • This focuses on the provenance of string manipulations, and is thus too granular for much of our needs.
    • However, nif:String allows us to zero in on the serial nature of a File.
  • W3C Organization Ontology
    • This will only be useful if we want to model the organizational structures of the developers that are contributing to the repository. For example, which of the 50 OPeNDAP committers are actually in non-profit http://www.opendap.org/about? We know that some RPI people contributed, and they also did so when they worked back at UCAR. But this stuff is ancillary.
  • Provenance, Authoring and Versioning (Stian)
    • This offers a hierarchy of prov:wasInfluencedBy, but nothing that seems general enough to warrant adoption.
  • Open Annotation Core Data Model (Stian)
    • This focuses on describing "portions" that can then be annotated, and so does not apply to our needs.
  • Vocabulary Of Attribution and Governance (Ralph/TopQuadrant)
    • Very little of this actually extends PROV, so it's more difficult to determine how we would reuse their concepts.
  • Spitfire Describes sensors, observations.
    • Very little of this actually extends PROV (they reproduce Agent and Activity into their own namespace), so it's more difficult to determine how we would reuse their concepts.
  • P-Plan Linking plans and parts of plans to their respective executions (Daniel, Yolanda)
    • This is focused on planning, they extend Activity, Bundle, Entity, and Plan in the proscriptive sense (as opposed to the retrospective sense). So, this just doesn't apply.

Not extending PROV (according to LOV):

  • Nepomuk

    • We've used this a lot to model the provenance of files. It includes file hash, file name, etc.
  • COGS Vocabulary for describing ETL and data transformation activities.

Concepts that we need

We want to minimize the number of concepts that we introduce, to avoid a cluttered and redundant vocabulary. We also don't want to get into the nuances between different version control systems. We're just looking for some pinpoints that lets users navigate and leverage the change history -- and to enable provenance developers to reference VCS elements more easily from within the rest of their systems.

  • vcsp:CommitActivity subclassOf prov:Activity .

    • The name "Commit" can be ambiguous -- it could be the activity or the resulting state.
    • Perhaps contrast CommitActivity with RepositoryState
    • "Joe committed the code" (CommitActivity ) vs. "Which commit should I grab to fix the bug?" (RepositoryState)
    • The CommitActivity could include not just the "sending files", but also everything that went into creating what is being sent -- the file edits, ticket conversations, etc. This would make the CommitActivity quite abstract, with many sub-activities.
  • vcsp:Committer subclassOf prov:Role .

  • prov:Entities generated by vcsp:CommitActivities activities:

    • are either Files or RepositoryStates (if both, then Files can relate to being part of RepositoryStates)
    • have a schema:version literal
    • have a prov:specializationOf [the general, version-less {file, repository}]
  • pvcs:FileDataObject, an abstract superclass of nfo:FileDataObject with exactly one :hasHash, exactly one prov:value, and exactly one nfo:fileName -- but NO nfo:fileCreated, nfo:fileLastAccessed, nfo:fileLastModified, nfo:fileOwner, nfo:fileSize, nfo:fileUrl, or nfo:permissions. This abstract FileDataObject is suitable for being prov:specializationOf each of its copies, which could have different more detailed attributes that we're excluding on the abstract one. We can't reuse nfo:hasHash because its domain is the too-specific nfo:FileDataObject (so make its superproperty pvcs:hasHash).

  • The latest (more abstract) file is Mutable, while each revision is prv:Immutable.

  • The original git2prov distinguishes between the file was A(dded) vs. M(odified) vs. D(eleted).

author

pvcs:Author a prov:Role .

committer

pvcs:Committer a prov:Role .

Concerns

  • How to distinguish the file metadata from the file access itself? This is a perennial problem in web resource modeling. We address this using three layers: the immutable string generated by a commit, the mutable file that was "updated", and the prv:serializedBy/nfo:fileURL of the mutable file on the SVN server.
  • Make sure our extension shows up in LOV; follow guidance here.
  • Being a Committer doesn't imply that you actually did the work that you're committing.

Version Control Systems

What is next