This document is licensed under a Creative Commons Attribution 3.0 License.
This specification defines a file format for storage and distribution of Research Objects as a ZIP archive; called a Research Object Bundle (RO Bundle). RO Bundles allow capturing a Research Object to a single file or byte-stream by including its manifest, annotations and some or all of its aggregated resources for the purposes of exporting, archiving, publishing and transferring research objects.
This document is merely a public working draft of a potential specification. It has no official standing of any kind and does not represent the support or consensus of any standards organisation.
This document is an Working Draft published by the Wf4Ever project. This document is currently work in progress and should not be used as a basis for implementations. Questions, feedback and comments are kindly requested to be sent to the wf4ever-public mailing list/forum.This section is non-normative.
The Wf4Ever Research Object model [RO] defines a model for aggregating the resources that contribute to a scientific work, including domain-specific annotations and provenance traces. The unit that collects these resources is called a Research Object (RO) and is described in an RDF-based manifest according to the Wf4Ever OWL ontologies. The RO model has been formed in particular for the purpose of preservation of scientific workflows, but is applicable also in a general sense for capturing resources that are related to eacher, and which together form a trackable whole. The Research Object primer [ROPrimer] provides further details and examples of using the RO model.
The specification for the RO model does not mandate any particular form for the representation of Research Objects. The Wf4Ever RO Storage and Retrieval Service API [ROSRS] defines how research objects can be accessed and maintained on the web through a RESTful web service exposing RDF/XML and Turtle representations. Practical use of the RO model has however shown that it is also benefitial to represent a research object as a single ZIP archive or as file system folders for the purposes of downloading, editing and archiving a research object.
For instance a scientific workflow system can export a workflow run by saving the workflow definition, runtime provenance trace and generated results to a set of files. A research object that represents the workflow run can aggregate and relate these resources. However, at the time of running the workflow (e.g. on a desktop computer) it is often not known where or if the user would choose to publish the RO; thus the direct use of a ROSRS service or minting public URIs is problematic in this situation.
A Research Object Bundle, as specified by this document, provides a way to collect the resources that are aggregated in a research object, represented as files in a ZIP archive, in addition to their metadata and annotations. The ZIP archive thus becomes a single representation of a research object and which can be exported, archived, published and transferred like a regular file or resource.
A Research Object Bundle is a structured [ZIP] archive, specializing the Adobe Universal Container Format [UCF]. UCF is based on the EPUB [OCF] format, but generalized to be any kind of container. The following section gives an informal introduction to the UCF format. For the complete, normative details, see the [UCF] specification.
This section is non-normative.
An UCF container is based on the ZIP compression file format [ZIP], enforcing additional restrictions. The most important restrictions are:
mimetype
and META-INF
mimetype
and without any extra attributesUCF says about mimetype
:
The first file in the Zip container MUST be a file with the ASCII name ofmimetype
, which holds the MIME type for the Zip container (application/epub+zipas an ASCII string; no padding, white-space, or case change).
The actual media type to include in mimetype
depends
on the specific container type (the above quote uses ePub as
an example).
See section 2.2 RO bundle container.
Best Practice 1: Use zip -0 -X
To add the mimetype
file correctly on a UNIX/Linux
installation with InfoZip, use
echo -n
and zip -0 -X
. Below is an example which adds
mimetype
correctly as the first, uncompressed file, then the remaining files (excluding mimetype
) with the default compression:
stain@ahtissuntu:~/test$ echo -n application/vnd.wf4ever.robundle+zip > mimetype stain@ahtissuntu:~/test$ zip -0 -X ../example.robundle mimetype adding: mimetype (stored 0%) stain@ahtissuntu:~/test$ zip -X -r ../example.robundle . -x mimetype adding: META-INF/ (stored 0%) adding: META-INF/container.xml (stored 0%) adding: .ro/ (stored 0%) adding: .ro/manifest.json (stored 0%) adding: helloworld.txt (stored 0%)
UCF says about META-INF/container.xml
and rootfiles:
A UCF Container MAY include a file namedcontainer.xml
in theMETA-INF
directory at the root level of the container file system. If present, thecontainer.xml
file MAY identify the MIME type of, and path to, the root file for the container and any OPTIONAL alternative renditions included in the container.
An example of META-INF/container.xml
which
defines the rootfile as .ro/manifest.json
:
<?xml version="1.0"?> <container version="1.0" xmlns="urn:oasis:names:tc:opendocument:xmlns:container"> <rootfiles> <rootfile full-path=".ro/manifest.json" media-type="application/ld+json" /> </rootfiles> </container>
.ro
mimetype
SHOULD be
application/vnd.wf4ever.robundle+zip
(see
below)META-INF/container.xml
, if present,
SHOULD contain a rootfile entry equivalent to:<rootfile full-path=".ro/manifest.json"
media-type="application/ld+json" />
.ro/manifest.json
SHOULD be
present, and MUST describe the RO according to section 3. Manifest.
Applications who specialize RO Bundles MAY specify a different
mimetype
, for instance because the
bundle is used to distribute application-specific data. It is
RECOMMENDED for such extensions that their media type end
with +zip
according to [RFC6839] unless it is not considered
meaningful for a user to treat such bundles
as a general ZIP archive.
Beyond rootfiles, the UCF specification does not specify how
to find the media-type when resolving individual resources in
a bundle. If an application requrires a media-type for a resource
in the RO bundle, it MAY use the defaults below based on
case-insensitive comparison of the file extension.
In the absence of a resolved media type, the media type
application/octet-stream
MAY be assumed.
Extension | Media type |
---|---|
.txt |
text/plain; charset="utf-8" |
.ttl |
text/turtle; charset="utf-8" |
.rdf |
application/rdf+xml |
.json |
application/json |
.xml |
application/xml |
Applications MAY use the file
META-INF/manifest.xml
, if present, to resolve
media types for resources in the RO bundle accoording to
the manifest:media-type
of the corresponding
manifest:file-entry
according to
ODF
Package specification [ODF], see however warnings in section 2.2.2 META-INF/manifest.xml below.
To avoid confusion with the somewhat overlapping
RO manifest it is NOT
RECOMMENDED to include META-INF/manifest.xml
in
RO Bundles. Applications and specializations of this specification MAY
however include
META-INF/manifest.xml
, for instance to provide
media types as specified in
section 2.2.1 Resource media type.
If
META-INF/manifest.xml
is present, it MUST
follow the specifications of
ODF
Package specification [ODF]
. That means that
if present, the META-INF/manifest.xml
file
MUST list all resources in the RO bundle, including
the folder .ro
and its content, but excluding
mimetype
and
META-INF
its content.
The research object SHOULD be described
in the file .ro/manifest.json
as specified
below.
The file .ro/manifest.json
, if present, MUST contain
the [ORE] manifest for the research object according to this section.
The file MUST be in JSON format
[RFC4627], and SHOULD be valid [JSON-LD].
Identifiers used below are either:
.ro/
folder,
which SHOULD NOT contain the :
character. For
instance manifest.json
or annotations/ann2
.
Depending on how meta-resources are used, the ZIP might or might not include
a corresponding entry for the given path.
bundle:
to indicate the root of the
bundle, for instance bundle:hello.txt
or
bundle:folder2/
. Folders SHOULD have a path
terminating with /
.
The resource identified by the path SHOULD be included as a
corresponding file or folder in the ZIP file.
:
), external to the bundle. For instance
http://example.com/external
The structure of the JSON manifest is given by an JSON Object with the keys:
@context
"http://purl.org/wf4ever/ro-bundle/context.json"
,
but MAY be a list, which SHOULD have this value as the last item. id
"bundle:"
indicating the
relative top-level folder as the identifier.
Note that this means the absolute URI identifying the research
object depends on the base URI this Research Object Bundle is
considered to be accessed at, for instance
file:///Users/alice/ro13.robundle/
(See
section 4. Identifiers.)manifest
.ro/
folder. SHOULD be literal
"manifest.json"
,
but MAY be a list, in which case the list MUST contain
"manifest.json"
createdOn
createdBy
authoredBy
. The creator SHOULD be an object
with the following keys:
uri
http://example.com/fred#fred
orcid
http://orcid.org/0000-0001-9842-9718
. An ORCID
MAY be present if known.
name
"John Doe"
or "University of Manchester"
@graph
according to section 3.1.2 Custom JSON-LD
by
using a @id
equal to the creator uri
.
authoredOn
authoredBy
createdBy
.
SHOULD be an object with the same keys and requirements as
for createdBy
, but MAY be a list to indicate
multiple authors.
Additional authorship information (curation, contribution,
etc) MAY be added using the
pav: namespace
within the top-level @graph
key according to section 3.1.2 Custom JSON-LD
by using an @id
value equal to the bundle
id
, e.g. "bundle:"
.
history
.ro/
folder. This property MAY be present, in
which case it SHOULD be "evolution.ttl"
,
indicating that the file .ro/evolution.ttl
contains the provenance trace.
This value MAY be a URI. The property MAY give a list if
several provenance traces are known, in which case the list
SHOULD include "evolution.ttl"
.
The file
.ro/evolution.ttl
, if present,
SHOULD include a provenance trace
of this research object
according to the roevo ontology.
aggregates
bundle:
file
or uri
. Its members are:
file
bundle:
uri
uri
MUST NOT be provided at the
same time as file
.folder
uri
) belongs to, relative to the
root of the bundle. The path SHOULD be prefixed with
bundle:
and SHOULD end with /
,
for instance bundle:folder2/
.mediatype
file
) resource. This SHOULD be specified
for a resource identified by file
,
unless its media type is correctly identified
according to
section 2.2.1 Resource media type.
createdOn
createdBy
proxy
The proxy identifier SHOULD consist of the prefix proxy:
and
a lowercased UUID string [RFC4122]. For example:
proxy:d4f09040-272e-467f-9250-59593bd4ac8f
The order of the values in the aggregates
list is insignificant, however
the list MUST NOT contain duplicate entries. An entry is considered
duplicate by comparing literal values and members
file
and uri
uniformly as URIs [URI].
annotations
An annotation is specified as an object, which have the following members:
annotation
annotation:
and
a lowercased UUID string [RFC4122]. For example:
annotation:1a876f9e-4ffe-4c99-a05d-cd9d0cbd4cbb
about
id
, e.g.
bundle:
bundle:
, which SHOULD be listed under
aggregates
if that key is presentproxy:
, which MUST be defined under
aggregates
with a matching
value for proxy
annotation:
, which MUST be defined under
annotations
content
bundle:
, which SHOULD be listed under
aggregates
if that key is present@graph
according to section 3.1.2 Custom JSON-LD
by using a @id
matching the
annotation
identifier.
@graph
An example of a manifest which is valid JSON-LD is included below:
{ "@context": [ { "@base": "widget://129b8efe-a692-48a0-85d4-ebc6c0a9b057/.ro/" }, "http://purl.org/wf4ever/ro-bundle/context.json" ], "id": "/", "manifest": "manifest.json", "createdOn": "2013-03-05T17:29:03Z", "createdBy": { "uri": "http://example.com/foaf#alice", "orcid": "http://orcid.org/0000-0002-1825-0097", "name": "Alice W. Land" }, "history": "evolution.ttl", "aggregates": [ "/folder/soup.jpeg", "http://example.com/blog/", { "file": "/README.txt", "mediatype": "text/plain", "createdBy": { "uri": "http://example.com/foaf#bob", "name": "Bob Builder" }, "createdOn": "2013-02-12T19:37:32.939Z" }, { "uri": "http://example.com/external.txt", "folder": "/folder/", "proxy": "uuid:a0cf8616-bee4-4a71-b21e-c60e6499a644" } ], "annotations": [ { "annotation": "uuid:d67466b4-3aeb-4855-8203-90febe71abdf", "about": "/folder/soup.jpeg", "content": "annotations/soup-properties.ttl" }, { "about": "uuid:a0cf8616-bee4-4a71-b21e-c60e6499a644", "content": "http://example.com/blog/they-aggregated-our-file" }, { "about": [ "/", "uuid:d67466b4-3aeb-4855-8203-90febe71abdf" ], "content": "annotations/a-meta-annotation-in-this-ro.txt" } ] }
Manifests following the JSON structure defined in
section 3.1 .ro/manifest.json with a
"@context":
"http://purl.org/wf4ever/ro-bundle/context.json"
is intended to be valid [JSON-LD] without any additional
modifications. Mapping .ro/manifest.json
to the
ORE and [RO] models in RDF SHOULD be performed according to the
algorithm for conversion
from JSON to RDF, as specified in the JSON-LD API [JSON-LD].
Describe JSON-LD context
{ "@context": { "ao": "http://purl.org/ao/", "oa": "http://www.w3.org/ns/oa#", "dc": "http://purl.org/dc/elements/1.1/", "dct": "http://purl.org/dc/terms/", "ore": "http://www.openarchives.org/ore/terms/", "ro": "http://purl.org/wf4ever/ro#", "roterms": "http://purl.org/wf4ever/roterms#", "robundle": "http://purl.org/wf4ever/robundle#", "prov": "http://www.w3.org/ns/prov#", "pav": "http://purl.org/pav/", "xsd": "http://www.w3.org/2001/XMLSchema#", "foaf": "http://xmlns.com/foaf/0.1/", "uuid": "urn:uuid:", "id": "@id", "file": "@id", "uri": "@id", "annotation": "@id", "manifest": { "@id": "ore:isDescribedBy", "@type": "@id" }, "createdOn": { "@id": "pav:createdOn", "@type": "xsd:dateTime" }, "createdBy": { "@id": "pav:createdBy", "@type": "@id" }, "authoredOn": { "@id": "pav:authoredOn", "@type": "xsd:dateTime" }, "authoredBy": { "@id": "pav:authoredBy", "@type": "@id" }, "curatedOn": { "@id": "pav:curatedOn", "@type": "xsd:dateTime" }, "curatedBy": { "@id": "pav:curatedBy", "@type": "@id" }, "contributedOn": { "@id": "pav:contributedOn", "@type": "xsd:dateTime" }, "contributedBy": { "@id": "pav:contributedBy", "@type": "@id" }, "name": { "@id": "foaf:name" }, "orcid": { "@id": "roterms:orcid", "@type": "@id" }, "history": { "@id": "prov:has_provenance", "@type": "@id" }, "aggregates": { "@id": "ore:aggregates", "@type": "@id" }, "mediatype": { "@id": "dc:format" }, "folder": { "@id": "robundle:inFolder", "@type": "@id" }, "proxy": { "@id": "robundle:hasProxy", "@type": "@id" }, "annotations": { "@id": "robundle:hasAnnotation", "@type": "@id" }, "content": { "@id": "oa:hasBody", "@type": "@id" }, "about": { "@id": "oa:hasTarget", "@type": "@id" } } }
As an example of this processing, below is a Turtle
representation after processing the .ro/manifest.json
shown as an example in
section 3.1 .ro/manifest.json:
Generate example
Applications who support JSON-LD (rather than just JSON)
MAY choose to parse and generate additional statements
in .ro/manifest.json
according to the [JSON-LD] specifications.
Applications generating JSON-LD MAY use a @context
list, but SHOULD include
http://purl.org/wf4ever/bundle/context.json
as the last item in the list to indicate to JSON parsers that
the manifest can be parsed as plain JSON according to
section 3.1 .ro/manifest.json.
Applications SHOULD NOT use @context
at deeper nexting
levels, except within the top level @graph
.
Applications SHOULD NOT write additional properties directly
to JSON-LD nodes defined from
section 3.1 .ro/manifest.json.
Instead, additional statements SHOULD be made within an
additional @graph
node according to
JSON-LD
Named Graphs. @graph
SHOULD only be
added to the top-level object.
For example:
{ "@context": "http://purl.org/wf4ever/ro-bundle/context.json", "id": "bundle:", "manifest": "manifest.json", "aggregates": [ "http://example.com/blog/2012", "http://example.com/blog/2013" ], "@graph": [ { "@id": "http://example.com/blog/2013", "dcterms:replaces": "http://example.com/blog/2012" }, { "@id": "http://example.com/blog/2013", "dcterms:isReplacedBy": "http://example.com/blog/2013" } ] }
Note that rather than using the above extension mechanism,
it is generally RECOMMENDED
to instead store such additional statements in
an annotation body
for purposes of provenance and separation of concern. Although
technically valid, it is NOT RECOMMENDED to use the member
@graph
to embed semantic annotation bodies
within annotations
nodes, as it would duplicate the
content of the annotation body in the bundle and may lead to
inconsistencies.
In addition to the .ro/manifest.json
representation
specified in section 3.1 .ro/manifest.json, a
Research Object Bundle MAY include the ORE manifest in
alternative representations like RDF/XML
[RDF-SYNTAX-GRAMMAR] and Turtle [TURTLE], for instance by
generating them using the conversion
from JSON to RDF algorithm in JSON-LD API [JSON-LD].
.ro/manifest
, for instance
.ro/manifest.ttl
for a Turtle representation.
.ro/manifest.json
as the authorative representation of
the research object.
.ro/manifest.json
(see section 3.1.1 JSON-LD and mapping to RO model)
META-INF/container.xml
as
<rootfile>
entries with corresponding
media-type
attributes.
.ro/manifest.json
This section is non-normative.
Objects in a research object bundle are identified within the JSON manifest using different JSON-LD prefixes, which could be thought of as local URI schemes, which resolves to relative URI references based at the root of the ZIP archive.
Prefix | Relative URI reference |
---|---|
(no prefix) | .ro/ |
bundle: | ./ |
proxy: | .ro/proxies/ |
annotation: | .ro/annotations/ |
(other) | Absolute URI |
Due to their nature as ZIP files, Research Object Bundles might
be downloaded, copied, moved and republished. In order to avoid
ambiguity about RO identity and evolution, each Research Object Bundle
serialization is considered to represent unique Research Objects.
Thus any of the prefixes above describing resources within the
bundle are relative to the root of the ZIP file, and the
id
identifying the Research Object is set to
bundle:
, meaning the root represents the RO itself.
This section is non-normative.
Applications which require an absolute URI for identifying a resource within a Research Object Bundle may choose to use one of the approaches presented in this section in combination with resolving against the prefix table above.
This section is non-normative.
If an RO bundle is published at a HTTP (or HTTPS) server, then URIs to the bundled resources can be minted by assuming a base URI of the RO Bundle URI with/
appended.
For instance, if:
http://example.com/example1.robundlecontains the file
folder/helloworld.txt
(bundle:folder/helloworld.txt
in the
manifest.json), then we can assume the base URI:
http://example.com/example1.robundle/and can refer to the file as:
http://example.com/example1.robundle/folder/helloworld.txt
A web server that exposes RO bundles MAY support resolving such nested URIs by internally extracting the resources from the ZIP archive or redirecting to an existing resource, for instance because it is implementing the [ROSRS] API.
Semantically, the distinction between the URI with or
without the trailing /
is that say
example1.robundle
identifies the RO Bundle,
e.g. the ZIP archive, which has attributes such as size in
bytes, checksum, etc, while example1.robundle/
identifies the slightly more abstract concept of the
Research Object (the aggregation) that is serialized as a RO
bundle.
Advantages:
img/picture.jpeg
linked
from
http://example.com/example1.robundle/document.html
resolves to
http://example.com/example1.robundle/img/picture1.jpeg
)Disadvantages
.robundle
should be removed. (URL hacking).
file
scheme.
For instance the reference ../../etc/passwd
from
file:///tmp/example1.robundle/evil.html
could be resolved to
file:///etc/passwd
This technique SHOULD NOT be used if:
urn:uuid:f81d4fae-7dec-11d0-a765-00a0c91e6bf6
)
http://example.com/ro?id=13
application/vnd.wf4ever.robundle+zip
MAY be interpreted to have a fragment identifier
that is resolvable as the path within the ZIP archive.
For instance, if:
http://example.com/example1.robundlecontains the file
folder/helloworld.txt
(bundle:folder/helloworld.txt
in the
manifest.json), then we can refer to the Research Object as
http://example.com/example1.robundle#and can refer to the file as:
http://example.com/example1.robundle#folder/helloworld.txtAdvantages:
img.jpg
from
http://exaxmple.com/example1.robundle#document.html
would seem to refer to http://exaxmple.com/img.jpg
#para2
in
document.html
becomes
http://example.com/example1.robundle#document.html#para2
jar:
, the original URI of the JAR file,
the separator !/
, and the path within the JAR
file. For all practical purposes, an RO bundle, being a ZIP
archive, can be interpreted as a JAR file. For instance, if:
http://example.com/example1.robundlecontains the file
folder/helloworld.txt
(bundle:folder/helloworld.txt
in the
manifest.json), then we can assume the base URI
jar:http://example.com/example1.robundle!/and can refer to the file as:
jar:http://example.com/example1.robundle!/folder/helloworld.txt
Advantages:
Disadvantages:
jar:
scheme is not hierarchical (it
does not use //
, and so relative URI
references within RO bundle resources are
not correctly resolved (not even by java.net.URI).
META-INF/MANIFEST.MF
.The Widget URI scheme defines how a URI can be formed for the purposes of accessing resources within a ZIP file as if it was a HTTP server. While this is intended for sandboxing Packaged web apps, it is equally applicable to Research Object bundles for the purposes of sandboxing.
The Widget URI scheme recommends generating a UUID string [RFC4122] for minting the authority, forming the base URI for the RO bundle. For instance, if:
http://example.com/example1.robundlecontains the file
folder/helloworld.txt
(:bundle:folder/helloworld.txt
in the
manifest.json), then we generate a new UUID
8191dee8-0b8e-452d-8d64-7706a140185e
and
refer to the Research Object as
widget://8191dee8-0b8e-452d-8d64-7706a140185e/and can refer to the file as:
widget://8191dee8-0b8e-452d-8d64-7706a140185e/folder/helloworld.txt
For purposes of security/sandboxing when interpreting RO bundles, the authority should be a v4 UUID from random numbers. For purposes of describing the content of an RO bundle at a given URI, the authority should be Name based UUID using v5 (SHA-1 hashing). For purposes of describing the content of an RO bundle as a bytestream independent of its location (for instance on a USB stick), then the authority should be the hexadecimal SHA-256 checksum of the ZIP archive.
Example widget base URIs
widget://15259726-dcbb-42ff-8fc6-36282c98d4e6/
UUID v4 using pseduo-random numberwidget://7878e885-327c-5ad4-9868-7338f1f13b3b/
UUID v5 of the URL
http://example.com/bundle1.robundle
widget://587cff3ae37d58af6886d656623bd91237759a42d8fe1575a9744898c01d97d7/
SHA-256 of an empty RO bundleAdvantages:
Disadvantages:
As well as sections marked as non-normative, all authoring guidelines, diagrams, examples, and notes in this specification are non-normative. Everything else in this specification is normative.
The key words MUST, MUST NOT, REQUIRED, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this specification are to be interpreted as described in [RFC2119].
Thanks to Khalid Belhajjame, Graham Klyne and Piotr Holubowicz for reviewing this specification. The underlying work has been funded as part of the Wf4Ever project, funded by the European Commisson's FP7 programme (FP7-ICT-2007-6 270192). Many thanks to Robin Berjon for making ReSpec.js which generated this page.