A Project of OCLC Research OCLC Online Computer Library Center

PURLS

PURL Validation README


I. Introduction

Validating PURLs is basically a four step process:
  1. identify the list of PURLs to validate,
  2. validate each PURL in the list identified in step 1,
  3. output (and possibly store) information about the results of step 2, and
  4. process the output from step 3 to produce reports for the appropriate individuals.

Most people will want to use the online forms available from the PURL Validation page to validate a few PURLs once in a while and immediately view the results. Those people do not need to read the rest of this document. The online forms should contain sufficient information to validate PURLs and view the results.

The rest of this document provides an overview of PURL validation steps for PURL administrators so they can control the validation process at their site, run batch validations, and possibly create their own validation processes.

Note to administrators: The directory WebRoot/docs/maint/validate under your PURL installation's top-level directory contains code and examples demonstrating the use of PURL metadata to validate PURLs. References to code and forms files below exist in this directory unless explicitly specified otherwise.


II. Identifying a list of PURLs to validate

Every validation begins with identifying a list of PURLs to validate. The PURL Validation page provides three methods of identifying a list of PURLs to validate and each method has a corresponding input form:

  1. searching the PURL metadata by PURL, URL, and/or maintainer (search_check.html),
  2. searching the previous validation results (results_check.html), and
  3. hand editing the list of PURLs (select.pl.cgi).

The input from the forms for methods 1 and 2 is sent to the Perl script select.pl.cgi for processing. select.pl.cgi finds the list of PURLs that match the form input and presents that list to the user for further editing in a subsequent form. select.pl.cgi is also used to accomplish method 3 by outputting an empty list into which the user can enter the PURLs directly.


III. Validating a list of PURLs

Once a list of PURLs has been selected, the actual validation process can begin. When the user submits the results from the forms described in Section II above, the Perl script validate.pl.cgi validates each PURL in the list by performing an HTTP GET on its associated URL and checking the results. The Perl script follows redirects where necessary and obeys robots.txt restrictions.

IV. Outputting validation information

Each attempted PURL validation results in the creation of a "validation record". Validation records contain the following fields.

PURL: The PURL that was validated.
TIME: The time validation occurred.
HOPS: The number of redirections encountered during validation. If the PURL has an associated URL that does not have a subsequent redirect, this value will be one. Otherwise, the total number of redirects encountered will be listed. If this value is zero, the PURL resolver served the document directly without returning a redirect to the validation processes.
STATUS: The HTTP status returned for the final non-redirect response received from the HTTP server responsible for the URL-identified item. Some problems can occur during validation that do not have standard HTTP response numbers. Note that 200 is the standard successful HTTP response. However, it is up to you to decide which codes are "valid" and which codes require some further action. See status-codes.html for an explanation of these codes.
URL: The ultimate URL checked. If the validation process encountered redirects, this field will contain the last URL checked. Note that in the case of redirects, this field will not be the same as the URL directly associated with the PURL in the PURL system.

While the PURL validation records could naturally be placed in a relational database management system (RDBMS), we did not want to include an RDBMS in the PURL release. Therefore, we have made the information available in a format that others can easily load into their own RDBMS, manipulate, and generate their own reports, etc.

Since we have decided to not ship an RDBMS to maintain the validation information, the validation records are simply dumped in a flat file format log with the fields delimited by whitespace. This information can easily be loaded into your own RDBMS (or whatever) for more extensive processing. As an example, the PURL Resolver at OCLC uses PostgreSQL and the RDBMS scripts distributed in the general release (but not hooked up by default) to do its validation. This way, we can provide scripts that work both against the flat file version (the default) and give real, working code as examples of how to process the validation information out of a relational database.


V. Viewing/Reporting validation results

Report writing is a tricky topic. We could never hope to generate reports that will satisfy everyone's needs for every given condition. Therefore, we have focused on providing a few simple examples of how to analyze and use the validation result logs. The logs should provide sufficient flexibility to create your own customized reports.

As with validation, the first step to reporting validation information is to identify a list of PURLs for which validation information is required. Identifying a list of PURLs for viewing results uses forms nearly identical to those listed above in Section II. [Just replace "check" with "view" in the filename (e.g., search_view.html instead of search_check.html). The main difference being that titles, button texts, and hints have been slightly modified to reflect that the correspondingly created list of PURLs will not be validated, but used to look up validation results. The processing of the forms is still done using select.pl.cgi.]

Once a list of PURLs of interest has been identified, the validation records are used to generate reports which are then sent to the appropriate individuals, or presented immediately online.


VI. Controlling validation requests

Since validating PURLs can produce "unwanted" system load, PURL validation is only run when: By default, VALIDATE_ALL only contains the PURL administration group, WHEEL; and VALIDATE_OWN contains the ALL group. Thus, by default, users can only validate PURLs that they are responsible for and the administrative group can validate anything. To restrict validation to just the PURL administrators, remove ALL from the VALIDATE_OWN group. See the PURL FAQ for more information on groups and PURL maintainers.

VII. Batch processing

The PURL administrator can validate PURLs and automatically report errors to the PURL owners in a batch mode. This process is very similar to normal validation, except that error results can be sent to the owner of a PURL. The conditions for when to identify a user directly are controlled by the parameters to the scripts. Basically, we assume that batch processing will be done using scripts like the shell examples we have included (the files ending in ".sh").

VIII. Miscellaneous files

Since there are many files used to support the processing outlined here, a list of the files in the validation directory and their functions can be found in the Manifest file.

IX. Conclusion

It is our hope that this document and the corresponding forms and examples explain validation in enough detail that PURL administrators can use PURL validation to improve PURLs by ensuring their "correctness". The validation implemented here is not meant to be "complete", but exemplary. That is, we expect that many will want to modify and enhance these forms and scripts to meet their own validation requirements. We hope we have provided enough information to do so.