Persistent Identifiers (PIDs) are unique entity names that have the organizational commitment and technical infrastructure to support them indefinitely. While a PID can be unique in any given context, it is most powerful when it is globally unique in a widely known and used namespace (e.g. ISBN). In digital repository infrastructures, PIDs have two role: 1) to support system integrity by providing stable identification for system entities, and 2) to enable long-term access to managed digital resources. Both are becoming increasingly important as individual repositories begin to seek out economies of scale through federated catalogs, one-stop portals, and shared functionality.
With relation to digital content, PIDs are most efficacious when they are sustained by services and protocols which make them actionable or 'resolvable' through the internet. A PID can bind a resource's permanent identity to its current, but potentially changeable, location on the web and help direct requests for the resource to its location. In this case, both the PID's global uniqueness and its binding with the current location of the resource must be persisted through organizational and technical frameworks. Today there are a several widely-used actionable PID systems supported through various frameworks. Nevertheless, PID uniqueness and persistence are ensured first and foremost through the long-term commitment of the organization which owns the resource.
Benefits of implementing a PID system include:
- ''Global uniqueness:'' context-independent identification supports the referencing of entities through various systems. More concretely, PIDs are independent of any cataloging or repository software and remain stable when software changes. In aggregation or harvesting services, globally unique identifiers facilitate data supply, helping to identify duplicates and manage updates and deletions.
- ''Persistence:'' supported bindings to a resource's current location and other information about the resource, along with the technical means to direct requests for the resource to the bound information. PIDs serve as a 'key', allowing the identifier to be used in place of the resource itself. More concretely, PIDs can prevent broken links in the form of altered resources, redirects, or in the worst case "404 not found" messages.
Selecting a PID System
A robust PID system can have two possible components: a registry and naming scheme that supports the creation of globally unique identifiers; and binding and resolver services that ensure long-term access over the internet to the resources named by the identifier. PID systems do not necessarily have both components. When selecting a PID system to support a digital repository, the following issues should be considered:
''Identifier actionability:'' Is the identifier resolvable on the web? Can the identifier be expressed as a URI that can be used directly in web browsers or is the mediation of the resolver service necessary?
''Identifier form and scope:''
- Is the identifier opaque or semantically meaningful? If it is semantically-laden, are the qualities on which it is based likely to persist? Can the syntax incorporate local identifier systems?
- Does the identifier syntax support digital object variants and versions? Does it support the relation of component parts? Would it support non-document entities (e.g. people, places, or concepts)?
''Supporting Services, Interoperability, Community:''
- Does the identifier scheme come bundled together with one or more services to create, bind, resolve, and manage PIDs effectively on the internet? Are the services reliable, sustainable, secure, and cost effective? Are the services centralized or locally hosted?
- Does the identifier system create an technical or administrative dependencies? Are there potential administrative or technical obstacles to using these services?
- Does the service support access restrictions for resources not intended for access on the web? How would the service support an identifier for a resource that is no longer available?
- Is the identifier system a formal and well-documented standard? Does it comply with major web standards? Is the system flexible enough to interoperate with or incorporate other schemes? Is it dependent on protocols which may change over time or become obsolete?
- Is the identifier system (with its attendant services) a mature, well-supported, and widely-adopted system with a committed community of users?
For digital content, it is appropriate to choose a PID system that is actionable through the internet and syntactically flexible to support a variety of resource types and object formats. The system should be in wide use, particularly in the cultural heritage sector, and supported by a robust technical service infrastructure through a service provider or community with a demonstrated long-term commitment. For social history institutions, it is important to choose a system that is inexpensive and straightforward to introduce and maintain through changing financial circumstances. Three systems currently meet the above requirements:
- PURL (Persistent Uniform Resource Locators)
- Handle System/DOI (Digital Object Identifier)
- ARK (Archival Resource Key)
(See PID System Profiles voor more detailed information.)
PURLs, Handles, DOIs, and ARKs each have advantages and disadvantages, and their suitability for a particular institution or institutional collection will depend on local factors. While DOI offers a robust schema, it is also requires a substantial professional and financial commitment—likely thanks to its roots in the private sector. ARK and DOI have descriptive metadata requirements that would seem a redundancy for institutions specialized in creating descriptive data according to their own domain standards. The possibilities to reflect a resource's structure in the identifiers may also be superfluous for organizations using METS or other structural metadata schema. ARKs require no service commitment and are free to use, but the institutional commitment central to the ARK concept (and the administrative effort underlying this commitment) may prove difficult for small organizations. Handles by themselves are less expensive than DOIs and offer a robust service package and, unlike DOIs and ARKs, relatively low barriers to implementation, but the institution must depend indefinitely on a service provider for resolution. PURLs have the lowest financial and technical barriers for implementation and create no dependencies, but they offer little nuance in their syntax and support no additional metadata—to facilitate access control, for instance.
The HOPE PID Service has opted to use the CNRI Handle System, citing Handle's large user base and strong documentation on APIs, workflows, and service agreements. It was an additional advantage that CNRI is seeking to move the service under the auspices of a large international body such as the UN. (Given that the HOPE PID Service is administered on behalf of several institutions, the steeper learning curve and additional requirements for ARKs may also have proved an obstacle.) HOPE has implemented the Handle system in such a way that each content provider using the HOPE PID Service will retain their own Naming Authority, allowing their PIDs and related data to transferred to a local Handle server at any time.
Application of PIDs
For local repositories, selecting a system is only half the battle. Putting in place policies and procedures to support the creation, binding, and maintenance of PIDs is more important and less obvious than it might seem. Current PID systems are designed with flexible syntax and data structures making possible a range of potential applications. An institution must make decisions about whether to assign PIDs:
- To abstract works or to specific manifestations/copies of a work?
- To digital resources only or to physical objects as well?
- To digital masters, to access copies, or to all representations?
- To whole documents only or individual files or other components?
- To object metadata or solely to objects? Or to some package which encompasses them both?
- To currently maintained versions of resources or to older versions as well?
- To document resources only or to non-document resources such as the people, groups, places, and concepts that are the topic of linked data efforts? Or to the rights, events, agents so central to intellectual property and preservation management? Or to the operations performed by the repository itself?
- And finally, to publicly available resources only or to all resources, including those that are internal, private, secure, or restricted?
Here it is important to consider the underlying motivations for applying PIDs. An institution must determine: which resources will themselves be managed and persisted and which are by nature temporary or in flux; which resources are well served by existing local or international identification schemes; which resources will be 'networked' in some form and which will remain offline; which resources might be shared, transferred, or exchanged and which will remain local.
In the case of archival material or other rare or unique collections, the question of 'abstract works' vs. 'manifestations' is of course less relevant than it is in the traditional library domain. More relevant may be the relation between the physical originals (what METS refers to as the Source) and digital versions. In OAIS, "identifiers that allow outside systems to refer, unambiguously to particular content information" (Consultative Committee for Space Data Systems, ''Reference Model for an Open Archival Information System'', 4-28.) are stored as part of preservation description reference information. Such identifiers refer to the physical or digital content data objects that are the object of preservation. Thus in OAIS terms, only the base object of preservation must be persistently identified, whether it be an analog original, a master object (potentially composed of multiple files), or each manifestation of the original object. PREMIS, with its focus on digital objects, suggests a more atomistic approach. PREMIS recommends that digital repositories generate and store persistent identifiers on all representations, files, and even bitstreams—depending on the level that the repository stores and manages objects. While use of persistent identifiers is permitted for other PREMIS entities, i.e. intellectual entities (i.e. descriptive metadata), agents, events, and rights, it is not explicitly recommended.
In short, an institution will have to decide if the resource as envisioned is an abstraction that includes a package of files and related metadata (something akin to a METs digital library object or an OAIS Information Package), or, on the other extreme, if each concrete manifestation of a work—be it a variant, version, component, or metadata—should be considered a resource in its own right. The prior would help enforce a more rigorous data model and, if well conceived, could ease administration; the latter is perhaps more concrete and comprehensible and might offer more flexibility to adapt to changing circumstances.
In most cases, an institution will fix on a model somewhere in between the two extremes, often depending on current data structures, supporting management systems, and well-worn local identification schemes. Though less pure, these systems are often the most intuitive and feasible. For archival and other unique material, institutions generally choose to assign PIDs in some combination to: one or more digital representations of an object, each separate master and even derivative file, a descriptive record, and/or the original physical object (for which the descriptive record may serve as a digital proxy). In any case, institutions should seriously consider assigning PIDs to restricted as well as open resources—it is the access restrictions in this case which are temporary, the resources themselves will persist.
In HOPE, PIDs are required for each descriptive metadata record and attendant landing page (often the same PID is used the these) as well as for each digital file submitted to the Aggregator and Shared Object Repository (SOR). A PID is also required for any submitted authority records on agents, places, and concepts. The HOPE Aggregator will also generate and manage PIDs on its own enriched HOPE agent, place, and theme entities when these are implemented.
Form of PIDs
Though PID syntax is broadly set by the system that is chosen, each of the systems described above has a [Local Name] element. In the case of Handle/DOI there is also the possibility to bind a single PID to multiple locations and assign attributes, while in ARK, a single PID can resolve to a digital object, descriptive metadata, or a commitment statement. ARK's syntax also permits the expression of variants and components as qualifiers. Thus an institution must determine:
- What the relationship will be between the PID Local Name element and other locally-supported identifiers, such as accession numbers, ISBN numbers, call numbers, reference codes, and system identifiers;
- Whether to express entity relationships through PIDs, and if so, to do so through adaptions to the Local Name root or through PID system syntax (e.g. qualifiers, attributes, and inflections);
- How/when to use the possible forms for a single PID—more precisely when it is necessary to use the actionable form and when to simply use the root form.
The relationship between PIDs and local identifier systems is tricky. It is common practice to use an established internal identification system as the source of Local Names, and there is some logic in it. Basing the PID local naming convention on an existing system has several benefits: 1) it eases workflow and administration, often saving an additional look-up step; 2) it allows an institution to infer the form of the PID; this can ease the stress on internal workflows—allowing the institution to create the PID in their local system before formally registering it with the PID service (or to forgo storing the PID value at all); and 3) it can reflect the institution's broader data model and relation between various entities, without the need for introducing PIDs throughout.
Given these arguments, perhaps the better question is not whether existing identifiers should be used, but which. The main criteria is, that the internal identification system should be locally unique and as stable and persistent as the intended PID; once created PIDs cannot be changed. In this case, semantically less meaningful names are generally preferable. If no stable local identification system exists, then the PID service can generate a random Local Name.
The next question is whether to express entity relationships through PIDs or to keep PIDs atomized and semantically opaque. One disadvantage (or advantage, depending on the aim) to using a standard syntax to express relationships is that PIDs can be guessed by end users. Kunze and Rodgers refer to these as 'hackable identifiers'. If the institution hopes to control access passively—by not publishing PIDs for restricted or internal resources—, this may cause problems. If an institution is basing PIDs on a local identification system which itself expresses relationships, then this cannot be avoided, though embedding relationships in an idiosyncratic local system, may be less hackable than expressing structure through the universal syntactical rules offered by PID systems.
Example: In the ARK syntax the use of qualifiers presupposes the existence of related resources. The sample ark:/12025/654/xz/321 actually contains three separate ARKs:
As a rule, if the institutional data model is likely to change over time, it may be better to avoid expressing relationships through PIDs. In this case, the use of multiple bindings (such as with Handle's attributes) may allow an institution to capture resource relationships without fixing them permanently into the PID. In general, it is recommended to explore the possibilities in the chosen PID system and optimize built-in features to ease administration, while at the same time constructing a naming convention that can be sustainable, maintainable, and appropriately opaque.
Example: Resource relationships expressed through varied PID syntax.
Resource PID: pid:10891/12345abcd
Related Resource PID, variation 1: pid:10891/12345abcd_master_001
(relationship expressed through change to Local Name root: )
Related Resource PID, variation 2: hdl:10891/12345abcd_001?locatt=level:master
(relationship expressed through attribute)
Related Resource PID, variation 3: ark:/10891/12345abcd/001.master
(relationship expressed through component and variant qualifiers)
Finally, current thinking suggests that PIDs should be used in the root form whenever possible, basically in all contexts where actionability is not important. As noted by Hilse and Kothe, "The integration of persistent identifiers into URL strings is risky: it introduces problems if those URLs are not clearly marked as containing a certain encoded identifier. In this case, the URL cannot easily be converted at a later point in time because it is hard to determine which element is, in fact, an encoded identifier." (Hans-Werner Hilse and Jochen Kothe, ''Implementing Persistent Identifiers: Overview of concepts, guidelines and recommendations'', p. 45.) This would indicate that the actionable form of a PID should be produced at the latest possible moment in the workflow and only when needed. In any case, whenever the root form is used, it is necessary to reference the namespace, either as part of the syntax, e.g. ark:/13030/tf5p30086k or doi:10.1006/jmbi.1998, or through labels or qualifiers. And of course, a local naming convention should be drafted to reflect the above policies, specifying the syntax and permitted characters.
Local PID Workflows and Maintenance
Long-term commitments expressed in policies should be supported through formal institutional procedures that answer the following questions:
- When should PIDs be assigned?
- Where should PIDs be stored?
- How should PIDs be maintained?
PIDs should be assigned at the earliest possible point after the creation or accession of a resource—it is, in fact, mandatory for the Archival Information Package (AIP). However, the (re)binding of PIDs to their networked location can only happen if and when the resource is made available on the internet. In the interval between, PIDs may be: 1) registered in the PID service as unresolved, 2) set to resolve to a stand-in page, or 3) in the most lightweight scenario (and if a predictable naming convention has been observed), assigned locally without recourse to the PID service. In any case, it is recommended that PIDs be based on stable local identifiers which are also assigned early in the creation or accession workflow.
PIDs should be stored in local repositories or collection management systems along with the metadata on the resource. PIDs assigned to metadata can be stored within the metadata record itself while PIDs assigned to files or representations should be stored along with file-level technical metadata and/or in a METS container. However, in many cases such advice is facile, easily given but ignoring the realities faced by small institutions. Proprietary collection management systems often fail to support PIDs. Moreover, many small institutions still rely on file servers, rather than proper digital repositories to store and structure digital objects. In these cases, institutions can create workarounds, such as lookup tables matching local IDs or file names to PIDs, but a good naming convention (e.g. aligning PIDs to file names or to database identifiers or collection reference codes) may also alleviate much of the problem.
It is, unfortunately, not sufficient to create and bind PIDs once and then promptly forget them—as many institutions may hope to do. PIDs are a long-term institutional commitment which must be sustained through constant vigilance. It is necessary to put procedures in place to administer PIDs and their bindings over the long term. Procedures should be established for binding and rebinding PIDs to one or more current locations, particularly as part of any server change, repository update, or website redesign. Procedures should also be produced for the removal, update, or replacement of the actual resources identified by PIDs—possibly through the use of place holders, fixed policy statements, or references to alternative resources. With complex PID systems like Handle, ARK, or DOI, all procedures must take into account the creation and update of all bindings stored with PIDs, in the form of additional metadata, access rights, and service commitments. The policy to use multiple resolve locations and PID qualifiers must also be supported by procedures which manage these through all processes. Finally, the institution needs to draft procedures related to its commitment to the service itself: renewal of subscriptions, software updates, etc.
At a higher level, PID policies should likewise lay out the long-term institutional commitment to PIDs, including contingency planning to counter technological or service dependencies created by the application of a particular PID system. As aptly argued by the ARK system creators, in final reckoning "persistence is purely a matter of service, and is neither inherent in an object nor conferred on it by a particular naming syntax. The best an identiﬁer can do is lead users to those services." (John A. Kunze, "Towards Electronic Persistence Using ARK Identifiers," ''Proceedings of the 3rd ECDL Workshop on Web Archives'', p. 2.)
At the end of the day, the only guarantee of the usefulness and persistence of identifier systems is the commitment of the organizations which assign, manage, and resolve identifiers.
(Stuart Weibel, Senior Research Scientist, OCLC)
''ARK (Archival Resource Key) Identifiers'' (http://n2t.net/e/ark_ids.html)
CASPAR: Cultural, Artistic and Scientific knowledge for Preservation, Access and Retrieval. ''D2301 Report on OAIS-Access Model''. February 2008. (Doc. Identifier: CASPARRPD230101011_3)
Consultative Committee for Space Data Systems. ''Reference Model for an Open Archival Information System''. CCSDS 650.0-M-2 Magenta Book. Washington D.C.: NASA, 2012. (https://public.ccsds.org/pubs/650x0m2.pdf)
''DOI System'' (http://www.doi.org/doi_handbook/TOC.html)
''Handle System'' (http://www.handle.net/hnr_documentation.html)
Hilse, Hans Werner, and Jochen Kothe. ''Implementing Persistent Identifiers: Overview of concepts, guidelines and recommendations''. London: Consortium of European Research Libraries, 2006. (http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf)
Jantz, Ronald, and Michael J. Giarlo. "Digital Preservation: Architecture and Technology for Trusted Digital Repositories." In: ''DLib Magazine 11: 6'' (June 2005). (http://www.dlib.org/dlib/june05/jantz/06jantz.html)
Kunze, John A. "Towards Electronic Persistence Using ARK Identifiers." In: ''Proceedings of the 3rd ECDL Workshop on Web Archives''. August 2003. (http://bibnum.bnf.fr/ecdl/2003/proceedings.php?f=kunze)
Kunze, J., and R. P. C. Rodgers. ''The ARK Persistent Identifier Scheme''. Internet Draft. 2008. (https://tools.ietf.org/html/draft-kunze-ark-14)
NISO. ''Report of the NISO Identifiers Roundtable''. Bethesda, MD.: National Library of Medicine, 2006. (https://groups.niso.org/news/events/niso/past/ID-06-wkshp/ID-workshop-Re...)
OCLC/RLG Working Group on Preservation Metadata. ''Preservation Metadata and the OAIS Information Model: A Metadata Framework to Support the Preservation of Digital Objects''. Dublin, OH: OCLC, 2002. (http://www.oclc.org/research/activities/past/orprojects/pmwg/pm_framewor...)
Paradigm: Personal Archives Accessible in Digital Media. ''Workbook on Digital Private Papers''. 2008. (http://www.paradigm.ac.uk/workbook)
PREMIS. ''PREMIS Data Dictionary for Preservation Metadata, v. 2.2''. Washington D.C.: Library of Congress, 2012. (https://www.loc.gov/standards/premis)
Reilly, Sean, and Robert Tupelo-Schneck. "Digital Object Repository Server: A Component of the Digital Object Architecture." In: ''D-Lib Magazine 16: 1 / 2'' (January/February 2010). (http://www.dlib.org/dlib/january10/reilly/01reilly.html)
Tonkin, Emma. "Persistent Identifiers: Considering the Options." In: ''Ariadne 56'' (July 2008).