Prism Working Paper  2000-02-07 01:12:35 PM

The Prism Project: Vision and Focus

 Over the past several months we have successfully started a number of research efforts within the context of Prism:  

With the beginning of the new year, it is time to move from this “collection of research efforts” to a more coordinated vision and vehicle for our research activities.  This short working paper is an attempt to establish that vision for the project and define the work that needs to be done in the context of that vision.

Visions and Definitions

During our meeting prior to the holiday break, Geri's group presented some results that indicated general agreement on the goals of the project.  We need to move beyond general agreement to more specific agreement on the nature of the beast - a digital library - on which we are working.:

A digital library is a managed collection of digital objects (content) and services (mechanisms) associated with the storage, discovery, retrieval, and preservation of those objects.  The task of management is three-fold.  

  1. It includes the selection of the digital objects that are components of the collections from which the library is composed.  These objects are selected from a global information space (e.g., the set of all published books, or the set of all digital objects on the World Wide Web), and become components of the library collections based on criteria applied by collection managers (which may be human, automated, or some hybrid).  
  2. Management also entails the definition of the services included in the digital library.  Some common examples of services are indexing, which allows discovery of content in the collections; preservation, which assures longevity of the objects in the collections; and awareness, which alerts users to changes in the collections.  
  3. Finally, management includes the development and enforcement of policies for tasks such as managing access to collection contents and preserving items in the collection.

This definition contains the notion of a digital library as management layered on top of infrastructure.  In particular: a digital library infrastructure provides the service definitions, protocols, and digital object model; digital library instances exploit this infrastructure by selecting services and content and administering policies on those selections.

It is illustrative to compare  these digital library instances to the current notion of a portal on the Web, for example Lycos, Yahoo, and AltaVista.  These portals offer access to distributed content and provide some services over this content (e.g., searching).  Yet, there is little argument that these current portals do not provide the level of curatorial responsibility undertaken by existing libraries.  The model of a digital library instance proposed here can be thought of as a hybrid portal.  It not only expands the traditional portal concept with enhanced services and content (as described in Sections 2.1 and 2.2), but recognizes the need for integrity maintenance (curatorial responsibility) ranging from casual (a distributed e-print archive) to strict and broad ranging (a research library). The effect then is that content and mechanisms may be shared among multiple digital libraries, but the policies applied to that content and services are tailored to the requirements of the organization (e.g., research library, academic department, etc.) administering the digital library instance. 

Overall Research Goal 

Traditionally, libraries have primarily asserted policies over objects through full control (ownership and containment) of the (physical) artifacts. With the arrival of digitized content, policy enforcement has become more complex  and has mainly been dealt with in two models:

  1. Full control of the bits - e.g., as in many of the digitization projects that largely extend standard library practices to a new genre of materials.

  2. Contractual agreements - which allow the bits to be under somewhat else's control but assume a high level of cooperation with one or more trusted parties.

The hybrid portal model described above presents significant challenges to these traditional practices.  Bits (content and mechanisms) are scattered across the Internet.  These bits are controlled by parties with varying levels of integrity and management, cooperative agreements with them may be impractical or impossible, and the "importance" of the bits (from the perspective of a specific portal) bears no relation to their location or level of management (e.g., my research may consider a working paper on somebody's personal web page more important than a publication in an established scholarly journal).

The general questions we are exploring in Prism are then:

Note that this research introduces important questions from both the library (information management) direction and CS direction.  Clearly policies must adapt to the new distributed digital context and mechanisms must be developed to enforce these policies.

Research Directions and Assumptions

This section distills these broader questions in to a set of working hypotheses and corresponding work and research tasks (which are in bracketed italics beneath the hypotheses).  

  1. Individual digital libraries (portals) will provide access to a mixture of shared and unshared content and services.

    [Continue to refine the infrastructure elements we are developing (e.g. Dienst, FEDORA)
    to enable research and demonstration of policy enforcement capabilities].


  2. Integrity (security and preservation) requirements will vary across digital library instances (even though they may share content and services).  

    1. An example in the context of access management is - the Cornell library may establish a campus-wide license for free access to all ACM content, whereas the digital library for researchers at Xerox PARC may allow access on a "pay per view" basis. 

    2. An example in the context of preservation is - a digital library for the  Cornell computer science department may consider the preservation of Gerry Salton's technical reports as central to its intellectual mission, but may not feel the same about older content in the ACM digital library.  At the same time the research library may decide that all TRs and all legacy content in the ACM digital library are mission critical and must be preserved.

    [Undertake a study to characterize security and preservation issues in the context of sample communities (portals) and the range of resources (local, distributed, formal, multiple genre) that they use.]


  3. Policies are the means of formally stating these integrity requirements.  Using this terminology, then, each digital library (portal) will have policies that express its integrity requirements on distributed content and services.  

    [Investigate various integrity requirements for diverse digital library portals and translate these into policies.]

    [Investigate notations for expressing policies and developing methods and tools for translating these diverse security and preservation requirements as enforceable policies.]


  4. Policy enforcement mechanisms need to be constructed that permit enforcement regardless of the level of control the digital library (the policy formulator and enforcement agent) has over the objects and services.

    [Investigate the implementation of a "policy layer" through which each portal can enforce its policies on the range of content and services and range of control over that content and services.]

    [Investigate the implementation of this "policy layer" through the creation of object surrogates using the FEDORA digital object model and reference monitors].

    [Investigate the tension between levels of control over content and services and the types of policies that need to be enforced (e.g., it may be necessary to "assert" control in certain situations - suck down the bits - in order to achieve certain policies).

Building a Testbed

The ideas outlined above need to be exercised in well-defined test environments.  The testbed(s) should provide us with a sufficiently rich set of policy contexts to provide interesting test cases.  

If we believe in this hybrid portal model of digital libraries, then we should focus substantial effort on the prototyping of a limited number of portals - investigating, formulating, implementing, and testing policy enforcement in the context of those portals.  Given our resource limitations, we should focus on two.

  1. A broad library driven orientation.
  2. A "customer driven" portal orientation.

There will certainly be interesting relationships between these two focuses at the policy and mechanism level.

The Broad Focus

One testbed focus will be the traditional research library model, which can be distinguished in a number of ways by its content, clientele, and services: 

content

 clientele

 services

The research library in the digital realm may be characterized as a complex, emerging, and fairly amorphous entity motivated by multiple goals that compete with one another.  There is a built in tension between serving current customers and safeguarding information resources.  Because it serves a broad community and a broad mission, the policy requirements of the research library are necessarily complex and multi-conditional and will provide a test environment in which the tensions between a customer focus and a collections focus will play out in policies and mechanisms.  It will also serve as a “moving target” in which new means are defined to provide content and services and to meet the needs of clients. 

The Narrow Focus

A concurrent testbed focus will be the investigation of a single portal and its policy requirements - a digital library serving researchers in the CS department at Cornell. This library would be a portal into the information space (both local and distributed materials) relevant to the needs of CS researchers.  This will include some fairly conventional materials (textual technical reports) and some radically new resources (the lecture browser materials that coming out of Brian Smith's group).

Content

Researchers in the computer science department use an extremely diverse variety of digital resources - print-only materials, multimedia, licensed materials (e.g., ACM), informal web pages, software, etc.  This diversity of resources and their diverse integrity requirements will provide a rich testbed for our work.  This testbed supports resources critical to research, teaching, and scholarship, which presents a multi-dimensionality of the policy issues.

Clientele 

Primary clientele are faculty, students, and staff of the CS department, who are on the leading edge in their use of electronic resources. In the future, use may be extended to students and faculty from other departments and researchers and distance learners outside of Cornell.

Services

The creation of such a testbed with the involvement of the library will provide sufficient balance to prevent this testbed from addressing only the "current" requirements of the CS researchers.  The library will add to the efforts the same aspect that it has traditionally added to information management -  curatorial responsibility that goes beyond the immediate needs of its patrons.  The library knows what is better for the collection as a whole and for future patrons, than do current patrons.  This will raise important issues due to the tension of the policy needs of the immediate patron (the individual CS researcher) and the broader policy needs as determined by the library. 

Therefore, we think we should devote a substantial part our testbed efforts - including requirements analysis, collection selection, collection prototyping, evaluation - in building a digital library for the CS community at Cornell. This will serve as an alternative model to the research library so as to invite comparative analysis.