Cancel Fullscreen
Loading...
 

This is the static archive copy of the old wiswiki, decommissioned on June 1 2020

Print

oai-pmh-guidelines

OAI-PHM guidelines for WIS catalogue synchronization between GISCs.



Introduction

The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is used in WIS for the synchronization of the metadata catalogue between GISCs. These guidelines elaborate how exactly OAI-PMH is employed to implement synchronization.

Structure of the distributed catalogue


The distributed catalogue of WIS is held in its entirety by each GISC. To synchronize the catalogue each GISC contributes the metadata from its area of responsibility. This means that each GISC provides the metadata from its area as well as receiving, harvesting in OAI jargon, metadata from all other GISCs.
A full-mesh and structured connectivity topology is used as synchronization stragegies between GISCs.

Structured connectivity and full-mesh


Full-mesh and structured connectivity are related in that full-mesh is a special case of structured connectivity.
Image Image
The left picture shows three GISCs in full-mesh topology. On the right GISC A serves as relay for GISC B & GISC C, providing the metadata of GISC B to GISC C and vice versa in structured connectiviy mode.

Basic idea


WIS catalogue synchronization in a nutshell works like this. Each GISC offers the metadata it has obtained from its area of responsibility on its OAI provider. In order to get the full catalogue, each GISC harvests this data from all other GISCs. Should two GISCs be unable to communicate directly, they can obtain each others metadata from a third GISC, which exposes the corrresponding metadata in addition to its own.

OAI-PHM in WIS


The OAI-PMH operations relavant for WIS are ListRecords, ListIdentifiers, GetRecord and ListSets. The metadataPrefix parameter required for some operations MUST be set to "metadataPrefix=iso19139". The ListIdentifiers and GetRecord operations use an identifier to relate to a specific metadata record. While this identifier can be of arbitrary nature, as long as it is guaranteed to be unique, in WIS the identifier MUST be equal to the fileIdentifier in the metadata.
The OAI-specifications provide further general details on OAI.


GISC as OAI-PHM provider

In OAI-PHM, servers are known as providers and clients as harvesters. In WIS, each GISC is both OAI provider and harvester. In order to expose a set of metadata (such as its own or a different GISCs one) the provider creates an OAI-set and put the corresponding metadata into it.

Although for basic WIS metadata synchronization a GISC MUST only expose the metadata of its own area of responsibility on the provider, for backup & flexibility each GISC SHOULD also expose the OAI sets of all other GISCs, as well as a set containing the whole WIS catalogue. This means that the following sets need to be created.
A naming convention governs the name of the OAI-PMH sets. The metadata from the area of responsibility of a GISC, will be contained in the OAI-PMH "WIS-#GISCNAME#" (e.G WIS-GISC-TOKYO, for JMA or WIS-GISC-TOULOUSE for Meteo France). The entire catalogue SHOULD be available as WIS-CATALOGUE.

  1. one set for the GISCs own metadata
  2. one set for each of the other GISCs metadata
  3. one set containing the whole WIS metadata catalogue (the set union of 1 & 2)

Deletions


To delete a record from the WIS catalogue, a GISC MUST, rather than simply removing the record from its OAI provider, set the record status to "deleted" for a period of 3 months. This is important for smooth operations, since without a list of deleted records a OAI harvester is forced to retrieve the full list of records to determine which records got deleted, adding significant load and bandwidth. (see performance)


Selective harvesting


In order to decrease the amount of metadata that has to be processed, only metadata having changed since a certain point in time, can be retrieved via OAI-PMH. This is implemented by the harvester by sending the timestamp of the last visit as part of the harvesting request. It does so by supplying the "from" parameter with the OAI request. For example, the following request retrieves all metadata records that have changed since 2011-07-12T09:14:30Z from JMAs area of responsibility from the JMA OAI-PMH provider http://www.shiken.kishou.go.jp/metasearch/oaiprovider.jsp?verb=ListIdentifiers&metadataPrefix=iso19139&set=WISTEST-JMA&from=2011-07-12T09:14:30Z.
GISCs MUST support selective harvesting as specified in the OAI-PMH specification, even though it is only an optional requirement in vanilla OAI-PHM.
Technically OAI-PMH sets are also part of selective harvesting, since it allows a harvester to retrieve only a subset of the metadata available on a provider.
GISCs MUST thus also support selective harvesting for sets, as specified above.


GISC as OAI-PHM harvester


In order to retrieve all the metadata records from an OAI provider, a harvester can employ different strategies. The main two ones are to either first get a list of available metadatarecords with ListIdentifiers and to retrieve each metadata record individually with GetRecords, or to retrieve all relevant metadata records with a ListRecords operation. Differential harvesting SHOULD be used to limit the amount of data that has to be transmitted. (see below)

Each GISC MUST configure a harvesting job for each other GISC, harvesting the metadata of each remote GISCs' area of responsibility. It SHOULD harvest it directly from the corresponding GISC, but CAN also retrieve it via a proxy GISC, in as described in structured connectivity mode.
The metadata harvested from the remote GISC SHOULD be put into a local set corresponding to the remote GISC and SHOULD also be put into the WIS-CATALOGUE OAI-PMH set.



Selective harvesting


When acting as OAI-PMH harvesters, GISCs SHOULD use selective harvesting to retrieve day to day metadata updates from the other GISCs. Only in exceptional cases, such as re-synchronization, should a full retrieval be performed.

Deletions


In order to learn about deleted records at a remote GISC, the harvester MUST parse the OAI response for records that are deleted. It is NOT possible to retrieve all records and compute the difference them with the local catalogue to find out which records have been removed remotely, due to performance considerations.


Important issues


  • If a GISC looses its history (the history of deleted records), the other GISCs might become inconsistent. This happens if a GISC deletes a metadata record and then suffers a crash. If the OAI-PMH index is built from scratch, the metadata record will not be indicated as part of the deleted records. Other GISCs will not notice that a record has been deleted, unless they performed a harvesting between the deletion and before the crash. The consequence is a lost-update situation in which the deleted record is persisted in the others GISC although having officially been removed.
    The GISC would have to somehow remember externally which records it deleted in the past and be able to restore this information after a crash. Rebuilding a current index from the available metadata records is not sufficient.

  • If a harvesting GISC looses an update (for whatever reason) and uses differential harvesting, it will perpetuate the lost update, since the lost update will not show up in a later sync

  • If a harvesting GISC switches OAI providers for the same set it needs to take precautionary measures, otherwise it can become inconsistent.

  • Time information is crucial for the differential harvesting, because time needs to be in sync between harvester and provider needs to be in sync. All GISCs MUST thus use the network time protocol to guarantee time synchronization.

  • Need to define what happens if there is a long answer with serveral resumptionTokens and there is an error during a later resumptionToken there is an error. Do the previously harvested records have to be deleted, too?

Performance


The WIS metadata catalogue will eventually grow to exceed the one million mark. For the first version of WIS, catalogue sizes greater than 200.000 are expected. Each synchronization strategy must thus efficiently handle large catalogue sizes. This means, that in normal operation only the metadata that has changed in the period since the catalogue was synchronized last is transmitted. In OAI, this can be achieved using selective harvesting. In selective harvesting the harvester supplies the "from" parameter in a ListRecords or ListIdentifiers operations. The parameter must be set to the time the harvester last started a harvesting operation from this provider. (note that the date must be before the start of the last operation to guarantee consistency).
The provider will answer with a list of metadata records that changed since the harvester last visited. Note that the state is maintained by the client which makes the system robust against server reboots etc.
GISC implementers MUST implement differential harvesting and deletions in order to guarantee smooth synchronization without burdening their partner GISCs.
In order to avoid retransmission of the whole list of records on the OAI provider, the providers MUST support "deletions", meaning the flagging with deleted of metadata records having been removed from the catalogue (see above).

Discussion


This approach has numerous advantages. First, the harvesting of a GISC's metadata takes place at the GISC that is responsible for it. This guarantees a fast convergence of the catalogue, since there is only one hop. Second, loops and duplicates are avoided, due to the simple and straightforward topology. Finally, the strategy allows for fast restart after an outage. For instance, if a GISC should crash and is offline for some time, the differentical harvesting allows to retrieve only the data that changes during this period, instead of the whole catalogue. Should the GISC loose all its data, it can recover the catalogue by harvesting each GISCs metadata. Should it have lost even its own metadadata, it is possible to easily retrieve its own metadata from another GISC, using the OAI-set functionality.





Page last modified on Tuesday 06 of September, 2011 13:00:17 CEST