Cancel Fullscreen
Loading...
 

This is the static archive copy of the old wiswiki, decommissioned on June 1 2020

Print

oai-pmh-docu

OAI-PHM for WIS catalogue synchronization between GISCs.




Introduction


ET-WISC, ICT-ISS and ICG/WIS have decided that for the first version of WIS, OAI-PHM will be used for the synchronization of the metadata catalogue between GISCs. Since OAI-PHM is mainly a protocoll for metadata marvesting, the following text will elaborate how exactly OAI-PMH will be employed to implement synchronization.

Prior discussion


Prior discussions focused on three different models. Ring, full-mesh and structured-connectivity. A proposal details the three different approaches. Discussions in ET-WISC and the GISC implementers led to adoption of the full-mesh and structured connectivity as synchronization stragegies for GISCs.
Full-mesh and structured connectivity are related in that full-mesh is a special case of structured connectivity.
Image Image
The left picture shows three GISCs in full-mesh topology. On the right GISC A serves as relay for GISC B & GISC C, providing the metadata of GISC B to GISC C and vice versa in structured connectiviy mode.

The following text focuses on a thorough description of the strategy.

Basic idea


WIS catalogue synchronization in a nutshell works like this. Each GISC offers the metadata it has obtained from its area of responsibility on its OAI provider. In order to get the full catalogue, each GISC harvests this data from all other GISCs. Should two GISCs be unable to communicate directly, they can obtain each others metadata from a third GISC, which exposes the corrresponding metadata in addition to its own.

OAI-PHM in WIS


The OAI-PMH operations relavant for WIS are ListRecords, ListIdentifiers, GetRecord and ListSets. The metadataPrefix parameter required for some operations must be set to "metadataPrefix=iso19139". The ListIdentifiers and GetRecord operations use an identifier to relate to a specific metadata record. While this identifier can be of arbitrary nature, as long as it is guaranteed to be unique, in WIS the identifier MUST be equal to the fileIdentifier in the metadata.
Please consult the OAI-specifications for further general details on OAI.

Topology

In OAI-PHM, servers are known as providers and clients as harvesters. In WIS, each GISC is both OAI provider and harvester. In order to expose a set of metadata (such as its own or a different GISCs one) the provider MUST create an OAI-set and put the corresponding metadata into it. Sets can be used for selective harvesting, allowing a harvester to retrieve only a subset of the metadata available on a provider. For example, the following OAI request retrieves all metadta Records of JMAs area of responsibility from JMA. http://www.shiken.kishou.go.jp/metasearch/oaiprovider.jsp?verb=ListRecords&metadataPrefix=iso19139&set=WISTEST-JMA . Although for basic WIS metadata synchronization it is only necessary for a GISC to expose the metadata of its own area of responsibility on the provider, it has been decided that for backup & flexibility each GISC MUST have the following OAI sets. The metadata of a GISCs area of responsibility shall henceforth be referred to as "the GISCs metadata".

  1. one set for the GISCs own metadata
  2. one set for each of the other GISCs metadata
  3. one set containing the whole WIS metadata catalogue (the set union of 1 & 2)
  4. A GISC may create other sets for local or special use

Please note that all GISCs must offer an OAI-SET WIS-CATALOGUE (Item 3 above) which contains all records

Currently (2013) these sets are referred to as "WIS-GISC-#GISCNAME#" (e.g. WIS-GISC-TOULOUSE) and WISTEST-CATALOGUE.

Each GISC MUST thus configure a harvesting job for each other GISC, harvesting from it the metadata contained in the set "WIS-GISC-#GISCNAME#" and configure its own provider such that the metadata having been harvested from "WIS-GISC-#GISCNAME#" is inserted into a local OAI-set of the same name. In addition, the metadata MUST also be part of the WIS-CATALOGUE OAI-set.

Harvesting


In order to retrieve all the metadata records from an OAI provider, a harvester can employ different strategies. The main two ones are to either first get a list of available metadatarecords with ListIdentifiers and to retrieve each metadata record individually with GetRecords, or to retrieve all relevant metadata records with a ListRecords operation. Differential harvesting can be used to limit the ammount of data that has to be transmitted. (see performance)
To delete a record from the WIS catalogue, a GISC SHALL, rather than simply removing the record from its OAI provider, set the record status to "deleted" for a period of 3 months. This is important for smooth operations, since without a list of deleted records a OAI harvester is forced to retrieve the full list of records to determine which records got deleted, adding siginificant load and bandwidth. (see performance)


Performance


The WIS metadata catalogue will eventually grow to exceed the one million mark. For the first version of WIS, catalogue sizes greater than 200.000 are expected. Each synchronization strategy must thus efficiently handle large catalogue sizes. This means, that in normal operation only the metadata that has changed in the period since the catalogue was synchronized last is transmitted. In OAI, this can be achieved using selective harvesting. In selective harvesting the harvester supplies the "from" parameter in a ListRecords or ListIdentifiers operations. The parameter must be set to the time the harvester last started a harvesting operation from this provider. (note that the date must be before the start of the last operation to guarantee consistency).
The provider will answer with a list of metadata records that changed since the harvester last visited. Note that the state is maintained by the client which makes the system robust against server reboots etc.
GISC implementers SHALL implement differential harvesting and deletions in order to guarantee smooth synchronization without burdening their partner GISCs.
In order to avoid retransmission of the whole list of records on the OAI provider, the providers SHALL support "deletions", meaning the flagging with deleted of metadata records having been removed from the catalogue (see above).

Advantages


This approach has numerous advantages. First, the harvesting of a GISC's metadata takes place at the GISC that is responsible for it. This guarantees a fast convergence of the catalogue, since there is only one hop. Second, loops and duplicates are avoided, due to the simple and straightforward topology. Finally, the strategy allows for fast restart after an outage. For instance, if a GISC should crash and is offline for some time, the differentical harvesting allows to retrieve only the data that changes during this period, instead of the whole catalogue. Should the GISC loose all its data, it can recover the catalogue by harvesting each GISCs metadata. Should it have lost even its own metadadata, it is possible to easily retrieve its own metadata from another GISC, using the OAI-set functionality.


Issues that need fixing


  • If a GISC looses its history (the history of deleted records), the other GISCs might become inconsistent. The GISC would have to somehow remember externally which records it deleted in the past. Rebuilding a current index from the available metadata records is not sufficient. This can be part of a backup strategy though
  • If a harvesting GISC looses an update (for whatever reason) and uses differential harvesting, it will perpetuate the lost update, since the lost update will not show up in a later sync
  • if a harvesting GISC switches OAI providers for the same set it needs to take precautionary measures, otherwise it can become inconsistent.
  • when harvesting selectively with from and to the time between servers needs to be in sync. Ergo, require NTP sync of GISC portal.


Page last modified on Friday 22 of November, 2013 16:00:17 CET