oaidoc
Welcome to the Documentation on Metadata synchronization with OAI
Open Archives Initiative (OAI) v2 has been designated the protocol for GISC2GISC synchronization in WIS and it is expected that it will also be widely used in the uploading of metadata from NC's and DCPC's to GISC's.
Table of contents
functioning of OAI
In a nutshell, OAI publishes metadata available in a catalogue by responding to requests to the OAI service via HTTP.For example, the WMO test OAI servers are reachable at
- http://85.10.219.212/geonetwork/srv/en/oaipmh?verb=ListIdentifiers&metadataPrefix=iso19139
- http://85.10.219.212/oai/provider?verb=ListIdentifiers&metadataPrefix=ISO19139. Here in both cases the ListIdentifiers operation is called, by specifying a URL parameter "verb". Other important operations are "ListRecords", "ListSets" and "GetRecord". See the offical OAI reference for more info and operations.
The OAI service then returns a XML document containing the actual metadata as payload (ListRecords operation), or a MetadataIdentifier and a Timestamp identifying a MetadataRecord and its time of modification (ListIdentifiers operation).
OAI for metadata synchronization
OAI is used for synchronization in the following way. A OAI client obtains the list of available Metadata by issuing a ListIdentifiers query to the OAI provider. The client then processes the result set (containing the identifiers and timestamps of all metadata available on the provider) and issues GetRecord queries for all metadata records not yet in the clients catalogue and also for all outdated records.
Since ListIdentifiers only outputs headers, bandwidth is saved since not all the documents are contained in the XML reply.
Selective harvesting allows to supply a timestamp parameter to ListIdentifiers or ListRecords. Only records having changed since the last synchronization will in this case be returned. Selective harvesting also allows to harvest only metadata belonging to a certain set. This will be explained below in more detail.
OAI in Geonetwork
Geonetwork is a OAI provider by default. This means that public metadata can be harvested from a Geonetwork node without any need for configuration, by poiting the harvester to geonetwork/srv/en/oaipmh. To make metadata public, login as admin, search for all metdata by pressing search without specifying a search-term in the portal. Click on "select all" and "actions on selected", and update privileges. Check all boxes and the metadata will be public.
The availability of the data can be verified by going to /geonetwork/srv/en/oaipmh?verb=ListIdentifiers&metadataPrefix=iso19139. Only metadata classified as ISO19139 will be output.
The metadataPrefix is an important variable in the synchronization process, and it imperative that it be set correctly when configuration the synchronization.
When ISO19139 data is imported into Geonetwork, it is automatically recognized as iso19139.
In order to harvest metadata from other providers with geonetwork, go into the administration section and select "harvesting management". If there are any harvesting jobs configured, you will now see them listed.
Click on "add" to add a new OAI provider. Select "OAI Protocol for Metadata Harvesting 2.0" and click "add" again.
The name can be chosen by you. The URL has to be set to the URI of the remote OAI provider. For example to harvest the WMO geonework test server, you have to point the URL to http://85.10.219.212/geonetwork/srv/en/oaipmh. For the WMO JOAI, to http://85.10.219.212/oai/provider.
In WIS metadata is public, so no username and password are needed (untick the box, too).
The next step is important. Click on "add" in the "Search criteria" section. Then click on "Retrieve info". Geonetwork will no query the server for supported metadata_prefixes and sets. Select "iso19139" as prefix. The set can be used to only retrieve a certain subset of the metadata, but is not important for the moment. (see more below).
The options can be used to configure a regular harvesting pattern. For testing nothing has to be configured here.
Using the priviledges you can assign access rights to the newly harvested data. If nothing is selected, they are readable by admin only.
The categories feature allows to classify the harvested data. Categories can be managed in the "administration" -> "category management".
Make sure to leave the screen by pressing "save". You have now setup a new provider which can be harvested. By ticking the box on the left and pressing the "run" button, you can trigger the harvesting. Metadata records should now start being added to the catalogiue. (you can search everything on the portal and see how to number increases or watch the logfile)
The most frequent error for GN not harvesting is that the wrong metadataprefix is requested. (you can see this is the logfile of the target server). In WIS iso19139 is needed. If e.G the harvester requested oai_dc, a WIS compliant OAI provider would not return any documents. For testing you can point your browser to the OAI provider and use the ListIdentifiers with the right metadataPrefix argument. For instance the WMO geonetwork could be querried like this http://85.10.219.212/geonetwork/srv/en/oaipmh?verb=ListIdentifiers&metadataPrefix=iso19139.
Selective harvesting with OAI
One option currently being discussed for synchronization in WIS is to harvest only the metadata a GISC is responsible from any GISC. (full mesh) A variant of this is the "managed connectivity approach", where metadata under the responsibility of GISC A can also be harvested from a number other GISC's that provide backup services for GISC A.
Both can be accomplished with OAI using "selective harvesting". Selective harvesting allows a harvester to harvest only a subset of the metadata available at a OAI provider.
This allows a GISC to hold the full catalogue, while other GISC'S can still only harvest the subset of data this GISC is responsible for.
For this, the GISC's have to categorize the metadata. In OAI jargon the categories are called "sets". In Geonetwork they are called "categories". On top of this, geonetwork (and other OAI implementations) can categorize metadata when it is harvested.
Using these two features one can implement the "full mesh" and "managed connectivity" approaches. Each GISC classifies the metadata it is responsible for into a category (e.G the imaginary WMO GISC classifies its data into a category WMOGISCMETADATA). The WMO GISC also sets up harvesting tasks to retrieve metadata from other GISC's. Since the other GISC's also classify their data into categories, the WMO GISC can retrieve the metadata from a ASIAN GISC,by using selective harvesting like oai?verb=ListIdentifiers&metadataPrefix=iso19139&setSpec=ASIANGISCMETADATA. The WMO GISC will categorize this data as ASIANGISCMETADATA (or any other name distinct from WMOGISCMEDATA). Now the WMO GISC has the asian metadata as well as its own, but harvesters harvesting from it can request to get only the authorative data by using selective harvesting with the setSpec=WMOGISCMETADATA parameter.
It is also clear that the "managed connectivity" can be implemented using this, since the above allows a third GISC to harvest the asian GISCs' metadata from the WMO GISC, by using setSpec=ASIANGISCMETADATA. The WMO GISC hereby plays the role of a backup GISC for the asian GISC.
Issues and problems
There is an issue with the metadata's unique identifier. If the same metadata file is harvested from two different sources, OAI considers them to be different (the algorithm to determine uniqueness of metadata1 and metadata2 is
IF ( uid1 == uid2 AND source1 == source2 ) THEN return equal; ELSE return distinct ; END.
So if the sources are not the same, the same metadata is considered distinct, which leads to a database error (the DB enforces the uniqueness of identifiers).
A possible workaround is to change the definition of equality, but it is not clear which consequences this has.