How Can We Help?
Harvesting content from the Pure web services (changes)Harvesting content from the Pure web services (changes)
It is sometimes necessary to extract large amounts of content from Pure to be used for for example for use in other systems that cannot use the web service for querying along the way at runtime.
Depending on the use case, content can be harvested in several ways
OAI PMH
The OAI interface is specifically designed for harvesting content. The format available in OAI is not as complete as that in the REST service, but if an infrastructure is already in place for OAI this can be the best way to go.
Read more on http://www.openarchives.org/pmh/
REST
Most of the content types in pure can be retrieved from Pure's web service endpoints.
In combination with Changes Stream Endpoint (described later) it is possible to harvest all publicly available content from the Pure. The suggested flow for initial harvest of pure data followed by synchronization via the change stream is as follows:
- Service records start date for harvest (for example 2019-01-01)
- Service harvests all publications (or other content) from the web service and stores them locally
- This is done by paging through the research output endpoint via windowing (can be done in parallel if the server hardware is adequate):
- http://your-pure-server/ws/api/524/research-outputs?size=20
- http://your-pure-server/ws/api/524/research-outputs?size=20&offset=20
- http://your-pure-server/ws/api/524/research-outputs?size=20&offset=40
- http://your-pure-server/ws/api/524/research-outputs?size=20&offset=60
- ... and so on until no more data is returned
- This is done by paging through the research output endpoint via windowing (can be done in parallel if the server hardware is adequate):
- Service requests changes from the changes stream endpoint:
- Passing a date (using example date 2019-01-01) to get all changes that have occurred since the harvest was started: http://your-pure-server/ws/api/524/changes/2019-01-01
- Change stream response contains all content changes and resumption token to fetch next set of changes
- Service processes the change response, and if any changes were for a publication, it performs the needed processing:
- Requests the research output(s) done by the endpoint with the UUID for the content: http://your-pure-server/ws/api/524/research-outputs/{id}
- Service stores the resumption token for future interactions with the change stream.
- The service now begins a loop the is repeated every NN minutes/hours and asks change stream for changes again:
- This time passing the resumption token instead of a date: http://your-pure-server/ws/api/524/changes/{token}
- Change stream response contains all content changes, a resumption token to fetch next set of changes, and a 'moreChanges' value indicating if there are more changes.
- Service processes the change response, and if any changes were for a research output, it performs the needed processing:
- Requests the research output(s) done by the endpoint with the UUID for the content: http://your-pure-server/ws/api/524/research-outputs/{id}
- Service stores the resumption token for future interactions with the change stream.
- and repeats the process infinitely ...
Content Rendering
The webservice allows the output format for content endpoints to be controlled with an additional request parameter called rendering. Pure, among others, support various XHTML citation formats such as harvard. For example http://your-pure-server/ws/api/524/research-outputs/{id}?rendering=harvard
This is an fine strategy if data is to be exported once or a few times. But if content should be kept in sync as data changes in Pure this would have to be done at regular intervals which would impose an unnecessary load on the Pure server. In stead we recommend you to use the changes operation described next
Changes Stream Endpoint
The change Stream Endpoint allows external services to receive updates to all content maintained in Pure. Each creation, update or deletion of content is visible in the stream. The information contained in the changes stream is limited to emitting very basic information about the content, such as the content UUID and family.
The change stream endpoint in combination with the Content REST endpoints enables creation of external services that continuously synchronize data from pure, with relatively low delay.
External services interact with the change stream endpoint by issuing GET requests, that either contain a 'fromDate' parameter to define the start date for changes, or by passing a 'resumptionToken' parameter, which is returned from the change stream endpoint in each response:
The external service performs the initial request to the Changes Endpoint to receive the initial batch of changes, along with the 'fromDate' parameter, and receives a batch of the changes that have happened from that date and onwards. Also included in the result is a resumption token that should be passed for the next change request. The service then performs another request to the changes stream endpoint, this time passing the resumption token as the 'resumptionToken' parameter. This is repeated a number of times (either until a number of events have been received or until the changes stream return an empty result set)
Based on the changes that were received, the external service then performs a number of requests to the Pure content REST endpoints to get data for the updated content, and performs other types processing based on the received changes.
Then after some time, the process will start again, with the service asking for changes by passing the last received resumption token. Through this simple process, it is possible to keep an external service synchronized with the pure installation with very limited delays in updates. |
Changes in associated content
The changes API will only return changes within a single content item and not that of its associated content. For example if a ResearchOutput changes its title or year it will be output as a change event of type ResearchOutput whereas if the name of an associated author changes that will not be included in the ResearchOutput change. In stead it will be part of a Person change.
If you need data to be re-fetched when associated data changes, for example in the above example where a rendering of the author list could be affected, you will need to keep track on the relation between content items in such a way that the person change can be used to lookup affected research output.
Change Stream Format
After content is initially exported, as above, the changes operation can be used to monitor when subsequent changes occur:
- First request http://your-pure-server/ws/api/524/changes/2019-01-01
- Then followed by http://your-pure-server/ws/api/524/changes/{token}
Below is an example change stream response which shows the creation, update and deletion of a single research output. The xml schema references have been removed for brevity:
<result xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://experts-master.demo.atira.dk/ws/api/515/xsd/schema1.xsd"> <count>3</count> <navigationLinks> <navigationLink ref="next" href="http://experts-master.demo.atira.dk/ws/api/515/changes/eyJzZXF1ZW5jZU51bWJlciI6MTk1NzQzNDl9"/> </navigationLinks> <resumptionToken>eyJzZXF1ZW5jZU51bWJlciI6MTk1NzQzNDl9</resumptionToken> <moreChanges>false</moreChanges> <items> <!-- Creation of a publication --> <contentChange> <uuid>909acdcb-c0e5-4130-9aa9-19d329696b30</uuid> <changeType>CREATE</changeType> <family>dk.atira.pure.api.shared.model.researchoutput.ResearchOutput</family> <familySystemName>ResearchOutput</familySystemName> <version>0</version> <!-- When content is created, all content associations are listed as additions --> <relationChanges> <relationChange> <uuid>c712b1cf-dcb8-4426-9107-ffd70aec17ce</uuid> <family>dk.atira.pure.api.shared.model.organisation.Organisation</family> <familySystemName>Organisation</familySystemName> <changeType>ADDED</changeType> </relationChange> <relationChange> <uuid>32ceb44e-c066-4a3f-b25b-7e680ff36bf5</uuid> <family>dk.atira.pure.api.shared.model.journal.Journal</family> <familySystemName>Journal</familySystemName> <changeType>ADDED</changeType> </relationChange> <relationChange> <uuid>91d7c0b2-4e04-42bd-a099-aaa4357835e6</uuid> <family>dk.atira.pure.api.shared.model.person.Person</family> <familySystemName>Person</familySystemName> <changeType>ADDED</changeType> </relationChange> <relationChange> <uuid>02ca812c-55f2-485e-aa7d-8a3659f76693</uuid> <family>dk.atira.pure.api.shared.model.organisation.Organisation</family> <familySystemName>Organisation</familySystemName> <changeType>ADDED</changeType> </relationChange> </relationChanges> </contentChange> <!-- Update of a publication --> <contentChange> <uuid>909acdcb-c0e5-4130-9aa9-19d329696b30</uuid> <changeType>UPDATE</changeType> <family>dk.atira.pure.api.shared.model.researchoutput.ResearchOutput</family> <familySystemName>ResearchOutput</familySystemName> <version>2</version> </contentChange> <!-- Deletion of a publication --> <contentChange> <uuid>909acdcb-c0e5-4130-9aa9-19d329696b30</uuid> <changeType>DELETE</changeType> <family>dk.atira.pure.api.shared.model.researchoutput.ResearchOutput</family> <familySystemName>ResearchOutput</familySystemName> <version>2</version> <!-- When content is deleted, all content associations are listed as deletions as well --> <relationChanges> <relationChange> <uuid>02ca812c-55f2-485e-aa7d-8a3659f76693</uuid> <family>dk.atira.pure.api.shared.model.organisation.Organisation</family> <familySystemName>Organisation</familySystemName> <changeType>REMOVED</changeType> </relationChange> <relationChange> <uuid>c712b1cf-dcb8-4426-9107-ffd70aec17ce</uuid> <family>dk.atira.pure.api.shared.model.organisation.Organisation</family> <familySystemName>Organisation</familySystemName> <changeType>REMOVED</changeType> </relationChange> <relationChange> <uuid>91d7c0b2-4e04-42bd-a099-aaa4357835e6</uuid> <family>dk.atira.pure.api.shared.model.person.Person</family> <familySystemName>Person</familySystemName> <changeType>REMOVED</changeType> </relationChange> <relationChange> <uuid>32ceb44e-c066-4a3f-b25b-7e680ff36bf5</uuid> <family>dk.atira.pure.api.shared.model.journal.Journal</family> <familySystemName>Journal</familySystemName> <changeType>REMOVED</changeType> </relationChange> </relationChanges> </contentChange> </items> </result> |
Change Stream Response Format
A 'change' element defines a change that has occured for a specific content object. It contains the following attributes:
- family: The java class name for the content family
- familySystemName: A unique identifier for the family
- uuid: The globally unique ID for the content
-
changeType: The change type:
- CREATE when content is created
- UPDATE when existing content is updated.
- DELETE when content is deleted
- version: The content version. Each time content is updated, it's version is incremented.
Each 'change' element may contain a list of 'rel' elements that list changes in relations to other content. Each 'rel' element contains the following attributes:
- family: The java class name for the content family
- familySystemName: A unique identifier for the family
- uuid: The globally unique ID for the content
-
relationChangeType: The type of relation change:
- ADDED is emitted when the relation is created. This can happen when the 'change' element's changeType is CREATE or UPDATE
- REMOVED is emitted when the relation is deleted. This can happen when the 'change' element's changeType is UPDATE or DELETE
Updated at July 27, 2024