How Can We Help?
Automatic deduplication jobAutomatic deduplication job
Potential duplicates are identified in two steps:
- Candidate location: potential duplicates are located in the database using specific search criteria.
- Duplicate validation: potential duplicates are checked against a set of rules (different for every content type).
By default we will handle a field in the following manner
Single/Multi-valued Fields:
- Merge Action: ADD
- If a duplicate of the target publication holds a value for a field that the target publication does not contain, the field value form the duplicate will be used
- Merge Action: REPLACE
- If a duplicate of the target publication holds a value for a field that the target publication also contains, the field value from the duplicate will be used
Multi-valued Fields:
- Merge Action: MERGE
- Merges values from a multi-valued field on a duplicate to the multi-valued field on the target
Overview of automated deduplication job
The table below specifies the criteria for deduplication for all supported content types.
Search criteria: One of each elements has to be fulfilled
Content type |
Location: search criteria (One has to be fulfilled) |
Validation: found candidates must match on (All has to be fulfilled) |
||||||||
---|---|---|---|---|---|---|---|---|---|---|
Research output* |
|
source/source ID combination* |
Title (80 pct match)
|
Subtitle (80 pct match) |
|
One of the lines has to be fulfilled. | ||||
DOI |
Title (80 pct match)
|
Subtitle (80 pct match) | ||||||||
Title (80 pct match) | Subtitle (80 pct match) |
Number of pages
|
Persons (by name) | Year |
If found:
|
|||||
Journal |
|
at least one title | ||||||||
Publisher |
|
name | country | |||||||
Event |
|
title | if found, then also | city and country | if found, then also | period | ||||
External Organisation |
|
name | type (Ignored if one is "unknown") |
country (Ignored if both are null) |
subdivision (Ignored if both are null) |
city (Ignored if both are null) |
state (Ignored if both are null) |
|||
Person* |
|
at least one first name | one last name | if found, then also |
Scopus Id (Must not contradict) |
Orcid Id (Must not contradict) |
||||
External Person |
|
name |
Possible to configure: Match all External organisations |
|||||||
Activity* |
|
Title (90 pct match) If title generically generated: Description (90 pct match) |
Visibility |
Type/template | Period |
Can be configured: Persons, and organisations |
If of a type where one Event, Publisher etc can be chosen, there need to be a match on those |
See here for more details: Deduplication of Activities
|
||
Prize* |
|
Title (90 pct match) |
Visibility
|
Type/template | Date | Persons | Organisations | See here for more details:Deduplication of Prizes | ||
Application* |
|
Title (90 pct match)
|
Visibility
|
Type/template | Period | Persons | Organisations |
See here for more details: |
||
Award* |
|
Title (90 pct match)
|
Visibility
|
Type/template | Period | Persons | Organisations | See here for more details: Deduplication of Awards | ||
Course* |
|
Title (90 pct match)
|
Visibility
|
Type/template | Period | Persons | Organisations |
See here for more details: Deduplication of Courses
|
||
DataSet* |
|
Type/template
|
Doi
|
If doi not found: | Title (90 pct match) | Description | Visibility | Person | Organisation |
See here for more details: Deduplication of DataSets |
Press/Media* |
|
Title (90 pct match) | Visibility | Type/template | Period | Persons | Organisation |
See here for more details: Deduplication of Press/Media (clipping) |
||
Projects |
|
Title (90 pct match) | Visibility | Type/template | Period | Persons | Organisation |
See here for more details: Deduplication of Projects |
*Research output: Title and subtitle with an high similarity score.
*Person: The more recent person, based on employment dates and whether each person has active employments, is used as the target of the merge.
*Activity, Prize, Application, Award, Course, DateSet, Press/media, projects: The jobs are only available on request.
Detailed description
Research Output
Search CriteriaIn order to determine if a publication is a duplicate we use three strategies.
Validation criteria
Configuration:
|
Updated at July 27, 2024