New Admins: Register for our new Pure Lecture Series!
Pure's logos
Pure Help Center for Pure Administrators

If you are a researcher, or other non-admin at your institution, click here.

  • Home
  • Announcements
  • Release Notes
  • Technical user guides
  • Training
  • Events
  • Support
  • Contact Us
  • Home
  • Training
  • Technical user guides
  • Community

How Can We Help?

Search Results

Filter By Category

Contact Us

If you still have questions or prefer to get help directly from an agent, please submit a request.
We’ll get back to you as soon as possible.

Contact us

Automatic deduplication jobAutomatic deduplication job

Potential duplicates are identified in two steps:

  1. Candidate location: potential duplicates are located in the database using specific search criteria.
  2. Duplicate validation: potential duplicates are checked against a set of rules (different for every content type).

By default we will handle a field in the following manner

Single/Multi-valued Fields:

  • Merge Action: ADD
    • If a duplicate of the target publication holds a value for a field that the target publication does not contain, the field value form the duplicate will be used
  • Merge Action: REPLACE
    • If a duplicate of the target publication holds a value for a field that the target publication also contains, the field value from the duplicate will be used

Multi-valued Fields:

  • Merge Action: MERGE
    • Merges values from a multi-valued field on a duplicate to the multi-valued field on the target

Overview of automated deduplication job


The table below specifies the criteria for deduplication for all supported content types.

Search criteria: One of each elements has to be fulfilled

Content type

Location: search criteria

(One has to be fulfilled)

Validation: found candidates must match on

(All has to be fulfilled)

Research output*
  • Secondary sources
  • DOI
  • ISBN
  • Journals ISSNs
  • The aggregated title and publication year
  • Authors (same ID), publication year and pages
source/source ID combination* 

 

Title (80 pct match) 
(title + subtile is also tried)

 

Subtitle (80 pct match)



 
One of the lines has to be fulfilled.
DOI

Title (80 pct match) 
(title + subtile is also tried)

 

Subtitle (80 pct match)
Title (80 pct match) Subtitle (80 pct match)

Number of pages  
(or to-from is same pages)

 

Persons (by name) Year

If found:

  • ISBN
  • Host
  • Journal
  • Publishers
  • Volume
Journal
  • ISSN
at least one title  
Publisher
  • Publisher name
name country  
Event
  • Event title
title if found, then also city and country if found, then also period  
External Organisation
  • Name, Country (If present, else ignored) and Subdivision (If present, else ignored)
name type 
(Ignored if one is "unknown")
country 
(Ignored if both are null)
subdivision 
(Ignored if both are null)
city 
(Ignored if both are null)
state 
(Ignored if both are null)
 
Person*
  • Scopus ID
  • ORCID
at least one first name one last name   if found, then also

Scopus Id

(Must not contradict)

Orcid Id 
(Must not contradict)
 
External Person
  • All names
name

Possible to configure:

Match all External organisations

 
Activity*
  • Titles are similar and date is the same
  • Classified Id
  • Persons and titles
Title (90 pct match) 
If title generically generated: 
Description (90 pct match)
Visibility 

 
Type/template Period

Can be configured:

Persons, and organisations

If of a type where one Event, Publisher etc can be chosen, there need to be a match on those

See here for more details: Deduplication of Activities 

 

Prize*
  • Titles are similar and year is the same 
  • Classified Id
  • Persons are the same and titles are similar
Title (90 pct match)

Visibility

 

Type/template Date Persons Organisations See here for more details:Deduplication of Prizes 
Application*
  • Titles are similar, date is the same, or
  • Classified Id, or
  • Receivers and titles are the same

Title (90 pct match)

 

Visibility

 

Type/template Period Persons Organisations

See here for more details:

Deduplication of Applications 

Award*
  • Titles are similar and date is the same
  • Classified Id
  • Persons and titles are the same

Title (90 pct match)

 

Visibility

 

Type/template Period Persons Organisations See here for more details: Deduplication of Awards 
Course*
  • Titles are similar and date is the same
  • Classified Id
  • Persons and titles are the same

Title (90 pct match)

 

Visibility

 

Type/template Period Persons Organisations

See here for more details: Deduplication of Courses 

 

DataSet*
  • Titles are similar and year is the same
  • Classified Id
  • Persons and titles are the same

Type/template

 

 

Doi

 

If doi not found: Title (90 pct match) Description Visibility Person Organisation See here for more details: 
Deduplication of DataSets 
Press/Media*
  • Titles are similar and date is the same
  • Persons and dates and title are the same
Title (90 pct match) Visibility Type/template Period Persons Organisation See here for more details:  
Deduplication of Press/Media (clipping) 
Projects
  • Titles are similar and date is the same
  • Classified Id
  • Persons and dates are the same
Title (90 pct match) Visibility Type/template Period Persons Organisation See here for more details: 
Deduplication of Projects 

*Research output: Title and subtitle with an high similarity score. 
*Person: The more recent person, based on employment dates and whether each person has active employments, is used as the target of the merge.

*Activity, Prize, Application, Award, Course, DateSet, Press/media, projects: The jobs are only available on request.

 

Detailed description


Research Output

Search Criteria

In order to determine if a publication is a duplicate we use three strategies.

  • Source ID strategy
    • A publication is a duplicate if publication B has the same source ID as publication A
    • A publication is a duplicate if publication B has the same secondary source ID as publication A
  • DOI strategy
    • A publication is a duplicate if publication B contains a DOI contained by publication A 
  • Fallback strategy (must match all (where applicable, e.g. patent IPC and NUMBER match only done on Patent type) before being considered a duplicate)
    • A publication is a duplicate if publication B matches publication A on title + subtitle (80% similarity)
    • A publication is a duplicate if publication B matches publication A on number of pages
    • A publication is a duplicate if publication B matches publication A on persons
    • A publication is a duplicate if publication B matches publication A on publication date (within a 1 year difference)
    • A publication is a duplicate if publication B matches publication A on pages
    • A publication is a duplicate if publication B matches publication A on ISBN
    • A publication is a duplicate if publication B matches publication A on host publication title
    • A publication is a duplicate if publication B matches publication A on journal association
    • A publication is a duplicate if publication B matches publication A on publisher association
    • A publication is a duplicate if publication B matches publication A on volume
    • A publication is a duplicate if publication B matches publication A on patent IPC and NUMBER

Validation criteria 



 

 

Configuration:

  • Configuration of workflow: Publications that has been merged will end up in the selected workflow step, if the required permission to move to this workflow step is present.

 

 

Published at April 03, 2024

Download
Table of Contents
  1. Overview of automated deduplication job
  2. Detailed description
  3. Research Output
  4. Search Criteria
  5. Validation criteria 
  6. Internal note
Keywords
  • automatic job
  • duplicate removal

Was this article helpful?

Yes
No
Give feedback about this article

    About Pure

  • Announcements

    Additional Support

  • Events
  • Client Community
  • Training

    Need Help?

  • Contact Us
  • Submit a Support Case
  • My Cases
  • Linkedin
  • Twitter
  • Facebook
  • Youtube
Elsevier logo Relx logo

Copyright © 2025 Elsevier, except certain content provided by third parties.

  • Terms & Conditions Terms & Conditions
  • Privacy policyPrivacy policy
  • AccesibilityAccesibility
  • Cookie SettingsCookie Settings
  • Log in to Pure Help CenterLog in to Helpjuice Center

Knowledge Base Software powered by Helpjuice

Expand