One of the challenges when adding new data to KB+ is matching any titles imported into the database with an existing title on the database – this is so we don’t end up with multiple records for the same journal title in the database.
While this is challenging, it is also a key way we can improve the data in KB+ as discrepancies between existing data and new data usually indicate an opportunity to improve things.
The most obvious thing to use to match titles is the ISSN (or eISSN), and this is the first thing we look for when doing matches. As well as ISSN and eISSN, we also check for matches on DOI where we have them.
Originally we thought this would be enough to say whether we had a good match or not. However, we quickly learnt that relying on identifiers alone wasn’t enough. The type of problem we would find is where a journal had a change of title and got a new ISSN. In some data files the new (changed) title would be present, but the old ISSN would still be used – and when we matched this in KB+ it would match to the old title (because of the ISSN) not the new title.
We also saw examples where a title would appear with conflicting identifiers – i.e. the correct ISSN but an eISSN from a different journal.
So in order to catch these errors we now do multiple identifier matches and a title string match. If we don’t get a unique match based on all of these match points, then we don’t import the title until we have investigated the problems and resolved them.
The algorithm we use for the matching is:
- For each of the key identifiers (ISSN,eISSN,DOI,ISBN) look for title match
- Do all the key identifiers we have in the file match to a single existing title?
- If we’ve matched multiple titles, report the error to the data manager and don’t load the file
- if we have a single title match, check the title string of the imported title against the title string of the matched title. In order to avoid small differences in the title use the following method to create a ‘fingerprint’ for both titles before comparing:
- Replace all occurrences of ‘&’ with ‘and’
- Remove all punctuation
- Convert string to lowercase
- Sort the words in the title into alphabetical order
- Convert characters to nearest ascii equivalent
- This is a modified version of the Open Refine ‘fingerprint’ mechanism. There is a java implementation at https://github.com/OpenRefine/OpenRefine/blob/master/main/src/com/google/refine/clustering/binning/FingerprintKeyer.java
- If the title ‘fingerprints’ don’t match, report the error to the data manager and don’t load the file
- If the title ‘fingerprints’ match then we can say with a high degree of confidence we have a good match and can proceed with the upload.
When an error is identified it can mean there is a problem with the file we are trying to upload, or it can mean there is a problem with the data already in KB+. In either case we use this information to investigate and correct any errors and feedback to the content providers/publishers where appropriate.
This mechanism means we avoid adding incorrect data to the KB+, and we continually improve the data that is already in there KB+.