Shortly after going live with Primo (Library Search) we disabled records from Eprints (Sussex Research Online) as they were of a poor standard and confusing users.
There were a number of issues, the the key one was this; there was no way to go from the Primo record to the SRO record, or the full text, or any other link. A complete dead end. Frustrating to users: here is something you might be interested in but there’s no way of accessing it.
The de facto standard way of harvesting metadata from repositories is via OAI-PMH as the transport and Dublin Core as the metadata format. The problem is that the number (and scope) of available fields does not really meet the requirements of the data that needs to be shared.
Extracting the data
Let’s look at what we might want to exchange between the two systems in a structured way:
- Item type
- Bibliographic data (title, author, publication title (journal title), ISSN/ISBN, publication date, Issue number, Volume, etc)
- Whether the item is publicly available as Open Access
- Link to SRO record
- Link to Full text on SRO
- Link to publishers site etc
Now let’s look at the metadata as Eprints provides it for a typical record
<dc:title>Digital futures — an agenda for a sustainable digital economy</dc:title> <dc:creator>Miller, Paul</dc:creator> <dc:creator>Wilsdon, James</dc:creator> <dc:subject>H Social Sciences (General)</dc:subject> <dc:subject>J Political Science</dc:subject> <dc:description>Despite the death of the dot-coms, there's no doubt that the digital revolution is reshaping the way we do business and relate to each other. Some would argue that this is happening with precious little thought for the environment or society at large, yet others say it has real potential to benefit both. Over the past year, Forum for the Future's Digital Futures project has examined the sustainability opportunities of e-business and the project's findings are summarised in this article.</dc:description> <dc:publisher>Elsevier</dc:publisher> <dc:date>2001-09</dc:date> <dc:type>Article</dc:type> <dc:type>PeerReviewed</dc:type> <dc:format>application/pdf</dc:format> <dc:identifier>http://sro.sussex.ac.uk/47853/1/Digital_Futures_%2D_An_Agenda_for_a_Sustainable_Digital_Economy.pdf</dc:identifier> <dc:relation>http://dx.doi.org/10.1016/S1066-7938(01)00116-6</dc:relation> <dc:relation>doi:10.1016/S1066-7938(01)00116-6</dc:relation> <dc:identifier>Miller, Paul and Wilsdon, James (2001) Digital futures — an agenda for a sustainable digital economy. Corporate Environmental Strategy, 8 (3). pp. 275-280. ISSN 1066-7938</dc:identifier> <dc:relation>http://sro.sussex.ac.uk/47853/</dc:relation>
Let’s go through extracting information from this in a logical way;
- The Good: The author(s) and title are easy use dc:title and dc:creator. So is the publisher (which isn’t always the most important piece of information to have – but still good to have it) dc:publisher and the abstract in dc:description.
- Subjects are ok. They are in dc:subject, but do start with the LoC classmark (in this example the very broad top-level H and J)
- The item type is in dc:type – but so is the PeerReviewed status. So to extract the item type, use the dc:type which doesn’t contain PeerReviewd or NonPeerReviewed. A little messy, but fine.
- The creation date is in dc:date
- Other bibliographic information is hard to extract. There is a citation in a dc:identifier field. The first hurdle here is that dc:identifier is used for more than just the citation, and unlike, say, a DOI where you can check if the field starts with ‘doi:’ there is no simple test to identify which identifier field is the citation.But even if we resolve this, parsing a citation is dubious at best. Citations were not designed to be parsed by computers, the delimiter – a comma – is used both between different parts of the citation, between author names, and between an individual author’s surname and given-name/initials. At best you might be able to extract the ISSN by looking for the string “ISSN ” and then extracting the next nine digits (including the dash). But the rest is not really useable, other than quoting the citation as a whole (that is, as mentioned above, if you can identify the correct identifier field in the first place)
- The link to the original record on SRO, the doi and the link to the publisher’s site (often a doi url) are all dc:relation. To extract the link (and hence identifier) to SRO we need to find the dc:relation field which starts with the same host/domain name as our repository (in the example above http://sro.sussex.ac.uk/47853/ ). This is fine in what we are trying to do here. As an aside, if you were connecting to many repositories to harvest, the identifiers to the records may not match the hostname of the repository (may use a cname/alias or resolver) – so this is may not work more generally.
- The link to the publisher’s version (which may be the only way of obtaining the full text if no OA version on the repository) can be obtained by selecting the dc:relation field which (a) does not start with the hostname of the repository (b) does not start with ‘doi:’. The doi can be obtained by using the dc:relation field which starts with doi: – a very useful piece of data.
- And finally how to tell if this item is full text. Well, you can’t. You can look to see if there is a dc:format field, if this is, it is almost certainly describing the full text file attached, but the file may not be publicly available.
As an aside, I raised a question about the decisions behind using DC in this way of the JISC-REPOSITORIES mailing list. This produced quite an informative discussion.
In summary, the data requires a bit of magic, presumptions and dodgy guesses to extract the information we want, and some key information – such as if the record is OA – is not extract-able at all.
How it looks
I mentioned at the top the key issue was the records didn’t link anywhere, and a second major problem in that the records could not be filtered down to those with full text available.
But there were other issues, lets look at the display to see some of them:
The URL to full text is hidden in the citation and isn’t a link (making the latter even more unreadable). Also shown below, ‘application/pdf’ not very friendly.
Subjects start with a LoC classmark. Not only hard to read but means they can’t be merged with more common subject terms used by other data sources.
And as already said, all appear to have Online Access.
Under the hood
Primo stores metadata in what are called PNX records. The structure is a little different to how you would expect data to be structured in a traditional source record; various fields are duplicated for different purposes. For example there are sections of the PNX record called ‘display’, ‘search’ and ‘facets, and each of these sections has its own Creator field, to allow the creators as displayed on screen to differ to those which are searched, and those displayed as facets. Though in most cases they are the same value.
To create a PNX record, Primo uses a set of normalisation rules unique to that data source (i.e. Eprints/SRO) to map fields in the import to those in the PNX records, with various rules to apply logic and transformations along the way. In practice most of the logic is set in creating the Display section, with other sections using the value of the corresponding display field.
Let’s take a look at a typical Eprints PNX record on Gist. [ note: you will need to scroll sideways to see the whole lines]
And if that is too much to read, here is just the display and links section of the same record.
Notice how <identifier> has the url for the full text file and the citation, and relation contains a couple of links.
Making things better.
There are a number of approaches to making things better for our users.
- Use a different export format. While Simple Dublin Core tends to be the default option for harvesting records from repositories via OAI-PMH it is not the only option. Out of the box our Eprints install also supports RDF, METS, rem atom, ‘context object‘, and a metadata profile specific to theses (each of the links go to a example record).
- Modify how Eprints exports Dublin Core.
- Make the best of working with what we currently have, i.e. use the rules and transformations in Primo’s normalisation process to improve how the data is imported.
Option 1 would take some work, Primo does support these formats out of the box, and it would live us using quite a non-standard method for doing this.
Option 2 has legs, though it could cause problems for other third-party systems that harvest our data (though it might be possible to create a new profile – a copy of oai_dc profile – with the changes, leaving the original intact for other systems to use).
Option 3 seemed like the starting point with the least disruption. So I set to work using Primo’s normalisation rules editor. Which looks much more confusing than it actually is.
Perhaps to cut a long story short, I’ll show you what we have so far:
- We now have links on the right, to the source record on SRO, to the full text, and to the publisher’s version via the DOI.
- The creation date is now just the year, bringing it inline with other sources
- If the full text file is PDF it now says “Format: PDF file” not “application/pdf”
- Subjects no longer have LoC classification numbers at the start.
- The URL for the full text and the citation are no longer merged together under ‘Identifier’. The full text url is not shown (no reason to) but used for the link on the right. The citation is now moved to a specific PNX citation field.
- The identifier field just shows the DOI (in url form)
- Source now shows as ‘Sussex Research Online’.
It’s not easy to share changes to the normalisation rules, however the screenshots below should capture most changes
Just use the year for date.
If dc:format is application/pdf then use “PDF file” for the format field.
For the identifier (where we now want just the official publisher url, which is normally a doi), use the dc:relation field if it does not start with http://sro.sussex (i.e. the dc:relation field which contains the link to the SRO record) – note that Condition Logic 1 is set to False to make this a not – AND if it does start with http:// (to make sure we don’t select the dc:relation that contains the non-url form of the DOI) then use it as an identifier. Put another way, we take the multiple dc:relation fields and look for the one starting with a URL, but not a URL to SRO itself.
If the above results in nothing, then see if we have a dc:relation field that starts with ‘doi:’ and use that. Not convinced of the value of this yet, and it may be easier just to use the DOI if we have it.
For the Subjects field, use a regex to drop the characters before the first space (i.e. the classmark).
Make the source field a little nicer.
Set the PNX citation field (which isn’t used much, but as we had a citation might as well put it in the right field), the citation is one of the dc:identifiers, the other dc:identifier is the url of the full text item on SRO, so ignore any dc:identifier starting with http (again, note that condition logic 1 is False). Of course, any citation starting with those four letters is doomed!
The other key section is the links section. Here we add the link to the Eprints record, and the full text, and, if we have a doi, to the publisher. (much of the logic here is just the reverse of the logic above)
In the Additional data section of the PNX, set the DOI if we have it. I’m not sure if this is used much, but it seems a good things to set.
One final thing, we also now set facets/toplevel to online_resources, so these records show up when a search is restricted to full text online.
So that’s the Primo changes so far.
But you might have noticed one thing that is yet to be resolved: we need to know whether a record has the full text available or not.
The other problem described above is that Primo thinks every record is ‘available’ online. This is quite a problem, it’s a key feature of the service to know what users can access. Not that Primo is really to blame, the metadata really provides no firm clues as to whether the item is available OA from the repository or not.
My original plan was to come up some hideous logic which some how tried to guess if the item was online or not, and set this in the PNX record during normalisation.
It was when I was looking at UCL’s Primo and repository OAI-PMH interface (as you do) that I realised this was the wrong approach.
We really only wanted the records for those with full text available. A metadata only record – even if it is a metadata only record which identifies itself as such – is not much use. So the task therefore is not to correctly identify those with full text and those which are metadata only (most of our records are the latter), but to only load in those records with full text.
And to do that, we need to configure the source (Eprints) and not Primo. OAI-PMH has a concept of sets. You can see ours here. So can we set up a set which only contains records where the full text is publicly available. It turns out we already have. Our system already had a set called DRIVERset which did just this. For naming clarify I created an identical set but called greenoa.
Now we can set the pipe in Primo to use the greenoa set, as below.
And this allowed me to do something I have put off for a long time. Add our SRO records to Primo Central. The ability to do this was announced over a year a go. However at the time, after talking to them it became clear they would treat any record with a link to the full text on a publishers site as ‘Available’ to the user, when really it is nothing of the sort. I didn’t want people, across the world, finding SRO records in Primo Central and being frustrated at the lack of full text. No good for the user, for their library, for Ex libris or for ourselves. So with this new set (well, really of my new knowledge of sets that already existed on our system) I was finally able to submit our repository to Primo Central.