Each year academic libraries in the UK spend time collecting a set of statistics known as the SCONUL Stats. SCONUL then collate the responses to produce a dataset for all UK Libraries, useful for comparison etc.

These stats now include a few questions about repositories, for example:

D13. Number of accesses to full-text items in the institutional repository during the year

D14. Number of accesses to bibliographic record items in the institutional repository during the year

D15. Total number of accesses to items in the institutional repository during the year = D13 + D14

In previous years I have taken this to mean D13, the number of full text downloads, and D14 the number of hits to the bibliographic metadata pages.

However when looking at the more detailed commentary for the questions:

D13. Number of accesses to complete works in the institutional repository during the year

Only required from those libraries which manage the repository on behalf of their institution. Count the number of ‘hits’, not the number of downloads. This question relates to use of those items reported in C23b. If the breakdown is not available, please enter ‘n/k’ here and put the total figure in D15.

‘hits not downloads’ for complete works is somewhat contradictory. A hit to a PDF is the same as a Download to a PDF. It seems what they are actually asking for is: how many hits/page views did we have to the bibliographic pages of items that we have the full text for (but not the hits to the full text item itself). This is a fairly odd statistics to ask for. Surely the real interest is in downloads (hits) to the full text. Knowing that someone looked at the metadata, without knowing if they then went to the item, while not knowing about those people who went straight to the full text and bypassed the metadata page (probably most) seems not very useful.

D14. Number of accesses to bibliographic record items in the institutional repository during the year

Only required from those libraries which manage the repository on behalf of their institution. Count the number of ‘hits’, not the number of downloads. This question relates to use of those items reported in C24. If the breakdown is not available, please enter ‘n/k’ here and put the total figure in D15.

So, based on D13 at the top, this seems to be asking for the number of hits to bibliographic pages of items where we have no public full text document for that item. Hence D13 and D14 together will give the total views to our bibliographic metadata pages.

I’ve already questioned how useful this is. The second issue is that it is not easy to obtain this information. We are able to narrow down to hits to metadata pages, but here we are being asked to separate this number in a way that you can’t apply a simply pattern or logic to provide.

For example, image ten records (items). Records: 2,3,6,8,9 have full text items, where as 1,4,5,7 and 10 do not. How do you split the hits of the first group with the hits of the second?

Total Bibliographic Views using Google Analytics

First, lets try and find the total number of hits to bibliographic pages on SRO (our Repository)  using Google Analytics, it doesn’t provide what we need above (except D15), but it should provide what they should be combined, so useful for reference. Eprints, the software used, simply adds the items id number as the url for each record. So http://sro.sussex.ac.uk/ is a our base url, and record one is  http://sro.sussex.ac.uk/1/

Using Google Analytics we can use the Page View report (Content, Site Content, All Pages), select the date range, and then use an advanced filter. Just before the textfield that is displayed, we can use the drop-down box and select “matching regex”

In the text field paste in ^/[0-9]*/$

sro-stats2

 

This will match any url that starts with a / then has a number and then another /

sro-stat1

 

As you can see, this shows 33,967 page views for the year. This is much lower than the numbers below, but we only installed Google Analytics late in the year, so it only really covers just over a couple of months.

IRStat

Eprints itself has a Stats module. However it is really for full text items and the User Interface leaves a lot to be desired (and documented).

However it turns out the table that powers this service is quite useful. It keeps a record of every access to the bibliographic page or full text document. It records information you would normally find in a web log file, but also the item id number and whether it was the bibliographic or full text that was accessed.

This links to the Item id, and the table that holds information about full text items also holds the item id, as you would expect.  So maybe we can get the information we need for SCONUL, but counting all the bibliographic views logged in this table, between the specified dates, and only counting those where the item id is also the item id that is attached to a file in the file table (and hence the item has a full text file, as included for D13 and exluded in D14).

Let’s not walk before we run. First we shall try and get a number for the number of bibliographic views for the time period. (i.e. D15, the total, and should be similar to the Google Analytics number above)

SELECT *
FROM access a
WHERE a.service_type_id = "?abstract=yes"
AND ((a.datestamp_year = 2012 AND a.datestamp_month < 8)
OR (a.datestamp_year = 2011 AND a.datestamp_month > 7))

(there’s actually a few more lines to remove bogus entries but I shall come to that in a bit). The service_type_id field contains either ?abstract=yes for a hit to a bibliographic page or ?fulltext=yes for a hit to a PDF or other full text file.

This gives a number of: 90,920

Let us try and select the hits to bibliographic pages where the item has the full text (publicly available):

SELECT Count(*)
FROM access a
JOIN eprint e on e.eprintid = a.referent_id
JOIN document d on d.eprintid = e.eprintid
WHERE a.service_type_id = "?abstract=yes"
AND ((a.datestamp_year = 2012 AND a.datestamp_month < 8)
 OR (a.datestamp_year = 2011 AND a.datestamp_month > 7))
and d.format like 'application%'
and d.security = "public"

This gives us: 25,803

Here, we count all rows which are hits to the bibliographic page ( ?abstract=yes ), where their exists a row in the documents table (i.e. full text) which is publicly available and is actually for a full text item (format = application%) not a thumbnail or similar.

This leaves 65,117 (90,920 – 25,803) hits to bibliographic pages for items with no full text.

I mentioned above that I excluded a few lines of SQL for conciseness, these lines try to ignore requests from robots and crawlers:

SELECT COUNT(*)
FROM access a
JOIN eprint e on e.eprintid = a.referent_id
JOIN document d on d.eprintid = e.eprintid
WHERE a.service_type_id = "?abstract=yes"
AND ((a.datestamp_year = 2012 AND a.datestamp_month < 8)
 OR (a.datestamp_year = 2011 AND a.datestamp_month > 7))
and d.format like 'application%'
and d.security = "public"
and a.requester_user_agent not like '%Baiduspider%'
and a.requester_user_agent not like '%scirus%'
and a.requester_user_agent not like '%robot%'
and a.requester_user_agent not like '%spider%'
and a.requester_user_agent not like 'Java%'
and a.requester_user_agent not like 'ia_archiver%'
and a.requester_user_agent not like 'HttpComponents%'
and a.requester_user_agent not like '%crawler%'
and a.requester_user_agent not like '%Slurp%' 
and a.requester_user_agent not like '%bot%' 
and a.requester_user_agent not like '%Bot%'
and a.requester_user_agent not like 'Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9.1.2) Firefox/3.5.2'
and a.requester_user_agent not like 'panscient.com'
;

I think this gets most of them. I really wish crawlers and robots used a common unique string in the user_agent field to easily identify them.

This provides the answers to the SCONUL stats questions.

3 Comments

  1. Rory McNicholl

    Hello Chris,

    I too find it difficult to see how such stats would be useful, especially given that this would ignore direct accesses to a pdf via a search engine results page (for one example).

    Thanks for posting the SQL queries, I may well use those. However I felt I should add that as far as I can see the tables queried are part of the EPrints core database structure. So IRStats is *not* a pre-requisite for getting those stats. That may or may not save someone a lot of trouble.

    Cheers,

    Rory

    • Hi Rory

      Thanks for the comment.
      Didn’t realise access was a core table.

      But note that this year the questions have changed, for the better. Only seem interested in the number of full text items, and their downloads (i think). which is more encouraging

      • Rory McNicholl

        Yep, it’s the reason that you can install IRstats on a repository that’s been running a while and it will have all those old download stats.

        Thanks for the heads-up re SCONUL. I’ll pass that on to the person who directed me here in the first place 😉

Leave a Reply