I was pulling my hair out trying to work with some MARC files ready for importing in to a system (actually it was a pre-import script I was using for vufind).

The files were around 90 thousand records in size, but they would only import a subset of these. 

My first step was to try and determine if these files definitely did include all the records I thought they did. for this I used a trusty set of simple perl scripts which use the CPAN record MARC::Record to display and count records. 

These were only showing a thousand odd before quitting. The problem here was I hadn’t used ‘$batch->strict_off();’ before entering the main loop iterating through each record. But still it would only get so far before the script which quit on error, in fact in the same place (the same record count) as the import process, but unlike the import perl was giving me an error.

utf8 “x81” does not map to Unicode at /usr/lib/perl/5.10/Encode.pm line 174

I get a little lost with Unicode, but that was clearly the cause. And of course, MARC is a binary format, so I couldn’t simply covert the file to a different text format. 

The fact that even my counting script was causing problems showed me this was something within the MARC::Record end and not my code.

The counting script looks basically like this, you can’t get much more simple.

my $batch = MARC::Batch->new( ‘USMARC’, $ARGV[0] ); 
while ( my $marc = $batch->next ) { 
$recordcount++; 
print $recordcount;


This had all been on a linux box, and I went back to MarcEdit to investigate. First sure enough there was a non-standard character (at least to my UK English eyes!) Ørskov, E. R. (Egil Robert), 1934-.

Looking around in MarcEdit I found a somewhat hidden option.under the Tools menu is a ‘batch process records’ option. One of these is ‘Character Conversions’. I wasn’t exactly sure what I was dealing with but tried: MARC8 to UTF8.

Trying again the pre-import script processed all 90 thousand records!

However my simple tools were still not playing ball. 

The answer came in the bug tracking list of the MARC::Record cpan package.

The following script will correctly count the number of records in the file (a somewhat important aspect to a script which has a sole job of counting records)

#!/usr/bin/perl
##
use MARC::Record;
use MARC::Batch;
use Encode qw(encode decode);
use utf8;
use open qw( :std :utf8);
use IO::File

my $batch;
my $fh = IO::File->new($ARGV[0]); # don’t let MARC::Batch open the file, as it applies the ‘:utf8’ IO layer
    $batch = MARC::Batch->new( ‘USMARC’, $fh );
$batch->warnings_off();
$batch->strict_off();
my $i=0;
my $commitnum = $commit ? $commit : 50;

RECORD: while (  ) {
    my $record;
    # get records
    eval { $record = $batch->next() };
    if ( $@ ) {
        print “Bad MARC record: skippedn”;
        next;
    }
    # skip if we get an empty record (that is MARC valid, but will result in AddBiblio failure
    last unless ( $record );
    $i++;
}
print $i

Additional code can be added where the ‘$i++’ sits near the end of the script to perform any tasks required on the records.

One Comment

Leave a Reply