Time again to highlight some great conversations for our listserv, with tips and tricks that you might want for your Islandora:
In a nutshell, it allows for you to put certain objects behind a log-in wall without suppressing them from searching & browsing, and when a user tries to access one of these objects they get an explanation of why the object is restricted and a prompt to register an account with extra info that copyright holders/grant funders would need to know. All that data gets stored for later viewing on the backend. There are options for whitelisted IP ranges and automatic deletion of these users as well. Could be useful in a digital archive where copyright is an issue, or in an IR where there's sensitive data around.
I manage pdf/a (pdf scanned + OCR text) as book and I use this steps:
- pdftk + imagemagick to generate tiff, 1 file x page
- docsplit utility to extract text from pdf pages, 1 file x page
- prepare dir structure as needed by book batch ingesting (1 dir x page with OBJ.tif, OCR.txt, DC.xml, ...)
- batch ingest (see islandora book ingest module)
while OCR.txt is indexed by solr and used by simple or advanced search block, IA uses HOCR datastream that at the moment is generated by tesseract during ingesting derivatives generation, I searched but I didn't found any way to generate HOCR from pdf/a directly,
so I have a full-text search based on OCR datastream while IA search is based on HOCR datastream, at the moment this is ok for me.
And finally, a stop on the islandora-dev listserv, with an update to a question from back in April, when SFU's Mark Jordan asked:
Has anybody tried running multiple (say 2 or 3) Islandora Batch loads via drush at one time? Or would that be a Dumb Thing To Do? Would love to hear if anyone has any experience.
Back then, UNCC's Brad Spry noted that it could cost ingest failures but that there was a possible solution he would explore. Last week, he updated the community with his work and some promising prospects:
I've been working to implement a cool server-side book batch pre-processing workflow this week, so I've been working on our nifty ingest scripts.
I ran into the "collision" issue I wrote about previously... After wrestling with it for days, my current theory is issues can be caused by multiple simultaneous or near-simultaneous execution(s) of islandora_batch_scan_preprocess.
I have one error documented so far:<ASSERT>Datastream must have a datastream id. (foxml:datastream: value of ID is missing)</ASSERT>
The cause of that error is still eluding me; I've even been disassembling BLOBs created by islandora_batch_scan_preprocess in search of answers :-)
I had to keep moving forward though, so I implemented a locking mechanism and precision set ingest. All of my ingest-ready objects and related directories pass through here:
batch_set_id=$(/usr/local/bin/drush -c /usr/local/drush/drushrc.php -v --user=user --uri=https://server islandora_batch_scan_preprocess --namespace=$1 --content_models=$2 --parent=$3 --parent_relationship_pred=isMemberOfCollection --type=directory --target=$4 2>&1 | sed -E '/^SetId:/! d; s/^SetId: ([0-9]+).*/\1/')#ready_for_ingest
/usr/local/bin/drush -c /usr/local/drush/drushrc.php vset islandora_bagit_create_on_modify '0'
/usr/local/bin/drush -c /usr/local/drush/drushrc.php -v --user=user --uri=https://server islandora_batch_ingest --ingest_set=$batch_set_id >> /mnt/islandora-loadingdock/ingest_log/ingest.log
/usr/local/bin/drush -c /usr/local/drush/drushrc.php vset islandora_bagit_create_on_modify '1'
After implementing the locking mechanism and precision set ingest, I've seen no "collisions". My testbed has been 2-3 books, audio, and images all trying to ingest simultaneously. I no longer allow them to fight each other; objects now form a single filed line.
I intend to keep pushing it and see how it holds up!