Automatic indexing of digital libraries

Context

The ULB libraries are technical coordinator in a European project called NEEO (Network of European Economists Online), aiming at aggregating the economics content, including publications and datasets, from repositories of about 20 institutions with a high reputation in the field of economics research. NEEO wants to bring this information to the user community through various added-value services, including, amongst others, a feature-rich portal system, which permits searching on metadata and full-text of the publications, with links to the publication texts in the repositories.

Every publication is described through an XML structure according to a specific DIDL application profile, consisting mainly of descriptive metadata (title, authors, abstract, keywords, etc), and information on the files constituting the publication (references to the files, version, accessibility restrictions, etc.). All these DIDL records are stored and indexed in a so-called “metadata store”, which is implemented with the open source Lucene search engine. This store is made available to the outside world, amongst other, through an SRU-based Web service.

NEEO also addresses the problem of automated enrichment of the original metadata. This is done through so-called enrichment servers, which harvest DIDL records from the metadata store, pull these though a specific enrichment procedure, and make the result of the enrichment available to the metadata store over OAI-PMH as DIDL formatted XML structures. The search engine can then constitute an aggregation of the original and enriched metadata and present the sum through its interfaces to the end user.

At this moment NEEO is running an experimental setup of such an enrichment server which automatically categorizes publications in economics according to the JEL ontology, and which extracts the references from these publications. These enrichments are “calculated” using text mining technologies.

ULB Libraries have a similar setup of metadata store, search engine and enrichment server, called DI-fusion. Objective is to implement an enrichment server that extracts full-text from PDF publications and makes this available to DI-fusion, which will then in turn be able to build a full-text index. Quite some PDF publications are only available as scanned images, and therefore need to go through a preliminary OCR process.

Task

An enrichment service that extract text from PDF publications. This consists of:

a literature study of available open source OCR software packages
a comparative study of some of these packages; this includes
- installation of the software on a server of the libraries
- testing the software on a set of publications
- reporting on the results (performance, correctness, etc)
setup of an enrichment service which extracts text from PDF publications; if the PDF file is a scanned image, OCR needs to be performed in a preliminary step; the extracted text needs to be made available as DIDL XML structures, according to a well-defined XML Schema

This work can be accomplished within a timeframe of 3 months.

Dissemination of results

The result of this work will be used in the DI-fusion search engine of the ULB libraries, in the Bictel catalog of e-theses of the French-speaking university community of Belgium, and in the European Economists Online service.

Profile

Team spirit: this work needs to be done in close collaboration with the IT department of the ULB Libraries
Good knowledge of XML and the Java programming language is a prerequisite
Reporting must be done in English

If this experience proofs to be a positive one (for both parties), possibilities exist for subsequent employment within the IT department of the ULB Libraries.

Table of Contents

Automatic indexing of digital libraries

Context

Task

Dissemination of results

Profile