OpenWGA 7.4 - OpenWGA Concepts and Features
Administration » Features » Lucene fulltext indexIndexable file types
The functionality to add the contents file attachments to the fulltext index is generally part of the OpenWGA enterprise edition.
The basis of indexing file attachments is the Adobe Portable Document Format, normally dubbed PDF. The OpenWGA Enterprise Edition is able to index the contents of PDF files out of the box. All other file types that should be indexed must be converted into PDF format.
Customers of the Enterprise Edition can install the OpenWGA unoconv plugin for this purpose, which is available on the IG customer service portal. This plugin is able to utilize a locally installed Unoconv service, which is basically a headless instance of the open source office suite LibreOffice. Using this OpenWGA will be able to convert all files to PDF that are readable by LibreOffice, which includes various file formats of Libre/OpenOffice, Microsoft Office and other office suites. For a complete list see the homepage of the project.
Installing unoconv is beyond the scope of this documentation. Please use the documentation on the Unoconv website for this task. If you are using a Linux operating system chances are that unoconv is already available in package form of the respective package repository. You can either just install it or configure it to run as a daemon in the backend. The latter has the advantage that OpenWGA does not need to start a single office instance for every conversion task and therefor should be more performant.
After having unoconv and the accompanying OpenWGA plugin installed on your server the lucene indexer will be able to use it instantly for PDF conversion.
Alternatively you can extend OpenWGA by writing your own "custom file to PDF" converter module. Write a OpenWGA java module of module type "de.innovationgate.enterprise.modules.dms.PDFConversionServiceModuleType". Your module must implement the simple interface "de.innovationgate.enterprise.dms.PDFConversionService", whose method toPDF() receives an input stream of the file to convert and must return an input stream of the converted PDF. The method getSupportedMimeTypes() identifies those MIME types that OpenWGA will use this converter on:
public Set<String> getSupportedMimeTypes();
public InputStream toPDF(InputStream in) throws IOException;
}