OpenWGA 7.6 - Query languages reference

lucene

Finding matches in file attachments

The lucene fulltext index is also capable of indexing the contents of file attachments and finding fulltext matches there. The functionality to index file attachments however is part of the OpenWGA enterprise edition. 

There are three ways how lucene queries can find matches in file attachments:

Querying doctype "attachment"

This is the preferred and most powerful variant, as it allows to find matches in file attachments where it is identifiable which file actually matched.

File attachments are queried on a lucene query when the doctype of the query is either set to "attachment" or "all" (see "Native query options reference"). On the search results use WebTML metadata field "SEARCHDOCTYPE" to find out, if a match was against the content document or a file attachment, and in case of "attachment" use meta "SEARCHFILENAME" to find out which file attachment actually matched your query. Matches against file attachments are nevertheless executed under the WebTML context of the document that contains them, just like regular content matches.

Here is a small example of a lucene query which also queries file attachments and differs its output depending on the match doctype, providing links to whatever matched the query:

<tml:collection>

  <tml:query type="lucene" options="doctype:all">WGA*</tml:query>

  <tml:foreach>

    <tml:if condition="SEARCHDOCTYPE=='attachment'">

      <tml:then>

        <a href="<tml:url type="file" file="{SEARCHFILENAME}"/>">Attachment <tml:meta name="SEARCHFILENAME"/> auf Dokument <tml:meta name="TITLE"/></a>       </tml:then>

      <tml:else>

        <a href="<tml:url/>">Content <tml:meta name="TITLE"/></a>

      </tml:else>

    </tml:if>

  </tml:foreach>

</tml:collection>

You can only query for the contents of file attachments that your OpenWGA installation can actually parse for text content. See Indexable file types on how to configure your installations fulltext capabilities regarding special file types.

Some facts regarding this type of query:

  • You can search specifically for properties of file attachments using the metadata fields "FILE_*" from the metadata fields table below in your query
  • Field-unspecific searches will not only match files contents. They will also find matches based on file names, also on the title and description from the file metadata.
  • If doctype is set to "all" and the query matches both, the item content of a content document AND the contents of a file attachment on the same document, then that document will be available twice on the result, once for the attachment with SEARCHDOCTYPE "attachment" and once for the content document itself with SEARCHDOCTYPE "content".
  • A lucene query will never be able to match both, a content item of index type "fulltext" and anything from a file attachment. So if your query needs to find something in a "fulltext" item you will not find file attachment results. However it is possible to find file attachment matches while the query also needs a match for a content item of index type "keyword".
  • You can use the TMLScript method Lucene.bestFileFragments() to retrieve content fragments that matched the query for a match on a file attachment.

Querying virtual field "allattachments"

You can also query file attachments on content documents by specifically searching against the virtual item "allattachments". It contains the indexed contents of all file attachments of a content documetn. However doing so will not allow you to find out which attachment matched:

<tml:query type="lucene">allattachments:WGA*</tml:query>

This is a legacy function that has no advantages over doctype "attachment". Therefor it should only be used on legacy functionality and content stores that do not have the needed version for the doctype "attachment".

Querying for file contents that are indexed on documents

If your app is configured to "Index File Contents on Documents" (which is an option available on the web apps configuration in OpenWGA admin client) then the contents of file attachments will also be added to the field-unspecific search of the "content" entries representing the content documents to which they are attached. So a simple field-unspecific search will return content documents because their file attachments match the query, but will not be able to determine which attachment matched.

<tml:query type="lucene">WGA*</tml:query>

This is also a legacy function that has no advantages over doctype "attachment". Therefor it should only be used on legacy functionality and content stores that do not have the needed version for the doctype "attachment".