OpenWGA 7.7 - Query languages reference


lucene

General

Lucene is a library for executing fulltext queries that is embedded to OpenWGA. It is available for all WGA Content Store types to index and query all of the contained content documents and their data. 

The lucene fulltext indexes the contents of the content stores of an OpenWGA installation in a way that allows querying it in a "fulltext" way, meaning that it is able to find out, which documents on a content store contain a certain word. It also provides some functionality that should determine, what document is more "relevant" regarding a certain term than another, depending on the position and frequency of the term in the contents. This form of search is known to most internet users in the form of internet search engines like Google, which provide a similar functionality regarding web pages.

You may use lucene as a site-internal search engine to perform fulltext queries for special terms on content documents. You can also use lucene like a database query language to do more specific queries on special items and metadata fields and have the results sorted the way you want.


As only query language in WGA it also allows queries on multiple WGA Content Stores at once.

The feature to query the lucene fulltext index must be enabled for individual content stores in administration. There you also have the ability to configure the way that lucene treats items and metadata fields in index, modifying importance, sorting capability and indexing type. If a lucene query does not return the results you want it to return, chances are that the behaviour of the lucene index can be adapted to your needs in administrative setup.

One general drawback of lucene is the fact that the index is updated asynchronously after each data change. Because of that the index may not include the latest data additions and modifications of the content store. If you need your query to return realtime results you should choose another query language like HQL.

Syntax

In the following document we want to demonstrate the most commonly used search syntaxes for lucene. For a more in-depth documentation you can use the official lucene dokumentation.

A lucene query consists of a number of singular search clauses. A search clause may be some simple term or a specific search for terms in a field. Individual clauses are divided by space characters. Therefor the following query consists of two clauses:

<tml:query type="lucene">
    Content Management
</tml:query>


It searches for documents that have both words "Content" and "Management" somewhere in their item data, or in textual metadata fields like title and description.
A clause that searches a term in a specific field contains fieldname and term divided by a colon:

<feldname>:<Suchausdruck>


The field names are interpreted as content items when they are lowercase, or as metadata fields when they are uppercase, for example:

body:WGA TITLE:Google


Searches a document whose item "body" contains the term "WGA" and whose metadata field "title" contains the term "Google". A list of valid metadata fieldnames is at the end of this document.

Sorting

The default sort order of lucene results is by "relevance", i.e. those documents with the "best matches" are displayed first. What differs a "better match" from a "worse match" is dependent on the field that the terms are found in. For example a match in the title has a higher relevance than a match in any item. Also the configuration of the lucene index in adminstration can "boost" special items so that matches in them are regarded "better" than in other items. For an in-depth treatment of the mathematics behind revelance determination you can read the lucene documentation about Scoring.

In any way you can return a numerical representation of the individual relevance as metadata field "searchscore" on each result document, which will return a fraction value between 1 and 0.

Alternatively you can sort the lucene results by item and metadata values, providing the used field was indexed to be sortable (which again can be configured in administration). Use the <tml:query> attribute options to specify the desired sorting:

<tml:query type="lucene" options="sort: myitem (asc)">...search terms...</tml:query>


The sort expression has the following syntax as seen in the example above:
  • The prefix "sort:"
  • The name of the field that should determine the sorting order, like it would be specified in the query itself (so items lowercase, metadatafields uppercase)
  • The suffix "(asc)" or  "(desc)" determining if you want ascending or descending sort order

The metadata table at the end of this document describes what metadata fields are sortable.

Sorting based on more than one field at once currently is not possible. If you need something like this you might want to fallback on WebTMLs sorting capabilities.

Operators

Search clauses can be combined with a variety of operators. It is placed either before a clause or between clauses, depending on its nature. The complete syntax of a search query including optional operators would therefor be:

<preceding-operator><fieldname>:<term>  <between operator> <preceding-operator><fieldname>:<term>  ...


But as we have seen before most queries never have a preceding or "between" operator. In that case default operators are implicitly used by lucene. The default preceding operator is "+" (means, that the clause is positive). The default operator between clauses is "AND" (means that result documents must match both clauses).

The following operators are available:
Operator Description Position
AND, && Combines two clauses so all documents are found that match both clauses. This is a default operator of lucene which is implicitly used if multiple clauses are just divided by space characters without explicit operator. Between two clauses
OR, || Combines two clauses so all documents are found that match either one of them or both clauses. Between two clauses
+ Marks the clause as "positive", i.e. all documents must match the clause. This is a default operator of lucene which is implicitly used when clauses have no preceding operator. Directly preceding the clause
NOT, -, ! Marks the clause as "negative", i.e. all documents must not match the clause. A query may not just consist of negative clauses. Directly preceding the clause im case of "-" and "!", preceding the clause but divided by a space character from it in case of NOT

Finding matches in file attachments

The lucene fulltext index is also capable of indexing the contents of file attachments and finding fulltext matches there. The functionality to index file attachments however is part of the OpenWGA enterprise edition. 

There are three ways how lucene queries can find matches in file attachments:

Querying doctype "attachment"

This is the preferred and most powerful variant, as it allows to find matches in file attachments where it is identifiable which file actually matched.

File attachments are queried on a lucene query when the doctype of the query is either set to "attachment" or "all" (see "Native query options reference"). On the search results use WebTML metadata field "SEARCHDOCTYPE" to find out, if a match was against the content document or a file attachment, and in case of "attachment" use meta "SEARCHFILENAME" to find out which file attachment actually matched your query. Matches against file attachments are nevertheless executed under the WebTML context of the document that contains them, just like regular content matches.

Here is a small example of a lucene query which also queries file attachments and differs its output depending on the match doctype, providing links to whatever matched the query:

<tml:collection>

  <tml:query type="lucene" options="doctype:all">WGA*</tml:query>

  <tml:foreach>

    <tml:if condition="SEARCHDOCTYPE=='attachment'">

      <tml:then>

        <a href="<tml:url type="file" file="{SEARCHFILENAME}"/>">Attachment <tml:meta name="SEARCHFILENAME"/> auf Dokument <tml:meta name="TITLE"/></a>       </tml:then>

      <tml:else>

        <a href="<tml:url/>">Content <tml:meta name="TITLE"/></a>

      </tml:else>

    </tml:if>

  </tml:foreach>

</tml:collection>

You can only query for the contents of file attachments that your OpenWGA installation can actually parse for text content. See Indexable file types on how to configure your installations fulltext capabilities regarding special file types.

Some facts regarding this type of query:

  • You can search specifically for properties of file attachments using the metadata fields "FILE_*" from the metadata fields table below in your query
  • Field-unspecific searches will not only match files contents. They will also find matches based on file names, also on the title and description from the file metadata.
  • If doctype is set to "all" and the query matches both, the item content of a content document AND the contents of a file attachment on the same document, then that document will be available twice on the result, once for the attachment with SEARCHDOCTYPE "attachment" and once for the content document itself with SEARCHDOCTYPE "content".
  • A lucene query will never be able to match both, a content item of index type "fulltext" and anything from a file attachment. So if your query needs to find something in a "fulltext" item you will not find file attachment results. However it is possible to find file attachment matches while the query also needs a match for a content item of index type "keyword".
  • You can use the TMLScript method Lucene.bestFileFragments() to retrieve content fragments that matched the query for a match on a file attachment.

Querying virtual field "allattachments"

You can also query file attachments on content documents by specifically searching against the virtual item "allattachments". It contains the indexed contents of all file attachments of a content documetn. However doing so will not allow you to find out which attachment matched:

<tml:query type="lucene">allattachments:WGA*</tml:query>

This is a legacy function that has no advantages over doctype "attachment". Therefor it should only be used on legacy functionality and content stores that do not have the needed version for the doctype "attachment".

Querying for file contents that are indexed on documents

If your app is configured to "Index File Contents on Documents" (which is an option available on the web apps configuration in OpenWGA admin client) then the contents of file attachments will also be added to the field-unspecific search of the "content" entries representing the content documents to which they are attached. So a simple field-unspecific search will return content documents because their file attachments match the query, but will not be able to determine which attachment matched.

<tml:query type="lucene">WGA*</tml:query>

This is also a legacy function that has no advantages over doctype "attachment". Therefor it should only be used on legacy functionality and content stores that do not have the needed version for the doctype "attachment".

Advanced syntax

Wildcards

A search term can contain two types of wildcards characters:

A question mark "?" is a wildcard for one arbitrary character.
A star sign "*" is a wildcard for any number of arbitrary characters (including none).

Wildcards may NOT be used as the first sign in search clauses.

Space characters in search terms

When searching for terms that contain space characters it does not work to just specify the term. As the space character normally is used by lucene to divide individual search clauses lucene will take everything after the space character as separate clause.

For example, the following query will search for the term "Content" in item "body", but them for the terms "Content", "Management", "with" and "WGA" in all other items (plus metas title and description):

body:Content Management with WGA

To search for a term with space characters exactly the way that it is entered, you have to encose it in double quotes. This will make lucene recognize it as one single term:

body:"Content Management with WGA"

Searching the contents of file attachments


Optionally lucene can also index files attached to content documents. It is disabled in default configuration and it needs some special "analyzer" modules to interpret the contents of the used file types which are not part of the OpenWGA standard distribution. Analyzer modules for the most frequently used file types are available in the OpenWGA Enterprise Edition.

There is a special "item name" in lucene for explicitly searching the contents of fileattachments named "allattachments". So if you want to also search in file attachments you may add an item specific clause it to your search term.

"Content Management" AND allattachments:"Content Management"

Searching content relations

Content relations are also indexed to lucene but do not provide direct links to the target as lucene can only index text. The index name of a normal relation is $rel_relationname and its index field contains the struct key and language of the target content, divided by a point. For example: "4028fbe5125651ea01125656704d000f.en".

Relation groups are indexed with name $relgroup_groupname and contain the same data.


Relations are indexed as keywords and are also sortable.

Searching for date and number values

As lucene is a fulltext indexing engine it treats all values as text, including dates and numbers which are converted to a standard text format. This must be considered when searching for those value types.

Date values

Date values are indexed as text in format "yyyyMMddHHmmss" indiziert. The characters mean (y)ear, (M)onth, (d)ay in month, (H)our, (m)inute and (s)econd. If a date contains no time information the time values are indexed as 0. So  1. September 2005 is indexed as "20050901000000". You can use wildcards when searching for dates if time does not matter. This searches for documents that were modified on that day, no matter what time:

MODIFIED:20050901*


If you want to search for date items that way they must be configured in WGA Admin Client to be indexed as type "KEYWORD".

Number values:

Numbers are just converted to text, optionally with the dot "." as decimal separator and without any grouping separator.

VERSION:5

Again items with number values must be configured to be indexed as type "KEYWORD" if they are meant to be queried that way.

Specifying ranges

In the following syntax it is possible to specify a range of values that a field may have:

<fieldname>:[<start> TO <end>] or
<fieldname>:{<start> TO <end>}


The difference of these two syntaxes is, that the square bracket syntax treats start and end values as inclusive (documents are found which have exactly equal values like <start> or <end>) while the curly brackets syntax treats them as exclusive (the values must be higher than <start> and lower than <end> for a document to be found).

The ranges syntax is most useful when searching for date ranges. The following search finds documents that were modified between 15. August and 1. September 2005 inclusive:

MODIFIED:[20050815000000 TO 20050901235959]


Searching multiple databases

As stated lucene is able to search multiple content stores at once. To specify which databases to include in the search you can use the following values on <tml:query> attribute db:
Value for attribute "db" Description
dbkey [, dbkey, ...] Comma separated list of databases to be searched.
* Search all lucene indexed databases in the same domain as the context database
** Search all lucene indexed databases
The default value for attribute db is the dbkey of the current context database. So if you just want so search this database you may omit this attribute.

Further functionality

Search score

As stated above lucene provdes a "search score" for each found content document, providing information about the relevance of the document for the search query. However it is only available when the query result is sorted by relevance (which is the default when no other sort order is declared).

It is retrievable as metadata field "SEARCHSCORE" on each result document and is a numeric fraction value ranging from 1 (perfect match) to nearly 0 (weak match).

The relevance of a document for a search query is calculated based on many parameters:
  • Count of found terms
  • Items/Metadata fields where the terms were found and their importance
  • Position of the terms inside the field data
  • Configured "boost" value for the field (settable in WGA Admin Client under the "Fulltext configuration" of the content store)
Further information about this topic is found in the chapter "Sorting".

Highlighting

The highlighting feature allows you to highlight the searched terms in the data of found documents. To enable this just set attribute highlight  at the <tml:query> tag to "true". Also, when putting out the data of found documents via <tml:item> set the attribute highlight of this tag to "true" to enable automatic highlighting.

The default highlighting simply marks the terms bold. You can change this by using the <tml:item> attributes highlightprefix and highlightsuffix to explicitly specify the HTML code that is to put out right before and after the term.

The following example highlights terms by wrapping them in a HTML span of CSS class "highlight":

<tml:item name="body" highlight="true" 
highlightprefix="<span class="highlight">" highlightsuffix="</span>"/>


This feature does not support you in finding the fields where the matches occured. You need to know the item that is to put out via <tml:item> and enable highlighting there. Therefor this feature is most useful with documents whose main data is just in one "body" item, that always can be put out.

Best fragments

The feature "best fragments" automatically detects those text fragments in an item that matched the query terms and is able to return them. This is useful if the text of a data item normally is too long to be put out in whole on a search result page.

You can retrieve these fragments by the TMLScript method this.bestFragments(), which returns the fragments for a specific item on the current result document. It always uses the fragments data for the last lucene search on the current user session. So executing another lucene query will delete the fragments data of a previous search.

Including virtual documents

Virtual documents by default are excluded from the result list of lucene as their data is not the one shown when the virtual document is displayed. You may however choose to include them by specifying the native query option "includeVirtualContent":

<tml:query type="lucene" options="includeVirtualContent" ... />

Native query option reference

Native query options are options given to WebTML attribute options, which control some aspects of the query that are native to the current query type. The following options are available:
Option Purpose
doctype:content|attachment|all Determines where to search for matching text: Choose "content" for matches on the fields of content documents only (the default), "attachment" for matches against the contents of file attachments or "all" for both.
explain Adds "lucene explain data" the query, explaining why the document is contained in the query result and the cause for its search score. This is rather technical and specific to lucenes internals. It can be retrieved on the WGAPI content object "WGContent" via method "getSearchExplanation()".
includeVirtualContent Includes virtual content documents in the search result. Note that the terms by which virtual documents are indexed are not from the data of their target documents. They are only indexed by the data that is stored directly on the virtual document.
sort:fieldname (asc|desc) Sorts the query result by the given field. Specify fieldname in lowercase to sort by an item, in uppercase to sort by a metadata field. Specify (asc) for ascending or (desc) for descending sort order.

Metadata fields in lucene index

This table shows all metadata fields that are contained in the lucene index and can be queried.

The fulltext index contains entries for two different doctypes: "content" which indexes the fields on content documents and "attachment" which indexes the contents of file attachments. Both also contain queryable metadata fields identified in this table. The column "Doctype" in this table identifies which entries in the index contain the respective field, "content", "attachment" or "all" if both contain it. Querying for a metadata field only available for one doctype will mean that this term will only find matches of this doctype.

There are also different indexing types in which these fields are indexed and which allow different usages:

  • keyword: The field value is stored unmodified and analyzed, therefor (only) can be found when querying for the exact and complete contents of the field.
  • analyzed: The field value is analyzed and tokenized. It can be found querying for any single word token.
  • fulltext: Like "analyzed", but the field can also be found when using field-unspecific search clauses
  • date: Like "keyword". Only for dates, that will be indexed in the text form "yyyyMMddHHmmss". See chapter "date values" for details.
Metadata field Description Doctype Index type Sortable
AREA Name of the area containing the content all keyword Yes
AUTHOR Author of the content all analyzed Yes
COAUTHORS Additional authors of the content (Only OpenWGA content stores of version 5 or higher) all fulltext No
CONTENTCLASS Name of the content class of the content all keyword Yes
CONTENTTYPE Name of the content type of the page all keyword Yes
CREATED Date and time of creation all date Yes
DBKEY Key of containing database all keyword Yes
DESCRIPTION Kurzbeschreibung des Inhaltes all fulltext Yes
DOCNAME, NAME, UNIQUENAME Unique name of the content all keyword Yes
FILE_COPYRIGHT For file attachment matches: Copyright information from the files metadata attachment keyword Yes
FILE_CREATED For file attachment matches: The date the file attachment was created attachment keyword Yes
FILE_DESCRIPTION For file attachment matches: The description from the files metadata attachment fulltext Yes
FILE_LASTMODIFIED For file attachment matches: The date the file attachment was last modified attachment keyword Yes
FILE_MD5CHECKSUM For file attachment matches: The MD5 checksum of the file contents attachment keyword Yes
FILE_MIMETYPE For file attachment matches: The files MIME type from its metadata attachment keyword Yes
FILE_NAME For file attachment matches: The file name attachment keyword Yes
FILE_SHA512CHECKSUM For file attachment matches: The SHA512 checksum of the file contents attachment keyword Yes
FILE_SIZE For file attachment matches: The size of the file in bytes attachment keyword Yes
FILE_TITLE For file attachment matches: The title of the file from its metadata attachment keyword Yes
HIDDENINNAV Is "true" if the document is to be shown in navigators, "false" otherwise all keyword Yes
HIDDENINSEARCH Is "true" if the document is to be shown in query results, false otherwise all keyword Yes
HIDDENINSITEMAP Is "true" if the document is to be shown in sitemaps, false otherwise all keyword Yes
KEY The complete content key of syntax "structkey.language.version" all keyword Yes
KEYWORDS Keywords for this content to be used by internet search machines all keyword No
LANGUAGE Code of the language of this content, for example "en" or "de" all keyword Yes
LASTCLIENT Type of the last OpenWGA authoring client that edited this content all keyword Yes
LASTMODIFIED, MODIFIED Date and time of last modification all date Yes
OWNER The owner of the content (Only OpenWGA content stores of version 5 or higher) all fulltext Yes
PAGEPUBLISHED The published date of the first ever published version of this content (Only OpenWGA content stores of version 5 or higher) all keyword Yes
PARENT Struct key of the parent page all keyword No
PATH Struct keys of all pages up the page hierarchy to the root page. Querying for the struct key of a specific page on PATH will return all contents that are in the hierarchy below that page all keyword No
PUBLISHED The published date of the content (Only OpenWGA content stores of version 5 or higher) all keyword Yes
STATUS Workflow state of the content:
"w" - Working copy
"g" - In approval process
"p" - Published
"a" - Archived
all keyword Yes
STRUCTENTRY, STRUCTKEY Key of the struct entry belonging to this content all keyword Yes
TITLE Title of the content all fulltext Yes
VALIDFROM Optional date and time before which the document should be invisible all date Yes
VALIDTO Optional date and time after which the document should be invisible all date Yes
VERSION Number of version of this content all keyword Yes
VIRTUALLINK If this document is a virtual document points to its target. Contents depends on type of virtual link (which is indexed as VIRTUALLINKTYPE):
"int" - Content key of the target document
"exturl" - URL to an external website
"file" - Syntax: <documentkey>/<filename>, where <documentkey> is the name of a file container or the key of a content document
"intfile" - Name of a file attachment on this content
all keyword Yes
VIRTUALLINKTYPE Type of virtual document:
"int" - Targets a content document in this database
"exturl" - Targets some custom URL
"file" - Targets a file attachment on a file container or content document in this database
"intfile" - Targets a file attachment on this content document
all keyword Yes
VISIBLE General visibility flag holding "true" or "false". all keyword Yes