OpenWGA 7.0 - Query languages reference

General

Lucene is a library for executing fulltext queries that is embedded to OpenWGA. It is available for all WGA Content Store types to index and query all of the contained content documents and their data. You may use lucene as a site-internal search engine to perform fulltext queries for special terms. You can also use lucene like a database query language to do more specific queries on special items and metadata fields and have the results sorted the way you want.

As only query language in WGA it also allows queries on multiple WGA Content Stores at once.

The feature to query the lucene fulltext index must be enabled for individual content stores in administration. There you also have the ability to configure the way that lucene treats items and metadata fields in index, modifying importance, sorting capability and indexing type. If a lucene query does not return the results you want it to return, chances are that the behaviour of the lucene index can be adapted to your needs in administrative setup.

One general drawback of lucene is the fact that the index is updated asynchronously after each data change. Because of that the index may not include the latest data additions and modifications of the content store. If you need your query to return realtime results you should choose another query language like HQL.

Syntax

In the following document we want to demonstrate the most commonly used search syntaxes for lucene. For a more in-depth documentation you can use the official lucene dokumentation.

A lucene query consists of a number of singular search clauses. A search clause may be some simple term or a specific search for terms in a field. Individual clauses are divided by space characters. Therefor the following query consists of two clauses:

<tml:query type="lucene">
    Content Management
</tml:query>

It searches for documents that have both words "Content" and "Management" somewhere in their item data, or in textual metadata fields like title and description.
A clause that searches a term in a specific field contains fieldname and term divided by a colon:

<feldname>:<Suchausdruck>

The field names are interpreted as content items when they are lowercase, or as metadata fields when they are uppercase, for example:

body:WGA TITLE:Google

Searches a document whose item "body" contains the term "WGA" and whose metadata field "title" contains the term "Google". A list of valid metadata fieldnames is at the end of this document.

The default sort order of lucene results is by "relevance", i.e. those documents with the "best matches" are displayed first. What differs a "better match" from a "worse match" is dependent on the field that the terms are found in. For example a match in the title has a higher relevance than a match in any item. Also the configuration of the lucene index in adminstration can "boost" special items so that matches in them are regarded "better" than in other items. For an in-depth treatment of the mathematics behind revelance determination you can read the lucene documentation about Scoring.

In any way you can return a numerical representation of the individual relevance as metadata field "searchscore" on each result document, which will return a fraction value between 1 and 0.

Alternatively you can sort the lucene results by item and metadata values, providing the used field was indexed to be sortable (which again can be configured in administration). Use the <tml:query> attribute options to specify the desired sorting:

<tml:query type="lucene" options="sort: myitem (asc)">...search terms...</tml:query>

The sort expression has the following syntax as seen in the example above:

The prefix "sort:"
The name of the field that should determine the sorting order, like it would be specified in the query itself (so items lowercase, metadatafields uppercase)
The suffix "(asc)" or "(desc)" determining if you want ascending or descending sort order

The metadata table at the end of this document describes what metadata fields are sortable.

Sorting based on more than one field at once currently is not possible. If you need something like this you might want to fallback on WebTMLs sorting capabilities.

Search clauses can be combined with a variety of operators. It is placed either before a clause or between clauses, depending on its nature. The complete syntax of a search query including optional operators would therefor be:

<preceding-operator><fieldname>:<term>  <between operator> <preceding-operator><fieldname>:<term>  ...

But as we have seen before most queries never have a preceding or "between" operator. In that case default operators are implicitly used by lucene. The default preceding operator is "+" (means, that the clause is positive). The default operator between clauses is "AND" (means that result documents must match both clauses).

The following operators are available:

Operator	Description	Position
AND, &&	Combines two clauses so all documents are found that match both clauses. This is a default operator of lucene which is implicitly used if multiple clauses are just divided by space characters without explicit operator.	Between two clauses
OR, \|\|	Combines two clauses so all documents are found that match either one of them or both clauses.	Between two clauses
+	Marks the clause as "positive", i.e. all documents must match the clause. This is a default operator of lucene which is implicitly used when clauses have no preceding operator.	Directly preceding the clause
NOT, -, !	Marks the clause as "negative", i.e. all documents must not match the clause. A query may not just consist of negative clauses.	Directly preceding the clause im case of "-" and "!", preceding the clause but divided by a space character from it in case of NOT

Wildcards

A search term can contain two types of wildcards characters:

A question mark "?" is a wildcard for one arbitrary character.
A star sign "*" is a wildcard for any number of arbitrary characters (including none).

Wildcards may NOT be used as the first sign in search clauses.

Space characters in search terms

When searching for terms that contain space characters it does not work to just specify the term. As the space character normally is used by lucene to divide individual search clauses lucene will take everything after the space character as separate clause.

For example, the following query will search for the term "Content" in item "body", but them for the terms "Content", "Management", "with" and "WGA" in all other items (plus metas title and description):

body:Content Management with WGA

To search for a term with space characters exactly the way that it is entered, you have to encose it in double quotes. This will make lucene recognize it as one single term:

body:"Content Management with WGA"

Searching the contents of file attachments

Optionally lucene can also index files attached to content documents. It is disabled in default configuration and it needs some special "analyzer" modules to interpret the contents of the used file types which are not part of the OpenWGA standard distribution. Analyzer modules for the most frequently used file types are available in the OpenWGA Enterprise Edition.

There is a special "item name" in lucene for explicitly searching the contents of fileattachments named "allattachments". So if you want to also search in file attachments you may add an item specific clause it to your search term.

"Content Management" AND allattachments:"Content Management"

Searching content relations

Content relations are also indexed to lucene but do not provide direct links to the target as lucene can only index text. The index name of a normal relation is $rel_relationname and its index field contains the struct key and language of the target content, divided by a point. For example: "4028fbe5125651ea01125656704d000f.en".

Relation groups are indexed with name $relgroup_groupname and contain the same data.

Relations are indexed as keywords and are also sortable.

Searching for date and number values

As lucene is a fulltext indexing engine it treats all values as text, including dates and numbers which are converted to a standard text format. This must be considered when searching for those value types.

Date values

Date values are indexed as text in format "yyyyMMddHHmmss" indiziert. The characters mean (y)ear, (M)onth, (d)ay in month, (H)our, (m)inute and (s)econd. If a date contains no time information the time values are indexed as 0. So 1. September 2005 is indexed as "20050901000000". You can use wildcards when searching for dates if time does not matter. This searches for documents that were modified on that day, no matter what time:

MODIFIED:20050901*

If you want to search for date items that way they must be configured in WGA Admin Client to be indexed as type "KEYWORD".

Number values:

Numbers are just converted to text, optionally with the dot "." as decimal separator and without any grouping separator.

VERSION:5

Again items with number values must be configured to be indexed as type "KEYWORD" if they are meant to be queried that way.

Specifying ranges

In the following syntax it is possible to specify a range of values that a field may have:

<fieldname>:[<start> TO <end>] or
<fieldname>:{<start> TO <end>}

The difference of these two syntaxes is, that the square bracket syntax treats start and end values as inclusive (documents are found which have exactly equal values like <start> or <end>) while the curly brackets syntax treats them as exclusive (the values must be higher than <start> and lower than <end> for a document to be found).

The ranges syntax is most useful when searching for date ranges. The following search finds documents that were modified between 15. August and 1. September 2005 inclusive:

MODIFIED:[20050815000000 TO 20050901235959]

Searching multiple databases

As stated lucene is able to search multiple content stores at once. To specify which databases to include in the search you can use the following values on <tml:query> attribute db:

Value for attribute "db"	Description
dbkey [, dbkey, ...]	Comma separated list of databases to be searched.
*	Search all lucene indexed databases in the same domain as the context database
**	Search all lucene indexed databases

The default value for attribute db is the dbkey of the current context database. So if you just want so search this database you may omit this attribute.

As stated above lucene provdes a "search score" for each found content document, providing information about the relevance of the document for the search query. However it is only available when the query result is sorted by relevance (which is the default when no other sort order is declared).

It is retrievable as metadata field "SEARCHSCORE" on each result document and is a numeric fraction value ranging from 1 (perfect match) to nearly 0 (weak match).

The relevance of a document for a search query is calculated based on many parameters:

Count of found terms
Items/Metadata fields where the terms were found and their importance
Position of the terms inside the field data
Configured "boost" value for the field (settable in WGA Admin Client under the "Fulltext configuration" of the content store)

Further information about this topic is found in the chapter "Sorting".

The highlighting feature allows you to highlight the searched terms in the data of found documents. To enable this just set attribute highlight at the <tml:query> tag to "true". Also, when putting out the data of found documents via <tml:item> set the attribute highlight of this tag to "true" to enable automatic highlighting.

The default highlighting simply marks the terms bold. You can change this by using the <tml:item> attributes highlightprefix and highlightsuffix to explicitly specify the HTML code that is to put out right before and after the term.

The following example highlights terms by wrapping them in a HTML span of CSS class "highlight":

<tml:item name="body" highlight="true" 
highlightprefix="<span class="highlight">" highlightsuffix="</span>"/>

This feature does not support you in finding the fields where the matches occured. You need to know the item that is to put out via <tml:item> and enable highlighting there. Therefor this feature is most useful with documents whose main data is just in one "body" item, that always can be put out.

The feature "best fragments" automatically detects those text fragments in an item that matched the query terms and is able to return them. This is useful if the text of a data item normally is too long to be put out in whole on a search result page.

You can retrieve these fragments by the TMLScript method this.bestFragments(), which returns the fragments for a specific item on the current result document. It always uses the fragments data for the last lucene search on the current user session. So executing another lucene query will delete the fragments data of a previous search.

Virtual documents by default are excluded from the result list of lucene as their data is not the one shown when the virtual document is displayed. You may however choose to include them by specifying the native query option "includeVirtualContent":

<tml:query type="lucene" options="includeVirtualContent" ... />

Native query options are options given to WebTML attribute options, which control some aspects of the query that are native to the current query type. The following options are available:

Option	Purpose
explain	Adds "lucene explain data" the query, explaining why the document is contained in the query result and the cause for its search score. This is rather technical and specific to lucenes internals. It can be retrieved on the WGAPI content object "WGContent" via method "getSearchExplanation()".
includeVirtualContent	Includes virtual content documents in the search result. Note that the terms by which virtual documents are indexed are not from the data of their target documents. They are only indexed by the data that is stored directly on the virtual document.
sort:fieldname (asc\|desc)	Sorts the query result by the given field. Specify fieldname in lowercase to sort by an item, in uppercase to sort by a metadata field. Specify (asc) for ascending or (desc) for descending sort order.

This table shows all metadata fields that are contained in the lucene index. There are different indexing types which allow different usages:

keyword: The field value is stored unmodified and analyzed, therefor (only) can be found when querying for the exact and complete contents of the field.
analyzed: The field value is analyzed and tokenized. It can be found querying for any single word token.
fulltext: Like "analyzed", but the field can also be found when using field-unspecific search clauses
date: Like "keyword". Only for dates, that will be indexed in the text form "yyyyMMddHHmmss". See chapter "date values" for details.

Metadata field	Description	Index type	Sortable
AREA	Name of the area containing the content	keyword	Yes
AUTHOR	Author of the content	analyzed	Yes
COAUTHORS	Additional authors of the content (Only OpenWGA content stores of version 5 or higher)	fulltext	No
CONTENTCLASS	Name of the content class of the content	keyword	Yes
CONTENTTYPE	Name of the content type of the page	keyword	Yes
CREATED	Date and time of creation	date	Yes
DBKEY	Key of containing database	keyword	Yes
DESCRIPTION	Kurzbeschreibung des Inhaltes	fulltext	Yes
DOCNAME, NAME, UNIQUENAME	Unique name of the content	keyword	Yes
HIDDENINNAV	Is "true" if the document is to be shown in navigators, "false" otherwise	keyword	Yes
HIDDENINSEARCH	Is "true" if the document is to be shown in query results, false otherwise	keyword	Yes
HIDDENINSITEMAP	Is "true" if the document is to be shown in sitemaps, false otherwise	keyword	Yes
KEY	The complete content key of syntax "structkey.language.version"	keyword	Yes
KEYWORDS	Keywords for this content to be used by internet search machines	keyword	No
LANGUAGE	Code of the language of this content, for example "en" or "de"	keyword	Yes
LASTCLIENT	Type of the last OpenWGA authoring client that edited this content	keyword	Yes
LASTMODIFIED, MODIFIED	Date and time of last modification	date	Yes
OWNER	The owner of the content (Only OpenWGA content stores of version 5 or higher)	fulltext	Yes
PAGEPUBLISHED	The published date of the first ever published version of this content (Only OpenWGA content stores of version 5 or higher)	keyword	Yes
PARENT	Struct key of the parent page	keyword	No
PATH	Struct keys of all pages up the page hierarchy to the root page. Querying for the struct key of a specific page on PATH will return all contents that are in the hierarchy below that page	keyword	No
PUBLISHED	The published date of the content (Only OpenWGA content stores of version 5 or higher)	keyword	Yes
STATUS	Workflow state of the content: "w" - Working copy "g" - In approval process "p" - Published "a" - Archived	keyword	Yes
STRUCTENTRY, STRUCTKEY	Key of the struct entry belonging to this content	keyword	Yes
TITLE	Title of the content	fulltext	Yes
VALIDFROM	Optional date and time before which the document should be invisible	date	Yes
VALIDTO	Optional date and time after which the document should be invisible	date	Yes
VERSION	Number of version of this content	keyword	Yes
VIRTUALLINK	If this document is a virtual document points to its target. Contents depends on type of virtual link (which is indexed as VIRTUALLINKTYPE): "int" - Content key of the target document "exturl" - URL to an external website "file" - Syntax: <documentkey>/<filename>, where <documentkey> is the name of a file container or the key of a content document "intfile" - Name of a file attachment on this content	keyword	Yes
VIRTUALLINKTYPE	Type of virtual document: "int" - Targets a content document in this database "exturl" - Targets some custom URL "file" - Targets a file attachment on a file container or content document in this database "intfile" - Targets a file attachment on this content document	keyword	Yes
VISIBLE	General visibility flag holding "true" or "false".	keyword	Yes

OpenWGA 7.0 - Query languages reference

lucene

General

Syntax

Sorting

Operators

Advanced syntax

Wildcards

Space characters in search terms

Searching the contents of file attachments

Searching content relations

Searching for date and number values

Date values

Number values:

Specifying ranges

Searching multiple databases

Further functionality

Search score

Highlighting

Best fragments

Including virtual documents

Native query option reference

Metadata fields in lucene index