OpenWGA 7.0 - Query languages reference
lucene
General
Lucene is a library for executing fulltext queries that is embedded to OpenWGA. It is available for all WGA Content Store types to index and query all of the contained content documents and their data. You may use lucene as a site-internal search engine to perform fulltext queries for special terms. You can also use lucene like a database query language to do more specific queries on special items and metadata fields and have the results sorted the way you want.As only query language in WGA it also allows queries on multiple WGA Content Stores at once.
The feature to query the lucene fulltext index must be enabled for individual content stores in administration. There you also have the ability to configure the way that lucene treats items and metadata fields in index, modifying importance, sorting capability and indexing type. If a lucene query does not return the results you want it to return, chances are that the behaviour of the lucene index can be adapted to your needs in administrative setup.
One general drawback of lucene is the fact that the index is updated asynchronously after each data change. Because of that the index may not include the latest data additions and modifications of the content store. If you need your query to return realtime results you should choose another query language like HQL.
Syntax
In the following document we want to demonstrate the most commonly used search syntaxes for lucene. For a more in-depth documentation you can use the official lucene dokumentation.A lucene query consists of a number of singular search clauses. A search clause may be some simple term or a specific search for terms in a field. Individual clauses are divided by space characters. Therefor the following query consists of two clauses:
<tml:query type="lucene">
Content Management
</tml:query>
It searches for documents that have both words "Content" and "Management" somewhere in their item data, or in textual metadata fields like title and description.
A clause that searches a term in a specific field contains fieldname and term divided by a colon:
<feldname>:<Suchausdruck>
The field names are interpreted as content items when they are lowercase, or as metadata fields when they are uppercase, for example:
body:WGA TITLE:Google
Searches a document whose item "body" contains the term "WGA" and whose metadata field "title" contains the term "Google". A list of valid metadata fieldnames is at the end of this document.
Sorting
The default sort order of lucene results is by "relevance", i.e. those documents with the "best matches" are displayed first. What differs a "better match" from a "worse match" is dependent on the field that the terms are found in. For example a match in the title has a higher relevance than a match in any item. Also the configuration of the lucene index in adminstration can "boost" special items so that matches in them are regarded "better" than in other items. For an in-depth treatment of the mathematics behind revelance determination you can read the lucene documentation about Scoring.
In any way you can return a numerical representation of the individual relevance as metadata field "searchscore" on each result document, which will return a fraction value between 1 and 0.
Alternatively you can sort the lucene results by item and metadata values, providing the used field was indexed to be sortable (which again can be configured in administration). Use the <tml:query> attribute options to specify the desired sorting:
The sort expression has the following syntax as seen in the example above:
The metadata table at the end of this document describes what metadata fields are sortable.
Sorting based on more than one field at once currently is not possible. If you need something like this you might want to fallback on WebTMLs sorting capabilities.
In any way you can return a numerical representation of the individual relevance as metadata field "searchscore" on each result document, which will return a fraction value between 1 and 0.
Alternatively you can sort the lucene results by item and metadata values, providing the used field was indexed to be sortable (which again can be configured in administration). Use the <tml:query> attribute options to specify the desired sorting:
<tml:query type="lucene" options="sort: myitem (asc)">...search terms...</tml:query>
The sort expression has the following syntax as seen in the example above:
- The prefix "sort:"
- The name of the field that should determine the sorting order, like it would be specified in the query itself (so items lowercase, metadatafields uppercase)
- The suffix "(asc)" or "(desc)" determining if you want ascending or descending sort order
The metadata table at the end of this document describes what metadata fields are sortable.
Sorting based on more than one field at once currently is not possible. If you need something like this you might want to fallback on WebTMLs sorting capabilities.
Operators
Search clauses can be combined with a variety of operators. It is placed either before a clause or between clauses, depending on its nature. The complete syntax of a search query including optional operators would therefor be:
But as we have seen before most queries never have a preceding or "between" operator. In that case default operators are implicitly used by lucene. The default preceding operator is "+" (means, that the clause is positive). The default operator between clauses is "AND" (means that result documents must match both clauses).
The following operators are available:
<preceding-operator><fieldname>:<term> <between operator> <preceding-operator><fieldname>:<term> ...
But as we have seen before most queries never have a preceding or "between" operator. In that case default operators are implicitly used by lucene. The default preceding operator is "+" (means, that the clause is positive). The default operator between clauses is "AND" (means that result documents must match both clauses).
The following operators are available:
Operator | Description | Position |
---|---|---|
AND, && | Combines two clauses so all documents are found that match both clauses. This is a default operator of lucene which is implicitly used if multiple clauses are just divided by space characters without explicit operator. | Between two clauses |
OR, || | Combines two clauses so all documents are found that match either one of them or both clauses. | Between two clauses |
+ | Marks the clause as "positive", i.e. all documents must match the clause. This is a default operator of lucene which is implicitly used when clauses have no preceding operator. | Directly preceding the clause |
NOT, -, ! | Marks the clause as "negative", i.e. all documents must not match the clause. A query may not just consist of negative clauses. | Directly preceding the clause im case of "-" and "!", preceding the clause but divided by a space character from it in case of NOT |
Advanced syntax
Wildcards
A search term can contain two types of wildcards characters:A question mark "?" is a wildcard for one arbitrary character.
A star sign "*" is a wildcard for any number of arbitrary characters (including none).
Wildcards may NOT be used as the first sign in search clauses.
Space characters in search terms
When searching for terms that contain space characters it does not work to just specify the term. As the space character normally is used by lucene to divide individual search clauses lucene will take everything after the space character as separate clause.For example, the following query will search for the term "Content" in item "body", but them for the terms "Content", "Management", "with" and "WGA" in all other items (plus metas title and description):
body:Content Management with WGA
To search for a term with space characters exactly the way that it is entered, you have to encose it in double quotes. This will make lucene recognize it as one single term:
body:"Content Management with WGA"
Searching the contents of file attachments
Optionally lucene can also index files attached to content documents. It is disabled in default configuration and it needs some special "analyzer" modules to interpret the contents of the used file types which are not part of the OpenWGA standard distribution. Analyzer modules for the most frequently used file types are available in the OpenWGA Enterprise Edition.
There is a special "item name" in lucene for explicitly searching the contents of fileattachments named "allattachments". So if you want to also search in file attachments you may add an item specific clause it to your search term.
"Content Management" AND allattachments:"Content Management"
Searching content relations
Content relations are also indexed to lucene but do not provide direct links to the target as lucene can only index text. The index name of a normal relation is $rel_relationname and its index field contains the struct key and language of the target content, divided by a point. For example: "4028fbe5125651ea01125656704d000f.en".
Relation groups are indexed with name $relgroup_groupname and contain the same data.
Relations are indexed as keywords and are also sortable.
Searching for date and number values
As lucene is a fulltext indexing engine it treats all values as text, including dates and numbers which are converted to a standard text format. This must be considered when searching for those value types.Date values
Date values are indexed as text in format "yyyyMMddHHmmss" indiziert. The characters mean (y)ear, (M)onth, (d)ay in month, (H)our, (m)inute and (s)econd. If a date contains no time information the time values are indexed as 0. So 1. September 2005 is indexed as "20050901000000". You can use wildcards when searching for dates if time does not matter. This searches for documents that were modified on that day, no matter what time:MODIFIED:20050901*
If you want to search for date items that way they must be configured in WGA Admin Client to be indexed as type "KEYWORD".
Number values:
Numbers are just converted to text, optionally with the dot "." as decimal separator and without any grouping separator.VERSION:5
Again items with number values must be configured to be indexed as type "KEYWORD" if they are meant to be queried that way.
Specifying ranges
In the following syntax it is possible to specify a range of values that a field may have:<fieldname>:[<start> TO <end>] or
<fieldname>:{<start> TO <end>}
The difference of these two syntaxes is, that the square bracket syntax treats start and end values as inclusive (documents are found which have exactly equal values like <start> or <end>) while the curly brackets syntax treats them as exclusive (the values must be higher than <start> and lower than <end> for a document to be found).
The ranges syntax is most useful when searching for date ranges. The following search finds documents that were modified between 15. August and 1. September 2005 inclusive:
MODIFIED:[20050815000000 TO 20050901235959]
Searching multiple databases
As stated lucene is able to search multiple content stores at once. To specify which databases to include in the search you can use the following values on <tml:query> attribute db:Value for attribute "db" | Description |
---|---|
dbkey [, dbkey, ...] | Comma separated list of databases to be searched. |
* | Search all lucene indexed databases in the same domain as the context database |
** | Search all lucene indexed databases |
The default value for attribute db is the dbkey of the current context database. So if you just want so search this database you may omit this attribute.
Further functionality
Search score
As stated above lucene provdes a "search score" for each found content document, providing information about the relevance of the document for the search query. However it is only available when the query result is sorted by relevance (which is the default when no other sort order is declared).
It is retrievable as metadata field "SEARCHSCORE" on each result document and is a numeric fraction value ranging from 1 (perfect match) to nearly 0 (weak match).
The relevance of a document for a search query is calculated based on many parameters:
It is retrievable as metadata field "SEARCHSCORE" on each result document and is a numeric fraction value ranging from 1 (perfect match) to nearly 0 (weak match).
The relevance of a document for a search query is calculated based on many parameters:
- Count of found terms
- Items/Metadata fields where the terms were found and their importance
- Position of the terms inside the field data
- Configured "boost" value for the field (settable in WGA Admin Client under the "Fulltext configuration" of the content store)
Highlighting
The highlighting feature allows you to highlight the searched terms in the data of found documents. To enable this just set attribute highlight at the <tml:query> tag to "true". Also, when putting out the data of found documents via <tml:item> set the attribute highlight of this tag to "true" to enable automatic highlighting.
The default highlighting simply marks the terms bold. You can change this by using the <tml:item> attributes highlightprefix and highlightsuffix to explicitly specify the HTML code that is to put out right before and after the term.
The following example highlights terms by wrapping them in a HTML span of CSS class "highlight":
The default highlighting simply marks the terms bold. You can change this by using the <tml:item> attributes highlightprefix and highlightsuffix to explicitly specify the HTML code that is to put out right before and after the term.
The following example highlights terms by wrapping them in a HTML span of CSS class "highlight":
<tml:item name="body" highlight="true"
highlightprefix="<span class="highlight">" highlightsuffix="</span>"/>
This feature does not support you in finding the fields where the matches occured. You need to know the item that is to put out via <tml:item> and enable highlighting there. Therefor this feature is most useful with documents whose main data is just in one "body" item, that always can be put out.
Best fragments
The feature "best fragments" automatically detects those text fragments in an item that matched the query terms and is able to return them. This is useful if the text of a data item normally is too long to be put out in whole on a search result page.
You can retrieve these fragments by the TMLScript method this.bestFragments(), which returns the fragments for a specific item on the current result document. It always uses the fragments data for the last lucene search on the current user session. So executing another lucene query will delete the fragments data of a previous search.
You can retrieve these fragments by the TMLScript method this.bestFragments(), which returns the fragments for a specific item on the current result document. It always uses the fragments data for the last lucene search on the current user session. So executing another lucene query will delete the fragments data of a previous search.
Including virtual documents
Virtual documents by default are excluded from the result list of lucene as their data is not the one shown when the virtual document is displayed. You may however choose to include them by specifying the native query option "includeVirtualContent":
<tml:query type="lucene" options="includeVirtualContent" ... />
Native query option reference
Native query options are options given to WebTML attribute options, which control some aspects of the query that are native to the current query type. The following options are available:
Option | Purpose |
---|---|
explain | Adds "lucene explain data" the query, explaining why the document is contained in the query result and the cause for its search score. This is rather technical and specific to lucenes internals. It can be retrieved on the WGAPI content object "WGContent" via method "getSearchExplanation()". |
includeVirtualContent | Includes virtual content documents in the search result. Note that the terms by which virtual documents are indexed are not from the data of their target documents. They are only indexed by the data that is stored directly on the virtual document. |
sort:fieldname (asc|desc) | Sorts the query result by the given field. Specify fieldname in lowercase to sort by an item, in uppercase to sort by a metadata field. Specify (asc) for ascending or (desc) for descending sort order. |
Metadata fields in lucene index
This table shows all metadata fields that are contained in the lucene index. There are different indexing types which allow different usages:
- keyword: The field value is stored unmodified and analyzed, therefor (only) can be found when querying for the exact and complete contents of the field.
- analyzed: The field value is analyzed and tokenized. It can be found querying for any single word token.
- fulltext: Like "analyzed", but the field can also be found when using field-unspecific search clauses
- date: Like "keyword". Only for dates, that will be indexed in the text form "yyyyMMddHHmmss". See chapter "date values" for details.
Metadata field | Description | Index type | Sortable |
---|---|---|---|
AREA | Name of the area containing the content | keyword | Yes |
AUTHOR | Author of the content | analyzed | Yes |
COAUTHORS | Additional authors of the content (Only OpenWGA content stores of version 5 or higher) | fulltext | No |
CONTENTCLASS | Name of the content class of the content | keyword | Yes |
CONTENTTYPE | Name of the content type of the page | keyword | Yes |
CREATED | Date and time of creation | date | Yes |
DBKEY | Key of containing database | keyword | Yes |
DESCRIPTION | Kurzbeschreibung des Inhaltes | fulltext | Yes |
DOCNAME, NAME, UNIQUENAME | Unique name of the content | keyword | Yes |
HIDDENINNAV | Is "true" if the document is to be shown in navigators, "false" otherwise | keyword | Yes |
HIDDENINSEARCH | Is "true" if the document is to be shown in query results, false otherwise | keyword | Yes |
HIDDENINSITEMAP | Is "true" if the document is to be shown in sitemaps, false otherwise | keyword | Yes |
KEY | The complete content key of syntax "structkey.language.version" | keyword | Yes |
KEYWORDS | Keywords for this content to be used by internet search machines | keyword | No |
LANGUAGE | Code of the language of this content, for example "en" or "de" | keyword | Yes |
LASTCLIENT | Type of the last OpenWGA authoring client that edited this content | keyword | Yes |
LASTMODIFIED, MODIFIED | Date and time of last modification | date | Yes |
OWNER | The owner of the content (Only OpenWGA content stores of version 5 or higher) | fulltext | Yes |
PAGEPUBLISHED | The published date of the first ever published version of this content (Only OpenWGA content stores of version 5 or higher) | keyword | Yes |
PARENT | Struct key of the parent page | keyword | No |
PATH | Struct keys of all pages up the page hierarchy to the root page. Querying for the struct key of a specific page on PATH will return all contents that are in the hierarchy below that page | keyword | No |
PUBLISHED | The published date of the content (Only OpenWGA content stores of version 5 or higher) | keyword | Yes |
STATUS |
Workflow state of the content: "w" - Working copy "g" - In approval process "p" - Published "a" - Archived |
keyword | Yes |
STRUCTENTRY, STRUCTKEY | Key of the struct entry belonging to this content | keyword | Yes |
TITLE | Title of the content | fulltext | Yes |
VALIDFROM | Optional date and time before which the document should be invisible | date | Yes |
VALIDTO | Optional date and time after which the document should be invisible | date | Yes |
VERSION | Number of version of this content | keyword | Yes |
VIRTUALLINK |
If this document is a virtual document points to its target. Contents depends on type of virtual link (which is indexed as VIRTUALLINKTYPE): "int" - Content key of the target document "exturl" - URL to an external website "file" - Syntax: <documentkey>/<filename>, where <documentkey> is the name of a file container or the key of a content document "intfile" - Name of a file attachment on this content |
keyword | Yes |
VIRTUALLINKTYPE |
Type of virtual document: "int" - Targets a content document in this database "exturl" - Targets some custom URL "file" - Targets a file attachment on a file container or content document in this database "intfile" - Targets a file attachment on this content document |
keyword | Yes |
VISIBLE | General visibility flag holding "true" or "false". | keyword | Yes |