OpenWGA 7.8 - Query languages reference
lucene
General
Lucene is a library for executing fulltext queries that is embedded to OpenWGA. It is available for all WGA Content Store types to index and query all of the contained content documents and their data.
The lucene fulltext indexes the contents of the content stores of an OpenWGA installation in a way that allows querying it in a "fulltext" way, meaning that it is able to find out, which documents on a content store contain a certain word. It also provides some functionality that should determine, what document is more "relevant" regarding a certain term than another, depending on the position and frequency of the term in the contents. This form of search is known to most internet users in the form of internet search engines like Google, which provide a similar functionality regarding web pages.
You may use lucene as a site-internal search engine to perform fulltext queries for special terms on content documents. You can also use lucene like a database query language to do more specific queries on special items and metadata fields and have the results sorted the way you want.
As only query language in WGA it also allows queries on multiple WGA Content Stores at once.
The feature to query the lucene fulltext index must be enabled for individual content stores in administration. There you also have the ability to configure the way that lucene treats items and metadata fields in index, modifying importance, sorting capability and indexing type. If a lucene query does not return the results you want it to return, chances are that the behaviour of the lucene index can be adapted to your needs in administrative setup.
One general drawback of lucene is the fact that the index is updated asynchronously after each data change. Because of that the index may not include the latest data additions and modifications of the content store. If you need your query to return realtime results you should choose another query language like HQL.
Syntax
In the following document we want to demonstrate the most commonly used search syntaxes for lucene. For a more in-depth documentation you can use the official lucene dokumentation.A lucene query consists of a number of singular search clauses. A search clause may be some simple term or a specific search for terms in a field. Individual clauses are divided by space characters. Therefor the following query consists of two clauses:
<tml:query type="lucene">
Content Management
</tml:query>
It searches for documents that have both words "Content" and "Management" somewhere in their item data, or in textual metadata fields like title and description.
A clause that searches a term in a specific field contains fieldname and term divided by a colon:
<feldname>:<Suchausdruck>
The field names are interpreted as content items when they are lowercase, or as metadata fields when they are uppercase, for example:
body:WGA TITLE:Google
Searches a document whose item "body" contains the term "WGA" and whose metadata field "title" contains the term "Google". A list of valid metadata fieldnames is at the end of this document.
Sorting
In any way you can return a numerical representation of the individual relevance as metadata field "searchscore" on each result document, which will return a fraction value between 1 and 0.
Alternatively you can sort the lucene results by item and metadata values, providing the used field was indexed to be sortable (which again can be configured in administration). Use the <tml:query> attribute options to specify the desired sorting:
<tml:query type="lucene" options="sort: myitem (asc)">...search terms...</tml:query>
The sort expression has the following syntax as seen in the example above:
- The prefix "sort:"
- The name of the field that should determine the sorting order, like it would be specified in the query itself (so items lowercase, metadatafields uppercase)
- The suffix "(asc)" or "(desc)" determining if you want ascending or descending sort order
The metadata table at the end of this document describes what metadata fields are sortable.
Sorting based on more than one field at once currently is not possible. If you need something like this you might want to fallback on WebTMLs sorting capabilities.
Operators
<preceding-operator><fieldname>:<term> <between operator> <preceding-operator><fieldname>:<term> ...
But as we have seen before most queries never have a preceding or "between" operator. In that case default operators are implicitly used by lucene. The default preceding operator is "+" (means, that the clause is positive). The default operator between clauses is "AND" (means that result documents must match both clauses).
The following operators are available:
Operator | Description | Position |
---|---|---|
AND, && | Combines two clauses so all documents are found that match both clauses. This is a default operator of lucene which is implicitly used if multiple clauses are just divided by space characters without explicit operator. | Between two clauses |
OR, || | Combines two clauses so all documents are found that match either one of them or both clauses. | Between two clauses |
+ | Marks the clause as "positive", i.e. all documents must match the clause. This is a default operator of lucene which is implicitly used when clauses have no preceding operator. | Directly preceding the clause |
NOT, -, ! | Marks the clause as "negative", i.e. all documents must not match the clause. A query may not just consist of negative clauses. | Directly preceding the clause im case of "-" and "!", preceding the clause but divided by a space character from it in case of NOT |
Finding matches in file attachments
The lucene fulltext index is also capable of indexing the contents of file attachments and finding fulltext matches there. The functionality to index file attachments however is part of the OpenWGA enterprise edition.
There are three ways how lucene queries can find matches in file attachments:
Querying doctype "attachment"
This is the preferred and most powerful variant, as it allows to find matches in file attachments where it is identifiable which file actually matched.
File attachments are queried on a lucene query when the doctype of the query is either set to "attachment" or "all" (see "Native query options reference"). On the search results use WebTML metadata field "SEARCHDOCTYPE" to find out, if a match was against the content document or a file attachment, and in case of "attachment" use meta "SEARCHFILENAME" to find out which file attachment actually matched your query. Matches against file attachments are nevertheless executed under the WebTML context of the document that contains them, just like regular content matches.
Here is a small example of a lucene query which also queries file attachments and differs its output depending on the match doctype, providing links to whatever matched the query:
<tml:collection>
<tml:query type="lucene" options="doctype:all">WGA*</tml:query>
<tml:foreach>
<tml:if condition="SEARCHDOCTYPE=='attachment'">
<tml:then>
<a href="<tml:url type="file" file="{SEARCHFILENAME}"/>">Attachment <tml:meta name="SEARCHFILENAME"/> auf Dokument <tml:meta name="TITLE"/></a> </tml:then>
<tml:else>
<a href="<tml:url/>">Content <tml:meta name="TITLE"/></a>
</tml:else>
</tml:if>
</tml:foreach>
</tml:collection>
You can only query for the contents of file attachments that your OpenWGA installation can actually parse for text content. See Indexable file types on how to configure your installations fulltext capabilities regarding special file types.
Some facts regarding this type of query:
- You can search specifically for properties of file attachments using the metadata fields "FILE_*" from the metadata fields table below in your query
- Field-unspecific searches will not only match files contents. They will also find matches based on file names, also on the title and description from the file metadata.
- If doctype is set to "all" and the query matches both, the item content of a content document AND the contents of a file attachment on the same document, then that document will be available twice on the result, once for the attachment with SEARCHDOCTYPE "attachment" and once for the content document itself with SEARCHDOCTYPE "content".
- A lucene query will never be able to match both, a content item of index type "fulltext" and anything from a file attachment. So if your query needs to find something in a "fulltext" item you will not find file attachment results. However it is possible to find file attachment matches while the query also needs a match for a content item of index type "keyword".
- You can use the TMLScript method Lucene.bestFileFragments() to retrieve content fragments that matched the query for a match on a file attachment.
Querying virtual field "allattachments"
You can also query file attachments on content documents by specifically searching against the virtual item "allattachments". It contains the indexed contents of all file attachments of a content documetn. However doing so will not allow you to find out which attachment matched:
<tml:query type="lucene">allattachments:WGA*</tml:query>
This is a legacy function that has no advantages over doctype "attachment". Therefor it should only be used on legacy functionality and content stores that do not have the needed version for the doctype "attachment".
Querying for file contents that are indexed on documents
If your app is configured to "Index File Contents on Documents" (which is an option available on the web apps configuration in OpenWGA admin client) then the contents of file attachments will also be added to the field-unspecific search of the "content" entries representing the content documents to which they are attached. So a simple field-unspecific search will return content documents because their file attachments match the query, but will not be able to determine which attachment matched.
This is also a legacy function that has no advantages over doctype "attachment". Therefor it should only be used on legacy functionality and content stores that do not have the needed version for the doctype "attachment".
Advanced syntax
Wildcards
A search term can contain two types of wildcards characters:A question mark "?" is a wildcard for one arbitrary character.
A star sign "*" is a wildcard for any number of arbitrary characters (including none).
Wildcards may NOT be used as the first sign in search clauses.
Space characters in search terms
When searching for terms that contain space characters it does not work to just specify the term. As the space character normally is used by lucene to divide individual search clauses lucene will take everything after the space character as separate clause.For example, the following query will search for the term "Content" in item "body", but them for the terms "Content", "Management", "with" and "WGA" in all other items (plus metas title and description):
body:Content Management with WGA
To search for a term with space characters exactly the way that it is entered, you have to encose it in double quotes. This will make lucene recognize it as one single term:
body:"Content Management with WGA"
Searching the contents of file attachments
Optionally lucene can also index files attached to content documents. It is disabled in default configuration and it needs some special "analyzer" modules to interpret the contents of the used file types which are not part of the OpenWGA standard distribution. Analyzer modules for the most frequently used file types are available in the OpenWGA Enterprise Edition.
There is a special "item name" in lucene for explicitly searching the contents of fileattachments named "allattachments". So if you want to also search in file attachments you may add an item specific clause it to your search term.
"Content Management" AND allattachments:"Content Management"
Searching content relations
Content relations are also indexed to lucene but do not provide direct links to the target as lucene can only index text. The index name of a normal relation is $rel_relationname and its index field contains the struct key and language of the target content, divided by a point. For example: "4028fbe5125651ea01125656704d000f.en".
Relation groups are indexed with name $relgroup_groupname and contain the same data.
Searching for date and number values
As lucene is a fulltext indexing engine it treats all values as text, including dates and numbers which are converted to a standard text format. This must be considered when searching for those value types.Date values
Date values are indexed as text in format "yyyyMMddHHmmss" indiziert. The characters mean (y)ear, (M)onth, (d)ay in month, (H)our, (m)inute and (s)econd. If a date contains no time information the time values are indexed as 0. So 1. September 2005 is indexed as "20050901000000". You can use wildcards when searching for dates if time does not matter. This searches for documents that were modified on that day, no matter what time:MODIFIED:20050901*
If you want to search for date items that way they must be configured in WGA Admin Client to be indexed as type "KEYWORD".
Number values:
Numbers are just converted to text, optionally with the dot "." as decimal separator and without any grouping separator.VERSION:5
Again items with number values must be configured to be indexed as type "KEYWORD" if they are meant to be queried that way.
Specifying ranges
In the following syntax it is possible to specify a range of values that a field may have:<fieldname>:[<start> TO <end>] or
<fieldname>:{<start> TO <end>}
The difference of these two syntaxes is, that the square bracket syntax treats start and end values as inclusive (documents are found which have exactly equal values like <start> or <end>) while the curly brackets syntax treats them as exclusive (the values must be higher than <start> and lower than <end> for a document to be found).
The ranges syntax is most useful when searching for date ranges. The following search finds documents that were modified between 15. August and 1. September 2005 inclusive:
MODIFIED:[20050815000000 TO 20050901235959]
Searching multiple databases
As stated lucene is able to search multiple content stores at once. To specify which databases to include in the search you can use the following values on <tml:query> attribute db:Value for attribute "db" | Description |
---|---|
dbkey [, dbkey, ...] | Comma separated list of databases to be searched. |
* | Search all lucene indexed databases in the same domain as the context database |
** | Search all lucene indexed databases |
Further functionality
Search score
It is retrievable as metadata field "SEARCHSCORE" on each result document and is a numeric fraction value ranging from 1 (perfect match) to nearly 0 (weak match).
The relevance of a document for a search query is calculated based on many parameters:
- Count of found terms
- Items/Metadata fields where the terms were found and their importance
- Position of the terms inside the field data
- Configured "boost" value for the field (settable in WGA Admin Client under the "Fulltext configuration" of the content store)
Highlighting
The default highlighting simply marks the terms bold. You can change this by using the <tml:item> attributes highlightprefix and highlightsuffix to explicitly specify the HTML code that is to put out right before and after the term.
The following example highlights terms by wrapping them in a HTML span of CSS class "highlight":
<tml:item name="body" highlight="true"
highlightprefix="<span class="highlight">" highlightsuffix="</span>"/>
Best fragments
You can retrieve these fragments by the TMLScript method this.bestFragments(), which returns the fragments for a specific item on the current result document. It always uses the fragments data for the last lucene search on the current user session. So executing another lucene query will delete the fragments data of a previous search.
Including virtual documents
<tml:query type="lucene" options="includeVirtualContent" ... />
Native query option reference
Option | Purpose |
---|---|
doctype:content|attachment|all | Determines where to search for matching text: Choose "content" for matches on the fields of content documents only (the default), "attachment" for matches against the contents of file attachments or "all" for both. |
explain | Adds "lucene explain data" the query, explaining why the document is contained in the query result and the cause for its search score. This is rather technical and specific to lucenes internals. It can be retrieved on the WGAPI content object "WGContent" via method "getSearchExplanation()". |
includeVirtualContent | Includes virtual content documents in the search result. Note that the terms by which virtual documents are indexed are not from the data of their target documents. They are only indexed by the data that is stored directly on the virtual document. |
sort:fieldname (asc|desc) | Sorts the query result by the given field. Specify fieldname in lowercase to sort by an item, in uppercase to sort by a metadata field. Specify (asc) for ascending or (desc) for descending sort order. |
Metadata fields in lucene index
This table shows all metadata fields that are contained in the lucene index and can be queried.
The fulltext index contains entries for two different doctypes: "content" which indexes the fields on content documents and "attachment" which indexes the contents of file attachments. Both also contain queryable metadata fields identified in this table. The column "Doctype" in this table identifies which entries in the index contain the respective field, "content", "attachment" or "all" if both contain it. Querying for a metadata field only available for one doctype will mean that this term will only find matches of this doctype.
There are also different indexing types in which these fields are indexed and which allow different usages:
- keyword: The field value is stored unmodified and analyzed, therefor (only) can be found when querying for the exact and complete contents of the field.
- analyzed: The field value is analyzed and tokenized. It can be found querying for any single word token.
- fulltext: Like "analyzed", but the field can also be found when using field-unspecific search clauses
- date: Like "keyword". Only for dates, that will be indexed in the text form "yyyyMMddHHmmss". See chapter "date values" for details.
Metadata field | Description | Doctype | Index type | Sortable |
---|---|---|---|---|
AREA | Name of the area containing the content | all | keyword | Yes |
AUTHOR | Author of the content | all | analyzed | Yes |
COAUTHORS | Additional authors of the content (Only OpenWGA content stores of version 5 or higher) | all | fulltext | No |
CONTENTCLASS | Name of the content class of the content | all | keyword | Yes |
CONTENTTYPE | Name of the content type of the page | all | keyword | Yes |
CREATED | Date and time of creation | all | date | Yes |
DBKEY | Key of containing database | all | keyword | Yes |
DESCRIPTION | Kurzbeschreibung des Inhaltes | all | fulltext | Yes |
DOCNAME, NAME, UNIQUENAME | Unique name of the content | all | keyword | Yes |
FILE_COPYRIGHT | For file attachment matches: Copyright information from the files metadata | attachment | keyword | Yes |
FILE_CREATED | For file attachment matches: The date the file attachment was created | attachment | keyword | Yes |
FILE_DESCRIPTION | For file attachment matches: The description from the files metadata | attachment | fulltext | Yes |
FILE_LASTMODIFIED | For file attachment matches: The date the file attachment was last modified | attachment | keyword | Yes |
FILE_MD5CHECKSUM | For file attachment matches: The MD5 checksum of the file contents | attachment | keyword | Yes |
FILE_MIMETYPE | For file attachment matches: The files MIME type from its metadata | attachment | keyword | Yes |
FILE_NAME | For file attachment matches: The file name | attachment | keyword | Yes |
FILE_SHA512CHECKSUM | For file attachment matches: The SHA512 checksum of the file contents | attachment | keyword | Yes |
FILE_SIZE | For file attachment matches: The size of the file in bytes | attachment | keyword | Yes |
FILE_TITLE | For file attachment matches: The title of the file from its metadata | attachment | keyword | Yes |
HIDDENINNAV | Is "true" if the document is to be shown in navigators, "false" otherwise | all | keyword | Yes |
HIDDENINSEARCH | Is "true" if the document is to be shown in query results, false otherwise | all | keyword | Yes |
HIDDENINSITEMAP | Is "true" if the document is to be shown in sitemaps, false otherwise | all | keyword | Yes |
KEY | The complete content key of syntax "structkey.language.version" | all | keyword | Yes |
KEYWORDS | Keywords for this content to be used by internet search machines | all | keyword | No |
LANGUAGE | Code of the language of this content, for example "en" or "de" | all | keyword | Yes |
LASTCLIENT | Type of the last OpenWGA authoring client that edited this content | all | keyword | Yes |
LASTMODIFIED, MODIFIED | Date and time of last modification | all | date | Yes |
OWNER | The owner of the content (Only OpenWGA content stores of version 5 or higher) | all | fulltext | Yes |
PAGEPUBLISHED | The published date of the first ever published version of this content (Only OpenWGA content stores of version 5 or higher) | all | keyword | Yes |
PARENT | Struct key of the parent page | all | keyword | No |
PATH | Struct keys of all pages up the page hierarchy to the root page. Querying for the struct key of a specific page on PATH will return all contents that are in the hierarchy below that page | all | keyword | No |
PUBLISHED | The published date of the content (Only OpenWGA content stores of version 5 or higher) | all | keyword | Yes |
STATUS |
Workflow state of the content: "w" - Working copy "g" - In approval process "p" - Published "a" - Archived |
all | keyword | Yes |
STRUCTENTRY, STRUCTKEY | Key of the struct entry belonging to this content | all | keyword | Yes |
TITLE | Title of the content | all | fulltext | Yes |
VALIDFROM | Optional date and time before which the document should be invisible | all | date | Yes |
VALIDTO | Optional date and time after which the document should be invisible | all | date | Yes |
VERSION | Number of version of this content | all | keyword | Yes |
VIRTUALLINK |
If this document is a virtual document points to its target. Contents depends on type of virtual link (which is indexed as VIRTUALLINKTYPE): "int" - Content key of the target document "exturl" - URL to an external website "file" - Syntax: <documentkey>/<filename>, where <documentkey> is the name of a file container or the key of a content document "intfile" - Name of a file attachment on this content |
all | keyword | Yes |
VIRTUALLINKTYPE |
Type of virtual document: "int" - Targets a content document in this database "exturl" - Targets some custom URL "file" - Targets a file attachment on a file container or content document in this database "intfile" - Targets a file attachment on this content document |
all | keyword | Yes |
VISIBLE | General visibility flag holding "true" or "false". | all | keyword | Yes |