About search tokenization

Search functionality relies on Elasticsearch indices. When you run a search query, the Intelligence Center searches for matches in the content that has been ingested and indexed.

Besides full text search, you can use Boolean operators and wildcards.
You can combine these filtering options to create more refined searches.

About search tokenization

Ingested data is indexed in Elasticsearch. Elasticsearch analyzes incoming data streams, and it breaks up data into tokens.
Tokens are smaller meaningful bits of information. The tokenization process is based on predefined rule sets.

If a data field is not mapped in the Elasticsearch index mapping , Elasticsearch stores also non-analyzed versions of the analyzed and tokenized data.
This version of the data holds the original, non-analyzed and non-tokenized, value of the data.

Elasticsearch can apply multiple tokenizers to text fields. This enables searching for and retrieving content using different search strategies:

  • Search based on the Elasticsearch standard tokenizer.

  • Search based on the Elasticsearch pattern tokenizer.

  • Search based on an alphanumeric tokenizer that uses any non-alphanumeric characters as token separators ([^a-zA-Z0-9_]).

  • Search for non-tokenized data.

  • Search for non-tokenized data spelled backward (reverse text).

Search for tokens and keywords

You can search for analyzed and tokenized, as well as for non-analyzed and non-tokenized data.

Elasticsearch analyzes and tokenizes ingested content using its grammar-based standard tokenizer: it splits content in text elements, based on the Unicode Text Segmentation algorithm.
Example:
A search for data.city_name.tokens:"King's Landing" returns [ King's, Landing ]

You can also search for indexed content based on different tokenization criteria.
To do so, append the following parameters to the JSON paths pointing to the JSON data field names whose values you want to look up:

  • tokens: based on an alphanumeric tokenizer, it uses any non-alphanumeric characters as token separators ([^a-zA-Z0-9_]).
    It is useful when searching alphanumeric IDs that should not be split into multiple tokens.
    Token delimiters include white space, punctuation, hyphen, apostrophe, and quotes.
    Example:
    A search for data.city_name.tokens:"King's Landing" returns [ King, s, Landing ].

  • keyword: based on the Elasticsearch keyword tokenizer, it returns the data exactly as it was received.
    The output data is the same as the corresponding input.
    It is useful when searching text where words are joined together by characters such as hyphens, underscores, or other characters that the other tokenizers would interpret as token separators.
    Example:
    A search for data.city_name.keyword:"King's Landing" returns King's Landing.

  • keyword_r: based on the Elasticsearch reverse token filter, it reverses the order of the original input data by returning it backward.
    Example:
    A search for data.city_name.keyword_r:"King's Landing" returns gnidnaL s'gniK.


Examples

Append tokens, keyword, or keyword_r to the JSON data field names whose values you want to search and retrieve.
The following examples search for observable values and enrichment observable values.


Field

Description

extracts.value.tokens

Non-alphanumeric characters are token separators.
Non-alphanumeric characters in the observable value are replaced and then split by whitespace to create tokens.

extracts.value.keyword

The original observable value is returned as is, without any modifications.

extracts.value.keyword_r

The original observable value is returned spelled backward (reverse text).

enrichment_extracts.value.tokens

Non-alphanumeric characters are token separators.
Non-alphanumeric characters in the enrichment observable value are replaced and then split by whitespace to create tokens.

enrichment_extracts.value.keyword

The original enrichment observable value is returned as is, without any modifications.

enrichment_extracts.value.keyword_r

The original enrichment observable value is returned spelled backward (reverse text).

Search for raw field values

You can bypass tokenization and search for raw, non-tokenized field values by appending a trailing .raw element.

To access raw, non-tokenized values in a field, append a trailing .raw element to the JSON path representing the field name.
Format: ${field.namejson.path}.raw

Example

Field

Description

meta.title

Enables accessing the indexed, tokenized field value.
It is possible to retrieve the field value by looking for any of its constituent tokens
Any search literal or data pattern that matches any, or at least one word in the title, returns the whole title content.
In the example, the field returns an entity name or its alias, if any; otherwise, its STIX title.

meta.title.raw

Enables accessing the indexed, non-tokenized field value.
It is possible to retrieve the field value by looking for the whole field value as a string
In the example, the field returns an entity name or its alias, if any; otherwise, its STIX title.

Search in root elements other than data

To specify selection criteria pointing to entity data outside the predefined data root JSON object, you can define a different root element than data.
For example, you may want a rule to return matches based on specific tags, metadata, or observable attributes.

To set a JSON path defining a field name other than data as a root field, prefix the field name with raw.:

  • raw. must be the first element in the JSON path defining the field name.

  • The second element in the JSON path after raw. becomes the designated JSON path root element for the specified path.

Example

raw. prefix

Custom root field

Targeted entity data

raw.tags

tags

Enables accessing entity tag field values through searching, filtering, and rules.

raw.extracts.kind

extracts.kind

Enables accessing observable type field values through searching, filtering, and rules.