Organize | Datasets#

Datasets are generic Entity containers that help you organize and Entities and Observables around shared characteristics or context.

They can be Search query datasets, i.e. based on a search query, or Collection datasets, i.e. arbitrary datasets that you manually fill with Entities.

You can create datasets to group information based on any criterium that matters to you. For example, you can create datasets to group Entities based on:

Entity type.
A specific threat scenario you are analyzing.
An incident.
A threat actor.
A targeted victim, and so on.

Or you can create datasets based on themes, for example:

Countries.
ATP-groups.
Vulnerability types.
Targeted infrastructure.

Subdividing a heterogeneous cyber threat intelligence corpus into smaller, more consistent, and more manageable chunks brings structure and clarity. This helps you see the forest for the trees, so that you can identify what matters to you quicker and more efficiently.

Dataset access control#

To control user access to datasets, save them to workspaces.

Like graphs, datasets inherit their access control rights from the workspace(s) they belong to.

Only workspace owners and collaborators can access datasets that belong to a workspace.

Collection and Search query datasets#

Collection datasets are arbitrary collections that you create by manually adding Entities to them.

Search query dataset are dynamic collections that contain all the Entities matching the foundational search query. If an Entity changes and no longer satisfies the search query’s criteria, it is omitted from the dataset. If a new Entity matching the search query is created, or an Entity is changed in such a way that it starts matching the query, it is added to the dataset.

Collection or Search query?

Collection and Search query datasets have different computational costs, with the former being more expensive than the latter.

Avoid applying rules to Collection datasets. Use rules with Search query datasets.

Collection datasets#

Collection datasets are defined in the PostgreSQL database, meaning that each time data is added to or removed from a Collection dataset, the database tables need be updated accordingly. This process can be expensive, and as a consequence performance can slow down.

If you apply rules to Collection datasets, the Entity version with the most recent timestamp replaces the version of the same Entity with an older timestamp in the Collection dataset.

This can be a newer version of the Entity, as well as the same version of the Entity with changes only in its meta content section:

Changes to the data section of an Entity create a new version of the Entity.

They also add a new log entry to the Entity history to record the changes.
Changes to the meta section of an Entity do not create a new version of the Entity.

However, they do update the timestamp value of the last_updated_at database field.
Update strategies rely on the last_updated_at database field to identify Entities whose timestamp value was updated since the previous execution of the outgoing feed.
Entities with a more recent timestamp value compared to the previous execution of the outgoing feed are packaged and included in the published content of the outgoing feed.

Search query datasets#

Search query datasets are rule-friendlier than Collection datasets.

If rules you apply make changes to Entities in a Search query datasets, the old version of the Entity gets excluded from the dataset and the new version of the Entity will be included in the dataset only if it satisfies the search query criteria.

This is computationally cheaper and faster.