Organize | Datasets#
Datasets are generic Entity containers that help you organize and Entities and Observables around shared characteristics or context.
They can be Search query datasets, i.e. based on a search query, or Collection datasets, i.e. arbitrary datasets that you manually fill with Entities.
You can create datasets to group information based on any criterium that matters to you. For example, you can create datasets to group Entities based on:
Entity type.
A specific threat scenario you are analyzing.
An incident.
A threat actor.
A targeted victim, and so on.
Or you can create datasets based on themes, for example:
Countries.
ATP-groups.
Vulnerability types.
Targeted infrastructure.
Subdividing a heterogeneous cyber threat intelligence corpus into smaller, more consistent, and more manageable chunks brings structure and clarity. This helps you see the forest for the trees, so that you can identify what matters to you quicker and more efficiently.
Dataset access control#
To control user access to datasets, save them to workspaces.
Like graphs, datasets inherit their access control rights from the workspace(s) they belong to.
Only workspace owners and collaborators can access datasets that belong to a workspace.
Collection and Search query datasets#
Collection datasets are arbitrary collections that you create by manually adding Entities to them.
Search query dataset are dynamic collections that contain all the Entities matching the foundational search query. If an Entity changes and no longer satisfies the search query’s criteria, it is omitted from the dataset. If a new Entity matching the search query is created, or an Entity is changed in such a way that it starts matching the query, it is added to the dataset.
Collection or Search query?
Collection and Search query datasets have different computational costs, with the former being more expensive than the latter.
Avoid applying rules to Collection datasets. Use rules with Search query datasets.
Collection datasets#
Collection datasets are defined in the PostgreSQL database, meaning that each time data is added to or removed from a Collection dataset, the database tables need be updated accordingly. This process can be expensive, and as a consequence performance can slow down.
If you apply rules to Collection datasets, the Entity version with the most recent
timestamp
replaces the version of the same Entity with an older timestamp
in the Collection
dataset.
This can be a newer version of the Entity, as well as the same version
of the Entity with changes only in its meta
content section:
Changes to the
data
section of an Entity create a new version of the Entity.They also add a new log entry to the Entity history to record the changes.
Changes to the
meta
section of an Entity do not create a new version of the Entity.However, they do update the
timestamp
value of thelast_updated_at
database field.Update strategies rely on the
last_updated_at
database field to identify Entities whosetimestamp
value was updated since the previous execution of the outgoing feed.
Entities with a more recent timestamp value compared to the previous execution of the outgoing feed are packaged and included in the published content of the outgoing feed.
Search query datasets#
Search query datasets are rule-friendlier than Collection datasets.
If rules you apply make changes to Entities in a Search query datasets, the old version of the Entity gets excluded from the dataset and the new version of the Entity will be included in the dataset only if it satisfies the search query criteria.
This is computationally cheaper and faster.