About datasets


Datasets are generic containers that help you organize and manage sets of entities and observables around shared characteristics, context, and themes.

Datasets are arbitrary data collections: you can edit and delete their contents at any time.
Datasets are generic containers: you can create datasets to group entities for reference, for further analysis, to temporarily drop them and then pick them up at a later time, and so on.

Datasets help you organize your intelligence. You can create datasets to group information based on any criteria that matter to you.
For example, you can create datasets to group entities based on:

  • Entity type.

  • A specific threat scenario you are analyzing.

  • An incident.

  • A threat actor.

  • A targeted victim, and so on.

Or you can create datasets based on themes, for example:

  • Countries.

  • ATP-groups.

  • Vulnerability types.

  • Targeted infrastructure.

Subdividing a heterogeneous cyber threat intelligence corpus into smaller, more consistent, and more manageable chunks brings structure and clarity.
This helps you see the forest for the trees, so that you can identify what matters to you quicker and more efficiently.


About dataset access control

To control user access to datasets, save them to workspaces.
Like graphs, datasets inherit their access control rights from the workspace(s) they belong to.

Only workspace owners and collaborators can access datasets that belong to a workspace.


Static and dynamic datasets

images/plugins/servlet/confluence/placeholder/unknown-macro.png

Static or dynamic?

Summary

  • Avoid rules with static datasets.

  • Use rules with dynamic datasets.

Static and dynamic datasets have different computational costs, with the former being more expensive than the latter.

About static datasets

As a general guideline, it is better to avoid applying rules to static datasets.

Static datasets are defined in the PostgreSQL database.

Each time data are added to or removed from static datasets, the database tables need be updated accordingly.
This process can be expensive, and as a consequence performance can slow down.

If you apply rules to static datasets, an entity with the most recent timestamp replaces the same entity with an older timestamp in the static dataset.
This can be a newer version of the entity, as well as the same version of the entity with changes only in its meta content section:

  • Changes to the data section of an entity create a new version of the entity.
    They also add a new log entry to the entity history to record the changes.

  • Changes to the meta section of an entity do not create a new version of the entity.
    However, they do update the timestamp value of the last_update_at database field.

  • Update strategies rely on the last_updated_at database field to identify entities whose timestamp value was updated since the previous execution of the outgoing feed.
    Entities with a more recent timestamp value compared to the previous execution of the outgoing feed are packaged and included in the published content of the outgoing feed.

About dynamic datasets

Dynamic datasets are rule-friendlier.

If you apply rules to dynamic datasets, a more recent version of an entity a rule retrieves is used to replace the corresponding previous version in the dynamic dataset only if the new version satisfies the search query criteria.
This is computationally cheaper and faster.