About ingestion
Ingestion process
EclecticIQ Intelligence Center can ingest data in different formats from multiple sources.
Data ingestion is a multi-step process:
Incoming data flows in, and it is added to the ingestion queue.
Packages are fetched from the ingestion queue to be processed.
Packages are processed, and data is:
Deduplicated
Normalized
Transformed to the Intelligence Center's internal data model:
Entities
Observables
Relationships
Rules and filters further treat the data to:
Enrich entities
Create relationships
Apply tags to entities
Save entities and observables to specific datasets
Assign a maliciousness confidence level to observables.
The Intelligence Center attempts to resolve ID references (idrefs) by checking if the actual data IDs point to is available.
Ingested entities and observables are stored to the databases (main data store, search database, and graph database), and they are searchable and available in the Intelligence Center.
Deduplication
Data deduplication sifts through ingested data to check if it already exists in the Intelligence Center. It avoids cluttering the databases with unnecessary copies of the same data.
The Intelligence Center deduplicates data at package, entity, and observable level.
When checking newly ingested packages vs. existing ones, package-level deduplication compares the former with the latter.
If an ingested package is a copy of an existing one; that is, both packages have the same hash value:
The ingested duplicate package is discarded.
Existing entities from the existing package are updated as for their source information to include the origin of the duplicate package as an additional data source.
When checking newly ingested entities vs. existing ones, entity-level deduplication handles the following scenarios:
Same STIX ID, same data: merged to one entity
Multiple entities with identical STIX ID AND identical content are handled like identical copies.
Identical copies of the same entity are merged to one entity.Before deduplication
After deduplication
Multiple entities
Same STIX ID
Same data content
(Identical copies)
Identical copies are merged to one entity.
Same STIX ID, slightly different data, different timestamps: entity update and new version
Multiple entities with identical STIX ID AND slightly different content are handled like minor revisions of the same entity.
In this case, slightly different content means different timestamps and minor differences in the entity content.
For example, a minor revision update reference, or a small edit such as fixing a typo in the title, description, or another similar field.
The existing entity is updated with the most recent content.
As a result of the entity update, a new version of the entity is created.Before deduplication
After deduplication
Multiple entities
Same STIX ID
Different timestamps
Slightly different content
(Minor revisions of the same entity; not different versions of the same entity)
The existing entity is updated with the content from the entity variant with the most recent timestamp.
The process produces a new version of the entity.
Same STIX ID, same data, different timestamps: timestamp update, no new version
Multiple entities with identical STIX ID AND identical content BUT different timestamps are handled like chronological copies of the same entity.
The timestamp of the existing entity is updated to the most recent value, without creating a new version of the entity.Before deduplication
After deduplication
Multiple entities
Same STIX ID
Different timestamps
(Chronological copies of the same entity)Same data content
The existing entity timestamp value is updated to the most recent one.
Updating only the timestamp and leaving all other data unchanged does not trigger producing a new version of the entity.
If you upload an entity by sending a POST request to the /private/entities/ API endpoint, entity deduplication relies on the entity data hash, instead of the entity STIX ID.
In this case, the Intelligence Center checks if the hash value of the uploaded entity already exists.
It does not compare the entity.data["id"] value of the uploaded entity to check if it matches the data.id field value of an existing entity.
Entity resolution
An entity is resolved when its idref matches existing, available data in the Intelligence Center.
An entity is unresolved when its idref does not match any existing data in the Intelligence Center.
idref resolution at entity level
Entities and observables can reference external data through the STIX and CybOX idref field.
When the idref is at entity level – that is, when the entity has a STIX ID – the Intelligence Center creates an entity relationship between the entities.
When a STIX idref references an entity, the Intelligence Center creates a direct relationship to associate the two entities.
When a STIX idref references an entity of the same type and content as the entity it belongs to, and where the only differences are either the timestamp, or the version reference values between the two entities, the Intelligence Center marks all but the latest version of the entity update chain as outdated.
Outdated entities have historical reference; they are not searchable.
idref resolution at nested object level
When the idref id reference is at nested object level – that is, when an entity includes either a referenced entity, or a bundled observable/an embedded CybOX observable with an idref – the Intelligence Center attempts to resolve the bundled object idref by looking for the referenced data:
If the bundled object idref references data available inside the same ingested package, or data stored in the Intelligence Center, the newly ingested idref is removed and it is replaced with the existing referenced data.
Referenced data is available, either inside the same ingested package, or in the Intelligence Center
If the idref references unavailable data at the moment, the reference is resolved if/when the referenced data becomes available.
idrefs pointing to unknown/unavailable data produce placeholder entities that are populated if and when actual data becomes available at a later time.
Referenced data is unavailable; a placeholder points to it, in case it becomes available in the future
The process works identically in the opposite direction: if bundled entity or observable/CybOX data is ingested, the Intelligence Center looks for an existing idref pointing to it.
If it finds a matching idref, the existing idref is removed, and it is replaced with the newly ingested referenced data.
If no matching idref is available, nothing happens.
If a matching idref becomes available at a later time, it will be resolved then by replacing it with the corresponding referenced data.
Refanging URIs
Defanging defuses suspicious or harmful URLs. Although there are no set standards, several approaches to defanging have gained popularity over time, and are now widely used.
The Intelligence Center defangs URLs by recognizing common patterns in potentially malicious URLs.
During ingestion the Intelligence Center refangs defanged URIs:
URIs/URLs
IP addresses
Domain names
Email addresses.
Incoming data may contain defanged URIs and URLs to defuse them and to make them safe for dissemination.
Intelligence vendors and providers often defang malicious IPs and domains to prevent recipients from reaching those malicious resources by mistake.
During ingestion, the Intelligence Center looks for defanged URIs; when it detects them, it refangs them – that is, the URIs become dangerous again because they point to potentially existing harmful resources.
Refanged URIs are not clickable hyperlinks: for example, users need to manually copy and paste them into a web browser address bar to reach a malicious resource.
Refanging URIs is a preliminary step to data deduplication.
On ingestion, when the Intelligence Center detects duplicate data that replicate the same information about the same URIs, and when the only difference lies in the safe vs. malicious URI scheme, it refangs the safe URI scheme, and then it deduplicates the data.
Defanged |
Refanged |
Example |
hxxp |
http |
hxxp://evil.domain.com/ http://evil.domain.com/ |
hxxps |
https |
hxxps://super.evil.domain.com/ https://super.evil.domain.com/ |
fxp |
ftp |
fxp://super.evil.domain.com/ ftp://super.evil.domain.com/ |
://] |
:// |
http://]evil.domain.com/ http://evil.domain.com/ |
\\. [.] (.) DOT [dot] [DOT] (dot) (DOT) |
. |
https://evil[.]domain[.]com/ https://evil.domain.com/ 217(dot)112(dot)7(dot)96 217.112.7.96 |
[@] (@) AT (at) (AT) |
@ |
im_hacker(at)evil[.]com [email protected] |
Multiple data sources
The Intelligence Center can ingest the same entity from more than one source. While the entity data is deduplicated during ingestion, available information about the data sources the entity originates from is retained in a list of sources.
Multiple data sources for an entity are indexed, and they are searchable.
Multiple data source information for an entity is stored in the sources JSON field as an object array.
{
"sources"
: [
{
"name"
:
"TAXII Stand Samples Cypress"
,
"source_id"
:
"09d01570-476d-4515-a458-faddb43hse86"
,
"source_type"
:
"incoming_feed"
},
{
"name"
:
"test.taxiistand.com"
,
"source_id"
:
"0bd6014d-e0b4-a8d5-83ac-c107fd034855"
,
"source_type"
:
"incoming_feed"
},
{
"name"
:
"TAXII Stand Samples"
,
"source_id"
:
"fc602bf6-f653-1234-8dde-b939f2bb13bd"
,
"source_type"
:
"incoming_feed"
}
]
}
When an entity has more than one data source, a counter is displayed next to the main entity data source name under the Source column.
Click it to view a tooltip with a list of all the data sources the entity refers to.
About data source changes
Changes to data sources can trigger rules
When the Intelligence Center ingests duplicate data, it updates existing entities to add the origin of the duplicate data as a new data source.
When changes to the data sources for an entity occur – for example, adding a new data source, renaming an enricher, or deleting an incoming feed – the process triggers entity and observable rules that use the changed data sources as a criterion:
The current data sources of the affected entities are reassessed.
For example, any unavailable sources, such as a deleted incoming feed, are dropped; new sources, such as the origin of a discarded duplicate package, are added.The timestamp recording the most recent entity update is set to now; that is, the time the data source change occurs.
Entity and observable rules are triggered to update any entity information that may have changed because of the data source modification.
It is possible to manually rename enrichers.
However, changing the name of an enricher equals to reassigning enrichment observables to a different source.
This process drops the affected observables from the current source, it attaches them to the newly designated source, and it triggers a number of rules:
The current data source of the observable is dropped.
The new data source is attached.
The timestamp recording the most recent observable update is set to now – that is, the time the data source change is applied.
All entities in the Intelligence Center are re-indexed, as well as all observables whose data source has changed as a consequence of renaming the enricher they refer to.
Changes to data sources can trigger database reindexing
Entity deletion and incoming feed purge actions trigger a database reindex process, because they involve dropping any data sources referring to the deleted incoming feed content and entities:
When you delete an entity, you also detach from the deleted entity any data sources referring to it.
When you delete or purge an incoming feed, you also remove it as a data source from the entities that refer to it.
Changes to data sources can affect user access
Changes to entity data sources affect user access to entity data.
For example, if the data source an entity refers to changes from group A to group B, group A members who are not also members of group B can no longer access the entity.
About aggregate values
Entities ingested from multiple sources can have different values for timestamp, source reliability, TLP, and half-life.
During ingestion, these values are aggregated based on value type to produce a resulting single value to assign to the ingested entities:
Timestamp
When processing multiple timestamps for an entity, the oldest one prevails – that is, the one indicating the earliest moment in time when the entity was ingested.
Source reliability
When processing multiple source reliability values for an entity, the highest one prevails – that is, the one representing the highest data source reliability level.
TLP
When processing multiple TLP values for an entity, the least restrictive TLP value prevails, and it is applied to the resulting ingested entity.
TLP overrides have precedence over the original entity TLP value.
TLP overrides always supersede the original TLP value assigned to an entity, regardless of the TLP override being more or less restrictive than the original TLP value.
Half-life
When processing multiple half-life values for an entity, the highest one prevails, and it is applied to the resulting ingested entity.
Processed half-life values always supersede the system-defined or the default half-life value for an entity.
Saving data
Last but not least, the Intelliegence Center saves ingested entities, observables, enrichments, and any established relationships to the databases in the following order:
Main data store (PostgreSQL).
Indexing and search data store (Elasticsearch).
Graph data store (Neo4j).
Saving data URIs
The extracted entity data that is stored inside observables ranges from short, simple data such as email addresses, domain names, IP addresses, and so on, to binary data.
When an entity contains binary data — for example, a file, a memory region, or packet capture (PCAP) data — the data can be represented as either a data URI or a CybOX raw artifact element.
During ingestion, extraction logic handles binary data URI and raw artifact objects embedded in CybOX objects in the following way:
data URIs are extracted and stored as entity attachments and new hash values:
The data URI value is recalculated to a new hash: uri-hash-sha256.
The SHA-256 hash value for uri-hash-sha256 is calculated over the UTF-8 encoding of the data URI string.
The uri-hash-sha256 hash substitute enables entity correlation among entities containing the same data URI.
The binary data/raw content embedded in the data URI is decoded and processed:
The extracted content is hashed using SHA-512, SHA-256, SHA-1, and MD5.
Each resulting hash is added to the relevant entities as an observable.
Example
A data URI with image content nested inside a CybOX object generates the following output:
1 uri-hash-sha256 hash to facilitate entity correlation.
4 calculated hash observables: hash-sha512, hash-sha256, hash-sha1, and hash-md5.
Failed packages
Sometimes ingestion may fail to process a package.
Instead of immediately flagging the package as failed, the process automatically attempts to ingest the package again.
If the package still fails to ingest after a predefined amount of attempts, then the Intelligence Center flags it as failed.
The following parameters in the /etc/eclecticiq/ platform_settings.py configuration file define the behavior of the mechanism that automatically retries to ingest failed packages:
settings.py (sourced from EIQ platform-backend)
Author |
Rutger Prins |
Commit |
17a58f9f930d83ee862b731813ff472ea3994a37 |
Timestamp |
February, 14, 2022 11:59 AM |
Full path |
eiq/platform/settings.py |
Title |
[SNYK] Upgrade packages and ignore issues with no upgrade path |
Description |
**Upgrade packages:**<br> `ipython==7.16.0` => `ipython==7.16.3` == no risk <br> `cairosvg==2.4.2`=> `cairosvg==2.5.2` == no risk <br> `jinja2==2.10.1` => `jinja2==2.11.3` == no risk<br> `pillow==7.2.0` => `pillow==8.3.2` == no risk <br> `pygments==2.6.1` => `pygments==2.7.4` == no risk <br> <br> **Snyk Ignore:** <br> _Removed issues that no longer affect our product._<br> Increase ignore time for following issues:<br> snyk:lic:pip:html2text:GPL-3.0 - can't be applied for 2.9<br> SNYK-PYTHON-PIP-609855 - can't upgrade PIP due to incompatibility with credential escaping<br> SNYK-PYTHON-PIP-1278135 - can't upgrade PIP due to incompatibility with credential escaping<br> SNYK-PYTHON-DATEPARSER-1063229 - no fix available<br> SNYK-PYTHON-CELERY-2314953 - fix can't be apply due to incompatibility with python 3.6<br> SNYK-PYTHON-PILLOW-2329135 - fix can't be apply due to incompatibility with python 3.6<br> SNYK-PYTHON-PILLOW-2331905 - fix can't be apply due to incompatibility with python 3.6<br> SNYK-PYTHON-PILLOW-2331907 - fix can't be apply due to incompatibility with python 3.6<br> SNYK-PYTHON-PILLOW-2331901 - fix can't be apply due to incompatibility with python 3.6<br> SNYK-PYTHON-PILLOW-2397241 - fix can't be apply due to incompatibility with python 3.6<br> SNYK-PYTHON-CRYPTOGRAPHY-1070544 - can't apply fix risk accepted SNYK-PYTHON-PYSAML2-1063038 - can't apply fix risk accepted SNYK-PYTHON-PYSAML2-1063039 - can't apply fix risk accepted See merge request engineering/platform-backend!6465 |
INGESTION_TASK_RETRY_SETTINGS
=
{
"max_attempts"
:
10
,
"retry_delay"
:
60
,
# one minute
"max_retry_delay"
:
60
*
90
,
# 90 minutes
settings.py (sourced from EIQ platform-backend)
Author |
Rutger Prins |
Commit |
17a58f9f930d83ee862b731813ff472ea3994a37 |
Timestamp |
February, 14, 2022 11:59 AM |
Full path |
eiq/platform/settings.py |
Title |
[SNYK] Upgrade packages and ignore issues with no upgrade path |
Description |
**Upgrade packages:**<br> `ipython==7.16.0` => `ipython==7.16.3` == no risk <br> `cairosvg==2.4.2`=> `cairosvg==2.5.2` == no risk <br> `jinja2==2.10.1` => `jinja2==2.11.3` == no risk<br> `pillow==7.2.0` => `pillow==8.3.2` == no risk <br> `pygments==2.6.1` => `pygments==2.7.4` == no risk <br> <br> **Snyk Ignore:** <br> _Removed issues that no longer affect our product._<br> Increase ignore time for following issues:<br> snyk:lic:pip:html2text:GPL-3.0 - can't be applied for 2.9<br> SNYK-PYTHON-PIP-609855 - can't upgrade PIP due to incompatibility with credential escaping<br> SNYK-PYTHON-PIP-1278135 - can't upgrade PIP due to incompatibility with credential escaping<br> SNYK-PYTHON-DATEPARSER-1063229 - no fix available<br> SNYK-PYTHON-CELERY-2314953 - fix can't be apply due to incompatibility with python 3.6<br> SNYK-PYTHON-PILLOW-2329135 - fix can't be apply due to incompatibility with python 3.6<br> SNYK-PYTHON-PILLOW-2331905 - fix can't be apply due to incompatibility with python 3.6<br> SNYK-PYTHON-PILLOW-2331907 - fix can't be apply due to incompatibility with python 3.6<br> SNYK-PYTHON-PILLOW-2331901 - fix can't be apply due to incompatibility with python 3.6<br> SNYK-PYTHON-PILLOW-2397241 - fix can't be apply due to incompatibility with python 3.6<br> SNYK-PYTHON-CRYPTOGRAPHY-1070544 - can't apply fix risk accepted SNYK-PYTHON-PYSAML2-1063038 - can't apply fix risk accepted SNYK-PYTHON-PYSAML2-1063039 - can't apply fix risk accepted See merge request engineering/platform-backend!6465 |
# this results in roughly a week worth of retrying a task in sum
INGESTION_EXTERNAL_TASK_RETRY_SETTINGS
=
{
"max_attempts"
:
25
,
"retry_delay"
:
60
*
5
,
# 5 minutes
"max_retry_delay"
:
60
*
60
*
12
,
# 12 hours
Attempts define how many times the Intelligence Center should try to ingest a package again, before flagging it as failed.
Delay parameters represent seconds.
They define the lower and upper boundaries used to calculate the retry interval between two consecutive attempts.
To attempt again to ingest a package that has been flagged as failed, you need to manually trigger reingestion.
Ingestion discrepancies
EclecticIQ JSON is the only supported format that enables sending entities, observables, enrichments, and source information from a source Intelligence Center instance to a recipient Intelligence Center instance.
To exchange data across Intelligence Center instances:
In the data source Intelligence Center instance set the outgoing feed content type to EclecticIQ JSON.
In the data destination Intelligence Center instance set the incoming feed content type to EclecticIQ JSON.
Fewer entities in target Intelligence Center
When you exchange entity data between two Intelligence Center instances, and if the source data is in a format other than EclecticIQ JSON, an entity count discrepancy between the source and the destination Intelligence Centers may arise at the end of the operation.
This is expected behavior.
For example, let’s assume the source data format is STIX:
STIX and JSON are different formats.
STIX is much more nested than JSON, which has a flatter structure.
STIX entities are bundled in packages; after ingestion, the package is not needed any longer.
The ingestion process needs to parse the source data before ingesting it:
During parsing, STIX relationships between the package and the entities it contains are discarded, as they hold no intel value.
If the source package contains unresolved entities or observables whose idref is resolved during the parsing and ingestion steps, the original unresolved objects are discarded, as they hold no intel value because they are now resolved inside the Intelligence Center.
Relationships between entities or between entities and observables are retained and ingested into the Intelligence Center.
What looks like a discrepancy is the result of a data optimization and normalization process whose purpose is to separate the wheat from the chaff to retain only data with actual intelligence value.
Entity count mismatch
When you add a batch of entities to a dataset, the GUI may display different entity counts in the dataset view and in the Add [n] entities to a dataset dialog window.
This UI inconsistency can impact all supported entity types.
A possible cause for the count mismatch can be malformed STIX input, particularly malformed idrefs that can point to existing data — these idrefs are resolved when the Intelligence Center replaces them with the matching data — or to data that is not yet available in the Intelligence Center — these idrefs remain unresolved as long as there is no matching data to replace them with.
The count inconsistency is at GUI level only, and it can occur in the following scenario:
New data is being written to the Intelligence Center, AND
A user is examining the database view counter and the Add [n] entities to a dataset dialog window counter at the same time; that is, both GUI counters are displayed at the same time.