Why STIX 2.1 as the output format
Every persisted behavioural finding passes through
orchestrator.stix_builder and ends up as a STIX 2.1 object in
Postgres. This page explains that choice and its consequences.
What STIX is
STIX (Structured Threat Information eXpression) is an
OASIS-standardised schema for sharing threat intelligence. Version 2.1
defines 19 “domain objects” (SDOs: malware, indicator,
attack-pattern, etc.), 14 “cyber observable objects” (SCOs: file,
process, network-traffic, domain-name, etc.), and a set of
relationship types.
Objects are JSON with a deterministic id field ({type}--{UUID}), a
type, a spec_version, timestamps, and type-specific fields. Every
SDO/SCO we emit includes an x_analysis_metadata extension that ties
it back to an analysis_id.
Why STIX
Because consumers expect it. The downstream target is GNAT — a threat-intel platform that speaks STIX natively, has a STIX ORM, and mostly exchanges STIX with other platforms. Any custom format we invented would be a translation layer GNAT would have to own.
Because it survives schema churn. SandGNAT’s Postgres tables (see database-schema.md) will evolve — Phase 3 added intake columns, Phase 4 added similarity, Phase 5 added static fingerprint fields. STIX objects don’t see any of that: an external consumer cares about “what did the malware do” and STIX answers that question at a stable abstraction level.
Because relationships are a first-class concept. Every STIX object
carries object_refs and can be linked via explicit relationship
SDOs. That lets SandGNAT express “this file was dropped by this
process” or “this indicator points to this network-traffic”
without reinventing the graph shape every integration.
Because IDs are deterministic. SandGNAT uses UUIDv5 derived from
(analysis_id, type, natural_key) so re-ingesting the same artifacts
produces the same STIX IDs. That makes the writes idempotent: a
re-run of the analyzer on the same completed/ workspace is a no-op
at the DB layer thanks to ON CONFLICT DO NOTHING.
What STIX is not
STIX isn’t a storage-efficient format. A bundle with hundreds of process-registry-write observables is tens of KB of JSON. We accept the bloat because:
- Postgres JSONB is compressed on disk (TOAST).
- We extract hot columns (hashes,
analysis_id,type) into relational indices, so queries don’t scan JSON. - The read path through the export API serves whole bundles rarely —
most analyst queries hit
/analyses,/static,/similar, which don’t return STIX.
STIX also isn’t a great match for clustering metadata. Trigram signatures, LSH bands, similarity edges, and lineage relations are not STIX objects — they live in normalised Postgres tables and never leak into the bundle. The boundary is: anything an external threat analyst would want to know about the sample goes through STIX; anything SandGNAT needs for its own clustering stays in native tables.
What we actually emit
For a typical completed detonation:
| STIX type | Count | Source |
|---|---|---|
malware |
1 | wraps the entire analysis |
file |
1 + per drop | sample itself + every dropped file |
process |
per PID | ProcMon events grouped by PID |
network-traffic |
per flow | PCAP flows |
indicator |
per persistence key + per confirmed IOC | RegShot + PCAP |
domain-name, ipv4-addr |
per DNS/flow | PCAP |
Full per-type field coverage is in reference/stix-output.md.
The bundle
persistence.export_bundle(analysis_id) assembles everything STIX
objects with that x_analysis_metadata.analysis_id into a single
STIX 2.1 Bundle. id is derived from the bundle’s contents so two
identical bundles get the same id.
That’s what GET /analyses/<id>/bundle returns.
Custom extensions we use
STIX allows x_-prefixed keys on any object. We use a handful:
x_analysis_metadataon every object — links toanalysis_id, VM UUID, tool provenance, analyst confidence.x_quarantine_pathonfileobjects for dropped files.x_artifact_disposition("quarantined") on file objects.x_created_by_process_reflinking a droppedfileto its creatingprocess.x_registry_modificationsonprocessobjects — summaries of RegSetValue events captured in ProcMon.x_http_headersonnetwork-trafficwhen the PCAP parser extracts them.x_imphashkey inside the standardhashesdict onfileobjects (imphash isn’t in the STIXhash-algorithm-ovvocabulary).
All extensions are documented in reference/stix-output.md.
What we don’t do
- We don’t emit Relationship SDOs. The
object_refsfield onmalwareis enough for our graph shape; explicitrelationshipSDOs are more ceremony than value. - We don’t emit CourseOfAction, CampaignReport, IntrusionSet. Those are analyst-curated; SandGNAT only emits machine-observed facts.
- We don’t emit CAPA capabilities as
attack-patternSDOs. CAPA findings live instatic_analysis.capa_capabilitiesas normalised JSONB. Promoting them to attack-pattern STIX is additive and could ship in a later phase.
STIX vs our own tables: who owns what
For a given sample:
| Question | Where it lives |
|---|---|
| What was the SHA-256? | analysis_jobs + STIX file |
| What files did it drop? | dropped_files (normalised) + STIX file(s) |
| What registry keys did it modify? | registry_modifications + STIX indicator |
| What domains/IPs did it contact? | network_iocs + STIX network-traffic |
| Is this a near-duplicate of prior analysis X? | analysis_jobs.near_duplicate_of (not STIX) |
| What’s its CAPA capability fingerprint? | static_analysis.capa_capabilities (not STIX) |
| What’s its byte-trigram MinHash signature? | sample_trigrams (definitely not STIX) |
The rule: STIX for analyst-facing facts; native tables for SandGNAT’s operational bookkeeping. Blurring that line causes pain for both external consumers (who see SandGNAT-specific fields in STIX they don’t understand) and operators (who can’t answer “why didn’t we detonate?” from a STIX bundle).