Trigram similarity and LSH
SandGNAT’s Linux static-analysis stage computes a similarity signature for every sample and uses it to cluster submissions. Two questions this page answers:
- What’s a trigram MinHash signature and how does it estimate similarity?
- Why banded LSH instead of just comparing signatures pairwise?
If you just want to use similarity, see
how-to/query-export-api.md. The short
version: GET /analyses/<id>/similar returns a JSON list of peer
analyses sorted by estimated Jaccard similarity.
The problem
Malware is heavily reused. Two submissions that look byte-identical aren’t interesting (the sha256 deduplicator catches those). Two submissions that are 95% the same — a repacked variant, a new config baked in, a minor string change — are what we want to catch, and sha256 gives zero signal there.
We want a function that, given a new sample’s bytes, can answer “have we seen anything like this before?” in O(corpus_size) at worst and sub-linear for most samples. Full byte-level comparison is O(N²) and doesn’t scale past a few thousand samples.
Trigrams
A trigram is a 3-element window over a sequence.
- Byte trigram: 3 consecutive bytes of a sample’s executable sections.
- Opcode trigram: 3 consecutive mnemonics from a capstone disassembly.
The intuition: two programs with similar control flow produce similar opcode trigram sets, and the set-theoretic overlap (Jaccard) is a decent proxy for “how related are these?”
Jaccard similarity of two sets A and B:
J(A, B) = |A ∩ B| / |A ∪ B|
Perfect overlap = 1.0. Zero overlap = 0.0. Repacked malware typically scores 0.3–0.6 on byte trigrams (resource sections and data drift, but code is preserved); 0.7–0.95 on opcode trigrams after unpacking.
Computing Jaccard directly requires storing every trigram set and intersecting them — too much memory and too slow for a million-sample corpus.
MinHash
MinHash is a probabilistic datastructure that collapses a set of any size into a fixed-length signature such that
Pr(signatures match at position i) ≈ J(A, B)
for independent hash functions h_1..h_k. SandGNAT uses k=128
permutations, so each sample’s signature is 128 × 32-bit ints =
512 bytes.
The algorithm:
- Pick k independent hash functions
h_1..h_k. SandGNAT derives them deterministically from a fixed seed, so signatures are reproducible across hosts and runs. (Seeorchestrator.trigrams._generate_hash_coeffs.) - For a set S, the i-th signature entry is
min(h_i(x) for x in S). - To estimate J(A, B), count positions where signature(A) and signature(B) agree, divide by k.
Error decays as O(1/sqrt(k)). With k=128 the 1-sigma error is about
0.088 — comfortable for a 0.85 short-circuit threshold.
Code: orchestrator.trigrams.minhash() produces a MinHashSignature.
Determinism is covered by test_trigrams.test_minhash_is_deterministic.
Banded LSH
Comparing a new signature against every prior signature is O(N × k) — 128 int-compares times N corpus samples. Still linear, still too slow at scale.
Locality-sensitive hashing solves this with a clever subdivision:
- Split each signature into
bbands ofrrows each (b × r = k). SandGNAT uses b=16, r=8. - For each band, hash its slice of the signature to a short band-value. Two samples share a band iff their band-values at the same index are equal.
- Store
(analysis_id, band_index, band_value)rows in an indexed table (sample_minhash_bandsin Postgres). - Query: for a new signature, look up rows matching each of its 16 band-values. That’s 16 O(1) index probes. The union of matching analysis_ids is the candidate set — samples that might be similar.
- Compute exact Jaccard only against candidates.
The probability that two samples with true Jaccard J share at least
one band is 1 - (1 - J^r)^b. For b=16, r=8:
| True Jaccard | P(share a band) |
|---|---|
| 0.5 | 6.2% |
| 0.7 | 64% |
| 0.8 | 94% |
| 0.85 | 98.8% |
| 0.9 | 99.9%+ |
So at the default short-circuit threshold of 0.85, we miss fewer than 1.2% of true near-duplicates. Below 0.7 we mostly don’t see them — which is fine because we don’t want to short-circuit on those anyway.
Tuning b and r trades recall for precision. More bands = higher
recall but more false-positive candidates to exact-compare. The
defaults are the standard “S-curve at ~0.7” values; tune if your
threshold changes significantly.
Two flavours: byte vs opcode
Byte trigrams are cheap (no disassembly) and language/architecture agnostic — they work on packed PEs, ELFs, Mach-Os, shellcode fragments. Weakness: byte-level rewrites (packers, string encoding) kill the signal.
Opcode trigrams require successful disassembly but are robust to byte-level obfuscation — an xor-encoded code region looks totally different byte-wise but disassembles to the same mnemonic stream once unpacked (assuming the sample unpacks itself; we disassemble the on-disk form).
SandGNAT stores both flavours and computes similarity on each
independently. The flavour config (STATIC_SHORT_CIRCUIT_FLAVOUR=
byte|opcode|either) picks which flavour’s hit triggers the
short-circuit. Default: either — accept whichever scores higher.
See the sample_trigrams and sample_minhash_bands tables in the
database reference for how the
signatures are persisted.
When it goes wrong
- Highly packed samples with code-level entropy near 8.0 — every byte looks random, byte trigrams are useless. Opcode trigrams only help if the packer exposes a small stub we can disassemble. For these cases SandGNAT falls through to detonation (which is what we’d want anyway — you can’t statically understand a packer).
- Tiny samples (<1 KiB) — both the byte trigram set and the disassembly window are too small for stable Jaccard estimates. Signatures are still emitted, but similarity scores should be treated as noisy.
- Signature-version bump — changing the hash family in
orchestrator.trigramsinvalidates every stored signature. Thesignature_versioncolumn lets us coexist with old signatures during a migration, but until a background re-hash job ships, bumping the version means losing similarity data.
References
- Broder (1997): “On the resemblance and containment of documents” — the original MinHash paper.
- Rajaraman & Ullman, Mining of Massive Datasets, Chapter 3 — best tutorial explanation of banded LSH.
orchestrator/trigrams.py— stdlib-only implementation (~150 LoC).orchestrator/similarity.py— candidate fetch + ranking + short-circuit decision.