Threat intel databases, part two

August 25, 2022

This post continues from “Threat intel databases, part one”. For simplicity, mentions of threat intel can be considered to include geolocation data.

Threat Intel Acquisition #

Day 0

Flat files versus the world. Day 0, your focus should be flat files. Streaming and API-based feeds can wait. Flat files provide the most lift for the effort applied. This assumes that well-known sources such as abuse.ch, PAAS mappings1, and customer submitted threat intel / trusted entities are important. With flat files, all of your customers will be able to contribute a CSV of trusted IPs or malicious entities.

Day 0+

You are acquiring abuse.ch, AWS ip-json, all the other feeds that are table stakes. Your research team submissions are integrated with flat files. Now you are ready to move into commercial feeds. Once you transition to commercial feeds, you will encounter that most vendors want you to use an API-based model where you pay per entity (IP/Domain). Common reasons vendors push the API model.

They cannot derive any product insights from a flat file model
Hard entitlement enforcement. Your only capability around 1:1 observations:queries is caching.
They do not have a production ready bulk file option

For your vendor, the API-based model drives account management, product improvement and increasing spend.

There is a window where an API-based model is acceptable, post this window it no longer scales as a technology or a cost.

API best practices #

Caching API responses can significantly improve application performance. Generally, domains and file hashes can be cached for long durations. IP addresses should not be cached for more than a few hours, or you risk false positives. Practically, your max cache duration should be aligned with your acquisition windows2. e.g., if you acquire every 12 hours, don’t cache for 24. If you cache longer than your acquisition window, troubleshooting false positives will devolve into a negative QA experience executed by your research and response teams.

Using Redis keys with TTLs is a practical solution, refresh the TTLs when you process new threat intel and old data will automatically expire. Whatever caching implementation you choose, your research and response team will need access to it. Specifically, they will need to validate detections as well as purge false positives from the cache without engineering involvement in the incident response.

Do not cache your cache. No happiness comes from Dante’s 9 circles of threat intel caching.

Common ways cyber-security companies discover they are over their commercial threat intel API limits

Fail closed: the API integration is blocking for your application, you hit your API limit and trigger an outage.
Fail open: the API integration is not blocking, you miss a table stakes detection and your customer escalates. This is a toxic false negative3.

Monitor your API call counts versus quota counts.

Post API #

Bulk flat files are your friend. Flat files still need caching and research and response enablement. Major downsides to bulk flat files are

Many vendors have no idea how to price it. This will be apparent in your discovery calls with their sales team
Cost, you are unlikely to encounter a vendor where spend will go down transitioning from API to bulk
Data quality, bulk flat files expose the data quality issues that are often less apparent with APIs

Additional considerations #

All threat intel sources will have false positives; commercial, free, open-source, and your research team. False positives can range from an obvious and outright false positive, e.g., 1.2.3.4 is malware when it is not. Obvious false positives are easy to solve through improved processes and threat acquisition filtering capabilities. False positives that are derived from cultural differences or opinions will need to be handled via product enhancements accessible to your customer.

There is no standard format for threat intel, and there are extensive quality control issues. You will often see the same source use null, "None", "N/A" values interchangeably. Timestamps can have wild variations.

File hashes can have prolific growth. For example, we once discovered a malicious Android app. For weeks after the initial discovery, we were acquiring 700k+ new hashes for the same malware every day. Prolific growth still fits in your Redis cache.

File size (bulk) can vary wildly from a few megabytes to hundreds of megabytes per feed per download.

Threat intel and geolocation data persistence #

Amazon Athena / S3 #

Marmot acquires threat intel and persists it to S3. Post acquisition and validation data is written to two locations.

latest
archive

latest:  s3://acme-bucket/threat_intel/external/latest/abuse_ch/latest.jsonl.gz
archive: s3://acme-bucket/threat_intel/external/archive/abuse_ch/year=2022/month=08/day=24/<relevant_filename>.jsonl.gz

Latest and archive are available for searching via Athena.

Latest

Latest is your primary query source for any recent observations that occurred within your last acquisition window. This should cover the majority of your product’s Athena-based queries.

Archive

Archive threat intel provides value for

Evidence: A customer has a question on an event that is days to weeks old. The threat intel artifact persisted with the event does not contain enough evidence.
Research: Your team is working on a threat report or investigating events from a relevant date in your archive
Analytics: Statistical analysis of threat metadata to create new filters or derivative threat intel

The S3 keys year, month, and day are Hive style partitions which can be queried as columns. For example:

SELECT * 
FROM abuse_ch 
WHERE "year" = 2022
 AND "month" = 5
 AND "day" = 5 
limit 10;

Partitions are a powerful feature that improves performance by limiting the amount of data crawled with a query. Many relevant variations are possible including partitioning by source names, customer UUIDs, etc.

PostgreSQL #

Create a similar structure as S3 in PostgreSQL, where threat intel from the same source is not co-mingled across acquisition events.

DB considerations

Avoid updates and deletes. Threat intel metadata can be highly ephemeral
Build your indexes in one shot
Consider truncating threat intel data before requiring TOAST

These constraints, minus TOAST, push towards a design where a table is created for each acquisition event.

Persistence process #

Post acquisition to S3, sanitization, and validation4:

Create connection with autocommit=False
Create cursor with context manager
- Create database table with a unique name. ex: abuse_ch_2022_05_05__06_00
- Insert all rows
- Commit
Create cursor with context manager
- Add indexes
- Commit
Create cursor with context manager
- Add new row, referencing the new table, to table inventory table
- Commit
Table is now accessible for new queries
Close connection5

How you use your context manager and when you perform commits is highly depending on your application structure. The two most impactful commits are

The commit after inserting all rows. Commits per row will slow down the process.
The commit to the inventory table. This makes the data accessible to the application.

Table inventory table

The inventory table helps with two items. Determining the latest table per source and determining which tables can be pruned.

Determining the latest table

Example query to identify all tables related to a threat intel source and return the lastest table.

SELECT
  table_name
FROM tmp_ti_table_inv
WHERE source_name = %(source_name)s
ORDER BY id DESC
LIMIT 1;

Pruning tables

Consider two tables per threat intel source to be a minimum requirement; the latest table plus the table you failed over from. Additional tables are helpful for quick QA. In the following query, number of tables per source is applied as lowest_rank. Setting lowest_rank to 5 would return all tables older than the most recent 5 tables for each threat intel source. The returned tables are the tables you prune.

SELECT
   id,
   src_table_name
FROM (
    SELECT tmp_ti_table_inv.*,
    rank() OVER (
        PARTITION BY source_name
        ORDER BY id DESC
    )
FROM tmp_ti_table_inv)
rf WHERE rank > %(lowest_rank)s;

Simplifying tables #

I use the same table structure for datasets that are similar. AWS and GCP ip-json ranges are a great example of this. The main benefit being query re-use across sources. This does, however, mean you will need to parameterize table names.

Psycopg2 provides functionality table name parameterization.

src_n and src_tn will be safely added into the query.

if source_name in {'aws_ip_ranges', 'gcp_ip_ranges'}:
    select_q = sql.SQL("""
    SELECT 
      props AS {src_n}
    FROM {src_tn}
    WHERE ip_prefix >> %(ip)s::inet;
    """).format(src_n=sql.Identifier(source_name),
                src_tn=sql.Identifier(src_table_name))

Summary #

Threat Intel is a fascinating adventure into data acquisition, sanitization, and filtering as well as presentation layers. It presents a broad degree of challenges. As a table stakes capability, missing these challenges incurs low efficacy, toxic false positives, and damages NPS scores. Meeting the challenges is a moving target that is fun and provides tangible value to your research teams, customers, and sales enablement. This post lightly touches on these challenges, and nothing here is gospel. Adapt this to your needs, be flexible, and most of all, have fun, enable your teams, and detect malicious things.

Appendix #

References

a) https://docs.aws.amazon.com/general/latest/gr/aws-ip-ranges.html , b) https://cloud.google.com/compute/docs/faq#find_ip_range
Some threat intel sources offer full threat intel downloads and some offer updates only. Choices of update models has impacts on persistence, pruning, and performance.
Toxic False Negative: A false negative that can impact your sales pipeline, company brand, and product trust. Often associated with a table stakes detection that can not be explained away.
Do not trust external data. Always sanitize and validate contents.
Connection generation is costly. Closing connections is highly dependant on application structure. In some scenarios, such as connection proxies, it may not be needed. https://aws.amazon.com/rds/proxy/

Notes

PostgreSQL interactions occur with Python3 and psycopg2
What about STIX and TAXII? I consider this a customer oriented feature, often not available for acquisition of external feeds. It should exist on your roadmap.
Images created using midjourney.com
Queries, S3 paths shown are updated to be more generic

Contents may not be republished without written consent.