Threat intel databases, part one

August 22, 2022

Intro #

Three types of content I manage are threat intel, geolocation, and honeypot observations.

Threat Intel is an opinion on an entity. Often that entity is a file hash, IP address, or domain that is associated with malware.

Geolocation is location information associated with an IP address. For example, an IP associated with cloud providers like AWS and Alibaba, ASN’s, or countries, states and cities. The main data differentiator of geolocation data from threat intel is how the data is queried.

Honeypot observations are data associated with a connection to the Marmot1 honeypot. This includes data such as source IP, ports, and payloads.

Example datasets

Threat intel example from abuse.ch

ioc_value	threat_type	malware_alias
xx.161.27.133:80	botnet_cc	Agentemis,BEACON,CobaltStrike,cobeacon
xx.95.30.6:443	botnet_cc	Agentemis,BEACON,CobaltStrike,cobeacon

Geo ASN example from Ip2Location

beginning_ip	ending_ip	cidr	asn	asn name
16859136	16871423	1.1.96.0/20	2519	Arteria Networks Corporation
16871424	16873471	1.1.112.0/21	2519	Arteria Networks Corporation

Geo example from Ip2Location

network_int	broadcast_int	iso_country	country	region
16777216	16777471	US	United States of America	California
16777472	16778239	CN	China	Fujian

PAAS example from AWS

ip_prefix	region	service	network_border_group
3.2.34.0/26	af-south-1	AMAZON	af-south-1
3.5.140.0/22	ap-northeast-2	AMAZON	ap-northeast-2

The purpose for acquiring threat intel, geolocation, and honeypot data is to ask questions of it. These questions can be for threat hunting or policy-based reasons. Asking questions of the datasets means matching on the following

IP address equals IP address
IP address in CIDR block
IP integer2 between network IP integer and broadcast IP integer

Marmot Databases #

Questions at Marmot are primarily asked via

Amazon Athena
PostgreSQL

Athena #

Athena is used for operational simplicity, cost, and threat hunting on medium-data3 with limited sanitization. Athena was used for the honeypot blog post and was the only content database for the first iterations of Marmot. It continues to be used in parallel with PostgreSQL.

PostgreSQL #

PostgreSQL4 was introduced to help with

rapid iteration on queries and table schemas for new features
support highly repetitive queries
flexibility over how IP addresses are queried

Rapid iteration on queries and table schemas for new features #

Design considerations: All threat intel, geolocation, and honeypot data is persisted to S3 on acquisition. Post acquisition, the data is ETL’d into a bucket for Athena or a table in PostgreSQL. The ETL job is not related to the pruning of acquired data. Once your first ETL scripts for Athena and PostgreSQL are configured in Terraform, additional prototype jobs become ~trivial.

When developing a new feature, I focus on what questions I want to ask before I dive into normalizing data or choose the database. Running queries in Athena and updating Athena schemas has an inherent latency that slows down my development flow. Compounding the latency, I sometimes get the answer to my question and realize I was asking the wrong question.

Once I have validated my questions, how often those questions will be asked, and other details, I choose Athena or PostgreSQL.

Support highly repetitive queries #

For non-bulk data that will see query rates of several per minute, I lean towards PostgreSQL for cost and user-experience.

Cost: Ultimately, I can control costs with API rate limiting. But in an MVP phase, I need to see how users want to use the product, not how they have to use it. With Athena, I may have to use stricter API limits for cost control, and that could be premature.

User-Experience: Athena is asynchronous. Without a UI your users will have to run two to three commands per question. PostgreSQL can be implemented with synchronous behavior via your application for small responses5.

Asynchronous:

Ask the question
Is the answer ready
Download the response

Synchronous:

Ask the question, receive response.

For the right use case, Athena is very cost-effective versus an RDS for PostgreSQL instance. For queries that can rely on indexes and caching, PostgreSQL can be cheaper and faster.

Flexibility over how IP addresses are queried #

IP addresses are queried in Athena as integers. There is no data-type for IP address. Querying via integers occurs in databases that support IP addresses and CIDR notation as well.

BETWEEN queries #

Design considerations: Geolocation is not bound by CIDR blocks. It may include partial ranges. Consequently, you may always have a requirement to search IP ranges instead of CIDR ranges.

BETWEEN: JOIN, Integers

Example of a many-to-many query using a JOIN, BETWEEN

Use-case: Bulk preparation of location data for every IP address observed over a time range in the honeypot or customer dataset. Useful for prepping data for your ETL’s, reducing costs downstream and improving performance for customer queries. This query is applicable via Athena or PostgreSQL. This query gets expensive with IPv66.

Query requires that you have your IP addresses stored as integers (bigint, etcetera).

SELECT
 hp.*,
 l4.country,
 l4.region
FROM honeypot hp
  INNER JOIN ip2location_ipv4 l4 ON hp.src_ip_int BETWEEN l4.begin_int AND l4.end_int;

BETWEEN: WHERE, Integers

Example of a one-to-many query using WHERE, BETWEEN

Use-case: Ad-hoc, user-driven queries. You will need to implement the conversion from IP to integer in your application for a positive user experience. This query is applicable via Athena or PostgreSQL.

Query requires that you have your IP addresses stored as integers (bigint, etcetera).

SELECT
  *
FROM ip2location_ipv4 l4
WHERE 16909060 BETWEEN l4.begin_int AND l4.end_int;

16909060 is equivalent to 1.2.3.47. The user provided 1.2.3.4 in this example.

BETWEEN: WHERE, IP Address

Example of a one-to-many query using WHERE, BETWEEN

Use-case: Ad-hoc, user-driven queries where there is no CIDR block to query. Often associated with higher resolution geolocation queries. This case casts an IP address string to the inet data-type. tmp.network_addr and tmp.broadcast_addr are already set to data-type inet in the schema. This query is applicable via PostgreSQL.

SELECT
  tmp.*
FROM ip2location_ipv4 tmp
WHERE '1.2.3.4'::inet BETWEEN tmp.network_addr AND tmp.broadcast_addr;

The user provided 1.2.3.4 in this example.

IP Addresses and CIDR blocks

If your dataset has CIDR blocks, like AWS and GCP ranges, you can use the PostgreSQL network operators.

Use-case: User-driven queries where the dataset has a CIDR block to query. This example casts the data-type for the input data as well as the dataset.

SELECT
    *
FROM aws_ranges aws
WHERE '1.2.3.4'::inet << aws.ip_prefix::cidr;

The user provided 1.2.3.4 in this example.

Summary #

Security is not one size fits all. Different features and different customers will have varying access patterns for data. It is important to have the ability to serve data via varying database types for the best user-experience and product efficacy. It is equally essential to support your engineers with the tooling they need to experiment and iterate quickly. Plan for any medium-data or larger dataset to need a map-reduce and relational database capabilities.

Appendix #

Remarks

Marmot: Ambiguous name for my security platform.
integer: A number without a fractional component. Not to be confused with a database or language data-type.
medium-data: Smaller than Big-Data, unless you have a marketing department.
PostgreSQL is the AWS RDS implementation for this post.
Pagination could be async. Recommend reading the following if you need pagination.
More data more problems. IPv6 Geo mappings are large in the number of rows and the size of the integers.

 from ipaddress import IPv4Address
 print(int(IPv4Address('1.2.3.4')))
 >>> 16909060

Image in post created using midjourney.com

Databases

I reference Athena and PostgreSQL exclusively in this doc. Other databases support these use-cases. Choose the database appropriate for your environment or infrastructure.

Data sanitization

Validate all user supplied data before submitting it to your database. Reject what is not expected.

Basic examples:

Length limitations for integers
Validate IP addresses with the ipaddress library or your language of choice’s version.

Parameterize all data submitted by your users to the database. There is no wiggle room on this.

Contents may not be republished without written consent.