
Intro #
Three types of content I manage are threat intel, geolocation, and honeypot observations.
Threat Intel is an opinion on an entity. Often that entity is a file hash, IP address, or domain that is associated with malware.
Geolocation is location information associated with an IP address. For example, an IP associated with cloud providers like AWS and Alibaba, ASN’s, or countries, states and cities. The main data differentiator of geolocation data from threat intel is how the data is queried.
Honeypot observations are data associated with a connection to the Marmot1 honeypot. This includes data such as source IP, ports, and payloads.
Example datasets
Threat intel example from abuse.ch
ioc_value | threat_type | malware_alias |
---|---|---|
xx.161.27.133:80 | botnet_cc | Agentemis,BEACON,CobaltStrike,cobeacon |
xx.95.30.6:443 | botnet_cc | Agentemis,BEACON,CobaltStrike,cobeacon |
Geo ASN example from Ip2Location
beginning_ip | ending_ip | cidr | asn | asn name |
---|---|---|---|---|
16859136 | 16871423 | 1.1.96.0/20 | 2519 | Arteria Networks Corporation |
16871424 | 16873471 | 1.1.112.0/21 | 2519 | Arteria Networks Corporation |
Geo example from Ip2Location
network_int | broadcast_int | iso_country | country | region |
---|---|---|---|---|
16777216 | 16777471 | US | United States of America | California |
16777472 | 16778239 | CN | China | Fujian |
PAAS example from AWS
ip_prefix | region | service | network_border_group |
---|---|---|---|
3.2.34.0/26 | af-south-1 | AMAZON | af-south-1 |
3.5.140.0/22 | ap-northeast-2 | AMAZON | ap-northeast-2 |
The purpose for acquiring threat intel, geolocation, and honeypot data is to ask questions of it. These questions can be for threat hunting or policy-based reasons. Asking questions of the datasets means matching on the following
- IP address equals IP address
- IP address in CIDR block
- IP integer2 between network IP integer and broadcast IP integer
Marmot Databases #
Questions at Marmot are primarily asked via
- Amazon Athena
- PostgreSQL
Athena #
Athena is used for operational simplicity, cost, and threat hunting on medium-data3 with limited sanitization. Athena was used for the honeypot blog post and was the only content database for the first iterations of Marmot. It continues to be used in parallel with PostgreSQL.
PostgreSQL #
PostgreSQL4 was introduced to help with
- rapid iteration on queries and table schemas for new features
- support highly repetitive queries
- flexibility over how IP addresses are queried
Rapid iteration on queries and table schemas for new features #
Design considerations: All threat intel, geolocation, and honeypot data is persisted to S3 on acquisition. Post acquisition, the data is ETL’d into a bucket for Athena or a table in PostgreSQL. The ETL job is not related to the pruning of acquired data. Once your first ETL scripts for Athena and PostgreSQL are configured in Terraform, additional prototype jobs become ~trivial.
When developing a new feature, I focus on what questions I want to ask before I dive into normalizing data or choose the database. Running queries in Athena and updating Athena schemas has an inherent latency that slows down my development flow. Compounding the latency, I sometimes get the answer to my question and realize I was asking the wrong question.
Once I have validated my questions, how often those questions will be asked, and other details, I choose Athena or PostgreSQL.
Support highly repetitive queries #
For non-bulk data that will see query rates of several per minute, I lean towards PostgreSQL for cost and user-experience.
Cost: Ultimately, I can control costs with API rate limiting. But in an MVP phase, I need to see how users want to use the product, not how they have to use it. With Athena, I may have to use stricter API limits for cost control, and that could be premature.
User-Experience: Athena is asynchronous. Without a UI your users will have to run two to three commands per question. PostgreSQL can be implemented with synchronous behavior via your application for small responses5.
Asynchronous:
- Ask the question
- Is the answer ready
- Download the response
Synchronous:
- Ask the question, receive response.
For the right use case, Athena is very cost-effective versus an RDS for PostgreSQL instance. For queries that can rely on indexes and caching, PostgreSQL can be cheaper and faster.
Flexibility over how IP addresses are queried #
IP addresses are queried in Athena as integers. There is no data-type for IP address. Querying via integers occurs in databases that support IP addresses and CIDR notation as well.
BETWEEN queries #
Design considerations: Geolocation is not bound by CIDR blocks. It may include partial ranges. Consequently, you may always have a requirement to search IP ranges instead of CIDR ranges.
BETWEEN: JOIN, Integers
Example of a many-to-many query using a JOIN, BETWEEN
Use-case: Bulk preparation of location data for every IP address observed over a time range in the honeypot or customer dataset. Useful for prepping data for your ETL’s, reducing costs downstream and improving performance for customer queries. This query is applicable via Athena or PostgreSQL. This query gets expensive with IPv66.
Query requires that you have your IP addresses stored as integers (bigint, etcetera).
SELECT
hp.*,
l4.country,
l4.region
FROM honeypot hp
INNER JOIN ip2location_ipv4 l4 ON hp.src_ip_int BETWEEN l4.begin_int AND l4.end_int;
BETWEEN: WHERE, Integers
Example of a one-to-many query using WHERE, BETWEEN
Use-case: Ad-hoc, user-driven queries. You will need to implement the conversion from IP to integer in your application for a positive user experience. This query is applicable via Athena or PostgreSQL.
Query requires that you have your IP addresses stored as integers (bigint, etcetera).
SELECT
*
FROM ip2location_ipv4 l4
WHERE 16909060 BETWEEN l4.begin_int AND l4.end_int;
16909060
is equivalent to 1.2.3.4
7. The user provided 1.2.3.4
in this example.
BETWEEN: WHERE, IP Address
Example of a one-to-many query using WHERE, BETWEEN
Use-case: Ad-hoc, user-driven queries where there is no CIDR block to query. Often associated with higher resolution geolocation queries. This case casts an IP address string to the inet
data-type. tmp.network_addr
and
tmp.broadcast_addr
are already set to data-type inet
in the schema. This query is applicable via PostgreSQL.
SELECT
tmp.*
FROM ip2location_ipv4 tmp
WHERE '1.2.3.4'::inet BETWEEN tmp.network_addr AND tmp.broadcast_addr;
The user provided 1.2.3.4
in this example.
IP Addresses and CIDR blocks
If your dataset has CIDR blocks, like AWS and GCP ranges, you can use the PostgreSQL network operators.
Use-case: User-driven queries where the dataset has a CIDR block to query. This example casts the data-type for the input data as well as the dataset.
SELECT
*
FROM aws_ranges aws
WHERE '1.2.3.4'::inet << aws.ip_prefix::cidr;
The user provided 1.2.3.4
in this example.
Summary #
Security is not one size fits all. Different features and different customers will have varying access patterns for data. It is important to have the ability to serve data via varying database types for the best user-experience and product efficacy. It is equally essential to support your engineers with the tooling they need to experiment and iterate quickly. Plan for any medium-data or larger dataset to need a map-reduce and relational database capabilities.
Appendix #
Remarks
- Marmot: Ambiguous name for my security platform.
- integer: A number without a fractional component. Not to be confused with a database or language data-type.
- medium-data: Smaller than Big-Data, unless you have a marketing department.
- PostgreSQL is the AWS RDS implementation for this post.
- Pagination could be async. Recommend reading the following if you need pagination.
- More data more problems. IPv6 Geo mappings are large in the number of rows and the size of the integers.
-
from ipaddress import IPv4Address print(int(IPv4Address('1.2.3.4'))) >>> 16909060
Image in post created using midjourney.com
Databases
I reference Athena and PostgreSQL exclusively in this doc. Other databases support these use-cases. Choose the database appropriate for your environment or infrastructure.
Data sanitization
Validate all user supplied data before submitting it to your database. Reject what is not expected.
Basic examples:
- Length limitations for integers
- Validate IP addresses with the ipaddress library or your language of choice’s version.
Parameterize all data submitted by your users to the database. There is no wiggle room on this.
(c) Michael Bentley 2022
Contents may not be republished without written consent.