bentley.foo

Books I've read in 2026

Tue, 24 Feb 2026 00:00:00 -0000

Shroud — Adrian Tchaikovsky

Klara and the Sun — Kazuo Ishiguro

Sun of Blood and Ruin — Mariely Lares

The Wild Fox of Yemen — Threa Almontaser

The Lathe of Heaven — Ursula K. Le Guin

1491 — Charles C. Mann (in progress)

]]>

Hiring and growth for security research and response teams

Tue, 30 Aug 2022 00:00:00 -0000

Hiring

Hire for Curiosity

Expand your candidate list from experienced researchers to experienced engineers that have strong curiosity. Software engineers often have a background that the most experienced researchers do not; how are applications deployed at scale and how systems communicate.

These engineers know that critical credentials are stored in terraform state files and nuances, like instances in your private subnet may be able to communicate externally with a c2 over IPv6 without a NAT gateway. Exposing software engineers to experienced researchers enables team discoveries that have wider coverage of allure and day in the life practicality. Joining forces is the way.

Hire for Locale

Some threats are obvious, blatant, and in those instances we can classify and move on. But you cannot apply a blanket of cultural norms to a customer base that crosses cultural boundaries. Data collection and monetization practices vary across cultural demarcation points. Your best bet at understanding these impacts and backgrounds is to hire from the locale. Specifically, if you are a US company analyzing applications or threat actors from LATAM and Asia, you need to have researchers from that region. Colloquialisms and slang simply do not translate via automated tools. Further, what is considered acceptable for data collection and monetization can vary wildly by region. It is imperative to understand the intent and the outcome before you classify. Otherwise, you risk false positives in your Serviceable Obtainable Market(SOM).

Outputs

Researchers need and deserve a technical outlet. This can be positively influenced with

public facing technical blog
internal technical blog
regular internal documentation on threat actors and malware

Product, research leads, and product marketing(PMM) can create a pipeline of technical and product relevant posts based on your internal documentation. Credit your researchers in public posts. Your sales engineering(SE) team will leverage internal technical blogs on their own. Promote SE self-service and prevent premature disclosure by clearly marking internal blogs with traffic light protocol(TLP). Incorporate TLP and the blogs into the SE onboarding process.

Creating a PMM gateway for all outputs is the worst outcome. I've brought this up at several organizations and heard several reasons why companies don't include their researchers as contributors or allow direct outputs. The worst reason to date being they don't want their researchers poached. The second worst was thinking that researchers couldn't do the writing.

Researchers writing will help you win over the customer personas and roles needed to land a new customer.

Personas

Economic Buyer: The person holding the purse strings. This person is the Wolf Blitzer problem. If Wolf Blitzer is talking about it, you need a market ready answer. PMM is a good resource for this. Everyone other persona and role is whispering in the ear of the economic buyer.

Technical expert / detractor: They are looking for technical outputs and solutions that solve day in the life issues for them. They may already have a solution in mind, which may not be you. This requires strong research led outputs and practical product solutions. PMM is unlikely to solve this. Research content that shows expertise and references product capabilities that solve day in the life issues is the key.

Champion: This is often a relationship managed directly by sales or sales engineering. Your research team is your hidden weapon. Feeding technical details via your internal blog to the SE team powers competitive knowledge as well as deeply technical conversations.

Leveling

Diversity of experience presents a leveling option that allows for different verticals of expertise. When you create your leveling matrix, you can have the same levels as engineering. Senior Engineer, Staff, Principal, etc. Within those levels, you have to account for the verticals of expertise. Comparing your kernel hacker directly to your Kubernetes expert may not yield results that retain people, allow them to be fulfilled in their growth, and for you to help them plot a path to what they want to achieve.

Consider using the Four Stages of Competence throughout your leveling. This helps measure levels of competence within the vertical of expertises that they are achieving versus a specific comparisons.

Chart a course

Iteration based retrospectives are critical for security research teams to grow. You have to cover

What went well
What could have gone better
What you want to try differently

Have your team contribute their feedback ahead of the retrospective. This should be in written form via a shared medium like Confluence. Have your team rotate through roles in the retrospective for timekeeper and moderator. Set timelines for what they want to try differently, e.g, six weeks, and let them grow. Your role is to ensure they focus on processes and ideas and not people. A strong program manager can be a pivotal role in success for this process.

Summary

Experienced researchers are critical for your team's success. Hiring for diversity of experience and locale will improve your odds of success. Expansion of team and SOM needs software engineers and regional knowledge. Provide your team with the means to communicate and support with guardrails in the background. Enable growth and paths for success by measuring competence within verticals of expertise.

There is no blueprint for every team. Go forth, adapt, build, and grow.

Appendix

Many of the items in this post were sourced or influenced from research team retrospectives.

Images created using midjourney.com

Contents may not be republished without written consent.

]]>

Threat intel databases, part two

Thu, 25 Aug 2022 00:00:00 -0000

This post continues from "Threat intel databases, part one". For simplicity, mentions of threat intel can be considered to include geolocation data.

Threat Intel Acquisition

Day 0

Flat files versus the world. Day 0, your focus should be flat files. Streaming and API-based feeds can wait. Flat files provide the most lift for the effort applied. This assumes that well-known sources such as abuse.ch, PAAS mappings¹, and customer submitted threat intel / trusted entities are important. With flat files, all of your customers will be able to contribute a CSV of trusted IPs or malicious entities.

Day 0+

You are acquiring abuse.ch, AWS ip-json, all the other feeds that are table stakes. Your research team submissions are integrated with flat files. Now you are ready to move into commercial feeds. Once you transition to commercial feeds, you will encounter that most vendors want you to use an API-based model where you pay per entity (IP/Domain). Common reasons vendors push the API model.

They cannot derive any product insights from a flat file model
Hard entitlement enforcement. Your only capability around 1:1 observations:queries is caching.
They do not have a production ready bulk file option

For your vendor, the API-based model drives account management, product improvement and increasing spend.

There is a window where an API-based model is acceptable, post this window it no longer scales as a technology or a cost.

API best practices

Caching API responses can significantly improve application performance. Generally, domains and file hashes can be cached for long durations. IP addresses should not be cached for more than a few hours, or you risk false positives. Practically, your max cache duration should be aligned with your acquisition windows². e.g., if you acquire every 12 hours, don't cache for 24. If you cache longer than your acquisition window, troubleshooting false positives will devolve into a negative QA experience executed by your research and response teams.

Using Redis keys with TTLs is a practical solution, refresh the TTLs when you process new threat intel and old data will automatically expire. Whatever caching implementation you choose, your research and response team will need access to it. Specifically, they will need to validate detections as well as purge false positives from the cache without engineering involvement in the incident response.

Do not cache your cache. No happiness comes from Dante's 9 circles of threat intel caching.

Common ways cyber-security companies discover they are over their commercial threat intel API limits * Fail closed: the API integration is blocking for your application, you hit your API limit and trigger an outage. * Fail open: the API integration is not blocking, you miss a table stakes detection and your customer escalates. This is a toxic false negative³.

Monitor your API call counts versus quota counts.

Post API

Bulk flat files are your friend. Flat files still need caching and research and response enablement. Major downsides to bulk flat files are * Many vendors have no idea how to price it. This will be apparent in your discovery calls with their sales team * Cost, you are unlikely to encounter a vendor where spend will go down transitioning from API to bulk * Data quality, bulk flat files expose the data quality issues that are often less apparent with APIs

Additional considerations

All threat intel sources will have false positives; commercial, free, open-source, and your research team. False positives can range from an obvious and outright false positive, e.g., 1.2.3.4 is malware when it is not. Obvious false positives are easy to solve through improved processes and threat acquisition filtering capabilities. False positives that are derived from cultural differences or opinions will need to be handled via product enhancements accessible to your customer.

There is no standard format for threat intel, and there are extensive quality control issues. You will often see the same source use null, "None", "N/A" values interchangeably. Timestamps can have wild variations.

File hashes can have prolific growth. For example, we once discovered a malicious Android app. For weeks after the initial discovery, we were acquiring 700k+ new hashes for the same malware every day. Prolific growth still fits in your Redis cache.

File size (bulk) can vary wildly from a few megabytes to hundreds of megabytes per feed per download.

Threat intel and geolocation data persistence

Amazon Athena / S3

Marmot acquires threat intel and persists it to S3. Post acquisition and validation data is written to two locations.

latest
archive

latest:  s3://acme-bucket/threat_intel/external/latest/abuse_ch/latest.jsonl.gz
archive: s3://acme-bucket/threat_intel/external/archive/abuse_ch/year=2022/month=08/day=24/.jsonl.gz

Latest and archive are available for searching via Athena.

Latest

Latest is your primary query source for any recent observations that occurred within your last acquisition window. This should cover the majority of your product's Athena-based queries.

Archive

Archive threat intel provides value for

Evidence: A customer has a question on an event that is days to weeks old. The threat intel artifact persisted with the event does not contain enough evidence.
Research: Your team is working on a threat report or investigating events from a relevant date in your archive
Analytics: Statistical analysis of threat metadata to create new filters or derivative threat intel

The S3 keys year, month, and day are Hive style partitions which can be queried as columns. For example:

SELECT * 
FROM abuse_ch 
WHERE "year" = 2022
 AND "month" = 5
 AND "day" = 5 
limit 10;

Partitions are a powerful feature that improves performance by limiting the amount of data crawled with a query. Many relevant variations are possible including partitioning by source names, customer UUIDs, etc.

PostgreSQL

Create a similar structure as S3 in PostgreSQL, where threat intel from the same source is not co-mingled across acquisition events.

DB considerations

Avoid updates and deletes. Threat intel metadata can be highly ephemeral
Build your indexes in one shot
Consider truncating threat intel data before requiring TOAST

These constraints, minus TOAST, push towards a design where a table is created for each acquisition event.

Persistence process

Post acquisition to S3, sanitization, and validation⁴:

Create connection with autocommit=False
Create cursor with context manager
Create database table with a unique name. ex: abuse_ch_2022_05_05__06_00
Insert all rows
Commit
Create cursor with context manager
Add indexes
Commit
Create cursor with context manager
Add new row, referencing the new table, to table inventory table
Commit
Table is now accessible for new queries
Close connection⁵

How you use your context manager and when you perform commits is highly depending on your application structure. The two most impactful commits are 1. The commit after inserting all rows. Commits per row will slow down the process. 2. The commit to the inventory table. This makes the data accessible to the application.

Table inventory table

The inventory table helps with two items. Determining the latest table per source and determining which tables can be pruned.

Determining the latest table

Example query to identify all tables related to a threat intel source and return the lastest table.

SELECT
  table_name
FROM tmp_ti_table_inv
WHERE source_name = %(source_name)s
ORDER BY id DESC
LIMIT 1;

Pruning tables

Consider two tables per threat intel source to be a minimum requirement; the latest table plus the table you failed over from. Additional tables are helpful for quick QA. In the following query, number of tables per source is applied as lowest_rank. Setting lowest_rank to 5 would return all tables older than the most recent 5 tables for each threat intel source. The returned tables are the tables you prune.

SELECT
   id,
   src_table_name
FROM (
    SELECT tmp_ti_table_inv.*,
    rank() OVER (
        PARTITION BY source_name
        ORDER BY id DESC
    )
FROM tmp_ti_table_inv)
rf WHERE rank > %(lowest_rank)s;

Simplifying tables

I use the same table structure for datasets that are similar. AWS and GCP ip-json ranges are a great example of this. The main benefit being query re-use across sources. This does, however, mean you will need to parameterize table names.

Psycopg2 provides functionality table name parameterization.

src_n and src_tn will be safely added into the query.

if source_name in {'aws_ip_ranges', 'gcp_ip_ranges'}:
    select_q = sql.SQL("""
    SELECT 
      props AS {src_n}
    FROM {src_tn}
    WHERE ip_prefix >> %(ip)s::inet;
    """).format(src_n=sql.Identifier(source_name),
                src_tn=sql.Identifier(src_table_name))

Summary

Threat Intel is a fascinating adventure into data acquisition, sanitization, and filtering as well as presentation layers. It presents a broad degree of challenges. As a table stakes capability, missing these challenges incurs low efficacy, toxic false positives, and damages NPS scores. Meeting the challenges is a moving target that is fun and provides tangible value to your research teams, customers, and sales enablement. This post lightly touches on these challenges, and nothing here is gospel. Adapt this to your needs, be flexible, and most of all, have fun, enable your teams, and detect malicious things.

Appendix

References

a) https://docs.aws.amazon.com/general/latest/gr/aws-ip-ranges.html , b) https://cloud.google.com/compute/docs/faq#find_ip_range
Some threat intel sources offer full threat intel downloads and some offer updates only. Choices of update models has impacts on persistence, pruning, and performance.
Toxic False Negative: A false negative that can impact your sales pipeline, company brand, and product trust. Often associated with a table stakes detection that can not be explained away.
Do not trust external data. Always sanitize and validate contents.
Connection generation is costly. Closing connections is highly dependant on application structure. In some scenarios, such as connection proxies, it may not be needed. https://aws.amazon.com/rds/proxy/

Notes

PostgreSQL interactions occur with Python3 and psycopg2
What about STIX and TAXII? I consider this a customer oriented feature, often not available for acquisition of external feeds. It should exist on your roadmap.
Images created using midjourney.com
Queries, S3 paths shown are updated to be more generic

Contents may not be republished without written consent.

]]>

Threat intel databases, part one

Mon, 22 Aug 2022 00:00:00 -0000

Intro

Three types of content I manage are threat intel, geolocation, and honeypot observations.

Threat Intel is an opinion on an entity. Often that entity is a file hash, IP address, or domain that is associated with malware.

Geolocation is location information associated with an IP address. For example, an IP associated with cloud providers like AWS and Alibaba, ASN's, or countries, states and cities. The main data differentiator of geolocation data from threat intel is how the data is queried.

Honeypot observations are data associated with a connection to the Marmot¹ honeypot. This includes data such as source IP, ports, and payloads.

Example datasets

_{Threat intel example from abuse.ch}

ioc_value	threat_type	malware_alias
xx.161.27.133:80	botnet_cc	Agentemis,BEACON,CobaltStrike,cobeacon
xx.95.30.6:443	botnet_cc	Agentemis,BEACON,CobaltStrike,cobeacon

_{Geo ASN example from Ip2Location}

beginning_ip	ending_ip	cidr	asn	asn name
16859136	16871423	1.1.96.0/20	2519	Arteria Networks Corporation
16871424	16873471	1.1.112.0/21	2519	Arteria Networks Corporation

_{Geo example from Ip2Location}

network_int	broadcast_int	iso_country	country	region
16777216	16777471	US	United States of America	California
16777472	16778239	CN	China	Fujian

_{PAAS example from AWS}

ip_prefix	region	service	network_border_group
3.2.34.0/26	af-south-1	AMAZON	af-south-1
3.5.140.0/22	ap-northeast-2	AMAZON	ap-northeast-2

The purpose for acquiring threat intel, geolocation, and honeypot data is to ask questions of it. These questions can be for threat hunting or policy-based reasons. Asking questions of the datasets means matching on the following * IP address equals IP address * IP address in CIDR block * IP integer² between network IP integer and broadcast IP integer

Marmot Databases

Questions at Marmot are primarily asked via

Amazon Athena
PostgreSQL

Athena

Athena is used for operational simplicity, cost, and threat hunting on medium-data³ with limited sanitization. Athena was used for the honeypot blog post and was the only content database for the first iterations of Marmot. It continues to be used in parallel with PostgreSQL.

PostgreSQL

PostgreSQL⁴ was introduced to help with 1. rapid iteration on queries and table schemas for new features 2. support highly repetitive queries 3. flexibility over how IP addresses are queried

Rapid iteration on queries and table schemas for new features

Design considerations: All threat intel, geolocation, and honeypot data is persisted to S3 on acquisition. Post acquisition, the data is ETL'd into a bucket for Athena or a table in PostgreSQL. The ETL job is not related to the pruning of acquired data. Once your first ETL scripts for Athena and PostgreSQL are configured in Terraform, additional prototype jobs become ~trivial.

When developing a new feature, I focus on what questions I want to ask before I dive into normalizing data or choose the database. Running queries in Athena and updating Athena schemas has an inherent latency that slows down my development flow. Compounding the latency, I sometimes get the answer to my question and realize I was asking the wrong question.

Once I have validated my questions, how often those questions will be asked, and other details, I choose Athena or PostgreSQL.

Support highly repetitive queries

For non-bulk data that will see query rates of several per minute, I lean towards PostgreSQL for cost and user-experience.

Cost: Ultimately, I can control costs with API rate limiting. But in an MVP phase, I need to see how users want to use the product, not how they have to use it. With Athena, I may have to use stricter API limits for cost control, and that could be premature.

User-Experience: Athena is asynchronous. Without a UI your users will have to run two to three commands per question. PostgreSQL can be implemented with synchronous behavior via your application for small responses⁵.

Asynchronous: 1. Ask the question 2. Is the answer ready 3. Download the response

Synchronous: 1. Ask the question, receive response.

For the right use case, Athena is very cost-effective versus an RDS for PostgreSQL instance. For queries that can rely on indexes and caching, PostgreSQL can be cheaper and faster.

Flexibility over how IP addresses are queried

IP addresses are queried in Athena as integers. There is no data-type for IP address. Querying via integers occurs in databases that support IP addresses and CIDR notation as well.

BETWEEN queries

Design considerations: Geolocation is not bound by CIDR blocks. It may include partial ranges. Consequently, you may always have a requirement to search IP ranges instead of CIDR ranges.

BETWEEN: JOIN, Integers

Example of a many-to-many query using a JOIN, BETWEEN

Use-case: Bulk preparation of location data for every IP address observed over a time range in the honeypot or customer dataset. Useful for prepping data for your ETL's, reducing costs downstream and improving performance for customer queries. This query is applicable via Athena or PostgreSQL. This query gets expensive with IPv6⁶.

_{Query requires that you have your IP addresses stored as integers (bigint, etcetera).}

SELECT
 hp.*,
 l4.country,
 l4.region
FROM honeypot hp
  INNER JOIN ip2location_ipv4 l4 ON hp.src_ip_int BETWEEN l4.begin_int AND l4.end_int;

BETWEEN: WHERE, Integers

Example of a one-to-many query using WHERE, BETWEEN

Use-case: Ad-hoc, user-driven queries. You will need to implement the conversion from IP to integer in your application for a positive user experience. This query is applicable via Athena or PostgreSQL.

_{Query requires that you have your IP addresses stored as integers (bigint, etcetera).}

SELECT
  *
FROM ip2location_ipv4 l4
WHERE 16909060 BETWEEN l4.begin_int AND l4.end_int;

16909060 is equivalent to 1.2.3.4⁷. The user provided 1.2.3.4 in this example.

BETWEEN: WHERE, IP Address

Example of a one-to-many query using WHERE, BETWEEN

Use-case: Ad-hoc, user-driven queries where there is no CIDR block to query. Often associated with higher resolution geolocation queries. This case casts an IP address string to the inet data-type. tmp.network_addr and tmp.broadcast_addr are already set to data-type inet in the schema. This query is applicable via PostgreSQL.

SELECT
  tmp.*
FROM ip2location_ipv4 tmp
WHERE '1.2.3.4'::inet BETWEEN tmp.network_addr AND tmp.broadcast_addr;

The user provided 1.2.3.4 in this example.

IP Addresses and CIDR blocks

If your dataset has CIDR blocks, like AWS and GCP ranges, you can use the PostgreSQL network operators.

Use-case: User-driven queries where the dataset has a CIDR block to query. This example casts the data-type for the input data as well as the dataset.

SELECT
    *
FROM aws_ranges aws
WHERE '1.2.3.4'::inet << aws.ip_prefix::cidr;

The user provided 1.2.3.4 in this example.

Summary

Security is not one size fits all. Different features and different customers will have varying access patterns for data. It is important to have the ability to serve data via varying database types for the best user-experience and product efficacy. It is equally essential to support your engineers with the tooling they need to experiment and iterate quickly. Plan for any medium-data or larger dataset to need a map-reduce and relational database capabilities.

Appendix

Remarks

Marmot: Ambiguous name for my security platform.
integer: A number without a fractional component. Not to be confused with a database or language data-type.
medium-data: Smaller than Big-Data, unless you have a marketing department.
PostgreSQL is the AWS RDS implementation for this post.
Pagination could be async. Recommend reading the following if you need pagination.
More data more problems. IPv6 Geo mappings are large in the number of rows and the size of the integers.
python from ipaddress import IPv4Address print(int(IPv4Address('1.2.3.4'))) >>> 16909060

Image in post created using midjourney.com

Databases

I reference Athena and PostgreSQL exclusively in this doc. Other databases support these use-cases. Choose the database appropriate for your environment or infrastructure.

Data sanitization

Validate all user supplied data before submitting it to your database. Reject what is not expected.

Basic examples: * Length limitations for integers * Validate IP addresses with the ipaddress library or your language of choice's version.

Parameterize all data submitted by your users to the database. There is no wiggle room on this.

Contents may not be republished without written consent.

]]>

Data brokers, spam messages, voicemail and Stan

Thu, 11 Aug 2022 00:00:00 -0000

My cell phone receives what I consider to be an excessive amount of unsolicited text messages. Between January 1 and August 10, 2022, it received 76 unsolicited messages or 1 message every 2.9 days.

Number of unsolicited text messages per day since Jan 1, 2022

Types of messages and how I respond

Banking fraud

When I receive a text message with a URL that is likely banking fraud, I do the following

Run a whois to get the registrar’s abuse@ email address.
Take a screenshot of the message and send it to the abuse@ email with context to support the fraud claim.

Registrars take this seriously and domain take-downs often happen within 24 hours. My time involved is typically less than five minutes. This task can be accomplished while waiting for a Terraform deploy, or while someone is talking about the demise of snack selection during all-hands meetings.

Ad-hoc blocking the sending phone number for banking fraud has had little to no effect on repeat messages. I suspect carriers quickly remediate the sources of activity out fairly quickly on their own, reducing any impact I can achieve.

Odd messages that are potentially phishing attempts

I reply if I have time or someone is still going on about snacks. Responses from the sender are rare. When I get a response it appears to be either nonsense or a case of mistaken identity/wrong number.

Real estate sale calls and messages

For 2+ years, I have been receiving large numbers of voice calls and some text messages with cash offers to buy someone named Stan's house. These calls and texts occur at inconvenient hours for me, something I attribute to having a phone number that is not related to the timezone that I live or work in.

Initially, I used my phone settings to route all unknown calls to voicemail and silence unknown numbers for texts. This is a less than desirable response as it also skips legitimate calls from unexpected numbers. For example, my doctor or calls about an incident response in progress. Additionally, my voicemail alerts stopped once I reached 99 unread voicemails.

I'm not a fan of voicemail, but I'm less a fan of service degradation due to * calls not intended for me * the red dot notification indicating I have dozens of unread text messages

I decided to review the messages and start calling the realtors back. Reviewing transcribed voicemails helped me recognize that the callers were all using the same or similar sources of data. Each voicemail left detailed personal information about Stan, including his address.

My goal was to try and find out where the realtors were sourcing their data from. The vast majority of realtors never answered my calls or voice messages. But one realtor did. I explained that they were calling at inappropriate hours, and that I was not their intended recipient. I queried their source of data. She did not know the answer on the spot, but she did reach back out and stated that they source their material from Lexis Nexis.

I went to the Lexis Nexis site to search for a way to correct the information. The takeaway from my search was

There are dozens of data brokers
There appears to be no way to correct when your information is associated with another persons account
I am not the only one with this issue.

A DuckDuckGo search yields many references for this issue such as * https://www.newsweek.com/2019/10/04/lexisnexis-mistake-data-insurance-costs-1460831.html * https://old.reddit.com/r/legaladvice/comments/69mgbx/lexisnexis_has_incorrect_information_about_me_and/

I filled out the Lexis Nexis form to receive a copy of my data and opt out of their services that I never opted in to. My report never arrived. I did get a confirmation that my data would be removed from my account. Removing my data has zero impact to any data that they have already sold which has been subsequently re-shared into the data ether.

The real estate text messages and voicemails for Stan have not stopped.

Political messages for Stan and Gayle

By far, the largest number of messages I receive are politically oriented. Like the real estate messages these do not come at a reasonable hour for where I live. I have replied to these messages with

STOP and variations of
Attempting to call the number. This was ineffective due to the majority of messages being some type of API based service
Requesting them to stop in a frustrated tone

Nothing has stemmed the influx of political text messages

Analyzing Text Messages

On August 10, 2022, I received a text message while on roaming and decided to see if I could take a similar approach to the political messages as I do banking fraud.

These messages are not fraud, but I do expect them to violate an acceptable use policy for a carrier since

I can't opt out or stop the messages
I never opted in
I'm not the intended recipient
The excessive quantity

To start my investigation I signed up for a Twilio account and used Twilio's phone lookup system. In this instance, the carrier for the message was identified by Twilio to be bandwidth.com.

Twilio Phone Lookup

I browsed to the bandwidth.com site and filled in the form. Good news, they "are here to help".

Bandwidth.com Site

Initially the bandwidth.com form would not work, citing incorrect form data. After several attempts with the same information the form allowed submission. I received a response so fast I knew it would not be good news.

Bandwidth Response

My interpretation of Bandwidth.com's response is * they are a wholesale provider * they are not responsible for how their network is used * they would forward my complaint to an unknown entity

I replied that this was a problem of excess, cc'd their legal@email, and what I guessed to be their CEO's email.

Their legal team auto-responded to use the form that I had already filled out. I suppose they have received these messages before, and the most logical response was to do nothing. I did not feel helped.

legal@bandwidth.com response

Summary of my experience escalating to Bandwidth.com

Bandwidth.com will not disclose the sub-carrier or service, thus blocking any attempts at me resolving at the carrier level. Further, they push acceptable use entirely to their undisclosed customer.

This was a frustrating outcome. At this point I decided to analyze all unsolicited messages I received since Jan 1, 2022 and look for common ground.

Methodology

Define spam

Any unsolicited text message. Messages that do not meet this requirement * Personal messages * SMS verifications for service sign-ups * Automated messages for appointments

Transcribe

I did not find a way to copy the messages from my phone directly to my computer with OS provided tools. I did not want to use a third-party tool to do this. Not deterred, I took the age-old accepted security response approach of spreadsheet-triage. I manually copied the date and source phone number for every unsolicited message into a spreadsheet. I labeled the messages with

Classification category
Primary Type
Secondary Type
Domain - If there was a domain and what the domain was
Path - any URL text past the domain
Image - If an image was contained
Domain registrar
Mentions Stan
Mentions Gayle
Carrier

The classifications are * Spam - Greeting * Spam - Survey * Election Poll * Fraud - Banking * Britney Spears * Real Estate * Political

Primary Type and Secondary Type focus on the contents of the message. Carrier is the phone number carrier as reported by the Twilio.com phone number lookup service.

Data points

Classification Over Time

Classification To Carrier

For classification to carrier I shortened the carrier name. For example Bandwidth SMSEnabled - Bandwidth CLEC - Sybase365 was shortened to Bandwidth.com. The mapping is posted in the Appendix.
Null and Unknown are distinct responses from the Twilio lookup service. Those responses are been preserved in the chart.

Bandwidth.com and Telnyx cover the majority of political messages.

Carriers To Text Messagse With Domains

Text messages with no domain are indicated by FALSE.

Domain mapping

The majority of political domains are fronts for hxxps://winred.com

True: Domain fronts for winred
False: Domain does not front for winred
Text messages without domains have been filtered out

I browsed to the Winred site to opt out of messages. Their site has a chatbot that provides categories of questions, including text messages.

Winred's response is to work with each campaign individually. The issues with this are 1. Volume - You attempting to mitigate a many-to-one attack 2. New sources - There is no source of truth, each campaign is a net new source 3. Stop and other replies via text message are partially ignored or ignored in totality.

Winred approach is similar to Bandwidth.com; in effect, there is no practical way to stop unsolicited messages via carriers or organizations using the carriers.

In every case, the parties involved in sending claim no responsibility or authority.

Suggested industry requirements

Note: I do not work in the carrier industry.

Carrier identification for any text message

A consumer should be able to identify the carrier or carrier customer account responsible for sending an API based or automated message. This should be trivial to accomplish via the message itself and any references via domains.

Identification via Twilio is not a consumer friendly option.

Opt-out or block the carrier's customer account

A consumer should be able to send a single STOP response and block all messages from the carrier customer account.

Mobile service providers should take a hostile approach

Some API-based senders are not operating in good faith. Providers and MVNO's should take a hostile approach towards senders. I am not a fan of my mobile phone provider choosing what I can and cannot see. But this is service degradation at this point. Take the same approach you would for any other network-based attack and null route it.

STOP should not convert to known sender

Replying STOP to an unknown sender moves the customer to a known sender status on iPhones. This effectively disables the mitigation. Further, STOP should be standardized. Not Stop2End or Stop=End.

The FTC should add a portal

Currently, the FTC recommends using your phone messaging settings to block this activity. This method does not work in a many-to-one attack. The FTC should add a portal where complaints can be escalated, investigated, and sources fined or shut down.

Potential downsides of suggestions

I suspect customers of API based carriers will continue to act in bad faith. Any identification capabilities are likely to take the route of cookies where companies choose overtly obtrusive implementations rather than following the spirit of the regulation. However, text messages offer limited real estate and I suspect that egregious implementations will have an equally negative impact to the sender as the receiver.

In summary - if you have a website or legal response that immediately acknowledges abuse on your platform and how you are not responsible. You are operating in bad faith.

Let's go Bandwidth.com

Source image from Wikipedia

Appendix

Tools used

Graphs built using Amazon Quicksight
Keynote
Spreadsheets
Skitch
Twilio Phone Lookup

Screenshots of text messages

I have truncated domains and other information that may be associated with my phone.

How to read this data

Messages are organized by categories

Spam - Greeting
Spam - Survey
Election Poll
Fraud - Banking
Britney Spears
Real Estate
Political
Democratic
Republican

Within categories messages are ordered left to right, above to below. * Messages on the left arrived before messages on the right * Messages on the above arrived before messages below

The number screenshots will exceed the number of messages in chats. I did not count images and multiple text messages sent at the same time as distinct messages.

Spam - Survey

Spam - Greeting

Election Poll

Fraud - Banking

Britney Spears

Real Estate

Political

Democratic

Republican

Bandwidth.com's Acceptable Use Policy

Section: Continuous or Repetitive Calls and Messaging.

Carrier Mappings

Name reported by Twilio	Short name for graphs
Bandwidth SMSEnabled - Bandwidth CLEC - Sybase365	Bandwidth.com
Bandwidth.com CLEC, LLC	Bandwidth.com
Bandwidth/13 - Bandwidth.com - SVR	Bandwidth.com
Bandwidth/20 - Bandwidth.com - SVR	Bandwidth.com
Bandwidth/Zipwhip/3 - Toll-Free - SVR	Bandwidth.com
Commio, LLC	Commio
Google (Grand Central) - SVR	Google Grand Central
Hook Mobile - Sybase365	Hook Mobile
Null	Null
Unknown	Unknown
Plivo - SVR	Plivo
T-Mobile USA, Inc.	T-Mobile USA
Telefinity/teli.net - SVR	Telefinity
Telnyx - Level3 - SVR	Telnyx
Telnyx - Telnyx - SVR	Telnyx
Telnyx - Windstream - SVR	Telnyx
TextNow - Bandwidth.com - SVR	Bandwidth.com
TextNow - Neutral Tandem - SVR	TextNow
Twilio - SMS/MMS-SVR	Twilio
Twilio - Toll-Free - SMS-Sybase365/MMS-SVR	Twilio

Updates

August 29, 2022

This morning, I received another realtor call for Stan. I explained the situation to the realtor, and he took the time to share information on their data set with me. This particular realtor is sourcing data from three companies.

True People Search [ truepeoplesearch.io ]
Fast People Search [ fastpeoplesearch.info ]
Lexis Nexis [ lexisnexis.com ]

He also shared that there were two phone numbers listed for Stan. My phone number and another number with the same last seven digits, but an alternate area code. Example:

415-123-1234
210-123-1234

I appreciate another realtor taking the time to help triage data broker false positives. I recommend inquiring into the false positive rate and data validation methods before entering into any contract or commercial services with these companies. There is no method for me to validate the statistical occurrence of matching numbers for individuals across area codes. I suspect the occurrence rate to be exceptional small and this example indicative of serious quality control issues.

I have reached out via a contact form to Fast People Search and email to True People Search in an attempt to have my information removed from Stan's records. Both sites have opt-out forms, but they are only for the person associated with the data, Stan.

Contents may not be republished without written consent.

]]>