Data - Kenneth's Corner

Why I built Serj

24 January 2023 - Posted in Data by Kenneth Kinion

Validin started as a company with two data sources and a small amount of related IP: a huge database of IP and Domain references from hundreds of of places from around the Internet, some of which generate thousands of artifacts a day, an absolutely enormous database of DNS associations going back to mid-2019, and the tooling to collect this data.

My initial plan with Validin was to find insightful ways to package the DNS data so it could be sold as a service. However, the scale of the data made it challenging to find appropriate and cost-effective tooling. By the numbers:

~2TB/day of source PCAPs whittled down to ~30GB of data gathered every day, for years - in bz2, parquet-like format
~2.5 billion fully-qualified domain names (FQDNs) with at-least daily DNS association refreshes in 3 categories
~7.5 billion total associations gathered per day (more, if including bi-directional associations)

I considered a number of solutions for storing, indexing, and querying this data:

Traditional databases, like SQLite and PostgreSQL
Non-traditional databases, like MongoDB
Cloud databases, like AWS DocumentDB, GCP Bigtable, and AWS DynamoDB
Cloud serverless databases, like Athena
Semi-managed cloud big data platforms, like Amazon EMR

In each case, off-the-shelf solutions had some combination of at least two (and possibly ALL) of the following problems:

The solution was going to cost too much
- Usually: WAY too much
- Additionally, pricing was often VERY difficult to predict
The scale of my data couldn't easily be handled by the solution OR the latency between query and response would have been too high to put behind an API
The solution couldn't handle my questions without non-negligible engineering effort (e.g.: CIDR range queries)
The solution was non-portable - I'd be locked into a specific cloud provider

However, I'd spent the previous 3.5 years building, tweaking, and using custom, high-performance databases for my previous employer. I had some ideas. I wanted:

Type flexibility - support arbitrary data types and complex (multi-dimensional) associations at the platform layer
Support for hierarchical queries, like those needed for searching for keys within a domain zone, or for IPs within an arbitrary CIDR
Serialized backing files that are small enough to be stored uncompressed at rest, allowing them to be queried directly from the cloud
Low fleet maintenance overhead - support full query range with very limited static infrastructure so I only have to scale when needed
Move operational burden to the transform/serialization step, which is one of my core competencies

The above were turned into requirements, and Serj was born!