Kenneth's Corner

Engineering, data, and testing

Why I built Serj

- Posted in Data by

Validin started as a company with two data sources and a small amount of related IP: a huge database of IP and Domain references from hundreds of of places from around the Internet, some of which generate thousands of artifacts a day, an absolutely enormous database of DNS associations going back to mid-2019, and the tooling to collect this data.

My initial plan with Validin was to find insightful ways to package the DNS data so it could be sold as a service. However, the scale of the data made it challenging to find appropriate and cost-effective tooling. By the numbers:

  • ~2TB/day of source PCAPs whittled down to ~30GB of data gathered every day, for years - in bz2, parquet-like format
  • ~2.5 billion fully-qualified domain names (FQDNs) with at-least daily DNS association refreshes in 3 categories
  • ~7.5 billion total associations gathered per day (more, if including bi-directional associations)

I considered a number of solutions for storing, indexing, and querying this data:

  • Traditional databases, like SQLite and PostgreSQL
  • Non-traditional databases, like MongoDB
  • Cloud databases, like AWS DocumentDB, GCP Bigtable, and AWS DynamoDB
  • Cloud serverless databases, like Athena
  • Semi-managed cloud big data platforms, like Amazon EMR

In each case, off-the-shelf solutions had some combination of at least two (and possibly ALL) of the following problems:

  • The solution was going to cost too much
    • Usually: WAY too much
    • Additionally, pricing was often VERY difficult to predict
  • The scale of my data couldn't easily be handled by the solution OR the latency between query and response would have been too high to put behind an API
  • The solution couldn't handle my questions without non-negligible engineering effort (e.g.: CIDR range queries)
  • The solution was non-portable - I'd be locked into a specific cloud provider

However, I'd spent the previous 3.5 years building, tweaking, and using custom, high-performance databases for my previous employer. I had some ideas. I wanted:

  • Type flexibility - support arbitrary data types and complex (multi-dimensional) associations at the platform layer
  • Support for hierarchical queries, like those needed for searching for keys within a domain zone, or for IPs within an arbitrary CIDR
  • Serialized backing files that are small enough to be stored uncompressed at rest, allowing them to be queried directly from the cloud
  • Low fleet maintenance overhead - support full query range with very limited static infrastructure so I only have to scale when needed
  • Move operational burden to the transform/serialization step, which is one of my core competencies

The above were turned into requirements, and Serj was born!