Validin started as a company with two data sources and a small amount of related IP: a huge database of IP and Domain references from hundreds of of places from around the Internet, some of which generate thousands of artifacts a day, an absolutely enormous database of DNS associations going back to mid-2019, and the tooling to collect this data.
My initial plan with Validin was to find insightful ways to package the DNS data so it could be sold as a service. However, the scale of the data made it challenging to find appropriate and cost-effective tooling. By the numbers:
- ~2TB/day of source PCAPs whittled down to ~30GB of data gathered every day, for years - in bz2, parquet-like format
- ~2.5 billion fully-qualified domain names (FQDNs) with at-least daily DNS association refreshes in 3 categories
- ~7.5 billion total associations gathered per day (more, if including bi-directional associations)
I considered a number of solutions for storing, indexing, and querying this data:
- Traditional databases, like SQLite and PostgreSQL
- Non-traditional databases, like MongoDB
- Cloud databases, like AWS DocumentDB, GCP Bigtable, and AWS DynamoDB
- Cloud serverless databases, like Athena
- Semi-managed cloud big data platforms, like Amazon EMR
In each case, off-the-shelf solutions had some combination of at least two (and possibly ALL) of the following problems:
- The solution was going to cost too much
- Usually: WAY too much
- Additionally, pricing was often VERY difficult to predict
- The scale of my data couldn't easily be handled by the solution OR the latency between query and response would have been too high to put behind an API
- The solution couldn't handle my questions without non-negligible engineering effort (e.g.: CIDR range queries)
- The solution was non-portable - I'd be locked into a specific cloud provider
However, I'd spent the previous 3.5 years building, tweaking, and using custom, high-performance databases for my previous employer. I had some ideas. I wanted:
- Type flexibility - support arbitrary data types and complex (multi-dimensional) associations at the platform layer
- Support for hierarchical queries, like those needed for searching for keys within a domain zone, or for IPs within an arbitrary CIDR
- Serialized backing files that are small enough to be stored uncompressed at rest, allowing them to be queried directly from the cloud
- Low fleet maintenance overhead - support full query range with very limited static infrastructure so I only have to scale when needed
- Move operational burden to the transform/serialization step, which is one of my core competencies
The above were turned into requirements, and Serj was born!