Why use a specialist database at all#
Before getting into why we picked BigQuery, let’s look at why we need a specialist database at all.
The way data is physically laid out in a database makes a huge difference to the query performance. Databases like Postgres and MySQL are row-oriented. This means that data is often arranged randomly on disk, depending on where free bytes are. This is fine, because normal operations are read-one-row, write-one-row. This type of database has a locking mechanism so that it is possible to make guarantees that once a row has been changed, subsequent queries are applied sequentially, and there’s no ‘uncertainty’ with respect to the underlying data.
It makes a lot of sense to use this type of database for, for example, an e-commerce platform, where you don’t want any uncertainty about a user’s transactions, for example.
The issue with making these guarantees is that making changes to the database is inherently slow; each time a row is changed, a process needs to occur to make sure the database is consistent. What’s more, the support for fast individual row updates means that data is scattered across a disk, which prevents long contiguous reads.
The solution to this problem is to transpose the data into columns, and relax the guarantees. Instead of storing the data row-wise, column-oriented databases save each column as a long list. For applications like time series (which is the underlying data structure of all web analytics), this makes a lot of sense, as they are append only; data is always written at the end of the list. Updates to individual rows are slow, as it requires scanning/rewriting the whole list to update a single value, but reading the column to memory is now fast.
Time series databases#
The truth is, any reasonable time series database should be able to do fast analytics. What’s more, time series data is usually simple enough, so (in theory) it should be portable between databases, or even replicable.
What our competitors have done#
Our (very clever) competitors have gone a different route to us.
Fathom Analytics settled on SingleStore, a closed-source commercial managed solution. I estimate this is costing them between $5-10k a month, perhaps more. SingleStore is a MySQL compatible solution. SingleStore tends to focus on enterprise clients in financial services; it is likely this is a very good solution. The raw speed of Fathom Analytics shows just how impressive SingleStore is.
Plausible and PostHog use ClickHouse. ClickHouse is an open source project that is a spin-off from Yandex, and is the leading time-series database. ClickHouse is an incredible product, with some remarkable optimisations for time-series. Both manage their own servers, although it is possible to use Yandex Cloud to host ClickHouse.
Managing servers for analytics can be tricky; the data volumes increase quickly with time, and because ingest of data is 24/7, downtime isn’t possible, even while having to expand storage, apply updates, or scaling instances.
Why we have gone a different way#
In a solution, we were looking for, scalable, near-limitless storage and compute, the ability to isolate customer data, to keep it in the EU for GDPR purposes, and to not have to maintain a service, hardware and backups. In addition, we also get the guarantees of Google with respect to uptime and data encryption.
Trade-offs#
Speed
There’s no doubt. BigQuery is slower than most time-series databases. Almost all of this time is in the query start-up time, which is generally on the order of 1-3 seconds. (The new JOB_CREATION_OPTIONAL
setting in BigQuery reduces this somewhat, but it’s still slow).
Once a query gets going, a properly optimised analytics query is very fast, as data is all on SSD and 100+ cores can be assigned, on demand, to your query.
The question for us was, was this an acceptable amount of time for a user to wait to see the results on their dashboard? Most queries return in under 3 seconds, and we felt this was an appropriate trade-off. Our competitors Fathom Analytics have remarkable speed on their dashboard; there’s no way we can match that with BigQuery (at least not yet, but we’re working on it!).
Uptime
Because of BigQuery’s managed nature, uptime is Google-level. Suffice to say, this uptime guarantee is far better than any independent company can provide.
Cost
There are three main ‘costly’ activities on BigQuery.
Ingest & Storage
Ingest & storage cost very little. Data is ingested directly from Cloudflare, so we bypass any application servers that we manage. Data is automatically moved to cheaper storage classes on BigQuery as it ages.
Query & Compute
BigQuery’s pay-as-you-go model charges only by the amount of data scanned, which at the time of writing, is $6.25/TB scanned.
By using good query optimisations, partitions and clustering, it’s possible, to some degree, to reduce the amount of data scanned.
Because of our usage pattern, which is ingest heavy, and query light (customers typically execute a reasonable number of queries per month, relative to the data ingested), this makes BigQuery cost effective.
The real cost saving
Our engineers don’t have to ever think about instances, provisioning, running out of disk space, backup scripts, replication, running out of RAM, concurrency, locking, or uptime. All we really do is write SQL and the rest happens. Compared with managing a ClickHouse cluster, which requires specialised engineers familiar with the ClickHouse and performance tuning, we hit the ground running.
It’s why we’ve managed to turn out a scalable product, that we believe is well made, more reliable and more functional than many of our competitors in a short time; by leaving the backend tech to Google, and focussing on our product. We can always migrate (or replicate) to another platform later.
Interested in seeing what we built with BigQuery? Try out Glass Analytics today.