Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Using ClickHouse to scale an events engine (github.com/getlago)
237 points by wyndham on April 11, 2024 | hide | past | favorite | 97 comments


> Recently, the most interesting rift in the Postgres vs OLAP space is [Hydra](https://www.hydra.so), an open-source, column-oriented distribution of Postgres that was very recently launched (after our migration to ClickHouse). Had Hydra been available during our decision-making time period, we might’ve made a different choice.

There will likely be a good OLAP solution (possibly implemented as an extension) in Postgres in the next year or so. There are a few companies are working on it (Hydra, Parade[0], tembo etc.).

0 - https://www.paradedb.com/


> 0 - https://www.paradedb.com/

this looks like repackaging of datafusion as PG extension?..


yes, that's a succinct way to put it.



That’s cool. Clickhouse and Alloy’s performances are impressive.


that benchmark is very weak, they used just 100M rows which is laughable, also no joins have been tested.


no joins is heavily favoring clickhouse (the creator of the benchmark). I'm not sure it's gotten better since I've seriously looked at them, but CH's join performance was really bad.


ParadeDB founder here. You can see how we compare to other Postgres-based analytical offerings on ClickBench here: https://blog.paradedb.com/pages/introducing_analytics


[flagged]


That is an incorrect and baseless accusation, we had nothing to do with "Postgres (tuned)". My commits are only in the `hydra` folder. There are no restrictions on how you set up the benchmark in Clickbench and the settings we use there are analogous with what we use on our cloud service for a similar sized instance.

As the linked post points out, the main 'advantage' of the "tuned" benchmark is the indexes, which are tuned specifically to the queries in the benchmark. We do not use indexes in our version of the benchmark, aside from the primary key (which actually provides no performance advantage).


I apologize for falsely claiming that it was Hydra devs who committed Postgres config.

However, I think problem stands: it is your main marketing pitch to compare HydraDB to undertuned PG, it is right on landing page of your project.

> the main 'advantage' of the "tuned" benchmark is the indexes

I am not sure which post you referred to, but unless you or someone else analyzed execution plans for all PG queries in that benchmark and verified that indexes are actually used, it is just speculations without evidence.

Another issue with this comparison is that ClickBench is toy micro-benchmark with just 100M records. Increasing datasize may or may not be beneficial for HydraDB.


I did not know that... That's very disingenuous! We'll make sure to keep this in mind as we make our own comparisons. We haven't tuned ParadeDB at all in our published ClickBench benchmarks, so it should be more representative


I don't think `tembo` is working on it though, probably just hosting an existing extension.


so Paradedb and Hydra are using same codebase or just similar approach ?


ParadeDB and Hydra are completely different. We're tackling the same problem of bringing analytics inside Postgres, but using different approaches.

ParadeDB integrates industry standards like Arrow, Parquet, DataFusion to offer columnar storage + vectorized processing. Hydra is building on top of Citus Columnar.

You can read about our approach here: https://blog.paradedb.com/pages/introducing_analytics


I feel like with all the Clickhouse praise on HN that we /must/ be doing something fundamentally wrong because I hate every interaction I have with Clickhouse.

* Timeouts (only 30s???) unless I used the cli client

* Cancelling rows - Just kill me, so many bugs and FINAL/PREWHERE are massive foot-guns

* Cluster just feels annoying and fragile don't forget "ON CLUSTER" or you'll have a bad time

Again, I feel like we must be doing something wrong but we are paying an arm and a leg for that "privilege".


What is your use case? If you're deleting rows that already feels like maybe it's not the intended use case. I think about clickhouse as taking in a firehose of immutable data that you want to aggregate/analyze/report on. Let's say a million records per second. I'll make up an example, the orientation, speed and acceleration of every Tesla vehicle in the world in real time every second.


It's to power all our analytics. We ETL data into it and some data is write-once so we don't have updates/deletes but a number of our tables have summary data ETL'd into them which means cleaning up the old rows.

I'm sure CH shines for insert-only workloads but that doesn't cover all our needs.


You have already gotten excellent options from the other comments, but here's another one that's not been mentioned yet.

You may want to consider adjusting your partition key (if feasible) as a function of datetime so you can just drop a complete partition when required, rather than needing separate delete queries.

In my experience, it has proven to be a very quick and clean way to clear out older data.


Have you looked into the ReplacingMergeTree table engine? (Although we still needed to use FINAL with this one)


You can always use different databases for different use cases.

There are many applications that require extremely high insertion rates (millions of records per second), very large total number of rows (billions, trillions) and flexible/fast querying/aggregation with high read rates (100's of millions or higher rows/s) and that's sort of the sweet spot IMO for ClickHouse and where you'll be pressed to find alternatives. I'm sure it can be used in other situations but maybe there are more choices if you're in those.


>You can always use different databases for different use cases.

Unfortunately this is not always realistic, especially in large organizations, I know where I am there is a big push from top (i.e the IT budget people) to standardize everything they want to simplify licenses, support contracts etc.

I may not be doing cutting edge stuff (I work at an Industrial plant) but we do have mixed data use cases where it could be beneficial to use different dbs but realistically I don't see it happening.


CH works just fine for cleaning up rows: Delete with mutations sync=1, or use optimize with deduplicate by, or use aggregate trees and optimize final, or query aggregate tables with final=1.

Numerous ways to achieve removal of old/stale rows.


Sounds like you need to use a ReplacingMergeTree + final keyword.


Tableau


Most, though certainly not all, problems I see with ClickHouse usage come from pretending it is another database or that it is intended for other use cases.


> * Timeouts (only 30s???) unless I used the cli client

Almost all clients (client libraries) allow a configurable timeout. In server settings there is a max query time settings which can be adjusted if necessary: https://clickhouse.com/docs/en/operations/settings/query-com...


From the docs on FINAL:

> However, using FINAL is sometimes necessary in order to produce accurate results

Welp.


If you use tables like “ReplacingMergeTree” which _explicitly_ states that merges happen in the background, and non-merged rows _will_ be visible.

It’s a table design optimised for specific workloads, and the docs and design detail those tradeoffs.

We use it at work for workloads that can tolerate “retreading” over stale data, because it means they can efficiently write to the db without round tripping, or locking and row updates, and without the table growing massive. It works fantastically in our use case.


It's meant to store immutable data, and isn't great if you need low-latency updates. Also it's quirky in some ways.


> It's meant to store immutable data

I don't disagree, I feel like we might be using it wrong. We were trying to replace ES with it but it just doesn't feel like it fits our needed usecase.


How many rows do you have on average per day?


A couple million (<10M), I don't have a better number available right now. Not all (or even most) of those need cancelling rows thankfully.


You don’t need a cluster nor should you be having any issues you mentioned. I run 10x that volume daily on a single gcp box (8 core / 64GB). We migrated off BigQuery and went from $10k/mo to about $250/mo. And it’s faster for both low-latency and big slow queries.


The plan is to 10x that volume in the not too distant future but given what you've said I can believe we are horribly over-provisioned/over-scaled. Thank you!


It sounds like you’re probably using Clickhouse Cloud? If so, I was not impressed. Overly pushy sales people, pricing isn’t competitive, and they’re trying to cater to the snowflake/databricks crowd without smoothing any rough edges (like the default timeout being enabled on a GUI).

Overall I’d say CH isn’t as tolerant or forgiving as BigQuery, Snowflake, or Databricks. You can write the worst SQL possible and BQ will happily charge you $5/TB for that cartesian self-join. CH meanwhile will error with memory limit or even crash.


We are using Altinity. I believe it's a 3-server cluster and we have 2 clusters (our prod one and another one we are trying to promote to production once our data integrity checks pass, at which point we will spin the other down).


Robert Hodges of Altinity will likely be here shortly, they monitor HN for any mention of clickhouse, maybe they can get you fixed up without their standard $10k/mo support contract.


Hi Josh, sorry to hear about the issues. Sounds like things that should be solvable. ClickHouse does require thinking a bit differently from other stores like BigQuery. I sent you an email. Feel free to contact me at the email posted on my HN account. Looking forward to helping you bend ClickHouse to your will.


Interesting about "Timeouts (only 30s???)" - most likely, this is a limitation configured explicitly for a user on your server. You can set it up with the `max_execution_time`, and by default, it is unlimited.

For example, I've set it up, along with many more limitations for my public playground https://play.clickhouse.com/, and it allows me to, at least, make it public and not worry much.

It could also be a configuration of a proxy if you connect through a proxy. ClickHouse has built-in HTTP API, so you can query it directly from the browser or put it behind Cloudflare, etc... Where do you host ClickHouse?


I can believe it's a config issue, I'll have to look into it. I didn't setup the cluster/dbs and when I asked about I was told "use the cli". I'll try to see if I can get that fixed.


Any chance you CH is proxied through a Heroku app? Heroku has 30s timeouts.


What foot guns have you run into with FINAL?


Just forgetting to use it or PREWHERE. Since queries run just fine without those you can think you have something working when you actually have duplicate rules.


Is ClickHouse a suitable engine for analyzing events? Absolutely, as long as you're analyzing a large table, its speed is definitely fast enough. However, you might want to consider the cost of maintaining an OSS ClickHouse cluster, especially when you need to scale up, as the operational costs can be quite high.

If your analysis in Postgres was based on multiple tables and required a lot of JOIN operations, I don't think ClickHouse is a good choice. In such cases, you often need to denormalize multiple data tables into one large table in advance, which means complex ETL and maintenance costs.

For these more common scenarios, I think StarRocks (www.StarRocks.io) is a better choice. It's a Linux Foundation open-source project, with single-table query speeds comparable to ClickHouse (you can check Clickbench), and unmatched multi-table join query speeds, plus it can directly query open data lakes.


> consider the cost of maintaining an OSS ClickHouse cluster I mean... it is pretty straightforward. 40~60 line Terraform, Ansible with templates for the proper configs that get exported from Terraform so you can write the IPs so they can see each other, and you are done.

What else could you possibly need? Backing up is built into it with S3 support: https://clickhouse.com/docs/en/operations/backup#configuring...

Upgrades are a breeze: https://clickhouse.com/docs/en/operations/update

People insist that OMG MAINTENANCE I NEED TO PAY THOUSANDS FOR MANAGED is better, when in reality, it is not.


ClickHouse is awesome, but as the post shows, some code is involved in getting the data there.

I have been working on Scratchdata [1], which makes it easy to try out a column database to optimize aggregation queries (avg, sum, max). We have helped people [2] take their Postgres with 1 billion rows of information (1.5 TB) and significantly reduce their real-time data analysis query time. Because their data was stored more efficiently, they saved on their storage bill.

You can send data as a curl request and it will get batch-processed and flattened into ClickHouse:

curl -X POST "http://app.scratchdata.com/api/data/insert/your_table?api_ke..." --data '{"user": "alice", "event": "click"}'

The founder, Jay, is super nice and just wants to help people save time and money. If you give us a ring, he or I will personally help you [3].

[1] https://www.scratchdb.com/ [2] https://www.scratchdb.com/blog/embeddables/ [3] https://q29ksuefpvm.typeform.com/to/baKR3j0p?typeform-source...


My first big win for clickhouse was replacing a 1.2tb, billion + row postgresql DB with clickhouse. It was static data with occasional full replacement loads. We got the DB down to ~ 60GB, with query speeds about 45x faster.

Now, the postgres schema wasn't ideal, and we could have saved ~ 3x on it with corresponding speed increases for queries with a refactor similar to the clickhouse schema, but that wasn't really enough to move the needle to near real-time queries.

Ultimately, the entire clickhouse DB was smaller than the original postgres primary key index. The index was too big to fit in memory on an affordable machine, so it's pretty obvious where the performance is coming from.


This is a nice illustration of the effects of different choices for storage layout and use of compute. ClickHouse blows away single-threaded queries on row-based data for analytic questions. On the other hand PostgreSQL can offer far higher throughput and concurrency when updating a shopping cart.


We use BigQuery a lot for internal analytics and we've been super happy. I don't see a lot of love for BigQuery on HN and I wonder why. Tons of features, no hassle and easy to throw a bunch of TB at it.

I guess maybe the cost?


I'm a big fan of big query as well, but the cost can be problematic if you're not careful.

Generally speaking I've found it manageable if you make good use of partitioning and do incremental aggregation (we use dbt, though you have to do some macro gymnastics to make the partition key filter eligible for pruning due to restrictions on use of dynamic values https://docs.getdbt.com/docs/build/incremental-models)

It's also important to monitor your cost and watch for the point where switching from the per-tb queried pricing model to slots makes sense.


yeah between partitioning, clustering, materialized views, and smart tuning it seems like there are enough knobs to control costs.


Probably also because it is proprietary and only exists in one cloud platform.


No, it’s because it’s google and HN are certain it will get cancelled at any moment.


seems unlikely, I think its the most popular google cloud product


We are lucky enough to be able to run BigQuery with flat rate billing. It's incredibly powerful and it's a really good example of SaaS and Serverless done right. It just works.


Yep love it too, especially with external data on GCS. Costs this way are very low. And the convenience is amazing (getting caches you can stream from for every query is a godsend)


What do you mean by the streaming caches?


I was quite surprised that other clouds don’t have an easy to get started analytics data warehouse solution like big query.


This change may make sense for Lago as a hosted multi-tenant service, as offered by Lago the company.

Simultaneously this change may not make sense for Lago as an open-source project self-hosted by a single tenant.

But that may also mean that it effectively makes sense for Lago as a business... to make it harder to self host.

I don't at all fault Lago for making decisions to prioritize their multi-tenant cloud offering. That's probably just the nature of running open-source SaaS these days.


Exactly, I've seen this at Sentry where you now have to run Kafka, Clickhouse, Redis, PG, Zookeeper, memcached and what have you. I get it, but the amount of baggage to handle is a bit difficult.


How were they doing millions of events per minute with postgres.

I'm struggling with pg write performance ATM and want some tips.


If you're not already doing this: remove unnecessary indices, partition the table, batch your inserts/updates, or try COPY instead of INSERT.


Turn off indexing and other optimizations done on a table level


What do you do to then query the data? I usually need indexes so queries are not slow. Perhaps I could insert into a staging table then bulk copy the data over to an indexed table, but that seems silly.


If your application language/framework allows, you can do the batching there. e.g. have your single request handler put work into an (in-memory) queue. Then another thread/async worker pull batches off the queue and do your db work in batch, and trigger the response to the original handler. In an http context, this is all synchronous from the client perspective, and you can get 2-10x throughput at a cost of like 2 ms latency under load.

I gave more detail with a toy example here: https://news.ycombinator.com/item?id=39245416

I've since played around with this a little more and you can do it pretty generically (at least make the worker generic where you give it a function `Chunk[A] => Task[Chunk[Result[B]]]` to do the database logic). I don't have that handy to post right now, but probably you're not using Scala anyway so the details aren't that relevant.

I've tried out a similar thing in Rust and it's a lot more finicky but still doable there. Should be similar in go I'd think.


Isn't that basically the idea behind the "lambda architecture"? Of course you typically don't use the same product for both the real time and the batch aspects.


You said you struggled with writes... so I mentioned an advice on how to speed up writes... the internet know a lot more about this than me tho


Could replicating to a DB with indexing (purely for queries) work?


If one can't keep up, the other one can't either.

You could use partitions though.


What’s your hardware? RDS? Nvme storage?


Its google cloud sql.


And if you use MariaDB, just enable columnstore. Why not treat yourself to s3 backed storage while you are there?

It is extremely cost effective when you can scale a different workload without migrating.


This is no shade to postgres or maria, but they don’t hold a candle to the simplicity, speed, and cost efficiency of clickhouse for olap needs.


I have tons of OOMs with clickhouse on larger than RAM OLAP queries.

While postgres works fine (even it is slower, but actually returns results)


There are various knobs in ClickHouse that allow you to trade memory usage for performance. ( https://clickhouse.com/docs/en/operations/settings/query-com... e.g.)

But yes, I've seen similar issues, running out of memory during query processing, it's a price you pay for higher performance. You need to know what's happening under the hood and do more work to make sure your queries will work well. I think postgres can be a thousand or more times slower, and doesn't have the horizontal scalability, so if you need to do complex queries/aggregations over billions of records then "return result" doesn't cut it. If postgres addresses your needs then great- you don't need to use ClickHouse...


> There are various knobs in ClickHouse that allow you to trade memory usage for performance.

but what knobs to use and what values to use in each specific case? Query just usually fails with some generic OOM message without much information.


It's not actually so esoteric. The two main knobs are

- max_concurrent_queries, since each query uses a certain amount of memory

- max_memory_usage, which is the max per-query memory usage

Here's my full config for running clickhouse on a 2GiB server without OOMs. Some stuff in here is likely irrelevant, but it's a starting point.

    diff --git a/clickhouse-config.xml b/clickhouse-config.xml
    index f8213b65..7d7459cb 100644
    --- a/clickhouse-config.xml
    +++ b/clickhouse-config.xml
    @@ -197,7 +197,7 @@
     
         <!-- <listen_backlog>4096</listen_backlog> -->
     
    -    <max_connections>4096</max_connections>
    +    <max_connections>2000</max_connections>
     
         <!-- For 'Connection: keep-alive' in HTTP 1.1 -->
         <keep_alive_timeout>3</keep_alive_timeout>
    @@ -270,7 +270,7 @@
         -->
     
         <!-- Maximum number of concurrent queries. -->
    -    <max_concurrent_queries>100</max_concurrent_queries>
    +    <max_concurrent_queries>4</max_concurrent_queries>
     
         <!-- Maximum memory usage (resident set size) for server process.
              Zero value or unset means default. Default is "max_server_memory_usage_to_ram_ratio" of available physical RAM.
    @@ -335,7 +335,7 @@
              In bytes. Cache is single for server. Memory is allocated only on demand.
              You should not lower this value.
           -->
    -    <mark_cache_size>5368709120</mark_cache_size>
    +    <mark_cache_size>805306368</mark_cache_size>
     
     
         <!-- If you enable the `min_bytes_to_use_mmap_io` setting,
    @@ -981,11 +980,11 @@
         </distributed_ddl>
     
         <!-- Settings to fine tune MergeTree tables. See documentation in source code, in MergeTreeSettings.h -->
    -    <!--
         <merge_tree>
    -        <max_suspicious_broken_parts>5</max_suspicious_broken_parts>
    +        <merge_max_block_size>2048</merge_max_block_size>
    +        <max_bytes_to_merge_at_max_space_in_pool>1073741824</max_bytes_to_merge_at_max_space_in_pool>
    +        <number_of_free_entries_in_pool_to_lower_max_size_of_merge>0</number_of_free_entries_in_pool_to_lower_max_size_of_merge>
         </merge_tree>
    -    -->
     
         <!-- Protection from accidental DROP.
              If size of a MergeTree table is greater than max_table_size_to_drop (in bytes) than table could not be dropped with any DROP query.
    diff --git a/clickhouse-users.xml b/clickhouse-users.xml
    index f1856207..bbd4ced6 100644
    --- a/clickhouse-users.xml
    +++ b/clickhouse-users.xml
    @@ -7,7 +7,12 @@
             <!-- Default settings. -->
             <default>
                 <!-- Maximum memory usage for processing single query, in bytes. -->
    -            <max_memory_usage>10000000000</max_memory_usage>
    +            <max_memory_usage>536870912</max_memory_usage>
    +
    +            <queue_max_wait_ms>1000</queue_max_wait_ms>
    +            <max_execution_time>30</max_execution_time>
    +            <background_pool_size>4</background_pool_size>
    +
     
                 <!-- How to choose between replicas during distributed query processing.
                      random - choose random replica from set of replicas with minimum number of errors


> The two main knobs are

my experience is that those are not enough, multiple algorithms will just fail saying you hit max memory limit. There are many other knobs, for example: when to start external aggregation or sorting. For some cases I couldn't figure out setup and query just hits OOM without any ideas how to fix it.


How is your table setup? It’s plausible the on-disk/index layout is not amenable to the kinds of queries you’re trying to do.

What kind of queries are you trying to do? Also, what kind of machine are you running on?


Trivial example would be to run select count(distinct) from large table with high cardinality values: https://github.com/ClickHouse/ClickHouse/issues/47520


And I mean why should they? They work great for what they are made for and that is all that matters!


As a caveat, I'd probably say 'at large volumes.'

For a lot of what people may want to do, they'd probably notice very little difference between the three.


That's true, but we're trying to change that at ParadeDB. Postgres is still way ahead of ClickHouse in terms of operational simplicity, ease of hiring for DBAs who are used to operating it at scale, ecosystem tooling, etc. If you can patch the speed and cost efficiency of Postgres for analytics to a level comparable to ClickHouse, then you get the best of both worlds


> Postgres is still way ahead of ClickHouse in terms of operational simplicity

Having served as both ClickHouse and Postgres SRE, I don't agree with this statement.

- Minimal downtime major version upgrades in PostgreSQL is very challenging.

- glibc version upgrade breaks postgres indices. This basically prevents from upgrading linux OS.

And there are other things which makes postgres operationally difficult.

Any database with primary-replica architecture is operationally difficult IMO.


For multi-tb or pb needs I would not stray from mariadb. Especially when using columnstore. I have taken the pepsi challenge, even after trying vertica and netezza. Not HANA though; one has had enough of SAP.


I'm curious: how many rows Lago store in its CH cluster? Do they collect data for fighting fraud?

PG can handle a billion rows easily.


OLAP databases need to be able to handle billions of rows per hour/day.

I super love PG but PG is too far away from that.


Reading between the lines, given they're talking > 1 million rows per minute, I'd guess on the order of trillions of rows rather than billions (assuming they retain data for more than a couple of weeks)


PG can handle billions of rows for certain use cases, but not easily. Generally you can make things work but you definitely start entering "heroic effort" territory.


scale is becoming more and more important, not just for cost, but also as a key technology feature to help deal with unexpected traffic and reduce the cost of manual operations.


I have a tangentially related question since I don’t use an Olap db: is deleting data so hard to perform? Is it necessarily an immutable storage?

If so, is it a gdpr compliant storage solution? I am asking it since gdpr compliance may require data deletion (or at least anonimization)


Columnar Db’s want stuff to be contiguous on disk, and deletes cause the rest of the data in that “block” to be rewritten (imagine deleting a chunk out of the middle of an excel table: you’ve got to move everything else up).

This in turn, creates read+write load. Modern OLAP db’s often support it, often via mitigating strategies to minimise the amount of extra work they incur: mark tainted rows, exclude them from queries, and clean up asynchronously; etc.


deleting this comment because apparently jokes are not received well here


> Recently, the most interesting rift in the Postgres vs OLAP space is [Hydra](https://www.hydra.so), an open-source, column-oriented distribution of Postgres that was very recently launched (after our migration to ClickHouse). Had Hydra been available during our decision-making time period, we might’ve made a different choice.

There will likely be a good OLAP solution (possibly implemented as an extension) in Postgres in the next year or so. Many companies are working on it (Hydra, Parade[0], etc.)

0 - https://www.paradedb.com/


for others curious

ParadeDB - AGPL License https://github.com/paradedb/paradedb/blob/dev/LICENSE

Hydra - Apache 2.0 https://github.com/hydradatabase/hydra/blob/main/LICENSE

also hydra seems derived from citusdata's columnar implementation.


Don't feel bad, lots of people get bitten by not reading all the way down to the bottom of their readme: https://github.com/hydradatabase/hydra/blob/v1.1.2/README.md... While Hydra may very well license their own code Apache 2, they ship the AGPLv3 columnar which to my very best IANAL understanding taints the whole stack and AGPLv3's everything all the way through https://github.com/hydradatabase/hydra/blob/v1.1.2/columnar/...


the only additional requirement that the AGPL introduces is that if you modify the AGPL software, you have to provide people who can access it over the network the code.

If you just use a pre-built package, the AGPL has the exact same requirements as the GPL.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: