Hacker Newsnew | past | comments | ask | show | jobs | submit | giovannibonetti's commentslogin

Clustered indexes only save up to 2x write amplification in the very rare case where you're indexing the entire table (e.g. if it has very few columns).

However, that is usually the least of your concerns with write amplification. If you don't batch your writes, you can easily get 100x write amplification. For any primary key or any other index not strongly correlated with your INSERTs, you can get perhaps another 100x write amplification even if you batch you writes.


For inserts, you cannot escape writing into the base table and all indexes. However, my understanding is that for updates PostgreSQL has a write amplification problem due to the fact that each time a row is updated this creates a new row (to implement MVCC), and a new physical location in the heap, so all indexes need to be updated to point to the new location, even those not containing the updated columns.

OTOH, with a heap-less (aka. clustered, aka. index organized) table, you would only have to update the indexes containing the columns that are actually being updated. You don't need to touch any other index. Furthermore, only if you are updating a key column would you physically "move" the entry into a different part of the B-tree. If you update an included column (PK columns are automatically "included" in all secondary indexes, even if not explicitly mentioned in the index definition), you can do that in-place, without moving the entry.

Here is how this works in SQL Server - consider the following example:

    CREATE TABLE T (

        ID int,
        NAME nvarchar(255) NOT NULL,
        AMOUNT int NOT NULL,

        CONSTRAINT T_PK PRIMARY KEY (ID)

    );

    GO

    CREATE INDEX T_I1 ON T (NAME);

    GO

    CREATE INDEX T_I2 ON T (AMOUNT);
Now, doing this...

    UPDATE T SET AMOUNT = 42 WHERE ID = 100;
...will only write to T_PK and T_I2, but not T_I1. Furthermore T_PK's entry will not need to be moved to a different place in the B-tree. SQL Server uses row versioning similar to PostgreSQL, so it's conceivable that PostgreSQL could behave similarly to SQL Server if it supported clustered (index-organized) tables.

>in the very rare case where you're indexing the entire table (e.g. if it has very few columns).

Not sure I follow most tables are accessed primarily in one way (primary key) while maybe sometimes in others for analysis. Having the PK written twice because it's almost always indexed is normally a waste and good candidate for a clustered index. So much so that many DB's like SQLite and MySql always do clustered indexes on primary key because their storage engine is built such that tables are a b-tree anyway vs PG that has separate b-tree indexes and heap tables. MSSQL and Oracle give you a choice whether the table is a index structure or a heap.

If you have very specific use case tables they can typically have a clustered index and no secondary indexes, you can still scan them for ad-hoc analysis but you get better insert performance and space usage because you aren't double writing to the heap and a PK index like you would in PG.

As far as batch writes that is a separate issue and has to due with whether that even makes sense for durability, if you need to commit a single random row due to something occurring you can't batch that up and maintain consistency, if your bulk loading data sure and is common practice to do commit batches there, clustered indexes could still be a 100 vs 200x write amplification if you have to insert both an index row and heap row vs just a single clustered index row.


Clustered indexes aren't just about write amplification. They also reduce the reads needed to get the data. Sometimes by quite a bit.

That's true for seeks into the clustered (primary) index because that index includes all fields, so you don't need to "jump" to the heap to get them.

However, seeking into a secondary index, and then reading a column not included in that index incurs an additional index seek (into the clustered index), which may be somewhat slower than what would happen in a heap-based table.

So there are pros and cons, as usual...


I have found very minimal penalty on secondary index reads in practice such that it has never made a difference.

Remember some databases always use clustered index internally (SQLite, MySql) such that even if you have no primary key they will create a hidden one instead for use with the index.

https://www.sqlite.org/rowidtable.html

It is nice to have the choice which way to go and would be nice if PG implemented this. It can have significant space savings on narrow table with one primary index and performance advantages.


You assume it is valid, until it isn't and you can have different strategies to handle that, like just skipping the broken part and carrying on.

Anyway, you write a state machine that processes the string in chunks – as you would do with a regular parser – but the difference is that the parser is eager to spit out a stream of data that matches the query as soon as you find it.

The objective is to reduce the memory consumption as much as possible, so that your program can handle an unbounded JSON string and only keep track of where in the structure it currently is – like a jQuery selector.


I have been using Powersync, and it works great for mobile apps with offline-first functionality.


I suppose the only part the OP would need to look at is "Being able to form and execute queries in my backend" for the read path, since Sync Rules / Sync Streams need to define those queries. For the write path, you should be able to re-use the existing APIs / permission middleware for the most part.

Then: "Further, I wonder how such a system would work with connection pooling, sharding, replication etc."

These are all topics on their own, but briefly: - Connection Pooling: Not supported in PowerSync Cloud atm, but supported in theory when self-hosting since pgbouncer 1.23 added WAL support. Having said that, clients connect to the PowerSync Service for sync operations, so you only need a single direct connection between the service and your postgres. - Sharding: PowerSync can be used in a sharded setup [1], with better support planned - Replication: PowerSync uses the Postgres WAL for replication [2]

[1] https://docs.powersync.com/usage/sync-rules/advanced-topics/... [2] https://docs.powersync.com/installation/database-setup#other...


I think only Turso — SQLite rewritten in Rust — supports that.


That's Mojo's selling point.

https://www.modular.com/mojo


Roc says hi!


Why does type inference contribute to that issue?


> Soon we'll have shiitake replacing transistors in our airplane and spacecraft computers, while sitting and eating ramen on the vehicles themselves. The future is shaping up to be interesting.

By the way, some people say eating meat is not going to be sustainable as more and more people become able to afford it, and fungi are a great option for providing the equivalent protein intake.


It's already not sustainable, but that hasn't really stopped us.


It absolutely is possible though.

We don’t incentivize properly


> This community has pockets of people who like authoritarian control, and genuinely believe in Apple or Google Play as some kind of superego that they need to defend, that they believe is protecting us.

Perhaps you meant Leviathan instead of superego?


I've noticed that LIMIT 1 makes a huge difference when working with LATERAL JOINs in Postgres, even when the WHERE condition has a unique constraint.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: