> The database is obviously not able to tell that there will never be any values...

mulander · on Feb 1, 2021

No, I don't think statistics can let you get away with this. Databases are concurrent, you can't guarantee that a different session will not insert a record that invalidates your current statistics.

You could argue that it should be able to use it if the table has a check constraint preventing severity_id above 7 being ever inserted. That is something that could be done, I don't know if PostgreSQL does it (I doubt it) or how feasable it would be.

Is SQL Server able to make an assumption like that purely based on statistics? Genuine question.

GordonS · on Feb 1, 2021

> No, I don't think statistics can let you get away with this. Databases are concurrent, you can't guarantee that a different session will not insert a record that invalidates your current statistics.

Of course you can't make a guarantee like that, but why would you need to? Statistics are there to guide planner choices, not make cast iron predictions.

mulander · on Feb 1, 2021

Let us say that in our example table we have 100 000 records with severit_id < 7, 200 000 with severity_id = 7 and 3 records with severity_id = 8.

Statistics claim 100k id < 7, 200k id = 7 and 0 with id > 7. The last 3 updates could have happened right before our query, the statistics didn't update yet.

Let us assume that we blindly trust the statistics and they currently state that there are absolutely no values with severity_id > 7 and you have a query WHERE severity_id != 7 and a partial index on severity_id < 7.

If you trust the statistics and actually use the index the rows containing severity_id = 8 will never be returned by the query even if they exist. So by using the index you only scan 100 k rows and never touch the remaining ~200k. However this query can't be answered without scanning all ~300k records. This means, that on the same database you would get two different results for the exact same query if you decided to drop the index after the first run. The database can't fall back and change the plan during execution.

Perhaps I misunderstood you originally. I thought you suggested that the database should be able to know that it can still use the index because currently the statistics claim that there are no records that would make the result incorrect. You are of course correct, that the statistics are there to guide the planner choices and that is how they are used within PostgreSQL - however some plans will give different results if your assumption about data are wrong.

cmeacham98 · on Feb 1, 2021

Because the database has to decided whether or not to use the index. If it decides to use the index, and there are values above 7, then it will misbehave (the query will miss those results). Now of course the database could then scan the rows for values above 7 it missed but at that point there's no point in using the index and you might as well have row scanned for the original query.

As a result, the database has to be 100% sure that there are no values _at all_ above 7 to safely and efficiently use the index, ex. when there's a constraint.

mattb314 · on Feb 1, 2021

I doubt it? At least the number times the "last updated" column appears on SQL server stats [1] leads me to believe it collects stats async with updates to the table.

The only system I've heard of that relies on up-to-date statistics for correctness is snowflake (long but interesting talk here [2]), where having accurate max/mins for each micro partition is really helpful for cutting down the amount of data in the large range scan queries common in BI. I'd guess that being a BI system, snowflake can get away with higher row update latency too.

[1] https://www.sqlshack.com/sql-server-statistics-and-how-to-pe...

[2] https://www.youtube.com/watch?v=CPWn1SZUZqE

londons_explore · on Feb 1, 2021

All statistics in postgres are considered best effort guidance. Even if the statistics are wrong it can never impact the correctness of the results.