Decent article, but some remarks:

1) It seems these two statements conflict with each other:

> The oldest such transaction sets the cutoff—referred to as the "MVCC horizon." Until that transaction completes, every dead tuple newer than its snapshot is retained.

and

> For example, imagine three analytics queries, each running for 40 seconds, staggered 20 seconds apart. No individual query would trigger a timeout for running too long. But because one is always active, the horizon never advances, and the effect on vacuum is the same as one transaction that never ends.

If the three analytics *transactions* (it's transactions that matter, not queries, although there is some subtlety around deferred transactions not acquiring a snapshot until the first query) are started at different times, they will have staggered snapshots and so once the first completes, this should allow the vacuum to advance.

2) Although the problem about this query:

    SELECT * FROM jobs
    WHERE status = 'pending'
    ORDER BY run_at
    LIMIT 1
    FOR UPDATE SKIP LOCKED;
having to consider dead tuples is a genuine concern and performance problem, this can also be mitigated by adding a monotonically increasing column and adding a `WHERE column < ?` clause, provided you have also added an index to make that pagination efficient. This way you don't need to consider dead tuples and they 'only' waste space whilst waiting to be vacuumed, rather than also bogging down read perf.

There is a little subtlety around how you guarantee that the column is monotonically increasing, given concurrent writers, but the answer to that depends on what tricks you can fit into your application.

3) I almost want to say that the one-line summary is 'Don't combine (very) long-running transactions with (very) high transaction rates in Postgres'

(Is this a fair representation?)

nine_k3 days ago | | | parent | | on: 47733308
For (2): the problem is that the index would still need to keep the dead tuples, until Postgres is positively certain that no transaction holds them, and runs vacuum over them. It may speed up things a bit, but would still overflow the disk storage eventually. It may still prevent other tables from being vacuumed, too!
cataflutter3 days ago | | | parent | | on: 47733782
For disk usage, yes this doesn't address anything.

But for read performance (which is IMO what the section in the article was motivated by), it doesn't actually matter to have a bunch of entries corresponding to dead tuples in your index, provided Postgres doesn't need to actually consider the dead tuples as part of your query.

So if you have a monotonically increasing `job_id` and that's indexed, then so long as you process your jobs in increasing `job_id` order, you can use the index and guarantee you don't have to keep reconsidering the dead tuples corresponding to jobs that already completed (if that makes sense).

[This is because the index is a b-tree, which supports efficient (O(log n) page reads for n entries) seeking on (any prefix of) the columns in the index.]

nine_k3 days ago | | | parent | | on: 47734042
This is fair! This should as you descripbe work with a partial index, and with picking the lowest ID that has status = pending (via that index) which is not locked (via select ... for update skip locked). The query plan should be triple-checked though to actually use the index.
jmalicki3 days ago | | | parent | | on: 47734042
The article is sparse on what pending means, but I would guess that that where condition would be enough?
cataflutter1 day ago | | | parent | | on: 47736121
`WHERE status = 'pending'` is not enough to avoid the performance problem, even if you have an index on `(status)`, because the index will still contain dead tuples until it is vacuumed.

Rough intuition: Postgres doesn't immediately delete rows, it just marks them as invalid after a certain snapshot version/transaction ID (and this mark is in the heap, not the indexes, AFAIK).

Every potential tuple that your index returns, Postgres needs to visit the heap to see if that tuple is alive. UNLESS all the tuples in that heap page are alive, in which case an optimisation called the 'visibility map' allows that check to skipped (relevant for Index-Only Scans, where Postgres can get all the results for your query from the index directly).

The only way to avoid the problem is therefore to either vacuum frequently enough that Postgres doesn't get any dead tuples returned from the index (that it must then visit the heap to check for liveness), or to bake in some condition to your queries that prevents the dead tuples from being returned from the index altogether. Vacuuming frequently is expensive and conflicts with having long-running transactions, so the latter option is generally the choice to go for when it matters.

[n.b. I feel I should note I am not a Postgres developer and wouldn't call myself an expert, just an enthusiast and dealt with a few problems here and there. So what I say should be taken with a grain or two of salt, though I believe it to be accurate.]