If you are using PostgreSQL for analytics or large-scale aggregations, you might occasionally notice the planner making false assumptions regarding the number of rows. While this isn't a problem for small aggregates, it is indeed an issue for large-scale aggregations featuring many different dimensions.

In short: The more columns your GROUP BY statement contains, the more likely it is that optimizer overestimates the row count.

This blog explains how this can be handled in PostgreSQL.

Populating sample data in PostgreSQL

For this example, I have created a small table which contains a couple of integer values:

blog=# CREATE TABLE t_sample (
a int,
b int,
c int,
d int,
e int,
f int DEFAULT random()*100000
);
CREATE TABLE

As we can see, we have 5 integer columns (a-e), and a column that contains a value. The easiest way to add 10 million rows to this table is to make use of the generate_series function:

blog=# INSERT INTO t_sample
SELECT x % 10, x % 11, x % 12, x % 13, x % 14
FROM generate_series(1, 10000000) AS x;
INSERT 0 10000000

The key point here is that our data has some interesting properties. Each column can only contain a fixed set of values (= 10 entries on the first column,11 distinct values in the second column and so on).

Once the data has been loaded, I am calling the ANALYZE command to create fresh optimizer statistics:

blog=# ANALYZE ;
ANALYZE

What does "optimizer statistics" mean?

How does the PostgreSQL optimizer work? The most fundamental idea is column-statistics. For each column, PostgreSQL keeps important information such as the number of distinct values, the most common values, a histogram and a lot more:

blog=# SELECT tablename, attname,
n_distinct, most_common_vals
FROM pg_stats
WHERE tablename = 't_sample'
AND attname IN ('a', 'b');
tablename | attname | n_distinct | most_common_vals
-----------+---------+------------+--------------------------
t_sample | b | 11 | {9,6,7,5,0,8,10,1,4,3,2}
t_sample | a | 10 | {9,0,4,5,8,1,2,3,6,7}
(2 rows)

Note that this information is stored for every single column. Under normal circumstances, autovacuum will ensure that the statistics is kept up to date.

When columns depend on each other

The following query is quite a common one: We want to determine how many entries exist in each group. The key detail is that the GROUP BY clause contains five dimensions.

To make the example easier to understand, we first disabled parallel queries in PostgreSQL. This is not a general performance recommendation for PostgreSQL, it is solely done to simplify the execution plan:

blog=# SET max_parallel_workers_per_gather TO 0;
SET
blog=# explain analyze
SELECT a, b, c, d, e, count(*)
FROM t_sample
GROUP BY 1, 2, 3, 4, 5;
QUERY PLAN
--------------------------------------------------------------------------
HashAggregate (cost=982455.57..1102046.81 rows=240240 width=28)
(actual time=1638.327..1784.077 rows=60060 loops=1)
Group Key: a, b, c, d, e
Planned Partitions: 4 Batches: 5
Memory Usage: 8241kB Disk Usage: 53072kB
-> Seq Scan on t_sample (cost=0.00..163696.15 rows=10000115 width=20)
(actual time=0.051..335.137 rows=10000000 loops=1)
Planning Time: 0.167 ms
Execution Time: 1787.892 ms
(6 rows)

What stands out here? It comes down to two numbers: "240240" and "60060". The first number indicates what the optimizer estimates - it assumes that over 200,000 lines will be returned. However, the actual result is only 60060 rows. Why is this the case? The reason is that certain combinations of a, b, c, d, e simply don't exist. However, PostgreSQL does not know that. Remember, the optimizer maintains statistics about each column - it does not know about all those real permutations.

How does PostgreSQL come up with "240240"? Consider the following calculation:

blog=# SELECT 10 * 11 * 12 * 13 * 14;
?column?
----------
240240
(1 row)

PostgreSQL simply multiplied the number of elements in each dimension, which leads to an overestimation (which is often less problematic than an underestimation).

CREATE STATISTICS: Fixing estimates

The solution to fix those estimates is the use of extended statistics. The following command can be used to approach the problem:

blog=# \h CREATE STATISTICS
Command: CREATE STATISTICS
Description: define extended statistics
Syntax:
CREATE STATISTICS [ [ IF NOT EXISTS ] statistics_name ]
ON ( expression )
FROM table_name

CREATE STATISTICS [ [ IF NOT EXISTS ] statistics_name ]
[ ( statistics_kind [, ... ] ) ]
ON { column_name | ( expression ) }, { column_name | ( expression ) } [, ...]
FROM table_name

URL: https://www.postgresql.org/docs/17/sql-createstatistics.html

In our case we need the "ndistinct" option of CREATE STATISTICS:

blog=# CREATE STATISTICS mystats (ndistinct)
ON a, b, c, d, e
FROM t_sample ;
CREATE STATISTICS
blog=# ANALYZE;
ANALYZE

This creates statistics for permutations, not just for single values and columns. When we run the query again, we will see that the estimates have improved significantly:

blog=# explain analyze
SELECT a, b, c, d, e, count(*)
FROM t_sample
GROUP BY 1, 2, 3, 4, 5;
QUERY PLAN
----------------------------------------------------------------------------
HashAggregate (cost=313697.88..314288.34 rows=59046 width=28)
(actual time=1587.681..1733.595 rows=60060 loops=1)
Group Key: a, b, c, d, e
Batches: 5 Memory Usage: 8241kB Disk Usage: 53072kB
-> Seq Scan on t_sample (cost=0.00..163696.15 rows=10000115 width=20)
(actual time=0.060..309.344 rows=10000000 loops=1)
Planning Time: 0.623 ms
Execution Time: 1737.495 ms
(6 rows)

Wow, this is close: "59046" vs "60060" - PostgreSQL is spot on. In this case, it doesn't impact performance. However, in real-world scenarios, fixing statistics can have a huge impact on speed and overall database
efficiency.

The question may naturally arise: What information does the optimizer use to come up with such an accurate estimate? Here is the answer:

blog=# \x
Expanded display is on.
blog=# SELECT * FROM pg_stats_ext;
-[ RECORD 1 ]----------+----------------------------------------------------------------
schemaname | public
tablename | t_sample
statistics_schemaname | public
statistics_name | mystats
statistics_owner | hs
attnames | {a,b,c,d,e}
exprs |
kinds | {d}
inherited | f
n_distinct | {"1, 2": 110, "1, 3": 60, "1, 4": 130, "1, 5": 70,
"2, 3": 132, "2, 4": 143, "2, 5": 154, "3, 4": 156,
"3, 5": 84, "4, 5": 182, "1, 2, 3": 660, "1, 2, 4": 1430,
"1, 2, 5": 770, "1, 3, 4": 780, "1, 3, 5": 420,
"1, 4, 5": 910, "2, 3, 4": 1716, "2, 3, 5": 924,
"2, 4, 5": 2002, "3, 4, 5": 1092, "1, 2, 3, 4": 8576,
"1, 2, 3, 5": 4621, "1, 2, 4, 5": 10026, "1, 3, 4, 5": 5457,
"2, 3, 4, 5": 11949, "1, 2, 3, 4, 5": 59046}
dependencies |
most_common_vals |
most_common_val_nulls |
most_common_freqs |
most_common_base_freqs |

The pg_stats_ext view tells us how PostgreSQL PostgreSQL handles the n_distinct column in this case.

Finally …

Optimizing queries and the optimizer in general are fascinating topics worth exploring. If you want to learn more and dig deeper, consider reading my post about "How the PostgreSQL query optimizer works"

The post # GROUP BY: Fixing optimizer estimates appeared first on CYBERTEC PostgreSQL | Services & Support.

Source: View source