Using the GROUPING SETS, ROLLUP, and CUBE syntax for aggregate function invocation
This section shows how to use the GROUPING SETS
, ROLLUP
, and CUBE
syntax, as part of the GROUP BY
clause, in concert with the invocation, as a SELECT
list item, of one or more aggregate functions. These constructs are useful when the relation defined by a subquery's FROM
list has two or more columns that you want to use in the GROUP BY
clause and when you want to use them with singly or in combinations that have fewer columns than the available number.
Semantics
GROUPING SETS (...)
, is a shorthand notation to let you achieve, in a single terse subquery, what you could achieve by the union of several subqueries that each uses the plain GROUP BY <expression list>
syntax.
ROLLUP (...)
and CUBE
are each shorthand notations for specifying two common uses of the GROUPING SET
syntax.
GROUPING SETS
Suppose that the list of candidate columns (or expressions) is "g1",... "g2",... "gN" and you want to produce the results from a set of aggregate functions using each individual one in turn as the GROUP BY
argument and also using no GROUP BY
clause at all. You can easily simply write several individual queries, like this:
select... fn(...) filter (where... ),... from... group by g1 having...;
select... fn(...) filter (where... ),... from... group by g2 having...;
...
select... fn(...) filter (where... ),... from... group by gN having...;
select... fn(...) filter (where... ),... from... having...;
You could also, by appropriate use of subqueries defined in a WITH
clause, UNION
the results of each to create a single joint results set. While doing this is straightforward, it can become quite verbose—especially if, for example, you want to use a FILTER
clause as part of the aggregate function invocation and want to use a HAVING
clause to restrict the result set.
The GROUPING SETS
syntax lets you achieve the required result in a terse fashion:
select fn(...) filter (where... ),... from... group by grouping sets ((g1), (g2), ()) having...
You can include any number of columns (or expressions) within each successive parenthesis pair. The empty parenthesis brings the effect of no GROUP BY
clause at all.
ROLLUP
This:
rollup (g1, g2, g3,... )
represents the given list of expressions and all prefixes of the list including the empty list; thus it is equivalent to this:
grouping sets (
(g1, g2, g3,... ),
...
(g1, g2, g3),
(g1, g2),
(g1),
()
)
This is commonly used for analysis over hierarchical data, for example "total salary" by "department", "division", and "company-wide total".
CUBE
This:
cube (g1, g2, g3,... )
is equivalent to GROUPING SETS with the given list and all of its possible subsets (i.e. the power set). Therefore this:
cube (g1, g2, g3)
is equivalent to this:
grouping sets (
(g1, g2, g3),
(g1, g2 ),
( g2, g3),
(g1, g3),
(g1 ),
( g2 ),
( g3),
( )
)
Using the GROUPING keyword in the SELECT list to label the different GROUPING SETS
The GROUPING
keyword is used to introduce a SELECT
list item when the GROUP BY
clause uses GROUPING SETS
, ROLLUP
, or CUBE
. It produces a label so that result rows can be distinguished. The GROUPING
arguments are not evaluated, but they must match exactly expressions given in the GROUP BY
clause of the associated query level. Bits are assigned with the rightmost argument being the least-significant bit; each bit is 0 if the corresponding expression is included in the grouping criteria of the grouping set generating the result row, and 1 if it is not. See the section Using the GROUPING keyword in the SELECT list to label the different GROUPING SETS at the end of the Examples section below.
Further detail
The individual elements of a ROLLUP
or CUBE
clause may be either individual expressions, or sublists of elements in parentheses. In the latter case, the sublists are treated as single units for the purposes of generating the individual grouping sets. For example, this:
rollup (g1, (g2, g3), g4)
is equivalent to this:
grouping sets (
(g1, g2, g3, g4),
(g1, g2, g3 ),
(g1 ),
( )
)
And this:
cube ((g1, g2), (g3, g4))
is equivalent to this:
grouping sets (
(g1, g2, g3, g4),
(g1, g2 ),
( g3, g4),
( )
)
The ROLLUP
and CUBE
constructs can be used either directly in the GROUP BY
clause, or nested inside a GROUPING SETS
clause. If one GROUPING SETS
clause is nested inside another, the effect is the same as if all the elements of the inner clause had been written directly in the outer clause.
If many grouping items are specified in a single GROUP BY
clause, then the final list of grouping sets is the cross product of the individual items. For example, this:
group by g1, cube (g2, g3), grouping sets ((g4), (g5))
is equivalent to this:
group by grouping sets (
(g1, g2, g3, g4), (g1, g2, g3, g5),
(g1, g2, g4 ), (g1, g2, g5 ),
(g1, g3, g4 ), (g1, g3, g5 ),
(g1, g4 ), (g1, g5 )
)
Examples
Create and populate the test table. Notice the assert
that tests that each of the grouping columns "g1" and "g2" has at least two rows for each of its two values. The reason for this test is that stddev()
returns NULL
when it is presented with just a single row. This unlucky outcome is very unlikely. But it was seen while developing this code example. This outcome, while not wrong, or even remarkable, makes the results produced by the verbose approach and the terse approach, below, a little harder to compare—thereby making the semantics demonstration a little less convincing.
drop table if exists t cascade;
create table t(
k int primary key,
g1 int not null,
g2 int not null,
v int not null);
drop procedure if exists populate_t cascade;
create procedure populate_t(no_of_rows in int)
language plpgsql
as $body$
declare
s0 constant double precision not null := to_number(to_char(clock_timestamp(), 'ms'), '9999');
s constant double precision not null := s0/500.0 - 1.0;
begin
perform setseed(s);
delete from t;
insert into t(k, g1, g2, v)
select
g.v,
case random() < 0.5::double precision
when true then 1
else 2
end,
case random() < 0.5::double precision
when true then 1
else 2
end,
round(100*random())
from generate_series(1, no_of_rows) as g(v);
declare
c_g1_1 constant int = (select count(v) from t where g1 = 1);
c_g1_2 constant int = (select count(v) from t where g1 = 2);
c_g2_1 constant int = (select count(v) from t where g2 = 1);
c_g2_2 constant int = (select count(v) from t where g2 = 2);
begin
assert
c_g1_1 > 1 and c_g1_2 > 1 and c_g2_1 > 1 and c_g2_2 > 1,
'Unlucky outcome. Try again.';
end;
end;
$body$;
call populate_t(100);
GROUPING SETS
Now do three simple GROUP BY
queries. (See avg()
and stddev()
for the specification of these two aggregate functions.)
-- First:
select
g1,
to_char(avg(v), '999.99') as avg,
to_char(stddev(v), '999.99') as stddev
from t
group by g1
order by g1;
-- Second:
select
g2,
to_char(avg(v), '999.99') as avg,
to_char(stddev(v), '999.99') as stddev
from t
group by g2
order by g2;
-- Third:
select
to_char(avg(v), '999.99') as avg,
to_char(stddev(v), '999.99') as stddev
from t;
Here are typical results:
g1 | avg | stddev
----+---------+---------
1 | 53.65 | 26.94
2 | 52.16 | 30.64
g2 | avg | stddev
----+---------+---------
1 | 54.55 | 30.04
2 | 51.81 | 26.72
avg | stddev
---------+---------
53.10 | 28.22
Now combine these three queries into a single, verbose, query:
with
a1 as (
select g1, avg(v) as avg, stddev(v) as stddev from t group by g1),
a2 as (
select g2, avg(v) as avg, stddev(v) as stddev from t group by g2),
a3 as (
select avg(v) as avg, stddev(v) as stddev from t),
a4 as (
select g1, null::int as g2, avg, stddev from a1
union all
select null::int as g1, g2, avg, stddev from a2
union all
select null::int as g1, null::int as g2, avg, stddev from a3)
select
g1,
g2,
to_char(avg, '999.99') as avg,
to_char(stddev, '999.99') as stddev
from a4
order by g1 nulls last, g2 nulls last;
This is the result for the table population that produced the typical results shown above:
g1 | g2 | avg | stddev
----+----+---------+---------
1 | | 53.65 | 26.94
2 | | 52.16 | 30.64
| 1 | 54.55 | 30.04
| 2 | 51.81 | 26.72
| | 53.10 | 28.22
The same result is given by this terse query using the GROUPING SETS
syntax:
select
g1,
g2,
to_char(avg(v), '999.99') as avg,
to_char(stddev(v), '999.99') as stddev
from t
group by grouping sets ((g1), (g2), ())
order by g1 nulls last, g2 nulls last;
Notice that it's easy to include a HAVING
clause in the terse GROUPING SETS
syntax:
select
g1,
g2,
to_char(avg(v), '999.99') as avg,
to_char(stddev(v), '999.99') as stddev
from t
group by grouping sets ((g1), (g2), ())
having avg(v) > 53.0::numeric
order by g1 nulls last, g2 nulls last;
This is the result with the table population that produced the results above:
g1 | g2 | avg | stddev
----+----+---------+---------
1 | | 53.65 | 26.94
| 1 | 54.55 | 30.04
| | 53.10 | 28.22
The actual result here will depend on the outcome of the "no_of_rows" invocations of the random()
function. It's possible (but quite rare) that you'll see no rows at all.
To get the same semantics with the verbose query, you must, of course, include the identical HAVING
clause in each of the three WITH
clause subqueries, "a1", "a2", and "a3" that it uses. This makes the verbose formulation yet more verbose and, more importantly, yet more subject to copy-and-paste error when the aim is to repeat this identical text three times:
...avg(v) as avg, stddev(v) as stddev from t... having avg(v) > 50.0::numeric...
Extending this thinking, the terse syntax also makes it easier to add a FILTER
clause:
select
g1,
g2,
to_char(avg(v) filter (where v > 20), '999.99') as avg,
to_char(stddev(v) filter (where v > 20), '999.99') as stddev
from t
group by grouping sets ((g1), (g2), ())
having avg(v) > 53.0::numeric
order by g1 nulls last, g2 nulls last;
This is the result with the table population that produced the results above:
g1 | g2 | avg | stddev
----+----+---------+---------
1 | | 59.93 | 22.60
| 1 | 63.79 | 23.85
| | 60.78 | 23.13
ROLLUP
This GROUP BY ROLLUP
clause:
group by rollup (g1, g2)
has the same meaning as this GROUP BY GROUPING SETS
clause:
group by grouping sets ((g1, g2), (g1), ())
So this query:
select
g1,
g2,
to_char(avg(v), '999.99') as avg,
to_char(stddev(v), '999.99') as stddev
from t
group by rollup (g1, g2)
order by g1 nulls last, g2 nulls last;
has the same meaning as this:
select
g1,
g2,
to_char(avg(v), '999.99') as avg,
to_char(stddev(v), '999.99') as stddev
from t
group by grouping sets ((g1, g2), (g1), ())
order by g1 nulls last, g2 nulls last;
Each produces this result with the table population that produced the results above:
g1 | g2 | avg | stddev
----+----+---------+---------
1 | 1 | 54.00 | 29.37
1 | 2 | 53.35 | 25.12
1 | | 53.65 | 26.94
2 | 1 | 55.44 | 31.93
2 | 2 | 49.05 | 29.90
2 | | 52.16 | 30.64
| | 53.10 | 28.22
CUBE
This GROUP BY CUBE
clause:
group by cube (g1, g2)
has the same meaning as this GROUP BY GROUPING SETS
clause:
group by grouping sets ((g1, g2), (g1), (g2), ())
So this query:
select
g1,
g2,
to_char(avg(v), '999.99') as avg,
to_char(stddev(v), '999.99') as stddev
from t
group by cube (g1, g2)
order by g1 nulls last, g2 nulls last;
has the same meaning as this:
select
g1,
g2,
to_char(avg(v), '999.99') as avg,
to_char(stddev(v), '999.99') as stddev
from t
group by grouping sets ((g1, g2), (g1), (g2), ())
order by g1 nulls last, g2 nulls last;
Each produces this result with the table population that produced the results above:
g1 | g2 | avg | stddev
----+----+---------+---------
1 | 1 | 54.00 | 29.37
1 | 2 | 53.35 | 25.12
1 | | 53.65 | 26.94
2 | 1 | 55.44 | 31.93
2 | 2 | 49.05 | 29.90
2 | | 52.16 | 30.64
| 1 | 54.55 | 30.04
| 2 | 51.81 | 26.72
| | 53.10 | 28.22
Using the GROUPING keyword in the SELECT list to label the different GROUPING SETS
Try this:
select
grouping(g1, g2),
g1,
g2,
to_char(avg(v), '999.99') as avg,
to_char(stddev(v), '999.99') as stddev
from t
group by cube (g1, g2)
order by g1 nulls last, g2 nulls last;
It's equivalent to this:
select
grouping(g1, g2),
g1,
g2,
to_char(avg(v), '999.99') as avg,
to_char(stddev(v), '999.99') as stddev
from t
group by grouping sets ((g1, g2), (g1), (g2), ())
order by g1 nulls last, g2 nulls last;
Each produces this result with the table population that produced the results above:
grouping | g1 | g2 | avg | stddev
----------+----+----+---------+---------
0 | 1 | 1 | 54.00 | 29.37
0 | 1 | 2 | 53.35 | 25.12
1 | 1 | | 53.65 | 26.94
0 | 2 | 1 | 55.44 | 31.93
0 | 2 | 2 | 49.05 | 29.90
1 | 2 | | 52.16 | 30.64
2 | | 1 | 54.55 | 30.04
2 | | 2 | 51.81 | 26.72
3 | | | 53.10 | 28.22
You might prefer to order first by the GROUPING
item and not include it in the SELECT
list:
select
g1,
g2,
to_char(avg(v), '999.99') as avg,
to_char(stddev(v), '999.99') as stddev
from t
group by grouping sets ((g1, g2), (g1), (g2), ())
order by grouping(g1, g2), g1 nulls last, g2 nulls last;
It produces this result with the table population that produced the results above:
g1 | g2 | avg | stddev
----+----+---------+---------
1 | 1 | 54.00 | 29.37
1 | 2 | 53.35 | 25.12
2 | 1 | 55.44 | 31.93
2 | 2 | 49.05 | 29.90
1 | | 53.65 | 26.94
2 | | 52.16 | 30.64
| 1 | 54.55 | 30.04
| 2 | 51.81 | 26.72
| | 53.10 | 28.22