Signature and purpose of each window function
The two tables at the end classify the eleven built-in window functions into two groups according to their general common characteristics.
Note: The navigation bar lists these window functions in four functional groups. The members in each group bear a strong family resemblance to each other. The first two groups list functions from the first table below. And the second two groups list functions from the second table.
Aggregate function variants
A few of these also have an aggregate function variant. This can be seen with the \df
meta-command. For example, df lag
shows this:
Result data type | Argument data types | Type
------------------+----------------------------------------+--------
bigint | | window
bigint | VARIADIC "any" ORDER BY VARIADIC "any" | agg
This property is marked by "Y" in the column "agg?" in the following tables; a blank in this column means that the entry has only a window function variant.
Functions with both a 'window' and an 'aggregate' variant
The definitive description of the use, as an aggregate function, of a window function that has such a variant, is described within the Aggregate functions major section in the section Within-group hypothetical-set aggregate functions .Frame_clause sensitivity
The results of all of the window functions depend upon what the window ORDER BY
clause and the optional PARTITION BY
clause say. Though you don't get an error if you omit the window ORDER BY
clause, its omission brings unpredictable and therefore meaningless results.
The results of a few of the window functions are sensitive to what the frame_clause
says. This is marked by "Y" in the column "frame?" in the following tables; a blank in this column means that the entry is not sensitive to what the frame_clause
says.
Frame_clause-insensitive window functions
All of the window functions listed in the first table, and lag()
and lead()
from the second table, are insensitive to whatever the frame_clause
might say. You can easily show this by trying a few variants like, for example, this:
-- Counter example. Don't use this for window functions
-- that aren't frame_clause-sensitive.
range between 1 preceding and 1 following exclude current row
It says "consider only the row immediately before and the row immediately after the current row". You'll see that including this, or any other variant, in the window_definition
brings the identical result to what including only the window ORDER BY
clause brings for each of the window functions that aren't sensitive to the frame_clause
.
Yugabyte recommends that you never include a frame_clause
in the window_definition
that you use when you invoke a window function that isn't sensitive to the frame_clause
.
Frame_clause-sensitive window functions
The names of the other window functions that the second table lists, first_value()
, nth_value()
and last_value()
, tell you that the output of each makes obvious sense when the scope within which the specified row is found is the entire window. The results of these three functions certainly are sensitive to the frame_clause
. This is the default for the frame_clause
:
-- This specification is used if you omit the frame_clause.
-- You probably don't want the results that this specifies.
between unbounded preceding and current row
See the section Window function invocation—SQL syntax and semantics. But the default does not specify the entire window. To do this, use this variant:
-- You must specify this explicitly unless you are sure
-- that you want a different specification.
range between unbounded preceding and unbounded following
Yugabyte recommends, therefore, that you include frame_clause
in the window_definition
that you use when you invoke a window function that is sensitive to the frame_clause
, unless you have one of the very rare use cases where the output that you want is produced by a different frame_clause
.
Note: The frame_clause
's many variants are useful when an aggregate function is invoked using the OVER
clause. The section Using the aggregate function avg()
to compute a moving average shows an example.
Window functions that return an "int" or "double precision" value as a "classifier" of the rank of the row within its window
The only information, in the input row set, that the functions in this group depend upon is the emergent order as determined by the window ORDER BY
clause. The actual order depends in the usual way on what this clause says. With the one exception of ntile()
, these functions have no formal parameters. The ntile()
function has a single, mandatory int
formal parameter. It specifies the number of subsets into which the input row set should be classified. This means that it reflects only the invoker's intention and does not reflect the shape of the input.
Function | agg? | frame? | Description |
---|---|---|---|
row_number() |
Returns a unique integer for each row in a window, from a dense series that starts with 1, according to the emergent order that the window ORDER BY clause specifies. For the two or more rows in a tie group, the unique values are assigned randomly. |
||
rank() |
Y | Returns the integer ordinal rank of each row according to the emergent order that the window ORDER BY clause specifies. The series of values starts with 1 but, when the window contains ties, the series is not dense. The "ordinal rank" notion is familiar from sporting events. If three runners reach the finish line at the same time, then they are all deemed to have tied for first place. The runner who finishes next after these is deemed to have come in fourth place because three runners came in before this finisher. |
|
dense_rank() |
Y | Returns the integer ordinal rank of the distinct value of each row according to what the window ORDER BY clause specifies. The series of values starts with 1 and, even when the window contains ties, the series is dense. The "dense rank" notion reflects the ordering of distinct values of the list of expressions that the window ORDER BY clause specifies. In the running race example, the three runners who tied for first place would get a dense rank of 1. And the runner who finished next after these would get a dense rank of 2, because this finisher got the second fastest distinct finish time. |
|
percent_rank() |
Y | Returns the percentile rank of each row within the window, with respect to the argument of the window_definition 's window ORDER BY clause, as a number in the range 0.0 through 1.0. The lowest value of percent_rank() within the window will always be 0.0, even when there is a tie between the lowest-ranking rows. The highest possible value of percent_rank() within the window is 1.0, but this value is seen only when the highest-ranking row has no ties. If the highest-ranking row does have ties, then the highest value of percent_rank() within the window will be correspondingly less than 1.0 according to how many rows tie for the top rank. The notion is well-known from statistics. More details are given in this function's dedicated account. |
|
cume_dist() |
Y | Returns a value that represents the number of rows with values less than or equal to the current row's value divided by the total number of rows—in other words, the relative position of a value in a set of values. The graph of all values of cume_dist() within the window is known as the cumulative distribution of the argument of the window_definition 's window ORDER BY clause. The value c returned by cume_dist() is a number in the range 0 < c <= 1. You can use cume_dist() to answer questions like this: Show me the rows whose score is within the top "x%" of the window's population, ranked by score. The notion is well-known from statistics. More details are given in this function's dedicated account. |
|
ntile() |
Returns an integer value for each row that maps it to a corresponding percentile. For example, if you wanted to mark the boundaries between the highest-ranking 20% of rows, the next-ranking 20% of rows, and so on, then you would use ntile(5) . The top 20% of rows would be marked with 1, the next-to-top 20% of rows would be marked with 2, and so on, so that the bottom 20% of rows would be marked with 5. If the number of rows in the window, N, is a multiple of the actual value with which you invoke ntile() , n , then each percentile set would have exactly N/n rows. This is achieved, if there are ties right at the boundary between two percentile sets, by randomly assigning some to one set and some to the other. If N is not a multiple of n, then ntile() assigns the rows to the percentile sets so that the numbers assigned to each are as close as possible to being the same. |
Window functions that return columns of another row within the window
The functions in this group depend, in the same way as those in the first group do, upon the emergent order as determined by the window ORDER BY
clause. Each has, at least, a single, mandatory, anyelement
formal parameter that specifies which value to fetch from the designated other row in the window. If you want to fetch values from more than one column, you must combine them into a scalar value. The most obvious way to do this is to list the columns in the constructor for a user-defined "row" type. The nth_value()
function has a mandatory second int
formal parameter that specifies the value of "N" to give "Nth" its meaning. The other functions in this group have just one formal parameter.
Function | agg? | frame? | Description |
---|---|---|---|
first_value() |
Y | Returns the specified value from the first row, in the specified sort order, in the current window frame. If you specify the frame_clause to start at a fixed offset before the current row, then first_value() would produce the same result as would the correspondingly parameterized lag() . If this is your aim, then you should use lag() for clarity. |
|
nth_value() |
Y | Returns the specified value from the "Nth" row, in the specified sort order, in the current window frame. The second, mandatory, parameter specifies "N" in "Nth". | |
last_value() |
Y | Returns the specified value from the last row, in the specified sort order, in the current window frame. | |
lag() |
Returns, for the current row, the designated value from the row in the ordered input set that is "lag_distance " rows before it. The data type of the return value matches that of the input value. NULL is returned when the value of "lag_distance" places the earlier row before the start of the window. Use the optional last parameter to specify the value to be returned, instead of NULL , when the looked-up row falls outside of the current window. |
||
lead() |
Returns, for the current row, the designated value from the row in the ordered input set that is "lead_distance" rows after it. The data type of the return value matches that of the input value. NULL is returned when the value of "lead_distance" places the later row after the start of the window. Use the optional last parameter to specify the value to be returned, instead of NULL , when the looked-up row falls outside of the current window. |