MySQL 8.0 窗口函数(Window Functions)


MySQL 8.0 版本以后,终于支持了窗口函数,以下为在 MySQL 中使用窗口函数的相关笔记。

更多请参考:官方文档

窗口表现记录的集合,窗口函数也就是满足某种条件的记录集合上执行的特殊函数,对于每条记录都要在此窗口内执行函数,有的函数,随着记录不同,窗口大小是固定的,这种属于静态窗口,有的函数则相反,不同的记录对应不同的窗口,这种动态变化的窗口叫滑动窗口。

MySQL中窗口函数,又称为 nonaggregate window 非聚合窗口函数,对应的正是如 group by 的聚合函数,如官方文档所述

A window function performs an aggregate-like operation on a set of query rows. However, whereas an aggregate operation groups query rows into a single result row, a window function produces a result for each query row:

  • The row for which function evaluation occurs is called the current row.
  • The query rows related to the current row over which function evaluation occurs comprise the window for the current row. ... By Contrast, window operations do not collapse groups of query rows to a single output row. Instead they product a result of each row.

基础示例

建立了一张 numbers 表,如下所示:

mysql> select * from numbers;
+------+
| val  |
+------+
|    1 |
|    1 |
|    2 |
|    3 |
|    3 |
|    3 |
|    4 |
|    4 |
|    5 |
+------+
9 rows in set (0.00 sec)

执行窗口函数

mysql> SELECT
    -> val,
    -> ROW_NUMBER ( ) OVER w AS 'row_number',
    -> CUME_DIST ( ) OVER w AS 'cume_dist',
    -> PERCENT_RANK ( ) OVER w AS 'percent_rank'
    -> FROM
    -> numbers WINDOW w AS ( ORDER BY val );
+------+------------+--------------------+--------------+
| val  | row_number | cume_dist          | percent_rank |
+------+------------+--------------------+--------------+
|    1 |          1 | 0.2222222222222222 |            0 |
|    1 |          2 | 0.2222222222222222 |            0 |
|    2 |          3 | 0.3333333333333333 |         0.25 |
|    3 |          4 | 0.6666666666666666 |        0.375 |
|    3 |          5 | 0.6666666666666666 |        0.375 |
|    3 |          6 | 0.6666666666666666 |        0.375 |
|    4 |          7 | 0.8888888888888888 |         0.75 |
|    4 |          8 | 0.8888888888888888 |         0.75 |
|    5 |          9 |                  1 |            1 |
+------+------------+--------------------+--------------+
9 rows in set (0.00 sec)

窗口函数(Window Function) 基础语法

在查询中的每个窗口函数都是通过包含 OVER() 的子句来指定的,该子句阐述如何将行划分为组,以便于窗口函数进行处理。

例如:

SELECT
    ID,
    SCORE,
    AVG(SCORE) OVER() AS total_avg_score,
    AVG(SCORE) OVER(PARTITION BY ID) AS rest_avg_score
FROM
    `ID_SCORE`

如上述代码 l4 所示,如果 OVER() 子句为空,则表明其将整个查询集作为一个分区。 与之对比的 l5 中,指明了使用 ID 作为分区键,则会针对每个ID进行一组分区(组),有多个唯一ID,则有多少个分区(组)。

窗口函数仅允许在 SELECT 和 ORDER BY 中使用,查询的结果取决于 FROM 子句,并在执行完 WHERE / GROUP BY / HAVING 后执行,在 ORDER_BY / LIMIT / SELECT DISTINCT 之前执行。

例如:

SELECT -- STEP 2 
    ID,
    SCORE,
    AVG(SCORE) OVER() AS total_avg_score,
    AVG(SCORE) OVER(PARTITION BY ID) AS rest_avg_score
FROM
    `ID_SCORE_1W` -- STEP 0
WHERE SCORE > 0.2 -- STEP 1
ORDER BY SCORE DESC -- STEP 3
LIMIT 5 -- STEP 4

则如上述 SQL 语句所示,首先执行的 FROM ... WHERE, 然后再做 SELECT, 最后执行 ORDER_BY & LIMIT

大部分聚合函数,在窗口函数中都可以沿用,例如:

AVG()
BIT_AND()
BIT_OR()
BIT_XOR()
COUNT()
JSON_ARRAYAGG()
JSON_OBJECTAGG()
MAX()
MIN()
STDDEV_POP(), STDDEV(), STD()
STDDEV_SAMP()
SUM()
VAR_POP(), VARIANCE()
VAR_SAMP()

此外还有一些窗口函数独有的非聚合函数,例如:

CUME_DIST()
DENSE_RANK()
FIRST_VALUE()
LAG()
LAST_VALUE()
LEAD()
NTH_VALUE()
NTILE()
PERCENT_RANK()
RANK()
ROW_NUMBER()

非聚合函数,最常见的应用场景为组内排序,例如:

mysql> SELECT DISTINCT
    -> ID,
    -> SCORE,
    -> ROW_NUMBER () OVER ( ) AS total_rank,
    -> ROW_NUMBER () OVER ( PARTITION BY ID ORDER BY SCORE DESC) AS partition_rank
    -> FROM
    -> `ID_SCORE_1W`
    -> LIMIT 10;
+----------+-------------------+------------+----------------+
| ID       | SCORE             | total_rank | partition_rank |
+----------+-------------------+------------+----------------+
| 10000694 | 0.538435152172434 |       5816 |              1 |
| 10000694 | 0.204906177029889 |       5817 |              2 |
| 10000694 | 0.178224311905739 |       5682 |              3 |
| 10000694 | 0.17722191737495  |       5683 |              4 |
| 10000694 | 0.173294815249342 |       5684 |              5 |
| 10000694 | 0.157300442035558 |       5818 |              6 |
| 10002527 | 0.123965125555045 |       8676 |              1 |
| 10002527 | 0.119915039695458 |       1893 |              2 |
| 10002527 | 0.114557146409275 |       8941 |              3 |
| 10002527 | 0.114177958425522 |       1894 |              4 |
+----------+-------------------+------------+----------------+
10 rows in set (0.05 sec)

这里分别做了两个排序,一个针对全体查询数据集,另一个很对组内

OVER ( PARTITION BY ID ORDER BY SCORE DESC)

这块可以特别注意到,OVER 子句中,使用 ORDER BY 对 SCORE 得分进行了排序,默认情况下,分区(partition)并不会对划分的小组,进行排序,即意味着 ROW_NUMBER 的结果也是不确定的。

例如:

mysql> SELECT DISTINCT
    -> ID,
    -> SCORE,
    -> ROW_NUMBER () OVER ( ) AS total_rank,
    -> ROW_NUMBER () OVER ( PARTITION BY ID ) AS partition_row_number
    -> FROM
    -> `ID_SCORE_1W`
    -> LIMIT 10;
+----------+-------------------+------------+----------------------+
| ID       | SCORE             | total_rank | partition_row_number |
+----------+-------------------+------------+----------------------+
| 10000694 | 0.178224311905739 |       5682 |                    1 |
| 10000694 | 0.17722191737495  |       5683 |                    2 |
| 10000694 | 0.173294815249342 |       5684 |                    3 |
| 10000694 | 0.538435152172434 |       5816 |                    4 |
| 10000694 | 0.204906177029889 |       5817 |                    5 |
| 10000694 | 0.157300442035558 |       5818 |                    6 |
| 10002527 | 0.119915039695458 |       1893 |                    1 |
| 10002527 | 0.114177958425522 |       1894 |                    2 |
| 10002527 | 0.111689829230984 |       1895 |                    3 |
| 10002527 | 0.110194202616688 |       1896 |                    4 |
+----------+-------------------+------------+----------------------+
10 rows in set (0.05 sec)

可以看到,这次执行中,并没有指定 ORDER BY SCORE,则最终标号的结果并没有进行排序,这块有个点,如果采用如下形式:

mysql> SELECT DISTINCT
    -> ID,
    -> SCORE,
    -> ROW_NUMBER () OVER ( ) AS total_rank,
    -> ROW_NUMBER () OVER ( PARTITION BY ID ORDER BY SCORE DESC) AS partition_rank,
    -> ROW_NUMBER () OVER ( PARTITION BY ID ) AS partition_row_number
    -> FROM
    -> `ID_SCORE_1W`
    -> LIMIT 10;
+----------+-------------------+------------+----------------+----------------------+
| ID       | SCORE             | total_rank | partition_rank | partition_row_number |
+----------+-------------------+------------+----------------+----------------------+
| 10000694 | 0.538435152172434 |       5816 |              1 |                    1 |
| 10000694 | 0.204906177029889 |       5817 |              2 |                    2 |
| 10000694 | 0.178224311905739 |       5682 |              3 |                    3 |
| 10000694 | 0.17722191737495  |       5683 |              4 |                    4 |
| 10000694 | 0.173294815249342 |       5684 |              5 |                    5 |
| 10000694 | 0.157300442035558 |       5818 |              6 |                    6 |
| 10002527 | 0.123965125555045 |       8676 |              1 |                    1 |
| 10002527 | 0.119915039695458 |       1893 |              2 |                    2 |
| 10002527 | 0.114557146409275 |       8941 |              3 |                    3 |
| 10002527 | 0.114177958425522 |       1894 |              4 |                    4 |
+----------+-------------------+------------+----------------+----------------------+
10 rows in set (0.08 sec)

由于 l5 中已经对 ID 的分区结果进行了排序,所以虽然 l6 中没有指定排序,但是结果却会和 l5 保持一致,但是如果 l5 和 l6 调换位置,则结果不同,如下所示:

mysql> SELECT DISTINCT
    -> ID,
    -> SCORE,
    -> ROW_NUMBER () OVER ( ) AS total_rank,
    -> ROW_NUMBER () OVER ( PARTITION BY ID ) AS partition_row_number,
    -> ROW_NUMBER () OVER ( PARTITION BY ID ORDER BY SCORE DESC) AS partition_rank
    -> FROM
    -> `ID_SCORE_1W`
    -> LIMIT 10;
+----------+-------------------+------------+----------------------+----------------+
| ID       | SCORE             | total_rank | partition_row_number | partition_rank |
+----------+-------------------+------------+----------------------+----------------+
| 10000694 | 0.538435152172434 |       5816 |                    4 |              1 |
| 10000694 | 0.204906177029889 |       5817 |                    5 |              2 |
| 10000694 | 0.178224311905739 |       5682 |                    1 |              3 |
| 10000694 | 0.17722191737495  |       5683 |                    2 |              4 |
| 10000694 | 0.173294815249342 |       5684 |                    3 |              5 |
| 10000694 | 0.157300442035558 |       5818 |                    6 |              6 |
| 10002527 | 0.123965125555045 |       8676 |                   14 |              1 |
| 10002527 | 0.119915039695458 |       1893 |                    1 |              2 |
| 10002527 | 0.114557146409275 |       8941 |                    6 |              3 |
| 10002527 | 0.114177958425522 |       1894 |                    2 |              4 |
+----------+-------------------+------------+----------------------+----------------+
10 rows in set (0.08 sec)

可以看到,此时 partition_row_number 和 partition_rank 的结果出现了差异。 partition_row_number 依据是原来行排序的序号,而 partition_rank 依据是组内排序后的结果。

OVER 子句语法

over_clause:
    {OVER (window_spec) | OVER window_name}

如上述语法所示,OVER 子句,有两种定义形式,一种是直接在 OVER 中书写窗口的划分方式,另一种则是在 OVER 中书写已定义好的窗口的名称(window name)

window_spec 语法

window_spec:
    [window_name] [partition_clause] [order_clause] [frame_clause]

window_spec 指定了窗口的划分方式,其中:

  • window_name 表现为窗口的名称,可以不写

  • partition_clause 指示如何将查询划分为组

语法表现为:

partition_clause:
    PARTITION BY expr [, expr] ...

这个特别需要注意的是,expr 表示为一个表达式,也就是说,除了直接 BY COLUMN_NAME 进行分区,我们完全可以配合相关函数,进行更复杂的分区操作。

  • order_clause 指示如何对已划分的组进行排序

语法表现为:

order_clause:
    ORDER BY expr [ASC|DESC] [, expr [ASC|DESC]] ...

这块需要注意的是,对于每个表达式,我们都可以使用 [ASC|DESC] 来控制其排序方式,如果 ORDER_BY 多个参数,需要注意此处的排序形式,是否符合自身需求。

  • frame_clause 是分区的子集

Named Window

使用 WINDOW 子句,可以独立的定义窗口,以便在 OVER 中使用。

语法:

WINDOW window_name AS (window_spec)
    [, window_name AS (window_spec)] ...

示例:

mysql> SELECT DISTINCT
    -> ID,
    -> SCORE,
    -> ROW_NUMBER () OVER ( orderd_id_group ) AS partition_rank,
    -> ROW_NUMBER () OVER ( PARTITION BY ID ORDER BY SCORE DESC) AS orderd_partition_rank
    -> FROM
    -> `ID_SCORE_1W`
    -> WINDOW orderd_id_group AS (
    -> PARTITION BY ID ORDER BY SCORE DESC
    -> )
    -> LIMIT 10;
+----------+-------------------+----------------+-----------------------+
| ID       | SCORE             | partition_rank | orderd_partition_rank |
+----------+-------------------+----------------+-----------------------+
| 10000694 | 0.538435152172434 |              1 |                     1 |
| 10000694 | 0.204906177029889 |              2 |                     2 |
| 10000694 | 0.178224311905739 |              3 |                     3 |
| 10000694 | 0.17722191737495  |              4 |                     4 |
| 10000694 | 0.173294815249342 |              5 |                     5 |
| 10000694 | 0.157300442035558 |              6 |                     6 |
| 10002527 | 0.123965125555045 |              1 |                     1 |
| 10002527 | 0.119915039695458 |              2 |                     2 |
| 10002527 | 0.114557146409275 |              3 |                     3 |
| 10002527 | 0.114177958425522 |              4 |                     4 |
+----------+-------------------+----------------+-----------------------+
10 rows in set (0.06 sec)

可以看到在 WINDOW 子句中,我们定义了 orderd_id_group,并在 l4 的 OVER 子句中进行了引用,其效果与 l5 中直接定义的 window_spec 发挥了同等效果。

NAMED WINDOW 可以简化当 SELECT 中存在大量窗口函数应用时的场景。

Window Function Frame

如上 window_spec 所示,我们定义 frame 子句,以划分当前分区的子集。 可以将 frame 理解为一个分区中,滑动的子集,这个子集的范围、位置,我们可以依赖于相应行来进行设置,比如相比于计算每组的平均值 我们可以进而利用 frame 计算,一个小组中,每个行其上下两行的平均值。

示例:

mysql> SELECT DISTINCT
    -> ID,
    -> SCORE,
    -> AVG (SCORE) OVER ( PARTITION BY ID) AS AVG_SCORE,
    -> AVG (SCORE) OVER ( PARTITION BY ID ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS CONTEXT_AVG_SCORE
    -> FROM
    -> `ID_SCORE_1W`
    -> LIMIT 10;
+----------+-------------------+---------------------+---------------------+
| ID       | SCORE             | AVG_SCORE           | CONTEXT_AVG_SCORE   |
+----------+-------------------+---------------------+---------------------+
| 10000694 | 0.178224311905739 | 0.23823046929465197 | 0.17772311464034452 |
| 10000694 | 0.17722191737495  | 0.23823046929465197 | 0.17624701484334368 |
| 10000694 | 0.173294815249342 | 0.23823046929465197 | 0.29631729493224196 |
| 10000694 | 0.538435152172434 | 0.23823046929465197 |  0.3055453814838883 |
| 10000694 | 0.204906177029889 | 0.23823046929465197 | 0.30021392374596034 |
| 10000694 | 0.157300442035558 | 0.23823046929465197 | 0.18110330953272352 |
| 10002527 | 0.119915039695458 | 0.11237516501964284 |    0.11704649906049 |
| 10002527 | 0.114177958425522 | 0.11237516501964284 | 0.11526094245065466 |
| 10002527 | 0.111689829230984 | 0.11237516501964284 |   0.112020663424398 |
| 10002527 | 0.110194202616688 | 0.11237516501964284 |   0.110665581155379 |
+----------+-------------------+---------------------+---------------------+
10 rows in set (0.11 sec)

如上述代码所示,我们可以看到,相对于对分组进行平均值计算,在加入了 frame 子句 BETWEEN 1 PRECEDING AND 1 FOLLOWING 后 其平均得分发生了变化,例如 l1 中 AVG_SCORE 为 0.23823046929465197, CONTEXT_AVG_SCORE 为 0.17772311464034452 这块 AVG_SCORE 很好理解,而 CONTEXT_AVG_SCORE 的值实际等于 (l1.SCORE + l2.SCORE)/2

同样,l2 的 CONTEXT_AVG_SCORE 则等于 (l1.SCORE + l2.SCORE + l3.SCORE)/3 -> l2.CONTEXT_AVG_SCORE

语法:

frame_clause:
    frame_units frame_extent

frame_units:
    {ROWS | RANGE}

frame_extent:
    {frame_start | frame_between}

frame_between:
    BETWEEN frame_start AND frame_end

frame_start, frame_end: {
    CURRENT ROW -- 当前行
  | UNBOUNDED PRECEDING -- 边界是分区的第一行
  | UNBOUNDED FOLLOWING -- 边界是分区的最后一行
  | expr PRECEDING -- 当前行的前 n 行
  | expr FOLLOWING -- 当前行的后 n 行
}

这块的核心在于 frame_start, frame_end, 举例来说,如果 frame 子句书写为

BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

则实际等同于未做 frame,用的还是全部分区。

如下所示:

mysql> SELECT DISTINCT
    -> ID,
    -> SCORE,
    -> AVG( SCORE ) OVER ( PARTITION BY ID ) AS AVG_SCORE,
    -> AVG( SCORE ) OVER ( PARTITION BY ID ROWS BETWEEN UNBOUNDED  PRECEDING AND UNBOUNDED  FOLLOWING ) AS CONTEXT_AVG_SCORE
    -> FROM
    -> `ID_SCORE_1W`
    -> LIMIT 10;
+----------+-------------------+---------------------+---------------------+
| ID       | SCORE             | AVG_SCORE           | CONTEXT_AVG_SCORE   |
+----------+-------------------+---------------------+---------------------+
| 10000694 | 0.178224311905739 | 0.23823046929465197 | 0.23823046929465197 |
| 10000694 | 0.17722191737495  | 0.23823046929465197 | 0.23823046929465197 |
| 10000694 | 0.173294815249342 | 0.23823046929465197 | 0.23823046929465197 |
| 10000694 | 0.538435152172434 | 0.23823046929465197 | 0.23823046929465197 |
| 10000694 | 0.204906177029889 | 0.23823046929465197 | 0.23823046929465197 |
| 10000694 | 0.157300442035558 | 0.23823046929465197 | 0.23823046929465197 |
| 10002527 | 0.119915039695458 | 0.11237516501964284 | 0.11237516501964284 |
| 10002527 | 0.114177958425522 | 0.11237516501964284 | 0.11237516501964284 |
| 10002527 | 0.111689829230984 | 0.11237516501964284 | 0.11237516501964284 |
| 10002527 | 0.110194202616688 | 0.11237516501964284 | 0.11237516501964284 |
+----------+-------------------+---------------------+---------------------+
10 rows in set (0.09 sec)

注意,有部分函数是无法与 frame 子句共用的,例如:

CUME_DIST()
DENSE_RANK()
LAG()
LEAD()
NTILE()
PERCENT_RANK()
RANK()
ROW_NUMBER()

这些语句即使使用了 frame,也依然会使用整个分区的。

其他

注意,如果要使用窗口函数,需要满足一些基础条件,包括:

  • MySQL 版本 > 8.0
  • 存储引擎为 InnoDB 或 NDB,例如 MyISAM 就暂时不行