記一次 1000+每秒 UPDATE 引發的 Redo 風暴案例

原創孫瑩 2025-12-11

963

記一次 1000+每秒 UPDATE 引發的 Redo 風暴：AWR 定位 log file switch (checkpoint incomplete) 并落地解決

一、問題現象

12 月 8 日上午 11 點多，監控告警提示生產數據庫磁盤使用率超過 90%。初步判斷：磁盤使用突增，需要關注數據庫整體負載與日志增長情況。

二、問題排查

2.1 操作系統磁盤空間

查看 OS 空間與 Oracle 目錄占用，發現歸檔/恢復區相關目錄增長較快（fast_recovery_area 約 43G）。

[root@md-db-out ~]# df -h
Filesystem                   Size  Used Avail Use% Mounted on
/dev/mapper/vg_root-lv_root  461G  399G   39G  92% /
tmpfs                         32G  160K   32G   1% /dev/shm
/dev/sda1                    485M   39M  421M   9% /boot
[root@md-db-out ~]# su - oracle
[oracle@md-db-out ~]$ cd /u01/app/oracle/
[oracle@md-db-out oracle]$ du -h --max-depth=1
4.0K    ./checkpoints
55G     ./diag
4.6G    ./admin
70G     ./oradata
43G     ./fast_recovery_area
1.8M    ./cfgtoollogs
4.4G    ./product
177G    .
[oracle@md-db-out oracle]$

雖然oradata目錄不小但故障前沒有添加過表空間數據文件，所以這次告警更像是“歸檔/恢復區增長過快”觸發

2.2 歸檔日志切換頻率異常

歸檔日志增長過快后，進一步統計日志切換頻率，發現上午 9 點開始切換次數從 11 次 → 755 次（提升近 68 倍）。

[oracle@md-db-out oracle]$ sqlplus / as sysdba

SQL*Plus: Release 11.2.0.4.0 Production on Mon Dec 8 11:27:01 2025

Copyright (c) 1982, 2013, Oracle.  All rights reserved.


Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> set lines 300
SQL> column h0 format 999
column h1 format 999
column h2 format 999
column h3 format 999
column h4 format 999
column h5 format 999
column h6 format 999
column h7 format 999
column h8 format 999
column h9 format 999
column h10 format 999
column h11 format 999
column h12 format 999
column h13 format 999
column h14 format 999
column h15 format 999
column h16 format 999
column h17 format 999
column h18 format 999
column h19 format 999
column h20 format 999
column h21 format 999
column h22 format 999
column h23 format 999
column avg format 999.99
column day format a6

SELECT TRUNC (first_time) "Date", TO_CHAR (first_time, 'Dy') "Day", COUNT (1) "Total",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '00', 1, 0)) h0,
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '01', 1, 0)) "h1",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '02', 1, 0)) "h2",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '03', 1, 0)) "h3",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '04', 1, 0)) "h4",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '05', 1, 0)) "h5",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '06', 1, 0)) "h6",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '07', 1, 0)) "h7",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '08', 1, 0)) "h8",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '09', 1, 0)) "h9",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '10', 1, 0)) "h10",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '11', 1, 0)) "h11",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '12', 1, 0)) "h12",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '13', 1, 0)) "h13",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '14', 1, 0)) "h14",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '15', 1, 0)) "h15",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '16', 1, 0)) "h16",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '17', 1, 0)) "h17",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '18', 1, 0)) "h18",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '19', 1, 0)) "h19",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '20', 1, 0)) "h20",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '21', 1, 0)) "h21",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '22', 1, 0)) "h22",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '23', 1, 0)) "h23", to_char(ROUND (COUNT (1) / 24, 2),'fm99999999990.00') "Avg"
FROM gv$log_history
WHERE first_time >= trunc(SYSDATE) - 7
and thread# = inst_id
GROUP BY TRUNC (first_time), TO_CHAR (first_time, 'Dy')
ORDER BY 1 DESC;
SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL>   2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30
Date                Day         Total   H0   h1   h2   h3   h4   h5   h6   h7   h8   h9  h10  h11  h12  h13  h14  h15  h16  h17  h18  h19  h20  h21  h22  h23 Avg
------------------- ------ ---------- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---------------
2025-12-08 00:00:00 Mon          1823   48    6    8    7    8    7    8    7   11  755  884   74    0    0    0    0    0    0    0    0    0    0    0    0 75.96
2025-12-07 00:00:00 Sun           480   38    8    9    8    6    8    9    7    9    9    9    8    8    8    9    8    9    8    8    9    7    7   10  261 20.00
2025-12-06 00:00:00 Sat           520   32    8    8    7    7   10    9    9    9    9   42    9    9    9    8    9    9    9    8    9    9    8    8  266 21.67
2025-12-05 00:00:00 Fri           568   33    9   11    8    7    9    9    9    9   19   19   18   13   11   15   17   23   16   11   10    7    9   10  266 23.67
2025-12-04 00:00:00 Thu           581   42    9   10    9    8    9    9    8   12   24   18   17   11   24   16   18   20   14   12    9    9    8   11  254 24.21
2025-12-03 00:00:00 Wed           585   55    8   10    8    7   10    9   10   12   14   19   20   11   13   19   20   20   17   11   10    8    9   10  255 24.38
2025-12-02 00:00:00 Tue           603   48    9   10    7    7   10    8    8   12   31   18   21   11   12   22   32   21   27   20   14    8   10   10  227 25.13
2025-12-01 00:00:00 Mon           554   51    7    7    6    7    7    8    7   10   14   18   17   11   13   16   18   17   17   14    8    8    9   11  253 23.08

8 rows selected.

SQL>

三、AWR報告分析

拉取10:00~11:00的AWR報告

SQL> @$ORACLE_HOME/rdbms/admin/awrrpt.sql

3.1 整體負載概況

DB Time：29,845s ≈ 497 分鐘，AAS ≈ 29,845 / (59×60) ≈ 8.4 個活躍會話

DB CPU ≈ 6,541s → 每秒約 1.84 CPU 秒，用滿 16 核只用了 約 11–12% 的 CPU

OS 負載：%busy ~ 11.8%，%idle ~ 88%，%iowait ~ 5.4%

結論：CPU 很空閑，系統總體是“等待型”而不是“CPU 打滿型”。

3.2 Top Wait Events

查看Top Wait 事件，找到真正卡在哪里。Top 10 Foreground Events：

log file switch (checkpoint incomplete)

Waits：12,509
Total Wait Time：≈ 12,288s（占 DB time ~41%）
Avg wait：982 ms / 次

log file sync

Waits：4,112,201
Total Wait Time：≈ 6,929s（占 DB time ~23%）
Avg wait：2 ms / 次
大約 1,160 次 commit/s 左右

Wait Class 匯總也印證了這一點：

Configuration：45%（主要就是各種 log file switch、SQ contention 等）
Commit：23%（log file sync）
DB CPU：22%
Concurrency：8.6%
User I/O：7%

結論：絕大部分時間都耗在 redo / checkpoint 配置和并發控制上。

3.3 定位 SQL：高頻 UPDATE

在 “SQL ordered by Elapsed Time” 中，你定位到 SQL ID：0vq0s6rm8fawn。它在 60 分鐘內執行 3,763,703 次，約 1000+ 次/秒 UPDATE，直接帶來大量 redo / 歸檔與頻繁 log switch。

SQL如下：

UPDATE SOA_EX_DRUG SET DRUG_ID=:1 , CMM_NAME=:2 , CMMDT_NM=:3 , MED_CTGY_TYPE=:4 , NORM=:5 , MIN_PCKNG_CMPNY=:6 , APRV_NO=:7 , PRDCT_PLT=:8 , status=:9 , CNTRY_NAME=:10 , DEVICES=:11 , PRV_APRV_NO=:12 , CMM_NAME_PY=:13 , PRDCT_CREDIT=:14 , PRDCT_ID=:15 , PACK_QUANTITY=:16 , MGT_CAT_NAME=:17 , PROD_CAT_NAME=:18 , REGIST_FORM_NAME=:19 , CLASS_SUBCATALOG_V17_NAME=:20 , LEVEL1_PROD_CAT_V17_NAME=:21 , LEVEL2_PROD_CAT_V17_NAME=:22 , SPU_FLAG=:23 WHERE ID=:24

進一步用 dba_hist_sqlstat 驗證：12/5 之前 executions 僅 2 萬級，到了 12/8 突增到 300 萬級（增長 167 倍）。

SQL> set linesize 300
SQL> select a.snap_id,b.begin_interval_time,a.executions_delta from dba_hist_sqlstat a,dba_hist_snapshot b where a.sql_id='0vq0s6rm8fawn' and a.snap_id=b.snap_id order by a.snap_id;

   SNAP_ID BEGIN_INTERVAL_TIME                                                              EXECUTIONS_DELTA
---------- -------------------------------------------------------------------------------- ----------------
     64968 02-DEC-25 02.00.04.126 PM                                                                   22446
     64969 02-DEC-25 03.00.06.304 PM                                                                   35605
     64989 03-DEC-25 11.00.53.713 AM                                                                   25956
     65011 04-DEC-25 09.00.47.148 AM                                                                   22293
     65018 04-DEC-25 04.00.05.309 PM                                                                   15736
     65040 05-DEC-25 02.00.01.717 PM                                                                    5283
     65042 05-DEC-25 04.00.06.122 PM                                                                   51054
     65107 08-DEC-25 09.00.51.667 AM                                                                 3144734
     65108 08-DEC-25 10.00.56.171 AM                                                                 3763703
     65109 08-DEC-25 11.00.03.946 AM                                                                  318640
     65110 08-DEC-25 12.00.35.037 PM                                                                 3151397
     65111 08-DEC-25 01.00.43.053 PM                                                                 3999423
     65112 08-DEC-25 02.00.29.114 PM                                                                 2629413

13 rows selected.

SQL>

一直到下午14點多查看oratop還是顯示SQL Id：0vq0s6rm8fawn頻繁執行。造成大量log file switch (checkpoint incomplete)等待。

結論：高頻的更新SQL語句，導致log file switch (checkpoint incomplete)一小時內 12,509 次，平均每次接近 1 秒

四、問題解決

4.1 應用側

第一步：與開發溝通，把高頻 SQL 告知對方，推動應用修改。

應用修改后，日志切換頻率正常。如下：

SQL> SELECT TRUNC (first_time) "Date", TO_CHAR (first_time, 'Dy') "Day", COUNT (1) "Total",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '00', 1, 0)) h0,
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '01', 1, 0)) "h1",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '02', 1, 0)) "h2",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '03', 1, 0)) "h3",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '04', 1, 0)) "h4",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '05', 1, 0)) "h5",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '06', 1, 0)) "h6",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '07', 1, 0)) "h7",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '08', 1, 0)) "h8",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '09', 1, 0)) "h9",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '10', 1, 0)) "h10",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '11', 1, 0)) "h11",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '12', 1, 0)) "h12",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '13', 1, 0)) "h13",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '14', 1, 0)) "h14",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '15', 1, 0)) "h15",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '16', 1, 0)) "h16",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '17', 1, 0)) "h17",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '18', 1, 0)) "h18",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '19', 1, 0)) "h19",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '20', 1, 0)) "h20",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '21', 1, 0)) "h21",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '22', 1, 0)) "h22",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '23', 1, 0)) "h23", to_char(ROUND (COUNT (1) / 24, 2),'fm99999999990.00') "Avg"
FROM gv$log_history
WHERE first_time >= trunc(SYSDATE) - 7
and thread# = inst_id
GROUP BY TRUNC (first_time), TO_CHAR (first_time, 'Dy')
ORDER BY 1 DESC;  2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30

Date                Day         Total   H0   h1   h2   h3   h4   h5   h6   h7   h8   h9  h10  h11  h12  h13  h14  h15  h16  h17  h18  h19  h20  h21  h22  h23 Avg
------------------- ------ ---------- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---------------
2025-12-10 00:00:00 Wed           333   56    9   10    7    7    8    7    9   18   22   20   16   13   12   16   18   19   16   10    7    8    7    9    9 13.88
2025-12-09 00:00:00 Tue           607    0    0    0    0    0    0    0    0   67   72   16   15   10   12   17   19   44   27   11   10    7    8    7  265 25.29
2025-12-08 00:00:00 Mon          4364   48    6    8    7    8    7    8    7   11  755  884   83  725  884  622   17   17   14    9    9    7    7   10  211 181.83
2025-12-07 00:00:00 Sun           480   38    8    9    8    6    8    9    7    9    9    9    8    8    8    9    8    9    8    8    9    7    7   10  261 20.00
2025-12-06 00:00:00 Sat           520   32    8    8    7    7   10    9    9    9    9   42    9    9    9    8    9    9    9    8    9    9    8    8  266 21.67
2025-12-05 00:00:00 Fri           568   33    9   11    8    7    9    9    9    9   19   19   18   13   11   15   17   23   16   11   10    7    9   10  266 23.67
2025-12-04 00:00:00 Thu           581   42    9   10    9    8    9    9    8   12   24   18   17   11   24   16   18   20   14   12    9    9    8   11  254 24.21
2025-12-03 00:00:00 Wed           585   55    8   10    8    7   10    9   10   12   14   19   20   11   13   19   20   20   17   11   10    8    9   10  255 24.38
SQL>

4.2 數據庫側：擴大 redo log size（從 50MB 擴到 2GB）

當時的 redo 配置：3 組、每組 50MB(默認)。然后新增 3 組 2GB（group 4/5/6），切換后刪除舊組（1/2/3）。檢查 checkpoint / mttr 參數，查看 FAST_START_MTTR_TARGET / LOG_CHECKPOINT_INTERVAL / LOG_CHECKPOINT_TIMEOUT避免設置得太小，導致頻繁 checkpoint。

-- 1) 查看現狀
SQL> select group#, bytes/1024/1024 mb, status from v$log;

    GROUP#         MB STATUS
---------- ---------- ----------------
         1         50 CURRENT
         2         50 INACTIVE
         3         50 INACTIVE

SQL> col member for a80
SQL> select group#, member from v$logfile;

    GROUP# MEMBER
---------- --------------------------------------------------------------------------------
         1 /u01/app/oracle/oradata/mddb/redo01.log
         2 /u01/app/oracle/oradata/mddb/redo02.log
         3 /u01/app/oracle/oradata/mddb/redo03.log

-- 2) 新增大 redo
SQL> alter database add logfile group 4 '/u01/app/oracle/oradata/mddb/redo04.log' size 2g;

Database altered.

SQL> alter database add logfile group 5 '/u01/app/oracle/oradata/mddb/redo05.log' size 2g;

Database altered.

SQL> alter database add logfile group 6 '/u01/app/oracle/oradata/mddb/redo06.log' size 2g;

Database altered.

-- 3) 手工切換，確保舊組不再 ACTIVE/CURRENT，再刪除
SQL> alter system switch logfile;

System altered.

SQL> /

System altered.

SQL> select group#, bytes/1024/1024 mb, status from v$log;

    GROUP#         MB STATUS
---------- ---------- ----------------
         1         50 INACTIVE
         2         50 INACTIVE
         3         50 INACTIVE
         4       2048 INACTIVE
         5       2048 ACTIVE
         6       2048 CURRENT

6 rows selected.

SQL> alter database drop logfile group 1;

Database altered.

SQL> alter database drop logfile group 2;

Database altered.

SQL> alter database drop logfile group 3;

Database altered.

-- 4) 驗證
SQL> select group#, bytes/1024/1024 mb, status from v$log;

    GROUP#         MB STATUS
---------- ---------- ----------------
         4       2048 INACTIVE
         5       2048 ACTIVE
         6       2048 CURRENT

-- 5) 檢查參數，為默認，系統自動觸發/控制 checkpoint
SQL> show parameter fast_start_mttr_target

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
fast_start_mttr_target               integer     0
SQL> show parameter log_check

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
log_checkpoint_interval              integer     0
log_checkpoint_timeout               integer     1800
log_checkpoints_to_alert             boolean     FALSE
SQL>

五、總結

12 月 8 日 11 點生產庫磁盤使用率超過 90%，排查發現主要由歸檔日志快速增長導致（/u01 下 fast_recovery_area 占用明顯）。進一步查看日志切換，9 點后歸檔切換從 11 次飆升到 755 次。拉取 10:00–11:00 AWR 后確認系統屬于“等待型”，主要等待集中在 log file switch (checkpoint incomplete)（12,509 次，約 41% DB time）和 log file sync（4,112,201 次，約 23% DB time）。同期 Top SQL 中 0vq0s6rm8fawn 在 60 分鐘內執行 3,763,703 次（約 1000+ 次/秒），并且相較 12/5 執行次數突增 167 倍，是導致 redo/歸檔暴漲的關鍵誘因。處置上先通知開發處理高頻 SQL，同時將 redo log 從 50MB 3 組擴容為 2GB 3 組并完成切換與清理，緩解日志切換與 checkpoint 壓力。

墨力計劃 oracle 性能優化數據庫實操

最后修改時間：2025-12-16 10:50:20

「喜歡這篇文章，您的關注和贊賞是給作者最好的鼓勵」

關注作者

文章被以下合輯收錄

墨天輪數據庫性能優化征文-優質文章合集（共37篇）

本合輯匯總了「實操看我的」數據庫性能優化篇征文活動中“最佳實操獎” 文章及部分高閱讀量合格文章，歡迎參考~

特色一级强游戏,海奥华预言免费阅读,51漫画兑换码,美女裸体无遮挡永久免费观看网站,lubuntu线路检测入口