記錄一次CPU使用率過(guò)高故障的分析與處理

一、問(wèn)題現(xiàn)象
下午接到系統(tǒng)運(yùn)行卡頓的反饋后,首先對(duì)主機(jī)性能進(jìn)行了檢查,發(fā)現(xiàn)CPU使用率異常偏高,系統(tǒng)整體負(fù)載壓力巨大(環(huán)境是CentOS7.9+Oracle11.2.0.4單機(jī))。具體情況如下圖所示:

二、初步分析
通過(guò)sqlplus連接數(shù)據(jù)庫(kù)后,查看當(dāng)前會(huì)話等待事件,發(fā)現(xiàn)存在大量的resmgr:cpu quantum等待:
[oracle@qxdb ~]$ sqlplus / as sysdba
SQL*Plus: Release 11.2.0.4.0 Production on Mon Sep 1 12:49:57 2025
Copyright (c) 1982, 2013, Oracle. All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
SQL> select event,count(*) from v$session group by event;
EVENT COUNT(*)
---------------------------------------------------------------- ----------
SQL*Net message from client 51
resmgr:cpu quantum 398
Streams AQ: waiting for messages in the queue 2
read by other session 13
rdbms ipc message 16
smon timer 1
pmon timer 1
Streams AQ: qmn slave idle wait 1
latch free 2
SQL*Net message to client 1
Streams AQ: waiting for time management or cleanup tasks 1
EVENT COUNT(*)
---------------------------------------------------------------- ----------
Streams AQ: qmn coordinator idle wait 1
VKTM Logical Idle Wait 1
DIAG idle wait 2
VKRM Idle 1
15 rows selected.
SQL>
resmgr:cpu quantum等待事件表明會(huì)話正在等待Oracle資源管理器(Resource Manager)分配CPU資源。
參考Oracle官方文檔(MOS)的相關(guān)說(shuō)明:
- High “Resmgr:Cpu Quantum” Wait Events In 11g Even When Resource Manager Is Disabled [ID 949033.1]
- 11g: Scheduler Maintenance Tasks or Autotasks [ID 756734.1]
- Large Waits With The Wait Event “Resmgr:Cpu Quantum” [ID 806893.1]
三、初步處理
根據(jù)上述分析,嘗試調(diào)整資源管理計(jì)劃參數(shù),并關(guān)閉自動(dòng)任務(wù)以減少資源爭(zhēng)用:
-- 檢查當(dāng)前資源管理計(jì)劃
SQL> show parameter resource_manager_plan
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
resource_manager_plan string SCHEDULER[0x3009]:DEFAULT_MAIN
-- 清空資源管理計(jì)劃 TENANCE_PLAN
SQL> alter system set resource_manager_plan='';
System altered.
-- 清除維護(hù)窗口的資源計(jì)劃
SQL> execute dbms_scheduler.set_attribute('WEEKNIGHT_WINDOW','RESOURCE_PLAN','');
PL/SQL procedure successfully completed.
SQL> execute dbms_scheduler.set_attribute('WEEKEND_WINDOW','RESOURCE_PLAN','');
PL/SQL procedure successfully completed.
SQL> execute dbms_scheduler.set_attribute('MONDAY_WINDOW','RESOURCE_PLAN','');
PL/SQL procedure successfully completed.
SQL> execute dbms_scheduler.set_attribute('TUESDAY_WINDOW','RESOURCE_PLAN','');
PL/SQL procedure successfully completed.
SQL> execute dbms_scheduler.set_attribute('WEDNESDAY_WINDOW','RESOURCE_PLAN','');
PL/SQL procedure successfully completed.
SQL> execute dbms_scheduler.set_attribute('THURSDAY_WINDOW','RESOURCE_PLAN','');
PL/SQL procedure successfully completed.
SQL> execute dbms_scheduler.set_attribute('FRIDAY_WINDOW','RESOURCE_PLAN','');
PL/SQL procedure successfully completed.
SQL> execute dbms_scheduler.set_attribute('SATURDAY_WINDOW','RESOURCE_PLAN','');
PL/SQL procedure successfully completed.
SQL> execute dbms_scheduler.set_attribute('SUNDAY_WINDOW','RESOURCE_PLAN','');
PL/SQL procedure successfully completed.
-- 禁用自動(dòng)空間顧問(wèn)和SQL調(diào)優(yōu)顧問(wèn)任務(wù)
SQL> BEGIN
DBMS_AUTO_TASK_ADMIN.DISABLE(
client_name => 'auto space advisor',
operation => NULL,
window_name => NULL);
END;
/ 2 3 4 5 6 7
PL/SQL procedure successfully completed.
SQL> BEGIN
DBMS_AUTO_TASK_ADMIN.DISABLE(
client_name => 'sql tuning advisor',
operation => NULL,
window_name => NULL);
END;
/ 2 3 4 5 6 7
PL/SQL procedure successfully completed.
SQL> select client_name,status from dba_autotask_client;
CLIENT_NAME STATUS
---------------------------------------------------------------- --------
auto optimizer stats collection ENABLED
auto space advisor DISABLED
sql tuning advisor DISABLED
-- 重啟數(shù)據(jù)庫(kù)使配置生效
SQL> shut immediate
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup
ORACLE instance started.
Total System Global Area 7.2688E+10 bytes
Fixed Size 2260728 bytes
Variable Size 6979322120 bytes
Database Buffers 6.5498E+10 bytes
Redo Buffers 208642048 bytes
Database mounted.
Database opened.
SQL> show parameter resource_manager_plan
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
resource_manager_plan string
SQL>
然而,重啟后CPU使用率仍然居高不下:

四、深入分析
再次檢查會(huì)話等待事件,發(fā)現(xiàn)等待類(lèi)型已發(fā)生變化,出現(xiàn)了大量的latch free和latch: cache buffers chains等待:
SQL> select event,count(*) from v$session group by event;
EVENT COUNT(*)
---------------------------------------------------------------- ----------
SQL*Net message from client 41
Streams AQ: waiting for messages in the queue 5
read by other session 3
null event 2
rdbms ipc message 15
latch: cache buffers chains 30
db file sequential read 7
pmon timer 1
smon timer 1
job scheduler coordinator slave wait 1
latch free 92
EVENT COUNT(*)
---------------------------------------------------------------- ----------
Streams AQ: qmn slave idle wait 1
SQL*Net message to client 1
Disk file operations I/O 1
Streams AQ: qmn coordinator idle wait 1
VKTM Logical Idle Wait 1
Streams AQ: waiting for time management or cleanup tasks 1
DIAG idle wait 2
18 rows selected.
SQL>
通過(guò)關(guān)聯(lián)SQL_ID定位到具體問(wèn)題語(yǔ)句:
SQL> select event,sql_id,count(1) from v$session where event='latch free' group by event,sql_id
2 ;
EVENT SQL_ID
---------------------------------------------------------------- -------------
COUNT(1)
----------
latch free d4guxxn00hu01
1
latch free gm684ycvk15qd
1
latch free 0z45dukpd4nan
144
EVENT SQL_ID
---------------------------------------------------------------- -------------
COUNT(1)
----------
latch free gptqjj4hx10rm
1
SQL> select sql_text from gv$sql where sql_id='0z45dukpd4nan';
SQL_TEXT
--------------------------------------------------------------------------------
SELECT COUNT(1) FROM SCM_INOUT_LOT_HISTORY_EXT WHERE STATDATE = TRUNC(LAST_DAY(A
DD_MONTHS(SYSDATE,-1))) AND COMPID = :B2 AND OWNERID = :B1
SQL>
同時(shí)發(fā)現(xiàn)數(shù)據(jù)庫(kù)中存在大量作業(yè)運(yùn)行:
SQL> select count(1) from dba_jobs;
COUNT(1)
----------
6201
SQL> select count(1) from dba_jobs_runnings;
COUNT(1)
----------
16
SQL>

分析問(wèn)題SQL的執(zhí)行計(jì)劃,發(fā)現(xiàn)對(duì)SCM_INOUT_LOT_HISTORY_EXT表(約2億條記錄)進(jìn)行了全表掃描:
SQL> conn MED_GKCS
Enter password:
Connected.
SQL> set autotrace trace exp
SQL> SELECT COUNT(1) FROM SCM_INOUT_LOT_HISTORY_EXT WHERE STATDATE = TRUNC(LAST_DAY(ADD_MONTHS(SYSDATE,-1))) AND COMPID = 2 AND OWNERID = 1;
Execution Plan
----------------------------------------------------------
Plan hash value: 343819289
--------------------------------------------------------------------------------
----------------
| Id | Operation | Name | Rows | Bytes | Cost (%
CPU)| Time |
--------------------------------------------------------------------------------
----------------
| 0 | SELECT STATEMENT | | 1 | 14 | 10820
(1)| 00:02:10 |
| 1 | SORT AGGREGATE | | 1 | 14 |
| |
|* 2 | TABLE ACCESS FULL| SCM_INOUT_LOT_HISTORY_EXT | 1 | 14 | 10820
(1)| 00:02:10 |
--------------------------------------------------------------------------------
----------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("OWNERID"=1 AND "STATDATE"=TRUNC(LAST_DAY(ADD_MONTHS(SYSDATE@!,-1)
)) AND
"COMPID"=2)
SQL> set autot off
SQL> SELECT COUNT(1) FROM SCM_INOUT_LOT_HISTORY_EXT;
COUNT(1)
----------
195323892
SQL>
五、根本原因與最終處理
綜合以上分析,確定故障的根本原因:
-
開(kāi)發(fā)人員誤操作導(dǎo)致作業(yè)異常重復(fù)創(chuàng)建,產(chǎn)生大量并發(fā)進(jìn)程
-
關(guān)鍵SQL語(yǔ)句缺乏合適索引,導(dǎo)致全表掃描
-
大量并發(fā)全表掃描操作引發(fā)緩沖鏈閂鎖爭(zhēng)用
采取以下處理措施:
1.緊急重啟數(shù)據(jù)庫(kù)并停止監(jiān)聽(tīng):
SQL> conn / as sysdba
Connected.
SQL> shut immediate
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> exit
Disconnected from Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
[oracle@qxdb ~]$ lsnrctl stop
LSNRCTL for Linux: Version 11.2.0.4.0 - Production on 01-SEP-2025 13:30:31
Copyright (c) 1991, 2013, Oracle. All rights reserved.
Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=172.16.168.135)(PORT=1521)))
The command completed successfully
[oracle@qxdb ~]$
2.創(chuàng)建缺失索引并更新統(tǒng)計(jì)信息:
[oracle@qxdb ~]$ sqlplus / as sysdba
SQL*Plus: Release 11.2.0.4.0 Production on Mon Sep 1 13:30:36 2025
Copyright (c) 1982, 2013, Oracle. All rights reserved.
Connected to an idle instance.
SQL> startup
ORACLE instance started.
Total System Global Area 7.2688E+10 bytes
Fixed Size 2260728 bytes
Variable Size 6979322120 bytes
Database Buffers 6.5498E+10 bytes
Redo Buffers 208642048 bytes
Database mounted.
Database opened.
SQL> conn MED_GKCS
Enter password:
Connected.
SQL> create index idx_scm_inout_lot_hist_ext_st on SCM_INOUT_LOT_HISTORY_EXT(STATDATE);
Index created.
SQL> exec dbms_stats.gather_table_stats('MED_GKCS','SCM_INOUT_LOT_HISTORY_EXT',cascade=>true);
PL/SQL procedure successfully completed.
SQL>
3.驗(yàn)證執(zhí)行計(jì)劃已改用索引:
SQL> set autotrace trace exp
SQL> SELECT COUNT(1) FROM SCM_INOUT_LOT_HISTORY_EXT WHERE STATDATE = TRUNC(LAST_DAY(ADD_MONTHS(SYSDATE,-1))) AND COMPID = 2 AND OWNERID = 1;
Execution Plan
----------------------------------------------------------
Plan hash value: 651601031
--------------------------------------------------------------------------------
------------------------------
| Id | Operation | Name | Rows | B
ytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------
------------------------------
| 0 | SELECT STATEMENT | | 1 |
14 | 7753 (1)| 00:01:34 |
| 1 | SORT AGGREGATE | | 1 |
14 | | |
|* 2 | TABLE ACCESS BY INDEX ROWID| SCM_INOUT_LOT_HISTORY_EXT | 1 |
14 | 7753 (1)| 00:01:34 |
|* 3 | INDEX RANGE SCAN | IDX_SCM_INOUT_LOT_HIST_EXT_ST | 542K|
| 1467 (1)| 00:00:18 |
--------------------------------------------------------------------------------
------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("OWNERID"=1 AND "COMPID"=2)
3 - access("STATDATE"=TRUNC(LAST_DAY(ADD_MONTHS(SYSDATE@!,-1))))
SQL> exit
Disconnected from Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
[oracle@qxdb ~]$
4.重啟監(jiān)聽(tīng)并清理異常作業(yè):
[oracle@qxdb ~]$ lsnrctl start
LSNRCTL for Linux: Version 11.2.0.4.0 - Production on 01-SEP-2025 13:59:38
Copyright (c) 1991, 2013, Oracle. All rights reserved.
Starting /app/oracle/product/11.2.0.4/db_1/bin/tnslsnr: please wait...
TNSLSNR for Linux: Version 11.2.0.4.0 - Production
System parameter file is /app/oracle/product/11.2.0.4/db_1/network/admin/listener.ora
Log messages written to /app/oracle/diag/tnslsnr/qxdb/listener/alert/log.xml
Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=172.16.168.135)(PORT=1521)))
Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=172.16.168.135)(PORT=1521)))
STATUS of the LISTENER
------------------------
Alias LISTENER
Version TNSLSNR for Linux: Version 11.2.0.4.0 - Production
Start Date 01-SEP-2025 13:59:38
Uptime 0 days 0 hr. 0 min. 0 sec
Trace Level off
Security ON: Local OS Authentication
SNMP OFF
Listener Parameter File /app/oracle/product/11.2.0.4/db_1/network/admin/listener.ora
Listener Log File /app/oracle/diag/tnslsnr/qxdb/listener/alert/log.xml
Listening Endpoints Summary...
(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=172.16.168.135)(PORT=1521)))
Services Summary...
Service "csqxcs" has 1 instance(s).
Instance "csqxcs", status UNKNOWN, has 1 handler(s) for this service...
The command completed successfully
[oracle@qxdb ~]$
-- 聯(lián)系開(kāi)發(fā)人員確認(rèn)并清理異常作業(yè)
[oracle@qxdb ~]$ sqlplus / as sysdba
SQL*Plus: Release 11.2.0.4.0 Production on Mon Sep 1 14:00:01 2025
Copyright (c) 1982, 2013, Oracle. All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
SQL> select count(1) from dba_jobs_running;
COUNT(1)
----------
0
SQL> select count(1) from dba_jobs;
COUNT(1)
----------
16
SQL>
六、處理結(jié)果
系統(tǒng)恢復(fù)正常運(yùn)行,CPU使用率降至正常水平:

七、經(jīng)驗(yàn)總結(jié)
本次CPU使用率過(guò)高故障是由多方面因素共同導(dǎo)致的:
-
直接原因:開(kāi)發(fā)誤操作導(dǎo)致作業(yè)異常重復(fù)創(chuàng)建,產(chǎn)生大量并發(fā)進(jìn)程
-
性能瓶頸:關(guān)鍵SQL缺乏適當(dāng)索引,導(dǎo)致全表掃描和緩沖鏈閂鎖爭(zhēng)用
-
加劇因素:資源管理器配置可能放大了資源爭(zhēng)用問(wèn)題
處理經(jīng)驗(yàn):
- 系統(tǒng)性能問(wèn)題需要綜合多方面因素進(jìn)行分析,不能僅關(guān)注表面現(xiàn)象
- 等待事件分析是定位數(shù)據(jù)庫(kù)性能問(wèn)題的重要方法
- 對(duì)于大規(guī)模數(shù)據(jù)表,適當(dāng)?shù)乃饕O(shè)計(jì)至關(guān)重要
- 建立健全的監(jiān)控體系,能夠及時(shí)發(fā)現(xiàn)作業(yè)異常等異常情況
- 重要的配置變更和作業(yè)調(diào)度應(yīng)建立嚴(yán)格的審核機(jī)制
以上就是本次故障處理全過(guò)程,分享出來(lái)希望能給各位小伙伴一些幫助??。




