ORA-00600 [4194] 故障處理

原創 yuqi.zhou 云和恩墨技術服務團隊 2021-03-31

6302

客戶一套測試庫主機宕機，主機啟動后數據庫啟動報ORA-00600 [4194]，本文介紹處理過程。

1. 問題現象

啟動數據庫，數據進行實例恢復完成后，能成功open，但數據庫在幾分鐘后自動宕掉，通過alert日志，可以看到宕庫前報錯是ORA-00600 arguments: [4194]。

alter database open
Beginning crash recovery of 1 threads
 parallel recovery started with 3 processes
Started redo scan     -----> 掃描redo，進行實例恢復
Completed redo scan
 read 149 KB redo, 107 data blocks need recovery
Started redo application at
 Thread 1: logseq 84918, block 3
Recovery of Online Redo Log: Thread 1 Group 3 Seq 84918 Reading mem 0
  Mem# 0: /data/orauser01/app/oradata/orcl/redo03.log
Completed redo application of 0.12MB      -----> 進行實例恢復完成
Completed crash recovery at
 Thread 1: logseq 84918, block 302, scn 13343268733370
 107 data blocks read, 107 data blocks written, 149 redo k-bytes read
Wed Mar 24 18:10:56 2021
Thread 1 advanced to log sequence 84919 (thread open)
Thread 1 opened at log sequence 84919
  Current log# 1 seq# 84919 mem# 0: /data/orauser01/app/oradata/orcl/redo01.log
Successful open of redo thread 1
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
Wed Mar 24 18:10:56 2021
SMON: enabling cache recovery
[3151] Successfully onlined Undo Tablespace 2.
Undo initialization finished serial:0 start:30524 end:30584 diff:60 (0 seconds)
Verifying file header compatibility for 11g tablespace encryption..
Verifying 11g file header compatibility for tablespace encryption completed
SMON: enabling tx recovery
Database Characterset is ZHS16GBK
No Resource Manager plan active
replication_dependency_tracking turned off (no async multimaster replication found)
Starting background process QMNC
Wed Mar 24 18:10:56 2021
QMNC started with pid=23, OS id=3405 
Completed: alter database open                  -----> open成功
Wed Mar 24 18:10:57 2021
db_recovery_file_dest_size of 4182 MB is 0.00% used. This is a
user-specified limit on the amount of space that will be used by this
database for recovery-related files, and does not reflect the amount of
space available in the underlying filesystem or ASM diskgroup.
Wed Mar 24 18:10:57 2021
Errors in file /data/orauser01/app/oracle/diag/rdbms/orcl/chnldev/trace/chnldev_m001_3409.trc  (incident=1905820):
ORA-00600: internal error code, arguments: [4194], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /data/orauser01/app/oracle/diag/rdbms/orcl/chnldev/incident/incdir_1905820/chnldev_m001_3409_i1905820.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Wed Mar 24 18:10:57 2021
Starting background process CJQ0
Wed Mar 24 18:10:57 2021
CJQ0 started with pid=26, OS id=3422 
Wed Mar 24 18:10:59 2021
Dumping diagnostic data in directory=[cdmp_20210324181059], requested by (instance=1, osid=3409 (M001)), summary=[incident=1905820].
Block recovery from logseq 84919, block 115 to scn 13343268734123
Recovery of Online Redo Log: Thread 1 Group 1 Seq 84919 Reading mem 0
  Mem# 0: /data/orauser01/app/oradata/orcl/redo01.log
Block recovery completed at rba 84919.3699.16, scn 3106.3100312748
Block recovery from logseq 84919, block 60 to scn 13343268733483
Recovery of Online Redo Log: Thread 1 Group 1 Seq 84919 Reading mem 0
  Mem# 0: /data/orauser01/app/oradata/orcl/redo01.log
Block recovery completed at rba 84919.119.16, scn 3106.3100312115
Errors in file /data/orauser01/app/oracle/diag/rdbms/orcl/chnldev/trace/chnldev_m001_3409.trc  (incident=1905821):
ORA-00600: internal error code, arguments: [4194], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /data/orauser01/app/oracle/diag/rdbms/orcl/chnldev/incident/incdir_1905821/chnldev_m001_3409_i1905821.trc
Wed Mar 24 18:11:00 2021
Errors in file /data/orauser01/app/oracle/diag/rdbms/orcl/chnldev/trace/chnldev_j001_3429.trc  (incident=1905844):
ORA-00600: internal error code, arguments: [4194], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /data/orauser01/app/oracle/diag/rdbms/orcl/chnldev/incident/incdir_1905844/chnldev_j001_3429_i1905844.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Wed Mar 24 18:11:01 2021
Sweep [inc][1905844]: completed
Sweep [inc][1905821]: completed
Sweep [inc][1905820]: completed
Sweep [inc2][1905820]: completed
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /data/orauser01/app/oracle/diag/rdbms/orcl/chnldev/trace/chnldev_m001_3409.trc:
ORA-00600: internal error code, arguments: [4194], [], [], [], [], [], [], [], [], [], [], []
Dumping diagnostic data in directory=[cdmp_20210324181101], requested by (instance=1, osid=3409 (M001)), summary=[incident=1905821].
Dumping diagnostic data in directory=[cdmp_20210324181103], requested by (instance=1, osid=3429 (J001)), summary=[incident=1905844].
Block recovery from logseq 84919, block 115 to scn 13343268734123
Recovery of Online Redo Log: Thread 1 Group 1 Seq 84919 Reading mem 0
  Mem# 0: /data/orauser01/app/oradata/orcl/redo01.log
Block recovery completed at rba 84919.3699.16, scn 3106.3100312748
Block recovery from logseq 84919, block 60 to scn 13343268734143
Recovery of Online Redo Log: Thread 1 Group 1 Seq 84919 Reading mem 0
  Mem# 0: /data/orauser01/app/oradata/orcl/redo01.log
Block recovery completed at rba 84919.3841.16, scn 3106.3100312768
Errors in file /data/orauser01/app/oracle/diag/rdbms/orcl/chnldev/trace/chnldev_j001_3429.trc  (incident=1905845):
ORA-00600: internal error code, arguments: [600], [ORA-00600: internal error code, arguments: [4194], [], [], [], [], [], [], [], [], [], [], []
], [], [], [], [], [], [], [], [], [], []
Incident details in: /data/orauser01/app/oracle/diag/rdbms/orcl/chnldev/incident/incdir_1905845/chnldev_j001_3429_i1905845.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /data/orauser01/app/oracle/diag/rdbms/orcl/chnldev/trace/chnldev_j001_3429.trc:
ORA-00600: internal error code, arguments: [600], [ORA-00600: internal error code, arguments: [4194], [], [], [], [], [], [], [], [], [], [], []
], [], [], [], [], [], [], [], [], [], []
Dumping diagnostic data in directory=[cdmp_20210324181105], requested by (instance=1, osid=3429 (J001)), summary=[incident=1905845].

2. 問題分析

參考MOS文檔：Step by step to resolve ORA-600 4194 4193 4197 on database crash (Doc ID 1428786.1)，有該錯誤的介紹和處理方法

The following error is occurring in the alert.log right before the database crashes.

ORA-00600: internal error code, arguments: [4194], [#], [#], [], [], [], [], []

This error indicates that a mismatch has been detected between redo records and rollback (undo) records.

ARGUMENTS:

Arg [a] - Maximum Undo record number in Undo block
Arg [b] - Undo record number from Redo block

Since we are adding a new undo record to our undo block, we would expect that the new record number is equal to the maximum record number in the undo block plus one. Before Oracle can add a new undo record to the undo block it validates that this is correct. If this validation fails, then an ORA-600 [4194] will be triggered.

通過文檔介紹，該錯誤是由重做記錄與回滾記錄不匹配引發。Oracle在驗證Undo record number時，會對比redo change 和回滾段中的undo record number，若發現2者存在差異則報該4194錯誤。其錯誤argument[a][b]，a代表回滾塊中的最大undo record number，b代表重做日志中記錄的undo record number。

這個問題通常發生在掉電或硬件故障導致數據庫crash，在啟動時，數據庫執行正常的前滾(重做)，然后回滾(撤銷)，這就是回滾時產生錯誤的地方。

3. 處理思路

通常最好的辦法是通過備份進行恢復。
如果沒有備份，那么可以通過特殊的初始化參數進行強制啟動，然后進行做進一步處理。
我這里先按照Doc ID 1428786.1里提供的方法嘗試處理。

4. 處理過程

（1）啟動數據庫到nomount，創建pfile，方便添加參數

SYS@chnldev> startup nomount
ORACLE instance started.

Total System Global Area  901971968 bytes
Fixed Size                  2258360 bytes
Variable Size             843057736 bytes
Database Buffers           50331648 bytes
Redo Buffers                6324224 bytes
SYS@chnldev> create pfile='/data/orauser01/pfile.ora' from spfile;
create pfile='/data/orauser01/pfile.ora' from spfile
*
ERROR at line 1:
ORA-01565: error in identifying file '?/dbs/spfile@.ora'
ORA-27037: unable to obtain file status
Linux-x86_64 Error: 2: No such file or directory
Additional information: 3

報錯文件不存在，那么數據庫應該是用pfile啟動的，確認一下

SYS@chnldev> show parameter spfile;

NAME                                 TYPE                   VALUE
------------------------------------ ---------------------- ------------------------------
spfile                               string

確認是通過pfile啟動，那可以直接去$ORACLE_HOME/dbs下去修改pfile

（2）修改pfile，添加如下參數

undo_management = manual
event = '10513 trace name context forever, level 2'

（3）關閉數據庫，restrict模式啟動

SYS@chnldev> shutdown immediate
ORA-01507: database not mounted


ORACLE instance shut down.
SYS@chnldev> 
SYS@chnldev> startup restrict
ORACLE instance started.

Total System Global Area  901971968 bytes
Fixed Size                  2258360 bytes
Variable Size             843057736 bytes
Database Buffers           50331648 bytes
Redo Buffers                6324224 bytes
Database mounted.
Database opened.

（4）查詢online的回滾段

SYS@chnldev> select tablespace_name, status, segment_name from dba_rollback_segs where status != 'OFFLINE';

TABLESPACE_NAME                STATUS           SEGMENT_NAME
------------------------------ ---------------- ------------------------------
SYSTEM                         ONLINE           SYSTEM

這里十分重要，我們這里所有的回滾段都是offlien狀態（system回滾段是永久online的）。
如果有online的非system回滾段，那么處理過程會更加復雜。

（5）重建undo表空間，并保留現有undo_tablespace參數

SYS@chnldev> create undo tablespace new_undotbs datafile '/data/orauser01/app/oradata/orcl/new_undotbs01.dbf' size 2000M;

Tablespace created.

SYS@chnldev> show parameter undo

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
undo_management                      string      MANUAL
undo_retention                       integer     900
undo_tablespace                      string      UNDOTBS1

（6）修改undo_tablespace參數，并重啟數據庫

修改pfile，將undo_tablespace參數修改為new_undotbs，然后重啟數據庫

SYS@chnldev> shutdown immediate;
Database closed.
Database dismounted.
ORACLE instance shut down.
SYS@chnldev> 
SYS@chnldev> startup
ORACLE instance started.

Total System Global Area  901971968 bytes
Fixed Size                  2258360 bytes
Variable Size             843057736 bytes
Database Buffers           50331648 bytes
Redo Buffers                6324224 bytes
Database mounted.
Database opened.

（7）觀察alert日志，沒有再報錯

Thu Mar 25 10:15:19 2021
ALTER DATABASE   MOUNT
Successful mount of redo thread 1, with mount id 1595909047
Database mounted in Exclusive Mode
Lost write protection disabled
Completed: ALTER DATABASE   MOUNT
Thu Mar 25 10:15:23 2021
ALTER DATABASE OPEN
Thread 1 opened at log sequence 84920
  Current log# 2 seq# 84920 mem# 0: /data/orauser01/app/oradata/orcl/redo02.log
Successful open of redo thread 1
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
SMON: enabling cache recovery
[27669] Successfully onlined Undo Tablespace 8.
Undo initialization finished serial:0 start:57898154 end:57898234 diff:80 (0 seconds)
Verifying file header compatibility for 11g tablespace encryption..
Verifying 11g file header compatibility for tablespace encryption completed
SMON: enabling tx recovery
Database Characterset is ZHS16GBK
No Resource Manager plan active
replication_dependency_tracking turned off (no async multimaster replication found)
Starting background process QMNC
Thu Mar 25 10:15:24 2021
QMNC started with pid=20, OS id=27672 
Completed: ALTER DATABASE OPEN
Thu Mar 25 10:15:25 2021
db_recovery_file_dest_size of 4182 MB is 0.00% used. This is a
user-specified limit on the amount of space that will be used by this
database for recovery-related files, and does not reflect the amount of
space available in the underlying filesystem or ASM diskgroup.
Starting background process CJQ0
Thu Mar 25 10:15:26 2021
CJQ0 started with pid=22, OS id=27688