Datagaurd是oracle 數(shù)據(jù)庫常用的容災(zāi)保護方案,但standby不再需要或刪除后,在primary 需要清理之前的配置,通常認為很容易,比如禁用日志傳輸,no force logging, 清理log_archive _destN \LOG_ARCHIVE_CONFIG相關(guān)的參數(shù)等。是否很容易? 但里面隱藏著一個風(fēng)險。上周一客戶遇到個案例,主庫Oracle RAC 11.2.0.4原來有dataguard,但是清理后,直到有1個節(jié)點重啟后實例無法啟動,簡單記錄。
DB alert log
Tue Jul 25 22:27:41 2023 Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_m000_58781.trc (incident=296583): ORA-01578: ORACLE data block corrupted (file # 403, block # 2) ORA-01110: data file 403: '+DATAVG/anbob/tempfile/temp2.dbf' Tue Jul 25 22:27:41 2023 Sweep [inc][296583]: completed Tue Jul 25 22:37:41 2023 Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_m000_64973.trc (incident=296584): ORA-01578: ORACLE data block corrupted (file # 403, block # 2) ORA-01110: data file 403: '+DATAVG/anbob/tempfile/temp2.dbf' Tue Jul 25 22:37:43 2023 Sweep [inc][296584]: completed Tue Jul 25 22:48:18 2023 IPC Send timeout detected. Receiver ospid 25486 [ Tue Jul 25 22:48:18 2023 Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_lms2_25486.trc: IPC Send timeout detected. Receiver ospid 25478 [ Tue Jul 25 22:48:18 2023 Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_lms0_25478.trc: IPC Send timeout detected. Receiver ospid 25490 [ Tue Jul 25 22:48:19 2023 ... Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_lms2_25486.trc: IPC Send timeout detected. Receiver ospid 25482 [ Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_lms1_25482.trc: IPC Send timeout detected. Receiver ospid 25486 [ Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_lms2_25486.trc: IPC Send timeout detected. Receiver ospid 25486 [ Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_lms2_25486.trc: Tue Jul 25 22:48:53 2023 IPC Send timeout detected. Receiver ospid 25482 [ Tue Jul 25 22:48:53 2023 Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_lms1_25482.trc: Tue Jul 25 22:49:07 2023 Detected an inconsistent instance membership by instance 1 Tue Jul 25 22:49:08 2023 Received an instance abort message from instance 1 Please check instance 1 alert and LMON trace files for detail. LMD0 (ospid: 25476): terminating the instance due to error 481 Tue Jul 25 22:49:08 2023 System state dump requested by (instance=2, osid=25476 (LMD0)), summary=[abnormal instance termination]. System State dumped to trace file /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_diag_25464_20230725224908.trc Tue Jul 25 22:49:09 2023 ORA-1092 : opitsk aborting process Tue Jul 25 22:49:09 2023 License high water mark = 1404 Instance terminated by LMD0, pid = 25476 USER (ospid: 8175): terminating the instance Instance terminated by USER, pid = 8175 Wed Jul 26 22:14:00 2023 Adjusting the default value of parameter parallel_max_servers from 2560 to 1970 due to the value of parameter processes (2000) Starting ORACLE instance (normal) ... ... ARC3 started with pid=47, OS id=11845 ARC1: Archival started ARC2: Archival started ARC1: Becoming the 'no FAL' ARCH ARC1: Becoming the 'no SRL' ARCH ARC2: Becoming the heartbeat ARCH WARNING: The 'LOG_ARCHIVE_CONFIG' init.ora parameter settings are inconsistent with another started instance. This may be caused by the 'DB_UNIQUE_NAME' init.ora parameter being specified differently on one or more of the other RAC instances; theDB_UNIQUE_NAME parameter value MUST be identical for allinstances of the database. Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_lgwr_11631.trc: ORA-16188: LOG_ARCHIVE_CONFIG settings inconsistent with previously started instance LGWR (ospid: 11631): terminating the instance due to error 16188 Wed Jul 26 23:06:28 2023 System state dump requested by (instance=2, osid=11631 (LGWR)), summary=[abnormal instance termination]. System State dumped to trace file /u01/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_diag_11578_20230726230628.trc Dumping diagnostic data in directory=[cdmp_20230726230628], requested by (instance=2, osid=11631 (LGWR)), summary=[abnormal instance termination]. Instance terminated by LGWR, pid = 11631
Note:
這里實際有3個問題我用三種顏色表示, 綠色是tempfile發(fā)現(xiàn)了壞塊, 橙色是RAC 節(jié)點間出現(xiàn)IPC通信錯誤, 紅色是實例異常crash,CRS自動拉起后啟動失敗,提示ORA-16188: LOG_ARCHIVE_CONFIG 節(jié)點間參數(shù)不一致。 前兩個問題不在本篇的描述范圍,關(guān)鍵是ora-16188, 想起了前不久《Troubleshooting Oracle Exadata X5 db instance mount fail with ORA-01105 & ORA-01154》那個故障,也是兩個實例間不兼容無法啟動,當(dāng)時還排查過這個參數(shù)。接下來分析看一下當(dāng)前參數(shù)
LOG_ARCHIVE_CONFIG參數(shù)
SQL> show parameter pfile NAME TYPE VALUE ------------------------------------ ----------- ------------------------------ spfile string +DATAVG/anbob/spfileanbob.ora SQL> show parameter config NAME TYPE VALUE ------------------------------------ ----------- ------------------------------ dg_broker_config_file1 string /u01/app/oracle/product/11.2.0/db_1/dbs/dr1anbob.dat dg_broker_config_file2 string /u01/app/oracle/product/11.2.0/db_1/dbs/dr2anbob.dat log_archive_config string
Note:
對比了RAC所有實例使用的是同一 spfile, 另一存活實例參數(shù)與spfile參數(shù)相同,都是null 或”, 通過create pfile能發(fā)現(xiàn)有*.log_archive_config=” 的配置。
ORA-16188
LOG_ARCHIVE_CONFIG settings are inconsistent with the previously started instance. Cause The settings for the LOG_ARCHIVE_CONFIG parameter were inconsistent with the settings of a previously started instance. The settings for this parameter must be exactly the same for all instances. Action Ensure that all instances use the exact same LOG_ARCHIVE_CONFIG settings.
Note:
這里給ORACLE做個宣傳,oracle已悄悄上線了error-help功能, 該功能確實早就應(yīng)該有,在線可以查詢錯誤編碼,增加到google瀏覽器的自定義搜索引擎,可以快速的搜索錯誤代碼,有些內(nèi)容比oerr程序 提示更加豐富。頁面如Ora-16188
也許你從網(wǎng)上能搜到或Standby Instance Crash or Failed to Startup with ORA-16188 (Doc ID 1580482.1) ,對于standby啟動相關(guān)的錯誤的解決方法 Cause: Setting log_archive_config to null in RAC standby database and Startup of node 2 fails with ORA-16188. Alter system set log_archive_config=” scope=both sid=’*’ Solution: This is due to v$dataguard_config not updated between instances. To fix this do the below in any one of the nodes. alter system set log_archive_config=NODG_CONFIG scope=both sid=’*’; And retry the standby setup.
上面這種修改為”的方式,可能會導(dǎo)致實例v$dataguard_config 可能沒有及時更新,提示參數(shù)不一致,似乎是個bug, 從oracle 12c已經(jīng)修復(fù)了該問題, 這之前可能更新值為NODG_CONFIG, 我們在幸存的節(jié)點修改參數(shù).
SQL> alter system set log_archive_config=NODG_CONFIG scope=both sid='*'; alter system set log_archive_config=NODG_CONFIG scope=both sid='*' * ERROR at line 1: ORA-02097: parameter cannot be modified because specified value is invalid ORA-16135: Invalid LOG_ARCHIVE_CONFIG modification while in protected mode
Note:
注意這里提示修改的值是無效值,無語,既然開始提示2實例參數(shù)不一致,那重啟幸存實例肯定就可以解決了吧? 這里因為時間緊急,客戶反饋可以重啟幸存節(jié)點,但如果重啟就盡快因為20分鐘后有批處理作業(yè)。于是決定重啟, 至于這個nodg_config值是無效值,這里其實忽略了一個問題.
重啟幸存實例
-- 備份了一個spfile SQL> alter system reset log_archive_config scope=spfile; System altered. SQL> shut immediate SQL> startup ORACLE instance started. Total System Global Area 2.2448E+11 bytes Fixed Size 2266664 bytes Variable Size 3.4897E+10 bytes Database Buffers 1.8952E+11 bytes Redo Buffers 64827392 bytes Database mounted. ORA-03113: end-of-file on communication channel Process ID: 26774 Session ID: 2065 Serial number: 3
Note:
重啟以后壞了,一個也無法啟動了,查看db alert log
DB ALERT LOG
Lost write protection disabled
Completed: ALTER DATABASE MOUNT /* db agent *//* {0:7:13143} */
ALTER DATABASE OPEN /* db agent *//* {0:7:13143} */
This instance was first to open
Picked broadcast on commit scheme to generate SCNs
LGWR: STARTING ARCH PROCESSES
Wed Jul 26 23:39:09 2023
ARC0 started with pid=44, OS id=27434
Wed Jul 26 23:39:10 2023
ARC0: Archival started
LGWR: STARTING ARCH PROCESSES COMPLETE
ARC0: STARTING ARCH PROCESSES
LGWR: Primary database is in MAXIMUM AVAILABILITY mode
LGWR: Destination LOG_ARCHIVE_DEST_1 is not serviced by LGWR
LGWR: Minimum of 1 LGWR standby database required
Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_lgwr_27288.trc:
ORA-16072: a minimum of one standby database destination is required
Wed Jul 26 23:39:10 2023
ARC1 started with pid=45, OS id=27436
LGWR (ospid: 27288): terminating the instance due to error 16072
Wed Jul 26 23:39:10 2023
System state dump requested by (instance=1, osid=27288 (LGWR)), summary=[abnormal instance te
System State dumped to trace file /u01/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_diag_2
Dumping diagnostic data in directory=[cdmp_20230726233910], requested by (instance=1, osid=27
Instance terminated by LGWR, pid = 27288
Note:
注意這里的提示已經(jīng)很明確,主庫目前是最大可用模式,至少需要1個standby, 而我們這個庫已經(jīng)清理了所有的standby ,只是之前一直未重啟過, 所以這個雷留到了現(xiàn)在。 知道原因解決就容易了。不過我們再回看上面提示nodg_config報錯時提到的:
ORA-02097: parameter cannot be modified because specified value is invalid
ORA-16135: Invalid LOG_ARCHIVE_CONFIG modification while in protected mode
ORA-16135
Cause: The LOG_ARCHIVE_CONFIG parameter can not be modified while any RAC instance is open in either maximum protection or maximum availability mode. Also, the parameter can not be modified in such way that would cause all destinations to fail while in maximum protection mode.
Action: Make the modification before the database is opened by any instance.
這個錯誤時忽略了檢查一下原來database 的保護模式. 以后禁用dg或清理參數(shù)前記的檢查, 既然一個dg standby都沒有,也只能重啟所有節(jié)點,切到最大性能模式。
select database_name,database_role,force_logging,PROTECTION_MODE from v$database;
修改保護模式
SQL> startup mount ORACLE instance started. Total System Global Area 2.2448E+11 bytes Fixed Size 2266664 bytes Variable Size 3.4897E+10 bytes Database Buffers 1.8952E+11 bytes Redo Buffers 64827392 bytes Database mounted. SQL> alter database set standby to maximize performance; Database altered. SQL> alter database open; Database altered.
Summary:
當(dāng)清理或刪除Standby時,在主庫需要檢查當(dāng)前的保護模式,如最大可用和最大保護模式至少要有1 standby, 在12c之前RAC 修改LOG_ARCHIVE_CONFIG參數(shù)記的使用NODG_CONFIG而不是”,否則可能會出現(xiàn)后期實例重啟無法啟動的風(fēng)險。




