DG日志應(yīng)用延時(shí)嚴(yán)重故障處理及優(yōu)化（數(shù)據(jù)延遲6天將近3T歸檔日志文件）

原創(chuàng) 企系山頂上噶雪梨 2022-07-11

5258

DG日志應(yīng)用延時(shí)嚴(yán)重故障處理及優(yōu)化（數(shù)據(jù)延遲6天將近3T歸檔日志文件）

【說(shuō)明】：操作部分若沒(méi)有特別說(shuō)明，都是在目標(biāo)端

本文包含故障的處理過(guò)程，DG目標(biāo)端應(yīng)用速度優(yōu)化，DG監(jiān)控思路

摘要：11G DG（Data Gurad）單實(shí)例目標(biāo)端應(yīng)用延遲6天，將近3T的歸檔日志為應(yīng)用，故障處理及數(shù)據(jù)積壓后怎么快速追平數(shù)據(jù)

故障原因分析：

1）第一時(shí)間查看了數(shù)據(jù)庫(kù)的打開(kāi)狀態(tài)為“ READ ONLY ”，而非正常的狀態(tài) “ READ ONLY WITH APPLY ”

SQL> select open_mode from v$database;  

OPEN_MODE

READ ONLY

2）日志怎么會(huì)掛了呢？查看告警日志，其中一段顯示，磁盤(pán)空間不足，造成后臺(tái)進(jìn)程MRP0 shutdown掉


09812 ARC3: Closing local archive destination LOG_ARCHIVE_DEST_1: '/mnt/vdd1/archivelog/ARC0000202222_1013199719.0001.arc' (error 19502) (orcl)
309813 ARCH: Archival stopped, error occurred. Will continue retrying
309814 ORACLE Instance orcl - Archival Error
309815 ORA-16038: log 4 sequence# 202222 cannot be archived
309816 ORA-19502: write error on file "", block number (block size=)
309817 ORA-00312: online log 4 thread 1: '/mnt/oracle/oradata/orcl/redo04.log'
309818 Sat Jul 02 14:14:06 2022
309819 Media Recovery Log /mnt/vdd1/archivelog/ARC0000201136_1013199719.0001.arc
309820 Sat Jul 02 14:14:08 2022
309821 ARCH: Archival stopped, error occurred. Will continue retrying
309822 ORACLE Instance orcl - Archival Error
309823 ORA-16014: log 4 sequence# 202222 not archived, no available destinations
309824 ORA-00312: online log 4 thread 1: '/mnt/oracle/oradata/orcl/redo04.log'
309825 Errors with log /mnt/vdd1/archivelog/ARC0000201136_1013199719.0001.arc
309826 MRP0: Background Media Recovery terminated with error 1237
309827 Errors in file /mnt/oracle/diag/rdbms/orcl_st/orcl/trace/orcl_pr00_29562.trc:
309828 ORA-01237: cannot extend datafile 91
309829 ORA-01110: data file 91: '/mnt/vdd1/oradata/TS_PART4_01'
309830 ORA-19502: write error on file "/mnt/vdd1/oradata/TS_PART4_01", block number 4130560 (block size=8192)
309831 ORA-27072: File I/O error
309832 Linux-x86_64 Error: 28: No space left on device
309833 Additional information: 4
309834 Additional information: 4130560
309835 Additional information: -1
309836 Managed Standby Recovery not using Real Time Apply
309837 Sat Jul 02 14:15:29 2022
309838 Recovery interrupted!
309839 Recovered data files to a consistent state at change 17592401117937
309840 Sat Jul 02 14:15:31 2022
309841 MRP0: Background Media Recovery process shutdown (orcl)

3）14點(diǎn)多發(fā)生磁盤(pán)空間不足，刪除目標(biāo)端日志的計(jì)劃任務(wù)是在18點(diǎn)，那18點(diǎn)后，DG檢測(cè)又空間了，繼續(xù)將歸檔日志重傳到目標(biāo)端，但是由于日志應(yīng)用的進(jìn)程MRP0掛掉，導(dǎo)致數(shù)據(jù)同步失敗

4）大坑：未及時(shí)發(fā)現(xiàn)，需設(shè)置監(jiān)控告警DG情況

故障解決：

1）考慮歸檔日志已經(jīng)被計(jì)劃任務(wù)刪除，需要先進(jìn)行歸檔日志恢復(fù)，目標(biāo)端最新應(yīng)用到201136，而主庫(kù)還存在歸檔是207163，限于磁盤(pán)空間，分多次進(jìn)行恢復(fù)，先進(jìn)行恢復(fù)202315到203000直接的歸檔日志文件，并后臺(tái)執(zhí)行

建立shell腳本：
cat arch_restore.sh

#!/bin/bash
. $HOME/.bash_profile
rman target / nocatalog msglog /mnt/script/logs/rman-arch.log <<EOF
run
{
set archivelog destination to '/mnt/vdd1/archivelog';
allocate channel ch1 type disk;
allocate channel ch2 type disk;
allocate channel ch3 type disk;
allocate channel ch4 type disk;
allocate channel ch5 type disk;
allocate channel ch6 type disk;
allocate channel ch7 type disk;
allocate channel ch8 type disk;
restore archivelog from logseq 202315 until logseq 203000;
release channel ch1;
release channel ch2;
release channel ch3;
release channel ch4;
release channel ch5;
release channel ch6;
release channel ch7;
release channel ch8;
}
exit
EOF

后臺(tái)執(zhí)行腳本 :

nohup sh arch_restore.sh >alog07070813.log &

2）重啟MRP0，恢復(fù)日志應(yīng)用 (### 然后忙去了，幾個(gè)小時(shí)候去檢查，發(fā)現(xiàn)才應(yīng)用了三百個(gè)多文件… ###)

alter database recover managed standby database cancel;
alter database recover managed standby database using current logfile disconnect;

3）再查日志，發(fā)現(xiàn)需要一個(gè)，到源端去拿一個(gè)，最后還直接掛了 (### 哭暈在廁所o(╥﹏╥)o… ###)

日志顯示：

Wed Jul 06 18:59:07 2022
Archived Log entry 206097 added for thread 1 sequence 206015 ID 0xe08f092f dest 1:
Wed Jul 06 18:59:41 2022
Media Recovery Log /mnt/vdd1/archivelog/ARC0000201380_1013199719.0001.arc
Error opening /mnt/vdd1/archivelog/ARC0000201380_1013199719.0001.arc
Attempting refetch
Media Recovery Waiting for thread 1 sequence 201380
Fetching gap sequence in thread 1, gap sequence 201380-201380
Wed Jul 06 18:59:42 2022
RFS\[7513\]: Allowing overwrite of partial archivelog for thread 1 sequence 201380
RFS\[7513\]: Opened log for thread 1 sequence 201380 dbid 1116521861 branch 1013199719
Archived Log entry 206098 added for thread 1 sequence 201380 rlc 1013199719 ID 0x0 dest 2:
Media Recovery Log /mnt/vdd1/archivelog/ARC0000201380_1013199719.0001.arc
Wed Jul 06 19:00:09 2022
Archived Log entry 206099 added for thread 1 sequence 206016 ID 0xe08f092f dest 1:
Wed Jul 06 19:01:35 2022
Media Recovery Log /mnt/vdd1/archivelog/ARC0000201381_1013199719.0001.arc
Error opening /mnt/vdd1/archivelog/ARC0000201381_1013199719.0001.arc
Attempting refetch
Media Recovery Waiting for thread 1 sequence 201381
Fetching gap sequence in thread 1, gap sequence 201381-201381
Media Recovery Waiting for thread 1 sequence 201381
Fetching gap sequence in thread 1, gap sequence 201381-201381
Wed Jul 06 19:05:07 2022
FAL\[client\]: Failed to request gap sequence
GAP - thread 1 sequence 201381-201381
DBID 1116521861 branch 1013199719
FAL\[client\]: All defined FAL servers have been attempted.
Check that the CONTROL_FILE_RECORD_KEEP_TIME initialization
parameter is defined to a value that's sufficiently large
enough to maintain adequate log switch information to resolve
archivelog gaps.

4）將日志批量復(fù)制到備庫(kù)，并解決GAP

– 【目標(biāo)端】GAP查詢(xún)語(yǔ)句

select count(*) from v$archive_gap;

–【目標(biāo)端】有時(shí)候該視圖查詢(xún)超級(jí)慢，也可以考慮用下面語(yǔ)句代替：

select USERENV('Instance'), high.thread#, low.lsq, high.hsq
from
(select a.thread#, rcvsq, min(a.sequence#)-1 hsq
from v$archived_log a,
(select lh.thread#, lh.resetlogs_change#, max(lh.sequence#) rcvsq
from v$log_history lh, v$database_incarnation di
where lh.resetlogs_time = di.resetlogs_time
and lh.resetlogs_change# = di.resetlogs_change#
and di.status = 'CURRENT'
and lh.thread# is not null
and lh.resetlogs_change# is not null
and lh.resetlogs_time is not null
group by lh.thread#, lh.resetlogs_change#
) b
where a.thread# = b.thread#
and a.resetlogs_change# = b.resetlogs_change#
and a.sequence# > rcvsq
group by a.thread#, rcvsq) high,
(select srl_lsq.thread#, nvl(lh_lsq.lsq, srl_lsq.lsq) lsq
from
(select thread#, min(sequence#)+1 lsq
from
v$log_history lh, x$kccfe fe, v$database_incarnation di
where to_number(fe.fecps) <= lh.next_change#
and to_number(fe.fecps) >= lh.first_change#
and fe.fedup!=0 and bitand(fe.festa, 12) = 12
and di.resetlogs_time = lh.resetlogs_time
and lh.resetlogs_change# = di.resetlogs_change#
and di.status = 'CURRENT'
group by thread#) lh_lsq,
(select thread#, max(sequence#)+1 lsq
from
v$log_history
where (select min( to_number(fe.fecps))
from x$kccfe fe
where fe.fedup!=0 and bitand(fe.festa, 12) = 12) >= next_change#
group by thread#) srl_lsq
where srl_lsq.thread# = lh_lsq.thread#(+)
) low
where low.thread# = high.thread#
and lsq < = hsq
and hsq > rcvsq;

–【源端】復(fù)制歸檔日志目標(biāo)端

scp -P port username@IP:目標(biāo)端目錄

–【目標(biāo)端】手動(dòng)注冊(cè)歸檔日志(注冊(cè)斷掉的第一個(gè)就好，后續(xù)再注冊(cè)會(huì)提示已注冊(cè))

alter database register logfile '/${歸檔目錄絕對(duì)路徑}/ARC0000201381_1013199719.0001.arc';

(### 你以為這就完了么？？？ ###)

在線(xiàn)重做日志我設(shè)置是500M，大部分歸檔日志文件在400~500M之間，觀(guān)察日志應(yīng)用發(fā)現(xiàn)，有時(shí)候是1分鐘一個(gè)，有時(shí)候3到5分鐘應(yīng)用一個(gè)，有時(shí)候又1分鐘倆三個(gè)，任由下去的話(huà)，要被業(yè)務(wù)部門(mén)KO了。怎么加速呢？？

DG目標(biāo)端應(yīng)用優(yōu)化加速

1）開(kāi)啟并行（視負(fù)載情況及CPU核數(shù)，最大設(shè)置cpu*2）

– 取消日志應(yīng)用

alter database recover managed standby database cancel;

– 默認(rèn)8通道，更改為12通道

alter database recover managed standby database parallel 16 using current logfile disconnect;

2）主庫(kù)到備庫(kù)的日志傳輸，考慮積壓量大，可以傳輸一段時(shí)間關(guān)閉一段時(shí)間的方式（源端主庫(kù)操作）

– 停止主庫(kù)歸檔日志傳輸?shù)絺鋷?kù)

ALTER system SET log_archive_dest_state_2 = 'defer';

– 恢復(fù)歸檔日志傳輸

ALTER system SET log_archive_dest_state_2 = enable;

3）優(yōu)化參數(shù)并重啟數(shù)據(jù)庫(kù)（目標(biāo)端執(zhí)行）

– 查看參數(shù)設(shè)置

show parameter parallel_execution_message_size --值為16384
show parameter filesystemio_options --默認(rèn)none

– linux操作系統(tǒng)層面可查看

grep kio /proc/slabinfo

– 更改設(shè)置為異步IO（單機(jī)實(shí)例，11.2.0.4）

alter system set filesystemio_options=setall scope=spfile;

–指定并行執(zhí)行中meassage的大小，11G取值2148~32768

alter system set parallel_execution_message_size=32768 scope=spfile;

通過(guò)上述1 2 點(diǎn)優(yōu)化設(shè)置后，速度有點(diǎn)提升，但是速度不穩(wěn)定（感謝lgs提供的思路）；

停掉日志傳輸，并行度從4、6、8、12、16、 20都嘗試過(guò)，但資源監(jiān)控：CPU、IO都閑的發(fā)慌，占比20%以下，這讓我再想加資源都沒(méi)有理由。

最后發(fā)現(xiàn)第3點(diǎn)的參數(shù)調(diào)優(yōu)，可以一試，趁著中午業(yè)務(wù)低峰期，調(diào)整參數(shù)后，導(dǎo)出pfile，比對(duì)之前的初始化參數(shù)文件，確定沒(méi)問(wèn)題后，重啟目標(biāo)端端數(shù)據(jù)庫(kù)【慎之又慎】。

通過(guò)以上三點(diǎn)，優(yōu)化完成后效果顯著，結(jié)果是可喜的：
速度提升10-20倍，由原來(lái)的2分鐘1-3個(gè)，提升到2分鐘30-50個(gè)文件

開(kāi)啟DG監(jiān)控，通過(guò)shell腳本+任務(wù)計(jì)劃實(shí)現(xiàn)：

1）設(shè)定日志監(jiān)控：通過(guò)監(jiān)控alert日志的關(guān)鍵字（在使用中逐步優(yōu)化）進(jìn)行告警

2）通過(guò)視圖v$dataguard_stats進(jìn)行監(jiān)控（單位分鐘）：

select to_number(substr(s.value,2,2))2460+to_number(substr(s.value,5,2))*60+to_number(substr(s.value,8,2)) as nums
from v$dataguard_stats s where name='apply lag';

其他相關(guān)查詢(xún)：

– 查詢(xún)歸檔還原情況，若是正常在同步中，YES后是IN-MEMORY狀態(tài)，表示正在應(yīng)用的日志文件,或通過(guò)文件號(hào)定位最新的應(yīng)用到的歸檔，YES為已應(yīng)用，NO為未應(yīng)用

select g.recid,g.sequence#,g.applied,status,archived,g.first_time from v$archived_log g
where 1=1-- g.sequence#> 201134
and applied='IN-MEMORY'
order by g.sequence#

– 查看備庫(kù)DG狀態(tài)

SELECT PROCESS,CLIENT_PROCESS,PROCESS,STATUS,THREAD#,SEQUENCE#,BLOCK#,BLOCKS FROM V$MANAGED_STANDBY;

– 查詢(xún)?cè)诰€(xiàn)日志情況

select thread#, group#, sequence#, bytes, archived, s.status , blocksize, member
FROM v$standby_log s
join v$logfile using (GROUP#)
order by thread#, group#;

– 查詢(xún)進(jìn)程狀態(tài) dg庫(kù)日志應(yīng)用性能監(jiān)控

select to_char(sysdate,'DD-MON-YYYY HH24:MI:SS') "Current time"
,s.process
, p.spid
, substr(s.program, -6) PROC
, s.event
, s.p1
, s.p2
, s.p3
, s.seconds_in_wait SIW
, s.seq#
from v$session s, v$process p
where p.addr = s.paddr and (s.program like '%MRP%' or s.program like '%PR0%' or s.program like '%DBW%' or s.program like '%CKPT%')
order by s.process;

最后修改時(shí)間：2022-09-08 18:07:30

「喜歡這篇文章，您的關(guān)注和贊賞是給作者最好的鼓勵(lì)」

關(guān)注作者

【版權(quán)聲明】本文為墨天輪用戶(hù)原創(chuàng)內(nèi)容，轉(zhuǎn)載時(shí)必須標(biāo)注文章的來(lái)源（墨天輪），文章鏈接，文章作者等基本信息，否則作者和墨天輪有權(quán)追究責(zé)任。如果您發(fā)現(xiàn)墨天輪中有涉嫌抄襲或者侵權(quán)的內(nèi)容，歡迎發(fā)送郵件至：contact@modb.pro進(jìn)行舉報(bào)，并提供相關(guān)證據(jù)，一經(jīng)查實(shí)，墨天輪將立刻刪除相關(guān)內(nèi)容。

特色一级强游戏,海奥华预言免费阅读,51漫画兑换码,美女裸体无遮挡永久免费观看网站,lubuntu线路检测入口

DG日志應(yīng)用延時(shí)嚴(yán)重故障處理及優(yōu)化（數(shù)據(jù)延遲6天將近3T歸檔日志文件）

DG日志應(yīng)用延時(shí)嚴(yán)重故障處理及優(yōu)化（數(shù)據(jù)延遲6天將近3T歸檔日志文件）

故障原因分析：

故障解決：

DG目標(biāo)端應(yīng)用優(yōu)化加速

開(kāi)啟DG監(jiān)控，通過(guò)shell腳本+任務(wù)計(jì)劃實(shí)現(xiàn)：

其他相關(guān)查詢(xún)：

評(píng)論