MySQL從庫異常宕機-I/O和SQL線程起不來【bug】

原創馮剛 2022-10-18

1766

前言

上一篇文章介紹了因為bug導致主從數據不一致的案例

MySQL主從數據不一致-BUG：http://www.sunline.cc/db/518430

此案例介紹碰到的另外一個bug：因為所在主機宕機，MySQL從庫I/O和SQL線程啟動報錯問題。

1 環境信息

自建MySQL：5.7.22

OS 版本：CentOS Linux release 7.6

2 錯誤描述

2022-08-17 06:45:04 從庫所在主機告警，提示宕機，并伴隨MySQL異常告警；

2022-08-17 06:45:53 宕機告警恢復；

2022-08-17 06:46:52 提示MySQL I/O和SQL線程停止；

3 排查

3.1 登錄實例，查看從庫同步狀態

此時I/O和SQL線程都是"NO"。

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State:
                  Master_Host: XXX.XXX.XXX.XXX
                  Master_User: XXXXXX
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.003334
          Read_Master_Log_Pos: 102990348
               Relay_Log_File: relay.011586
                Relay_Log_Pos: 111435792
        Relay_Master_Log_File: mysql-bin.003334
             Slave_IO_Running: No
            Slave_SQL_Running: No
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 1872
                   Last_Error: Slave failed to initialize relay log info structure from the repository
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 111435579
              Relay_Log_Space: 0
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 1872
               Last_SQL_Error: Slave failed to initialize relay log info structure from the repository
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 0
                  Master_UUID: acd7aa9e-cc1c-11eb-8e2e-0c42a1474c26
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State:
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp: 220817 09:17:54
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set:
            Executed_Gtid_Set: 3f6aeb45-13b5-11e9-be3e-7cd30ae4123c:1-154276204,
5157c677-13b5-11e9-9a24-7cd30ae43356:1-2847256991,
f537f1cc-a575-11ea-9d39-52540014ff9c:1-6
                Auto_Position: 1
         Replicate_Rewrite_DB:
                 Channel_Name:
           Master_TLS_Version:
1 row in set (0.00 sec)

3.2 查看錯誤日志

## 下面日志是主機啟動后，MySQL實例的啟動日志：2022-08-17T06:45:17.529616+08:00 0 [Warning] 'NO_ZERO_DATE', 'NO_ZERO_IN_DATE' and 'ERROR_FOR_DIVISION_BY_ZERO' sql modes should be used with strict mode. They will be merged wi
th strict mode in a future release.
2022-08-17T06:45:17.532546+08:00 0 [Note] /usr/local/mysql57/sbin/mysqld (mysqld 5.7.22-log) starting as process 2277 ...
2022-08-17T06:45:17.550221+08:00 0 [Warning] option 'innodb-sort-buffer-size': unsigned value 134217728 adjusted to 67108864
2022-08-17T06:45:17.552817+08:00 0 [Note] InnoDB: PUNCH HOLE support available
2022-08-17T06:45:17.552844+08:00 0 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2022-08-17T06:45:17.552850+08:00 0 [Note] InnoDB: Uses event mutexes
2022-08-17T06:45:17.552852+08:00 0 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
2022-08-17T06:45:17.552855+08:00 0 [Note] InnoDB: Compressed tables use zlib 1.2.3
2022-08-17T06:45:17.552877+08:00 0 [Note] InnoDB: Using Linux native AIO
2022-08-17T06:45:17.555735+08:00 0 [Note] InnoDB: Number of pools: 1
2022-08-17T06:45:17.557545+08:00 0 [Note] InnoDB: Using CPU crc32 instructions
2022-08-17T06:45:17.561520+08:00 0 [Note] InnoDB: Initializing buffer pool, total size = 8G, instances = 16, chunk size = 128M
2022-08-17T06:45:17.561541+08:00 0 [Note] InnoDB: Setting NUMA memory policy to MPOL_INTERLEAVE
2022-08-17T06:45:18.001568+08:00 0 [Note] InnoDB: Setting NUMA memory policy to MPOL_DEFAULT
2022-08-17T06:45:18.001601+08:00 0 [Note] InnoDB: Completed initialization of buffer pool
2022-08-17T06:45:18.054769+08:00 0 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority().
2022-08-17T06:45:18.070395+08:00 0 [Note] InnoDB: Highest supported file format is Barracuda.
2022-08-17T06:45:18.134991+08:00 0 [Note] InnoDB: Log scan progressed past the checkpoint lsn 13018018506772
2022-08-17T06:45:18.197995+08:00 0 [Note] InnoDB: Doing recovery: scanned up to log sequence number 13018019491644
2022-08-17T06:45:18.198032+08:00 0 [Note] InnoDB: Database was not shutdown normally!
2022-08-17T06:45:18.198036+08:00 0 [Note] InnoDB: Starting crash recovery.
2022-08-17T06:45:18.307661+08:00 0 [Note] InnoDB: Transaction 10209047171 was in the XA prepared state.
2022-08-17T06:45:18.310909+08:00 0 [Note] InnoDB: Transaction 10209047171 was in the XA prepared state.
2022-08-17T06:45:18.311385+08:00 0 [Note] InnoDB: 1 transaction(s) which must be rolled back or cleaned up in total 0 row operations to undo
2022-08-17T06:45:18.311399+08:00 0 [Note] InnoDB: Trx id counter is 10209047552
2022-08-17T06:45:18.311601+08:00 0 [Note] InnoDB: Starting an apply batch of log records to the database...
InnoDB: Progress in percent: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 
2022-08-17T06:45:19.168448+08:00 0 [Note] InnoDB: Apply batch completed
2022-08-17T06:45:19.168500+08:00 0 [Note] InnoDB: Last MySQL binlog file position 0 358270216, file name mysql_bin.003520
2022-08-17T06:45:19.332107+08:00 0 [Note] InnoDB: Starting in background the rollback of uncommitted transactions
2022-08-17T06:45:19.332133+08:00 0 [Note] InnoDB: Rollback of non-prepared transactions completed
2022-08-17T06:45:19.332171+08:00 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1"
2022-08-17T06:45:19.332190+08:00 0 [Note] InnoDB: Creating shared tablespace for temporary tables
2022-08-17T06:45:19.332240+08:00 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
2022-08-17T06:45:19.344599+08:00 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
2022-08-17T06:45:19.345375+08:00 0 [Note] InnoDB: 96 redo rollback segment(s) found. 96 redo rollback segment(s) are active.
2022-08-17T06:45:19.345396+08:00 0 [Note] InnoDB: 32 non-redo rollback segment(s) are active.
2022-08-17T06:45:19.345772+08:00 0 [Note] InnoDB: 5.7.22 started; log sequence number 13018019491644
2022-08-17T06:45:19.346378+08:00 0 [Note] InnoDB: Loading buffer pool(s) from /data/mysql/mysql57_3307/data/ib_buffer_pool
2022-08-17T06:45:19.347415+08:00 0 [Note] Plugin 'FEDERATED' is disabled.
220817  6:45:19 server_audit: INCL_DML_USERS and EXCL_DML_USERS specified simultaneously - both set to empty
220817  6:45:19 server_audit: server_audit_incl_users set to ''.
220817  6:45:19 server_audit: MariaDB Audit Plugin version 1.4.4 STARTED.
220817  6:45:19 server_audit: Could not create file '/xxxxx/auditlog/audit.log'.
2022-08-17T06:45:19.354790+08:00 0 [Note] Recovering after a crash using /xxxxx/binlog/mysql_bin
2022-08-17T06:45:21.272713+08:00 0 [Note] InnoDB: Buffer pool(s) load completed at 220817  6:45:21
2022-08-17T06:45:24.365162+08:00 0 [Note] Starting crash recovery...
2022-08-17T06:45:24.365218+08:00 0 [Note] InnoDB: Starting recovery for XA transactions...
2022-08-17T06:45:24.365224+08:00 0 [Note] InnoDB: Transaction 10209047171 in prepared state after recovery
2022-08-17T06:45:24.365227+08:00 0 [Note] InnoDB: Transaction contains changes to 2 rows
2022-08-17T06:45:24.365230+08:00 0 [Note] InnoDB: 1 transactions in prepared state after recovery
2022-08-17T06:45:24.365232+08:00 0 [Note] Found 1 prepared transaction(s) in InnoDB
2022-08-17T06:45:24.365654+08:00 0 [Note] Crash recovery finished.
2022-08-17T06:45:25.533770+08:00 0 [Warning] Failed to set up SSL because of the following SSL library error: SSL context is not usable without certificate and private key
2022-08-17T06:45:25.533805+08:00 0 [Note] Server hostname (bind-address): '*'; port: 3307
2022-08-17T06:45:25.533832+08:00 0 [Note] IPv6 is available.
2022-08-17T06:45:25.533837+08:00 0 [Note]   - '::' resolves to '::';
2022-08-17T06:45:25.533851+08:00 0 [Note] Server socket created on IP: '::'.
2022-08-17T06:45:25.541233+08:00 0 [Warning] 'user' entry 'dba-docp@localhost' ignored in --skip-name-resolve mode.
2022-08-17T06:45:25.541250+08:00 0 [Warning] 'user' entry 'mysql.session@localhost' ignored in --skip-name-resolve mode.
2022-08-17T06:45:25.541256+08:00 0 [Warning] 'user' entry 'zabbix_monitor@localhost' ignored in --skip-name-resolve mode.
2022-08-17T06:45:25.541291+08:00 0 [Warning] 'db' entry 'performance_schema mysql.session@localhost' ignored in --skip-name-resolve mode.
2022-08-17T06:45:25.541295+08:00 0 [Warning] 'db' entry 'sys mysql.sys@localhost' ignored in --skip-name-resolve mode.
2022-08-17T06:45:25.541304+08:00 0 [Warning] 'proxies_priv' entry '@ root@localhost' ignored in --skip-name-resolve mode.
2022-08-17T06:45:25.552222+08:00 0 [Warning] 'tables_priv' entry 'user mysql.session@localhost' ignored in --skip-name-resolve mode.
2022-08-17T06:45:25.552241+08:00 0 [Warning] 'tables_priv' entry 'sys_config mysql.sys@localhost' ignored in --skip-name-resolve mode.
2022-08-17T06:45:25.559579+08:00 0 [Note] Slave: MTS group recovery relay log info based on Worker-Id 0, group_relay_log_name /xxxxx/relaylog/relay.011626, group_relay_log_pos 4217146 group_master_log_name mysql-bin.005810, group_master_log_pos 4216933## 實例啟動后，slave進程的報錯
2022-08-17T06:45:25.559872+08:00 0 [ERROR] Error looking for file after /xxxxx/relaylog/relay.011627.
2022-08-17T06:45:25.560208+08:00 0 [ERROR] Slave: Failed to initialize the master info structure for channel ''; its record may still be present in 'mysql.slave_master_info' table, consider deleting it.
2022-08-17T06:45:25.560215+08:00 0 [ERROR] Failed to create or recover replication info repositories.
2022-08-17T06:45:25.560221+08:00 0 [ERROR] Slave SQL for channel '': Slave failed to initialize relay log info structure from the repository, Error_code: 1872
2022-08-17T06:45:25.560224+08:00 0 [ERROR] /usr/local/mysql57/sbin/mysqld: Slave failed to initialize relay log info structure from the repository
2022-08-17T06:45:25.560225+08:00 0 [ERROR] Failed to start slave threads for channel ''
2022-08-17T06:45:25.560227+08:00 0 [Note] Some of the channels are not created/initialized properly. Check for additional messages above. You will not be able to start replication on those channels until the issue is resolved and the server restarted.
2022-08-17T06:45:25.581309+08:00 0 [Note] Event Scheduler: Loaded 0 events
2022-08-17T06:45:25.581497+08:00 0 [Note] /usr/local/mysql57/sbin/mysqld: ready for connections.
Version: '5.7.22-log'  socket: '/xxxxx/socketdir/mysql.sock'  port: 3307  MySQL Community Server (GPL)

3.3 嘗試手動啟動slave

報錯：ERROR 1872 (HY000): Slave failed to initialize relay log info structure from the repository

mysql> start slave;
ERROR 1872 (HY000): Slave failed to initialize relay log info structure from the repository

2022-08-17T09:02:30.689881+08:00 551 [ERROR] Slave SQL for channel '': Slave failed to initialize relay log info structure from the repository, Error_code: 1872
2022-08-17T09:02:33.382601+08:00 551 [ERROR] Slave SQL for channel '': Slave failed to initialize relay log info structure from the repository, Error_code: 1872
2022-08-17T09:03:31.630680+08:00 551 [ERROR] Slave SQL for channel '': Slave failed to initialize relay log info structure from the repository, Error_code: 1872
2022-08-17T09:05:21.955292+08:00 564 [Warning] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.

4 原因

根據報錯，推測是MySQL BUG導致。

4.1 bug名稱

Bug #83713 : Slave failed to initialize relay log info after OS crash when use MTS and GTID

MySQL BUG鏈接：https://bugs.mysql.com/bug.php?id=83713

4.2 bug描述

Description:
When i tried to configurate a crash safe slave with MTS and GTID based replication, but after a OS crash
replication failed to be start.

4.3 參數配置

[root@localhost] {18:09:49} ((none)) [5]> show VARIABLES where variable_name in('log_slave_updates', 'gtid_mode', 'enforce_gtid_consistency', 'master_info_repository', 'relay_log_info_repository', 'relay_log_recovery', 'sync_binlog', 'innodb_flush_log_at_trx_commit', 'slave_parallel_workers','sync_master_info', 'sync_relay_log', 'sync_relay_log_info'); 
+--------------------------------+-------+
| Variable_name                  | Value |
+--------------------------------+-------+
| enforce_gtid_consistency       | ON    |
| gtid_mode                      | ON    |
| innodb_flush_log_at_trx_commit | 1     |
| log_slave_updates              | ON    |
| master_info_repository         | TABLE |
| relay_log_info_repository      | TABLE |
| relay_log_recovery             | ON    |
| slave_parallel_workers         | 16    |
| sync_binlog                    | 1     |
| sync_master_info               | 10000 |
| sync_relay_log                 | 10000 |
| sync_relay_log_info            | 10000 |
+--------------------------------+-------+
12 rows in set (0.01 sec)

4.4 復現

STEP1:
run a script to execute update sql on the master continuously

STEP2:
use "kill -9" kill the kvm process of the slave in the physical machine

STEP3:  
start mysql of the slave

按照上面方法，沒有復現。下面是具體步驟

步驟1 主庫會話持續插入數據

CREATE TABLE `t2` (
  `id` bigint(11) unsigned NOT NULL AUTO_INCREMENT,
  `name` varchar(50) NOT NULL,
  `sex` char(1) NOT NULL,
  `score` char(2) NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4;

drop procedure p_1;

delimiter //
create procedure p_1(in num int)
BEGIN
	DECLARE i INT DEFAULT 0;
	DECLARE sex CHAR(1);
	DECLARE score CHAR(2);
	SET score= FLOOR(40 + (RAND() *60));
	SET sex=FLOOR(0 + (RAND() * 2));
	WHILE i < 1000000
	DO
		INSERT INTO t2(name,sex,score) VALUES (concat('f',i),sex,score);
		SET i = i + 1;
	END WHILE;
END
//
delimiter ;
call p_1(1000000);

步驟2 從庫主機層面kill實例進程

[ root@mysql10-dba:~ ]# kill -9 53372

步驟3 主庫啟動MySQL實例，啟動slave

[ root@mysql10-dba:~ ]# systemctl start mysql57@7777

4.5 原因/建議

Suggested fix:
The error ocuurs when mts_recovery_groups() reading the corrupted relay log files in MTS group recovery.
But with "gtid_mode=ON" and "relay_log_recovery = 1", the relay log files will be discard later, and so, at the beginning should skip the relay log read.
當mts_recovery_groups()讀取MTS組恢復中損壞的中繼日志文件時發生錯誤。
但是對于"gtid_mode=ON"和"relay_log_recovery = 1"，中繼日志文件稍后將被丟棄，因此，在開始時應該跳過中繼日志讀取。
workaround:
The following steps can start slave successfully

    reset slave;
    start slave IO_THREAD;
    stop slave IO_THREAD;
    reset slave;
    start slave;

5 解決方案

[root@localhost:] {09:18:17} ((none)) [1]> reset slave;
Query OK, 0 rows affected (0.01 sec)
[root@localhost:] {09:18:36} ((none)) [3]> start slave;
Query OK, 0 rows affected (0.04 sec)

6 補充

reset slave干的那些事：

刪除slave_master_info ，slave_relay_log_info兩個表中數據；
刪除所有relay log文件，并重新創建新的relay log文件；
不會改變gtid_executed 或者 gtid_purged的值；

官方文檔鏈接描述：

RESET SLAVE makes the replica forget its replication position in the source's binary log. This statement is meant to be used for a clean start: It clears the replication metadata repositories, deletes
all the relay log files, and starts a new relay log file. It also resets to 0 the replication delay specified with the MASTER_DELAY option to CHANGE MASTER TO.

All relay log files are deleted, even if they have not been completely executed by the replication SQL thread. (This is a condition likely to exist on a replica if
you have issued a STOP SLAVE statement or if the replica is highly loaded.)
For a server where GTIDs are in use (gtid_mode is ON), issuing RESET SLAVE has no effect on the GTID execution history. The statement does not change the values of gtid_executed or
gtid_purged, or the mysql.gtid_executed table. If you need to reset the GTID execution history, use RESET MASTER, even if the GTID-enabled server is a replica where binary logging is disabled.
RESET SLAVE requires the RELOAD privilege.

「喜歡這篇文章，您的關注和贊賞是給作者最好的鼓勵」

關注作者

特色一级强游戏,海奥华预言免费阅读,51漫画兑换码,美女裸体无遮挡永久免费观看网站,lubuntu线路检测入口