問題背景
部門的測試環境,Mogdb的一主一備集群,版本是2.1.1。雙機在斷電之前,Mogdb集群正常,斷電重啟后,操作系統啟動正常,啟動mogdb集群,啟動失敗。
主機ip:192.168.137.110
備機ip:192.168.137.111
問題現象
$ gs_om -t start
Starting cluster.
=========================================
=========================================
[GAUSS-53600]: Can not start the database, the cmd is . /home/omm/.bashrc; python3 '/dbdata/app/tools/script/local/StartInstance.py' -U omm -R /dbdata/app/mogdb -t 300 --security-mode=off, Error:
[FAILURE] master:
[GAUSS-51607] : Failed to start instance. Error: Please check the gs_ctl log for failure details.
[2022-09-08 17:47:43.905][1718][][gs_ctl]: gs_ctl started,datadir is /dbdata/data/db1
[2022-09-08 17:47:51.180][1718][][gs_ctl]: waiting for server to start...
.0 LOG: [Alarm Module]can not read GAUSS_WARNING_TYPE env.
0 LOG: [Alarm Module]Host Name: master
0 LOG: [Alarm Module]Host IP: 192.168.137.110
0 LOG: [Alarm Module]Cluster Name: dbCluster
..2022-09-08 17:47:53.433 6319ba47.1 [unknown] 140561613260352 [unknown] 0 dn_6001_6002 DB010 0 [REDO] LOG: Recovery parallelism, cpu count = 2, max = 4, actual = 2
2022-09-08 17:47:53.433 6319ba47.1 [unknown] 140561613260352 [unknown] 0 dn_6001_6002 DB010 0 [REDO] LOG: ConfigRecoveryParallelism, true_max_recovery_parallelism:4, max_recovery_parallelism:4
Failed to read gaussdb.state: 0Failed to set gaussdb.state with UNKNOWN_STATE[2022-09-08 17:47:54.185][1718][][gs_ctl]: waitpid 1722 failed, exitstatus is 256, ret is 2
[2022-09-08 17:47:54.185][1718][][gs_ctl]: stopped waiting
[2022-09-08 17:47:54.185][1718][][gs_ctl]: could not start server
Examine the log output.[FAILURE] standby:
[GAUSS-51607] : Failed to start instance. Error: Please check the gs_ctl log for failure details.
[2022-09-08 17:48:00.805][1344][][gs_ctl]: gs_ctl started,datadir is /dbdata/data/db1
[2022-09-08 17:48:02.935][1344][][gs_ctl]: waiting for server to start...
.0 LOG: [Alarm Module]can not read GAUSS_WARNING_TYPE env.
0 LOG: [Alarm Module]Host Name: standby
0 LOG: [Alarm Module]Host IP: 192.168.137.111
0 LOG: [Alarm Module]Cluster Name: dbCluster
2022-09-08 17:48:03.632 6319ba53.1 [unknown] 139726745114176 [unknown] 0 dn_6001_6002 DB010 0 [REDO] LOG: Recovery parallelism, cpu count = 2, max = 4, actual = 2
2022-09-08 17:48:03.632 6319ba53.1 [unknown] 139726745114176 [unknown] 0 dn_6001_6002 DB010 0 [REDO] LOG: ConfigRecoveryParallelism, true_max_recovery_parallelism:4, max_recovery_parallelism:4
Failed to read gaussdb.state: 0Failed to set gaussdb.state with UNKNOWN_STATE[2022-09-08 17:48:03.937][1344][][gs_ctl]: waitpid 1347 failed, exitstatus is 256, ret is 2
[2022-09-08 17:48:03.937][1344][][gs_ctl]: stopped waiting
[2022-09-08 17:48:03.937][1344][][gs_ctl]: could not start server
Examine the log output.
問題分析
1.查看集群狀況
#su - omm
$ gs_om -t status --detail
[ Cluster State ]
cluster_state : Unavailable
redistributing : No
current_az : AZ_ALL
[ Datanode State ]
node node_ip port instance state
-----------------------------------------------------------------------------------
1 master 192.168.137.110 26000 6001 /dbdata/data/db1 P Down Manually stopped
2 standby 192.168.137.111 26000 6002 /dbdata/data/db1 S Down Manually stopped
2.查看數據庫版本
$ gs_ctl --V
gs_ctl (openGauss) 9.2.4
$ mogdb -V
gaussdb (MogDB 2.1.1 build b5f25b20) compiled at 2022-03-21 14:42:30 commit 0 last mr
3.查看日志
#查詢日志目錄
cat /dbdata/data/db1/postgresql.conf |grep -i log_dir
log_directory = '/dbdata/log/omm/pg_log/dn_6001' # directory where log files are written,
查看日志列表
cd /dbdata/log/omm/pg_log/dn_6001
ls -l
-rw------- 1 omm dbgrp 100076 Sep 8 15:47 postgresql-2022-09-08_144356.log
-rw------- 1 omm dbgrp 0 Sep 8 17:04 postgresql-2022-09-08_170442.log
最新日志已經不打印。
4.查看官方手冊
根據錯誤碼查看官方手冊,[GAUSS-53600]和[GAUSS-51607]
GAUSS-53600: "CA password must contain at least eight characters."
SQLSTATE: 無
錯誤原因: 系統內部錯誤。
解決辦法: 請聯系技術支持工程師提供技術支持。
GAUSS-51607: "Failed to start %s."
錯誤原因: 啟動集群/節點/實例失敗。
解決辦法: 1.檢查網絡連接是否正常;2.檢查配置文件是否正確。
5.查看源碼
報錯里面提到文件gaussdb.state,在官方手冊搜gaussdb.state,沒有發現主題。
根據報錯“Failed to read gaussdb.state”語句,查看官方源碼,找到相關代碼
postmaster.cpp
/*
* Only update gaussdb.state file's state field.
*
* PARAMETERS:
* state: INPUT new state
* RETURN:
* true if success, otherwise false.
*
* NOTE: unsafe function is not expected here since it is referred in signal handler.
*/
bool SetDBStateFileState(DbState state, bool optional)
{
/* do nothing while core dump be appeared so early. */
if (strlen(gaussdb_state_file) > 0) {
char temppath[MAXPGPATH] = {0};
GaussState s;
int len = 0;
/* zero it in case gaussdb.state doesn't exist. */
int rc = memset_s(&s, sizeof(GaussState), 0, sizeof(GaussState));
securec_check_c(rc, "\0", "\0");
rc = snprintf_s(temppath, MAXPGPATH, MAXPGPATH - 1, "%s.temp", gaussdb_state_file);
securec_check_intval(rc, , false);
/* Write the new content into a temp file and rename it at last. */
int fd = open(gaussdb_state_file, O_RDONLY);
if (fd == -1) {
if (errno == ENOENT && optional) {
write_stderr("gaussdb.state does not exist, and skipt setting since it is optional.");
return true;
} else {
write_stderr("Failed to open gaussdb.state.temp: %d", errno);
return false;
}
}
/* Read old content from file. */
len = read(fd, &s, sizeof(GaussState));
/* sizeof(int) is for current_connect_idx of GaussState */
if ((len != sizeof(GaussState)) && (len != sizeof(GaussState) - sizeof(int))) {
write_stderr("Failed to read gaussdb.state: %d", errno);
(void)close(fd);
return false;
}
在源碼文件postmaster.cpp里面發現代碼函數SetDBStateFileState。在啟動Mogdb的時候,會通過讀取gaussdb.state來設置數據庫運行狀態,而在讀取gaussdb.state的字節長度大小比較失敗,輸出錯誤,返回false,終止啟動。
問題解決
1.查看gaussdb.state
cd /dbdata/data/db1/
ll gaussdb.state
-rw------- 1 omm dbgrp 0 Sep 8 17:04 gaussdb.state
權限和屬組正常,但是文件大小0異常。
cat gaussdb.state
返回空
vi gaussdb.state
返回空
2.替換gaussdb.state
rm -f gaussdb.state
從另外的mogdb正常環境復制一個gaussdb.state到主機和備機
cp gaussdb.state
ll gaussdb.state
-rw-r--r-- 1 root root 72 Sep 9 15:14 gaussdb.state
chown omm.dbgrp gaussdb.state
ll gaussdb.state
-rw-r--r-- 1 omm dbgrp 72 Sep 9 15:24 gaussdb.state
查看正常的gaussdb.state
cat gaussdb.state
返回空
vi gaussdb.state
^B^@^@^@^A^@^@^@^A^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
3.啟動集群,并查看集群狀態
gs_om -t start
Starting cluster.
=========================================
[SUCCESS] master
[SUCCESS] standby
=========================================
Successfully started.
gs_om -t status --detail
[ Cluster State ]
cluster_state : Normal
redistributing : No
current_az : AZ_ALL
[ Datanode State ]
node node_ip port instance state
-----------------------------------------------------------------------------------
1 master 192.168.137.110 26000 6001 /dbdata/data/db1 P Primary Normal
2 standby 192.168.137.111 26000 6002 /dbdata/data/db1 S Standby Normal
發散思維驗證
1.誤刪gaussdb.state是否可以正常啟動?
刪除文件gaussdb.state
rm -f gaussdb.state
啟動數據庫
gs_ctl -D /dbdata/data/db1/ start
啟動成功,新生成一個gaussdb.state
2.修改gaussdb.state里面內容是否可以正常啟動?
vi gaussdb.state
清空已有內容,隨便插入幾個數字,保存
啟動數據庫
gs_ctl -D /dbdata/data/db1/ start
重現上面故障
清除非法內容,插回字符長串,保存
^B^@^@^@^A^@^@^@^A^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
啟動數據庫成功
總結
1.數據庫要保障正常啟動和關停,保障供電正常,忌突然斷電,容易造成數據文件損壞,數據庫異常。
2.數據庫無法啟動,通過報錯或者錯誤日志分析原因,可以查詢官方手冊,可以官方源碼搜關鍵字詞等
參考文檔
Mogdb官方手冊https://docs.mogdb.io/zh/mogdb/v3.0/overview
openGauss源碼地址:https://gitee.com/opengauss/openGauss-server/blob/master/src/gausskernel/process/postmaster/postmaster.cpp#L877




