摘要
postgresql的控制文件保存initdb期間初始化的信息、WAL信息、檢查點(diǎn)信息等。文件位于$PGDATA/global/pg_control。postgresql集簇存在期間(運(yùn)行或停止),一些工具或進(jìn)程可以查看或修改該文件。本文整理了(幾乎)所有修改和查看pg_control控制文件的地方,結(jié)合源碼進(jìn)行了梳理,希望能對(duì)pg的控制文件有跟進(jìn)一步的了解。
全局概覽
先上圖,共有5個(gè)服務(wù)端工具、4個(gè)內(nèi)置函數(shù)和1個(gè)后端進(jìn)程可以對(duì)pg_control控制文件進(jìn)行查詢或修改操作,后文將進(jìn)行介紹。

數(shù)據(jù)結(jié)構(gòu)&方法
pg_control文件是一個(gè)大小8192字節(jié)的二進(jìn)制文件,文件內(nèi)容是將結(jié)構(gòu)體ControlFileData以二進(jìn)制的形式寫入pg_control文件中。
- pg_control文件大小
#define PG_CONTROL_FILE_SIZE 8192
- ControlFileData數(shù)據(jù)結(jié)構(gòu)
源碼位于src/include/catalog/pg_control.h
/*
* Contents of pg_control.
*/
typedef struct ControlFileData
{
uint64 system_identifier;
uint32 pg_control_version; /* PG_CONTROL_VERSION */
uint32 catalog_version_no; /* see catversion.h */
DBState state; /* see enum above */
pg_time_t time; /* time stamp of last pg_control update */
XLogRecPtr checkPoint; /* last check point record ptr */
CheckPoint checkPointCopy; /* copy of last check point record */
XLogRecPtr unloggedLSN; /* current fake LSN value, for unlogged rels */
XLogRecPtr minRecoveryPoint;
TimeLineID minRecoveryPointTLI;
XLogRecPtr backupStartPoint;
XLogRecPtr backupEndPoint;
bool backupEndRequired;
int wal_level;
bool wal_log_hints;
int MaxConnections;
int max_worker_processes;
int max_wal_senders;
int max_prepared_xacts;
int max_locks_per_xact;
bool track_commit_timestamp;
uint32 maxAlign; /* alignment requirement for tuples */
double floatFormat; /* constant 1234567.0 */
#define FLOATFORMAT_VALUE 1234567.0
uint32 blcksz; /* data block size for this DB */
uint32 relseg_size; /* blocks per segment of large relation */
uint32 xlog_blcksz; /* block size within WAL files */
uint32 xlog_seg_size; /* size of each WAL segment */
uint32 nameDataLen; /* catalog name field width */
uint32 indexMaxKeys; /* max number of columns in an index */
uint32 toast_max_chunk_size; /* chunk size in TOAST tables */
uint32 loblksize; /* chunk size in pg_largeobject */
bool float8ByVal; /* float8, int8, etc pass-by-value? */
uint32 data_checksum_version;
char mock_authentication_nonce[MOCK_AUTH_NONCE_LEN];
pg_crc32c crc;
} ControlFileData;
- get_controlfile
get_controlfile主要功能是將二進(jìn)制文件pg_control讀取到ControlFileData中。
// 函數(shù)定義
ControlFileData *get_controlfile(const char *DataDir, bool *crc_ok_p);
// 關(guān)鍵代碼,讀取控制文件到結(jié)構(gòu)體中
r = read(fd, ControlFile, sizeof(ControlFileData));
- update_controlfile
update_controlfile主要功能室將結(jié)構(gòu)體ControlFileData中的內(nèi)容以二進(jìn)制的形式寫入pg_control中。
// 函數(shù)定義
void update_controlfile(const char *DataDir, ControlFileData *ControlFile, bool do_sync)
// 二進(jìn)制形式打開文件
if ((fd = open(ControlFilePath, O_WRONLY | PG_BINARY, pg_file_create_mode)) == -1)
// 寫入內(nèi)容
if (write(fd, buffer, PG_CONTROL_FILE_SIZE) != PG_CONTROL_FILE_SIZE)
- read_controlfile
在pg_resetwal中單獨(dú)實(shí)現(xiàn)了一個(gè)read_controlfile函數(shù),這里處理讀取pg_control控制文件,主要的功能是檢查控制文件的長(zhǎng)度、版本號(hào)、WAL文件的大小。個(gè)人覺得,其實(shí)這里用read_controlfile這個(gè)名字未必合理。
static bool
read_controlfile(void)
{
……
if ((fd = open(XLOG_CONTROL_FILE, O_RDONLY | PG_BINARY, 0)) < 0)
{
……
}
……
len = read(fd, buffer, PG_CONTROL_FILE_SIZE);
……
if (len >= sizeof(ControlFileData) &&
((ControlFileData *) buffer)->pg_control_version == PG_CONTROL_VERSION)
{
……
if (!EQ_CRC32C(crc, ((ControlFileData *) buffer)->crc))
{
……
}
……
if (!IsValidWalSegSize(ControlFile.xlog_seg_size))
{
……
}
return true;
}
……
}
誰讀了pg_control
服務(wù)端工具pg_controldata
這個(gè)工具實(shí)現(xiàn)十分簡(jiǎn)單,就是讀取pg_control然后進(jìn)行打印輸出。

內(nèi)部函數(shù)
postgresql 提供了4個(gè)內(nèi)置函數(shù)對(duì)控制文件中的信息進(jìn)行了分類顯示。
pg_control_init(初始化集簇initdb的參數(shù))

pg_control_system(系統(tǒng)參數(shù))

pg_control_checkpoint(checkpoint參數(shù))

pg_control_recovery(recovery參數(shù))

服務(wù)端工具pg_checksums
pg_checksums在PostgreSQL集簇中檢查、啟用或禁用數(shù)據(jù)校驗(yàn)和。運(yùn)行pg_checksums之前,必須徹底關(guān)閉服務(wù)器。驗(yàn)證校驗(yàn)和時(shí),如果沒有校驗(yàn)和錯(cuò)誤,則退出狀態(tài)為零,如果檢測(cè)到至少一個(gè)校驗(yàn)和失敗,則退出狀態(tài)為非零。啟用或禁用校驗(yàn)和時(shí),如果操作失敗,則退出狀態(tài)為非零。
驗(yàn)證校驗(yàn)和時(shí),集簇中的每個(gè)文件都要被掃描。啟用校驗(yàn)和時(shí),集簇中的每個(gè)文件都會(huì)被重寫。禁用校驗(yàn)和時(shí),僅更新pg_control文件。

服務(wù)端工具pg_ctl
pg_clt在備點(diǎn)進(jìn)行promote時(shí),需要判斷備點(diǎn)狀態(tài),即備點(diǎn)狀態(tài)需為DB_IN_ARCHIVE_RECOVERY。
static DBState
get_control_dbstate(void)
{
DBState ret;
bool crc_ok;
ControlFileData *control_file_data = get_controlfile(pg_data, &crc_ok);
if (!crc_ok)
{
write_stderr(_("%s: control file appears to be corrupt\n"), progname);
exit(1);
}
ret = control_file_data->state;
pfree(control_file_data);
return ret;
}
DBState的幾種狀態(tài)如下:
/*
* System status indicator. Note this is stored in pg_control; if you change
* it, you must bump PG_CONTROL_VERSION
*/
typedef enum DBState
{
DB_STARTUP = 0,
DB_SHUTDOWNED,
DB_SHUTDOWNED_IN_RECOVERY,
DB_SHUTDOWNING,
DB_IN_CRASH_RECOVERY,
DB_IN_ARCHIVE_RECOVERY,
DB_IN_PRODUCTION
} DBState;
服務(wù)端工具pg_resetwal
通過函數(shù)read_controlfile,實(shí)現(xiàn)了讀取pg_control,目的是為了后續(xù)對(duì)pg_control的修改。
服務(wù)端工具pg_rewind
通過函數(shù)read_controlfile,實(shí)現(xiàn)了讀取pg_control,目的是為了后續(xù)對(duì)pg_control的修改。
誰寫了pg_control
服務(wù)端工具pg_checksums
之前提到pg_checksums會(huì)讀取pg_control控制文件,同時(shí)pg_checksums也會(huì)更新pg_control控制文件,主要是更新Data page checksum version的值。

當(dāng)執(zhí)行pg_checksums -e時(shí),開啟校驗(yàn),會(huì)將控制文件中Data page checksum version更新為1,如果是 -d 關(guān)閉校驗(yàn),則Data page checksum version被更新為0.
/*
* Finally make the data durable on disk if enabling or disabling
* checksums. Flush first the data directory for safety, and then update
* the control file to keep the switch consistent.
*/
if (mode == PG_MODE_ENABLE || mode == PG_MODE_DISABLE)
{
ControlFile->data_checksum_version = (mode == PG_MODE_ENABLE) ? PG_DATA_CHECKSUM_VERSION : 0;
if (do_sync)
{
pg_log_info("syncing data directory");
fsync_pgdata(DataDir, PG_VERSION_NUM);
}
pg_log_info("updating control file");
update_controlfile(DataDir, ControlFile, do_sync);
if (verbose)
printf(_("Data checksum version: %u\n"), ControlFile->data_checksum_version);
if (mode == PG_MODE_ENABLE)
printf(_("Checksums enabled in cluster\n"));
else
printf(_("Checksums disabled in cluster\n"));
}
服務(wù)端工具pg_resetwal
pg_resetwal可以重置損壞的wal日志或根據(jù)事務(wù)號(hào)進(jìn)行重置wal日志文件,同時(shí)如有必要同時(shí)更新pg_control文件。
/*
* Write out the new pg_control file.
*/
static void
RewriteControlFile(void)
{
/*
* Adjust fields as needed to force an empty XLOG starting at
* newXlogSegNo.
*/
XLogSegNoOffsetToRecPtr(newXlogSegNo, SizeOfXLogLongPHD, WalSegSz,
ControlFile.checkPointCopy.redo);
ControlFile.checkPointCopy.time = (pg_time_t) time(NULL);
ControlFile.state = DB_SHUTDOWNED;
ControlFile.checkPoint = ControlFile.checkPointCopy.redo;
ControlFile.minRecoveryPoint = 0;
ControlFile.minRecoveryPointTLI = 0;
ControlFile.backupStartPoint = 0;
ControlFile.backupEndPoint = 0;
ControlFile.backupEndRequired = false;
/*
* Force the defaults for max_* settings. The values don't really matter
* as long as wal_level='minimal'; the postmaster will reset these fields
* anyway at startup.
*/
ControlFile.wal_level = WAL_LEVEL_MINIMAL;
ControlFile.wal_log_hints = false;
ControlFile.track_commit_timestamp = false;
ControlFile.MaxConnections = 100;
ControlFile.max_wal_senders = 10;
ControlFile.max_worker_processes = 8;
ControlFile.max_prepared_xacts = 0;
ControlFile.max_locks_per_xact = 64;
/* The control file gets flushed here. */
update_controlfile(".", &ControlFile, true);
}
服務(wù)端工具pg_rewind
將PostgreSQL數(shù)據(jù)目錄同步到新的時(shí)間線。
static ControlFileData ControlFile_target;
static ControlFileData ControlFile_source;
static ControlFileData ControlFile_source_after;
從源控制文件中讀取必要信息,并重新寫入目標(biāo)控制文件。
總結(jié)
pg_control保存了4類信息,分別是postgres集簇的初始化信息、系統(tǒng)信息、checkpoint信息、recovery信息。多種服務(wù)端工具會(huì)對(duì)pg_control進(jìn)行查看或者修改。本文從代碼的角度梳理了對(duì)pg_control讀寫相關(guān)的代碼,希望能對(duì)大家了解postgres控制文件有所幫助。




