Leveldb 源码分析 03: WAL

Table of Contents

WAL 概述
#

崩溃恢复
#

DBMS 往往采用恢复算法在故障发生时确保数据库一致性, 原子性和持久性的技术. 当发生崩溃时, 所有未提交到磁盘的内存中缓存的数据都有丢失的风险. 恢复算法的作用是在崩溃后防止数据的丢失.

恢复算法主要包括有两个部分:

在正常事务处理期间采取的操作, 以确保数据库管理系统能够从故障中恢复
在故障后采取的操作, 将数据库恢复到确保原子性、一致性和持久性的状态

恢复算法中使用的关键概念称为 UNDO 和 REDO, 但并非所有恢复算法都使用这两个.

UNDO：重做未完成或中止的事务
REDO：重新应用已提交的事务

Buffer Pool 和 WAL
#

传统的面向磁盘的 DBMS 往往会使用 Buffer Pool 缓冲数据页面, 多个并发事务运行过程中可能会修改相同页面, Buffer Pool 需要提供两个保证

所有已提交的事务都是持久的
未提交的事务不会出现部分提交（即事务终止前的操作覆盖了磁盘上已经提交的事务的内容）

因此，维护 Buffer Pool 时产生了一些策略，一般考虑

steal policy：是否允许未提交的事务覆盖已经提交的事务的最新写入的值
force policy：是否要求事务提交时将事务所有的更新都持久化，如果不能持久化则不会提交

steal policy

STEAL 允许
NO-STEAL 不允许

force policy

FORCE 要求
NO-FORCE 不要求

NO-STEAL + FORCE 是最简单的

不需要 undo，因为不会覆盖
不需要 redo，因为未提交的事务都丢弃了

但是 NO-STEAL + FORCE 存在限制，要求事务的 write-set 需要完全放入内存. STEAL + NO-FORCE 是目前大多数数据库使用的, 但该策略需要 wal 来进行 redo 和 undo. 简单举个栗子，例如 A 向 B 转账 100 元，Buffer Pool 使用 STEAL + NO-FORCE 策略

A 的账户扣减之后，由于 STEAL，这次更改所在的 Page 被 Buffer Pool 标记未 dirty 并写入磁盘
数据库崩溃，A 事务未完成, 客户端认为此次转账失败, 但 page 被刷到磁盘覆盖了 B 账户

如果没有 WAL 进行 undo, 上面数据就出现了不一致, 无法满足 ACID.

根据 Architecture of a Database System 论文的描述，WAL 协议具备3个规则

对于数据库页的每一次更新都会产生一个日志记录，在数据库页被刷新到磁盘之前，该日志记录必须被刷新到日志系统中。
数据库日志记录必须按顺序刷新，即在日志记录 r 被刷新时，必须保证所有 r 之前的日志记录都已经被刷新了。
如果一个事务提出提交请求，那么在提交返回成功之前，该提交日志记录必须被刷新到日志设备上。

leveldb 中的 wal
#

leveldb 使用 WAL 为 memtable 提供 ACID 中的 A (原子性) 和 D (持久性).
leveldb memtable 可以看作是 buffer pool
leveldb 不提供事务, 也不存在多个并发 writer.
leveldb 的缓冲策略是 no-steal, 写入首先到 memtable, 然后合并时刷新到磁盘, 所以恢复时只需要 redo 不需要 undo.
leveldb 支持多版本快照, 因此除了数据使用 wal 做 redo, 还会使用 wal 维护版本信息.

leveldb wal 格式设计特点

编码和内容分离: wal 格式无关的, 存储的数据类型取决于使用 wal 的部分如何编码
可移植性: wal 通过leveldb 的文件抽象实现了跨平台读写

每个 wal 文件由多个 block 构成, 每个 block 大小为 kBlockSize (默认 32kb), 每个 block 可以存储多个 record, record 由 7bytes header + data 构成.

+--------------------+--------------------+----------------+--------------------------------+
| checksum (4 bytes) |  length (2 bytes)  | type (1 bytes) |     data (32 kb - 7 bytes)     |
+--------------------+--------------------+----------------+--------------------------------+
|                                                          |
|                                                          |
|                                                          |
|                                                          |
+----------------------->   header   <---------------------+

通过格式设计可以看出，data 部分的内容是一个字节序列，通过编码决定具体的类型，所以是格式无关的。

使用 type 的原因在于 block 是固定大小，而一个 record 可能不止 32 kb，也可能远远小于 32kb, 所以其规则对应为:

一个 block 若只包含一个 record，则类型为 full
一个 block 包含若干个 record, 则第一个 record 的类型为 first, 最后一个为 last, 中间多个为 middle
一个 record 跨越多个 block, record 所在第一个 block 的类型为 first, 所在的最后一个 block 类型为 last, 中间若干个 block 类型记录为 middle

                    +-----------+--------+-------+---------------------------+-------------------+
 block 1  --------->| checksum  | length | full  |           data            | padding (6 bytes) |
                    +-----------+--------+-------+---------------------------+-------------------+



              +---->+-----------+--------+-------+-----------------------+
              |     | checksum  | length | first |         data          |
              |     +-----------+--------+-------+-----------------------+
              |     | checksum  | length |middle |         data          |
 block 2  ----+     +-----------+--------+-------+-----------------------+
              |     | checksum  | length |middle |         data          |
              |     +-----------+--------+-------+-----------------------+
              |     | checksum  | length | last  |         data          |
              +---->+-----------+--------+-------+-----------------------+



                    +-----------+--------+-------+-----------------------------------------------+
              +---->| checksum  | length | first |                     data                      |
              |     +-----------+--------+-------+-----------------------------------------------+
              |
              |     +-----------+--------+-------+-----------------------------------------------+
 block 3      |     | checksum  | length |middle |                     data                      |
 block 4      |     +-----------+--------+-------+-----------------------------------------------+
 block 5  ----+
 block 6      |     +-----------+--------+-------+-----------------------------------------------+
              |     | checksum  | length |middle |                     data                      |
              |     +-----------+--------+-------+-----------------------------------------------+
              |
              |     +-----------+--------+-------+---------------------------+
              +---->| checksum  | length | last  |           data            |
                    +-----------+--------+-------+---------------------------+

block 1 存储了一个完整的 record, 但剩余了 6byte 空间，由于够一个 header 大小，因此 leveldb 会做填充
block 2 存储了多个 reorcd
block{3,4,5,6} 存储了一个较大的 record, 同时 block6 还剩余一些空间

// vim db/log_format.h +14
enum RecordType {
  // Zero is reserved for preallocated files
  kZeroType = 0,

  kFullType = 1,

  // For fragments
  kFirstType = 2,
  kMiddleType = 3,
  kLastType = 4
};
static const int kMaxRecordType = kLastType;

static const int kBlockSize = 32768;

// Header is checksum (4 bytes), length (2 bytes), type (1 byte).
static const int kHeaderSize = 4 + 2 + 1;

leveldb 在 wal 中的每个 block 的 record data 部分编码写入的 key/value 数据，每个 record data 包含多个 WriteBatch

            +-->+-----------------+-----------------+-----------------------------------------------------+
            |   |  SN (8 bytes)   | count (8 bytes) |                       Content                       |
  record    |   +-----------------+-----------------+-----------------------------------------------------+
   data   --+   |  SN (8 bytes)   | count (8 bytes) |                       Content                       |
            |   +-----------------+-----------------+-----------------------------------------------------+
            |   |  SN (8 bytes)   | count (8 bytes) |                       Content                       |
            +-->+-----------------+-----------------+-----------------------------------------------------+
                                                                               |
                                                                               |
             +-------------------------------------    Content Layout    ------+------------------------------+
             |                                                                                                |
             |                                                                                                |
             |                                                                                                |
             |  +-----------------+-----------------+-------------+--------------------+-------------------+  |
             +->| type (1 bytes)  |key len (4 bytes)|  key data   |value len (4 bytes) |    value data     |<-+
                +-----------------+-----------------+-------------+--------------------+-------------------+

每个 WriteBatch 包括

sequence number, 写入是分配的 SN, leveldb 使用 SN 为多版本确定顺序、实现快照读
count, 包含多少个 key/value data
content, 编码后 key/value 数据

WriteBatch 实现上将 header 和 content 独立

WriteBatch

内部表示为一个字符串 rep_
编码 key/value 信息

// vim include/leveldb/write_batch.h +30
class LEVELDB_EXPORT WriteBatch {
  // Store the mapping "key->value" in the database.
  void Put(const Slice& key, const Slice& value);

  // If the database contains a mapping for "key", erase it.  Else do nothing.
  void Delete(const Slice& key);

  // Copies the operations in "source" to this batch.
  //
  // This runs in O(source size) time. However, the constant factor is better
  // than calling Iterate() over the source batch with a Handler that replicates
  // the operations into this batch.
  void Append(const WriteBatch& source);

 private:
  friend class WriteBatchInternal;

  std::string rep_;  // See comment in write_batch.cc for the format of rep_
};

WriteBatchInternal

使用 WriteBatch 的 rep_ 编码头部信息

// vim db/write_batch_internal.h +15

// WriteBatchInternal provides static methods for manipulating a
// WriteBatch that we don't want in the public WriteBatch interface.
class WriteBatchInternal {
 public:
  // Return the number of entries in the batch.
  static int Count(const WriteBatch* batch);

  // Set the count for the number of entries in the batch.
  static void SetCount(WriteBatch* batch, int n);

  // Return the sequence number for the start of this batch.
  static SequenceNumber Sequence(const WriteBatch* batch);

  // Store the specified number as the sequence number for the start of
  // this batch.
  static void SetSequence(WriteBatch* batch, SequenceNumber seq);

  static Slice Contents(const WriteBatch* batch) { return Slice(batch->rep_); }

  static size_t ByteSize(const WriteBatch* batch) { return batch->rep_.size(); }

  static void SetContents(WriteBatch* batch, const Slice& contents);

  static Status InsertInto(const WriteBatch* batch, MemTable* memtable);

  static void Append(WriteBatch* dst, const WriteBatch* src);
};

日志写入
#

写入主要处理

在文件中按照格式写入多个逻辑 block
处理 record 跨越多个 block 的情况

wal 写入由 leveldb::log::Writer 类实现, 先来看看该 class 的数据结构为

// vim db/log_writer.h +20
class Writer {
 private:
  WritableFile* dest_;
  int block_offset_;  // Current offset in block

  // crc32c values for all supported record types.  These are
  // pre-computed to reduce the overhead of computing the crc of the
  // record type stored in the header.
  uint32_t type_crc_[kMaxRecordType + 1];
};

dest_ leveldb 抽象的一个顺序写文件
block_offset_ 记录一个 block 数据段的 offset，当切换 block 时重置为 0，写入时增长

block 编码由EmitPhysicalRecord 执行，按照 wal 格式追加写入, 依次为

计算 crc
写入 header
写入 data
计算 block_offset_

// vim db/log_write.cc +82
Status Writer::EmitPhysicalRecord(RecordType t, const char* ptr,
                                  size_t length) {
  assert(length <= 0xffff);  // Must fit in two bytes
  assert(block_offset_ + kHeaderSize + length <= kBlockSize);

  // Format the header
  char buf[kHeaderSize];
  buf[4] = static_cast<char>(length & 0xff);
  buf[5] = static_cast<char>(length >> 8);
  buf[6] = static_cast<char>(t);

  // Compute the crc of the record type and the payload.
  uint32_t crc = crc32c::Extend(type_crc_[t], ptr, length);
  crc = crc32c::Mask(crc);  // Adjust for storage
  EncodeFixed32(buf, crc);

  // Write the header and the payload
  Status s = dest_->Append(Slice(buf, kHeaderSize));
  if (s.ok()) {
    s = dest_->Append(Slice(ptr, length));
    if (s.ok()) {
      s = dest_->Flush();
    }
  }
  block_offset_ += kHeaderSize + length;
  return s;
}

#

graph TD
  AddRecord --> begin
  begin[设置 begin = true]
  loop{{"数据完全写入?"}}
  left_header{"当前 block 剩余空间\n可以写入一个 header?"}

  begin --> loop
  loop --> |no| left_header

  reset_block_offset["重置 block_offset_ 为 0"]
  fill_block_trailer["在当前 block 填充空白"]
  left_over_lg_0{"当前 block 剩余空间 > 0"}

  left_header --> |no| left_over_lg_0
  left_over_lg_0 --> fill_block_trailer
  fill_block_trailer --> reset_block_offset
  reset_block_offset --> avail
  avail["计算写入 header 后当前 block 剩余可空间"]
  left_header --> |yes| avail

  full_space{"block 剩余空间可以\n写入多少数据?"}
  avail --> full_space

  full["一次写入全部\n(kFullType)"]
  first["第一次写入, 但只写入部分\n(kFirstType)"]
  last["最后一次写入, 数据写完\n(kLastType)"]
  middle["其他循环次数写入\n(kMiddleType)"]
  e["结束"]
  begin_f["begin 设置为 false"]
  full_space --> full
  full_space --> last
  full_space --> first
  full_space --> middle

  full --> e
  last --> e

  middle --> loop
  first --> begin_f
  first --> loop

日志读取
#

读取按照 Record 为单位

但是 leveldb 每次读一个 block 大小
对于机械硬盘来说是好的

// vim db/log_reader.h +51

  // Read the next record into *record.  Returns true if read
  // successfully, false if we hit end of the input.  May use
  // "*scratch" as temporary storage.  The contents filled in *record
  // will only be valid until the next mutating operation on this
  // reader or the next mutation to *scratch.
  bool ReadRecord(Slice* record, std::string* scratch);

Reader 类的 ReadPhysicalRecord 实现了从文件顺序读 block 并解出来 record

// vim db/log_reader.h +20
class Reader {
 private:
  // Extend record types with the following special values
  enum {
    kEof = kMaxRecordType + 1,
    // Returned whenever we find an invalid physical record.
    // Currently there are three situations in which this happens:
    // * The record has an invalid CRC (ReadPhysicalRecord reports a drop)
    // * The record is a 0-length record (No drop is reported)
    // * The record is below constructor's initial_offset (No drop is reported)
    kBadRecord = kMaxRecordType + 2
  };

  // Return type, or one of the preceding special values
  unsigned int ReadPhysicalRecord(Slice* result);

  SequentialFile* const file_;
  Reporter* const reporter_;
  bool const checksum_;
  char* const backing_store_;
  Slice buffer_;
  bool eof_;  // Last Read() indicated EOF by returning < kBlockSize

  // Offset of the last record returned by ReadRecord.
  uint64_t last_record_offset_;
  // Offset of the first location past the end of buffer_.
  uint64_t end_of_buffer_offset_;

  // Offset at which to start looking for the first record to return
  uint64_t const initial_offset_;

  // True if we are resynchronizing after a seek (initial_offset_ > 0). In
  // particular, a run of kMiddleType and kLastType records can be silently
  // skipped in this mode
  bool resyncing_;
};

file_ 是 leveldb 抽象的一个顺序读文件类
buffer_ 将 block 大小的数据一个从文件中读到内存
- 一个 block 可能包含多个 record, 所以 buffer_ 可以多次使用
- 当 buffer_ 不足一个 kHeaderSize 大小时，会再次从文件中读取一个新的
- buffer_ 不会手动 shink
end_of_buffer_offset_ 指向 buffer_ 末尾
checksum_ 记录是否解码 block 时计算 checksum, 默认为 true
resyncing_ 处理跨 block 的 record
- 如果一次可以读完一个完整的 block 或者读到 record 所在的最后一个 block, 设置为 false
- 其他情况设置为 false 表示需要读取多个 block 才能解出 record
last_record_offset , 如果 block 包含多个 record，记录在 buffer_ 中读取的最后一个 record 的 offset