On-Disk Data Structures

Waltz Storage provides persistency of data. It stores transaction data in its local disk.

Directory Structure

Waltz Storage stores transaction data in the local file system. The root directory is called the storage directory which is configured using a configuration file. The storage directory contains the control file (waltz-storage.ctl) which contains a version information, creation timestamp and partition information. Under the storage directory, there are partition directories. Each partition directory contains data files and index files. For each partition, transaction data are split into segments chronologically. A new segment is created when the current segment grow beyond the configured size. Each segment consists of a data file and an index file.

<storage directory>/                # the root directory of the storage (configurable)
    waltz-storage.ctl               # the control file
    0/                              # the directory for partition 0
        0000000000000000000.seg     # the segment data file. The file name is
                                    # <first transaction id in the segment>.seq
        0000000000000000000.idx     # the segment's index file
        ....
    1/
        ....

Control File

Control File Header

The control file begins with the header which contains the following information.

Field	Data Type	Size (bits)
format version number	int	32
creation time	long	64
key	UUID	128
the number of partitions	int	32
reserved for future use	-	768

The total header size is 128 bytes (1024 bits).

The key is UUID which is generated when the cluster is configured by CreateCluster utility. The key identifies the cluster to which the cluster it belongs. If an open request comes from a Waltz Server whose key does not match the key in the control file, Waltz Storage rejects the request.

Control File Body

After the header follows the actual body of control data. It is a list of Partition Info. The number of Partition Infos is the number of partitions recorded in the header.

Field	Data Type	size(bits)
partition id	int	32
partition info struct 1 session id	long	64
partition info struct 1 low-water mark	long	64
partition info struct 1 local low-water mark	long	64
partition info struct 1 checksum	int	32
partition info struct 2 session id	long	64
partition info struct 2 low-water mark	long	64
partition info struct 2 local low-water mark	long	64
partition info struct 2 checksum	int	32

Each Partition Info record is 60 bytes (480 bits)

A partition info struct records the session ID, the low-water mark, the local low-water mark, and the checksum of the struct itself. The low-water mark is the high-water mark of the partition in the cluster when the session is successfully started. The local low-water mark is the highest valid transaction ID of the partition in the storage when the session is successfully started. The local low-water mark can be smaller than the low-water mark when the storage is falling behind.

Two partition info structs are updated alternately when a new storage session is started, and the update is immediately flushed to the disk. The checksum is checked when a partition of opened. Since the atomicity of I/O is not guaranteed, it is possible that an update is not completely written to the file when a fault occurs during I/O. If one of the structs has a checksum error, we ignore it and use the other struct, which means we rollback the partition. We assume at least one of them is always valid. If neither of structs is valid, we fail to open the partition.

Segment Data File

Data File Header

Field	Data Type	Size (bits)
format version number	int	32
creation time	long	64
cluster key	UUID	128
partition id	int	32
first transaction ID	long	64
reserved for future use	-	704

The header size is 128 bytes. The cluster key is a UUID assigned to a cluster.

The first transaction ID is the ID of the first transaction in the segment. The data file body is a list of transaction records. Each transaction record contains the following information.

Transaction Record

Field	Data Type
transaction ID	long
request id	ReqId
transaction header	int
transaction data length	int
transaction data checksum	int
transaction data	byte[]
checksum	int

When new records are written, Waltz Storage flushes the file channel to guarantee the record persistence before responding to Waltz Server. The index file is also updated, but flush is delayed to reduce physical I/Os until checkpoint. The checkpoint interval is 1000 transactions (hardcoded). When a checkpoint is reached, Waltz Storage flushes the index file before adding a new record. This means, if a fault occurs between checkpoints, we are not sure if the index is valid. So, the index file recovery is necessary every time Waltz Storage starts up. Waltz Storage scans the records from the last checkpoint and rebuild index for record after the last checkpoint.

Segment Index File

Index File Header

Exactly same as the data file header.

Index File Body

Index File Body is an array of transaction record offsets.

Field	Data Type
transaction record offset	long

Each element corresponds to a transaction in the segment. The array index is <transaction id> - <first transaction id>. Each element is byte offsets of the transaction record in the data file.

Checkpoint Interval

In the recovery process described above, the last known clean transaction ID is updated more often than a stable environment since it is updated during the recovery process. A drawback is that the number of transactions after the last known clean transaction ID can become large when no fault occurs for a long period of time. This is bad when a recovery requires a truncation to the last known clean transaction ID. So, Waltz provides a configuration parameter "storage.checkpointInterval" which is an interval in transactions for forced initiation of a new session.

Handling Snapshot or Backup

Waltz does not provide a snapshot or backup making functionality. It is not a high priority at this moment since Waltz storage is fault tolerant. If necessary, use of a journaling file system like ZFS is a possible solution to this for now.

Let’s assume a snapshot is available somehow. We may restore stale storage files from a snapshot when storage files on a storage node is damaged by a disk failure or mistake. The issue is that the state information in Zookeeper and the state information storage becomes inconsistent. Waltz already handle this case. The storage is simply truncated to the last known clean transaction ID (recorded in the storage) to remove any possibly dirty transaction, then the catch-up process will be started and bring the storage up-to-date.