2022 APFS Advent Challenge Day 6 - B-Trees (Part 1)

Thursday, December 8, 2022

In yesterday’s post, we discussed Checkpoint Maps, the simple linear-time data structures that APFS uses to manage persistent, ephemeral objects. Today, we will give a general overview of B-Trees and detail the layout and on-disk structures of B-Tree Nodes.

Background

A B-Tree is a self-balancing tree data structure that maintains sorted data and allows for fast search, insertion, and deletion operations. B-Trees are a generalization of binary trees that enable each node to have more than two children, increasing the number of key/value pairs and thus the fan-out. B-Trees are self-balancing – automatically adjusting their structure to maintain a specific balance factor that guarantees operations in logarithmic time.

B-Trees are used in APFS to store and reference many different objects and data types. APFS B-Trees differ from traditional implementations in several ways. Leaf nodes store values, and non-leaf nodes store their children’s object identifiers (oids). Because oids are usually smaller than values, non-leaf nodes in APFS can reference more children than other implementations, reducing tree depth and increasing fan-out. B-Tree nodes in APFS support fixed or variable-length keys and values. This flexibility enables the storage of heterogeneous types of data. APFS provides for the specialization of B-Tree sub-objects that define their configuration and storage order. APFS B-Trees are well-suited for storing large amounts of data on disk or other sequential access storage devices.

On-Disk Structure and Layout

There are two types of APFS B-Tree node objects, and their layer differs slightly: B-Tree Root Node Objects and B-Tree Node Objects. Traversal of APFS B-Trees must start from a Root Node Object because nodes only store references to their children, not their siblings.

Structure of a Root B-Tree Node

Structure of a Root B-Tree Node

Structure of a Non-Root B-Tree Node

Structure of a Non-Root B-Tree Node

The layout of root nodes differs from non-root nodes in that they end with a btree_info_t structure. This structure gives information about the entire B-Tree. To avoid data duplication and to make more efficient use of space, non-root nodes do not store a copy of this information.

B-Tree Info

typedef struct btree_info_fixed {
    uint32_t bt_flags;     // 0x00
    uint32_t bt_node_size; // 0x04
    uint32_t bt_key_size;  // 0x08
    uint32_t bt_val_size;  // 0x0C
} btree_info_fixed_t;      // 0x10
  • bt_flags: A set of B-Tree node flags
  • bt_node_size: The on-disk size, in bytes, of a node in this B-Tree
  • bt_key_size: The size of a key, or zero if the keys have variable size
  • bt_val_size: The size of a key, or zero if the values have variable size
typedef struct btree_info {
    btree_info_fixed_t bt_fixed; // 0x00
    uint32_t bt_longest_key;     // 0x10
    uint32_t bt_longest_val;     // 0x14
    uint64_t bt_key_count;       // 0x18
    uint64_t bt_node_count;      // 0x20
} btree_info_t;                  // 0x28
  • bt_fixed: Information about the B-tree that doesnʼt change over time
  • bt_longest_key: The size (in bytes) of the largest key stored in the tree
  • bt_longest_value: The size (in bytes) of the largest value stored in the tree
  • bt_key_count: The number of keys stored in the B-Tree
  • bt_node_count: The number of nodes that make up the B-Tree

B-Tree Flags

Name Value Description
BTREE_UINT64_KEYS 0x00000001 Keys are all of type uint64_t
BTREE_SEQUENTIAL_INSERT 0x00000002 Keys are inserted sequentially
BTREE_ALLOW_GHOSTS 0x00000004 Allow keys with no corresponding value
BTREE_EPHEMERAL 0x00000008 Child node oids are ephemeral
BTREE_PHYSICAL 0x00000010 Child node oids are physical
BTREE_NONPERSISTENT 0x00000020 never found on disk
BTREE_KV_NONALIGNED 0x00000040 Keys and values aren’t 8-byte aligned
BTREE_HASHED 0x00000080 Non-leaf nodes store child hashes
BTREE_NOHEADER 0x00000100 Nodes are all stored without headers

Both types of node objects begin with a btree_node_phys_t structure.

// A location within a B-tree node.
typedef struct nloc {
    uint16_t off; // 0x00
    uint16_t len; // 0x02
} nloc_t;         // 0x04
  • off: The context-specific offset (in bytes)
  • len: The length (in bytes)
typedef struct btree_node_phys {
    obj_phys_t btn_o;         // 0x00
    uint16_t btn_flags;       // 0x20
    uint16_t btn_level;       // 0x22
    uint32_t btn_nkeys;       // 0x24
    nloc_t btn_table_space;   // 0x28
    nloc_t btn_free_space;    // 0x2C
    nloc_t btn_key_free_list; // 0x30
    nloc_t btn_val_free_list; // 0x34
} btree_node_phys_t;          // 0x38
  • btn_o: The node’s object header
  • btn_flags: A set of node-specific bit-flags
  • btn_level: The number of child levels below this node
  • btn_nkeys: The number of keys stored in this node
  • btn_table_space: The location of the table of contents relative to the end of the btree_node_phys_t structure
  • btn_free_space: The location of the table of contents relative to the beginning of the key area
  • brn_key_free_list: A linked list that tracks free space in the key area
  • brn_val_free_list: A linked list that tracks free space in the value area

Node Flags

Name Value Description
BTNODE_ROOT 0x0001 Root node
BTNODE_LEAF 0x0002 Leaf node
BTNODE_FIXED_KV_SIZE 0x0004 Fixed-sized keys and values
BTNODE_HASHED 0x0008 Node contains child hashes
BTNODE_NOHEADER 0x0010 Node is stored without an object header
BTNODE_CHECK_KOFF_INVAL 0x8000 never found on disk

Table of Contents

Each node has a table of contents (ToC) which stores an entry to each key/value pair contained in the node. The table space can be located by reading the btn_table_space member of the header. The start of this area is btn_table_space.off bytes after the btree_node_phys_t and is btn_table_space.len bytes in size.

ToC entries identify the location of key and value data in their respective storage areas. The structure of these entries differs depending on whether the BTNODE_FIXED_KV_SIZE flag is set in the header. When this flag is set, kvoff_t structures in the ToC store offsets of keys and values relative to their respective storage areas.

// The location, within a B-tree node, of a fixed-size key and value.
typedef struct kvoff {
    uint16_t k; // 0x00
    uint16_t v; // 0x02
} kvoff_t;      // 0x04
  • k: Offset of the key (relative to the start of the key area)
  • v: Offset of the value (relative to the end of the value area)

It is necessary to store both offsets and sizes for nodes with variable keys.

// The location, within a B-tree node, of a key and value.
typedef struct kvloc {
    nloc_t k; // 0x00
    nloc_t v; // 0x04
} kvloc_t;    // 0x08
  • k: Location of the key (relative to the start of the key area)
  • v: Location of the value (relative to the end of the value area)

Key Area

The area where key data is stored is called the key area and begins immediately after the table space. The size of this area grows downward towards the end of the node. Key offsets in the ToC are relative to the start of this area.

Value Area

The area where value data is stored is called the value area. For non-root nodes, the value area starts at the end of the node. For root nodes, the start is before the btree_info_t structure. This area grows upward towards the beginning of the node, and value offsets in the ToC are interpreted as negative offsets relative to this point.

Conclusion

B-Trees are tree-like data structures that provide fast access and efficient storage of key/value data. B-Trees are widely used in APFS and serve several specialized purposes.

In the next post, we will continue our discussion of B-Trees and detail methods of enumerating and looking up their referenced objects.

This post is part of my 2022 APFS Advent Challenge

Every weekday in the month of December, I will attempt to post a blog about APFS internals. For each day that I miss a post, I will donate $100 to support humanitarian aid for the Ukrainian people. If you find value in this series, and would like to support this effort, please consider donating to the GoFundMe. Slava Ukraini! 🇺🇦

Find an issue or technical inaccuracy in this post? Please file an issue so that it may be corrected.