File System Trees
Each APFS volume has a logical file system stored on disk as a collection of File System Objects. Unlike other APFS Objects, File System Objects consist of one or more File System Records, which are stored in the volume’s File System Tree (FS-Tree). Each record stores specific information about a file or directory. Analyzing each record and associating them with other records with the same identifier gives a complete picture of the file system entry. This post will discuss how these records are organized in the volume’s FS-Tree.
Overview
The File System Tree is a specialized B-Tree that differs in several ways from the other trees that we’ve discussed so far:
-
FS-Trees are virtual B-Trees. Each node in the tree is a virtual object owned by the Volume’s Object Map. This means that querying the FS-Tree requires using the Object Map to locate each node.
-
FS-Tree nodes can be optionally encrypted. (We will discuss encryption in a future post.) This allows for select volumes to encrypt not only their files’ contents but their metadata as well.
-
FS-Trees store a heterogeneous set of records – multiple types of keys and values are stored in the same tree.
One advantage of being virtual trees is that FS-Trees can take full advantage of the Object Map’s snapshotting capabilities to restore their state to previous points in time. Apple also uses the snapshots to compare an FS-Tree with an earlier version of itself to create deltas for Time Machine backups.
Keys
Because FS-Trees have multiple key types, they require a way to identify record types. All keys begin with a common structure for this purpose. Specific types may add additional fields to their keys.
#define OBJ_ID_MASK 0x0fffffff'ffffffff
#define OBJ_TYPE_MASK 0xf0000000'00000000
#define OBJ_TYPE_SHIFT 60
typedef struct j_key {
uint64_t obj_id_and_type;
} j_key_t;
obj_id_and_type: A bit field that encodes the record’s object identifier (in the 60 least-significant bits) and type (in the four most-significant bits).
Keys are ordered first by an object identifier and then by type. A File System Object’s records will be stored together sequentially. Search the FS-Tree for the first record with a given identifier and then enumerate subsequent records until reaching one with a different ID.
Reserved Inode Numbers
Several inode numbers are reserved for special purposes:
| Name | Value | Description |
|---|---|---|
| INVALID_INO_NUM | 0 | Invalid (no inode) |
| ROOT_DIR_PARENT | 1 | Sentinel parent ID for the root directory (not an actual inode) |
| ROOT_DIR_INO_NUM | 2 | The volume’s root directory |
| PRIV_DIR_INO_NUM | 3 | The private directory (private-dir), used for implementation-specific bookkeeping |
| SNAP_DIR_INO_NUM | 6 | Snapshot metadata directory |
| PURGEABLE_DIR_INO_NUM | 7 | Purgeable file references (reserved, no actual directory) |
| MIN_USER_INO_NUM | 16 | First inode number available for user content |
All inode numbers below 16 are reserved. On system volumes in a volume group, these same numbers are offset by UNIFIED_ID_SPACE_MARK (0x0800000000000000).
File System Record Types
Below is a table of the documented File System Record Types. We will discuss the on-disk format of each record type soon.
| Name | Value | Description |
|---|---|---|
| APFS_TYPE_SNAP_METADATA | 1 | Metadata about a snapshot |
| APFS_TYPE_EXTENT | 2 | A physical extent record |
| APFS_TYPE_INODE | 3 | An inode |
| APFS_TYPE_XATTR | 4 | An extended attribute |
| APFS_TYPE_SIBLING_LINK | 5 | A mapping from an inode to hard links |
| APFS_TYPE_DSTREAM_ID | 6 | A data stream |
| APFS_TYPE_CRYPTO_STATE | 7 | A per-file encryption state |
| APFS_TYPE_FILE_EXTENT | 8 | A physical extent record for a file |
| APFS_TYPE_DIR_REC | 9 | A directory entry |
| APFS_TYPE_DIR_STATS | 10 | Information about a directory |
| APFS_TYPE_SNAP_NAME | 11 | The name of a snapshot |
| APFS_TYPE_SIBLING_MAP | 12 | A mapping from a hard link to its target inode |
| APFS_TYPE_FILE_INFO | 13 | Additional information about file data |
On volumes with the APFS_INCOMPAT_EXPANDED_RECORDS flag set, the type value APFS_TYPE_EXPANDED (14) serves as a marker. When this type is set in j_key_t, the actual record subtype is stored in a separate byte at key offset +8. Known expanded subtypes:
| Subtype | Description |
|---|---|
| 16 | Purgeable file tracking |
| 17 | Tombstone records for deleted entries |
| 18 | Expanded directory statistics |
| 19 | Clone mapping records for file cloning |
Key Encoding Details
The obj_id_and_type field in j_key_t packs both the object identifier and type into a single 64-bit value. In a volume group, SYSTEM_OBJ_ID_MARK (0x0fffffff00000000) is the smallest object identifier used by the system volume: object identifiers below this value belong to the data volume, and identifiers at or above it belong to the system volume.
When sorting FS-Tree records, the comparison proceeds in three stages:
- Compare object identifiers (lower 60 bits of
obj_id_and_type) numerically. - Compare types (upper 4 bits) numerically. This groups all record types for a single file system object together.
- For directory records and extended attributes, compare the remaining key data. For
j_drec_hashed_key_t, comparename_len_and_hashnumerically first, then name bytes. Forj_xattr_key_t, compare name bytes directly.
This ordering ensures that all records for a single file system object (inode, directory entries, extents, xattrs, siblings) are stored adjacently in the tree, making it efficient to gather all information about a file in a single range scan.
File System Object Composition
Different file system objects are composed of different record types. Here are the common compositions:
- Files:
INODE(required), plus optionalCRYPTO_STATE,DSTREAM_ID,EXTENT,FILE_EXTENT,SIBLING_LINK, andXATTRrecords - Directories:
INODE(required), plusDIR_RECfor each child, optionalDIR_STATS,CRYPTO_STATE, andXATTR - Symbolic links:
INODE(required), plus anXATTRwith the namecom.apple.fs.symlinkwhose value is the target path - Snapshots:
SNAP_METADATA(required) andSNAP_NAME(required), plus optionalCRYPTO_STATEandEXTENT
Conclusion
The File System Tree (FS-Tree) in an APFS volume is a specialized B-Tree that stores information about the files and directories on the volume. A unique object identifier and type identify each record in the tree, and the FS-Tree is ordered by these keys. FS-Tree nodes can be encrypted, and the tree takes advantage of the Object Map’s snapshotting capabilities. By analyzing the records in the FS-Tree, one can gain a complete understanding of the volume’s file system. In our next post, we will discuss the details of some of these records.
Find an issue or technical inaccuracy in this post? Please file an issue so that it may be corrected.