Containers
APFS is a pooled storage, transactional, copy-on-write file system. Its design relies on a core management layer known as the Container. APFS containers consist of a collection of several specialized components: The Space Manager, the Checkpoint Areas, and the Reaper. In today’s post, we will give an overview of APFS containers and these components, including the mount procedure, transaction lifecycle, and container resize mechanisms.
History
Prior to the introduction of APFS, Apple’s primary file system of choice was HFS+. HFS+ is a journaling file system that was introduced by Apple in 1998 as an improvement over its legacy HFS file system.
Like most file systems of its era, each HFS+ volume can only manage the space of a single physical disk partition. While it is possible to have more than one HFS+ volume on a disk, the limitation of “one volume per partition” requires that the storage space for each volume be fixed and pre-allocated. This means that HFS+ volumes that are low on storage space cannot make use of any available free space elsewhere on disk.
In 2012, Apple introduced its hybrid Fusion Drives, which consist of a larger hard disk drive (HDD) combined with a smaller, but faster solid state drive (SSD) in a single package. The HDD is intended to be used as the primary storage device, providing the baseline storage capacity, and the SSD provides faster access to the most recently accessed data by acting as a cache.
This caching logic is not built into the fusion drive hardware. The two drives are presented to the operating system as separate storage devices. HFS+ does not have the ability to span a volume across multiple partitions, and it was not designed to support the desired caching mechanisms.
Rather than massively overhauling HFS+ to support these new capabilities, Apple decided instead to add an additional storage layer, called Core Storage. Core Storage acts as a logical volume manager that has the ability to pool the storage of multiple devices on the same drive into a single, logical volume. It also implements a tiered storage model that allows blocks to be duplicated and cached on Fusion drives. Incidentally, Core Storage also provides the mechanism for the volume-level encryption facilities of File Vault on HFS+ systems. Because HFS+ only sees a single logical volume, these complexities are completely transparent to the file system’s implementation.
Apple introduced APFS in 2017. The design of APFS takes many lessons from both HFS+ and Core Storage, and eliminates the need for both of them.
Space Manager
APFS containers provide pooled and tiered storage capabilities, without the need for a Core Storage layer. It presents one logical view of storage, whose blocks can be shared among multiple volumes without the need for pre-partitioning and pre-allocation of space. As volumes’ storage requirements change over time, blocks are allocated or returned to the container. This allows for quite a bit of flexibility, as you can now have multiple volumes that serve different roles without having to figure out their space requirements ahead of time. For example, you can now have more than one system volume with different versions of macOS installed that can share the same user data volume.
It supports storage devices as small as 1 MiB in size (APFS on a 1.44 MiB HD floppy, anyone?) and has no apparent upper storage limit. It supports the sharing of blocks among as many as 100 volumes (with some limitations). In addition to that hard-coded upper maximum of 100 volumes, APFS requires that there can be no more than one volume per 512 MiB of storage space. This helps limit storage contention and reduces the amount of space needed to maintain file system metadata on-disk.
The Space Manager keeps track of which blocks across storage tiers are in-use. It is also responsible for the allocation and freeing of blocks for volumes on-demand.
Checkpoint Areas
As mentioned in last Friday’s post, APFS provides fault tolerance by batching together copies of updated objects and committing them to disk in transactions known as checkpoints. This transactional, copy-on-write strategy ensures that there is always at least one valid and complete set of APFS objects on disk. The latest checkpoint may be used as the authoritative source of information and since checkpoints aren’t immediately invalidated, the entire state of APFS can be reverted to an earlier point in time.
APFS containers maintain two distinct checkpoint areas. The Checkpoint Data Area, which is reserved for storage of ephemeral objects, and the Checkpoint Descriptor Area.
The Checkpoint Descriptor Area provides a logically (but not necessarily physically) contiguous area on disk that is reserved to act as a circular buffer to store two types of objects that are used to store information about checkpoints: Checkpoint Map Objects and NX Superblock Objects.
After a checkpoint is flushed to disk, both types of objects are written to the descriptor area. The Checkpoint Map Objects provide a list of all ephemeral objects, their types, and their storage location within the checkpoint data area. A NX Superblock object is written to the descriptor area buffer after the map objects. This superblock is the root object of APFS and serves as the initial source of information about the state of the container in each checkpoint. All other valid objects in APFS are either directly or indirectly reachable from the NX superblock object.
Both checkpoint areas normally occupy contiguous ranges of blocks on disk, but can be fragmented when contiguous space is unavailable. When fragmented, bit 31 is set in the nx_xp_desc_blocks or nx_xp_data_blocks fields of the NX Superblock, and the corresponding _base field becomes the object identifier of a Metadata Fragmented Extent List Tree rather than a direct base address. This tree maps logical offsets within the metadata region to physical block ranges, allowing the checkpoint areas to span non-contiguous regions.
Reaper
Once a checkpoint transaction is successfully flushed to disk, APFS may choose to invalidate the oldest checkpoint. At this point, all newly unreferenced objects are subject to a process of garbage collection, where their blocks can be wiped and returned to the space manager for reuse. The Reaper is responsible for managing this garbage collection process, keeping track of the state of objects so that they may be freed across transactions.
Mounting a Container
Mounting an APFS container involves locating the most recent valid checkpoint and using it to bootstrap access to all other structures. The procedure follows these steps:
-
Read block zero. This block contains a copy of the container superblock (
nx_superblock_t). It may be the latest version or an older one, depending on whether the drive was unmounted cleanly. Validate thatnx_magicequalsNX_MAGIC('BSXN'), the block size is valid, and the checksum is correct. -
Locate the checkpoint descriptor area using the
nx_xp_desc_basefield. - Find the latest valid checkpoint. Two paths exist:
- Clean-unmount fast path: If
NX_CLEAN_UNMOUNTis set innx_flags, the storage is trusted, and bothnx_xp_desc_lenandnx_xp_data_lenare nonzero, read the superblock directly at the index(nx_xp_desc_index + nx_xp_desc_len - 1) % (nx_xp_desc_blocks & 0x7FFFFFFF). - Full scan: Scan all blocks in the checkpoint descriptor area to find the superblock with the highest valid transaction identifier (
o_xid). Walk backward from that point, validating each candidate superblock’s checksum, feature flags, UUID consistency, and self-reported position. On untrusted storage (external or removable media), perform additional consistency checks on recently-changed container structures.
- Clean-unmount fast path: If
-
Validate checkpoint mappings. Read the
nx_xp_desc_len - 1checkpoint mapping blocks that precede the superblock in the descriptor area. Verify each mapping block’s type, transaction ID, and entry count. The final mapping block hasCHECKPOINT_MAP_LASTset. -
Load ephemeral objects. Read each object listed in the checkpoint mappings from the checkpoint data area. Verify checksums, types, and transaction IDs.
-
Locate the container object map using
nx_omap_oid. - Mount volumes. Read the volume list from
nx_fs_oid, look up each volume’s superblock via the container object map, and access each volume’s file system tree.
If any step fails, the implementation falls back to an older valid checkpoint from the descriptor area. This ensures that even after a crash or incomplete write, the container can always recover to a consistent state.
Transaction Lifecycle
APFS maintains a pool of up to 4 transaction objects. Each transaction progresses through these states:
-
Open: A new transaction identifier is assigned from
nx_next_xid. Participants (volume operations, space manager updates) enter the transaction and increment its active reference count. New reads and writes operate within this transaction. -
Closing: When conditions are met (sufficient dirty objects, space pressure, or an explicit flush request), the transaction transitions to closing. No new participants may enter.
-
Flushing: The checkpoint write sequence begins. All dirty ephemeral objects are written to the checkpoint data area, checkpoint mapping blocks are written to the descriptor area, a storage barrier is issued, and the new superblock is written as the commit point.
-
Complete: The checkpoint is fully committed. The transaction object is recycled to the pool.
A new transaction can open while a previous one is still flushing. At most two transactions are active simultaneously: one accepting modifications and one being written to disk. If the flush pipeline is full, new transactions block until space is available. An error at any stage aborts the transaction, reverting all uncommitted changes.
Checkpoint Write Sequence
The commit process writes a checkpoint in a carefully ordered sequence:
-
Write checkpoint data: Dirty ephemeral objects (space manager, object map, reaper, B-Tree nodes) are written to the checkpoint data area ring buffer.
-
Write checkpoint mappings:
checkpoint_map_phys_tblocks recording the type, location, and size of each ephemeral object are written to the descriptor area. The last mapping block is marked withCHECKPOINT_MAP_LAST. -
Storage barrier: A cache flush ensures all mapping and data blocks reach persistent storage before the superblock.
-
Write superblock: The NX Superblock is written with updated
nx_xp_desc_index,nx_xp_desc_len,nx_xp_data_index,nx_xp_data_len, and the current transaction’so_xid. This is the atomic commit point: once persisted, the checkpoint is valid.
The ordering guarantees that if a crash occurs before the superblock is persisted, the previous checkpoint remains valid. The superblock is the single point of atomicity for each transaction.
Container Resize
A container can be grown or shrunk while mounted. This modifies all statically allocated metadata areas (checkpoint descriptor area, checkpoint data area, space manager internal pool, and internal pool bitmaps) to fit the new container size.
Growing
Growing a container extends nx_block_count to cover additional space on the device. New metadata areas are allocated in the expanded region, old metadata blocks are freed, and the space manager is updated to reflect the larger free pool.
Shrinking
Shrinking is more complex because data may occupy the blocks being removed:
-
Block-out phase: The physical range at the tail of the container is identified for removal. Any data in this range is relocated to free space elsewhere in the container using the block-out mechanism (
nx_blocked_out_prangein the NX Superblock). This eviction uses the evict-mapping tree (nx_evict_mapping_tree_oid) to track source-to-destination block relocations. The block-out may span multiple transactions. -
Metadata relocation: Transactions are frozen and new metadata areas are allocated within the reduced container bounds. The space manager, checkpoint areas, and internal pool are recreated at their new locations.
-
Persist: The updated superblock is written and old metadata blocks are freed.
If the shrink fails due to insufficient space, APFS computes and reports the minimum achievable container size based on currently occupied data blocks and required metadata overhead.
Conclusion
Containers provide the core management layer of APFS using several specialized subsystems. The mount procedure ensures crash recovery by always having access to at least one valid checkpoint. The transaction lifecycle provides atomic commits through carefully ordered writes with storage barriers. Container resize enables live storage management without unmounting. Future posts in this series will discuss each of these subsystems in more detail, including the Space Manager’s allocation algorithms and the Reaper’s multi-phase garbage collection.
Find an issue or technical inaccuracy in this post? Please file an issue so that it may be corrected.