Post

Head First Dive Into The File System

Now that my Proxmox VE is up and running, I plan to run TrueNAS as a KVM virtual machine and pass through my SSDs to create a ZFS storage pool inside it.

I currently have four SSDs: one 512 GB and three 256 GB. I intend to add a quad M.2 PCIe adapter by enabling PCIe bifurcation (a feature that allows a single physical PCIe slot to be logically divided into multiple smaller independent PCIe links, x16 -> four x4).

With a total of 4 SATA SSDs and 4 NVMe SSDs available, I want to implement ZFS in the most efficient way possible.

I know ZFS has built-in caching features to speed up reads, but before setting it up, I’d like to understand ZFS properly. And even before that — what exactly is a file system?

Let’s start by exploring file systems in general, then move on to ZFS, and eventually cover creating our storage pool and setting up caching.

Even before starting with the file system, lets get an overview of the mass storage structure.

Mass Storage Structure

Hard Disk Drives:

HDD

In Hard disk - disk drive motor spins which is measured in rotation per minute (RPM). It determines the transfer rate. Present HDD have an average 7200 RPM. And the present transfer rate is 250-300 MB/s. Some company uses dual or multi actuators to get sequential read and write speed up to 554 MB/s. Unlike the picture above, the dual actuator means separate read and write head.

Nonvolatile Memory Devices

NVME

Generation / StandardInterfaceMax Theoretical SpeedReal-world Sequential SpeedNotes / Examples
SATA ISATA 1.5 Gb/s~150 MB/s~100–140 MB/sVery old (pre-2007), rarely used today
SATA IISATA 3 Gb/s~300 MB/s~250–280 MB/sObsolete for SSDs, backward compatible
SATA III (current)SATA 6 Gb/s~600 MB/s~500–560 MB/sDominant standard since ~2010
e.g., Samsung 870 EVO, Crucial MX500, WD Blue 3D NAND
GenerationInterfaceMax Sequential SpeedNotes / Examples
NVMe Gen 3PCIe 3.0 x4~3,500 MB/sCommon in 2018–2022 drives
NVMe Gen 4PCIe 4.0 x4~7,500 MB/se.g. Samsung 990 Pro, WD Black SN850X
NVMe Gen 5PCIe 5.0 x4~14,900 MB/sHigh-end consumer drives since ~2024–2025
NVMe Gen 6PCIe 6.0 x4~28,000 – 30,000 MB/sEnterprise-only (as of late 2025)

Note:

Although both uses NAND flash memory, the differences in performance is due to interface and bandwidth. PCIe is like a multi-lane highway whereas SATA is like driving on a narrow road. Also, SATA is based on AHCI protocol (Advanced Host Controller Interface) which was originally designed for HDD.

Storage Abstraction Layers

LevelName / ConceptTypical Size (2024–2025)Managed byWhat the OS sees / uses it forKey purpose
1Physical Sector512 bytes – 4 KiBSSD / HDD firmwareRaw hardware addressSmallest unit the drive itself can read/write
2Logical Block / LBAUsually 512B or 4KiBOS block layer + driver/dev/sda, \\.\PhysicalDrive0Uniform “blocks” OS talks to
3PartitionMany GiB – few TiBPartition table (GPT/MBR)/dev/sda1, C:, D:Divide disk into independent areas
4 (optional)RAID / Storage Spaces / mdadmVery flexibleHardware RAID card / OS software RAIDOne big “virtual disk”Performance + redundancy across multiple drives
5 (optional)Logical Volume (LVM / LDM)Very flexibleLVM (Linux), Storage Spaces/Dynamic Disk (Windows)/dev/mapper/vg-lv, bigger & resizable “disk”Easy resizing, pooling many disks
6File Systemext4 / NTFS / APFS / XFS / Btrfs / ReFSFormatted volume ready for filesOrganizes files & directories
7Allocation Unit / Cluster1–64 KiB (most often 4 KiB)File system itselfSmallest unit file system allocatesEfficiency trade-off: speed vs. wasted space

To sum up the file system abstraction:

  • Level 5 – You: “I want to save my homework”
  • Level 4 – Programs: “Please OS, save this as homework.docx”
  • Level 3 – File system: “Okay, I’ll use these clusters and remember the name”
  • Level 2 – Disk driver: “Here are some clean 4KB blocks to write into”
  • Level 1 – Real disk: “Fine… writing electricity/magnetic stuff now…”

Cluster / Allocation Unit

On the above table, Allocation unit/Cluster (Row 7) is introduced by File system (Row 6). Think of cluster as the smallest piece of disk space that the file system is willing to give to a file.

Lets use a simple real-life analogy. Imagine you have a huge warehouse with shelves. Each shelf can hold exactly 4 boxes and you cannot rent half a shelf. When you want to store something, you must take a whole shelf (even if you only have 1 small box). That shelf is called cluster (or allocation unit) and the size of one shelf is called cluster size (most often 4KB today).

Cluster Size

File SystemMost common cluster size (block size)Wasted space for tiny filesTypical use-case
NTFS (Windows)4 KBmediumNormal computers, external drives, enterprise servers
exFAT32 KB – 128 KBvery highUSB flash drives, SD cards, cross-platform large files
ext4 (Linux)4 KBmediumLinux servers & desktops (very common default)
APFS (macOS)4 KBmediumModern Macs (since 2017)
FAT324 KB – 32 KBhighOld USB sticks, very small drives, legacy devices
ZFSVariable (default 128 KB)high for tiny filesNAS, servers, data centers, backups, large storage pools (FreeBSD, TrueNAS, Linux, macOS via OpenZFS)

Where does ZFS fit in the storage abstraction layers?

ZFS combines several layers that are usually separate in traditional stack.

Normal Stack: Physical disk → Partitions → (optional RAID/LVM) → File system (ext4/NTFS) → Files & folders

1
2
3
4
5
6
7
Physical disks (many) 
      ↓
ZFS Pool (vdevs → mirrors/RAID-Z/RAID-Z2/etc.)   # ZFS does the RAID part itself
      ↓
ZFS Datasets / Volumes / Snapshots / Clones      # ZFS acts as both "file system" and "logical volume manager"
      ↓
Files & Folders

ZFS uses the term recordsize (not exactly the same as cluster/allocation unit). The default recordsize is 128KB and you can set it per dataset from 512 B to 16 MB.

Files and Directories

Now that we have the mental model of storage abstraction layers. Lets go one level deep to better understand the file system.

Before we do that, a little bit of housekeeping-

  1. What is offset?- it simply means how far from the beginning.

    1
    2
    3
    4
    
       file content: [byte 0] [byte 1] [byte 2] [byte 3] [byte 4] [byte5]
                         H         E      L        L         L       O
       # Offset 0 = start of the file (first byte: 'H')
       # Offset 4 = 4 bytes from the beginning (fourth byte: 'L)
    
  2. Hard and Soft Link:

    1
    2
    3
    4
    5
    6
    7
    
       # Create hard link
       ln original.txt backup.txt
    
       # Now both names point to same data
       # If you delete original.txt → backup.txt STILL WORKS
       # rm original.txt
       # cat backup.txt   ← still shows content
    
    1
    2
    3
    4
    5
    6
    
       # Create symbolic link
       ln -s /home/user/documents/real_file.txt ~/Desktop/shortcut.txt
    
       # If you move or delete real_file.txt → shortcut becomes broken
       # ls -l shows:
       # lrwxrwxrwx 1 user user  ... shortcut.txt -> /home/user/documents/real_file.txt
    
  3. Bitmap: A simple map made of 0s and 1s that shows whether something is free or used
    1
    2
    3
    4
    5
    6
    
       Disk has 1,000,000 blocks (very small example)
    
       Bitmap = 000101100011... (1,000,000 bits long = only ~125 KB of space!)
    
       Wherever you see a 0 → that block is free → you can use it
       Wherever you see a 1 → someone already uses that block
    
  4. Copy-On-Write: Don’t copy data right now when someone asks for a copy. Just pretend you copied it. Only make a real copy later — at the exact moment someone tries to change their version.

How do we go from a file name and offset to a block number?

A simple answer is that file systems implement a dictionary that maps keys (file name and offset) to values (block number on a device)

Based on the file system type (NTFS, ext4, XFS, ZFS), it organizes a dictionary as linked list, tree, hash map etc. This is well explained in book Operating Systems Principles & Practice in chapter 13- Files and Directories.

RowWhat it really meansFAT (very old, simple, like old USB sticks)FFS (Berkeley Fast File System – classic Unix from 1980s)NTFS (modern Windows default)ZFS (very modern, used in big servers & data storage)
Index structureHow does the file system remember “where are the pieces of this file on disk”? (like a table of contents)Very simple linked list (each piece points to the next)Fixed tree, a bit uneven/asymmetric (multi-level index)Dynamic tree (can grow/shrink smartly)Very smart dynamic tree with Copy-On-Write (COW)
Index structure granularityWhat size chunks does it use when pointing to file data?One block at a timeOne block at a timeBigger pieces called extents (good for large files)One block at a time
Free space managementHow does it keep track of which parts of the disk are still empty and can be used?Old-school table (FAT array) that marks every blockFixed bitmap (simple on/off map, doesn’t change size)Bitmap, but stored inside a special file (more flexible)Very clever log-structured space map
Locality heuristicsTricks the file system uses so that related data stays physically close together on disk (faster reading)Needs separate defragmenter toolPuts related things in block groups + reserves spaceTries to find the best fit place + defrag when needed“Write anywhere” + still tries to keep groups together

The table above is just a quick summary of how different types of file systems work. Under the hood, a tremendous amount is happening, and I dared to dive deep into the underlying structures. My head is still spinning from it, but at the end of this rabbit hole, I finally saw the light—or perhaps I even found enlightenment, like Buddha did.

Now when I look at these file systems, I can see right through them. I understand how they actually work, where they fall short, the limitations of older file systems, and how the newer ones were specifically designed to overcome the shortcomings of their predecessors.

Here, I’m simply going to list diagrams of the file system structures. That way, if I ever need to revisit this topic in the future, I can quickly refresh my memory.

File Allocation Table (FAT)

FAT

Unix Fast File System (FFS)

FFS

Microsoft New Technology File System (NTFS)

NTFS

Zettabyte File System (ZFS)- CoW based

ZFS

The above images are taken from book Operating Systems Principles & Practice

RAID

A stich in time saves nine.

Redundant Array of Inexpensive/Independent Disks

  • RAID 0 – striping
  • RAID 1 – mirroring
  • RAID 5 – striping with parity
  • RAID 6 – striping with double parity
  • RAID 10 – combining mirroring and striping

The average time from when a disk fails until it has been replaced and rewritten is called the mean time to repair (MTTR).

Here is a good read about RAID.

  1. RAID 0 offers get performance but is not fault tolerant.

    RAID0

  2. RAID 1 : Software RAID 1 solutions do not always allow a hot swap of a failed drive. That means the failed drive can only be replaced after powering down the computer it is attached to. For servers that are used simultaneously by many people, this may not be acceptable. Such systems typically use hardware controllers that do support hot swapping.

    RAID1

  3. RAID 5: Read data transactions are very fast while write data transactions are somewhat slower (due to the parity that has to be calculated).

    RAID5

  4. RAID 6: RAID 6 is like RAID 5, but the parity data are written to two drives. That means it requires at least 4 drives and can withstand 2 drives dying simultaneously.

    RAID6

  5. RAID 10: This is a nested or hybrid RAID configuration. It provides security by mirroring all data on secondary drives while using striping across each set of drives to speed up data transfers.

    RAID10

Whats Next?

Lets go back to our Proxmox server and spin off a TrueNas VM and create ZFS pool with these drives.

My Drives

Reference

https://www.cs.cornell.edu/courses/cs4410/2019sp/schedule/slides/Lecture22.pdf

https://www.kea.nu/files/textbooks/ospp/osppv4.pdf

https://cs.uml.edu/~bill/cs308/ZFS_File_Sys.pdf

https://www.prepressure.com/library/technology/raid

This post is licensed under CC BY 4.0 by the author.