Linux Filesystem: Inodes, Journaling, and ext4 Explained

The Uptime Engineer

👋 Hi, I am Yoshik Karnawat

Skip this and you'll be Googling “df shows space but can't create files“

Facts About Linux Filesystem

90% of public cloud workloads run on Linux - making filesystem security and reliability critical
ext4 reduces metadata overhead by 90% compared to ext3 using extents instead of block mapping
Linux ransomware attacks increased 62%, with filesystem-level exploits (buffer overflows, privilege escalation) as primary attack vectors
A single fragmented 50GB MySQL database can experience 30-50% slower read performance

You delete a 100GB database backup.

Gone in under a second.

Then you try to remove a 1KB log file and your system hangs.

Or you get "No space left on device" errors when df -h shows 40% free disk space.

Sounds broken.

But it's not.

It's filesystem architecture doing exactly what it was designed to do.

And understanding why changes how you debug production at 3 AM.

The three invisible layers

Your filesystem isn't just a place to store files.

It's three systems working together:

1. The catalog (inodes) - Metadata about every file: size, permissions, timestamps, location

2. The shelves (data blocks) - Where your actual data lives

3. The transaction log (journal) - A record of changes before they happen

Understanding these three layers is the difference between blindly running commands and knowing exactly why your system just broke.

Why "No space left" doesn't mean no space

Here's a production scenario you'll hit eventually:

$ df -h /var
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1       100G   60G   40G  60% /var

$ touch /var/test
touch: cannot touch '/var/test': No space left on device

You have 40GB free.

But you can't create files.

Check this:

$ df -i /var
Filesystem      Inodes  IUsed   IFree IUse% Mounted on
/dev/sda1       6.5M    6.5M       0  100% /var

You ran out of inodes, not disk space.

What's an inode?

Every file on Linux has an inode - a data structure that stores everything except the filename and data:

File type (regular, directory, symlink)
Permissions (rwxr-xr-x)
Owner UID/GID
Size in bytes
Timestamps (ctime, mtime, atime)
Number of hard links
Pointers to data blocks

The filename is stored separately in directory entries.

This separation enables hard links - multiple filenames pointing to the same inode:

echo "production secrets" > /etc/app/config.conf
ln /etc/app/config.conf /backup/config.conf

ls -i /etc/app/config.conf /backup/config.conf
# Both show: 1234567

Deleting /etc/app/config.conf doesn't delete the data.

It just decrements the link count.

Only when link count hits zero does the data get freed.

Why this breaks production

Millions of tiny log files (1KB each) can consume all inodes while using minimal disk space.

Each file needs one inode - regardless of size.

A 1-byte file and a 1GB file both consume exactly one inode.

Example:

"Application generating thousands of small files (session files, cached images, thumbnails). Each small file still consumes an inode. Once inodes are gone, no new files can be created, even with free disk space available."

The fix:

# Find the inode hogs
find /var -xdev -type f | cut -d "/" -f 2-3 | sort | uniq -c | sort -rn | head -10

# Clean up old logs
find /var/log -type f -name "*.log" -mtime +30 -delete

Why deleting 100GB is instant

dd if=/dev/zero of=/tmp/bigfile bs=1M count=100000
rm /tmp/bigfile  # Returns instantly

What actually happens:

rm doesn't touch the data blocks. It just:

Removes the directory entry
Decrements the inode link count to zero
Marks blocks as "free" in metadata

The kernel deallocates blocks asynchronously in the background.

No data is touched during the delete operation.

Why "mv" within a filesystem is instant

mv /data/huge.tar.gz /backup/huge.tar.gz  # Instant

Why:

The inode number doesn't change.

Data blocks don't move.

mv only updates directory entries - removes from /data, adds to /backup.

But moving across filesystems forces a full copy:

mv /data/huge.tar.gz /mnt/external/huge.tar.gz  # Slow (full copy + delete)

Journaling: The system that saves you from corruption

Here's the problem journaling solves:

You run:

mv /tmp/large-file.log /var/log/archive/

The filesystem must:

Remove directory entry in /tmp
Update inode link count
Add directory entry in /var/log/archive
Update both directory metadata
Update free block bitmaps
Flush changes to disk

Power fails between step 2 and step 3:

File exists nowhere
Inode shows link count 0, but blocks aren't freed
Filesystem is corrupted

Without journaling, fsck scans the entire disk - taking hours.

How journaling actually works

ext4 uses a write-ahead log (journal):

Transaction sequence:

Log the operation to the journal (fast sequential write)
Mark transaction as "committed"
Write data to filesystem (slower random writes)
Clear journal entry

On crash recovery:

Kernel replays uncommitted transactions from the journal.

Recovery takes seconds instead of hours.

The three journal modes

data=journal (Safest, Slowest)

Logs both metadata AND file data
Everything written twice
Use for: Financial systems, critical databases

data=ordered (Default, Balanced)

Logs only metadata
Data written before metadata is journaled
Use for: Most production systems (Ubuntu/RHEL default)

data=writeback (Fastest, Least Safe)

Logs only metadata
Data can be written anytime
Risk: After crash, files can contain garbage
Use for: Scratch space, temp directories

Check your mode:

tune2fs -l /dev/sda1 | grep "Default mount options"

Why ext4 outperforms ext3

Extents vs Block Mapping

Old way (ext3):

A 1GB file = thousands of 4KB block pointers stored in metadata

Modern way (ext4):

A 1GB file = single extent: "blocks 1000-250000"

Result:

90%+ reduction in metadata overhead
Faster sequential reads
Less fragmentation

Check fragmentation:

filefrag /var/log/syslog
# Output: /var/log/syslog: 1 extent found (good)
# Output: /var/log/messages: 347 extents found (fragmented)

Why this matters:

A heavily fragmented 50GB MySQL database with 10,000 extents will see 30-50% slower read performance vs the same file in 50 extents.

Advanced ext4 features

Delayed Allocation

ext4 doesn't allocate blocks immediately when you write.

It keeps data in memory and decides block allocation only during flush.

Benefits:

Reduces fragmentation
Improves write performance

Risk:

Data can be lost if system crashes before flush
Applications assuming write() is durable can lose data

Faster fsck with Uninitialized Block Groups

ext4 marks unallocated areas as "uninitialized."

During fsck, these areas are skipped entirely.

Result: Checking a 10TB filesystem that's 20% full takes 80% less time.

The bottom line

Understanding filesystems isn't academic. It's the difference between systems that survive crashes and systems that corrupt data.

The filesystem is invisible until it breaks.

Now you know how to see it before that happens.

Linux Filesystem: Inodes, Journaling, and ext4 Explained

The Uptime Engineer

Facts About Linux Filesystem

The three invisible layers

Why "No space left" doesn't mean no space

What's an inode?

Why this breaks production

Why deleting 100GB is instant

Why "mv" within a filesystem is instant

Journaling: The system that saves you from corruption

How journaling actually works

The three journal modes

Why ext4 outperforms ext3

Advanced ext4 features

The bottom line

Keep Reading

The Uptime Engineer