The Uptime Engineer
👋 Hi, I am Yoshik Karnawat
Skip this and you'll be Googling “df shows space but can't create files“
Facts About Linux Filesystem
90% of public cloud workloads run on Linux - making filesystem security and reliability critical
ext4 reduces metadata overhead by 90% compared to ext3 using extents instead of block mapping
Linux ransomware attacks increased 62%, with filesystem-level exploits (buffer overflows, privilege escalation) as primary attack vectors
A single fragmented 50GB MySQL database can experience 30-50% slower read performance
You delete a 100GB database backup.
Gone in under a second.
Then you try to remove a 1KB log file and your system hangs.
Or you get "No space left on device" errors when df -h shows 40% free disk space.
Sounds broken.
But it's not.
It's filesystem architecture doing exactly what it was designed to do.
And understanding why changes how you debug production at 3 AM.
The three invisible layers
Your filesystem isn't just a place to store files.
It's three systems working together:
1. The catalog (inodes) - Metadata about every file: size, permissions, timestamps, location
2. The shelves (data blocks) - Where your actual data lives
3. The transaction log (journal) - A record of changes before they happen
Understanding these three layers is the difference between blindly running commands and knowing exactly why your system just broke.
Why "No space left" doesn't mean no space
Here's a production scenario you'll hit eventually:
$ df -h /var
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 100G 60G 40G 60% /var
$ touch /var/test
touch: cannot touch '/var/test': No space left on deviceYou have 40GB free.
But you can't create files.
Check this:
$ df -i /var
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/sda1 6.5M 6.5M 0 100% /var
You ran out of inodes, not disk space.
What's an inode?
Every file on Linux has an inode - a data structure that stores everything except the filename and data:
File type (regular, directory, symlink)
Permissions (rwxr-xr-x)
Owner UID/GID
Size in bytes
Timestamps (ctime, mtime, atime)
Number of hard links
Pointers to data blocks
The filename is stored separately in directory entries.
This separation enables hard links - multiple filenames pointing to the same inode:
echo "production secrets" > /etc/app/config.conf
ln /etc/app/config.conf /backup/config.conf
ls -i /etc/app/config.conf /backup/config.conf
# Both show: 1234567Deleting /etc/app/config.conf doesn't delete the data.
It just decrements the link count.
Only when link count hits zero does the data get freed.
Why this breaks production
Millions of tiny log files (1KB each) can consume all inodes while using minimal disk space.
Each file needs one inode - regardless of size.
A 1-byte file and a 1GB file both consume exactly one inode.
Example:
"Application generating thousands of small files (session files, cached images, thumbnails). Each small file still consumes an inode. Once inodes are gone, no new files can be created, even with free disk space available."
The fix:
# Find the inode hogs
find /var -xdev -type f | cut -d "/" -f 2-3 | sort | uniq -c | sort -rn | head -10
# Clean up old logs
find /var/log -type f -name "*.log" -mtime +30 -deleteWhy deleting 100GB is instant
dd if=/dev/zero of=/tmp/bigfile bs=1M count=100000
rm /tmp/bigfile # Returns instantlyWhat actually happens:
rm doesn't touch the data blocks. It just:
Removes the directory entry
Decrements the inode link count to zero
Marks blocks as "free" in metadata
The kernel deallocates blocks asynchronously in the background.
No data is touched during the delete operation.
Why "mv" within a filesystem is instant
mv /data/huge.tar.gz /backup/huge.tar.gz # Instant Why:
The inode number doesn't change.
Data blocks don't move.
mv only updates directory entries - removes from /data, adds to /backup.
But moving across filesystems forces a full copy:
mv /data/huge.tar.gz /mnt/external/huge.tar.gz # Slow (full copy + delete) Journaling: The system that saves you from corruption
Here's the problem journaling solves:
You run:
mv /tmp/large-file.log /var/log/archive/ The filesystem must:
Remove directory entry in
/tmpUpdate inode link count
Add directory entry in
/var/log/archiveUpdate both directory metadata
Update free block bitmaps
Flush changes to disk
Power fails between step 2 and step 3:
File exists nowhere
Inode shows link count 0, but blocks aren't freed
Filesystem is corrupted
Without journaling, fsck scans the entire disk - taking hours.
How journaling actually works
ext4 uses a write-ahead log (journal):
Transaction sequence:
Log the operation to the journal (fast sequential write)
Mark transaction as "committed"
Write data to filesystem (slower random writes)
Clear journal entry
On crash recovery:
Kernel replays uncommitted transactions from the journal.
Recovery takes seconds instead of hours.
The three journal modes
data=journal (Safest, Slowest)
Logs both metadata AND file data
Everything written twice
Use for: Financial systems, critical databases
data=ordered (Default, Balanced)
Logs only metadata
Data written before metadata is journaled
Use for: Most production systems (Ubuntu/RHEL default)
data=writeback (Fastest, Least Safe)
Logs only metadata
Data can be written anytime
Risk: After crash, files can contain garbage
Use for: Scratch space, temp directories
Check your mode:
tune2fs -l /dev/sda1 | grep "Default mount options" Why ext4 outperforms ext3
Extents vs Block Mapping
Old way (ext3):
A 1GB file = thousands of 4KB block pointers stored in metadata
Modern way (ext4):
A 1GB file = single extent: "blocks 1000-250000"
Result:
90%+ reduction in metadata overhead
Faster sequential reads
Less fragmentation
Check fragmentation:
filefrag /var/log/syslog
# Output: /var/log/syslog: 1 extent found (good)
# Output: /var/log/messages: 347 extents found (fragmented) Why this matters:
A heavily fragmented 50GB MySQL database with 10,000 extents will see 30-50% slower read performance vs the same file in 50 extents.
Advanced ext4 features
Delayed Allocation
ext4 doesn't allocate blocks immediately when you write.
It keeps data in memory and decides block allocation only during flush.
Benefits:
Reduces fragmentation
Improves write performance
Risk:
Data can be lost if system crashes before flush
Applications assuming
write()is durable can lose data
Faster fsck with Uninitialized Block Groups
ext4 marks unallocated areas as "uninitialized."
During fsck, these areas are skipped entirely.
Result: Checking a 10TB filesystem that's 20% full takes 80% less time.
The bottom line
Understanding filesystems isn't academic. It's the difference between systems that survive crashes and systems that corrupt data.
The filesystem is invisible until it breaks.
Now you know how to see it before that happens.
