How much room does a file take on disk? Surely a trivial question, right? Who cares anyway? — my hard disk has 160GB to spare. But wait, just installing windows vista ate half of it :) So the question may be relevant after all.
First let's summarize the basics. Each storage device, be it a hard disk, floppy or CD has some physical and logical storage elements. Unlike RAM that is addressible by each individual byte, disks have a larger basic allocation unit, the so called sector which is 512 bytes. And due to reasons that are not 100% clear to me, the minimum addressible unit is even larger: a cluster is worth a few sectors which means a few KBytes. There must be hardware and OS limitations that cause this coarse addressibility. The formatting plays a role too: NTFS partitions have a maximum cluster size of 4096 bytes whereas FAT32 pushes the minimum size up to 32KB depending on how large your drive is.
So whereas file sizes are counted in bytes, their physical storage uses clusters. Even a 1-byte file will take a full cluster on disk, which can be up to 32KB (talking about waste of space!). If you turn on "size on disk" column in xplorer˛ (View > Select columns command) you will see the difference clusters make for small files.
NTFS 5 has brought some extra twists to this basic picture. Jeffrey Richter's classic article on Windows NT 5.0 File System talks about several features that affect file size: hard links, compressed and sparse files, and alternate data streams (actually not all of these are new for NTFS5, NTFS4 had ADS too). Let's consider each one in turn.
Hard links are a UNIX concept ported to windows. They are similar to windows shortcuts that point to another file. But hard links are like shadow copies of the target file; once you create one, it has no difference to the "real" file. It's like you have 2 or more instances of the same file; change the one and all the others are automatically updated. The smart thing about hard links is that each new "instance" does not consume extra disk space. The file data are stored once and each link merely occupies a small control node with reference counting and name information (each link can have a different name). Sadly hard links are limited to files within the same partition, but I've read that for Vista the other unix type of symbolic link is available, which overcomes this limitation. This should be coming in a future xplorer˛ version.
File compression also affects storage space. Think zip files but at the filesystem level. When a file has the compressed attribute then it is stored using less space. This is all transparent to the user and all programs that use the file. NTFS behind the scenes uncompresses the file when it is read and recompresses it when changes are saved. You don't need special software to deal with such compressed files. So you save space without losing your convenient access. I guess the down side is the slight performance penalty and degraded I/O speed for the background (un)compressing. Compression will still be constrained by the cluster size limitation but it can make the difference for large files.
Sparse files are a curiosity which I will only discuss in passing. You can have a file that reserves disk space for supposed future needs, but doesn't use all the space. It is like the difference between reserving and committing virtual memory. So you end up with a file that is reported as larger than it actually is. I am not aware of any tools to manage sparse files and anyway can't think of any use for them other than as pranks or party tricks.
Finally alternate data streams (ADS) contribute in file size confusion. One way to view ADS is like separate chapters within a file. What you usually regard as file is the main stream where your data is held, e.g. your text. With NTFS you can associate other streams with a file or folder. So file.txt can carry a stream called other and it will be addressed as file.txt:other using a ":" to separate the stream name. Not many system tools are ADS-aware, and explorer certainly isn't. When it lists a file it won't tell you about any streams it may have or how big they are. ADS can be as big as the "real" file. xplorer˛ can tell you many things about ADS; it has a detailed view column called "Streams" that hints at ADS existence. For more details you can use Actions > ADS submenu to check stream contents.
In the past ADS were used for things like file comments and other summary information. From Vista onwards microsoft took a U-turn and are making my life harder. I'm still waiting to see the first malware to hide behind this obscure mechanism. But remember that xplorer˛ "size on disk" column shows space occupied by streams too, so it will be a giveaway for any aspiring miscreants.
I conclude this short post with a table that summarizes the file size-affecting NTFS features and how you can detect and change them if need be. The accompanying brief demonstration video does some detective work on NTFS file storage.
|© 2002—2007 Nikos Bozinis, all rights reserved|