Exploring the Filesystem

Exploring the Filesystem Portion of the Namespace

We will be looking at two different ways for exploring the shell namespace. Typically this involves enumerating the contents of folders and subfolders, and obtaining basic information like display names, date stamps, size and attributes. In this section we will concentrate on the regular filesystem portion, whereas the full COM-based all-areas namespace exploration is the subject of the following section.

Note that there is no single approach that delivers the full picture; shell COM and filesystem API provide complementary information, albeit overlapping at times, as will become evident shortly.

Topics: The filesystem part | Pattern matching | Notifications of change | Parsing pathnames

Exploring the filesystem part

Let's start with the easy stuff. FindFirstFile, FindNextFile and FindClose are the only API you need to enumerate regular filesystem folders. It all starts with FindFirstFile creating a search "handle" for some path. This handle is used by FindNextFile which is repeatedly called to enumerate all the contents. Finally FindClose wraps things up by releasing the handle and the associated system resources. Here's an example:

#include <Shlwapi.h> // for PathAppend() void EnumerateFolderFS(LPCTSTR path) { TCHAR searchPath[MAX_PATH]; // a wildcard needs to be added to the end of the path, e.g. "C:\*" lstrcpy(searchPath, path); PathAppend(searchPath, _T("*")); // defined in shell lightweight API (v4.71) WIN32_FIND_DATA ffd; // file information struct HANDLE sh = FindFirstFile(searchPath, &ffd); if(INVALID_HANDLE_VALUE == sh) return; // not a proper path i guess // enumerate all items; NOTE: FindFirstFile has already got info for an item do { cout << "Name = " << ffd.cFileName << endl; cout << "Type = " << ( (ffd.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY) ? "dir\n" : "file\n" ); cout << "Size = " << ffd.nFileSizeLow << endl; } while (FindNextFile(sh, &ffd)); FindClose(sh); }

FindFirstFile requires a wildcard to be appended to the path name. PathAppend is used for that. You can also write your own function to join paths, if you don't want your code to be version 4.71 dependent. Note that although it is possible to use a partial wildcard like "*.exe" to get a subset of the folder comments, it's better to read all the files and do your own filtering, as in the next section below.

The information for each item is placed in a WIN32_FIND_DATA data structure. There are much more stuff in this struct than those used in the above simple example, like file attributes (read-only, system, etc), all three dates (modified, created, accessed) and some "reserved" stuff (call me a conspiracy theorist, but whenever I see such reserved members or arguments to functions I can't help thinking that M/S are withholding information from us, which they themselves use to get a competitive advantage :).

The byte size is held in two DWORD members, nFileSizeHigh and nFileSizeLow. The simple reason behind this oddity is that nowadays a 32-bit number may not be enough to hold information for huge files over 4GB. This calls for this double-DWORD solution that should be ok for the next couple of years or so <g>. To manage 64-bit numbers use the __int64 data type of Visual Studio:

// using the WIN32_FIND_DATA ffd from previous example... __int64 size = ( ((__int64)ffd.nFileSizeHigh) << 32 ) + ffd.nFileSizeLow; printf("Big size = %I64d bytes.\n", size); /* NOTE * If only m/s had defined nFileSizeLow first in WIN32_FIND_DATA, * so that the order would be nFileSizeLow, nFileSizeHigh, then * you could do a direct conversion to an 64-bit value as in: * size = *((__int64*)(&ffd.nFileSizeLow)); * which looks kinda cooler, no? <g> */

The above EnumerateFoldersFS() sample can be easily converted to recurse into subfolders. Within the do-while loop, whenever you hit an item with the FILE_ATTRIBUTE_DIRECTORY set, add its name in the end of the path and call yourself. The only detail you need to watch out for are the '.' and '..' MS/DOS remnants that correspond to this and the parent folder. You wouldn't want to recurse on those, lest you want to break the stack. Here's the modified do-while loop of the previous example:

do { // don't process the '.' and '..' pseudo-items if(ffd.cFileName[0] == '.') { // why bother with lstrcmp() ? if( ffd.cFileName[1] == 0 || (ffd.cFileName[1] == '.' && ffd.cFileName[2] == 0) ) continue; // skip this } // here process the item as you wish... if(ffd.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY) { // add this name to the path and recurse into TCHAR innerPath[MAX_PATH]; lstrcpy(innerPath, path); PathAppend(innerPath, ffd.cFileName); EnumerateFolderFS(innerPath); // call ourselves again } } while (FindNextFile(sh, &ffd));

The filesystem approach for reading folder contents, has two main advantages, providing information that is not available through shell COM. Both have to do with file date stamp information:

You get valid information in the ftCreationTime and ftLastAccessTime fields of WIN32_FIND_DATA, for the creation and last access time, respectively. The shell equivalent, SHGetDataFromIDList will not fill these members. Apparently this is an oversight going back to windows 95, where M/S were trying to save on memory usage, enabling explorer to run on machines with 8MB of RAM.
All file times are 1-second accurate when the underlying filesystem enables this feature (i.e. NTFS). SHGetDataFromIDList on the other hand always rounds times to even numbers. This may seem like a small detail, but it affects time-based comparisons across different filesystems, e.g NTFS and FAT32.

On the down side FindFirstFile and related APIs are useless for exploring virtual folders. That's the role for the shell COM which will be presented shortly.

Pattern matching: filtering filenames

Pattern matching is one of these tricky programming subjects revolving around recursion. I have seen classes that perform UNIX grep-like pattern matching, whose size is in thousands of lines. Fortunately, MS/DOS-type wildcard matching is much simpler to use, involving just two special characters:

char Description and use

* Matches zero or more occurrences of any characters. Example: "*.cpp" will match "1.cpp", "12.cpp" and so on.

? Matches zero or one occurrences of any character. Example: "a?.cpp" will match "a1.cpp", "aa.cpp", but not "a12.cpp"

char	Description and use
*	Matches zero or more occurrences of any characters. Example: "*.cpp" will match "1.cpp", "12.cpp" and so on.
?	Matches zero or one occurrences of any character. Example: "a?.cpp" will match "a1.cpp", "aa.cpp", but not "a12.cpp"

These two special characters can be combined to form complex wildcards like "*.c??". This MS/DOS matching is not very powerful but you don't need a PhD in astrophysics to implement it in your program. Windows don't offer any API for wildcard matching, but there are loads of ready made functions that can be directly pasted in your code. I won't present any sample code because it can be lengthy and difficult to understand. I recommend grabbing a matching function from the online docs (download the sample and look for MatchPattern in file "common.c") and using it without much quibbling. It's one of those things that you (I) can't be ar**ed understanding in full before using them.

I recommend using your own wildcard matching code instead of letting FindFirstFile do the work for you for a number of reasons:

You avoid the need for double reading, one to get the subfolders and one for the files that match some wildcard. Usually, you only filter out files but leave all subfolders users need for navigating the namespace.

COM doesn't provide any similar facility for virtual folders. You'd have to develop name matching code yourself, anyway.

You have more control on the filtering process. For example you can use two or more wildcards at the same time, even provide for exclusion filters, i.e. filter out names that match a wildcard.

If you need to handle more complicated pattern matching, there are full classes available in programmers sites like codeGuru. Still, I feel that the simple MS/DOS wildcards are enough when filename matching is concerned. You'd only need more expressive power for searching larger text chunks.

ADVANCED: Notifications of change

Sooner or later, the contents of a folder will change, one way or another. If you are showing its contents to a user you'd want to keep them up to date. There is only one documented way to monitor such changes that works both for 9x- and NT-based platforms, and that's FindFirstChangeNotification and related functions. It works only for regular filesystem folders. There's no documented way for monitoring virtual folders.

FindFirstChangeNotification creates a waitable handle that will be fired each time some kind of change occurs in the specified filesystem directory. The problem is that it won't tell you exactly what has changed, so the only option is to refresh (re-read) the folder from scratch. If you were monitoring the whole subtree underneath the folder, then tough, you'd have to refresh everything, since you won't be told where the change occurred. ReadDirectoryChangesW is more advanced in this respect, but it only works for NT, so I won't bother with it. Anyway, re-reading directory contents nowadays in the era of the pentium is not too much trouble, even if it involves thousands of files.

You do have some control over what kind of change you will be notified for, using the appropriate FILE_NOTIFY_CHANGE_xxx value for dwNotifyFilter argument. The description of these values in the docs leaves much to be admired, so I've compiled the following table which shows how many notifications are received for some typical events that change folder contents. The "experiment" was performed for both win9x and NT. Pursuant to the law of the sod <g> the two operating systems behavior was slightly different. The discrepancies are marked in red.

Event _FILE_NAME _DIR_NAME _ATTRIBUTES _LAST_WRITE

File creation (or paste) 1

File renaming 2; NT:1 2; NT:0

File deletion (or cut) 2 2; NT:0

File size modification 5; NT:3 2; NT:0 1 1

Change of file attributes 1

Folder creation 1

Folder renaming 2; NT:0 2; NT:1

Folder deletion 2; NT:0 2

Event	_FILE_NAME	_DIR_NAME	_ATTRIBUTES	_LAST_WRITE
File creation (or paste)	1
File renaming	2; NT:1	2; NT:0
File deletion (or cut)	2	2; NT:0
File size modification	5; NT:3	2; NT:0	1	1
Change of file attributes			1
Folder creation		1
Folder renaming	2; NT:0	2; NT:1
Folder deletion	2; NT:0	2

The most important detail is that often you get multiple notifications for a single event. For example, if you edit a file in a folder monitored using FILE_NOTIFY_CHANGE_FILE_NAME, the moment you save it you'd receive up to 5 notifications. All these figures shown are for a single file. If you copied, say 10 files, you'd get (up to) 10 times the number of notifications you'd receive for a single file, i.e. 10 notifications.

You may also combine values to obtain a single handle that is signalled for all events triggered by each individual value. Hence, FILE_NOTIFY_CHANGE_FILE_NAME | FILE_NOTIFY_CHANGE_DIR_NAME would ensure notifications for all changes that would be of interest for a browser like 2xExplorer. When you combine values this way the number of notifications per event is not the sum of the individual cases, but the maximum of the two as I have discovered. Hence, for a modification to the last write time of a file you'd get 5 notifications, not 7. Here's a code sample that tackles the problem of multiple notifications per event:

// request a handle to monitor c:\temp (only) for all changes HANDLE cnh = FindFirstChangeNotification("C:\\Temp", FALSE, FILE_NOTIFY_CHANGE_FILE_NAME | FILE_NOTIFY_CHANGE_DIR_NAME); while(1) { DWORD wr = WaitForSingleObject(cnh, INFINITE); // get rid of all similar events that occur shortly after this DWORD dwMilSecs = 200; // arbitrary; enlarge for floppies do { FindNextChangeNotification(cnh); wr = WaitForSingleObject(cnh, dwMilSecs); } while(wr == WAIT_OBJECT_0); // now wr == WAIT_TIMEOUT, no more pending events printf("Event intercepted, refresh contents!\n"); // thus, we have avoided unnecessary folder refreshes, see? :) // ...test some condition to break the infinite monitoring loop... } FindCloseChangeNotification(cnh); // release notification handle

The procedure is reminiscent of the FindFirstFile sample presented earlier, which read folder contents. After we've successfully obtained the handle via FindFirstChangeNotification, we enter a loop and use WaitForSingleObject. This blocks us until the event is signalled, so a separate thread is more or less compulsory. Note that the handle is waitable in the same sense as e.g. a normal mutex or process handle. I have to admit that mikrosoft have done a good job homogenizing all the handles in this way.

Once fired, the handle needs to be treated with FindNextChangeNotification if we are planning to start waiting on it once more. The code uses a secondary do-loop to remove any multiple notifications for the same event, waiting for a very short period. Finally, once all is said and done, good housekeeping dictates releasing the handle via FindCloseChangeNotification.

FindFirstChangeNotification can also be used to monitor a whole drive like "c:\" using it's bWatchSubtree parameter. That kind of information is useful for building self-refreshing folder tree views like the left pane of windows explorer. However, the only feasible strategy would be to monitor just for changes in folders, else there would be too many notifications to deal with. This is exactly what FILE_NOTIFY_CHANGE_DIR_NAME (used on its own) is for.

Unfortunately, the table above indicates that win9x will do the stupid thing and notify you even when plain file details change. Talking about crappy OS's... if you needed any more proofs that 9x sucks that's another one then <g>. Imagine how many false alarms this would generate: just think that whenever windows write to the swap file, you'd be alerted. The same table shows that 9x also cack-up the FILE_NOTIFY_CHANGE_FILE_NAME flag, since you'd receive notification of folder-related changes, too. NT on the other hand has much more sense, furnishing you with the correct information in both cases.

ADVANCED: Network/Mapped drives

Simply put, the documented change notification network collapses completely when changes occur in other computers, accessible via a LAN etc. FindFirstChangeNotification will seemingly succeed, but the returned handle is almost never signalled, so you end up waiting in vain. I've spent incalculable hours trying to find a way around this problem, searching newsgroups, obscure interfaces like ICopyHook and IShellChangeNotify but it was all to no avail. So how does explorer manage? All processes that change shell contents in any way usually let explorer know about it, calling SHChangeNotify. This is meant to be private information for the benefit of explorer alone, but people have hacked in and came up with the goods: SHChangeNotifyRegister. This adds a "hook" to the notification chain so your app has access to all the insider information. The down side is that it is undocumented and I wouldn't want to touch it with a 10 feet pole <g>

ADVANCED: Network/Mapped drives
Simply put, the documented change notification network collapses completely when changes occur in other computers, accessible via a LAN etc. FindFirstChangeNotification will seemingly succeed, but the returned handle is almost never signalled, so you end up waiting in vain. I've spent incalculable hours trying to find a way around this problem, searching newsgroups, obscure interfaces like ICopyHook and IShellChangeNotify but it was all to no avail. So how does explorer manage? All processes that change shell contents in any way usually let explorer know about it, calling SHChangeNotify. This is meant to be private information for the benefit of explorer alone, but people have hacked in and came up with the goods: SHChangeNotifyRegister. This adds a "hook" to the notification chain so your app has access to all the insider information. The down side is that it is undocumented and I wouldn't want to touch it with a 10 feet pole <g>

Additional information

If you need to hear more lame excuses about the failings of the notification framework, here are a couple from miniSoft <g>

Q188321 - FindFirstChangeNotification May Not Notify All Processes on File Changes
Q245214 - PRB: ReadDirectoryChangesW Not Giving Consistent Notification
Q268817 - Deadlock in Redirector Delays Opening a File on the Network
ARTICLE: Keeping an Eye on Your NTFS Drives: the Windows 2000 Change Journal Explained — MSJ, September 1999
ARTICLE: Wicked Code: CDriveView by Jeff Prosise

Parsing pathnames for PIDLs

If you want to enumerate the namespace using COM, you'll need a folder object, and its IShellFolder interface in particular. The starting point in your explorations is invariably SHGetDesktopFolder which returns desktop's IShellFolder. In the next section we'll see how to enumerate it's contents. For the present, let's concentrate on how to obtain another folder object, given the namespace root.

BindToObject bridges the gap between two folders. Given a relative PIDL it returns the IShellFolder of the target folder — and other directly exposed interfaces too. The desktop folder itself uses absolute PIDLs to address items; given it's nature as root however, this statement is rather tautological <g>. At any rate, we need a PIDL to go any further.

One way to obtain an absolut PIDL is SHGetSpecialFolderLocation, which knows the location of a bunch of important folders in your system, identified by CSIDL_xxx constants. Such a service is undeniably useful, but more often you would like to convert a path name to a PIDL (the inverse process to SHGetPathFromIDList presented earlier). ParseDisplayName serves this purpose, but normally you'd have to do some processing yourself first, to determine whether the pathname is global or relative, looking at the few first characters of the string.

Start Path interpretation

x:\ Regular full path, targeting device x, for example "c:\windows". Mapped network drives fall into this category, too.

\\computer Universal Naming Convention (UNC) full path, targeting shared folders on networked computers, as in "\\sunffd3\homes\umeca74"

\\?\x:\ This is a variation of the UNC full path, which allows very long paths to be specified (up to 32K characters; normal paths are limited to a length of MAX_CHARS, or 260). An example is "\\?\C:\thatWasntSoLong"

::{GUID} Full "path" for namespace extensions, i.e. virtual folders, where GUID is the unique identifier of the COM object. Recycle bin for example can be accessed with "::{645FF040-5081-101B-9F08-00AA002F954E}"

Start	Path interpretation
x:\	Regular full path, targeting device x, for example "c:\windows". Mapped network drives fall into this category, too.
\\computer	Universal Naming Convention (UNC) full path, targeting shared folders on networked computers, as in "\\sunffd3\homes\umeca74"
\\?\x:\	This is a variation of the UNC full path, which allows very long paths to be specified (up to 32K characters; normal paths are limited to a length of MAX_CHARS, or 260). An example is "\\?\C:\thatWasntSoLong"
::{GUID}	Full "path" for namespace extensions, i.e. virtual folders, where GUID is the unique identifier of the COM object. Recycle bin for example can be accessed with "::{645FF040-5081-101B-9F08-00AA002F954E}"

Paths that fall into the above categories must be dealt with desktop's IShellFolder, since they are global. All other paths are relative and need to be parsed using the folder where they are "rooted", which usually is the current directory of the process, as returned by GetCurrentDirectory

An interesting hybrid case is paths like "C:Path". Old MS/DOS hands will surely remember the notation; it addresses folders relative to the "active" directory in some device. For instance, if you were last browsing "C:\Windows", then the above path resolves to "C:\Windows\Path". MS/DOS used to remember all the active folders in all drives, but the win32 shell has developed some form of amnesia. The solution I implemented is to keep a list of these active folders per device, and resolve such paths within 2xExplorer.

Finally note that ParseDisplayName is confused by paths containing indirect specifiers like '..' for the parent of a folder. The only solution is to detect these cases and manually convert them to proper full paths before passing them to ParseDisplayName. The easiest way to do the conversion is to SetCurrentDirectory (which understands such old fashioned notations) on the indirect path followed by GetCurrentDirectory.

Shell basics

Namespace exploring

Contents