Home » Blog
date 29.Jan.12

■ Search for funny strings in text files: a Master's course

One of xplorer² biggest strengths is searching for text in files. As I've mentioned before (warning! last century template :) xplorer² search engine digs into all sorts of files and finds exactly the string you are typing, without arbitrarily splitting words and what have you. Especially for plain text files xplorer² is formidable both in scope and speed. Whether you are scanning english, greek or chinese text, with or without encoding indicator (BOM), if it is there xplorer² will find it. I am quite proud of it <g>

Plain text files, be it HTML or server logs or C++ source code, are easy to deal with as long as the character set remains in plain english or ANSI. Dealing with localized plain text on the other hand is very tricky as there are many ways to encode the greek "καλημέρα" in plain text, that is in a number from 0→255. Many different approaches are in use for the encoding problem, like the unicode standard, UTF-8, code pages etc. If you are lucky the file contains information about its encoding in the form of a byte order mask (BOM), but many don't.

If you have the latest xplorer² version then you don't need to worry about most of these things. Just type in the string you are looking for in the search box (see pic) and leave all the deciphering to xplorer². The default encoding option is the easiest to use as xplorer² does most of the encoding conversions automatically. Most of the forced encoding options offered are no longer required — just a remnant of older xplorer² versions.
find text dialog

Sometimes you want to search for unprintable characters like a newline (byte code 10 or $0A in hex). xplorer² lets you do that as well, so for example searching for final$0A will match only those files where the word final is a the end of the line. If you know the ASCII code of the character you are looking for you can enter it in the search box as $xx (where xx is the hexadecimal value). xplorer² offers the most typical ones via the Special characters drop down box. Another way would be to tick the regular expression option and use the power of regular expressions to find special characters.

Ok then how can you search for the byte sequence EF BB BF? These 3 bytes in the beginning of a file are the marker for UTF-8 encoding. If you search for the string $EF$BB$BF in the default mode then you won't find anything as xplorer² "eats" these bytes when importing the file. That's where the Raw encoding mode comes handy. Selecting Raw from the encoding drop-down and ticking case sensitive option will allow you to find those files that bear the UTF-8 BOM!

To recap: when searching for text in files, leave the Default encoding option so that xplorer² will figure out the plain text format. The only exceptions are:

For the sake of completeness I mention in passing that you can also add boolean operators when searching for multiple keywords. Finally you cannot search for embedded zero bytes (typing $00 won't work, and neither will \00 in a regular expression); this is the only drawback of an otherwise complete system that makes one proud to behold!

Post a comment on this topic »

Share |

©2002-2012 ZABKAT, all rights reserved | Privacy policy | Sitemap