TxtSplit: A Text File Parsing Utility
TXTSPLIT is a tool for chopping apart big text files on specific lines. It was designed with a very specific application in mind: extracting individual messages from a single, big, globby file.
Where would you get such a thing? Why would you ever have a "single, big globby file" which contained lots of separate messages? I can think of two examples: email programs, and command-line newsreaders.
If you don't know what a command-line newsreader is, chances are you'll never need to know. The basic use of this program is the same regardless of the source of the text file, however, so I'm going to discuss its use in conjunction with email.
Suppose you subscribe to a mailing list. Every day, along with your daily dose of Spam, you receive about a dozen messages from the mailing list. You want to keep them for future reference. But you don't want to keep them in your mailbox. No problem... you can just save them. But it's very time-consuming to save them one by one. You have to name each file separately, you have to remember to save them all in the same place. Moreover, let's suppose that you're like most people in this Windows-dominated world. You're probably using Outlook to read your mail.
Wouldn't it be nice if you could just select five hundred messages and save them all at once? With Outlook, you actually can do that. Ah, but what does that give you? Hint: the correct answer contains the words "big" and "globby." (OK, OK, so "globby" is not an actual word, but you get the idea: it means fat and sloppy.)
So now you have this big, giant, fat text file, and it contains ALL of your saved messages, crammed into one colossal document. (This, by the way, is basically what you'll get if you use an old-style shell program such as Pine to read email, or if you use a newsreader like Tin. But I digress. Oh... and another thing... I'm not going to talk about email attachments. That's a whole 'nother problem.)
What do you do with this file? After all, the reason you're saving these messages in the first place is that they might have information that would be useful for you. It isn't very convenient to have to read through the same fifty-page document every time you want to find a little info. You want to leverage the power of your computer. Let the computer find things for you. If you can use the computer's "Find" or "Search" feaure to look for text inside a file, you'd be able to find what you want fast. But to do that, you'll want to break that big file apart again. That's where TXTSPLIT comes in.
If you open the big text file you got when you saved everything from Outlook, you'll see that the message headers are retained. They are also consistent, in that each message has the same sequence of headers.
TXTSPLIT allows you to specify text strings to identify the message headers. It will then save each individual message into a separate file, using a filename prefix that you've defined, and appending a number to the prefix to create a unique filename automatically.
TXTSPLIT is configured via an .ini file (txtsplit.ini). The .ini file contains five lines. If you open the default .ini file, you'll find that it looks like this:
What do those lines mean?
- The first line is a counter, used for incrementing the filenames.
- The second line is the prefix for the filenames. With the default settings, the first new file extracted from the big, globby text file would be msg1.txt, the second would be msg2.txt, and so on.
- The third line is the default output directory. This is where the extracted messages will be saved.
- The fourth line is the text string which identifies the first line of the message header.
- The fifth line is the text string which identifies the second line of the message header. Why does the program use TWO lines to identify the header? Relying on one line doesn't seem sufficiently safe. It's fairly likely that sooner or later that same text string would wind up in a message body somewhere, and you'd have messages chopped apart improperly. But if for some reason you NEED to use only one line, you can do that. More on that in a moment.
So how do we use this program? It's a command-line program, which means that you should be running it from a DOS prompt, or from within a batch file. The syntax is simple:
If you do not specify a filename, it will assume that you want to use its default input filename, rawfile.txt.
When the program runs, it will first display its current settings. At this point, you'll have four options: you can type "H" to view the program help, "R" to reset the .ini file to its original defaults, "D" to create the default output directory (txt_out), or "X" to exit without doing anything at all. You'll have 15 seconds to make a choice. If you don't select one of the four other options, the program will run, and will (presumably) extract your messages from the original file. (The original file, incidentally, will not be altered in any way.)
What if you want to split a file using a one-line match? You can do that by modifying txtsplit.ini. Suppose you had a big file that you wanted to split apart on every occurance of a line beginning with the word "Error." You can do this by editing the .ini file so that the fifth line of the file is blank. The fifth line of the .ini file contains the text that must match the second header line. By setting the line to an empty string, you're telling TXTSPLIT, "Screw the second header line. Just split the thing when you find a match for the first header line." One important point, though: to get this to work properly, you have to leave the fifth line of the .ini file blank. Don't delete the whole line! You still need a carriage return there, or TXTSPLIT will complain that the format of the .ini file is incorrect.
There's one potential problem that should be noted: if your text file doesn't begin with the search string, any text before the first matching line will be ignored.
If I really wanted to be thorough, I'd probably explain a lot more about why you might want to save your emails or newsgroup postings this way. However, this is such an obscure, single-purpose tool that I doubt if many people will need it. If you *do* need it, you probably already know WHY you need it, and you are probably saavy enough to figure out whether or not this will help you.
One question I've been asked from time to time: is there a maximum limit to the size of the file that can be split? I don't believe there is, beyond the limits imposed by the maximum file size on your drive's file system. There is, however, a limit to the maximum length of a line within the file. See "Bugs and Known Flaws," below.
On the outside chance that you are just certain that this is the tool you need, but you can't figure out how to use it because the documentation is so damn skimpy, go ahead and email me. If you put "TXTSPLIT" in the subject, I'll try to help you out.
Also available: a "super-Canadian" version of TXTSPLIT, known as QSPLIT, optimized for use in batch files. QSPLIT differs from TXTSPLIT in that it does not have the fifteen-second delay during which you can choose other actions, such as showing the help screen, rebuilding the .ini file, or creating the default output folder. (You can, however, still do those things with QSPLIT by passing it the command-line parameters /h, /r, or /d.) Click here to download QSPLIT.
Bugs and Known Flaws
Elegant error handling? Haaaahaaahaaa!!! This program has no such thing. I wrote this to meet a very specific need, and it was done in a big hurry. I'll let you in on a secret: this is a truly BASIC program. That is, BASIC as in "Microsoft QuickBASIC." (Yes, there was such a thing as a compiler for QuickBasic.) I'm not even going to try to create a comprehensive list of its flaws; I'll just list a few significant limitations, and beyond that, you're on your own.
- This program cannot process long filenames. It is a 16-bit program, and must use old-fashioned DOS-style 8.3 filenames.
- There is a limit to the maximum line length within the files you are trying to split. If you exceed that limit, you'll encounter the dreaded "Out of string space in module TXTSPLIT" error. This occurs when a line in the file exceeds 65,535 characters. This limitation is imposed by QuickBASIC's "line input" function. Fortunately, the Most Excellent Patty Bristol of IxReveal, figured out a workaround: if you open the file in Wordpad or some other editor and insert line breaks into the offending lines, you'll be back in business. Thanks also to Robert Erdely of the Pennsylvania State Police, who was the first person to send me the detailed error message, and to Matt Mason for pointing out an error in the online documentation.
- It will crash if the output directory does not exist. That will give you a nasty error along the lines of "Path not found in module TXTSPLIT at address 0DD7:0EC6. Hit any key to return to system." If you get this error, no problem: just run the program again, and choose option D (create default output directory) at the prompt.
- If you put junk in the .ini file, it will choke. For example, if you use a file prefix that is too long, it's going to crash and burn in some weird, screwy fashion.
- If you are careless with filenames and output locations, it will cheerfully overwrite existing files.
- It will generate ugly errors if you try to split a file that doesn't exist. You'll get some scary-sounding message like "File not found in module TXTSPLIT at address 0DD7:03BB."
- Can you traverse full paths with this silly program? I dunno. Probably you could if they did not include long filenames. I'm too lazy to test it. I just put the file I want to split in the same directory as TXTSPLIT itself.
- If you see a message saying "Program execution complete," but you didn't see any indication that it was saving files, that probably means that it didn't find lines matching the expected header sequence.
Builds 1 - 17 were debugging builds, and were not released. Build 18 is the first release version.
You can download TxtSplit here:
Program Description: A quick-and-dirty tool for splitting a text file into new files on each occurance of matching lines of text. This is Build 18.
The zip file contains the following items:
txtsplit.exe - (the executable file)
txtsplit.ini - (configuration file)
txtsplit.txt - (program notes and documentation)
install.bat - (batch file to create the default output directory)
rawfile.txt - (a sample file to demonstrate the use of the program)
txtsplit.bas - (QuickBasic source code)
source - (directory containing source code)
txt_out - (empty directory, used as default output destination)
Program and documentation by Bruce Sharp. Last update: Nov. 2007.