Notes for Name Cleaning string class Example

First, a side trip not related to strings at all: enumerations. The strip_spaces() function uses an argument of type 'EndType'. This is not a built-in type or a 'class' type (like cin's and cout's types or the string type). It is, however, a data type. It is one defined by an enumeration:

// for use with strip_spaces() (see).
enum EndType { Left, Right, Both, Neither };

This statement defines 'EndType' to be an 'enum'eration data type. It says that the type has four possible values: 'Left', 'Right', 'Both', and 'Neither'. So if we declare a variable, constant, or function argument of this data type, the only values that are valid to be stored or compared with would be these four. (It is kind of like the bool type and its two values true and false except that we the programmer have defined it.)

Its use here is to allow the caller of strip_spaces() to specify whether they want the spaces stripped from the left end, right end, both ends, or neither end of the string. (I don't know why they would call a function for stripping spaces if they didn't want to strip any spaces, but it finishes out the complete set. It is also good style/form to provide all possible options for such a generically useful function.) Since we believe that the caller will most likely want spaces stripped from both ends of the string, we default the argument to 'Both'.

Now the prototype of the strip_spaces function itself:

// removes spaces (' ', '\t', '\n', '\r', '\v') from the left end,
// right end, or both of s.  (or neither if you are weird
// enough to ask it to.)
void strip_spaces(string & s, EndType which_end = Both);

It takes a string by reference and an argument to tell it which_end the caller wishes spaces removed from. This argument, as mentioned above, defaults to Both.

Next a prototype for replace_all():

// for replace_all() (see).  a variety to assuage various
// programmers' tastes.
const bool REPEATED = true, WITHIN = true, FORCE = true,
           ONE_PASS = false, NO_CROSS = false;
// replaces all occurances of this_stuff in in_here with
// with_this.  will *NOT* replace occurances of this_stuff
// that are in with_this unless forced to!
void replace_all(string & in_here, const string & this_stuff,
                 const string & with_this, bool force_within_with);
inline
void replace_all(string & in_here, const string & this_stuff,
                 const string & with_this)
{
    replace_all(in_here, this_stuff, with_this,
                this_stuff.find(with_this) == 0);
    return;
}

This function exists because the string owned function replace() only replaces the first occurance of one thing with another. We will need (see below) to replace all occurances of a certain thing with another thing (all tabs with a single space; all double'd spaces with a single space).

The last argument is minorly confusing at first glance. Its purpose is to allow the caller to specify that the with_this string may actually be contained in the this_stuff string (as a prefix) and should be checked for replacement itself. Okay, so it probably doesn't happen this way very often, but the replacement may abut against something in the original string to form the string to be replaced (this_stuff).

Still confused, eh? Let's see an example:

    in_here:         "this is my string and it has stuff in it"
    this_stuff:      "his "
    with_this:       "h"

Notice how when the "his " there at the beginning of in_here is replaced with "h", we end up with: "this my string...". Thus, the replacement has produced a new occurance of the string to be replaced within the overall string. Technically we would be not doing our duty to replace *ALL* occurances of "his " if we didn't replace this new one, as well. But, this isn't the behavior most programmers would expect. Most programmers I've worked with/taught would expect that replacement would be autonomous/atomic. That is, once replaced, that replaced text would not be considered again. But who are we to say that *NO* programmer will ever desire such repeated in-place replacement? Unqualified, that's who.

Remember, examples are there for clarity and may not make real-world sense. We are not here to come up with every possibility that could ever happen, but to give people options when we think something may prove useful someday.

Also note the helper inline'd overload'ed version that tries to detect when this is going to happen so the programmer calling the function doesn't need to think about it. (This sort of thing is a variation on default'ing an argument. The programmer can still pass the 4^th argument to turn it on/off explicitly, but they can also just pass the first 3 args and let us detect that it may/may not happen. If that isn't doing what they want, they can go back to the call and add the 4^th arg explicitly! We couldn't just use a default value because we didn't know if it should be on or off until we looked at the this_stuff and with_this strings and their relationship to one another. Since we needed code, we made an overload'ed helper. Since it was short, we inline'd it.)

We'll see exactly how this should work below in the function definition, but at least we know what it means, now. (Don't worry if you don't, it only occurs once in this program, after all.)

Oh, also note the several constants defined above the prototype. These give the calling programmer different ways to state that they do/do not desire the 'repeated in-place replacement' feature. They could simply pass true/false, but words are often more clear than plain data values. (Hence the idea of the [named] constant and its solution to the 'magic number' dilemma.)

Next come a pair of functions for reading in a whole line of text into a string (not just space-delimited as >> would allow):

// reads a whole line into tha string (including leading,
// internal, and trailing spaces/tabs).  avoids grabbing
// an empty line if called after extraction!
inline void get_line(string & s)
{
    cout.flush();    // peek doesn't print prompts on some systems
    if (cin.peek() == '\n')   // stray \n from prior extraction
    {
        cin.ignore();    // toss it!
    }
    getline(cin, s);   // now get the whole (fresh) line
    return;
}

This first version takes in the string variable to fill in from the caller. Then it makes sure cin is ready to get a whole-line string and that cout has prompted properly. Finally, it calls a function from the string library to actually read the whole line from cin into the string.

Whew! That was a chunk. Let's start at the bottom: getline(cin, s). This calls a library function that will get an entire line (from current reading position within the buffer to the next new-line character) and store it in the string object specified. This first argument is the input stream to read from. We want to read from cin, so that is what we specify. (We'll see next semester that there are other kinds of input streams that could be used -- files, for instance.)

But, there is a potential problem that creeps in when getline() is mixed with extraction (>>). Extraction leaves new-line characters within the buffer when it sees that as the end of the data that was requested. That is, when we ask cin to read a double and the user types:

    4.2<Enter>
    _

at the keyboard, the buffer will look like this:

    +---+---+---+----+---+---+---
    | 4 | . | 2 | \n |   |   | ...
    +---+---+---+----+---+---+---

Then cin will translate the double leaving the buffer like this:

    +---+---+---+----+---+---+---
    | 4 | . | 2 | \n |   |   | ...
    +---+---+---+----+---+---+---

Note that the '4.2' part has been translated and stored in a variable, but the new-line character is just sitting there having served its purpose of ending translation of the double, it awaits to be skipped before the next extraction.

However, getline() doesn't behave like extraction: it doesn't skip over spacing before real data because spaces are part of its valid data. It also considers the new-line character a special value and will simply declare its job done when it finds one.

So, if the above extraction were followed by our getline() call, we wouldn't get the information the user wanted to type, but rather an empty string! getline() would see the new-line character as the first data and say, "Hey, a new-line! I'm done. Here's your empty string." That isn't the behavior we desired. We wanted the program to stop and let the user type a string -- not skip it and move on with no real data!

To fix this, we ask cin what the next character is (peek(), remember?) and check if it is a new-line. If it is, then we must have been preceeded by an extraction (because getline() will remove the new-line character that ends it). So we know bad things are about to happen and we tell cin to ignore that new-line. Now, if we were preceeded by an extraction, we'll ignore() the '\n' character and if we weren't, we'll do nothing. Finally, we'll getline() the user's actual data!

Um...but what about that first statement? Well, some compilers (okay, most) take the idea of 'cout shall print its buffer when cin reads' quite literally. They only tell cout to print the current buffer contents when cin actually reads something (takes something out of *its* buffer). Since peek() doesn't remove anything from the buffer, it doesn't 'read' according to this definition. So, peek() won't force cout to print any waiting prompt. Eek! We must do so. The flush() function owned by cout fits our needs precisely. We can't read because we don't know what to read. We can't emit an endl because we don't want to ruin the caller's user interface with extra new-lines being printed. We certainly can't wait until cout is full or the end of the program! flush() will tell cout to print its current buffer -- NOW! (I'll leave the imagery to you...*bleah*)

This second version of our line-getting function doesn't require the caller to have a variable of their own to store the line into. (Note the previous incarnation took a reference argument.)

// companion overload if you don't need to store it...or just
// like the function return style of call...
inline string get_line(void)
{
    string s;
    get_line(s);
    return s;
}

This might prove useful if the caller simply needs to 'echo' the user's input back:

    cout << get_line() << '\n';

(I don't know why this might be needed, we are just providing the caller with options! Overloading with a string reference and a void argument lists respectively does this here.)

Note how the cout needs to print a new-line character separately since getline() doesn't store the new-line character that stopped its reading.

And in main:

    string whole_name;

    cout << "Enter your name:  ";
    whole_name = get_line();
    // or we could do:  get_line(whole_name);

We make a string variable and call one of the get_line() functions. (Even show how we could have called the other one, if we'd so chosen.)

Next, to process the user's name:

    strip_spaces(whole_name);  // remove stray spaces from both ends
    replace_all(whole_name, "\t", " ");  // replace any tabs with single
                                         // spaces
    replace_all(whole_name, "  ", " ");  // replace double spaces
                                         // with single spaces --
                                         // until they are all gone!

We remove all extra spaces from both the left and right ends of the string (note we used the default for the EndType argument). Next we replace any tab characters they've typed with a single space. Now for the fun line: we replace all double'd spaces with single spaces -- even ones we 'create' with our replacement!

Notice the following scenario:

    whole_name:  "Jacob     Miller"

The first name and last name are separated by 5 spaces. If we just called replace_all() normally (like we did for tabs), we'd end up with:

    whole_name:  "Jacob   Miller"

There are still 3 spaces in there! We wanted just one space between the names! Watch how this happens:

    whole_name:  "Jacob     Miller"
    found:             --
    ...replace...
    whole_name:  "Jacob    Miller"
    replacement:       _

Two became one -- so far so good. Next:

    whole_name:  "Jacob    Miller"
    found:              --
    ...replace...
    whole_name:  "Jacob   Miller"
    replacement:        _

It found the double space after the first replacement text of a single space -- not the double space formed with the replacement text and the following space. And on the next look for a double space, it'll find none because the only remaining double space is before the place it'll begin to look (the space just before Miller).

By requesting replacement within replaced text, we get this sequence of events instead:

    I    whole_name:  "Jacob     Miller"
         found:             --
         ...replace...
         whole_name:  "Jacob    Miller"
         replacement:       _

   II    whole_name:  "Jacob    Miller"
         found:             --
         ...replace...
         whole_name:  "Jacob   Miller"
         replacement:       _

  III    whole_name:  "Jacob   Miller"
         found:             --
         ...replace...
         whole_name:  "Jacob  Miller"
         replacement:       _

   IV    whole_name:  "Jacob  Miller"
         found:             --
         ...replace...
         whole_name:  "Jacob Miller"
         replacement:       _

A total of four replacements are made and we've removed all the extra spacing between the names! (Hopefully the replace within replaced text thing is becoming a little clearer now...keep thinking about it...)

Now for some function code (we'll see those last few lines in a bit):

// removes spaces (' ', '\t', '\n', '\r', '\v') from the left end,
// right end, or both of from.  (or neither if you are weird
// enough to ask it to.)
void strip_spaces(string & from, EndType which_end)
{
    const string space_chars = " \t\n\r\v";
    string::size_type at;
    if (which_end == Left || which_end == Both)
    {
        at = from.find_first_not_of(space_chars);
        from.erase(0, at);
    }
    if (which_end == Right || which_end == Both)
    {
        at = from.find_last_not_of(space_chars);
        from.erase(at+1);            // erase to end of string
    }
    return;
}

If the caller requested we replace either at the left end or at both ends (which includes the left end, you see), we ask the string to find the first character it contains that isn't a standard spacing character (space, tab, and new-line). This is done with the find_first_not_of() function, oddly enough. (*grin*) Notice that 'at' has that funny type 'string::size_type'. Recall that all values that refer to either the size of a string or a position within a string are of this 'guaranteed to be big enough' type defined within the string class. That's where the string:: bit came from -- to say that size_type is inside of the string class (read it from right to left).

Once we've found the position of the first non-whitespace character in the string, we have that many characters removed from the beginning of the string by calling the erase() function. Note that the beginning of the string is position 0. I suppose they aren't really positions, after all. Truth-be-told, the 'positions' that the string functions deal with are really distances from the beginning of the string (sometimes also called offsets from the beginning of the string). So, the first 'position' is 0 away from the beginning of the string.

Here's an example to clarify a bit:

                             1111111111222222
    positions:     01234567890123456789012345
    string:       "    Ignacious   Harvey    "

This string contains 26 characters. The first four are spaces. So the position of the first non-whitespace character ('I') is 4. Since it is 4 places distant from the beginning, there must be 4 characters that preceed it. Hence the position does double duty: it is an offset and a count of preceeding characters! (This wouldn't have been possible if we'd counted positions from 1 like most people do.)

So, the call to erase() says to remove from position 0 (the very first character) the next 4 characters (since 'at' will hold the 4 for the position of 'I' here). We end up with:

                             111111111122
    positions:     0123456789012345678901
    string:       "Ignacious   Harvey    "

Now the string only has 22 characters and the leading spaces are gone!

If the caller didn't ask us to remove from the left (or both), we'll skip this bit of code and proceed to the next check: stripping from the right end.

If the caller asked us to remove spaces from the right end (or both ends), we need to find the last character in the string that isn't one of the standard white-space characters (space, tab, and new-line). Again, we find that the oddly named function 'find_last_not_of()' comes in handy here. It finds the last character in the string that called it that is not present in the string passed! Just what we needed...

For our example, find_last_not_of() will return 17 (the position of the 'y' in the last name). Now we call erase() again to request the part of the string following this position be removed. Note that this is done differently than the other call:

        from.erase(at+1);            // erase to end of string

By default (or perhaps in this overloaded version -- it would function the same either way), erase() will erase from the position it is asked to start until the end of the string. We could have specified it directly:

        from.erase(at+1, from.length()-1);    // erase to end of string

But that would have been extra work on our part and not as efficient as letting the erase() function just realize that we wanted the entire remainder of the string gone via the default (or the call of the one-argument overloaded version).

The next function is the replace_all() function:

// replaces all occurances of this_stuff in in_here with
// with_this.  will *NOT* replace occurances of this_stuff
// that are in with_this (or created by with_this abutting
// the regular string contents after replacement) unless
// forced to!
void replace_all(string & in_here, const string & this_stuff,
                 const string & with_this, bool force_within_with)
{
    string::size_type at, this_len = this_stuff.length(),
                      pos_jump = force_within_with ? 0 : with_this.length();
    at = in_here.find(this_stuff);
    while (at != string::npos)      // while it is not a non-position
    {
        in_here.replace(at, this_len, with_this);
        at = in_here.find(this_stuff, at + pos_jump);
    }
    return;
}

This one turns out to be relatively simple (well, compared to other code in this program). We simply find where this_stuff occurs in in_here and call replace() with that position and the length of this_stuff and the replacement text (with_this). Then we repeat that cycle until we can't find this_stuff anymore.

We use 3 variables of 'string::size_type': the position we found this_stuff at, the length of this_stuff, and the length of with_this. We store the lengths of these arguments so that we don't have to continually call those functions. Since the arguments are by constant reference, we know the lengths won't change so this 'caching' is safe.

Next we look for this_stuff in in_here with the find() function. True to its name, it locates the argument string in the string that called it. If it cannot find the argument string at all, it returns the guaranteed invalid position string::npos (short for 'not-a-position' or 'no-position'). This constant value is defined within the string class and so the screwy notation again.

Anyway, while the last found position *does* exist (is not equal to the 'non-position'), we do the replacement requested. (Recall that c.replace(p,n,a) will replace the n characters of the string c starting at position p with the string a.) Once we've replaced as desired, we look for another occurance of the text-to-be-replaced (this_stuff).

But did the caller desire replacements that happen [partially] within the just replaced text? We check as we call the overloaded (or does it have a default argument?) find() function to tell it where to start looking. The new third argument is a position from which find() will search for the specified text. If the caller wanted replacement-within-replaced-text, we tell find() to search from the same position it last found the to-be-replaced text (this_stuff). (Note that it cannot possibly be before this position or find() would have returned that position instead.) We do this by adding 0 to at from the ternary/decision operator (?:). On the other hand, if the caller did *NOT* want replace-within-replaced-text behavior, we add the length of the replacement text to at from the ternary/decision operator (?:). This way, new searches start strictly after the text we just 'inserted'.

And finally, that name-case'ing stuff. The prototype and typedef:

// changes next word of string s (in position range [start_at..end_before) )
// to 'name-case' (first capitalized and rest lower-case).  returns
// position after word.
// REQUIRED:  start_at is assumed to be the first character of the word!
//            if called with start_at pointing to white-space, we will
//            *NOT* be held responsible!!!
typedef string::size_type str_sz;
str_sz name_case(string & s, str_sz end_before, str_sz start_at = 0);

The 'typedef' is another way for programmers to introduce 'new' data types. In fact, it can really only specify extra names for existing types. Here, we use it to specify that the name 'str_sz' will represent the data type 'string::size_type'.

But I digress... So, when using the replace(), insert(), find(), erase(), etc. family of functions, we use size_type positions to indicate offsets from the beginning of the string where actions should occur, to receive offsets where certain text is located, and even to indicate an invalid or non-existant position (string::npos is of the data type string::size_type).

When, however, we need to process each character of a string in turn, we use a string::size_type in conjunction with the subscript operator to affect a character, in conjunction with the increment or decrement operator to move on to the next character, and then tested against .size() to stop just before the end of the string (at the last character in the string). Exactly how we do this is shown below.

Using it in main:

    str_sz last;

    //...other stuff...

    // name-case each 'name' in their name...
    last = name_case(whole_name, whole_name.size());
    while (last != whole_name.size())
    {
        ++last;
        last = name_case(whole_name, whole_name.size(), last);
    }

Here we declare a string::size_type via the 'str_sz' typedef from earlier (you must admit, it is *way* less typing). This size_type is to indicate where the last position for the last word of the name was (kind-of a double-meaning thing). All strings' first size_type (aka str_sz) position is 0 (hence the default on the name_case function's last argument). And the size() function reports the position just past the end of the string (since the positions are numbered from 0 -- a distance/offset from the beginning, remember -- the number of characters in the string is one more than the last possible position/subscript; in general, the number of characters that precede the one at position 'p' in a string is 'p').

On our first call to name_case(), we use size() and the default'ed 0 to tell it to start with the first word (first name) in the string and turn it to name-case. It returns a size_type of the character just after the first word (the space between names or size() if the user only entered a one-word name). Next we check if the 'last' position is the size() position of our string. If it isn't, we can do the next word (name). If it is, our loop ends and we proceed to print their fixed-up name.

To process the next word of our string, we increment the position we received from the previous call to name_case(). This is because that size_type position was to the space between the names and we need to tell name_case() where the next name starts -- which is one position past that space! (This is because, of course, our prior processing made sure there was one and only one space between the user's names. So incrementing from the space will place us at the first character of the next name.) We know that a next name exists because we wouldn't be here if we'd previously reached the end of the string (the loop would have ended). When that next word (name) is name-case'd, we store the end of it in 'last' again. Then we return to the top of the while to see if we've finished the whole string or not. Note how the size() of the string never changes and so is always the second argument to name_case(). (We could 'cache' this size_type value in a variable, but it is safer to ask for the size() again since we do not know if name_case() might change the content of the string in such a way as to change how long the string might be.)

So, here we've seen how to get size_type's of a string in the first place: 0 for the beginning and size() for the 'one past the end'. And, we've seen how to move a size_type from one character to the next (++; both pre and post work, by-the-way). When we look at the code for name_case() itself, we'll see how to affect the character at the position the size_type 'refers' to.

And, finally, the name_case() function's code:

// changes next word of string s (in position range [start_at..end_before) )
// to 'name-case' (first capitalized and rest lower-case).  returns
// position after word.
// REQUIRED:  start_at is assumed to be the first character of the word!
//            if called with start_at pointing to white-space, we will
//            *NOT* be held responsible!!!
str_sz name_case(string & s, str_sz end_before, str_sz start_at)
{
    str_sz on = start_at;
    if (start_at != end_before)
    {
        s[on] = toupper(s[on]);
        ++on;
        while (on != end_before && !isspace(s[on]))
        {
            s[on] = tolower(s[on]);
            on++;
        }
    }
    return on;
}

First we declare a local size_type to keep track of which character we are 'on' during our walk along the string. If the range of characters we've been asked to walk is not an empty range, we can upper-case the first character and then lower-case any remaining characters in the word. The word is considered to end with a space (or tab or...). Thus, while we haven't either reached the end of the string or found a space character, we lower-case the current character and then move to the next one.

Using just the variable name 'on' is essentially saying 'the size_type position of the character we are working on'. To actually deal with the character itself, a new syntax is introduced: 's[on]'. The '[]' here is a binary operator that essentially is saying 'go to the thing that is right-hand/internal operand distant from the beginning of left-hand/external operand'. So, 's[on]' is saying 'the character of s that is on away from the beginning of s'. Or, more meaningfully, 'the character of s we are working on'.

So, when we initialize on to the beginning position, compare it with the stopping position, and increment it to the next position (see, I told you both pre and post increments would work!), we just use the name of the size_type variable. Filling in size_type's values, comparing two size_type's, and moving to the next size_type position are all actions on the size_type variable itself. However, checking if the character is a space and changing it to upper/lower-case are operations on the character at the position to which the size_type 'refers'. So, in those parts of the code we use '[]' with the size_type and the string within which it represents a position to indicate that we need to reach the actual object indicated by the size_type and not the object's relative position within the container.

We must make sure the word isn't empty before we process it. This is because it is an error to try to access the object that the size() position 'refers' to since it is just past the last element of the collection. Doing so would crash our program -- hopefully. (If our program didn't crash, it might keep running with stranger things happening as the result of this mis-step.)