Notes for Name Cleaning string class Example

First, a side trip not related to strings at all: enumerations. The strip_spaces() function uses an argument of type 'EndType'. This is not a built-in type or a 'class' type (like cin's and cout's types or the string type). It is, however, a data type. It is one defined by an enumeration:

// for use with strip_spaces() (see).
enum EndType { Left, Right, Both, Neither };

This statement defines 'EndType' to be an 'enum'eration data type. It says that the type has four possible values: 'Left', 'Right', 'Both', and 'Neither'. So if we declare a variable, constant, or function argument of this data type, the only values that are valid to be stored or compared with would be these four. (It is kind of like the bool type and its two values true and false except that we the programmer have defined it.)

Its use here is to allow the caller of strip_spaces() to specify whether they want the spaces stripped from the left end, right end, both ends, or neither end of the string. (I don't know why they would call a function for stripping spaces if they didn't want to strip any spaces, but it finishes out the complete set. It is also good style/form to provide all possible options for such a generically useful function.) Since we believe that the caller will most likely want spaces stripped from both ends of the string, we default the argument to 'Both'.

Now the prototype of the strip_spaces function itself:

// removes spaces (' ', '\t', '\n') from the left end,
// right end, or both of s.  (or neither if you are weird
// enough to ask it to.)
void strip_spaces(string & s, EndType which_end = Both);

It takes a string by reference and an argument to tell it which_end the caller wishes spaces removed from. This argument, as mentioned above, defaults to Both.

Next a prototype for replace_all():

// for replace_all() (see).  a variety to assuage various
// programmers' tastes.
const bool REPEATED = true, WITHIN = true, FORCE = true,
           ONE_PASS = false, NO_CROSS = false;
// replaces all occurances of this_stuff in in_here with
// with_this.  will *NOT* replace occurances of this_stuff
// that are in with_this unless forced to!
void replace_all(string & in_here, const string & this_stuff,
                 const string & with_this, bool force_within_with = false);

This function exists because the string owned function replace() only replaces the first occurance of one thing with another. We will need (see below) to replace all occurances of a certain thing with another thing (all tabs with a single space; all double'd spaces with a single space).

The last argument is minorly confusing at first glance. Its purpose is to allow the caller to specify that the with_this string may actually contain the this_stuff string and should be replaced or not replaced when this happens. Okay, so it probably doesn't happen this way very often, but the replacement may abut against something in the original string to form the string to be replaced.

Still confused, eh? Let's see an example:

    in_here:         "this is my string and it has stuff in it"
    this_stuff:      "his "
    with_this:       "h"

Notice how when the "his " there at the beginning of in_here is replaced with "h", we end up with: "this my string...". Thus, the replacement has produced a new occurance of the string to be replaced within the overall string. Technically we would be not doing our duty to replace *ALL* occurances of "his " if we didn't replace this new one, as well. But, this isn't the behavior most programmers would expect. Most programmers I've worked with/taught would expect that replacement would be autonomous/atomic. That is, once replaced, that replaced text would not be considered again. But who are we to say that *NO* programmer will ever desire such repeated in-place replacement? Unqualified, that's who.

Remember, examples are there for clarity and may not make real-world sense. We are not here to come up with every possibility that could ever happen, but to give people options when we think something may prove useful someday. (We defaulted it to off, however, so that it won't get in anyone's way should they not need this feature.)

We'll see exactly how this should work below in the function definition, but at least we know what it means, now. (Don't worry if you don't, it only occurs once in this program, after all.)

Oh, also note the several constants defined above the prototype. These give the calling programmer different ways to state that they do/do not desire the 'repeated in-place replacement' feature. They could simply pass true/false, but words are often more clear than plain data values. (Hence the idea of the [named] constant and its solution to the 'magic number' dilemma.)

Next come a pair of functions for reading in a whole line of text into a string (not just space-delimited as >> would allow):

// reads a whole line into tha string (including leading,
// internal, and trailing spaces/tabs).  avoids grabbing
// an empty line if called after extraction!
inline void get_line(string & s)
{
    cout.flush();    // peek doesn't print prompts on some systems
    if (cin.peek() == '\n')   // stray \n from prior extraction
    {
        cin.ignore();    // toss it!
    }
    getline(cin, s);   // now get the whole (fresh) line
    return;
}

This first version takes in the string variable to fill in from the caller. Then it makes sure cin is ready to get a whole-line string and that cout has prompted properly. Finally, it calls a function from the string library to actually read the whole line from cin into the string.

Whew! That was a chunk. Let's start at the bottom: getline(cin, s). This calls a library function that will get an entire line (from current reading position within the buffer to the next new-line character) and store it in the string object specified. This first argument is the input stream to read from. We want to read from cin, so that is what we specify. (We'll see next semester that there are other kinds of input streams that could be used -- files, for instance.)

But, there is a potential problem that creeps in when getline() is mixed with extraction (>>). Extraction leaves new-line characters within the buffer when it sees that as the end of the data that was requested. That is, when we ask cin to read a double and the user types:

    4.2<Enter>
    _

at the keyboard, the buffer will look like this:

    +---+---+---+----+---+---+---
    | 4 | . | 2 | \n |   |   | ...
    +---+---+---+----+---+---+---

Then cin will translate the double leaving the buffer like this:

    +---+---+---+----+---+---+---
    | 4 | . | 2 | \n |   |   | ...
    +---+---+---+----+---+---+---

Note that the '4.2' part has been translated and stored in a variable, but the new-line character is just sitting there having served its purpose of ending translation of the double, it awaits to be skipped before the next extraction.

However, getline() doesn't behave like extraction: it doesn't skip over spacing before real data because spaces are part of its valid data. It also considers the new-line character a special value and will simply declare its job done when it finds one.

So, if the above extraction were followed by our getline() call, we wouldn't get the information the user wanted to type, but rather an empty string! getline() would see the new-line character as the first data and say, "Hey, a new-line! I'm done. Here's your empty string." That isn't the behavior we desired. We wanted the program to stop and let the user type a string -- not skip it and move on with no real data!

To fix this, we ask cin what the next character is (peek(), remember?) and check if it is a new-line. If it is, then we must have been preceeded by an extraction (because getline() will remove the new-line character that ends it). So we know bad things are about to happen and we tell cin to ignore that new-line. Now, if we were preceeded by an extraction, we'll ignore() the '\n' character and if we weren't, we'll do nothing. Finally, we'll getline() the user's actual data!

Um...but what about that first statement? Well, some compilers (okay, most) take the idea of 'cout shall print its buffer when cin reads' quite literally. They only tell cout to print the current buffer contents when cin actually reads something (takes something out of *its* buffer). Since peek() doesn't remove anything from the buffer, it doesn't 'read' according to this definition. So, peek() won't force cout to print any waiting prompt. Eek! We must do so. The flush() function owned by cout fits our needs precisely. We can't read because we don't know what to read. We can't emit an endl because we don't want to ruin the caller's user interface with extra new-lines being printed. We certainly can't wait until cout is full or the end of the program! flush() will tell cout to print its current buffer -- NOW! (I'll leave the imagery to you...*bleah*)

This second version of our line-getting function doesn't require the caller to have a variable of their own to store the line into. (Note the previous incarnation took a reference argument.)

// companion overload if you don't need to store it...or just
// like the function return style of call...
inline string get_line(void)
{
    string s;
    get_line(s);
    return s;
}

This might prove useful if the caller simply needs to 'echo' the user's input back:

    cout << get_line() << '\n';

(I don't know why this might be needed, we are just providing the caller with options! Overloading with a string reference and a void argument lists respectively does this here.)

Note how the cout needs to print a new-line character separately since getline() doesn't store the new-line character that stopped its reading.

And in main:

    string whole_name;

    cout << "Enter your name:  ";
    whole_name = get_line();
    // or we could do:  get_line(whole_name);

We make a string variable and call one of the get_line() functions. (Even show how we could have called the other one, if we'd so chosen.)

Next, to process the user's name:

    strip_spaces(whole_name);  // remove stray spaces from both ends
    replace_all(whole_name, "\t", " ");  // replace any tabs with single
                                         // spaces
    replace_all(whole_name, "  ", " ", REPEATED);  // replace double spaces
                                                   // with single spaces --
                                                   // until they are all gone!

We remove all extra spaces from both the left and right ends of the string (note we used the default for the EndType argument). Next we replace any tab characters they've typed with a single space. Now for the fun line: we replace all double'd spaces with single spaces -- even ones we 'create' with our replacement!

Notice the following scenario:

    whole_name:  "Jacob     Miller"

The first name and last name are separated by 5 spaces. If we just called replace_all() normally (like we did for tabs), we'd end up with:

    whole_name:  "Jacob   Miller"

There are still 3 spaces in there! We wanted just one space between the names! Watch how this happens:

    whole_name:  "Jacob     Miller"
    found:             --
    ...replace...
    whole_name:  "Jacob    Miller"
    replacement:       _

Two became one -- so far so good. Next:

    whole_name:  "Jacob    Miller"
    found:              --
    ...replace...
    whole_name:  "Jacob   Miller"
    replacement:        _

It found the double space after the first replacement text of a single space -- not the double space formed with the replacement text and the following space. And on the next look for a double space, it'll find none because the only remaining double space is before the place it'll begin to look (the space just before Miller).

By requesting replacement within replaced text, we get this sequence of events instead:

    I    whole_name:  "Jacob     Miller"
         found:             --
         ...replace...
         whole_name:  "Jacob    Miller"
         replacement:       _

   II    whole_name:  "Jacob    Miller"
         found:             --
         ...replace...
         whole_name:  "Jacob   Miller"
         replacement:       _

  III    whole_name:  "Jacob   Miller"
         found:             --
         ...replace...
         whole_name:  "Jacob  Miller"
         replacement:       _

   IV    whole_name:  "Jacob  Miller"
         found:             --
         ...replace...
         whole_name:  "Jacob Miller"
         replacement:       _

A total of four replacements are made and we've removed all the extra spacing between the names! (Hopefully the replace within replaced text thing is becoming a little clearer now...keep thinking about it...)

Now for some function code (we'll see those last few lines in a bit):

// removes spaces (' ', '\t', '\n') from the left end,
// right end, or both of from.  (or neither if you are weird
// enough to ask it to.)
void strip_spaces(string & from, EndType which_end)
{
    string::size_type at;
    if (which_end == Left || which_end == Both)
    {
        at = from.find_first_not_of(" \t\n");
        from.erase(0, at);
    }
    if (which_end == Right || which_end == Both)
    {
        at = from.find_last_not_of(" \t\n");
        from.erase(at+1);            // erase to end of string
    }
    return;
}

If the caller requested we replace either at the left end or at both ends (which includes the left end, you see), we ask the string to find the first character it contains that isn't a standard spacing character (space, tab, and new-line). This is done with the find_first_not_of() function, oddly enough. (*grin*) Notice that 'at' has that funny type 'string::size_type'. Recall that all values that refer to either the size of a string or a position within a string are of this 'guaranteed to be big enough' type defined within the string class. That's where the string:: bit came from -- to say that size_type is inside of the string class (read it from right to left).

Once we've found the position of the first non-whitespace character in the string, we have that many characters removed from the beginning of the string by calling the erase() function. Note that the beginning of the string is position 0. I suppose they aren't really positions, after all. Truth-be-told, the 'positions' that the string functions deal with are really distances from the beginning of the string (sometimes also called offsets from the beginning of the string). So, the first 'position' is 0 away from the beginning of the string.

Here's an example to clarify a bit:

                             1111111111222222
    positions:     01234567890123456789012345
    string:       "    Ignacious   Harvey    "

This string contains 26 characters. The first four are spaces. So the position of the first non-whitespace character ('I') is 4. Since it is 4 places distant from the beginning, there must be 4 characters that preceed it. Hence the position does double duty: it is an offset and a count of preceeding characters! (This wouldn't have been possible if we'd counted positions from 1 like most people do.)

So, the call to erase() says to remove from position 0 (the very first character) the next 4 characters (since 'at' will hold the 4 for the position of 'I' here). We end up with:

                             111111111122
    positions:     0123456789012345678901
    string:       "Ignacious   Harvey    "

Now the string only has 22 characters and the leading spaces are gone!

If the caller didn't ask us to remove from the left (or both), we'll skip this bit of code and proceed to the next check: stripping from the right end.

If the caller asked us to remove spaces from the right end (or both ends), we need to find the last character in the string that isn't one of the standard white-space characters (space, tab, and new-line). Again, we find that the oddly named function 'find_last_not_of()' comes in handy here. It finds the last character in the string that called it that is not present in the string passed! Just what we needed...

For our example, find_last_not_of() will return 17 (the position of the 'y' in the last name). Now we call erase() again to request the part of the string following this position be removed. Note that this is done differently than the other call:

        from.erase(at+1);            // erase to end of string

By default (or perhaps in this overloaded version -- it would function the same either way), erase() will erase from the position it is asked to start until the end of the string. We could have specified it directly:

        from.erase(at+1, from.length()-1);    // erase to end of string

But that would have been extra work on our part and not as efficient as letting the erase() function just realize that we wanted the entire remainder of the string gone via the default (or the call of the one-argument overloaded version).

The next function is the replace_all() function:

// replaces all occurances of this_stuff in in_here with
// with_this.  will *NOT* replace occurances of this_stuff
// that are in with_this unless forced to!
void replace_all(string & in_here, const string & this_stuff,
                 const string & with_this, bool force_within_with)
{
    string::size_type at, this_len = this_stuff.length(),
                      with_len = with_this.length();
    at = in_here.find(this_stuff);
    while (at != string::npos)      // while it is not a non-position
    {
        in_here.replace(at, this_len, with_this);
        at = in_here.find(this_stuff, at + (force_within_with?0:with_len));
    }
    return;
}

This one turns out to be relatively simple (well, compared to other code in this program). We simply find where this_stuff occurs in in_here and call replace() with that position and the length of this_stuff and the replacement text (with_this). Then we repeat that cycle until we can't find this_stuff anymore.

We use 3 variables of 'string::size_type': the position we found this_stuff at, the length of this_stuff, and the length of with_this. We store the lengths of these arguments so that we don't have to continually call those functions. Since the arguments are by constant reference, we know the lengths won't change so this 'caching' is safe.

Next we look for this_stuff in in_here with the find() function. True to its name, it locates the argument string in the string that called it. If it cannot find the argument string at all, it returns the guaranteed invalid position string::npos (short for 'not-a-position' or 'no-position'). This constant value is defined within the string class and so the screwy notation again.

Anyway, while the last found position *does* exist (is not equal to the 'non-position'), we do the replacement requested. (Recall that c.replace(p,n,a) will replace the n characters of the string c starting at position p with the string a.) Once we've replaced as desired, we look for another occurance of the text-to-be-replaced (this_stuff).

But did the caller desire replacements that happen [partially] within the just replaced text? We check as we call the overloaded (or does it have a default argument?) find() function to tell it where to start looking. The new third argument is a position from which find() will search for the specified text. If the caller wanted replacement-within-replaced-text, we tell find() to search from the same position it last found the to-be-replaced text (this_stuff). (Note that it cannot possibly be before this position or find() would have returned that position instead.) We do this by adding 0 to at from the ternary/decision operator (?:). On the other hand, if the caller did *NOT* want replace-within-replaced-text behavior, we add the length of the replacement text to at from the ternary/decision operator (?:). This way, new searches start strictly after the text we just 'inserted'.

And finally, that name-case'ing stuff. The prototype and typedef:

// changes next word of string (represented by [start_at..end_before) )
// to 'name-case' (first capitalized and rest lower-case).  returns
// iterator to position after word.
typedef string::iterator str_itr;
str_itr name_case(str_itr start_at, str_itr end_before);

The 'typedef' is another way for programmers to introduce 'new' data types. In fact, it can really only specify extra names for existing types. Here, we use it to specify that the name 'str_itr' will represent the data type 'string::iterator'.

So, what is a 'string::iterator', anyway? First, we note from the syntax that 'iterator' is defined inside the string class (like size_type was, remember?). As to its meaning, an iterator is used to iterate over a collection. In our case, a string is a collection of characters. So, when we want to loop through each character of the string for some reason, we use an iterator to indicate each character in turn.

The most common use of iterators in C++ is to iterate over the elements in a container (a string, again, contains many characters). The C++ convention is that when we specify iterator control of code, we must tell what iterated position to begin with and the iterated position just after where we want to end. In math terms, we'd be specifying [begin, end). So, our function name_case(), takes two iterator arguments which indicate, respectively, where we are being asked to start and the position before which we *must* end. We even return an iterator to the position before which we actually finish processing. (As the comments state, a call to name_case() changes the contents of only the very next word to be made to look like a name: first character capitalized and rest lower-case. So, we may not reach the ultimate stopping point at the end of the whole string, but we'll end after the next word is done. We return this iterator position so the caller can know where the end of that word was -- if they care.)

How do iterator positions differ from size_type positions? Well, positions stored in a size_type are some kind of integer indicating a distance from the beginning of the string (which is position 0 in this system). On the other hand, positions stored in iterator form can actually be used to affect the character at that position directly. It is sort-of like having a reference to the character within the string, but not exactly. It is *not*, however, an integer kind of thing. It is, in fact, a class type of its own (defined inside the string class! wild, no?). (At least it is on most compilers...it is left to the compiler how best to implement iterator types. Using them must adhere to a strict set of rules, but they can be defined data-wise in a variety of ways.)

But I digress... So, when using the replace(), insert(), find(), erase(), etc. family of functions, we use size_type positions to indicate offsets from the beginning of the string where actions should occur, to receive offsets where certain text is located, and even to indicate an invalid or non-existant position (string::npos is of the data type string::size_type).

When, however, we need to process each character of a string in turn, we use a string::iterator to affect a character, move on to the next character, and then stop just before the end of the string (at the last character in the string). Exactly how we do this is shown below.

Using it in main:

    str_itr last;

    //...other stuff...

    // name-case each 'name' in their name...
    last = name_case(whole_name.begin(), whole_name.end());
    while (last != whole_name.end())
    {
        ++last;
        last = name_case(last, whole_name.end());
    }

Here we declare a string::iterator via the 'str_itr' typedef from earlier (you must admit, it is *way* less typing). This iterator is to indicate where the last position for the last name was (kind-of a double-meaning thing). How do we ask our string (whole_name) for iterators to its characters in the first place? We use the begin() and end() functions. begin() returns an iterator to the very first character in the string (or to end() if the string is empty). end() returns an iterator to the position just past the end of the string (recall our [begin, end) processing model).

On our first call to name_case(), we use begin() and end() to tell it to start with the first word (first name) in the string and turn it to name-case. It returns an iterator to the character just after the first word (the space between names or the end() if the user only entered a one-word name). Next we check if the 'last' position is the end() of our string. If it isn't, we can do the next word (name). If it is, our loop ends and we proceed to print their fixed-up name.

To process the next word of our string, we increment the iterator we received from the previous call to name_case(). This is because that iterator was to the space between the names and we need an iterator to the first character of the next name. Since our prior processing made sure there was one and only one space between the user's names, incrementing from the space will place us at the first character of the next name. We know that a next name exists because we wouldn't be here if we'd previously reached the end of the string (the loop would have ended). When that next word (name) is name-case'd, we store the end of it in 'last' again. Then we return to the top of the while to see if we've finished the whole string or not. Note how the end() of the string never changes and so is always the second argument to name_case(). (We could 'cache' this iterator in a variable, but it is safer to ask for the end() again since we do not know if name_case() might change the content of the string in such a way as to change where the end() iterator would be.)

So, here we've seen how to get iterators into a string in the first place: begin() and end(). And, we've seen how to move an iterator from one character to the next (++; both pre and post work, by-the-way, pre-increment is just more commonly seen for whatever reason). When we look at the code for name_case() itself, we'll see how to affect the character the iterator 'refers' to.

And, finally, the name_case() function's code:

// changes next word of string (represented by [start_at..end_before) )
// to 'name-case' (first capitalized and rest lower-case).  returns
// iterator to position after word.
// REQUIRED:  start_at is assumed to be the first character of the word!
//            if called with start_at pointing to white-space, we'll *not*
//            be held responsible!!!
str_itr name_case(str_itr start_at, str_itr end_before)
{
    str_itr on = start_at;
    if (start_at != end_before)
    {
        *on = toupper(*on);
        ++on;
        while (on != end_before && !isspace(*on))
        {
            *on = tolower(*on);
            on++;
        }
    }
    return on;
}

First we declare a local iterator to keep track of which character we are 'on' during our walk along the string. If the range of characters we've been asked to walk is not an empty range, we can upper-case the first character and then lower-case any remaining characters in the word. The word is considered to end with a space (or tab or...). Thus, while we haven't either reached the end of the string or found a space character, we lower-case the current character and then move to the next one.

Using just the variable name 'on' is essentially saying 'the iterator for/to the character we are working on'. To actually deal with the character itself, a new syntax is introduced: '*on'. The '*' here is a unary operator that essentially is saying 'go to the thing that this iterator references'. So, '*on' is saying 'the character we are working on'.

So, when we initialize on to the beginning position, compare it with the stopping position, and increment it to the next position (see, I told you both pre and post increments would work!), we just use the name of the iterator. Filling in iterator values, comparing two iterators, and iterating to the next iterator are all actions on the iterator itself. However, checking if the character is a space and changing it to upper/lower-case are operations on the character to which the iterator 'refers'. So, in those parts of the code we use '*' before the iterator name to indicate that we need to reach the actual object indicated by the iterator and not the object's relative position within the container.

We must make sure the word isn't empty before we process it. This is because it is an error to try to access the object that end() 'refers' to since it is just past the last element of the collection. Doing so would crash our program -- hopefully. (If our program didn't crash, it might keep running with stranger things happening as the result of this mis-step.)

The really interesting thing about this function is that it can change the contents of the caller's string without having a reference to the string itself. In fact, it has only copies of the iterators to the first position to work on and the position before which to stop working. Yet, still, the caller's string's contents are changed. This is because iterators are able to 'refer' to particular elements within a container (here particular characters within a string). The 'reference' is hidden inside the iterator itself (inside the string::iterator object -- or the str_itr object here). This can be confusing when you first start out, but try to get used to it. You could have passed the iterators by reference, but that would have caused problems back in the main() and been needless since all iterator types are meant to be small and efficient to copy.