Input More In Depth

How cin Sees the World

cin has a fairly simple view of the world. characters enter its world from the keyboard (typed by the user) and leave to enter the program in some form. Sometimes the characters are to stay characters. Most times, however, they are to be translated into numeric values. For all this to work, cin has three basic structures: a buffer of characters, a position within that buffer, and an accumulator to build numeric values from characters.

The buffer is a sequence of characters which is retrieved from the 'boat' each time the user hits <Enter>. When that happens, the position is set to the first character the user typed. The accumulator is only used when the extraction is for a numeric type (double, short, etc.).

Let's look at a couple of situations:

    cin Statement           User Types    buffer                 position
   ----------------------------------------------------------------------
    cin >> c1                ant           'a','n','t','\n'       0
        >> c2
        >> c3;
   ----------------------------------------------------------------------
    cin >> num;              42            '4','2','\n'           0
   ----------------------------------------------------------------------
    cin >> flNum;            3.4           '3','.','4','\n'       0

Note how each example's buffer ends in a new-line character. This represents the user having typed Enter to end their input. Also note that the starting position is 0. This is the computer's way of saying, "I'm 0 positions away from the beginning of the buffer." So the first character of input is actually in position 0. The second is in position 1. Etc.

In the first situation, cin sees that it is reading 3 char values (although it actually just sees this one char at a time). It will perform the following actions:

    extracting c1 (char):     buffer at position:  'a'
                              store 'a' in c1
                              increment position:  1
    extracting c2 (char):     buffer at position:  'n'
                              store 'n' in c2
                              increment position:  2
    extracting c3 (char):     buffer at position:  't'
                              store 't' in c3
                              increment position:  3

Each time a char is read, the position advances to remove that char from further consideration.

Let's look at the second situation, where we are reading an integer-typed number:

    extracting num (short):   reset accumulator:        0
                              in buffer at position:    '4'
                              change '4' to 4
                              shift accumulator by 10:  0
                              add 4 to accumulator:     4
                              increment position:       1
                              in buffer at position:    '2'
                              change '2' to 2
                              shift accumulator by 10:  40
                              add 2 to accumulator:     42
                              increment position:       2
                              in buffer at position:    '\n'
                              store accumulator to num: 42

This one was much more complicated! However, the basic idea is that cin reads digit by digit until it reaches something that is not a digit — the '\n' here. At each character, it changes the digit to a number (more on that later) and adds it to the accumulator after multiplying the current value by 10 (to represent the positions within the base 10 number it is reading).

The process on the third example is similar, but when it reaches the '.', it begins to count places and then shift back at the end. (This is one way...others abound. Perhaps you could work out a couple of approaches to increase your understanding.)

When Good cin Statements Read Bad Data

Problems Reading characters?

What happens when you are expecting a '(' or a letter and instead get a number? Nothing. Anything you can type at the keyboard can be stored in a char.

Problems Reading Numbers? (Part I)

What happens when you are expecting a number (integer or floating-point) and the user types a non-numeric character? cin gets VERY upset! And when cin gets upset, no-one gets any more data from the keyboard!

The variable you were attempting to extract into is left unaltered. This could leave a previous value in it or it could be the garbage left there from before your program loaded into memory. Whatever the case, the data you expected is NOT there.

Any further attempts to read into variables will be instantly ignored — more unaltered variables.

This is a terrible state of affairs, but it just can't be helped (until later).

(Oh, and the offending character(s) are left in the input buffer (aka cin's boat) to be read later — if you dare!)

So what's a non-numeric character? Well, anything that isn't valid.

Okay, smarty, what's valid? For integers, there can be a leading sign (+ or -) and then a sequence of digits (0-9). Floating-point values also can have a leading sign and a sequence of digits. But after that, they can have a decimal point followed by more digits. They can also have scientific notation (e or E followed by a power assumed to be of base 10). And the power for the scientific notation can also have a leading sign.

The rules are:

    integers:  [+-]?[0-9]+

    floating-point:  [+-]?([0-9]+(\.[0-9]*)?|[0-9]*(\.[0-9]+))([eE][+-]?[0-9]+)?

Say what?!? Oh...sorry...some explanation of notation. The above are known as regular expressions. They are prevalent on *nix systems in many forms. They are used in searching through text for patterns. The [] enclose groups of characters any one of which can be present. A - between two characters inside [] indicates a range (so you don't have to type out all of the values; so [a-z] would be any lower-case letter). A ? indicates that the pattern preceeding it may be present or missing. () are used to group together sequences of patterns. A + after a pattern means that there must be at least 1 of this, but there can be as many more as we like. Normally a . matches any single character, so \. is used to mean we want to match an actual . in the text. A * after a pattern is like a +, except there can be 0 or more instead of at least 1. A | between patterns means either the left one or the right one can match.

Unh-hunh...er... Say what?!? Well, like we said, an integer can have a leading + or - sign or not and then has a sequence of digits (at least one or there isn't much of an integer here, eh?). So the pattern says: [+-]?[0-9]+. That is, either a +, a -, or neither of those followed by 1 or more of the digits (0, 1, 2, ... 9).

Admittedly, the second pattern for the floating point values is a tad more tricky. But it follows the same principles. Let's break it down:

    floating-point:  [+-]?            # optional sign (like on integer)
                     (                # group for number itself
                        [0-9]+        # either 1 or more digits
                        (\.[0-9]*)?   #        possibly followed by a . and
                                      #                 0 or more digits
                      |               #   OR
                        [0-9]*        #        0 or more digits
                        (\.[0-9]+)    #        followed by a . and
                                      #                    1 or more digits
                     )                # one of these has to be there!
                     (                # group for scientific notation
                        [eE]          # either an e or an E
                        [+-]?         # optional sign (as int and above)
                        [0-9]+        # 1 or more digits
                     )?               # sci-not is optional

Here I've split up the overall pattern into its components. The text after the #s comment what each piece is matching. (Normally a regex — regular expression — cannot contain spaces or comments. This is for your edification.)

Some examples:

    123456        integer
    -123          integer  (leading -)
    +123          integer  (leading +)
    123.456       floating point  (fraction)
    123.456e2     floating point  (fraction and e)
    123.456e-2    floating point  (fraction and signed e)
    123.456e+2    floating point  (fraction and signed e)
    +123.456      floating point  (sign +)
    -123.456      floating point  (sign -)
    +123.456e2    floating point  (signed, fraction, and e)
    +123.456e-2   floating point  (signed, fraction, and signed e)
    -123.456e+2   floating point  (signed, fraction, and signed e)
    123e2         floating point  (e)
    123e-2        floating point  (signed e)
    123e+2        floating point  (signed e)
    -123e2        floating point  (signed and e)
    +123e-2       floating point  (signed and signed e)
    -123e+2       floating point  (signed and signed e)

Problems Reading Numbers? (Part II)

Please also note that newer (but not as new as can be) compilers will have a similar fit of failure when they encounter a number too large (or too small) to fit into a particular type of variable. For instance, trying to read 40000 into a short integer will result in a little cin fit and no data being stored in your variable. (The latest and greatest compilers will have been keeping up with the standards update committee reports and noticed the issue should now result in not only a little fit but also the closest possible value to the one entered being stored in the variable. In the 40000 example, cin would be upset but also store 32767 in the variable. Similarly, -40000 would result in a fit and -32768 being stored.)

Not All Numbers Were Created in Decimal (Base 10)

Older (less compliant) compilers may support an older input standard. This section does not apply to our compiler, but may to your home compiler (Visual, etc.).

Unfortunately for us, the computer sees the world in binary: base 2. And base 2 translates easily into base 8 and base 16; each of which has proved useful to various parts of the computer industry. Base 10 is not used much by hardcore computer technology people. So, cin supports this view by making any numeric value beginning with a leading 0 a candidate for being base 8 or base 16. If the next char is an 'x', then conversion is done to base 16, otherwise conversion is done to base 8.

This all seems very complicated and difficult. Yes, it is. However, most people won't run into this trouble because very few end-users type leading 0's. There are those, however, who do. Some people, though, want their numbers to all line up nice and neat and type in things like:

In this case, things aren't too awful. But instead of their third value being 23, it will be 19 (base 8, remember). Things can be worse:

Here the third value is translated as 2! Translation stops at the digit '9' because that cannot be part of an octal (base 8) number. Then the fourth value is 9 and the fifth value is 103. Hmm...and they thought they only entered four numbers. *sigh*

To alleviate this minor difficulty, you can use a manipulator to force cin to always translate into decimal (base 10). Here is a code snippet for doing this:

    #include <iomanip>   // required for manipulators

       cin >> dec;      // sets up decimal conversion only

The new #include should go at the top of your program with any others (like for iostream). The cin line should be placed in your main — right after the variable declarations — before any other code has been seen.

In case you want the octal or hexadecimal conversion someday, you can turn them back on by using the manipulators oct and hex, respectively. However, you can't go back to the automatic detection scheme without a bit more effort. (That would be a topic for CSC122. *shrug*)