Searching vectors

NOTE:

These are a little sketchy...I'll fill in details 'soon'.

Where'd That Data Go?

Sometimes you have a vector with data in it, but you want to know either if a value is in the vector or where it is within the vector. When such a situation comes up, you need to search for (find, locate) the value among the vector's contents.

The easiest way is to simply start at the beginning of the vector and move along until you either find the value or you run out of elements to look at. This is termed linear search (since you are searching in a 'line').

Linear Search

As stated, you start looking for your target value at the beginning of the vector, moving to subsequent values until you either find it or you run off the end of the vector:

    i = 0;
    while (haven't found it and still more to look through)
    {
        i++;
    }

(Note the logical negation of the stopping condition.) Now to make it more C++-like:

    i = 0;
    while (vec[i] != value && i != vec.size())
    {
        i++;
    }

So, when this loop ends, either the value is located at position i or it simply isn't there. When it isn't there, i will have the value vec.size() to indicate a position outside of (just after) the vector.

The only other thing we should do is reverse the order of the &&'s operands. As they are, we might accidentally subscript by the size() to access that position's [lack of a] data value -- [possibly] crashing the program. If we reverse them, the != size() test would fail and short-circuit the && -- saving us from certain doom!

    i = 0;
    while (i != vec.size() && vec[i] != value)
    {
        i++;
    }

Even better! Now we can simply encapsulate this search in a function! Our function's prototype:

    // Finds the indicated value within the given vector between
    // the specified positions.  Note that the end position is
    // *NOT* included in the search!  The bounds are as has become
    // C++ standard:  [start_at..end_before).
    //
    // Returns either the size_type position at which find_this
    // was located or end_before to indicate the value wasn't
    // present.
    vector<short>::size_type
    linsearch(const vector<short> & vec, short find_this,
              vector<short>::size_type end_before,
              vector<short>::size_type start_at = 0);

    // (overloaded helper function to linear search an entire vector)
    inline
    vector<short>::size_type
    linsearch(const vector<short> & vec, short find_this)
    {
        return linsearch(vec, find_this, vec.size());
    }

(Note the inline helper function! This provides flexibility for our caller at little outlay of effort on our part.)

And its definition:

    // Finds the indicated value within the given vector between
    // the specified positions.  Note that the end position is
    // *NOT* included in the search!  The bounds are as has become
    // C++ standard:  [start_at..end_before).
    //
    // Returns either the size_type position at which find_this
    // was located or end_before to indicate the value wasn't
    // present.
    vector<short>::size_type
    linsearch(const vector<short> & vec, short find_this,
              vector<short>::size_type end_before,
              vector<short>::size_type start_at)
    {
        vector<short>::size_type k = start_at;
        while (k != end_before && vec[k] != find_this)
        {
            k++;
        }
        return k;
    }

Nice...now it's ready to be placed in a library. (See the notes at the end of the list management section for more on making vector libraries more generic.)

But...

This algorithm works fine. But it is terribly slow. On the vectors you are likely to use this semester, it will perform from admirably to reasonably well. And, in general, it is the only way to locate a value in a vector. But, let's take a look at its performance, anyway:

end_before-start_at (i.e. vec.size())	Average Iterations Until Found	Iterations Until NOT Found
8	4	8
64	32	64
1024	512	1024
4096	2048	4096
1048576	524288	1048576

When you have more data this algorithm takes lots and lots more time. In general, if you have n pieces of data, it will take n/2 iterations before you find the value you seek (or n until you realize it isn't there).

Can We Do Better? (aka Binary Search)

After some research, we found that when a vector of values is in order (ascending, descending: it doesn't matter which way), we can do MUCH better:

end_before-start_at (i.e. vec.size())	Average Iterations Until Found	Iterations Until NOT Found
8	4	4
64	7	7
1024	11	11
4096	13	13
1048576	21	21

That's a phenomenal decrease in time!!! In general, for a vector of n elements, it will take:

    |log  n| + 1
    |_  2 _|

Or floor(log₂(n))+1 iterations to find the value (or realize it isn't there). (No, there isn't a log2 function in the cmath library, please see the college algebra notes if you need a refresher of logarithm base conversions.)

How does it work?

Let's assume for now that the vector is in ascending order (lowest to highest). By looking first in the middle of the vector, we can immediately tell whether the value we seek is in the first or second half of the vector -- it must be either higher or lower than the middle value (if they aren't already equal, that is!).

Basically this is the one dimensional hotter-colder game (aka higher-lower) applied to locating a value in an ordered list. But, being computer scientists, we couldn't very well call it "hotter-colder search" or "higher-lower search", now could we? It just didn't sound very 'professional'. *grin*

Since we cut the vector in half each time, we called it binary search. (It kinda went with our whole base-2 computer thing, too. *grin*)

So how do we code it? We start with the entire vector as our list:

    bottom = 0;
    top    = vec.size()-1;
    middle = (top+bottom)/2;

Note how the middle is found with integer division -- truncating any decimal (0.5) away. This is on purpose. If our vector has 5 items, they are [0..4] and the middle is 2. But if our vector has 6 items, they are [0..5]. This list technically has 2 middles -- positions 2 and 3! We've arbitrarily chosen the first of them -- position 2. This doesn't split the list equally, but that's okay. This still ends up being close enough. (As the size of the list grows, this off-by-one split becomes more and more negligible ... or less and less significant, if you prefer.)

If the middle element is not the value we seek, we decide which half of the list it is in:

    if (vec[middle] < value)
    {
        // value must be in top half!
    }
    else
    {
        // value must be in bottom half!
    }

If the value is in the top half of the list, then our new bounds will be:

    bottom = middle+1;
    top    = vec.size()-1;
    middle = (top+bottom)/2;

Note how top doesn't move but bottom does. bottom even moves above the current middle -- since that value wasn't the one we wanted, we can exclude it from further consideration!

What about the other case -- when the value is less than the middle element? That places our further searching in the bottom half. This gives us new bounds of:

    bottom = 0;
    top    = middle-1;
    middle = (top+bottom)/2;

Again, the bottom stays the same, but the top has shifted down below the old middle.

This process continues until we either find our target value or... Hmm, how can we tell we're done? Well, as long as top-bottom+1 -- the number of things in the range we are searching, remember? -- is non-negative, we still have a place to search:

    top-bottom+1 >= 0
    top-bottom >= 1
    top-bottom > 0
    top > bottom

So when the top drops below the bottom, we are out of elements to search through. And when the top equals the bottom, we have only one element left to search.

So, here's our algorithm:

    bottom = 0;
    top = vec.size()-1;
    middle = (top+bottom)/2;
    while (top > bottom && vec[middle] != value)
    {
        if (vec[middle] < value)
        {
            bottom = middle+1;
        }
        else
        {
            top = middle-1;
        }
        middle = (top+bottom)/2;
    }

Note that we only need to adjust the end boundary that changes within the if of the while. But at the bottom of the while, we always update the middle position -- since one of its determining ends must have changed.

At the end of this loop, either middle element is the value we sought or the value is not in the vector.

Let's encapsulate this into a function for easier re-use! Here's the prototype:

    // find the indicated value within the given vector between
    // the start_at and end_before (end_before not included).
    // returns either the index at which find_this was located
    // or end_before to indicate the value wasn't present.
    vector<char>::size_type
    binsearch(const vector<char> & vec, char find_this,
              vector<char>::size_type end_before,
              vector<char>::size_type start_at = 0);
    inline
    vector<char>::size_type
    binsearch(const vector<char> & vec, char find_this)
    {
        return binsearch(vec, find_this, vec.size());
    }

(Also note the inline helper function to provide potential callers with extra flexibility!)

And definition:

    // find the indicated value within the given vector between
    // the start_at and end_before (end_before not included).
    // returns either the index at which find_this was located
    // or end_before to indicate the value wasn't present.
    // (uses binary search algorithm.)
    vector<char>::size_type
    binsearch(const vector<char> & vec, char find_this,
              vector<char>::size_type end_before,
              vector<char>::size_type start_at)
    {
        vector<char>::size_type top = end_before-1, bottom = start_at,
                                      middle = (top+bottom)/2;
        if (end_before != 0)
        {
            while (top > bottom && vec[middle] != find_this)
            {
                if (vec[middle] < find_this)
                {
                    bottom = middle+1;
                }
                else
                {
                    top = middle;
                }
                middle = (top+bottom)/2;
            }
        }
        return end_before == 0 ||        // empty vector
               top < bottom ||           // not == to find final element
               vec[middle] != find_this  // didn't find it
                 ? end_before            // error return
                 : middle;               // found it!
    }

The if is there to make sure we don't try to search an empty vector. We could have just as easily used an vec.empty() check.

(The top < bottom check may seem un-necessary, but it is there to make a cheaper determination than comparing the middle element to the search target. Cheaper than comparing characters?! Well, no one said we could only search vectors of char, now did they? Other types of data may be more expensive to compare -- string, for instance... Or almost any of the class types we may have created -- comparing that two Dates are/not equal can be fairly expensive as well...)

Again, you should check out the notes on making generic vector libraries under the list management section for more information on how to place these functions in a library.

And How Did That Data Get In Order?

Well, now that we've searched out our value -- and done it really quickly, we might want to not rely on the end-user to enter their data in order in the first place. What we need is an algorithm (or two...) to sort values into order. Maybe that's a story for another day...

Stray Bits

BTW, the number of iterations it takes to find a value in a vector on average is called the "average case time complexity". The number of iterations it takes to realize a value is not in a vector is called the "worst case time complexity".
Linear search can be performed from 'tail' to 'head', too. This finds the last value in a vector that matches (when a vector can contain duplicate entries, this might be important). (Just watch out that size()-1 is the position of the last element...not size()!)
If the sort order of the vector for binary search is reversed (largest to smallest), you can simply change the < to a > in the if statement (the one inside the loop there). Everything else can remain the same!
The logarithm's base is 2 in the binary search because of the fact that we are dividing the list in two parts!

Fine Tuning

But wait! There's more!

We can actually shave a few microseconds off the binary search and gain immense re-usability bonuses all in one sweeping stroke. But how?!

Simply following this 2043 part recipe:

    0 1 2 3 4 5 6 7
    A B C D E F G H

  f B
    b A   t H   m E
    ...   t E   m C
    ...   t C   m B
  done -- fixed: but b still < t...eek!

  f H                        f H    (t-b)/2+b
    b A   t H   m E            b A   t x   m E
    b E   ...   m G            b E   ...   m G
    b G   ...   m H            b G   ...   m H
  done                       cool...

  f F
    b A   t H   m E
    b E   ...   m G
    ...   t G   m F
  done

  f A                        f A    (t-b)/2+b
    b A   t H   m E            b A   t x   m E
    ...   t E   m C            ...   t E   m C
    ...   t C   m B            ...   t C   m B
    ...   t B   m B            ...   t B   m A  !!! yea !!!
  never moves to find it...*shiver*

    0 1 2 3 4 5
    B C D F G x

  f B
    b B   t x   m D
    ...   t D   m C
    ...   t C   m B
  done

  f A
    b B   t x   m D
    ...   t D   m C
    ...   t C   m B
    ...   t B   m B
  b !< t  yea!

  f E
    b B   t x   m D
    b D   ...   m F
    ...   t F   m D
    b D   ...   m D
  fixed -- never moves further!!! b < t still!  eek!

  f H
    b B   t x   m D
    b D   ...   m F
    b F   ...   m G
    b G   ...   m G
  fixed -- never moves further!!! b < t still!  eek!

We can re-write our binary search algorithm to look like this:

    // locate the desired value within the position bounds
    vector<double>::size_type
    binsearch(const vector<double> & vec, double value,
              vector<double>::size_type bottom,
              vector<double>::size_type top)
    {
        vector<double>::size_type middle = (top + bottom) / 2;
        double direc = vec[middle] - value;
        while (bottom < top && fabs(direc) < 1e-6)
        {
            if (direc > 0.0)
            {
                top = middle;
            }
            else
            {
                bottom = middle+1;
            }
            middle = (top + bottom) / 2;
            direc = vec[middle] - value;
        }
        return bottom;
    }

WOW! I want that one! Let's order now!