Sorting vectors

NOTE:

These are a little sketchy...I'll fill in details 'soon'.

Putting It All In Order

There comes a time in every programmer's career when s/he must place items of data into order. Sometimes this is ascending (smallest to largest). Sometimes it is descending (largest to smallest). But whichever way it goes, it must.

One of the earliest sorts known is bubble sort. Later programmers sought to make their mark by sorting in new and unique ways. Thus came selection sort and insertion sort. After these quick and dirty algorithms came a period of reflection and study. At which time arrived on the scene much more powerful sorts of sorting algorithms: quick sort, heap sort, and merge sort. Let's look at each of these six algorithms in turn.

Bubble Sort

This sort is named after the way soda bubbles rise to the top of a glass -- just sort of meandering their way from the base to the rim:

                   |___________|
                   |       ,   |
                   |     .     |
                   |       *   |
                   |         o |
                   |        0  |
                   |          O|
                   +-----------+

(Okay, so I'm not an artist. *sigh*)

The idea of the algorithm, however, is that data items will migrate their way toward their proper (ordered) position within the vector -- slowly but surely. This is accomplished by comparing adjacent pairs of data within the vector and swapping them (if necessary) to be in the correct (desired) order. If we were trying to sort the following vector into descending order, for instance, watch the steps unfold (and watch how the a element 'bubbles' its way toward the opposite end of the vector):

    +-----+-----+-----+-----+-----+-----+
    |  a  |  d  |  c  |  q  |  b  |  e  |      original
    +-----+-----+-----+-----+-----+-----+

    +-----+-----+-----+-----+-----+-----+
    |  d  |  a  |  c  |  q  |  b  |  e  |      compare/swap 0 and 1
    +-----+-----+-----+-----+-----+-----+

    +-----+-----+-----+-----+-----+-----+
    |  d  |  c  |  a  |  q  |  b  |  e  |      compare/swap 1 and 2
    +-----+-----+-----+-----+-----+-----+

    +-----+-----+-----+-----+-----+-----+
    |  d  |  c  |  q  |  a  |  b  |  e  |      compare/swap 2 and 3
    +-----+-----+-----+-----+-----+-----+

    +-----+-----+-----+-----+-----+-----+
    |  d  |  c  |  q  |  b  |  a  |  e  |      compare/swap 3 and 4
    +-----+-----+-----+-----+-----+-----+

    +-----+-----+-----+-----+-----+-----+
    |  d  |  c  |  q  |  b  |  e  |  a  |      compare/swap 4 and 5
    +-----+-----+-----+-----+-----+-----+

This completes phase one. Next, we begin again at position 0 and move anything else that needs adjusting:

    +-----+-----+-----+-----+-----+-----+
    |  d  |  c  |  q  |  b  |  e  |  a  |      'original'
    +-----+-----+-----+-----+-----+-----+

    +-----+-----+-----+-----+-----+-----+
    |  d  |  c  |  q  |  b  |  e  |  a  |      compare      0 and 1
    +-----+-----+-----+-----+-----+-----+

    +-----+-----+-----+-----+-----+-----+
    |  d  |  q  |  c  |  b  |  e  |  a  |      compare/swap 1 and 2
    +-----+-----+-----+-----+-----+-----+

    +-----+-----+-----+-----+-----+-----+
    |  d  |  q  |  c  |  b  |  e  |  a  |      compare      2 and 3
    +-----+-----+-----+-----+-----+-----+

    +-----+-----+-----+-----+-----+-----+
    |  d  |  q  |  c  |  e  |  b  |  a  |      compare/swap 3 and 4
    +-----+-----+-----+-----+-----+-----+

    +-----+-----+-----+-----+-----+-----+
    |  d  |  q  |  c  |  e  |  b  |  a  |      compare      4 and 5
    +-----+-----+-----+-----+-----+-----+

Note how this time we only had to swap two pairs of elements! Things seem to be looking up! (Some may ask, did we really have to compare those elements that didn't need swapping? We didn't, but the computer does. It doesn't know until it has checked them whether two items are in order. The computer doesn't have the more global vision of the vector that we do. It sees just a single element at a time -- or two when comparing.)

But, our job isn't done yet! We have to keep going until all the elements are in order. Pass three:

    +-----+-----+-----+-----+-----+-----+
    |  d  |  q  |  c  |  e  |  b  |  a  |      'original'
    +-----+-----+-----+-----+-----+-----+

    +-----+-----+-----+-----+-----+-----+
    |  q  |  d  |  c  |  e  |  b  |  a  |      compare/swap 0 and 1
    +-----+-----+-----+-----+-----+-----+

    +-----+-----+-----+-----+-----+-----+
    |  q  |  d  |  c  |  e  |  b  |  a  |      compare      1 and 2
    +-----+-----+-----+-----+-----+-----+

    +-----+-----+-----+-----+-----+-----+
    |  q  |  d  |  e  |  c  |  b  |  a  |      compare/swap 2 and 3
    +-----+-----+-----+-----+-----+-----+

    +-----+-----+-----+-----+-----+-----+
    |  q  |  d  |  e  |  c  |  b  |  a  |      compare      3 and 4
    +-----+-----+-----+-----+-----+-----+

    +-----+-----+-----+-----+-----+-----+
    |  q  |  d  |  e  |  c  |  b  |  a  |      compare      4 and 5
    +-----+-----+-----+-----+-----+-----+

Again, just two swaps, but 5 comparisons needed to determine if swaps were necessary. Pass four:

    +-----+-----+-----+-----+-----+-----+
    |  q  |  d  |  e  |  c  |  b  |  a  |      'original'
    +-----+-----+-----+-----+-----+-----+

    +-----+-----+-----+-----+-----+-----+
    |  q  |  d  |  e  |  c  |  b  |  a  |      compare      0 and 1
    +-----+-----+-----+-----+-----+-----+

    +-----+-----+-----+-----+-----+-----+
    |  q  |  e  |  d  |  c  |  b  |  a  |      compare/swap 1 and 2
    +-----+-----+-----+-----+-----+-----+

    +-----+-----+-----+-----+-----+-----+
    |  q  |  e  |  d  |  c  |  b  |  a  |      compare      2 and 3
    +-----+-----+-----+-----+-----+-----+

    +-----+-----+-----+-----+-----+-----+
    |  q  |  e  |  d  |  c  |  b  |  a  |      compare      3 and 4
    +-----+-----+-----+-----+-----+-----+

    +-----+-----+-----+-----+-----+-----+
    |  q  |  e  |  d  |  c  |  b  |  a  |      compare      4 and 5
    +-----+-----+-----+-----+-----+-----+

There! Now we're done, right? Nope. Again, the computer doesn't see the 'whole picture'. It only knows that on this last pass, it swapped two things. Therefore, things were still out of order. It must take another stab at it. Pass five:

    +-----+-----+-----+-----+-----+-----+
    |  q  |  e  |  d  |  c  |  b  |  a  |      'original'
    +-----+-----+-----+-----+-----+-----+

    +-----+-----+-----+-----+-----+-----+
    |  q  |  e  |  d  |  c  |  b  |  a  |      compare      0 and 1
    +-----+-----+-----+-----+-----+-----+

    +-----+-----+-----+-----+-----+-----+
    |  q  |  e  |  d  |  c  |  b  |  a  |      compare      1 and 2
    +-----+-----+-----+-----+-----+-----+

    +-----+-----+-----+-----+-----+-----+
    |  q  |  e  |  d  |  c  |  b  |  a  |      compare      2 and 3
    +-----+-----+-----+-----+-----+-----+

    +-----+-----+-----+-----+-----+-----+
    |  q  |  e  |  d  |  c  |  b  |  a  |      compare      3 and 4
    +-----+-----+-----+-----+-----+-----+

    +-----+-----+-----+-----+-----+-----+
    |  q  |  e  |  d  |  c  |  b  |  a  |      compare      4 and 5
    +-----+-----+-----+-----+-----+-----+

Finally! The computer, having made it through an entire pass without a single swap, feels that it is safe to stop. *whew*

This, fairly obviously, takes quite a bit of time and effort. Let's look at the performance in a table:

Number of Elements	Average Comparisons to Ordered
8	64
64	4096
1024	1048576
4096	16777216
1048576	1099511627776

Yeow! That's a lot of comparisons! In fact, if you have n items of data, it will take about n² (that's n-squared) comparisons to place the items in order. After watching that work for a while, we decided there had to be a better way! Although the number of comparisons seemed as small as possible (okay, we were naive), we thought we could at least get rid of some of those swaps. Thus was born:

Selection Sort

Since our goal with selection sort was to reduce the number of swaps performed during the sort, the target seemed clear. To place 6 things in order, we should, at worst, have to move 6 things. (Well, 6 pairs of things, maybe...)

So, we decided that we should first determine what piece to move into the next position and then move it there. So, if we are looking to do an ascending sort, we'd be looking for the smallest piece to place first. Let's try with our previous vector:

    +-----+-----+-----+-----+-----+-----+
    |  a  |  d  |  c  |  q  |  b  |  e  |      original
    +-----+-----+-----+-----+-----+-----+

First we try to find the smallest. (Again, this seems obvious to us because we have a global view. However, the computer can see at most two elements at a time and doesn't realize that a really is the smallest until it has compared the a to all of the others.)

    position |  at this position  |  smallest so far
   ----------+--------------------+------------------
       0     |        'a'         |         'a'
       1     |        'd'         |         'a'
       2     |        'c'         |         'a'
       3     |        'q'         |         'a'
       4     |        'b'         |         'a'
       5     |        'e'         |         'a'

Now, we swap the a with the ___ position. Um...better remember where it was instead of what it was, eh? *smile* Trying that again:

    position |  at this position  |  position of smallest so far
   ----------+--------------------+------------------------------
       0     |        'a'         |         0
       1     |        'd'         |         0
       2     |        'c'         |         0
       3     |        'q'         |         0
       4     |        'b'         |         0
       5     |        'e'         |         0

    +-----+-----+-----+-----+-----+-----+
    |  a  |  d  |  c  |  q  |  b  |  e  |      swap 0 and 0
    +-----+-----+-----+-----+-----+-----+

Okay, so this first one was a little redundant. Again, the computer doesn't realize that -- unless we tell it so!

Let's look at finding the next smallest, though. A novice might start again at position 0. However, this won't work out since we just placed the smallest piece of data in that position! Because of that, we should now eliminate the first position from our search. Let's start at 1 and work our way along:

    +-----+-----+-----+-----+-----+-----+
    |  a  |  d  |  c  |  q  |  b  |  e  |      'original'
    +-----+-----+-----+-----+-----+-----+

    position |  at this position  |  position of smallest so far
   ----------+--------------------+------------------------------
       1     |        'd'         |         1
       2     |        'c'         |         2
       3     |        'q'         |         2
       4     |        'b'         |         4
       5     |        'e'         |         4

    +-----+-----+-----+-----+-----+-----+
    |  a  |  b  |  c  |  q  |  d  |  e  |      swap 1 and 4
    +-----+-----+-----+-----+-----+-----+

Nifty! Now, we can eliminate positions 0 and 1 from our search for the third smallest item:

    +-----+-----+-----+-----+-----+-----+
    |  a  |  b  |  c  |  q  |  d  |  e  |      'original'
    +-----+-----+-----+-----+-----+-----+

    position |  at this position  |  position of smallest so far
   ----------+--------------------+------------------------------
       2     |        'c'         |         2
       3     |        'q'         |         2
       4     |        'd'         |         2
       5     |        'e'         |         2

    +-----+-----+-----+-----+-----+-----+
    |  a  |  b  |  c  |  q  |  d  |  e  |      swap 2 and 2
    +-----+-----+-----+-----+-----+-----+

*sigh* Once again our computer's narrow vision can make a silly move. *shrug* Oh, well. Let's find the fourth smallest (now not having to look through 0, 1, or 2):

    +-----+-----+-----+-----+-----+-----+
    |  a  |  b  |  c  |  q  |  d  |  e  |      'original'
    +-----+-----+-----+-----+-----+-----+

    position |  at this position  |  position of smallest so far
   ----------+--------------------+------------------------------
       3     |        'q'         |         3
       4     |        'd'         |         4
       5     |        'e'         |         4

    +-----+-----+-----+-----+-----+-----+
    |  a  |  b  |  c  |  d  |  q  |  e  |      swap 3 and 4
    +-----+-----+-----+-----+-----+-----+

And the fifth smallest (searching only positions 4 and 5 now):

    +-----+-----+-----+-----+-----+-----+
    |  a  |  b  |  c  |  d  |  q  |  e  |      'original'
    +-----+-----+-----+-----+-----+-----+

    position |  at this position  |  position of smallest so far
   ----------+--------------------+------------------------------
       4     |        'q'         |         4
       5     |        'e'         |         5

    +-----+-----+-----+-----+-----+-----+
    |  a  |  b  |  c  |  d  |  e  |  q  |      swap 4 and 5
    +-----+-----+-----+-----+-----+-----+

And...we're done! Why don't we go through it again? Well, since we were swapping pairs, we actually only needed 5 swaps to place 6 items in order. (In fact, to do n, we'd only need n-1 swaps.)

So, has this really helped? Actually, yes and no. We have reduced the number of swaps (which can be a slow thing to do -- depending on the size of the data). We even reduced the absolute number of comparisons (a good side-effect!). However, analysis shows that the number of comparisons is still 'in the range of' n-squared.

(To better understand this, you can take out your graphing calculator and graph x^2 and x^2-4x+3. Make the viewing window [0,4000,0,20000]. Not much difference between the two lines, eh? That's the difference we just made in the number of comparisons. Well, not exactly, but close. If you want to see the reduction we made in the number of swaps, leave x^2 and replace the second function with x-1. Leave the window the same, and look at the graph. Now that's a significant decrease!)

Insertion Sort

Still trying to reduce the number of comparisons, someone came up with insertion sort. Bubble sort tried to work with the whole vector (so to speak). Selection sort eliminated one element at a time from consideration.

Insertion sort takes another approach. It starts by viewing the 'vector' as just the first element. Since there is only one element, it says to itself, "There, that's sorted!"

Next it brings another element into consideration, finds where the new one should go within the already sorted 'vector', inserts it, and says, "There, now those are sorted!"

And so on... An example:

    Pass 1:
       |-view-|
        +-----+-----+-----+-----+-----+-----+
        |  a  |  d  |  c  |  q  |  b  |  e  |      original
        +-----+-----+-----+-----+-----+-----+
        done
       |-view-|
        +-----+-----+-----+-----+-----+-----+
        |  a  |  d  |  c  |  q  |  b  |  e  |      sorted (within view)
        +-----+-----+-----+-----+-----+-----+

    Pass 2:
       |----view----|
        +-----+-----+-----+-----+-----+-----+
        |  a  |  d  |  c  |  q  |  b  |  e  |      'original'
        +-----+-----+-----+-----+-----+-----+
        new item:  'd'
        before 'a'?  no
        done
       |----view----|
        +-----+-----+-----+-----+-----+-----+
        |  a  |  d  |  c  |  q  |  b  |  e  |      sorted (within view)
        +-----+-----+-----+-----+-----+-----+

    Pass 3:
       |-------view-------|
        +-----+-----+-----+-----+-----+-----+
        |  a  |  d  |  c  |  q  |  b  |  e  |      'original'
        +-----+-----+-----+-----+-----+-----+
        new item:  'c'
        before 'a'?  no
        before 'd'?  yes
        hold 'c'
        move 'd' toward end
        place 'c'
        done
       |-------view-------|
        +-----+-----+-----+-----+-----+-----+
        |  a  |  c  |  d  |  q  |  b  |  e  |      sorted (within view)
        +-----+-----+-----+-----+-----+-----+

    Pass 4:
       |----------view----------|
        +-----+-----+-----+-----+-----+-----+
        |  a  |  c  |  d  |  q  |  b  |  e  |      'original'
        +-----+-----+-----+-----+-----+-----+
        new item:  'q'
        before 'a'?  no
        before 'c'?  no
        before 'd'?  no
        done
       |----------view----------|
        +-----+-----+-----+-----+-----+-----+
        |  a  |  c  |  d  |  q  |  b  |  e  |      sorted (within view)
        +-----+-----+-----+-----+-----+-----+

    Pass 5:
       |-------------view-------------|
        +-----+-----+-----+-----+-----+-----+
        |  a  |  c  |  d  |  q  |  b  |  e  |      'original'
        +-----+-----+-----+-----+-----+-----+
        new item:  'b'
        before 'a'?  no
        before 'c'?  yes
        hold 'b'
        move 'q' toward end
        move 'd' toward end
        move 'c' toward end
        place 'b'
        done
       |-------------view-------------|
        +-----+-----+-----+-----+-----+-----+
        |  a  |  b  |  c  |  d  |  q  |  e  |      sorted (within view)
        +-----+-----+-----+-----+-----+-----+

    Pass 6:
       |----------------view----------------|
        +-----+-----+-----+-----+-----+-----+
        |  a  |  b  |  c  |  d  |  q  |  e  |      'original'
        +-----+-----+-----+-----+-----+-----+
        new item:  'e'
        before 'a'?  no
        before 'b'?  no
        before 'c'?  no
        before 'd'?  no
        before 'q'?  yes
        hold 'e'
        move 'q' toward end
        place 'e'
        done
       |----------------view----------------|
        +-----+-----+-----+-----+-----+-----+
        |  a  |  b  |  c  |  d  |  e  |  q  |      sorted
        +-----+-----+-----+-----+-----+-----+

When sorting an existing vector, this doesn't perform very well. It still does in the neighborhood (on the order of) n-squared comparisons. And, the number of movements has grown because sometimes you have to shift many items over to insert the 'new' item in its place!

(It is back up to around (x^2)/2. Check out this graph versus x^2 in the same window as before to see the difference.)

However, when you are trying to build a sorted vector as the user originally enters their data, this algorithm is quite handy. After all, that is it's view of the world: one more element, where does it go? (And the speed or lack thereof is but a grain of sand in the hour-glass compared to the typing 'speed' of the user!)

Can't We Go Any Faster?

Why yes, Virginia, we can! So far, we have seen three sorts which have performance characteristics like this:

Algorithm	Average Comparisons	Average Swaps
bubble	2 n	2 n --- 2
selection	2 n	n-1
insertion	2 n	2 n --- 2

As the size of the vector (represented in the table by n) grows, these algorithms take much more time.

We'd mentioned three other algorithms, however. How do they do? Well, they take considerably less time (graphing is left to the interested reader *grin*):

Algorithm	Average Comparisons
quick	n*(log n) 2
heap	n*(log n) 2
merge	n*(log n) 2

Wow! After looking at the graphs for n-log-n versus n-squared, you'll easily see the improvement. However, given that these three sorts are so much alike, how can we tell which algorithm we should use?

When the average performance is equal, we have to consider other algorithmic properties:

Algorithm	Worst Comparisons	Extra Memory	Stable
quick	2 n	0	no
heap	n*(log n) 2	0	no
merge (on a vector)	n*(log n) 2	n	yes

Worst performance of heap and merge is still n-log-n. Quick sort can degrade to n-squared performance if care isn't taken.

Quick and heap sort never take any extra memory. However, when used on vectors, merge sort requires a second vector for processing -- another n memory locations.

The stability of a sort indicates that, when sorting on multiple attributes -- multiple columns in a spreadsheet, for example, any previous order is preserved. For instance, consider if you had parallel vectors containing the id number of sales persons, the month during which sales were made, and the sales figures (for that person in that month):

    Original Data:
           id       month       sales
        +-----+    +-----+     +-----+
        | 432 |    |  8  |     | 189 |
        +-----+    +-----+     +-----+
        | 123 |    |  8  |     | 116 |
        +-----+    +-----+     +-----+
        | 555 |    |  8  |     | 203 |
        +-----+    +-----+     +-----+
        | 432 |    |  9  |     | 105 |
        +-----+    +-----+     +-----+
        | 123 |    |  9  |     |  43 |
        +-----+    +-----+     +-----+
        | 555 |    |  9  |     | 187 |
        +-----+    +-----+     +-----+

If we sort first based on the sales figures:

    Sorted by sales:
           id       month       sales
        +-----+    +-----+     +-----+
        | 123 |    |  9  |     |  43 |
        +-----+    +-----+     +-----+
        | 432 |    |  9  |     | 105 |
        +-----+    +-----+     +-----+
        | 123 |    |  8  |     | 116 |
        +-----+    +-----+     +-----+
        | 555 |    |  9  |     | 187 |
        +-----+    +-----+     +-----+
        | 432 |    |  8  |     | 189 |
        +-----+    +-----+     +-----+
        | 555 |    |  8  |     | 203 |
        +-----+    +-----+     +-----+

And then sort based on the ids:

    Re-Sorted by id:
           id       month       sales
        +-----+    +-----+     +-----+
        | 123 |    |  9  |     |  43 |
        +-----+    +-----+     +-----+
        | 123 |    |  8  |     | 116 |
        +-----+    +-----+     +-----+
        | 432 |    |  9  |     | 105 |
        +-----+    +-----+     +-----+
        | 432 |    |  8  |     | 189 |
        +-----+    +-----+     +-----+
        | 555 |    |  9  |     | 187 |
        +-----+    +-----+     +-----+
        | 555 |    |  8  |     | 203 |
        +-----+    +-----+     +-----+

The second sorting would preserve the order of sales figures within the data about the same person. (Note above how sales person 123's entries are still in order relative to one another even though 432's 105 has moved down to join their 189 -- also relatively in order to one another!)

Only merge sort exhibits this property, whereas both quick and heap sorts completely disregard any information about prior order contained in the data's current positions.

So, by looking at these other properties of the algorithms (how badly they do in the worst circumstances, if they are stable or not, and whether or not they require extra memory during execution), we could decide which of these three advanced sorts to use.

[Un]Fortunately for you, you'll have to wait a semester or two before you get to actually use them. All three of them are a bit too complicated for this semester. (We still have other things to worry about/learn!)

However, you will still be expected to know their basic characteristics (their average behavior and the properties that distinguish them).

Stray Bits

Notice how, in the bubble sort, the letter q bubbles its way toward the beginning of the vector. It takes it a few passes, but it gets there. This bubble isn't as obvious as as passage -- which happens all in a single pass. (Try to follow all the letters to their destinations. Do they all go straight to their eventual place or do some flitter about like a bubble might?)
There is pseudocode algorithm for bubble sort elsewhere amongst these notes.
Some do not find it so obvious, but selection sort is so named because we are selecting the next item to place in order.
The code for selection sort can be found in your book (Chapter 20, Section 3).
The key part of insertion sort -- inserting into an already ordered list -- is described in our notes. (Other than that, you just need to search for the proper place to put the 'new' data...in the already sorted portion of the vector...)
The basic idea of heap sort is that you throw all the data in a 'heap', placing the smallest (or largest -- depending on whether you want descending or ascending order at the end) on top. Now just set it 'aside' -- i.e. swap it with the last item in the vector. Then, take the remaining items and fluff the 'heap' to get the next smallest item on top again. Repeat... This is similar to selection sort, but the heaping process is much more efficient than the linear search.
The idea behind merge sort is that if you had two sorted lists, you could merge them together fairly quickly. (n total items would take n steps to merge, in fact.) So, we split the vector into two halves and sort each half separately -- then merge them back together! What makes this more effective than your quadratic sorts is that we sort each of the halves with ...merge sort! Thus we're splitting the halves in half, too. This process continues until a 'half' only has one or two ...well, a very few... items in it. Then it either is already sorted or takes so little effort to sort that we just use bubble or insertion sort on it. Finally we just merge back up toward the entire vector.

(Thinking about it you realize that, even if you didn't continue to merge sort each half and just bubble sorted each of the first two halves, you'd reduce from n² comparisons to 2*(n/2)²==n²/2 comparisons. i.e. When n==100, we'd be looking at 5000 comparisons instead of 10000. By continuing to merge sort down each halving, we end up with an even better improvement.)

This is similar to binary search -- cut your problem in half at each step. But with the merge sort, you have to retrace your steps and merge the lists back together.
Quick sort is pretty twisted, but involves a divide-and-conquer approach like those of merge sort and binary search.

Lecture Notes

Here are the lecture notes about bubble sort, selection sort, and insertion sort's main and its library.