CpSc 360
Chapter 7 (continued)
·
Sorting Files on Tape
·
The
choice of tape sort algorithms is highly dependent on hardware configurations
and data to be sorted
·
Basic
steps to sorting on tape are the same as sorting on disk
·
Distribute
the unsorted file into sorted runs
·
Merge
the runs into a single sorted file
·
Use
the replacement selection algorithm to produce long runs
·
The
balanced merge
·
2-way
balanced merge
·
Use
four tapes and put the initial runs on only 2 of the tapes
·
Merge
runs from these 2 tapes onto the other 2 tapes, creating half as many, but
longer runs
·
Continue
merging between the pairs of tapes until all runs have been merged into a
single run on a single tape
|
Tape |
|
Contains runs |
|
|
|
|
T1 |
R1 |
R3 |
R5 |
R7 |
R9 |
|
T2 |
R2 |
R4 |
R6 |
R8 |
R10 |
|
T3 |
- |
|
|
|
|
|
T4 |
- |
|
|
|
|
|
|
|
|
|
|
|
|
T1 |
- |
|
|
|
|
|
T2 |
- |
|
|
|
|
|
T3 |
R1-R2 |
R5-R6 |
R9-R10 |
|
|
|
T4 |
R3-R4 |
R7-R8 |
|
|
|
|
|
|
|
|
|
|
|
T1 |
R1-R4 |
R9-R10 |
|
|
|
|
T2 |
R5-R8 |
|
|
|
|
|
T3 |
- |
|
|
|
|
|
T4 |
- |
|
|
|
|
|
|
|
|
|
|
|
|
T1 |
- |
|
|
|
|
|
T2 |
- |
|
|
|
|
|
T3 |
R1-R8 |
|
|
|
|
|
T4 |
R9-R10 |
|
|
|
|
|
|
|
|
|
|
|
|
T1 |
R1-R10 |
|
|
|
|
|
T2 |
- |
|
|
|
|
|
T3 |
- |
|
|
|
|
|
T4 |
- |
|
|
|
|
·
There
is no seeking on tape so performance is measured in the transmission time
(number of blocks read or total passes over the data)
·
The
total passes for n runs is p=élog2nù
·
If
we were to sort an 800-megabyte file with 200 runs, 8 passes would be
required. Using a 6,250 bpi tape that
moves at 200 inches per second (1,250 bytes/sec), the time to sort using this
algorithm would be about 1 hour 28 minutes.
This is substantially slower than a comparable disk sort.
·
The
k-way balanced merge
·
Improvements
can be made if we can reduce the number of passes
·
We
can make p smaller if we change the
base of the equation:
p=élog2nù
·
If
we were to use 20 tapes (10 for input and 10 for output) then we would have the
equation:
p=élog10nù
·
Similarly
if we were to use 2k tapes (k for input and k for output) then we would have the equation:
p=élogknù
·
In
the example above of sorting the 800-megabyte file, the total time would be
reduced to about 42 minutes with 20 tape drives.
·
Compact
notation for above example:
|
|
T1 |
T2 |
T3 |
T4 |
|
Step 1 |
1 1 1 1 1 |
1 1 1 1 1 |
- |
- |
|
Step 2 |
- |
- |
2 2 2 |
2 2 |
|
Step 3 |
4 2 |
4 |
- |
- |
|
Step 4 |
- |
- |
8 |
2 |
|
Step 5 |
10 |
- |
- |
- |
·
Multiphase
merges
·
Consider
this technique
|
|
T1 |
T2 |
T3 |
T4 |
|
Step 1 |
1 1 1 1 1 |
1 1 1 1 1 |
- |
- |
|
Step 2 |
- |
- |
2 2 2 |
2 2 |
|
Step 3 |
4 |
4 |
. . 2 |
- |
|
Step 4 |
- |
- |
- |
10 |
·
Use
a higher-order merge. For example, in
the 2-way merge example, in place of two 2-way merges, use one 3-way merge.
·
Extend
the merging of runs from one tape over several steps. For example, in the 2-way merge example, merge some of the runs
from Tape 3 in step 3 and some in step 4.
·
Runs
are thus merged in phases. These ideas are the basis for two well-know
approaches to merging called the polyphase
merge and the cascade merge.
·
These
2 techniques share these common characteristics:
·
The
initial distribution of runs is such that at least the initial merge is a J-1
way merge, where J is the number of available
·
The
distribution of the runs across the tapes is such that the tapes often contain
different number of runs.
·
Consider
the following example with ten runs:
|
|
T1 |
T2 |
T3 |
T4 |
|
Step 1 |
1 1 1 1 1 |
1 1 1 |
1 1 |
- |
|
Step 2 |
1 1 1 |
1 |
- |
3 3 |
|
Step 3 |
1 1 |
- |
5 |
3 |
|
Step 4 |
1 |
4 |
5 |
- |
|
Step 5 |
- |
- |
- |
10 |
·
Questions:
·
How
does one choose an initial distribution that leads readily to an efficient
merge pattern?
·
Are
there algorithmic descriptions of the merge patterns, given an initial
distribution?
·
Given
N runs and J tapes drives, is there some way to compute the optimal merging
performance so we have a yardstick against which to compare the performance of
any specific algorithm?
·
Tapes
Versus Disks for External Sorting
·
Disk
and tape used to be expensive – but not so today
·
Suppose
we wish to sort a 8GB file with only 1 megabyte of memory instead of 10
megabytes.
·
The
calculations are in the text, however, the bottom line is that a tenfold
increase in the number of runs, results in a hundredfold increase in the number
of seeks.
·
The
time to sort changes from 3 hours with 10MB of memory to 300 hours with 10MB of
memory, just for the seeks!
·
In
this case the seek time greatly outweighs the transfer time with tapes.
·
With
more memory (and thus large but fewer runs) and more files available on disk,
the transmission time (due to multiple passes over the data) outweighs the seek
time.
·