CpSc 360
Chapter 7 (continued)
·
Merging as a Way of Sorting
Large Files on Disk
·
Opportunities
for improvement
·
Hardware
improvements
·
Increase
the amount of RAM
·
Increase
the number of disk drives
·
Increase
the number I/O channels
·
Decrease
the number of seeks using multiple-step merges
·
Merge
runs together in groups rather than all together
·
Merge
results of groups of runs, etc.
·
Construct
a tree with dispersion factors of 2n where 2n is the
number of files to be merged during each pass
·
E.g.
if n=2 then starting with the final pass and working backwards, we would merge:
4, 16, 64, … runs
·
Would
some base other than 2 work better?
·
Algorithmically
increase the lengths of the initial sorted runs
·
Longer
runs ® fewer total runs
·
Fewer
total runs ® a lower order merge
·
Lower
order merge ® bigger buffers
·
Bigger
buffers ® fewer seeks
·
But
how do we get bigger buffers without buying more memory?
·
Use
a replacement selection algorithm
·
Read
a collection of records and sort them using heapsort. This creates a heap of sorted values – the primary heap.
·
Don’t
write out the whole heap but rather write out only the record in the heap with
the smallest key value
·
Bring
in a new record and compare the value of its key with that of the key just
outputted
·
If
the new key value >= outputted key
value, insert the new record into its proper place in the primary heap and again
output the record with the smallest key value
·
If
the new key value < outputted key value, place the record into a secondary heap of records with key
values lower than those already written out.
·
Continue
until there are no more records in the primary heap or the secondary heap
reaches capacity. If the secondary heap
reaches capacity, output the remainder of the primary heap make the secondary
heap the primary heap, the primary heap the secondary heap and continue the process.
·
The
replacement algorithm will output runs with approximately 2*P records, where the number of memory locations is P.
·
Since
buffering of input and output is required (to reduce seeks and rotational
latency times) and a secondary heap is required, the output runs will not reach
2*P.
·
If
the number of records is sufficiently large, multi-step merging techniques
provide far more benefits than the replacement algorithm.
·
Both
techniques used together will generally be advantageous.
·
Find
ways to overlap I/O operations
·
Use
multiple disk drives
·
Use
one drive for input and the other for output
·
I/O
can be overlapped so transmission time can be reduces by as much as 50%
·
Seeking
is virtually eliminated
·
Use
double buffering for both input and output
·
Use
additional disks (3,4,…)
·
Use
additional processors for the sort phase
·
Multiprogramming
is likely to reduce the performance of an external sort, (and any other job in
the system) but overall system performance may increase
·
General
rules for external sorts:
·
For
in-RAM sorting, use heapsort for forming the original list of sorted elements
in a run. With it and double buffering,
we can overlap input and output with internal processing.
·
Use
as much RAM as possible. It makes the
runs longer and provides bigger and/or more buffers during the merge phase.
·
If
the number of initial runs is so large that total seek and rotation time is
much greater than total transmission time, use a multistep merge. It increases the amount of transmission time
but can decrease the number of seeks enormously.
·
Consider
using replacement selection of initial run formation, especially if there is a
possibility that the runs will be partially ordered.
·
Use
more than one disk drive and I/O channel so reading and writing can
overlap. This is especially true if
there are no other users on the system.
·
Keep
in mind the fundamental elements of external sorting and their relative costs,
and look for ways to take advantage of new architectures and systems, such as
parallel processing and high-speed local area networks.