CpSc 360

Lecture 14

 

Chapter 7 (continued)

 

·       Sorting Files on Tape

·        The choice of tape sort algorithms is highly dependent on hardware configurations and data to be sorted

·        Basic steps to sorting on tape are the same as sorting on disk

·        Distribute the unsorted file into sorted runs

·        Merge the runs into a single sorted file

·        Use the replacement selection algorithm to produce long runs

·        The balanced merge

·        2-way balanced merge

·        Use four tapes and put the initial runs on only 2 of the tapes

·        Merge runs from these 2 tapes onto the other 2 tapes, creating half as many, but longer runs

·        Continue merging between the pairs of tapes until all runs have been merged into a single run on a single tape

 

 

Tape

 

Contains runs

 

 

 

T1

R1

R3

R5

R7

R9

T2

R2

R4

R6

R8

R10

T3

-

 

 

 

 

T4

-

 

 

 

 

 

 

 

 

 

 

T1

-

 

 

 

 

T2

-

 

 

 

 

T3

R1-R2

R5-R6

R9-R10

 

 

T4

R3-R4

R7-R8

 

 

 

 

 

 

 

 

 

T1

R1-R4

R9-R10

 

 

 

T2

R5-R8

 

 

 

 

T3

-

 

 

 

 

T4

-

 

 

 

 

 

 

 

 

 

 

T1

-

 

 

 

 

T2

-

 

 

 

 

T3

R1-R8

 

 

 

 

T4

R9-R10

 

 

 

 

 

 

 

 

 

 

T1

R1-R10

 

 

 

 

T2

-

 

 

 

 

T3

-

 

 

 

 

T4

-

 

 

 

 

 

 

·        There is no seeking on tape so performance is measured in the transmission time (number of blocks read or total passes over the data)

·        The total passes for n runs is p=élog2nù

·        If we were to sort an 800-megabyte file with 200 runs, 8 passes would be required.  Using a 6,250 bpi tape that moves at 200 inches per second (1,250 bytes/sec), the time to sort using this algorithm would be about 1 hour 28 minutes.  This is substantially slower than a comparable disk sort.

·        The k-way balanced merge

·        Improvements can be made if we can reduce the number of passes

·        We can make p smaller if we change the base of the equation:

 

p=élog2nù

 

·        If we were to use 20 tapes (10 for input and 10 for output) then we would have the equation:

 

p=élog10nù

 

·        Similarly if we were to use 2k tapes (k for input and k for output) then we would have the equation:

 

p=élogknù

 

·        In the example above of sorting the 800-megabyte file, the total time would be reduced to about 42 minutes with 20 tape drives.

·        Compact notation for above example:

 

 

T1

T2

T3

T4

Step 1

1 1 1 1 1

1 1 1 1 1

-

-

Step 2

-

-

2 2 2

2 2

Step 3

4 2

4

-

-

Step 4

-

-

8

2

Step 5

10

-

-

-

 

·        Multiphase merges

·        Consider this technique

 

 

 

 

T1

T2

T3

T4

Step 1

1 1 1 1 1

1 1 1 1 1

-

-

Step 2

-

-

2 2 2

2 2

Step 3

4

4

. . 2

-

Step 4

-

-

-

10

 

 

·        Use a higher-order merge.  For example, in the 2-way merge example, in place of two 2-way merges, use one 3-way merge.

·        Extend the merging of runs from one tape over several steps.  For example, in the 2-way merge example, merge some of the runs from Tape 3 in step 3 and some in step 4.

·        Runs are thus merged in phases.  These ideas are the basis for two well-know approaches to merging called the polyphase merge and the cascade merge.

·        These 2 techniques share these common characteristics:

·        The initial distribution of runs is such that at least the initial merge is a J-1 way merge, where J is the number of available

·        The distribution of the runs across the tapes is such that the tapes often contain different number of runs.

·        Consider the following example with ten runs:

 

 

 

T1

T2

T3

T4

Step 1

1 1 1 1 1

1 1 1

1 1

-

Step 2

1 1 1

1

-

3 3

Step 3

1 1

-

5

3

Step 4

1

4

5

-

Step 5

-

-

-

10

 

 

·        Questions:

·        How does one choose an initial distribution that leads readily to an efficient merge pattern?

·        Are there algorithmic descriptions of the merge patterns, given an initial distribution?

·        Given N runs and J tapes drives, is there some way to compute the optimal merging performance so we have a yardstick against which to compare the performance of any specific algorithm?

·        Tapes Versus Disks for External Sorting

·        Disk and tape used to be expensive – but not so today

·        Suppose we wish to sort a 8GB file with only 1 megabyte of memory instead of 10 megabytes.

·        The calculations are in the text, however, the bottom line is that a tenfold increase in the number of runs, results in a hundredfold increase in the number of seeks. 

·        The time to sort changes from 3 hours with 10MB of memory to 300 hours with 10MB of memory, just for the seeks!

·        In this case the seek time greatly outweighs the transfer time with tapes.

·        With more memory (and thus large but fewer runs) and more files available on disk, the transmission time (due to multiple passes over the data) outweighs the seek time.

·