Chapter
4 (continued)
·
A
Sequential Search
·
Finding
a particular record in a file based upon a primary key
·
Performance
·
Usually
expressed as a function of the number of comparisons
·
With
files this is often a poor indicator – number of I/O operations is usually
better
·
Unblocked
record searching (with n) records
requires an average of n/2 I/O
operations
·
Blocked
records (with b records per block)
requires n/(2*b) I/O operations
·
Blocked
reads also require fewer seeks – big savings
·
Both
are O(n) in time complexity
·
Sequential
searches are good for many applications.
E.g.
·
When
searching for a particular pattern (and there is no specific record key)
·
Very
small files (e.g. 30 records)
·
Files
that only rarely need searching
·
Files
in which searches yield a large number of hits
·
Common
search utilities
·
wc (word count in UNIX)
·
counts
the number of lines, words and characters in a file
·
grep
(generalized regular expression program)
·
searches
for all instances of a particular
string in a file
·
Skip
sequential search
·
Assume
a sorted file has n records blocked b records per block. What is the average number of blocks that
must be read in order to find a record?
What is the average number of comparisons that must be made? What is the optimal blocking factor required
to minimize the number of comparisons?
·
Direct
Access
·
Can
go to any specific record (using relative record number, RRN, in the file)
·
If
record lengths are fixed then RRN can be used to compute CCHHR in the file
·
Cannot
be used if records are of varying length
·
Can
go to any specific byte (using relative byte offset, RBO, in the file)
·
Can
go to a record with a specific key value
·
Requires
an index to determine the RBO in the file
·
File
Organization vs File Access
·
File
organization choices can be made independent of access
·
File
Access choices depend on what choices have been made for file organization
·
Abstract
Data Models
·
Application’s
view of the data
·
Removed
from the physical organization & device specific issues
·
Headers
and Self-Describing Files
·
Information
about the physical characteristics of a file is kept in the file itself. e.g.
·
Names
of fields, offset and length of each field, fields per record, records per
block
·
Programs
(or access methods) must run in an interpretive mode and thus may experience
poorer performance
·
Metadata
·
Data
that is related to the primary data of study
·
Can
be stored within the file in special locations or formats
·
Examples: scaling factor, offset, source, date, etc
·
Indexes
are often maintained to locate primary data with certain metadata
characteristics
·
Standard
conventions concerning the format of commonly stored data often reduce the
amount of metadata required in a file.
·
Often
a reference code is stored in the file to denote a class of metadata (standard
type) and the details are assumed to be available from other sources.
·
Object-oriented
File Access
·
Programs
can process data as though they were always stored in RAM
·
An Object Oriented File System performs
the transformation from external formats (on files) to internal formats (in
RAM).
·
Pointers
must change
·
Standards
in representation may change
·
Programmers
should not be responsible for the transformations
·
Portability
and Standardization
·
Factors
affecting portability
·
Operating
System Differences
·
Language
Differences
·
Machine
Architecture Differences
·
Support
Library Differences
·
Version
Differences
·
Achieving
Portability
·
Stick
with a Standard for Physical Record Format
·
Use
a Standard Binary Encoding for Data Elements
·
When
conversion is required, convert through a common standard
·
File
Structure Conversion
·
File
System Differences