Storing Data: Disks and Files


Low Level Data Storage

Because a database generally stores huge amounts of data, a database engine pays careful attention to most low-level aspects of memory management. The memory management policies are key to a DBMS, for reasons of efficiency, portability and overall control. Therefore, most comercial operating systems take care to implement policies which would otherwise be handled by the Operating System.

Memory Hierarchy

The typical memory hierarchy has multiple layers. A relatively simple example of such a hierarchy is the following:
CPU
 | 
CACHE
 | 
MAIN MEMORY
 | 
MAGNETIC DISK
 |
TAPE ARCHIVE
We will focus most of our attention on the interactions between neighboring levels of this hierarchy, and particular between main memory and the magnetic disk.

Data is predominantly stored on the magnetic disk, for several reasons:

  • The amount of data stored in a typical database can not be expected to fit in main memory.
  • An individual file may be so large that they could not be fully addressed by a 32-bit computer, even if it could reside in main memory.
  • For crash recovery, much of the data must be stored using non-volatile memory, and the disk drive generally serves this purpose.
  • At the same time, for the CPU to operate on any piece of data, that data must first be brought into main memory, if not already there. Because the access time to read/write a block of data from/to disk is orders of magnitude longer than most CPU operations, the number of disk I/O's is generally the bottleneck in terms of efficiency for database operations.


    Disk Space Management

    The major (software) components withinf a DBMS, which involve levels of access to physical storage are the following:
         [rest of DBMS]
               |
    +------------------------+
    | FILES & ACCESS METHODS |
    |          |             |
    |   BUFFER MANAGER       |
    |          |             |
    |   DISK SPACE MANAGER   |
    +------------------------+
               |
         [physical data]
    

    In short,

  • DISK SPACE MANAGER

    Manages the precise use of space on the disk, keeping track of which "pages" have been allocated, and when data should be read or writen into those pages.
  • BUFFER MANAGER

    Manages the control of pages which are currently residing in main memory, as well as the transfer of those pages back and forth between main memory and the disk.
  • FILES & ACCESS METHODS

    Irregardless of the low level memory increments, much of the database software will want to view data as logically organized into files, each of which may be stored below using a large number of low level data pages.

  • Let's examine each of these components in more detail:
  • DISK SPACE MANAGER

    The disk space manager will manage the space on the disk. It will create an abstraction of the disk as a collection of pages, on which the rest of the DBMS will rely. Typically, a page size will be equivalent to a disk block.

    Typical operations which it will support are:

  • Reading a page of data from the disk
  • Writing a page of data to the disk
  • Allocating or Deallocating a page of the disk for use
  • possibly allocating a group of "consecutive" pages for use
  • To manage the disk space, it must keep track of all of the current free blocks. This is generally done in one of two ways.

  • via a "free list"
  • via a "bitmap"

  • BUFFER MANAGER

    The CPU can only operate on data which exists in main memory. The buffer manager will be responsible for transfering pages between the main memory and the underlying disk.

    The buffer manager organizes main memory into a collection of frames, where each frame has the ability to hold one page. The overall collection of these frames is refered to as the buffer pool. When a higher-level portion of the DBMS needs access to a page (referenced by a pageID), it will explicitly request that page from the Buffer Manager. Furthermore, that portion of the DBMS is expected to explicitly "release" the page, informing the Buffer Manager when it is no longer needed in main memory for the time being.

    When a portion of the DBMS submits a requests to the Buffer Manager for a particular page, the manager must first determine whether or not this page is already in the current buffer pool. Generaly, this can be accomplished by keeping a pair (pageID,frameNum) for each page which is currently in the pool. By storing this information in a hash table, the buffer manager can look up a given pageID, to find if it is already in the pool, and if so, in which frame.

    When a requested page is not in the buffer pool, the Buffer Manager will need to send a request to the Disk Manager to read that page from the disk, and it will need to determine which frame to store it in, and thus which existing page of the buffer pool to evict.

    The decision of which page to evict is complicated by several factors:

  • Several current processes may have requested a particular page at the same time, and that page can only be released from memory after all of the requesting processes have released the page.

    To accomplish this, a pin_count is kept for each page currently in the buffer. The count is initially zero; is incremented each time a request for the page is served (a.k.a. "pinning); and is decremented each time a process subsequently releases the page (a.k.a. "unpinning").

    Thus, the evicted page must be chosen from those pages with a current pin count of zero. (if no such pages exist, then

  • There may be several candidate pages for eviction. There are many factors that might influence our choice; we can adopt a particular "replacement policy" for such decisions. (we defer the discussion of such policies for the moment).
  • When a page is going to be evicted, we must be concerned as to whether the contents of that page in main memory were altered since the time it was brought in from the disk. If so, we must make sure to write the contents of the page back to the disk (via the Disk Manager). Conversely, if the page was only read, then we can remove it from main memory, knowing that the contents are still accurate on disk.

    To accomplish this, a boolean value known as the dirty bit is kept for each page in the buffer pool. When read from disk, the dirty bit is initially set to false. However, when each process releases the page, it must also inform the buffer manager of whether or not it had changed any of the memory contents while it was checked out. If so, then the dirty bit is set to true, ensuring that the contents will later be written to disk should this page be evicted.

  • Buffer Replacement Policies

  • LRU (Least Recently Used)

  • FIFO (First In First Out)

  • CLOCK
    This is meant to have behavior in the style of LRU yet with less overhead. Associated with each page is a referenced bit. Whenever the pin count is decremented to zero, the referenced bit is turned on.

    When looking for a page to evict, a counter current is used to scan all candidated pages. When current reaches a page:

  • If the pin count is non-zero, the current page is left alone, and the current variable cycles to the next page.
  • If the pin count is zero, but the referenced bit is on, then the current page is left alone but the referenced bit is turned off, after which the current variable cycles to the next page.
  • If the pin count is zero and the referenced bit is off, then this page is evicted.
  • MRU (Most Recently Used)

  • RANDOM
  • Which policy to use depends on the access pattern of the database. Fortunately, the access pattern is often predictable and so the DBMS can take advantage of this knowledge.

    There are also times where a DBMS needs to be able to force a particular page to be written to the disk immediately, and so the buffer manager must support this type of request.


  • FILES & ACCESS METHODS

    As we work with higher-level portions of the DBMS, we must consider how data will be stored on the disk. We will consider all data to be represented as files, each of which is a collection of records. If all of the records of a file cannot fit on a single page of the disk, then multiple pages will be used to represent that file.

    For example, for a typical table in a relational database, each tuple would be a record, and the (unordered) set of tuples would be stored in a single file. Of course, other internal data for the DBMS can also be viewed as records and files.

    Each record has a unique identifier called a record id (rid). Among other things, this will identify the disk address of the page which contains the record.

    The file and access layer will manage the abstraction of a file of records. It will support the creation and destruction of files, as well as the insertion and deletion of records to and from the file. It will also support the retrieval of a particular record identified by rid, or a scan operation to step through all records of the file, one at a atime.

    Implementation

    The file layer will need to keep track of what pages are being used in a particular file, as well as how the records of the file are organized on those pages.

    There are several issues to address:

  • Whether the records in a file are to be maintained as an ordered collection or unordered.
  • Whether the records of a given file are of fixed size or of variable size.
  • We will consider three major issues:

  • Format of a single record
  • Format of a single page
  • Format of a single file

  • Format of a single record

  • Fixed-Length Records
    This is an easy scenario. If each field of record has a fixed-length, then the underlying data can be stored directly one field after another. Based on the known structure of the record, offsets can be calculated for accessing any given field.
  • Variable-Length Records
    If these records represent tuples from a relation, then each record must have the same number of fields. However, some domains may be used which result in fields that are variable in length.

    There are two general approaches to handling such records:

  • Separate fields with a chosen delimiter (control character). Then, the fields can be identified by scanning the entire record.
  • Reserve some space at the beginning of the record to provide offsets to the start of each field of the record. This allows you to jump to any particular field of the record.
  • The second of these approaches is generally prefered, as it offers minimal overhead, and gives more efficient access to an arbitrary field.

    In general, working with variable length fields introduces some other subtle complexities:

  • Modifying a field may cause the record to grow, which may effect the page's placement of the record.
  • In fact, a record's growth may mean that it no loner fits in the space remaining on its current page.
  • A single record could potentially be so large that that record does not even fit on a single page by itself.
  • Format of a single page

  • If Fixed-Length Records

    Consider the page to be broken into uniform slots, where the slot size is equal to the record size. An rid will be represented as < pageID, slot#>

    How to handle insertions and deletions depends on whether such rid's are held externally. If we are allowed to move arbitrarily reorganize records, than we can efficiently ensure that all N records are kept in the first N slots of the page. However, if a record's rid must remain consistent, then we will have to leave "holes" after a deletion, and will have to scan for open slots upon an insertion.

  • If Variable-Length Records

    With variable length records, we can no longer consider the page to be broken into fixed slots, because we do not know what size slots to use. Instead, we will have to devote available space in a page to store a newly inserted record, if possible.

    Again, our approach will depend greatly on whether or not we are allowed to rearrange the order and placement of the records, at risk of redefining a record's rid. If we are allowed to adjust the placement of existing records, then we can always ensure that all of the records are kept compactly at one end of the page, and that all remaining freespace is contiguous at the other end of the page.

    However, if the validity of rid's must be preserved over time, we must adjust this approach. Our solution will be to add one level of indirection in the rid. Rather than have the rid directly reference the true placement of the record on the page, it can reference an entry in a slot directory maintained on the page, where that entry contains information about the true placement.

    Though this change of approach may seem insignificant, it allows us to internally rearrange the placement of the records of a page, so long as we update the slot directory in accordance.

    One additional subtlety is that we still must manage the use of available entries in the slot directory as insertions and deletions of records take place. An existing page's record ID is now represented as < pageID, slotDirectoryEntry >.

  • Format of a single file

  • Linked List of pages
    A small bit of space on each page can be used to represent the links to the previous or following pages in the file.

    If desired, the pages of a file can be kept in one of two separate linked lists:

  • One list contains those pages that are completely full
  • Another list contains those pages that have some remaining free space
  • Of course, this approach will not be very helpful with variable-length records, because it is quite unlikely that any pages will be completely full.

    In either case, finding an existing page with sufficient space for a new record may require walking through many pages from the list (and thus many I/Os, one per page).

  • Directory of Pages
    Another approach is to separately maintain a directory of pages (using some additional pages for the directory itself). The directory can contain an entry for each page of the file, representing whether that page has any free space or perhaps how much free space.

    To locate a page with enough free space for a new record may still require scanning the directory to find a suitable page. The advantage is that far fewer I/Os will be spent scanning the directory, as many directory entires will fit on a single page.


  • Michael Goldwasser