Course Home | Assignments | Computing Resources | Lab Hours/Tutoring | Python | Schedule | Submit

Saint Louis University

Computer Science 150
Introduction to Object-Oriented Programming

Michael Goldwasser

Spring 2013

Dept. of Math & Computer Science

Programming Assignment 10

Computing Disk Usage

Due: 11:59pm, Friday, 3 May 2013


Contents:


Collaboration Policy

For this assignment you must work individually in regard to the design and implementation of your project.

Please make sure you adhere to the policies on academic integrity in this regard.


Overview

A computer's file system is another example of a structure that is organized recursively, as there is a root directory that contains a collection of files and perhaps other directories, that themselves contain a collection of files and possibly further directories. As a result, most algorithms that an operating system uses to examine or process portions of a computer's file system can be most elegantly implemented with recursion.

In this assignment, you will implement one such algorithm --- that for computing the overall amount of disk space being used to store a directory and all contents recursively stored within. As an example, there is a directory on turing with path /Public/goldwasser/150/engine. A representation of the contents of that directory are diagramed as follows. (If you'd like that same structure on your own machine, feel free to download engine.zip.)

The engine directory contains four regular files as well as a subdirectory named corpus, that itself contains six files. Underneath each filename is the number of bytes of storage used by that file (for example, Alice in Wonderland is using 166887 bytes of storage). We note that the 4096 bytes listed for the corpus directory does not include the total storage of things stored in that directory, rather only the number of bytes it takes the operating system for maintaining the directory entry itself. The overall amount of disk space used by the engine directory together with all recursive contents is 5975508, which is the total of all numbers in the above diagram.

An algorithm for computing the overall disk usage can be implemented quite easily with recursion. In general, given a particular directory, the total space used is the nominal space used for representing that directory plus the space used for each entry within that directory (including a recursive computation of disk space for any subdirectories). So the overall space usage for the engine directory is its own 4096 bytes, plus the sum of the disk usage of its five entries. Four of those five entries are standard files with known sizes. But corpus is itself a directory and so to know its total disk usage, we apply recursion; that results in taking the sum of its own 4096 bytes together with the 5958616 bytes comprised in the six files within, for a total of 5962712. Therefore, the overall usage for the engine directory is therefore its own 4096 bytes, together with the 1663+681+5962712+2654+3702 bytes represented by the five entries within, for a grand total of 5975508 bytes.


Unix's du Command

There exists a classic tool on Unix (and Linux) operating systems for calculating the overall amount of disk usage stored within a given directory. The program is titled du (short for "disk usage") is available on turing (and on any Mac OSX system, although that version seems to report disk usage in kilobytes by default).

The tool uses the recursive algorithm to compute the total usage. There are many options for running the tool. The variant we wish to consider uses a syntax such as the following:

du -ba /Public/goldwasser/150/engine

On turing, this command produces the following output:

1663    /Public/goldwasser/150/engine/ourStrip.py
681     /Public/goldwasser/150/engine/reverseDictionary.py
166887  /Public/goldwasser/150/engine/corpus/AliceInWonderland.txt
716940  /Public/goldwasser/150/engine/corpus/PrideAndPrejudice.txt
3322004 /Public/goldwasser/150/engine/corpus/LesMiserables.txt
609492  /Public/goldwasser/150/engine/corpus/HuckFinn.txt
594238  /Public/goldwasser/150/engine/corpus/SherlockHolmes.txt
549055  /Public/goldwasser/150/engine/corpus/GrimmFairyTales.txt
5962712 /Public/goldwasser/150/engine/corpus
2654    /Public/goldwasser/150/engine/TextIndex.py
3702    /Public/goldwasser/150/engine/Engine.py
5975508 /Public/goldwasser/150/engine
The end result is that it has computed the 5975508 total bytes within /Public/goldwasser/150/engine. But during the process it reports all intermediate calculations. Notice that it is unable to report the grand total for a (sub)directory until after it has examined all things within the directory. For example, the total of 5962712 for engine/corpus is not known until after all of the totals are known for the files within that directory. Similarly, the final total for engine is the last thing that is reported.


Your Task

You are to implement a version of the recursive disk usage algorithm in Python, writing a self-contained script named diskUsage.py. That script should ask the user for the starting path, and produce a report identical to what is given by the standard du tool. There should be one line for each entry in the file system at or below the starting path, with the number of bytes, followed by a single tab character ('\t' in Python), and then the path for the entry.

The high-level algorithm can be described as a recursive function that returns the total disk space used by a given path (and its contents).

Algorithm diskUsage(entry):
  total = immediate disk space used by the given entry
  if the entry is itself a directory:
    for each child within the directory do:
      total = total + diskUsage(child)
  print summary line for current entry
  return total
In the next section, we provide relevant information about how to interact with the operating system and file system in Python.


Python's os Module

The os module in Python provides many tools for interacting, in a portable way, with all major operating systems from within a Python program. Complete documentation on the module is available here, but we will provide you with a summary of the particular tools that you will need to complete this assignment.


Submitting Your Assignment

Your source code should be contained in a file named diskUsage.py This file must be submitted electronically.

You should also submit a separate 'readme' text file, as outlined in the general webpage on programming assignments.

Please see details regarding the submission process from the general programming web page, as well as a discussion of the late policy.


Extra Credit

If you run du starting at /Public/goldwasser/150 you may notice a line that reads

du: cannot read directory `/Public/goldwasser/150/contest/.contest': Permission denied
within the output. The reason is because there exists a directory within that hierarchy that you do not have permission to read. If you test your Python version of the disk usage program starting at /Public/goldwasser/150, your program may crash because of an uncaught exception.

The extra credit challenge is to design your program so that it gracefully handles such a situation, printing a line of output similar to the one shown by the du program.


Michael Goldwasser
Last modified: Friday, 03 May 2013