Saint Louis University |
Computer Science 1300/5001
|
Computer Science Department |
For this assignment you must work individually in regard to the design and implementation of your project.
Please make sure you adhere to the policies on academic integrity in this regard.
A computer's file system is another example of a structure that is organized recursively, as there is a root directory that contains a collection of files and perhaps other directories, that themselves contain a collection of files and possibly further directories. As a result, most algorithms that an operating system uses to examine or process portions of a computer's file system can be most elegantly implemented with recursion.
In this assignment, you will implement one such algorithm --- that for computing the overall amount of disk space being used to store a directory and all contents recursively stored within. As an example, there is a directory on hopper with path /public/goldwasser/1300/engine. A representation of the contents of that directory are diagramed as follows. (If you'd like that same structure on your own machine, feel free to download engine.zip.)
The engine directory contains four regular files as well as a
subdirectory named corpus, that itself contains six files.
Underneath each filename is the number of bytes of storage used by
that file (for example, Alice in Wonderland is using
An algorithm for computing the overall disk usage can be implemented quite easily with recursion. In general, given a particular directory, the total space used is the nominal space used for representing that directory plus the space used for each entry within that directory (including a recursive computation of disk space for any subdirectories). So the overall space usage for the engine directory is its own 8 bytes, plus the sum of the disk usage of its five entries. Four of those five entries are standard files with known sizes. But corpus is itself a directory and so to know its total disk usage, we apply recursion; that results in taking the sum of its own 8 bytes together with the 5958616 bytes comprised in the six files within, for a total of 5958624. Therefore, the overall usage for the engine directory is therefore its own 8 bytes, together with the 1663+681+5958624+2654+3702 bytes represented by the five entries within, for a grand total of 5967331 bytes.
There exists a classic tool on Unix (and Linux) operating systems for calculating the overall amount of disk usage stored within a given directory. The program is titled du (short for "disk usage") is available on hopper (and on any Mac OSX system, although that version seems to report disk usage in kilobytes by default).
The tool uses the recursive algorithm to compute the total usage. There are many options for running the tool. The variant we wish to consider uses a syntax such as the following:
du -ba /public/goldwasser/1300/engine
On hopper, this command produces the following output:
1663 /public/goldwasser/1300/engine/ourStrip.py 3702 /public/goldwasser/1300/engine/Engine.py 3322004 /public/goldwasser/1300/engine/corpus/LesMiserables.txt 166887 /public/goldwasser/1300/engine/corpus/AliceInWonderland.txt 716940 /public/goldwasser/1300/engine/corpus/PrideAndPrejudice.txt 609492 /public/goldwasser/1300/engine/corpus/HuckFinn.txt 594238 /public/goldwasser/1300/engine/corpus/SherlockHolmes.txt 549055 /public/goldwasser/1300/engine/corpus/GrimmFairyTales.txt 5958624 /public/goldwasser/1300/engine/corpus 2654 /public/goldwasser/1300/engine/TextIndex.py 681 /public/goldwasser/1300/engine/reverseDictionary.py 5967331 /public/goldwasser/1300/engineThe end result is that it has computed the 5967331 total bytes within /public/goldwasser/1300/engine. But during the process it reports all intermediate calculations. Notice that it is unable to report the grand total for a (sub)directory until after it has examined all things within the directory. For example, the total of 5958624 for engine/corpus is not known until after all of the totals are known for the files within that directory. Similarly, the final total for engine is the last thing that is reported.
Note: if you copy this example on your own machine, you might see some variance, both because the order in which contents of a directory are reported might vary, and because different operating systems may require different number of bytes for representing the directories themselves.
You are to implement a version of the recursive disk usage algorithm in Python, writing a self-contained script named diskUsage.py. That script should ask the user for the starting path, and produce a report identical to what is given by the standard du tool. There should be one line for each entry in the file system at or below the starting path, with the number of bytes, followed by a single tab character ('\t' in Python), and then the path for the entry.
The high-level algorithm can be described as a recursive function that returns the total disk space used by a given path (and its contents).
Algorithm diskUsage(path): total = immediate disk space used by the entry at given path if the path is itself a directory: for each child within the directory do: total = total + diskUsage(child) print summary line for current path entry return totalIn the next section, we provide relevant information about how to interact with the operating system and file system in Python.
The os module in Python provides many tools for interacting, in a portable way, with all major operating systems from within a Python program. Complete documentation on the module is available here, but we will provide you with a summary of the particular tools that you will need to complete this assignment.
os.path.getsize(path)
returns the immediate disk
usage (measured in bytes) for the file or directory that is
identified by the string path
(e.g.,
os.path.isdir(path)
returns True if the entry designated by the string
path is a directory; False otherwise.
os.listdir(path)
returns a list of strings that are the names of all entries
within a directory designated by string path. For
example, in the above case the call
os.listdir('/public/goldwasser/1300/engine') returns the
list
os.path.join(path, filename)
Returns a string that results when composing the string,
path, and the string, filename using a
separator between the two that is appropriate for the given
operator system (e.g., the / character for Unix/Linux
system, and the \ character for Windows).
Your source code should be contained in a file named diskUsage.py This file must be submitted electronically.
You should also submit a separate 'readme' text file, as outlined in the general webpage on programming assignments.
Please see details regarding the submission process from the general programming web page, as well as a discussion of the late policy.
The assignment is worth 40 points, distributed as follows:
If you run du starting at /public/goldwasser/1300 you may notice a line that reads
du: cannot read directory `/public/goldwasser/1300/contest/.contest': Permission deniedwithin the output. The reason is because there exists a directory within that hierarchy that you do not have permission to read. If you test your Python version of the disk usage program starting at /public/goldwasser/1300, your program may crash because of an uncaught exception.
The extra credit challenge is to design your program so that it gracefully handles such a situation, printing a line of output similar to the one shown by the du program.