Saint Louis University |
Computer Science 1300
|
Computer Science Department |
In order to practice forms of file processing, we will take inspiration from a number of standard Unix/Linux tools that have been created precisely to provide some common functionality. While you should certainly take time to learn use of those tools if you're going to spend time in a Unix-like environment, we'll recreate the wheel, so to speak, and try to implement some of those same basic behaviors using Python.
Note: at the bottom of this document we provide some additional discussion of how you can write Python scripts that make use of command-line arguments in order to better emulate the Unix tools that we are expoloring. You don't need to use those techniques, but they're great to know
Note: If you are interested in learning more about any of these unix tools, there is another tool named man which shows you the user manual for another tool. For example, the command
man wcwill tell you much more about the wc tool.
The wc command name stands for "word count", as this program is used to count the number of lines, words, and characters of a text file. It is typically invoked from the terminal using a syntax such as
wc myfile.txtand it produces output that may look something like the following
38 204 1161 myfile.txtwith the three numbers indicating the total number of lines, words, and characters in the file. Lines are defined based on separation by newline characters. Words are defined based on any sequence of non-white-space characters that are separated by any form of white space. Characters are characters.
Your task: recreate this behavior as a Python script named wc.py. Note that you might do this either by processing a line at a time and keeping running totals for the number of lines, words, characters that you encounter, or you might choose to read the entire file into one string, and then to perform various splits to see how many peices you get as a result. Clearly, the line-by-line processing is better for really big files, but either will suffice. (Spoiler: we use this example in the text, with several implementations given on page 277.)
The head command is used to quickly examine the beginning of a file. By default, a command such as
head myfile.txtdisplays the first 10 lines of the given file (more technically, at most 10 lines, since some files will not have 10 lines). But this command can be used to show the head of many different files sequentially, such as
head myfile.txt another.txt yetmore.txtAlso, there is a way to provide additional command-line arguments to change the value 10 to som other number of lines to show. In Unix, this might be done with syntax
head -n 20 myfile.txt
Your Task: Recreate the basic version of head to show the first 10 lines of an indicated file. If you wish to read ahead you can even allow a syntax where commandline arguments can be used to change the number of lines that are shown.
The tail command is essentially symmetric to head, except with focus on showing the last so many lines of a file. You're welcome to try to re-create this tool if you are inspired.
In basic form, grep allows you to search for any given string within a file, and it displays all lines of the file in which a match was found. The unix tool can be invoked using a syntax such as
grep needle haystack.txtin which case it tries to found the string "needle" within the file haystack.txt. By default, this is a case-sensitive search, in that the string "Needle" is not a match. Also, the pattern need not be a full word to be a match. There are many other forms of the unix tool syntax, but a few examples are:
grep -i needle haystack.txt # does a case-INSENSITIVE earch grep "if so" document.txt # if pattern includes spaces, must quote the full patternand there's much more functionality to allow you to use regular expressions to give more intricate patterns that you want to find.
Your task: Recreate the most basic form of grep. Ask the user for a filename and a string pattern, and echo all lines of the file that contain that pattern within.
The sort command echos the contents of a file, yet with the lines of that file sorted alphabetically (at least, that's the default order). It's default syntax is
sort document.txtbut there are indeed a variety of other command-line arguments that are typically used to control how sorting is done (alphabetically vs. numerically, reverse sorting, extracting some portion of a line to use as the sort key)
Also, by default the result is printed to the standard output, but there are commandline arguments that allow you to specify a new filename in which the results should be saved.
Your task: Recreate the most basic form of sort, perhaps with an option to write the results to some other file rather than to print them.
uniq scores.txtAs an example, if scores were originally
5 5 5 7 8 8 8 5 7then the output would be
5 7 8 5 7There are additional command-line options that allow for variants of this task, for example to display only those lines that were duplicated or to give counts of the number of duplicates that were found.
Admittedly, it is commond that someone will first use the sort tool on a data set to force duplicates to be consecutive, and then use the uniq tool as a secondary step.
Your task: Give a basic implementation of the default uniq behavior, echoing all lines of a file except omitting any consecutively duplicated lines.
More is a more interactive tool for viewing contents of a file, with interactive commands that allow for you to advance/retreat a line at a time, a page at a time, or to do basic forward searches for patterns interactively. It's harder to recreate purely in Python, but still a great unix command worth knowing about!
Note: Command-line arguments in Python The Unix tools that we are emulating are typically meant to be started in the operating system's shell (aka, terminal, command prompt). For example the wc program that we will start with is typically invoked as
wc myfile.txtwith the fileame indicated as a separate argument after the command. In Python, you could choose to instead make a more traditional interactive program where you wait to ask the user what the filename is when the program is running. But it is also possible to write Python programs that can take such "command-line arguments" from the terminal. For example, if your script is named wc.py then you normally might start the process in a shell with the command
python wc.pybut if you'd like, you can allow the user to specify extra arguments when starting the program, for example with a command such as:
python wc.py myfile.txt
From within your Python source code, you have access to all such command-line arguments through a module named sys. In particular if you start with
import systhen the variable sys.argv will be a list of strings that were provided as command-line arguments when the program was started. The first string on that list, sys.argv[0], will actually be the name of the python source code (e.g., wc.py in the above example). But if the program was executed as
python wc.py myfile.txtthen
import sys if len(sys.argv) >= 2: # they gave additional argument filename = sys.argv[1] else: filename = raw_input("Enter filename: ")