Unix Tools

Computer Science 1300
Introduction to Object-Oriented Programming

Unix Tools

In order to practice forms of file processing, we will take inspiration from a number of standard Unix/Linux tools that have been created precisely to provide some common functionality. While you should certainly take time to learn use of those tools if you're going to spend time in a Unix-like environment, we'll recreate the wheel, so to speak, and try to implement some of those same basic behaviors using Python.

Note: at the bottom of this document we provide some additional discussion of how you can write Python scripts that make use of command-line arguments in order to better emulate the Unix tools that we are expoloring. You don't need to use those techniques, but they're great to know

Note: If you are interested in learning more about any of these unix tools, there is another tool named man which shows you the user manual for another tool. For example, the command

man wc

will tell you much more about the wc tool.

wc

The wc command name stands for "word count", as this program is used to count the number of lines, words, and characters of a text file. It is typically invoked from the terminal using a syntax such as

wc myfile.txt

and it produces output that may look something like the following

 38   204  1161   myfile.txt

with the three numbers indicating the total number of lines, words, and characters in the file. Lines are defined based on separation by newline characters. Words are defined based on any sequence of non-white-space characters that are separated by any form of white space. Characters are characters.

Your task: recreate this behavior as a Python script named wc.py. Note that you might do this either by processing a line at a time and keeping running totals for the number of lines, words, characters that you encounter, or you might choose to read the entire file into one string, and then to perform various splits to see how many peices you get as a result. Clearly, the line-by-line processing is better for really big files, but either will suffice. (Spoiler: we use this example in the text, with several implementations given on page 277.)

head

The head command is used to quickly examine the beginning of a file. By default, a command such as

head myfile.txt

displays the first 10 lines of the given file (more technically, at most 10 lines, since some files will not have 10 lines). But this command can be used to show the head of many different files sequentially, such as

head myfile.txt another.txt yetmore.txt

Also, there is a way to provide additional command-line arguments to change the value 10 to som other number of lines to show. In Unix, this might be done with syntax

head -n 20 myfile.txt

Your Task: Recreate the basic version of head to show the first 10 lines of an indicated file. If you wish to read ahead you can even allow a syntax where commandline arguments can be used to change the number of lines that are shown.

tail

The tail command is essentially symmetric to head, except with focus on showing the last so many lines of a file. You're welcome to try to re-create this tool if you are inspired.

grep

In basic form, grep allows you to search for any given string within a file, and it displays all lines of the file in which a match was found. The unix tool can be invoked using a syntax such as

grep needle haystack.txt

in which case it tries to found the string "needle" within the file haystack.txt. By default, this is a case-sensitive search, in that the string "Needle" is not a match. Also, the pattern need not be a full word to be a match. There are many other forms of the unix tool syntax, but a few examples are:

grep -i needle haystack.txt                # does a case-INSENSITIVE earch
grep "if so"  document.txt                 # if pattern includes spaces, must quote the full pattern

and there's much more functionality to allow you to use regular expressions to give more intricate patterns that you want to find.

Your task: Recreate the most basic form of grep. Ask the user for a filename and a string pattern, and echo all lines of the file that contain that pattern within.

sort

The sort command echos the contents of a file, yet with the lines of that file sorted alphabetically (at least, that's the default order). It's default syntax is

sort document.txt

but there are indeed a variety of other command-line arguments that are typically used to control how sorting is done (alphabetically vs. numerically, reverse sorting, extracting some portion of a line to use as the sort key)

Also, by default the result is printed to the standard output, but there are commandline arguments that allow you to specify a new filename in which the results should be saved.

Your task: Recreate the most basic form of sort, perhaps with an option to write the results to some other file rather than to print them.

uniq

The uniq command processes a file and removes any consecutive duplicate lines. It can be invoked using syntax

uniq scores.txt

As an example, if scores were originally

then the output would be

There are additional command-line options that allow for variants of this task, for example to display only those lines that were duplicated or to give counts of the number of duplicates that were found.

Admittedly, it is commond that someone will first use the sort tool on a data set to force duplicates to be consecutive, and then use the uniq tool as a secondary step.

Your task: Give a basic implementation of the default uniq behavior, echoing all lines of a file except omitting any consecutively duplicated lines.

More is a more interactive tool for viewing contents of a file, with interactive commands that allow for you to advance/retreat a line at a time, a page at a time, or to do basic forward searches for patterns interactively. It's harder to recreate purely in Python, but still a great unix command worth knowing about!

Note: Command-line arguments in Python The Unix tools that we are emulating are typically meant to be started in the operating system's shell (aka, terminal, command prompt). For example the wc program that we will start with is typically invoked as

wc myfile.txt

with the fileame indicated as a separate argument after the command. In Python, you could choose to instead make a more traditional interactive program where you wait to ask the user what the filename is when the program is running. But it is also possible to write Python programs that can take such "command-line arguments" from the terminal. For example, if your script is named wc.py then you normally might start the process in a shell with the command

python wc.py

but if you'd like, you can allow the user to specify extra arguments when starting the program, for example with a command such as:

python wc.py myfile.txt

From within your Python source code, you have access to all such command-line arguments through a module named sys. In particular if you start with

import sys

then the variable sys.argv will be a list of strings that were provided as command-line arguments when the program was started. The first string on that list, sys.argv[0], will actually be the name of the python source code (e.g., wc.py in the above example). But if the program was executed as

python wc.py myfile.txt

then sys.argv = ['wc.py', 'myfile.txt']. You can then write logic to test whether the user indicated command-line arguments, and if so, how many. For example, if allowing a filename for the script wc.py, I might start it as follows:

import sys

if len(sys.argv) >= 2:      # they gave additional argument
   filename = sys.argv[1]
else:
   filename = raw_input("Enter filename: ")

Michael Goldwasser

CSCI 1300, Fall 2017
Last modified: Tuesday, 10 October 2017

Saint Louis University

Computer Science 1300
Introduction to Object-Oriented Programming

Michael Goldwasser

Fall 2017

Computer Science Department

Unix Tools

wc

head

tail

grep

sort

uniq

more

Saint Louis University

Computer Science 1300 Introduction to Object-Oriented Programming

Michael Goldwasser

Fall 2017

Computer Science Department

Unix Tools

wc

head

tail

grep

sort

uniq

more

Computer Science 1300
Introduction to Object-Oriented Programming