Saint Louis University |
Computer Science 1300
|
Computer Science Department |
In order to practice use of loops and conditionals, we wanted to play with an easily accessible data set. Our solution is to use a system-wide list of words and phrases that supports spell-checking and other language tools. On hopper, this list can be found at location /usr/share/dict/words, but if you wish to download this same word set on your own computer, here is a copy.
We should note that this list is almost comically extensive (as we will see), and it includes some proper words that are capitalized, some numerals, and some punctuated terms with apostophes, hyphens, periods, etc. The list is alphabetized, but in a case-insensitive way.
To load this data set into a standard list of strings, we ask that you use the following command, even though we are not yet ready to explain how this command works:
words = [line.strip() for line in open('/usr/share/dict/words') if line.strip()]or with a more local filename if you are using the saved wordfile on your own machine.
For the rest of the day, we are creating a version of a "word scavenger hunt" by having you write Python code to answer the following questions. All of these can be answered by using combinations of for loops, conditional statements, and many of the convenient behaviors of Python's str class (see online documentation).
There is an entry of the list that has 7 occurrences of the character 'i'. What is that entry?
There are a number of hyphenated phrases in the data set of the form "[blank]-to-[blank]" for some word (e.g., hand-to-hand). Determine precisely how many such phrases exist.
What is the longest entry in the data set, and what is its length? (in case of tie, you may report any such longest word, although in turns out to be unique for this data set)
If we consider only alphabetic characters in the wordlist (use c.isalpha() method to test character c), and if we ignore their case, what percentage of those letters are vowels? (i.e., AEIOU) Compute the answer to the nearest hundreth of a percent.
There is a classic spelling rule
We say that a word is proper if it appears capitalized in the word list (e.g., Missouri). There are some entries in this list that appear in proper form and improper form (e.g., Apple and apple). Because of the way the list is alphabetized, when this happens those words must be neighbors of each other in the list, and the capialized occurrence will be the first of the two. Given this knowledge, determine how many such proper/impropper pairs exist.
Note: Some of these pairs are strange because I don't see why the improper form is included. For example, this data set includes both Missouri and missouri. But we'll count them nonetheless.
We say a word has a repeated pair if two consecutive letters are the same, and since our data set includes some bizarre entries such as "AAAAA", we will say that the pair must not be part of three consecutive occurrences of the same character. What is the largest number of repeated pairs contained by any entry, and give an example of such an entry that begins with the letter 'p'.
The word abjectedness has occurrences of the letters a,b,c,d,e appearing in that order, although not necessarily consecutively (abjectedness). Including this word, how many entries in the data set have this property (using their original case)?
These are listed as super challenges because the require much more care to do properly, especially if only using techniques we've seen thus far.
We may often have both american and british forms of words, such as theater and theatre, with one ending with er and the other ending in re. While it is clear the er form will appear earlier on the list, they might not be immediate neighbors in the list. With that warning, and considering only the improper words on the list, determine how many such pairs exist which differ only because of an er vs re ending. (Not that we can be sure those will all be american/british pairs.)
We will post our solutions after class: (click here)