CLOP Pools Package
1 Introduction
2 Football pools
  2.1 Football pool data file format
  2.2 Football picks format
  2.3 Football programs
    2.3.1 fall
    2.3.2 fsmooth
    2.3.3 fseek
    2.3.4 fcanon
    2.3.5 fstats
3 Tournament pools
  3.1 Tournament pool data file formats
    3.1.1 Names file format
    3.1.2 Head-to-head probability file format
    3.1.3 Winround probability file format
  3.2 Tournament picks format
  3.3 Tournament programs
    3.3.1 tseek
    3.3.2 tcanon
    3.3.3 trandom
    3.3.4 tstats
    3.3.5 tscore
    3.3.6 tsim
    3.3.7 dumph2h
    3.3.8 dumpwinround
    3.3.9 pcalc
    3.3.10 pix2tex
CLOP Pools Package
******************

This manual is for the CLOP Pools Package (version 1.4, 29 September
2016), which calculates optimal picks for sports betting pools.

   Copyright (C) 2006 Bryan Clair

   Copying and distribution of this file, with or without modification,
are permitted in any medium without royalty provided the copyright
notice and this notice are preserved.

1 Introduction
**************

The CLOP Pools Package is a suite of command line utilities for working
with sports betting pools.  The package generally implements algorithms
from the article 'Optimal Strategies for Sports Betting Pools'
(Clair,Letscher 2005), but also includes the main algorithm from 'March
Madness and the Office Pool' (Kaplan, Garstka 2001).  The package and
supporting information are maintained at
<http://math.slu.edu/~clair/pools>.

   The model used for a betting pool requires three inputs:
   * Size of the pool (the number of participants).
   * Actual probabilites, used to model the actual outcomes of the
     games.
   * Perceived probabilities, used to model the behavior of the other
     participants.  These are known as "pool probabilities" in the
     Optimal Strategies paper.  Pool is a better name than perceived,
     but the code predates the name change.
   The size of the pool is given (when needed) via the command line
argument -n.  The actual and perceived probabilities are stored in ASCII
text files and are passed as command line arguments.

   Pick sets are coded as single lines of ASCII text, and are passed to
the pools utilites on standard input, and produced on standard output.

   Generally, the package is intended to play well with Unix utilities
such as cut, paste, and sort, and the programs are designed to fit
nicely into pipelines.  The programs 'fall', 'fseek', 'fsmooth',
'fcanon', 'tgreedy', and 'tcanon' all generate pick sets on stdout.  The
programs 'fstats' and 'tstats' calculate interesting statistics for pick
sets provided on stdin.  For example, the command:
     fcanon -qd nfl_data | fstats -n50 nfl_data
   calculates the expected return for a bet on all the underdogs in a 50
player football pool described in file nfl_data.

   This package uses the GNU Scientific Library
(<http://www.gnu.org/software/gsl/>) for numeric computations.  The
programs 'fall', 'fseek', and 'fsmooth' are all threaded for speed on
multiprocessor machines.

2 Football pools
****************

Football pools consist of g games which are assumed to be independent.
The number of games is limited by the number of bits in an unsigned int.
Running a 16 game pool on a machine with 16 bit ints has not been
tested, and could potentially cause problems.

2.1 Football pool data file format
==================================

A football pool data file contains all actual and perceived probablities
needed to model the pool.  A line beginning with '#' is a comment and is
ignored, as are blank lines.  The file begins with a header record on a
single line which is followed by one game record line for each game in
the pool.

   The header record gives text names for each columnn of fields.

   Each game record has from 3 to 16 whitespace separated fields:
   * Field 1: Team 1 Name
   * Field 2: Team 2 Name
   * Remaining fields: Probability for team 1 winning or being picked.

   Here is a sample data file:
     # Data from week 3 of the 2005 NFL season

     HOME    AWAY    SAGARIN ESPN    YAHOO
     BUF     ATL     0.60    .459    .448
     CHI     CIN     0.48    .214    .228
     DEN     KC.     0.42    .278    .274
     GB.     TB.     0.41    .341    .334
     IND     CLE     0.87    .975    .970
     MIA     CAR     0.63    .206    .152
     MIN     NO.     0.51    .509    .533
     NYJ     JAX     0.64    .527    .480
     PHI     OAK     0.82    .940    .935
     PIT     NE.     0.61    .709    .619
     SD.     NYG     0.47    .635    .720
     SEA     ARZ     0.80    .877    .860
     SF.     DAL     0.67    .140    .125
     STL     TEN     0.54    .756    .763
   In this file, the columns are the probability of the Home team
winning as predicted by Sagarin ratings, the probability that a given
participant in ESPN's football pool chose the Home team, and the
probability that a given participant in Yahoo's football pool chose the
Home team.

   By default, the 3rd and 4th column are used as the actual and
perceived probability of Team 1 winning.  To choose different columns,
use the -A and -P options to the football programs.  For example,
     fseek -n150 nfl05_week3_data -PYAHOO
   would search for the best picks in a 150 participant pool, using
SAGARIN data as actual probabilities and YAHOO data as perceived
probabilities.

2.2 Football picks format
=========================

Picks are read and written as a whitespace separated list of winners.
Names must match exactly and be in the same order as the team names in
the associated pool data file.  For example:
     BUF CHI DEN TB. IND MIA MIN NYJ PHI NE. NYG SEA SF. TEN

2.3 Football programs
=====================

2.3.1 fall
----------

Calculate expected return for all possible picks.  Writes 2^g lines of
output.  Each line shows the expected return for a set of picks, a tab
character, and then the picks in the above format.

   It is useful to pipe the output of 'fall' to 'sort -nr' to get a list
sorted in descending order of quality.

   Usage: 'fall [-tqnAP] datafile'

'-q'
     Quiet.  Suppress display of header.
'-t<threads>'
     Threads.  Specify number of computation threads (default 2).
'-n<competitors>'
     Number of competitors.
'-A<actuals>'
'-P<perceiveds>'
     Actual or perceived probabilities.  Specify column header from
     datafile.

2.3.2 fsmooth
-------------

Calculate expected return for all possible picks.  Writes 2^g lines of
output.  Each line shows the expected return for a set of picks, a tab
character, and then the picks in the above format.

   'fsmooth' operates exactly the same as 'fall', except that the normal
approximation is used to calculate the expected return for each set of
picks.  'fsmooth' is considerably faster than 'fall'.

   It is useful to pipe the output of 'fsmooth' to 'sort -nr' to get a
list sorted in descending order of quality.

   Usage: 'fsmooth [-tqnAP] datafile'

'-q'
     Quiet.  Suppress display of header.
'-t<threads>'
     Threads.  Specify number of computation threads (default 2).
'-n<competitors>'
     Number of competitors.
'-A<actuals>'
'-P<perceiveds>'
     Actual or perceived probabilities.  Specify column header from
     datafile.

2.3.3 fseek
-----------

Hill climbing search for a pickset which is a local maximum for expected
return.  The search begins with a good guess based off of typical
results.  The search ends at a pick set which has larger expected return
than any other set which differs by at most two games.

   'fseek' is very effective at finding the best pickset quickly.
However, it might in theory become stuck at a local maximum which is not
the global maximum.

   Usage: 'fseek [-tqnAP] datafile'

'-q'
     Quiet.  Suppress display of header.
'-t<threads>'
     Threads.  Specify number of computation threads (default 2).
'-n<competitors>'
     Number of competitors.
'-A<actuals>'
'-P<perceiveds>'
     Actual or perceived probabilities.  Specify column header from
     datafile.

2.3.4 fcanon
------------

Calculate canonical picks for a football pool.  Canonical picks include
the favories, the underdogs, and the edge picks (and will display in
that order if more than one are requested).  Favorites and underdogs use
the actual values, so if you want to see perceived favorites/underdogs,
use the -A option.  The edge picks are optimal for a sufficiently large
number of competitors, and maximize the ratio A/P.

   Usage: 'fcanon [-qfdeAP] datafile'

'-q'
     Quiet.  Display only the picks.
'-A<actuals>'
'-P<perceiveds>'
     Actual or perceived probabilities.  Specify column header from
     datafile.  Note that '-P' is only useful in conjunction with '-e'.
'-f'
     Calculate actual favorites.
'-d'
     Calculate actual underdogs.
'-e'
     Calculate edge picks.

2.3.5 fstats
------------

Calculate statistics for picksets read on standard in.  The default
behavior is to print the expected return followed by the picks.

   Usage: 'fstats [-qnAPsdgvw] datafile'

'-q'
     Quiet.  Suppress display of header.
'-n<competitors>'
     Number of competitors.
'-A<actuals>'
'-P<perceiveds>'
     Actual or perceived probabilities.  Specify column header from
     datafile.
'-s'
     Smooth.  Use the normal approximation to calculate expected return.
'-d'
     Detailed.  Show detailed statistics.  Shows the expected return
     (exp), and expected return with smooth approximation (sexp).  For
     both the actual and perceived data, it shows the probability that
     these picks occur exactly (prob), the mean and variance of the
     number of games these picks will agree with (mean, var), and the
     number of underdogs picked (upsets).
'-g'
     Game-by-game.  Displays five columns of data for each game.  The
     first (Pick) is 1 or 0 depending on whether the actual favorite or
     actual underdog was chosen by the pickset.  The next columns give
     the actual and perceived probabilites for the chosen team to win.
     The final two (which take some time to compute) give numeric
     calculations of the partial derivative of expected return with
     respect to a change in the input variables a_i or p_i for that
     game.  These can be though of as a measure of the sensitivity of
     expected return to the data for that particular game.  Keep in mind
     that probabilites range from 0 to 1, and that a change of .01 makes
     a much bigger difference to a probability of .98 than it does to a
     probability of .5.
'-v<actual spread>'
     Vary actuals.  This option is intended to test robustness of the
     expected return value.  This option calculates the expected return
     for the given set of picks 200 times while varying the actual
     probabilities used to model the pool.  For each calculation, each
     value a_i is chosen uniformly randomly from an interval centered at
     the original a_i with width 2*<actual spread>.  Statistics are
     calculated for the 200 values of expected return, and displayed.
'-w<perceived spread>'
     Vary perceiveds.  Same as '-v', but varies p_i.  Using both '-v'
     and '-w' will vary both at the same time.

3 Tournament pools
******************

A tournament pool involves picking all games of an R round single
elimination tournament with 2^R teams.  Currently, the maximum number of
allowable rounds is 14 (which is ridiculously large).

   With tournament pools, the scoring method is variable.  In this
version of CLOP, only two scoring methods are implemented: power-of-two
scoring and ESPN scoring.  In power-of-two scoring, correct picks are
worth 1,2,4,8,... in increasing rounds.  In ESPN scoring, correct picks
are worth 10,20,40,80,120, and 160 points in increasing rounds.  Any
tournament program that uses scoring will use power-of-two scoring by
default and accept the '-E' option to switch to ESPN scoring.

3.1 Tournament pool data file formats
=====================================

A tournament pool is described by three collections of data: team names,
actual probabilties, and perceived probabilities.  A collection of
probabilities can be given in one of two ways, as head-to-head data or
as winround data.

   Within data files, team order is important, because it determines
which teams play in which round (using the usual single elimination
bracket) and it must remain consistent for all files used in a given
pool.

   In the sections below, T is the number of teams in the tournament and
R is the number of rounds.

3.1.1 Names file format
-----------------------

A team names file begins with a header line containing the keyword
'names' followed by the number of teams (T) in the tournament, followed
by an optional comment to the end of the line.  Each subsequent line
contains a team name, which may optionally be in double quotes.  Quotes
are useful to include whitespace in the team name, which makes ASCII
picks output much nicer.

   Here is an example tournament with four teams.  The first round
matchups are Aardvarks-Bison and Chihuahuas-Ducks.
     names 4 Bryan's Imaginary Playoffs
     "Aardvarks "
     "Bison     "
     "Chihuahuas"
     "Ducks     "

3.1.2 Head-to-head probability file format
------------------------------------------

A head-to-head data file begins with a header line containing the
keyword 'h2h' followed by the number of teams T in the tournament,
followed by and optional comment to the end of the line.

   Data follows as T*T floating point numbers, in order:

        P(0 beats 0) P(0 beats 1) .. P(0 beats T-1)
              ...
        P(T-1 beats 0) ..            P(T-1 beats T-1)

   The data is redundant since P(i beats j) = 1 - P(j beats i).  Values
for P(x beats x) are required but ignored.

   Here is an example that goes with Bryan's Imaginary Playoffs:
     h2h 4   Close. Team 2 (Bison) have an edge.
     .5 .4 .4 .7
     .6 .5 .7 .6
     .6 .3 .5 .6
     .3 .4 .4 .5

3.1.3 Winround probability file format
--------------------------------------

A winround data file begins with a header line containing the keyword
'winround' followed by the number of teams T in the tournament, followed
by and optional comment to the end of the line.

   Data in the file comes in two series, the solo and pair series.  The
solo series begins with the keyword 'solo' followed by the probabilities
of team i winning round r for all i,r.  The pair series begins with the
keword 'pair' followed by the probabilities of team i winning round r
and team j winning round s for all i,j,r,s.

   The pair series is optional.  If it is omitted, the data no longer
contains enough information for the theoretical model of the pool.  In
that case, CLOP will estimate the pair data and print a warning message
to stderr.  See the 'Optimal Strategies' paper for details.

   The solo series is size (T * (R+1)), in order:

   P(0->0) P(0->1) ... P(0->R) P(1->0) ... P((T-1)->R)

   The pair array is size (T * T * (R+1) * (R+1)), in order:

   P(0->0 & 0->0) P(0->0 & 0->1) .. P(0->0 & 0->R)
   P(0->1 & 0->0) ..                P(0->1 & 0->R)
    ...
   P(0->R & 0->0) ..                P(0->R & 0->R)
   P(0->0 & 1->0) P(0->0 & 1->1) .. P(0->0 & 1->R)
    ...
   P(0->R & 1->0) ..                P(0->R & 1->R)
    ...
   P(0->R & (T-1)->0) ..        P(0->R & (T-1)->R)
   P(1->0 & 0->0) ...
                            P((T-1)->R & (T-1)->R)

   Here is an example that goes with Bryan's Imaginary Playoffs:
     winround 4   Team 2 (Bison) very strong.
     solo
       1.000 0.300 0.180
       1.000 0.700 0.525
       1.000 0.500 0.125
       1.000 0.500 0.170
     pair
       1.000 0.300 0.180
       0.300 0.300 0.180
       0.180 0.180 0.180

       1.000 0.700 0.525
       0.300 0.000 0.000
       0.180 0.000 0.000

       (14x9 more floats)...

3.2 Tournament picks format
===========================

A set of picks for a tournament is stored in "depth format" as a list of
integers in the range [1...R+1], one for each team.  The number for each
team indicates which round that team will reach.

   In Bryan's Imaginary Playoffs, here is a bracket in which the Bison
beat the Chihuahuas in the finals:
     1 3 2 1

   The tstats program can display brackets in a human readable ASCII
format.  The pix2tex utility can create a TeX file that displays the
bracket graphically.

3.3 Tournament programs
=======================

3.3.1 tseek
-----------

Performs a hill-climbing search for picks that maximize expected return.
Each trial chooses a random starting pick (uniformly distributed over
the set of all possible brackets) and hill climbs to a local maximum.
The process is repeated for the specified number of trials.  Picks that
improve on previous results are displayed when found.

   Usage: 'tseek [-nEqvts] namesfile actualfile perceivedfile'

'-n<competitors>'
     Number of competitors.
'-E'
     Use ESPN scoring.
'-q'
     Quiet.  Display only one set of picks (the best found) when all
     trials are finished.
'-v'
     Verbose.  Display all intermediate picks for each trial.  Using
     'tseek -v -t1 ...' is a good way to get a feel for the hill
     climbing process.
'-t<trials>'
     Trials.  Specify number of trials (default is to run trials
     forever).
'-s<seed>'
     Seed.  Specify (long integer) seed for random number generator
     (default seeds with the current time).

3.3.2 tcanon
------------

Display canonical statistics and picks for a tournament pool.  The
statistics (shown unless '-q' is used) describe opponent scoring.  The
six sets of picks are:
   * Picks giving the maximum expected score.
   * The actual favorites.
   * The perceived favorites.
   * The result most likely to actually occur.
   * The picks most likely to be made by an opponent.
   * The picks that optimize expected return in the limit as the number
     of competitors approaches infinity.

   Usage: 'tcanon [-Eq] namesfile actualfile perceivedfile'

'-E'
     Use ESPN scoring.
'-q'
     Quiet.  Suppress headers.

3.3.3 trandom
-------------

Generate random picks.  Each game is 50-50 unless the optional datafile
is given to specify the probabilities.

   Usage: 'trandom [-nR] [datafile]'

'-n<count>'
     Generate <count> set of picks.  Default is 1.
'-R<rounds>'
     Specify number of rounds in the tournament.  If datafile is given,
     uses the rounds for that datafile.  If unspecified, defaults to 6.

3.3.4 tstats
------------

Calculate statistics for picksets read on standard in.  After reading
input, tstats produces a header with the comments from the input files
and statistics describing opponent scores.  Then, for each set of picks
on stdin, tstats displays the picks in a human readable ASCII form and
displays statistics for the picks.  The statistics are:
'expected return'
     The expected return on a bet of 1 on these picks.
'actual probability'
     The probability these picks actually occur.
'actual mean score, actual score standard deviation'
     The mean score and SD for these picks.
'perceived probability'
     The probability one opponent will make these picks exactly.
'perceived mean score, perceived score standard deviation'
     The mean score and SD for these picks if the tournament games were
     played using the perceived probabilities.
'correlation with opponents'
     The correlation (\in [-1,1]) between the score of these picks and
     the score of one opponent.

   Usage: 'tstats [-nEqsetP] namesfile actualfile perceivedfile'

'-n<competitors>'
     Number of competitors.
'-E'
     Use ESPN scoring.
'-q'
     Quiet.  Suppress headers.
'-s'
     Don't show stats.
'-e'
     Don't show expected return.
'-t'
     Don't show teams.
'-P'
     Show perceived probability only.  (This was useful, once.)

3.3.5 tscore
------------

Quick and dirty program to calculate the scores of picks on stdin, given
a set of picks as the input file 'outcome'.

   Usage: 'tscore [-E] [-r rounds] outcome'

'-E'
     Use ESPN scoring.
'-r rounds'
     Specify number of rounds.  Default is 6.

3.3.6 tsim
----------

Simulate tournaments.  Computes results for each set of picks Y read on
standard in.  Each trial chooses n competitor picks, either randomly
using perceived probablities or by selecting from the opponentpicks file
if provided.  Each trial chooses winners using actual probabilities and
calculates the score and winnings for picks Y. After all trials are
finished, the summary results for picks Y are displayed.

   Usage: 'tsim [-nEqst] namesfile actualfile perceivedfile
[opponentpicks] '

'-n<competitors>'
     Number of competitors.
'-E'
     Use ESPN scoring.
'-q'
     Quiet.  Suppress headers.
'-s<seed>'
     Seed.  Seed random number generator with seed.  Default is to use
     current time.
'-t<trials>'
     Number of tournaments to simulate.  Default is 10000.

3.3.7 dumph2h
-------------

Utility program to read in a probability file and dump a correctly
formatted probability file in h2h format.  Use for converting winround
to h2h.

   Usage: 'dumph2h probfile'

3.3.8 dumpwinround
------------------

Utility program to read in a probability file and dump a correctly
formatted probability file in winround format.  Useful for converting
h2h to winround (because the solo information is interesting for
computer ranking generated h2h files).

   Usage: 'dumpwinround [-p] probfile'

'-p'
     Only dump solo data.

3.3.9 pcalc
-----------

Calcualate a table of winround data from a list of picks.  Given a
series of picks on either stdin or in 'picksfile', computes solo and
pair data by counting occurences of teams reaching rounds.  Dumps
results to stdout as a winround format file.  This is how you get
perceived probabilities if you have a large collection of opponent
picksets.

   Usage: 'pcalc [-r<rounds>] [picksfile]'

'-r<rounds>'
     Specify number of rounds in tournament.  Default is 6.

3.3.10 pix2tex
--------------

From a set of picks and a tournament names file, 'pix2tex' generates
LaTeX output to draw a filled in bracket.

   Width and height are specified as floating point numbers and are used
to position the elements of the bracket.  LaTeX will interpret these as
points, by default, although you could change '\unitlength' in your
document to adjust this.

   Usage: 'pix2tex [-h<height>] [-w<width>] namesfile'

