--------------------------------------------------------------------------------

                               cogteams

                  Michael H. Goldwasser  and  Xin He
                (goldwamh@slu.edu, xhe2@email.unc.edu)

                             Version 1.0
                               May 2004


Copyright (C) 2004  Michael H. Goldwasser and Xin He

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA


---------------------------TABLE OF CONTENTS------------------------------------

1.      Introduction
2.      Overview of Distribution
3.      cogteam software
  3.1     compilation
  3.2     usage
4.      File formats
5.      utility scripts
6.      An Example
7.      Contact
8.      Reference

--------------------------------------------------------------------------------

1.      Introduction

COGTeams is a bioinformatics program used to identify conserved gene
clusters.  The paper [1] details the concept, model and algorithm of
COGTeams; we give a brief overview here. It is believed that the genes
remaining close to each other in the chromosome during evolution have
a tendency to encode proteins with related functionalities. These
neighboring genes are called conserved gene clusters. Though it is not
clear how and to what extent the correlation of position and functions
happen, it is helpful to identify these conserved gene clusters across
genomes.  This may allow further study of the genome organization, and
possible to identify the function of unknown genes.

COGTeams is a software that serves this purpose. In general, the
homologs of a gene in two species form a Cluster of Orthologous Group
(COG) [2], including the orthologs and paralogs. A COG team is
essentially a set of neighboring COGs in chromosomes of different
species. By presenting the concept of conserved gene clusters in a
well-defined combinatorial construct, however, it makes rigorous
algorithmic design and analysis possible.  This software performs a
pairwise chromosomal comparison to identify such teams.  Though these
technique might be generalized to three or more chromosomes, it is
open as to whether this could be done without significant degredation
of the performance bounds.

--------------------------------------------------------------------------------

2.      Overview of Distribution


The top-level directory of the distribution contains the following:

README.txt    this file
gpl.txt       GNU Public License
src/          The source code for the cogteam software
scripts/      Additional utility scripts to aid in preparing and
              interpretting data in the file formats used by the
              cogteam software
example/      Example of the data and results involving the comparison
              of two bacterial genomes


Further details are provided in the following sections of this
documentation.


--------------------------------------------------------------------------------

3.      cogteams software

3.1  compilation
----------------

The source code is written in the C programming language.  The source
code together with a makefile is contained in the "src/" subdiretory
of this distribution.  In most cases, the software should be compiled
by simply typing 'make', though there may be some system dependencies
in the compilation process.    The compilation process should result
in an executable named 'cogteams'


3.2  usage
----------------

The cogteams software is run using the following command-line:

   cogteams [options] inputfile

The specified input file defines the two chromosomes to be analyzed,
using a file format described in section 4 of this document.

Available options are as follows:

   -d val   This is used to specify the 'delta' value, which is the
            maximum allowable separation between two genes of a
            chromosome that are to be considered neighboring, in the
            context of a COG team.    The units are relative to the
            given positions, and the default value is 1000.

   -W file  By default, all output generated by the program is
            delivered to stdout.  However, this option can be used to
            send output to a given file.

   -F rule  use filtering rule for selecting relevant teams
   -P param additional parameter for chosen filtering rule

            By default, all teams of cardinality two or greater will
            be reported by the software.  In fact, some teams will be
            reported more than once because they are witnessed by
            multiple sections of the chromosomes.   However, it may
            often be convenient to automatically filter the results
            using some criteria.  For this reason, the software
            supports several common such filtering rules, some of
            which are further controlled through an additional
            parameter.   The current set of filtering rules is:

            rule   parameter    description
            ------ ----------   ---------------------------
            all                 all teams  (DEFAULT)
            size   <val>        teams with at least given cardinality
            cog    <COG name>   teams which contain the specified COG
            max                 for each COG, the containing team with
            max cardinality
            maxcog <COG name>   for given COG, the containing team
            with max cardinality

   -O style In addition to controlling which teams are reported, via
            the above filtering rules, the user can also control the
            style of output used for each reported team.  The current
            set of styles is:

            The default style (cogs) will report the list of COGs
            which make up the team, though not the positions of the
            actual genes which form witnesses to the team in the given
            chromosomes.

            The (witness) style reports the list of COGs together with
            such a witness in each chromosome.

            Finally, the (concise) style is used in conjunction with
            the -T option, discussed next.

   -T file  By default, the original chromosomal data is read from the
            input file and then teams are identified through the
            primary algorithm discussed in reference [1].  However,
            because of the variety of options in filtering rules and
            in output styles, the software allows for the set of
            filtered teams to be reported in an intermediate, concise
            format.    This format is generated via the "-O concise" 
            option above.

            Such a concise summary of previously computed teams can
            then be re-analyzed using this option.  Please note that
            the concise summary file can only be be used in
            conjunction with the original inputfile specifying the
            chromosomal data.  Furthermore, the delta value cannot be
            adjusted in this way, nor can any previously filtered
            teams be revisited.  What can be done is to vary the
            output style, as well as to apply additional filtering
            rules to the intermediate results.


--------------------------------------------------------------------------------

4.      File formats

The original sequence files are downloaded from a sequence database,
for instance, GenBank. These original sequence files must be processed
to meet the format required by cogteams. And the output file of
cogteams could be hard to read as COGs are represented by their ID
numbers, instead of names or descriptions. Therefore, the output file
needs to be processed appropriately as well. This section is primarily
intended for those who needs/wants to write their own routines for
data format processing.

Input file (chromsomes)
-----------------------

The first line of cogteams input file may look like this:
        COG_ID  CHROMOSOME1     CHROMOSOME2
Note that the first word must be "COG_ID" to ensure proper
formatting. The name of two chromosomes can be customized.
 
The following lines always start with the name/ID of a COG, followed
by the positions of this COG on the first and second chromosome,
respectively. For example:

        COG0589 640662:1395696:1433209:1977777:3637741:4110873  4030586

These three fields are whitespace-separated.  Furthermore, if many
genes from the cog occur in the first (or second) chromosome, the
positions of those genes are colon-separated.


Output styles (teams)
---------------------

As discussed in the Usage, there are several styles for reporting a
known team. The formats are self-explanatory, so we only discuss it
very briefly.  For the default style (cogs), a typical team is
reported as a set of cogs, separated by colons:

COG0118:COG0106:COG0107:COG0140

If using the (witness) style, the team would be reported as:

COG0118:COG0106:COG0107:COG0140
Witness in the 1st chromosome:
   (COG0118:2092557.00),(COG0106:2093144.00),(COG0107:2093866.00),(COG0140:2094636.00)
Witness in the 2nd chromosome:
   (COG0140:3581996.00),(COG0107:3582622.00),(COG0106:3583377.00),(COG0118:3584111.00)


For those interested, the (concise) style for this same team may
appear as:
     1066      1069         0      1965
These four numbers are simply the start/end indices of the witnesses
in the two chromosomes, expressed not in the original positional
units, but rather by the rank of the gene in terms of position (i.e.,
the index of the gene in the chromosome sorted by position).

--------------------------------------------------------------------------------

5.      Utility scripts

The "scripts/" subdirectory contains several utility scripts that might
be used in converting data to the expected inputfile format for
cogteams, or in taking the standard output formats and converting
those to alternate views.

Another issue concerns the perl and shell scripts included in the
package. The C program cogteams is the core of this software, and the
scripts used for data processing are only included for the convenience
of users. It should be realized that these scripts might not run
properly in users' machines. The user should be prepared for the
changes of the paths of perl executable and shell (bash is used in the
included script). For Windows users, they will probably need to
develop their own routines for preparing and interpreting the data in
a format specified by Section 4 of this document.


scripts/gene2COG.pl: the perl script that prepares the data in a format that can be read by cogteams
scripts/COG2desc.sh: the shell script that processes the output of cogteams
scripts/COG2desc.pl: the perl script that is invoked by COG2desc.sh
scripts/rm_dup.pl:   the perl script to remove duplicates from the list of teams
                     (duplicates will appear when a given team is witnessed
                      in more than one place in the chromosomes)


--------------------------------------------------------------------------------

6.      An Example

This directory includes various files related to a sample experiment,
namely the pairwise comparison between E. coli and B. subtilis.

Input Files:
example/COG.lst: list of the textual descriptions for COGs from NCBI GenBank
example/EC.ptt: the sequence file E. coli. downloaded from NCBI GenBank
example/BS.ptt: the sequence file B. subtilis downloaded from NCBI GenBank
example/EC-BS.cog: the input file of cogteams

Output Files:
example/EC-BS-1000.team: the output of cogteams, which is a list of COG teams (-d 1000 -F maxcog)
example/EC-BS-1000.desc: easier to interpret version of EC-BS-1000.team, processed with COG2desc.sh script

example/EC-BS-2000.team: the output of cogteams, which is a list of COG teams (-d 2000 -F maxcog)
example/EC-BS-2000.desc: easier to interpret version of EC-BS-2000.team, processed with COG2desc.sh script

example/EC-BS-3000.team: the output of cogteams, which is a list of COG teams (-d 3000 -F maxcog)
example/EC-BS-3000.desc: easier to interpret version of EC-BS-3000.team, processed with COG2desc.sh script


The overall procedure of using this software for discovering conserved
gene clusters can be illustrated as follows. This procedure is applied
in a standard UNIX/LINUX environment using the included cogteam
software and scripts.

-       download the sequence files of two genomes of interest from NCBI GenBank. 
                ftp://ftp.ncbi.nih.gov/genbank/genomes/
        These should be in a .ptt format.

-       create the input file for the program "cogteams" with this command:
                perl gene2COG.pl <chr1.ptt> <chr2.ptt> > <chr1-chr2.cog>
        where chr1 and chr2 refer to the two chromosomes and
        "chr1-chr2.cog" is the output file (the users can choose
         whatever name they want). Example: 
                perl gene2COG.pl EC.ptt BS.ptt > EC-BS.cog 

-       run the program to find COG teams with these two typical commands:
                cogteams [options] EC-BS.cog -W EC-BS.team

-       replace the COG ids in the output file with the associated
        textual description to facilitate the interpretation of the
        result with:  
                COG2desc.sh <chr1-chr2.team> COG.lst > <chr1-chr2.desc>;
        where "chr1-chr2.desc" is the name of the output file. Example: 
                COG2desc.sh EC-BS.team COG.lst > EC-BS.desc

--------------------------------------------------------------------------------

5.      Contact

For problems and bugs of this software, please contact:

Xin He
Department of Biology
University of North Carolina, Chapel Hill
Email: xhe2@email.unc.edu

Or:
Michael Goldwasser
Department of Mathematics and Computer Science
Saint Louis University
Email: goldwamh@slu.edu

--------------------------------------------------------------------------------

6.      Reference

[1] Xin He, Michael Goldwasser, Identifying conserved gene clusters in
the presence of orthologous groups, Eighth Annual International
Conference on Research in Computational Molecular Biology (RECOMB),
San Diego, USA, March 2004, pp. 272-280.

[2] National Center for Biotechnology Information. Phylogenetic
classification of proteins encoded in complete genomes, 2003.
http://www.ncbi.nih.gov/COG.