-------------------------------------------------------------------------------- cogteams Michael H. Goldwasser and Xin He (goldwamh@slu.edu, xhe2@email.unc.edu) Version 1.0 May 2004 Copyright (C) 2004 Michael H. Goldwasser and Xin He This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA ---------------------------TABLE OF CONTENTS------------------------------------ 1. Introduction 2. Overview of Distribution 3. cogteam software 3.1 compilation 3.2 usage 4. File formats 5. utility scripts 6. An Example 7. Contact 8. Reference -------------------------------------------------------------------------------- 1. Introduction COGTeams is a bioinformatics program used to identify conserved gene clusters. The paper [1] details the concept, model and algorithm of COGTeams; we give a brief overview here. It is believed that the genes remaining close to each other in the chromosome during evolution have a tendency to encode proteins with related functionalities. These neighboring genes are called conserved gene clusters. Though it is not clear how and to what extent the correlation of position and functions happen, it is helpful to identify these conserved gene clusters across genomes. This may allow further study of the genome organization, and possible to identify the function of unknown genes. COGTeams is a software that serves this purpose. In general, the homologs of a gene in two species form a Cluster of Orthologous Group (COG) [2], including the orthologs and paralogs. A COG team is essentially a set of neighboring COGs in chromosomes of different species. By presenting the concept of conserved gene clusters in a well-defined combinatorial construct, however, it makes rigorous algorithmic design and analysis possible. This software performs a pairwise chromosomal comparison to identify such teams. Though these technique might be generalized to three or more chromosomes, it is open as to whether this could be done without significant degredation of the performance bounds. -------------------------------------------------------------------------------- 2. Overview of Distribution The top-level directory of the distribution contains the following: README.txt this file gpl.txt GNU Public License src/ The source code for the cogteam software scripts/ Additional utility scripts to aid in preparing and interpretting data in the file formats used by the cogteam software example/ Example of the data and results involving the comparison of two bacterial genomes Further details are provided in the following sections of this documentation. -------------------------------------------------------------------------------- 3. cogteams software 3.1 compilation ---------------- The source code is written in the C programming language. The source code together with a makefile is contained in the "src/" subdiretory of this distribution. In most cases, the software should be compiled by simply typing 'make', though there may be some system dependencies in the compilation process. The compilation process should result in an executable named 'cogteams' 3.2 usage ---------------- The cogteams software is run using the following command-line: cogteams [options] inputfile The specified input file defines the two chromosomes to be analyzed, using a file format described in section 4 of this document. Available options are as follows: -d val This is used to specify the 'delta' value, which is the maximum allowable separation between two genes of a chromosome that are to be considered neighboring, in the context of a COG team. The units are relative to the given positions, and the default value is 1000. -W file By default, all output generated by the program is delivered to stdout. However, this option can be used to send output to a given file. -F rule use filtering rule for selecting relevant teams -P param additional parameter for chosen filtering rule By default, all teams of cardinality two or greater will be reported by the software. In fact, some teams will be reported more than once because they are witnessed by multiple sections of the chromosomes. However, it may often be convenient to automatically filter the results using some criteria. For this reason, the software supports several common such filtering rules, some of which are further controlled through an additional parameter. The current set of filtering rules is: rule parameter description ------ ---------- --------------------------- all all teams (DEFAULT) size teams with at least given cardinality cog teams which contain the specified COG max for each COG, the containing team with max cardinality maxcog for given COG, the containing team with max cardinality -O style In addition to controlling which teams are reported, via the above filtering rules, the user can also control the style of output used for each reported team. The current set of styles is: The default style (cogs) will report the list of COGs which make up the team, though not the positions of the actual genes which form witnesses to the team in the given chromosomes. The (witness) style reports the list of COGs together with such a witness in each chromosome. Finally, the (concise) style is used in conjunction with the -T option, discussed next. -T file By default, the original chromosomal data is read from the input file and then teams are identified through the primary algorithm discussed in reference [1]. However, because of the variety of options in filtering rules and in output styles, the software allows for the set of filtered teams to be reported in an intermediate, concise format. This format is generated via the "-O concise" option above. Such a concise summary of previously computed teams can then be re-analyzed using this option. Please note that the concise summary file can only be be used in conjunction with the original inputfile specifying the chromosomal data. Furthermore, the delta value cannot be adjusted in this way, nor can any previously filtered teams be revisited. What can be done is to vary the output style, as well as to apply additional filtering rules to the intermediate results. -------------------------------------------------------------------------------- 4. File formats The original sequence files are downloaded from a sequence database, for instance, GenBank. These original sequence files must be processed to meet the format required by cogteams. And the output file of cogteams could be hard to read as COGs are represented by their ID numbers, instead of names or descriptions. Therefore, the output file needs to be processed appropriately as well. This section is primarily intended for those who needs/wants to write their own routines for data format processing. Input file (chromsomes) ----------------------- The first line of cogteams input file may look like this: COG_ID CHROMOSOME1 CHROMOSOME2 Note that the first word must be "COG_ID" to ensure proper formatting. The name of two chromosomes can be customized. The following lines always start with the name/ID of a COG, followed by the positions of this COG on the first and second chromosome, respectively. For example: COG0589 640662:1395696:1433209:1977777:3637741:4110873 4030586 These three fields are whitespace-separated. Furthermore, if many genes from the cog occur in the first (or second) chromosome, the positions of those genes are colon-separated. Output styles (teams) --------------------- As discussed in the Usage, there are several styles for reporting a known team. The formats are self-explanatory, so we only discuss it very briefly. For the default style (cogs), a typical team is reported as a set of cogs, separated by colons: COG0118:COG0106:COG0107:COG0140 If using the (witness) style, the team would be reported as: COG0118:COG0106:COG0107:COG0140 Witness in the 1st chromosome: (COG0118:2092557.00),(COG0106:2093144.00),(COG0107:2093866.00),(COG0140:2094636.00) Witness in the 2nd chromosome: (COG0140:3581996.00),(COG0107:3582622.00),(COG0106:3583377.00),(COG0118:3584111.00) For those interested, the (concise) style for this same team may appear as: 1066 1069 0 1965 These four numbers are simply the start/end indices of the witnesses in the two chromosomes, expressed not in the original positional units, but rather by the rank of the gene in terms of position (i.e., the index of the gene in the chromosome sorted by position). -------------------------------------------------------------------------------- 5. Utility scripts The "scripts/" subdirectory contains several utility scripts that might be used in converting data to the expected inputfile format for cogteams, or in taking the standard output formats and converting those to alternate views. Another issue concerns the perl and shell scripts included in the package. The C program cogteams is the core of this software, and the scripts used for data processing are only included for the convenience of users. It should be realized that these scripts might not run properly in users' machines. The user should be prepared for the changes of the paths of perl executable and shell (bash is used in the included script). For Windows users, they will probably need to develop their own routines for preparing and interpreting the data in a format specified by Section 4 of this document. scripts/gene2COG.pl: the perl script that prepares the data in a format that can be read by cogteams scripts/COG2desc.sh: the shell script that processes the output of cogteams scripts/COG2desc.pl: the perl script that is invoked by COG2desc.sh scripts/rm_dup.pl: the perl script to remove duplicates from the list of teams (duplicates will appear when a given team is witnessed in more than one place in the chromosomes) -------------------------------------------------------------------------------- 6. An Example This directory includes various files related to a sample experiment, namely the pairwise comparison between E. coli and B. subtilis. Input Files: example/COG.lst: list of the textual descriptions for COGs from NCBI GenBank example/EC.ptt: the sequence file E. coli. downloaded from NCBI GenBank example/BS.ptt: the sequence file B. subtilis downloaded from NCBI GenBank example/EC-BS.cog: the input file of cogteams Output Files: example/EC-BS-1000.team: the output of cogteams, which is a list of COG teams (-d 1000 -F maxcog) example/EC-BS-1000.desc: easier to interpret version of EC-BS-1000.team, processed with COG2desc.sh script example/EC-BS-2000.team: the output of cogteams, which is a list of COG teams (-d 2000 -F maxcog) example/EC-BS-2000.desc: easier to interpret version of EC-BS-2000.team, processed with COG2desc.sh script example/EC-BS-3000.team: the output of cogteams, which is a list of COG teams (-d 3000 -F maxcog) example/EC-BS-3000.desc: easier to interpret version of EC-BS-3000.team, processed with COG2desc.sh script The overall procedure of using this software for discovering conserved gene clusters can be illustrated as follows. This procedure is applied in a standard UNIX/LINUX environment using the included cogteam software and scripts. - download the sequence files of two genomes of interest from NCBI GenBank. ftp://ftp.ncbi.nih.gov/genbank/genomes/ These should be in a .ptt format. - create the input file for the program "cogteams" with this command: perl gene2COG.pl > where chr1 and chr2 refer to the two chromosomes and "chr1-chr2.cog" is the output file (the users can choose whatever name they want). Example: perl gene2COG.pl EC.ptt BS.ptt > EC-BS.cog - run the program to find COG teams with these two typical commands: cogteams [options] EC-BS.cog -W EC-BS.team - replace the COG ids in the output file with the associated textual description to facilitate the interpretation of the result with: COG2desc.sh COG.lst > ; where "chr1-chr2.desc" is the name of the output file. Example: COG2desc.sh EC-BS.team COG.lst > EC-BS.desc -------------------------------------------------------------------------------- 5. Contact For problems and bugs of this software, please contact: Xin He Department of Biology University of North Carolina, Chapel Hill Email: xhe2@email.unc.edu Or: Michael Goldwasser Department of Mathematics and Computer Science Saint Louis University Email: goldwamh@slu.edu -------------------------------------------------------------------------------- 6. Reference [1] Xin He, Michael Goldwasser, Identifying conserved gene clusters in the presence of orthologous groups, Eighth Annual International Conference on Research in Computational Molecular Biology (RECOMB), San Diego, USA, March 2004, pp. 272-280. [2] National Center for Biotechnology Information. Phylogenetic classification of proteins encoded in complete genomes, 2003. http://www.ncbi.nih.gov/COG.