TAMO: a flexible, object-oriented framework for analyzing transcriptional regulation using DNA-sequence motifs.
Bioinformatics. 2005 Jul 15;21(14):3164-5.
License:
Note: For each command line program, (e.g. Sitemap.py,
AlignAce.py,
MotifMetrics.py,
UPGMA.py, and others)
documentation is obtained by executing the program without any arguments.
Motifs - TAMO is developed around a unified motif representation of a position-specific scoring matrix (PSSM) (See MotifTools.py). Motif objects may be assembled from IUPAC-ambiguity codes, multiple sequence alignments, averages of other motifs, and matrices of frequencies or log-likelihood values. Motifs can printed, concatenated, indexed and sliced like text strings, or rendered as sequence logos. They can also be randomized, reverse-complemented, and recomputed using different assumptions about background base frequencies. Motifs can also store and report information about their origin, information content, and score. Finally, motifs can scan DNA sequences for instances of matching sites. We have included a command-line program Sitemap.py that uses the motif object to produce text-based maps of motif occurrences within a set of input sequences.
Motif Discovery - The TAMO package contains interfaces to publicly available motif discovery programs as well as its own internal motif discovery programs. Interfaces to MDscan, AlignACE, and MEME (which must be installed separately) can be used to read output generated from their respective programs. The interfaces also can be used to invoke the programs so that they can be manipulated as data structures. In addition, we include a command-line program that uses the interface to invoke the MDscan program repeatedly with several different motif widths and rank all the resulting motifs (See MD/MDscan.py.) Another included program uses the interface to AlignACE to increase its sampling capacity by invoking it repeatedly with different random number seeds. (See MD/AlignAce.py.)
TAMO also includes two motif discovery programs of its own. First (MD/TAMO_EM.py) is a python/C++ implementation of an EM-based motif search program very similar to MEME (Bailey and Elkan, 1995). Second (MotifMetrics.py) is a program for exhaustively scoring all k-mer words found in a set of co-regulated promoters (or bound intergenic regions) using any of the metrics provided in the motif scoring module, which is described below.
External Data Sources - Several TAMO modules provide access to public repositories of genomic information, with an emphasis on yeast and human data. For example, there are modules to provide access to SGD (DataSources/SGD.py) feature maps, including feature coordinates, and functions for translating between different types of identifiers (e.g. gene to ORF to Swiss-Prot identifier, etc.) The GO module (DataSources/GO.py) uses GO-slim annotations for gene annotation, and has facilities for finding functional categories that are statistically over-represented within a set of genes. TAMO also provides an interface to the human Gene Atlas (Su et al., 2004) (DataSources/Novartis.py), to data about yeast transcription rates (Holstege et al., 1998) (DataSources/Holstege.py), and to yeast protein localization data (Huh et al., 2003). Finally, TAMO provides fast, random-access interfaces to human and yeast (Saccaromyces cerevisiae) genome sequences (Through seq/Human.py and DataSources/SGD.py.)
Motif Scoring - Several different metrics have been described for evaluating motif quality. Generally, these methods use a statistical test to assign a score to the relative frequency of occurrence of a motif within an input set of sequences compared to the entire genome. TAMO includes metrics such as the "group specificity score" (Hughes et al., 2000), the enrichment score (Harbison et al., 2004), the ROC AUC metric (Clarke and Granek, 2003) and several others (See MotifMetrics.py).
Motif Comparison and Clustering - There are several ways of measuring the similarity between motifs. TAMO includes routines to find the optimal alignment of two motifs and to quantitatively report their similarity (or divergence) with several choices of distance metrics (Clustering/MotifCompare.py). TAMO also includes implementations of the k-medoids clustering algorithm (as described in Harbison et al.) (Clustering/Kmedoids.py) and an UPGMA-based hierarchical clustering algorithm (Clustering/UPGMA.py) in which motifs in sub-trees are averaged with weightings based on their connectivities.
Sequence Data - The TAMO package has fast routines for scanning collections of sequences with motifs and for reading and writing collections of sequences in FASTA format. Routines are also included for generating sets of sequences picked at random from a large collection (e.g. from all the promoters in a genome) for the purpose of collecting statistics on the performance of motif discovery programs on random data. Finally, TAMO provides tools for computing Markov models to represent background sequences.
Microarray Data - A general purpose "dataset" object stores collections of microarray experiments. For each experiment, the object provides the ability to identify sets of genes (or probes) that have enrichments (or P-values) that satisfy user-supplied thresholds. Similarly, it is possible to identify experiments in which a specific gene (or probe) has a value above or below a particular threshold. The dataset object is easily integrated with other modules by using consistent identifiers (e.g. promoter or probe names). As an example, we include a command-line program which, when given a collection of genome-wide chromatin immunoprecipitation experiments for a large number of transcription factors, can be used to generate individual FASTA-formatted files containing the sequences of the intergenic regions bound by each factor.
An additional array-specific module is included that provides access to details regarding the yeast microarray we have used in previous work (Harbison et al., 2004).
Statistics - TAMO includes a set of useful statistical routines for computing P-values for normal, binomial, Poisson, and hypergeometric distributions. The Shapiro-Wilk normality test and the Wilcoxon-Mann-Whitney rank sum test are also provided.
Bailey, T. L., and Elkan, C. (1995). The value of prior knowledge in discovering motifs with MEME. Proc Int Conf Intell Syst Mol Biol 3, 21-29.
Cherry, J. M., Ball, C., Weng, S., Juvik, G., Schmidt, R., Adler, C., Dunn, B., Dwight, S., Riles, L., Mortimer, R. K., and Botstein, D. (1997). Genetic and physical maps of Saccharomyces cerevisiae. Nature 387, 67-73.
Clarke, N. D., and Granek, J. A. (2003). Rank order metrics for quantifying the association of sequence features with gene regulation. Bioinformatics 19, 212-218.
Crooks, G. E., Hon, G., Chandonia, J. M., and Brenner, S. E. (2004). WebLogo: a sequence logo generator. Genome Res 14, 1188-1190.
Harbison, C. T., Gordon, D. B., Lee, T. I., Rinaldi, N. J., Macisaac, K. D., Danford, T. W., Hannett, N. M., Tagne, J. B., Reynolds, D. B., Yoo, J., et al. (2004). Transcriptional regulatory code of a eukaryotic genome. Nature 431, 99-104.
Holstege, F. C., Jennings, E. G., Wyrick, J. J., Lee, T. I., Hengartner, C. J., Green, M. R., Golub, T. R., Lander, E. S., and Young, R. A. (1998). Dissecting the regulatory circuitry of a eukaryotic genome. Cell 95, 717-728.
Hughes, J. D., Estep, P. W., Tavazoie, S., and Church, G. M. (2000). Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J Mol Biol 296, 1205-1214.
Huh, W. K., Falvo, J. V., Gerke, L. C., Carroll, A. S., Howson, R. W., Weissman, J. S., and O'Shea, E. K. (2003). Global analysis of protein localization in budding yeast. Nature 425, 686-691.
Karolchik, D., Baertsch, R., Diekhans, M., Furey, T. S., Hinrichs, A., Lu, Y. T., Roskin, K. M., Schwartz, M., Sugnet, C. W., Thomas, D. J., et al. (2003). The UCSC Genome Browser Database. Nucleic Acids Res 31, 51-54.
Liu, X. S., Brutlag, D. L., and Liu, J. S. (2002). An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nat Biotechnol 20, 835-839.
Su, A. I., Wiltshire, T., Batalov, S., Lapp, H., Ching, K. A., Block, D., Zhang, J., Soden, R., Hayakawa, M., Kreiman, G., et al. (2004). A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci U S A 101, 6062-6067.
Tompa, M., Li, N., Bailey, T. L., Church, G. M., De Moor, B., Eskin, E., Favorov, A. V., Frith, M. C., Fu, Y., Kent, W. J., et al. (2005). Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol 23, 137-144.
Bioinformatics. 2005 Jul 15;21(14):3164-5.
- Click here for the package overview.
- Click here for short descriptions of each file.
- Click here for an introductory tutorial.
- Download the package. (Last updated March 21th, 2012).
- Browse the automatically generated documentation (via pydoc).
- For each command-line program that can be executed from the unix shell, (e.g. Sitemap.py, AlignAce.py, MotifMetrics.py, UPGMA.py, etc...) documentation is obtained by executing the program without any arguments.
- If you're looking for a place to start, the core data structure is the Motif object. This file also includes tools for constructing motifs from different data sources.
- Installation instructions are here.
License:
- The TAMO package is free for academic use. Please contact Ernest Fraenkel for commerical licensing.
Package Overview
Motifs - TAMO is developed around a unified motif representation of a position-specific scoring matrix (PSSM) (See MotifTools.py). Motif objects may be assembled from IUPAC-ambiguity codes, multiple sequence alignments, averages of other motifs, and matrices of frequencies or log-likelihood values. Motifs can printed, concatenated, indexed and sliced like text strings, or rendered as sequence logos. They can also be randomized, reverse-complemented, and recomputed using different assumptions about background base frequencies. Motifs can also store and report information about their origin, information content, and score. Finally, motifs can scan DNA sequences for instances of matching sites. We have included a command-line program Sitemap.py that uses the motif object to produce text-based maps of motif occurrences within a set of input sequences.
Motif Discovery - The TAMO package contains interfaces to publicly available motif discovery programs as well as its own internal motif discovery programs. Interfaces to MDscan, AlignACE, and MEME (which must be installed separately) can be used to read output generated from their respective programs. The interfaces also can be used to invoke the programs so that they can be manipulated as data structures. In addition, we include a command-line program that uses the interface to invoke the MDscan program repeatedly with several different motif widths and rank all the resulting motifs (See MD/MDscan.py.) Another included program uses the interface to AlignACE to increase its sampling capacity by invoking it repeatedly with different random number seeds. (See MD/AlignAce.py.)
TAMO also includes two motif discovery programs of its own. First (MD/TAMO_EM.py) is a python/C++ implementation of an EM-based motif search program very similar to MEME (Bailey and Elkan, 1995). Second (MotifMetrics.py) is a program for exhaustively scoring all k-mer words found in a set of co-regulated promoters (or bound intergenic regions) using any of the metrics provided in the motif scoring module, which is described below.
External Data Sources - Several TAMO modules provide access to public repositories of genomic information, with an emphasis on yeast and human data. For example, there are modules to provide access to SGD (DataSources/SGD.py) feature maps, including feature coordinates, and functions for translating between different types of identifiers (e.g. gene to ORF to Swiss-Prot identifier, etc.) The GO module (DataSources/GO.py) uses GO-slim annotations for gene annotation, and has facilities for finding functional categories that are statistically over-represented within a set of genes. TAMO also provides an interface to the human Gene Atlas (Su et al., 2004) (DataSources/Novartis.py), to data about yeast transcription rates (Holstege et al., 1998) (DataSources/Holstege.py), and to yeast protein localization data (Huh et al., 2003). Finally, TAMO provides fast, random-access interfaces to human and yeast (Saccaromyces cerevisiae) genome sequences (Through seq/Human.py and DataSources/SGD.py.)
Motif Scoring - Several different metrics have been described for evaluating motif quality. Generally, these methods use a statistical test to assign a score to the relative frequency of occurrence of a motif within an input set of sequences compared to the entire genome. TAMO includes metrics such as the "group specificity score" (Hughes et al., 2000), the enrichment score (Harbison et al., 2004), the ROC AUC metric (Clarke and Granek, 2003) and several others (See MotifMetrics.py).
Motif Comparison and Clustering - There are several ways of measuring the similarity between motifs. TAMO includes routines to find the optimal alignment of two motifs and to quantitatively report their similarity (or divergence) with several choices of distance metrics (Clustering/MotifCompare.py). TAMO also includes implementations of the k-medoids clustering algorithm (as described in Harbison et al.) (Clustering/Kmedoids.py) and an UPGMA-based hierarchical clustering algorithm (Clustering/UPGMA.py) in which motifs in sub-trees are averaged with weightings based on their connectivities.
Sequence Data - The TAMO package has fast routines for scanning collections of sequences with motifs and for reading and writing collections of sequences in FASTA format. Routines are also included for generating sets of sequences picked at random from a large collection (e.g. from all the promoters in a genome) for the purpose of collecting statistics on the performance of motif discovery programs on random data. Finally, TAMO provides tools for computing Markov models to represent background sequences.
Microarray Data - A general purpose "dataset" object stores collections of microarray experiments. For each experiment, the object provides the ability to identify sets of genes (or probes) that have enrichments (or P-values) that satisfy user-supplied thresholds. Similarly, it is possible to identify experiments in which a specific gene (or probe) has a value above or below a particular threshold. The dataset object is easily integrated with other modules by using consistent identifiers (e.g. promoter or probe names). As an example, we include a command-line program which, when given a collection of genome-wide chromatin immunoprecipitation experiments for a large number of transcription factors, can be used to generate individual FASTA-formatted files containing the sequences of the intergenic regions bound by each factor.
An additional array-specific module is included that provides access to details regarding the yeast microarray we have used in previous work (Harbison et al., 2004).
Statistics - TAMO includes a set of useful statistical routines for computing P-values for normal, binomial, Poisson, and hypergeometric distributions. The Shapiro-Wilk normality test and the Wilcoxon-Mann-Whitney rank sum test are also provided.
References
Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., et al. (2000). Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25, 25-29.Bailey, T. L., and Elkan, C. (1995). The value of prior knowledge in discovering motifs with MEME. Proc Int Conf Intell Syst Mol Biol 3, 21-29.
Cherry, J. M., Ball, C., Weng, S., Juvik, G., Schmidt, R., Adler, C., Dunn, B., Dwight, S., Riles, L., Mortimer, R. K., and Botstein, D. (1997). Genetic and physical maps of Saccharomyces cerevisiae. Nature 387, 67-73.
Clarke, N. D., and Granek, J. A. (2003). Rank order metrics for quantifying the association of sequence features with gene regulation. Bioinformatics 19, 212-218.
Crooks, G. E., Hon, G., Chandonia, J. M., and Brenner, S. E. (2004). WebLogo: a sequence logo generator. Genome Res 14, 1188-1190.
Harbison, C. T., Gordon, D. B., Lee, T. I., Rinaldi, N. J., Macisaac, K. D., Danford, T. W., Hannett, N. M., Tagne, J. B., Reynolds, D. B., Yoo, J., et al. (2004). Transcriptional regulatory code of a eukaryotic genome. Nature 431, 99-104.
Holstege, F. C., Jennings, E. G., Wyrick, J. J., Lee, T. I., Hengartner, C. J., Green, M. R., Golub, T. R., Lander, E. S., and Young, R. A. (1998). Dissecting the regulatory circuitry of a eukaryotic genome. Cell 95, 717-728.
Hughes, J. D., Estep, P. W., Tavazoie, S., and Church, G. M. (2000). Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J Mol Biol 296, 1205-1214.
Huh, W. K., Falvo, J. V., Gerke, L. C., Carroll, A. S., Howson, R. W., Weissman, J. S., and O'Shea, E. K. (2003). Global analysis of protein localization in budding yeast. Nature 425, 686-691.
Karolchik, D., Baertsch, R., Diekhans, M., Furey, T. S., Hinrichs, A., Lu, Y. T., Roskin, K. M., Schwartz, M., Sugnet, C. W., Thomas, D. J., et al. (2003). The UCSC Genome Browser Database. Nucleic Acids Res 31, 51-54.
Liu, X. S., Brutlag, D. L., and Liu, J. S. (2002). An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nat Biotechnol 20, 835-839.
Su, A. I., Wiltshire, T., Batalov, S., Lapp, H., Ching, K. A., Block, D., Zhang, J., Soden, R., Hayakawa, M., Kreiman, G., et al. (2004). A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci U S A 101, 6062-6067.
Tompa, M., Li, N., Bailey, T. L., Church, G. M., De Moor, B., Eskin, E., Favorov, A. V., Frith, M. C., Fu, Y., Kent, W. J., et al. (2005). Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol 23, 137-144.