Python: module TAMO.MotifMetrics

TAMO.MotifMetrics

index
/home/David_Gordon/docs/TAMO/MotifMetrics.py

MotifMetrics -- Kitchen sink of routines for evaluating motifs on genomes This module is used both as a command-line script and as a repository for motif metrics. For a summary of the command-line usage, just type python MotifMetrics.py for help. -= Summary =- The MotifMetrics module performs fast motif evaluation by converting python representations of motifs (PSSMs or regular expressions) and DNA sequences into efficient C++ data structures. -= Motif Metrics =- Motif metrics represent different ways to assign a score to the number of binding sites that match a motif model that can be found within a subset of genomic sequences (such as those identified by microarray or functional annotation) as compared to the occurences in the entire genome (or at least the entire sequence represented in the microrray.) Historically, these metrics have generally been limited to intergenic sequences, but they can be extended to complete sequences as microarray technology progresses. This module provides an interface to more than a dozen different metrics. These include: The "Enrichment" metric used in Harbison et al. Nature 2004 The "Group Specificity Metric" used by AlignACE (Hughes et al.) The "ROC auc" and "MNCP" metrics described by Clarke et al. and several other variants and experimental metrics. Although many metrics are based on approaches from probability theory, the scores themselves may not always be interpreted directly in terms of significance, particularly if 1) the motif is learned from the same data on from which the score(s) is computed or 2) the motif is one of many motifs being evaluated on the same data. -= Input =- The metrics used in the module (generally) require these items as input: 1) A motif (specified as a regular expression, a IUPAC ambiguity code, or as a "tamo"-formatted PSSM). 2) A genome (Usually specified as a fasta-file with individual entries representing each intergenic region or UTR region.) 3) A list (or fasta-formatted file) of "interesting" genes or intergenic regions, using the same names as in the genome fasta-formatted file. For example, to compute the "Enrichment" metric of the GCN4 motif among intergenic probes identified in a GCN4 ChIP-chip assay in yeast, the user must supply: 1) The motif ("TGASTCA") 2) A genome (say, the probe sequences on the intergenic array from the ChIP experiment: "yeast6k_probes.fsa") 3) A list of the "bound" probes. (Say, "GCN4_YPD.fsa") NOTE: The first time a fasta-formated file is loaded into a MotifMetrics.ProbeSet object (often given the variable name "genome"), the object is cached to disk. This means that the first user to load a fasta-formatted file needs to have write permission in the same directory. -= Quick and Dirty motif discvery =- Given a motif metric, it is possible to evaluate many candidate motifs to find the ones with the best scores. This is essentially a naive motif discvery, that is limited only by the available computer time. One built-in form of this available with the "-n" option, which computes the specified metric for all k-mers (without ambiguity codes) in the set of "bound" or "differentially regulated" sequences. On a P4-2000, about a thousand such motifs can be evaluated in about five numutes. -= Examples =- At the command line: % MotifMetrics.py GCN4_YPD.fsa -g $TAMOdata/yeast/yesat6k_probes.fsa TGASGCA In a python file: Genome = MotifMetrics.ProbeSet('../yeast/yeast6k_probes.fsa') ids = Fasta.ids('GCN4_YPD.fsa') e = Genome.Enrichment('TGASTCA',ids) print e, -math.log(e)/math.log(10) Copyright (2005) Whitehead Institute for Biomedical Research All Rights Reserved Author: David Benjamin Gordon

Modules

TAMO.util.Arith
TAMO.seq.Fasta
TAMO.MD.MDsupport
TAMO.MotifTools
TAMO
math
os
pickle
re
shelve
string
sys

Classes



ProbeSet

class ProbeSet

    class Probeset -- Class that represents groups of discrete genome sequences (such as may be represented                   on a microarray, or the upstream regions of all genes) in such a way is to facilitate                   computing motif scores and metrics.  See module documentation and documentation of                   member functions.

Methods defined here:

E(self, ambig, NMlist, verbose='', factor=None, omitlist=[])
ProbeSet.E      (ambig, NMlist,verbose='',factor=None,omitlist=[]) --          Synonym for ProbeSet.Enrichment.  See its docstring

E_chi2(self, ambig, NMlist, bins=None, verbose='', factor=None, omitlist=[])
ProbeSet.E_chi2(ambig, NMlist,bins=None,verbose='',factor=None,omitlist=[]) --          Metric that compares the distribution of number of probes with multiple occurrences          as compared to the distribution seen the the genome at large.  Not completely stable,          because counts are often too small for reliable chi2.

E_seq(self, ambig, NMlist, verbose='', factor=None, omitlist=[])
ProbeSet.E_seq(ambig, NMlist,verbose='',factor=None,omitlist=[]) -- [Unstable]          Compute the Enrichment score, with these two changes:          The number of "draws" is the number of bases (both in the genome and          in the NMlist probe/gene sequences) and the motif counts are the number          of matching, but non-ovelapping binding sites.  In theory, this should          behave similarly to the overrep score with MAXVALUE set very high.

E_site(self, ambig, NMlist, ceil=1, verbose='', factor=None, omitlist=[])
ProbeSet.E_site(ambig, NMlist,ceil=1,verbose='',factor=None,omitlist=[]) --          Like the enrichment score, but computes the hypergrometric on the number of sites.          Not recommended, because the "total" and "numpicked" are based on number of probes,          so the metric intrinsically contains an apple vs. orange comparison.

E_sitef(self, ambig, NMlist, k, verbose='', factor=None, omitlist=[])
ProbeSet.E_sitef(ambig, NMlist,k, verbose='',factor=None,omitlist=[]) --          Another variant of the Enrichment score, which requires at least          k occurences of a motif to occur on a probe for the probe to be          counted.  Each probe is counted k times.

Enrichment(self, ambig, NMlist, verbose='', factor=None, omitlist=[])
ProbeSet.Enrichment(ambig, NMlist,verbose='',factor=None,omitlist=[]) --          Computes Enrichment score as described in Harbison & Gordon et al. Nature 2004          Optional factor is expressed in terms of fraction of the maxiumum possible          PSSM ("ambig") score.  Result is computed by using the hypergeometric distribution          to compute the probability of observing the number entries in NMlist that          match the motif by considering them has drawn randomly from the ProbeSet without          replacement.

GO(self, ambig, NMlist, verbose='', factor=None, omitlist=[])
ProbeSet.GO(ambig,NMlist,verbose='',factor=None,omitlist=[]) --          Check whether the subset of NMlist ids that match the motif have significant          categorization according to the Gene Ontology (GO) database.

MNCP(self, ambig, NMlist, verbose='', factor=None, omitlist=[])
ProbeSet.MNCP(ambig, NMlist,verbose='',factor=None,omitlist=[]) --          Compute MNCP as decribed in Clarke ND, Granek JA  Bioinformatics 2003

ROC_AUC(self, ambig, NMlist, verbose='', factor=None, omitlist=[])
ProbeSet.ROC_AUC(ambig, NMlist,verbose='',factor=None,omitlist=[]) --          Compute ROC_AUC as decribed in Clarke ND, Granek JA  Bioinformatics 2003

__init__(self, genome='YEAST', factor=0.69999999999999996)
ProbeSet.__init__(genome='YEAST',factor=0.7) -- 'genome' can be "YEAST" or a fasta-formatted file.                                                 'factor' specifies the PSSM threshold in terms of                                                 its maximum possible value.

best_Enrichment(self, ambig, NMlist, verbose='')
ProbeSet.best_Enrichment(ambig, NMlist,verbose='') --          Find the scoring threshold at which the motif ("ambig") has the optimal Enrichment score          (see ProbeSet.Enrichment).  Returns (best_enrichment_score, best_factor) tuple.

best_p_value(self, ambig, NMlist, verbose='')
ProbeSet.best_p_value(ambig, NMlist,verbose='') --          Synonym for best_Enrichment.  See its docstring.

best_pvalue(self, ambig, NMlist, verbose='')
ProbeSet.best_pvalue(ambig, NMlist,verbose='') --          Find the scoring threshold at which the motif ("ambig") has the optimal Enrichment score          (see ProbeSet.Enrichment)  Returns (best_enrichment_score, best_factor) tuple.  (Synonym          for best_Enrichment)

binomial(self, ambig, NMlist, verbose='', factor=None, omitlist=[], MAXVALUE=4)
ProbeSet.binomial(ambig,NMlist, verbose='',factor=None,omitlist=[],MAXVALUE=4) --          Synonym for ProbeSet.overrep.  See its docstring.

church(self, ambig, preNMlist, verbose='', topcount=100, omitlist=[])
ProbeSet.church(ambig,preNMlist,verbose='',topcount=None,omitlist=[]) --          The "Group Specificity Score" as used by AlignACE (Hughes et al. JMB 2000)          Optional topcount is used as "100" in AlignACE.  A negative fractional value for          topcount is used to specify the count in terms of the number of sequences          in NMlist.

countD_matching_sites(self, ambig, NMlist=[], thresh=None, Nooverlap=None)
ProbeSet.countD_matching_sites( ambig, NMlist=[], thresh=None,Nooverlap=None) -- [Utility, Dispatch]          Confirm that ambig is a motif object (as opposed to a regular expression) and invoke          _countD_matching_sites_motif_fast.  See its documentation.

count_matching_probes(self, ambig, NMlist=[], thresh=None)
ProbeSet.count_matching_probes( ambig, NMlist=[], thresh=None): --  [Utility, Dispatch]          Check whether ambig is a Motif object or a regular expression, and invoke the          appropriate routine to use ambig to count the number of matching probes.

count_matching_sites(self, ambig, ceil=1, NMlist=[], thresh=None)
ProbeSet.count_matching_sites( ambig, ceil=1,NMlist=[], thresh=None) -- [Utility, Dispatch]          Confirm that ambig is a motif object (as opposed to a regular expression) and invoke          _count_matching_sites_motif_fast().   See its docstring.

distribution(self, motif, NMlist)
ProbeSet.distribution(motif,NMlist,) --          Prints the "id   best_score, bestscore/maxscore" information for all probes in NMlist,          followed by some general statistical information.  If Stats.stats is installed,          also attempts KS test of distribution of scores within bound probes versus the genome.

filter(self, probelist)
ProbeSet.filter(probelist) --          Return list of ids that are also represented in the ProbeSet object.

frac(self, ambig, NMlist, verbose='', factor=None, omitlist=[])
ProbeSet.frac(ambig, NMlist,verbose='',factor=None,omitlist=[]) --          Report and return the fraction of the probes/gene sequences in NMlist          that contain matches to the motif ("ambig") under the default          threshold score or the specified theshold according to factor (which          indicates the fraction of the maximum possible PSSM score to be used          as a threshold.

fsa_string_from_ids(self, ids, comments=[])
ProbeSet.fsa_string_from_ids( ids, comments=[]) --          Prepare a text string from ids for          printing in Fasta format.  Similar functionality in Fasta module.

ids_from_file(self, filename, QUIET='')
ProbeSet.ids_from_file(filename,QUIET='') -- [Deprecated]          Fasta and txt file loader.  Use Fasta.keys() or Fasta.ids() instead.

matches_with_features(self, ambig, NMlist, featurelist, featurename='fE', verbose='', factor=None)
ProbeSet.matches_with_features(ambig,NMlist,featurelist,featurename='fE',verbose='',factor=None) --          Check Enrichment of motif among ids in featurelist that have something "interesting" about          them.  For example, you would use this function if you wanted to ask if a motif has a good          Enrichment score among probes identified in a microarray experiment that are, say,          telomeric, or perhaps known to be important during cell-cycle.          NMlist and featurelist are both lists of gene/probe ids.  Featurename is used only for          visual reference when printing the results.  Optional factor is expressed in terms          of fraction of the maxiumum possible PSSM score.

matching_ids(self, ambig, NMlist=[], factor=None)
ProbeSet.matching_ids(ambig,NMlist,factor=None) --          Return a list of sequence ids among NMlist (probe or gene ids) that match the motif.          Optional factor is expressed in terms of fraction of the maxiumum possible PSSM score.

overrep(self, ambig, NMlist, verbose='', factor=None, omitlist=[], MAXVALUE=4)
ProbeSet.overrep(ambig,NMlist, verbose='',factor=None,omitlist=[],MAXVALUE=4) --          Computes Over-representation score.          Optional factor is expressed in terms of fraction of the maxiumum possible          PSSM ("ambig") score.          The score is computed by using the binomial distribution to compute the          probability of observing the number sites that match the motif among the sequences          in NMlist by considering them has drawn randomly from the ProbeSet with          replacement.  The expected fractions are based on the number of bases in the NMlist          probe/gene sequences and the total number of bases in the genome.          MAXVALUE is the maximum number of binding sites to count within a probe/gene          sequence.  A good value for 500-1000bp sequences is 4, but MAXVALUE should          be made very high if analyzing longer sequences.

p_value(self, ambig, NMlist, verbose='', factor=None, omitlist=[])
ProbeSet.p_value(ambig, NMlist,verbose='',factor=None,omitlist=[]) --          Synonym for ProbeSet.Enrichment.  See its docstring

p_value_noomit(self, ambig, NMlist, verbose='', factor=None)
ProbeSet.p_value_noomit(ambig, NMlist,verbose='',factor=None) -- [Deprecated]          Synonym for ProbeSet.Enrichment.  See its docstring

print_fsa_from_idfile(self, filename)

print_fsa_from_ids(self, ids)
ProbeSet.print_fsa_from_ids(ids) --          Print Fasta-formatted sequences to standard output.          Similar functionality also exists in the Fasta module.

reduce_probelist(self, ids)
ProbeSet.reduce_probelist(ids) --          Change the the internal probelist.  Keep only sequences associated with the supplied IDs.

seqs_from_idfile(self, filename, QUIET='')

seqs_from_ids(self, ids)

Functions


AmbigToRegExp(expression)
AmbigToRegExp(expression)          Converts a string IUPAC ambiguity codes into a regular expressions.

gapped_motifs(seq, maxgap=12)
gapped_motifs(seq,maxgap=12) --      Takes input text string, say CGG, and builds a list of all tandem, inverted, and everted repeats      with "N" gaps ranging from 0 to maxgap (12). [CGGCCG CGGCGG CCGCGG, CGGnCCG CGGnCGG CCGnCGG, ....]

main()

permute(letters, depth, seqs=[''], curdepth=0)
permute(letters, depth, seqs=[''],curdepth=0) --        Generate all possible sequences from the alphabet        "letters" of length "depth" (sorry about the        variable names).  For example, "permute('ACGT',4)"        generates all 256 possible 4-letter DNA sequences.        Duplicated in TAMO.util.PermuteTools

revcomp(dna)
revcomp(dna) --        return reverse complement of a DNA sequence

top_nmers_fsa(N, filename)
top_nmers_fsa(N,filename) --         Return the n-mers (or k-mers) of length N in sequencs in the fasta-formatted         input file.

top_nmers_seqs(N, seqs)
top_nmers_seqs(N,seqs) --         Return the n-mers (or k-mers) that occur at in at least approximately 10% of the input         sequences.

Data

revcomp_memo = {}
revcomp_trans = '\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./...\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'

Data
		revcomp_memo = {} revcomp_trans = '\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./...\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'