Python: module TAMO.MotifTools

TAMO.MotifTools

index
/home/David_Gordon/docs/TAMO/MotifTools.py

MotifTools.py -- Kitchen Sink Motif Objects and Operations CORE: Motif() A pssm model, with scanning, storing, loading, and other operations There are a large number of functions and member fucntions here. To get started, a motif can be instantiated by providing an ambiguity code, a set of aligned DNA sequences, or from matrices of counts, probabilities or log-likelihoods (aka PSSMs). >>> m = MotifTools.Motif_from_text('TGAAACANNSYWT') >>> print m.oneletter() TGAAACA..sywT Lower case reflects lower information content. For a more detailed view of the distribution of information, try this: >>> m.textlogo() # -- 2.30 bits # # TGAAACA T # TGAAACA T # TGAAACA T # TGAAACA T # TGAAACA CCAT # TGAAACA CCAT # TGAAACA GTTT # TGAAACA GTTT -- 0.23 bits # ------------- # TGAAACA..sywT Motif objects may be manipulated largely like text strings (with pythonish indexing) >>> print m[4:5].oneletter A >>> print m[4:7].oneletter ACA >>> print (m[4:7] + m[1:2]).oneletter ACAG >>> print (m[4:7] + m[1:7]).oneletter ACAGAAACA ...and even padded with blanks: >>> print m[-4:7] ....TGAAACA See source for further documentation. Copyright (2005) Whitehead Institute for Biomedical Research All Rights Reserved Author: David Benjamin Gordon

Modules

copy
TAMO
math
os
pickle
re
string
sys
tempfile

Classes



Motif

class Motif

    CORE: Motif()   A pssm model, with scanning, storing, loading, and other operations There are a large number of functions and member fucntions here.  To get started, a motif can be instantiated by providing an ambiguity code, a set of aligned DNA sequences, or from matrices of counts, probabilities or log-likelihoods (aka PSSMs). >>> m =  MotifTools.Motif_from_text('TGAAACANNSYWT') >>> print m.oneletter() TGAAACA..sywT See code (or documentation) for details

Methods defined here:

__add__(self, other)
m.__add__(other) -- Overload  '+' for concatenating motifs

__getitem__(self, tup)
m.__getitem__(tup) -- Overload m[a,b] to submotif.  Less pythonish than [:], but more reliable

__getslice__(self, beg, end)
m.__getslice__(,beg,end) -- Overload m[a:b] to submotif.

__init__(self, list_of_seqs_or_text=[], backgroundD=None)

__len__(self)
m.__len__()  -- Overload len(m) to return width

__repr__(self)

__sub__(self, other)
m.__sub__(other)  --- Overloads the '-' operator to compute the Euclidean distance between probability matrices                       motifs of equal width.  Consult TAMO.Clustering.MotifCompare for metrics to compare motifs                       of different widths

addpseudocounts(self, beta=0)
m.addpseudocounts(,beta=0) -- Add pseudocounts uniformly across the matrix

bestscan(self, seq)
m.bestscan(seq) -- Return the score of the best match to the motif in the supplied sequence

bestscanseq(self, seq)
m.bestscanseq(seq) -- Return score,sequence of the best match to the motif in the supplied sequence

bestscore(self, seq, fwd='')
m.bestscore(seq, fwd='') -- Returns the score of the best matching subsequence in seq.

bestseqs(self, thresh=None)
m.bestseqs(,thresh=None)  -- Return all k-mers that match motif with a score >= thresh

bogus_kmers(self, count=200)
m.bogus_kmers(count=200) --  Generate a faked multiple sequence alignment that will reproduce                              the probability matrix.

compute_from_counts(self, countmat, beta=0)
m.compute_from_counts(,countmat,beta=0) -- Utility function to build a motif object from a matrix of letter counts.

compute_from_ll(self, ll)
m.compute_from_ll(ll) -- Build motif from an inputed log-likelihood matrix (This function reverse-calculates the probability matrix and background frequencies that were used to construct the log-likelihood matrix)

compute_from_nmer(self, nmer, beta=0.001)
m.compute_from_nmer(nmer,beta=0.001):  See compute_from_text.  Here for reverse compatibility

compute_from_text(self, text, beta=0.001)
m.compute_from_text(,text,beta=0.001)  -- Compute a matrix values from a text string of ambiguity codes.                                           Use Motif_from_text utility instead to build motifs on the fly.

copy(self)
m.copy() -- Return a 'deep' copy of the motif

denoise(self, bitthresh=0.5)
m.denoise(bitthresh=0.5) -- Set low-information positions (below bitthresh) to Ns

emit(self, prob_min=0.0, prob_max=1.0)
m.emit(,prob_min=0.0,prob_max=1.0) -- Consider motif as a generative model, and have it emit a sequence

giflogo(self, id, title=None, scale=0.80000000000000004, info_str='')
m.giflogo(id,title=None,scale=0.8) -- (Requires seqlogo package) Make a gif sequence logo

mask_seq(self, seq)
m.mask_seq(seq) -- Return a copy of input sequence in which any regions matching m are replaced with strings of N's

maskdiff(self, other)
m.maskdiff(other) -- A different kind of motif comparison metric.  See THEME paper for details

masked_neighborhoods(self, seq, flanksize)
m.masked_neighborhoods(seq,flanksize) -- Chop up the input sequence into regions surrounding matches to m.  Replace the                                          subsequences that match the motif with N's.

matchstartorient(self, seq, factor=0.69999999999999996)
m.matchstartorient(,seq, factor=0.7) -- Returns list of (start,orientation) coordinate pairs of                                         matches to the motif in the supplied sequence.  Factor                                         is multiplied by m.maxscore to get a match threshold.

maxdiff(self)
m.maxdiff()   -- Compute maximum possible Euclidean distance to another motif.  (For normalizing?)

minimal_raw_seqs(self)
m.minimal_raw_seqs() -- Return minimal list of seqs that represent consensus

new_bg(self, bg)
m.new_bg(,bg)  -- Change the ACGT background frequencies to those in the supplied dictionary.                   Recompute log-likelihood, etc. with new background.

print_textlogo(self, norm=2.2999999999999998, height=8.0)
m.print_textlogo(,norm=2.3, height=8.0) -- Print a text-rendering of the Motif Logo norm   -- maximum number of bits to show height -- number of lines of text to use to render logo

printlogo(self, norm=2.2999999999999998, height=10.0)
m.printlogo(,norm=2.3, height=10.0) -- Print a text-rendering of the Motif Logo norm   -- maximum number of bits to show height -- number of lines of text to use to render logo

random_diff_avestd(self, iters=5000)
m.random_diff_avestd(iters=5000) -- See modules' random_diff_avestd

random_kmer(self)
m.random_kmer() -- Generate one of the many k-mers that matches the motif.  See m.emit() for a more probabilistic generator

revcomp(self)
m.revcomp() -- Return reverse complement of motif

scan(self, seq, threshold='', factor=0.69999999999999996)
m.scan(seq, threshold = '', factor=0.7) -- Scan the sequence.  Returns three lists:                                            matching sequences, endpoints, and scores.  The value of                                            'factor' is multiplied by m.maxscore to get a match threshold                                            if none is supplied

scansum(self, seq, threshold=-1000)
m.scansum(seq,threshold = -1000) -- Sum of scores over every window in the sequence.  Returns                                     total, number of matches above threshold, average score, sum of exp(score)

score(self, seq, fwd='Y')
m.score(seq, fwd='Y') -- Returns the score of the first w-bases of the sequence, where w is the motif width.

seq_neighborhoods(self, seq, flanksize)
m.seq_neighborhoods(seq,flanksize) -- Chop up the input sequence into regions surrounding matches to the motif.

shuffledP(self)
m.shuffledP() -- Generate motif in which probability matrix has been shuffled.

summary(self)
m.summary() -- Return a text string one-line summary of motif and its metrics

trimmed(self, thresh=0.10000000000000001)
m.trimmed(,thresh=0.1)  -- Return motif with low-information flanks removed.  'thresh' is in bits.

Functions


Motif_from_counts(countmat, beta=0.01, bg={'A': 0.25, 'C': 0.25, 'G': 0.25, 'T': 0.25})
m = Motif_from_counts(countmat,beta=0.01,bg={'A':.25,'C':.25,'G':.25,'T':.25}) Construct a Motif object from a matrix of counts (or probabilities or frequencies). A default set of uniform background frequencies may be overridden. beta refers to the number of pseudocounts that should be distributed over each position of the PSSM.

Motif_from_ll(ll)
m = Motif_from_ll(ll) Constructs a motif object from a log-likelihood matrix, which is in the form of a list of dictionaries.

Motif_from_text(text, beta=0.050000000000000003, source='', bg=None)
m = Motif_from_text(text,beta=0.05,source='',bg=None) Construct a Motif object from a text string constructed from IUPAC ambiguity codes. A default set of uniform background frequencies may be overridden with a dictionary of the form {'A':.25,'C':.25,'G':.25,'T':.25}). beta refers to the number of pseudocounts that should be distributed over each position of the PSSM.

Random_motif(w)
Random_motif(w) -- Generate a random motif of width w.  Each position will                    have a dominant letter with probability around 0.91.

avestd(vals)
avestd(vals) -- [Utility function] Return an (average, stddev) tuple computed from the supplied list of values

bestseqs(motif, thresh, seq='', score=0, depth=0, bestcomplete=None, SEQS=[])
bestseqs(motif,threshold) This function returns a list of all sequences that a motif could match match with a sum(log-odds) score greater than thresh.

compare_seqs(s1, s2)
compare_seqs(s1, s2)

copy(motif)
m = copy(n) Utility routine for copying motifs

diff(self, other)
diff(m1,m2)        - psuedo-Euclidean (sum_col(sqrt(norm(sum_row)))/#col

diverge(self, other)
Yet another distance metric

giflogo(motif, id, title=None, scale=0.80000000000000004)
giflogo(motif,id,title=None,scale=0.8) -- Interface to the 'weblogo/seqlogo' perl                                      scripts that generate colorful sequence logos

infomaskdiff(self, other)
infomaskdiff(m1,m2) -- Return pseudo-Euclidean distance, but scale column distance by                        information content of "other".  Used by THEME

load(filename)
load(filename) -- Load a 'TAMO'-formatted motif file.

m_matches(seqs, wmer, m)
m_matches(seqs,wmer,m)  -- Returns list of all kmers among sequences that have at most                            m mismatches to the supplied wmer (kmer).

maskdiff(self, other)
maskdiff(m1,m2)    - diff, but excluding positions with 'N' in m2                      Return pseudo-Euclidean distance, but only include columns that are not background

merge(A, B, overlap=0)
merge(A,B,overlap=0)  -- [Utility function] Use the '+' operator instead.   Used for concatenating motifs                          into a new motif, allowing for the averaging of overlapping bases between them.

minaligndiff(M1, M2, overlap=5, diffmethod='diff')

minwindowdiff(M1, M2, overlap=5, diffmethod='diff')
m.minwindowdiff(M1,M2,overlap=5,diffmethod='diff')

nlog10(x, min=9.8813129168249309e-324)
nlog10(x,min=1e-323) -- returns -log10(x) with a maximum default value of 323.

pickletxt2motifs(toks)
pickletxt2motifs(toks) -- [Utility function] See txt2motifs documentation.

print_motif(motif, kmer_count=20, istart=0)
print_motif(motif,kmer_count=20,istart=0) -- Print a motif in the 'TAMO'-format.  istart specificies the motif number, and                                              optional kmer_count specificies how many sequences to include in the printed                                              multiple sequence alignment that recapitulates the probability matrix.

print_motifs(motifs, kmer_count=20, istart=0)
print_motifs(motifs,kmer_count=20,istart=0) -- Print list of motifs as a 'TAMO'-formatted motif file to the specificied file.                                                Optional kmer_count specificies how many sequences to include in the printed                                                multiple sequence alignment that recapitulates the probability matrix.                                                istart specifies number from which to begin motif ids.

random(...)
random() -> x in the interval [0, 1).

random_diff_avestd(motif, iters=5000)
random_diff_avestd(motif,iters=5000) -- Return the average & stddev distance ('diff') between a                                         motif and "iters" random motifs of the same width.

revcomplement(seq)
revcomplement(seq) A quick reverse-complement routine that memo-izes queries, understands IUPAC ambiguity codes, and preserves case.

revcompmotif(self)
revcompmotif(self)  -- [Utility function] Construct the reverse complement of the motif.  Use m.revcomp() member function instead.

save_motifs(motifs, filename, kmer_count=20)
save_motifs(motifs,filename,kmer_count=20) -- Save list of motifs as a 'TAMO'-formatted motif file to the specificied file.                                               optional kmer_count specificies how many sequences to include in the printed                                               multiple sequence alignment that recapitulates the probability matrix.

seqs2fasta(seqs, fasta_file='')
seqs2fasta(seqs,fasta_file = '') -- Dumps a Fasta formatted file of sequences, keyed by the sequence itself: >ACTTTTTGTCCCA ACTTTTTGTCCCA >ACTTTTGGGGCCA ACTTTTGGGGCCA ...

shuffle_bases(m)
shuffle_bases(m) -- Return a new motif object in which the probabilities are                     randomly re-assigned to different letters at the same position.

shuffledP(self)
shuffledP(self) -- Construct a motif in which the letter distributions are preserved but                    are reassigned to rondom positions in the motif.

sortby(motiflist, property, REV=0)
sortby(motiflist, property, REV=0) -- Sort a motif list according to a particular property

submotif(self, beg, end)
submotif(self,beg,end) -- Utility function for extracting sub-motifs and padding motifs.                           Use slice functionality (m[2:4]) instead.

sum(motifs, weights=[])
sum(motifs,weights=[]) -- Perhaps better called 'average'.  Constructs a motif by averaging the                           probabilities at each position of the (pre-aligned) input motifs.  Optional                           weights can be assigned, and must be in the same order as the motifs.

toDict(M)
toDict(M) -- Convert a 2D array to a list of dictionaries (which is how the motif object              stores information internally).  Assumes M entries are in alphabetical order (ACGT)

toDictVect(V)
toDictVect(V) -- Convert a 1D vector to a dictionary of DNA letters.  Assumes values in V are in alphabetical order (ACGT).

top_nmers(N, seqs, with_counts=0, purge_Ns='')
top_nmers(N,seqs,with_counts = 0,purge_Ns = '') -- Assemble list of all nmers (kmers) with width 'N' from                                                    supplied sequences.  Option with_counts returns list                                                    of (kmer, count) tuples instead.  Purge N's ignores                                                    kmers containing N's.

txt2motifs(txt, VERBOSE=1)
txt2motifs(txt,VERBOSE=1) -- Convert a text string into a list of motifs:                              Examples:                              'TGASTCA,GAATC'      --> 2 motifs from ambiguity codes                              'results.tamo'       --> All motifs in TAMO-format file                              'results.tamo:34,45' --> Motifs 34 and 45 in TAMO-format file                              'results.pickle'     --> All motifs in pickle (list or dict of Motifs)                              'results.pickle%GAL4 --> 'GAL4' entry in results.pickle dictionary                              'results.pickle:34,45 -> Motifs 34 and 45 in results.pickle list

Data

ACGT = ['A', 'C', 'G', 'T']
YEAST_BG = {'A': 0.31, 'C': 0.19, 'G': 0.19, 'T': 0.31}
one2two = {'K': 'GT', 'M': 'AC', 'R': 'AG', 'S': 'CG', 'W': 'AT', 'Y': 'CT'}
revcomp = {' ': 'N', 'A': 'T', 'B': 'N', 'C': 'G', 'D': 'N', 'G': 'C', 'H': 'N', 'K': 'M', 'M': 'K', 'N': 'N', ...}
revcompTBL = '\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./...\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
revcomplement_memo = {'A': 'T'}
two2one = {'AC': 'M', 'AG': 'R', 'AT': 'W', 'CG': 'S', 'CT': 'Y', 'GT': 'K'}

Data
		ACGT = ['A', 'C', 'G', 'T'] YEAST_BG = {'A': 0.31, 'C': 0.19, 'G': 0.19, 'T': 0.31} one2two = {'K': 'GT', 'M': 'AC', 'R': 'AG', 'S': 'CG', 'W': 'AT', 'Y': 'CT'} revcomp = {' ': 'N', 'A': 'T', 'B': 'N', 'C': 'G', 'D': 'N', 'G': 'C', 'H': 'N', 'K': 'M', 'M': 'K', 'N': 'N', ...} revcompTBL = '\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()+,-./...\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff' revcomplement_memo* = {'A': 'T'} two2one = {'AC': 'M', 'AG': 'R', 'AT': 'W', 'CG': 'S', 'CT': 'Y', 'GT': 'K'}