Simple motif operations: "A Taste of TAMO"
Type "python" at the command line and you should get the python prompt
Python 2.4 (#1, Dec 4 2004, 20:10:33)
[GCC 3.3.3 (cygwin special)] on cygwin
Type "help", "copyright", "credits" or "license" for more information.
>>>
Load the module that contains the Motif class definition
>>> from TAMO import MotifTools
Now you can create a motif. In the absence of other data, you can build one from a text string:
>>> m = MotifTools.Motif_from_text("TGASTCA")
Several commands are available to display the motif. For example, it has a text string representation (stored in a member variable called "oneletter").
>>> print m.oneletter
TGAsTCA
...and it is possible to print a text "logo."
>>> m.printlogo()
# -------
# -- 2.30 bits
#
# TGA TCA
# TGA TCA
# TGA TCA
# TGA TCA
# TGACTCA
# TGAGTCA -- 0.29 bits
# -------
# TGAsTCA
The logo command has two formatting parameters, which you can inspect by using "help."
>>> help(m.printlogo)
Help on method printlogo in module TAMO.MotifTools:
printlogo(self, norm=2.3, height=10.0) method of TAMO.MotifTools.Motif instance
m.printlogo(,norm=2.3, height=10.0) -- Print a text-rendering of the Motif Logo
norm -- maximum number of bits to show
height -- number of lines of text to use to render logo
>>> m.printlogo(norm=2.0, height=5)
# -------
# -- 2.00 bits
# TGA TCA
# TGA TCA
# TGACTCA
# TGAGTCA -- 0.00 bits
# -------
# TGAsTCA
You can also print the various matrix representations: log-likelihood, and probability, and counts. Counts will be the same as probability unless the motif is derived from a multiple-sequence alignment or from a matrix of letter counts.
>>> m._print_ll()
# 0 1 2 3 4 5 6
#A -4.392 -4.392 1.948 -4.392 -4.392 -4.392 1.948
#C -4.392 -4.392 -4.392 0.965 -4.392 1.948 -4.392
#T 1.948 -4.392 -4.392 -4.392 1.948 -4.392 -4.392
#G -4.392 1.948 -4.392 0.965 -4.392 -4.392 -4.392
>>> m._print_p()
# 0 1 2 3 4 5 6
#A 0.012 0.012 0.964 0.012 0.012 0.012 0.964
#C 0.012 0.012 0.012 0.488 0.012 0.964 0.012
#T 0.964 0.012 0.012 0.012 0.964 0.012 0.012
#G 0.012 0.964 0.012 0.488 0.012 0.012 0.012
>>> m._print_counts()
# 0 1 2 3 4 5 6
#A 0.012 0.012 0.964 0.012 0.012 0.012 0.964
#C 0.012 0.012 0.012 0.488 0.012 0.964 0.012
#T 0.964 0.012 0.012 0.012 0.964 0.012 0.012
#G 0.012 0.964 0.012 0.488 0.012 0.012 0.012
>>> msa = ['TGACTCA',
... 'TGACTCA',
... 'TGAGTCA',
... 'TGAGTCA']
>>> m_msa = MotifTools.Motif(msa)
>>> print m_msa
TGAsTCA (4)
>>> m_msa._print_counts()
# 0 1 2 3 4 5 6
#A 0.000 0.000 4.000 0.000 0.000 0.000 4.000
#C 0.000 0.000 0.000 2.000 0.000 4.000 0.000
#T 4.000 0.000 0.000 0.000 4.000 0.000 0.000
#G 0.000 4.000 0.000 2.000 0.000 0.000 0.000
Note that when a text string or multiple sequence alignment contains "N's," that TAMO uses pseudocounts based on the motif's background model to populate the missing letters, and renormalizes:
>>> msa = ['TGACTCA',
... 'TGANTCA',
... 'TGAGTCA',
... 'TGANTCA']
>>> m_msa = MotifTools.Motif(msa, {'A': 0.3, 'C': 0.2, 'G': 0.2, 'T': 0.3})
>>> m._print_counts()
# 0 1 2 3 4 5 6
#A 0.012 0.012 0.964 0.012 0.012 0.012 0.964
#C 0.012 0.012 0.012 0.488 0.012 0.964 0.012
#T 0.964 0.012 0.012 0.012 0.964 0.012 0.012
#G 0.012 0.964 0.012 0.488 0.012 0.012 0.012
If you have installed the weblogo package (see installation instructions), you can generate those colorful sequence logos described by Schneider and Stephens in 1990.
>>> filename = m.giflogo('gcn4')
>>> print filename
gcn4.gif
Motifs can be manipluated much like text strings:
>>> print m[0:3].oneletter
TGA
>>> print m[0:-1].oneletter
TGAsTC
>>> print (m[0:-1] + m[0:-1]).oneletter
TGAsTCTGAsTC
>>> m_long = m[0:-1] + m[0:3]
>>> print m_long.oneletter
TGAsTCTGA
Picking bounds "outside" of the motif generates "Ns" scaled according to the backround model:
>>> print m[-2,10].oneletter
..TGAsTCA...
For motifs of the same width, some comparison routines are provided:
>>> left = m[0:3]
>>> right = m[4:7]
>>> print left.oneletter, right.oneletter
TGA TCA
>>> print left - right
1.3468700594
>>> print right - left
1.3468700594
>>> help(m.__sub__)
Help on method __sub__ in module TAMO.MotifTools:
__sub__(self, other) method of TAMO.MotifTools.Motif instance
m.__sub__(other) --- Overloads the '-' operator to compute the Euclidean
distance between probability matrices for motifs of
equal width. Consult TAMO.Clustering.MotifCompare
for metrics to compare motifs of different widths
Motif are primarily useful for scanning sequences. TAMO provides both "fast" and "slow" scanning functions. The "fast" ones are accessed through the Probeset object in MotifMetrics.py. For less demanding scanning, motifs methods can be used directly.
>>> m.scan('GGTGTAGACGACTATCCGTGACTCACGAAAACGCG')
(['TGACTCA'], [(18, 24)], [12.650430062474509])
>>> m.scan('GGTGTAGACGACTATCCGTTACTCACGAAAACGCG')
([], [], [])
The format ([matching strings], [start/stop, start/stop], [score, score,....]) is convenient for some, but not all applications. Another variant is:
>>> m.bestscan('GGTGTAGACGACTATCCGTGACTCACGAAAACGCG')
12.650430062474509
>>> m.bestscan('GGTGTAGACGACTATCCGTTACTCACGAAAACGCG')
6.3105800595898849
A typical match threshold to use is 70% of the best possible score:
>>> print m.maxscore, m.maxscore * 0.7
12.6504300625 8.85530104373
>>> sequence = 'GGTGTAGACGACTATCCGTGACTCACGAAAACGCG'
>>> if m.bestscan(sequence) > m.maxscore *0.7: print "MATCH!"
MATCH!
It is common to work with collections of motifs. Collections can be stored to disk using the built-in python serialization methods ("pickle" and "shelve,") but TAMO also supports its own user-readable text format. When stored in a list, collections of motifs can be stored and reloaded using the "load" and "save_motifs" functions:
>>> motifs = [MotifTools.Motif_from_text(x) for x in 'TGAsTCA ACCCCG CCGNNNGCC'.split()]
>>> print motifs
[TGAsTCA (1), ACCCCG (1), CCG...GCC (1)]
>>> MotifTools.save_motifs(motifs,'motifs.tamo')
>>> newmotifs = MotifTools.load('motifs.tamo')
>>> print newmotifs
[TGAsTCA (1), ACCCCG (1), CCG...GCC (1)]
Inspect the file "motifs.tamo" to view the format. The ".tamo" format is described at:
https://jura.wi.mit.edu/fraenkel/regcode/release_v24/index.html#MotifFormat
Simple Sequence Scanning
The installation instructions suggested that you set the "TAMO" environmental variable to be the head of the TAMO installation directory. It should have a value similar to:
/usr/lib/python2.4/site-packages/TAMO/
For the remainder of this example, we'll assume that the variable is set correctly, and can be accessed using "$TAMO" syntax.
First let's download some sample files:
In a new directory, download:
https://jura.wi.mit.edu/fraenkel/regcode/release_v24/final_set/Final_InTableS2_v24.motifs
https://jura.wi.mit.edu/fraenkel/regcode/release_v24/fsafiles/GCN4_YPD.fsa
The sequences in the GCN4_YPD.fsa are a subset of the intergenic regions that were determined to be bound by the GCN4 when cells were grown in rich media ("YPD").
First we'll scan for occurences of a typical GCN4 motif:
% python $TAMO/Sitemap.py -f GCN4_YPD.fsa -L G -m TGASTCA
#/usr/lib/python2.4/site-packages/TAMO//Sitemap.py -f GCN4_YPD.fsa -m TGASTCA -L G
# 1: [0]
G : TGAsTCA (1) 8.86 (0.70)
iYJR016C ------G-------- G TGAGTCA 132-138 12.65
iYOR301W ------G---------------- G TGACTCA 133-139 12.65
iYJL200C -----------G--- G TGACTCA 226-232 12.65
<~35 additional lines...>
iYOL064C -------G-- G TGAGTCA 150-156 12.65
iYGL184C -------------------G------ G TGAGTCA 386-392 12.65
iYBR248C ----------G------- G TGACTCA 200-206 12.65
Found matches in 40 of 59 input probes
The output shows graphically the sequences (as strings of "-'s") and the placement of the matches to the motif (as "G's", because we specified "-L G" on the command line.) To the right each of each "sequence" is a summary of the match, including the ID, the matching sequence, it's coordinates, and its score.
We can look for more than one motif at once:
% python $TAMO/Sitemap.py -f GCN4_YPD.fsa -L GM -m TGASTCA,AGGGG
#/usr/lib/python2.4/site-packages/TAMO//Sitemap.py -f GCN4_YPD.fsa -L GM -m TGASTCA,AGGGG
# 2: [0, 1]
G : TGAsTCA (1) 8.86 (0.70)
M : AGGGG (1) 6.82 (0.70)
iYJR016C ------G-------- G TGAGTCA 132-138 12.65
iYOR301W M--M--G-------M-------- G TGACTCA 133-139 12.65 | M CCCCT 5-9 9.74 | M AGGGG 71-75 9.74 | M CCCCT 298-302 9.74
iYJL200C -----------G--- G TGACTCA 226-232 12.65
iYOL064C -------G-- G TGAGTCA 150-156 12.65
<~40 additional lines...>
Occurences of the first motif are noted with an "G," and occurrences of the second with a "M" because we specified "-L GM" on the command line.
The "scale" (how many base per "-" symbol) and the threshold for match (as a fraction of the motif maximum) are paramters that can be adjusted via the "-S" and "-t" command line flags.
Simple Motif Scoring
Download all the same files as described in "Simple Sequence Scanning" above, and copy the file Yeast6kArray/yeast_Young_6k.fsa from your "TAMOdata" directory. This file should have been downloaded during installation of the TAMO package. If it wasn't, type
$TAMO/GetDataFiles.py --Whitehead
and then copy the file to the current directory. The sequences in the GCN4_YPD.fsa are a subset of the intergenic regions in yeast_Young_6k.fsa that were determined to be bound by the GCN4 when cells were grown in rich media ("YPD"). You must have write permission in whichever directory the yeast_Young_k6.fsa file resides.
NOTE: To demonstrate the generality of the approach, we'll use the explicity "all sequences" file (yeast_Young_6k.fsa), which can be replaced by whichever file suits your experiment. TAMO has a built in alias for this file, "YEAST," that grabs this file from the TAMOdata directory.
% python $TAMO/MotifMetrics.py GCN4_YPD.fsa -genome yeast_Young_6k.fsa TGASTCA
#/usr/lib/python2.4/site-packages/TAMO//MotifMetrics.py GCN4_YPD.fsa -genome yeast_Young_6k.fsa TGASTCA
Writing yeast_Young_6k.fsa.pickle...
#Searching 1 motif
E: 233 / 6725, 40 / 59 found. p: 1.01e-45 TGASTCA
This file computes the "enrichment" score of the motif, figureed as the significance of the number of sequences that contain the motif computed according to a hypergeometric distribution.
Other available scores include the "Group Specificity Score" from the Church lab and a new metric that accounts for both the numbers of occurrences within each sequence and the lengths of the sequence, using a bionomial distribution. For these metrics, it will be necessary to force the motif to be a motif object instead of a regular expression, using the "-M" option:
% python $TAMO/MotifMetrics.py GCN4_YPD.fsa -genome yeast_Young_6k.fsa -M TGASTCA -spec
#/usr/lib/python2.4/site-packages/TAMO//MotifMetrics.py GCN4_YPD.fsa -genome yeast_Young_6k.fsa -M TGASTCA -spec
#Forming PSSMs from text
#Searching 1 motifs
Church 59 / 6725, 40 / 233 top-ranked. p: 1.01e-45 TGAsTCA (1)
% python $TAMO/MotifMetrics.py GCN4_YPD.fsa -genome yeast_Young_6k.fsa -M TGASTCA -binomial
#/usr/lib/python2.4/site-packages/TAMO//MotifMetrics.py GCN4_YPD.fsa -genome yeast_Young_6k.fsa -M TGASTCA -binomial
#Forming PSSMs from text
#Searching 1 motifs
Ov: Observed 42 / 238, expecting 0.0104424 ( 2.5 ) p: 0.00e+00 TGAsTCA (1)
The scores are 1e-45 and 0, respectively. Reported as -log10(score), the Enrichment, Specificity, and binomial "Over-representation" scores are 45, 45, and >50.
The MotifMetrics.py script can be used to score several motifs at once:
% python $TAMO/MotifMetrics.py GCN4_YPD.fsa -g yeast_Young_6k.fsa -M -binomial TGASTCA CGGNNCCG AGGGG
#/usr/lib/python2.4/site-packages/TAMO//MotifMetrics.py GCN4_YPD.fsa -genome yeast_Young_6k.fsa -M -binomial TGASTCA CGGNNCCG AGGGG
#Forming PSSMs from text
#Searching 3 motifs
Ov: Observed 42 / 238, expecting 0.0104424 ( 2.5 ) p: 0.00e+00 TGAsTCA (1)
Ov: Observed 4 / 151, expecting 0.0104424 ( 1.6 ) p: 7.46e-02 CGG..CCG (1)
Ov: Observed 26 / 2699, expecting 0.0104424 ( 28.2 ) p: 6.86e-01 AGGGG (1)
Or, better yet, it can be used to score all motifs in a file of motifs (notice no need for "-M" here.)
% python $TAMO/MotifMetrics.py GCN4_YPD.fsa -g yeast_Young_6k.fsa -binomial -t Final_InTableS2_v24.motifs > All_Ov_Scores.out
% grep Ov All_Ov_Scores.out | sort -gk 12 | head
Ov: Observed 42 / 238, expecting 0.0104424 ( 2.5 ) p: 0.00e+00 TGAsTCa (1) GCN4 GCN4_SM 0 121-GCN4
Ov: Observed 58 / 771, expecting 0.0104424 ( 8.1 ) p: 1.05e-30 TGACTC (1) BAS1 BAS1_SM 0 2-BAS1
Ov: Observed 27 / 1705, expecting 0.0104424 ( 17.8 ) p: 2.45e-02 GATAAG (1) GZF3 Literature
Ov: Observed 5 / 160, expecting 0.0104424 ( 1.7 ) p: 2.70e-02 CCG....CGG (1) UGA3 Literature
Ov: Observed 35 / 2355, expecting 0.0104424 ( 24.6 ) p: 2.71e-02 yTGACT (1) ASH1 Literature
Ov: Observed 8 / 338, expecting 0.0104424 ( 3.5 ) p: 2.72e-02 tCACGTG (1) CBF1 CBF1_SM 0 72-CBF1
Here we see that when the lines containing "Ov" scores are sorted according to the scores which are in the 12 column ("sort -gk 12"). The GCN4 motif is the most significantly enriched, followed by BAS1 (which is a similar motif.) The following motifs, with -log10(score) around "2," are not significantly enriched, especially if you correct for multiple hypothesis testing.
A similar approach can be used to search for all k-mers that with size "k" or larger, but only if they occur at least once in the input sequences (GCN4_YPD.fsa). The result is a kind of "quick and dirty" motif discovery. Any of the metrics can be used, including -spec, -binomial, -ROC_AUC, and the default, enrichment (no arguments.) See the code for the MotifMetrics.py module for more information.
% python $TAMO/MotifMetrics.py GCN4_YPD.fsa -genome yeast_Young_6k.fsa -binomial -n 7 > gcn4_7mers+.out
% grep Ov gcn4_7mers+.out | sort -gk 12 | head
Ov: Observed 42 / 238, expecting 0.0104424 ( 2.5 ) p: 0.00e+00 TGACTCA (1)
Ov: Observed 19 / 61, expecting 0.0104424 ( 0.6 ) p: 4.45e-23 ATGACTCA (1)
Ov: Observed 15 / 44, expecting 0.0104424 ( 0.5 ) p: 3.31e-19 GTGACTCA (1)
Ov: Observed 26 / 227, expecting 0.0104424 ( 2.4 ) p: 4.10e-19 ATGACTC (1)
Ov: Observed 15 / 59, expecting 0.0104424 ( 0.6 ) p: 4.96e-17 ATGAGTCA (1)
Ov: Observed 18 / 158, expecting 0.0104424 ( 1.6 ) p: 1.17e-13 GAGTCAC (1)
Ov: Observed 10 / 47, expecting 0.0104424 ( 0.5 ) p: 5.61e-11 GTGAGTCA (1)
Ov: Observed 6 / 10, expecting 0.0104424 ( 0.1 ) p: 2.63e-10 GTGACTCAC (1)
Ov: Observed 7 / 18, expecting 0.0104424 ( 0.2 ) p: 3.90e-10 AATGACTCA (1)
Ov: Observed 17 / 227, expecting 0.0104424 ( 2.4 ) p: 4.50e-10 ATGAGTC (1)
The calculation produces a few thousand lines (one for each k-mer evaluated) and takes about 10 minutes to run on a ~1.5 Ghz machine. The top-ranked motifs are all incarnations of the correct specificity for this dataset, TGAsTCA. The search scores k-mers that occur in high numbers first, so that if aborted prematurely due to time restrictions, the best scoring motifs are still likely to be detected.