Simple motif operations: "A Taste of TAMO"
Type "python" at the command line and you should get the python prompt
Python 2.4 (#1, Dec 4 2004, 20:10:33)
[GCC 3.3.3 (cygwin special)] on cygwin
Type "help", "copyright", "credits" or "license" for more information.
>>>
Load the module that contains the Motif class definition
>>> from
TAMO import MotifTools
Now you can create a motif. In the absence of other data, you can
build one from a text string:
>>> m =
MotifTools.Motif_from_text("TGASTCA")
Several commands are available to display the
motif. For example, it has a text string representation (stored
in a member variable called "oneletter").
>>> print m.oneletter
TGAsTCA
...and it is possible to print a text
"logo."
>>> m.printlogo()
# -------
# -- 2.30 bits
#
# TGA TCA
# TGA TCA
# TGA TCA
# TGA TCA
# TGACTCA
# TGAGTCA -- 0.29 bits
# -------
# TGAsTCA
The logo command has two formatting
parameters, which you can inspect by using "help."
>>> help(m.printlogo)
Help on method printlogo in module TAMO.MotifTools:
printlogo(self, norm=2.3, height=10.0) method of TAMO.MotifTools.Motif
instance
m.printlogo(,norm=2.3, height=10.0) -- Print a
text-rendering of the Motif Logo
norm -- maximum number of bits to show
height -- number of lines of text to use to render
logo
>>> m.printlogo(norm=2.0, height=5)
# -------
# -- 2.00 bits
# TGA TCA
# TGA TCA
# TGACTCA
# TGAGTCA -- 0.00 bits
# -------
# TGAsTCA
You can also print the various matrix
representations: log-likelihood, and probability, and
counts. Counts will be the same as probability unless the motif
is derived from a multiple-sequence alignment or from a matrix of
letter counts.
>>> m._print_ll()
#
0
1
2
3
4
5 6
#A -4.392
-4.392 1.948
-4.392 -4.392
-4.392 1.948
#C -4.392 -4.392
-4.392 0.965
-4.392 1.948 -4.392
#T 1.948
-4.392 -4.392
-4.392 1.948
-4.392 -4.392
#G -4.392
1.948 -4.392
0.965 -4.392
-4.392 -4.392
>>> m._print_p()
#
0
1
2
3
4
5 6
#A 0.012
0.012 0.964
0.012 0.012
0.012 0.964
#C 0.012
0.012 0.012
0.488 0.012
0.964 0.012
#T 0.964
0.012 0.012
0.012 0.964
0.012 0.012
#G 0.012
0.964 0.012
0.488 0.012
0.012 0.012
>>> m._print_counts()
#
0
1
2
3
4
5 6
#A 0.012
0.012 0.964
0.012 0.012
0.012 0.964
#C 0.012
0.012 0.012
0.488 0.012
0.964 0.012
#T 0.964
0.012 0.012
0.012 0.964
0.012 0.012
#G 0.012
0.964 0.012
0.488 0.012
0.012 0.012
>>> msa = ['TGACTCA',
...
'TGACTCA',
...
'TGAGTCA',
...
'TGAGTCA']
>>> m_msa
= MotifTools.Motif(msa)
>>> print
m_msa
TGAsTCA (4)
>>> m_msa._print_counts()
#
0
1
2
3
4
5 6
#A 0.000
0.000 4.000
0.000 0.000
0.000 4.000
#C 0.000
0.000 0.000
2.000 0.000
4.000 0.000
#T 4.000
0.000 0.000
0.000 4.000
0.000 0.000
#G 0.000
4.000 0.000
2.000 0.000
0.000 0.000
Note that when a text string or multiple sequence
alignment contains "N's," that TAMO uses pseudocounts based on the
motif's background model to populate the missing letters, and
renormalizes:
>>> msa =
['TGACTCA',
...
'TGANTCA',
...
'TGAGTCA',
...
'TGANTCA']
>>> m_msa
= MotifTools.Motif(msa, {'A': 0.3, 'C': 0.2, 'G': 0.2, 'T': 0.3})
>>>
m._print_counts()
#
0
1
2
3
4
5 6
#A 0.012
0.012 0.964
0.012 0.012
0.012 0.964
#C 0.012
0.012 0.012
0.488 0.012
0.964 0.012
#T 0.964
0.012 0.012
0.012 0.964
0.012 0.012
#G 0.012
0.964 0.012
0.488 0.012
0.012 0.012
If you have installed the weblogo package (see installation
instructions), you can generate those colorful sequence logos described
by Schneider and Stephens in 1990.
>>> filename
= m.giflogo('gcn4')
>>> print filename
gcn4.gif
Motifs can be manipluated much like text strings:
>>> print
m[0:3].oneletter
TGA
>>> print m[0:-1].oneletter
TGAsTC
>>> print (m[0:-1] + m[0:-1]).oneletter
TGAsTCTGAsTC
>>> m_long = m[0:-1] + m[0:3]
>>> print m_long.oneletter
TGAsTCTGA
Picking bounds "outside" of the motif generates "Ns" scaled according
to the backround model:
>>> print
m[-2,10].oneletter
..TGAsTCA...
For motifs of the same width, some comparison routines are provided:
>>> left = m[0:3]
>>> right = m[4:7]
>>> print left.oneletter,
right.oneletter
TGA TCA
>>> print left - right
1.3468700594
>>> print right - left
1.3468700594
>>> help(m.__sub__)
Help on method __sub__ in module TAMO.MotifTools:
__sub__(self, other) method of TAMO.MotifTools.Motif instance
m.__sub__(other) --- Overloads the '-'
operator to compute the Euclidean
distance between probability matrices for motifs of
equal width. Consult TAMO.Clustering.MotifCompare
for metrics to compare motifs of different widths
Motif are primarily useful for scanning sequences. TAMO provides
both "fast" and "slow" scanning functions. The "fast" ones are
accessed through the Probeset object in MotifMetrics.py.
For less demanding scanning, motifs methods can be used directly.
>>> m.scan('GGTGTAGACGACTATCCGTGACTCACGAAAACGCG')
(['TGACTCA'], [(18, 24)], [12.650430062474509])
>>> m.scan('GGTGTAGACGACTATCCGTTACTCACGAAAACGCG')
([], [], [])
The format
([matching strings], [start/stop, start/stop], [score, score,....]) is
convenient for some, but not all applications. Another variant is:
>>> m.bestscan('GGTGTAGACGACTATCCGTGACTCACGAAAACGCG')
12.650430062474509
>>>
m.bestscan('GGTGTAGACGACTATCCGTTACTCACGAAAACGCG')
6.3105800595898849
A typical match threshold to use is 70% of the best possible score:
>>> print
m.maxscore, m.maxscore * 0.7
12.6504300625 8.85530104373
>>> sequence = 'GGTGTAGACGACTATCCGTGACTCACGAAAACGCG'
>>> if
m.bestscan(sequence) > m.maxscore *0.7: print "MATCH!"
MATCH!
It is common to
work with collections of motifs. Collections can be stored to
disk using the built-in python serialization methods ("pickle" and
"shelve,") but TAMO also supports its own user-readable text
format. When stored in a list, collections of motifs can be
stored and reloaded using the "load" and "save_motifs" functions:
>>>
motifs = [MotifTools.Motif_from_text(x) for x in 'TGAsTCA ACCCCG
CCGNNNGCC'.split()]
>>> print motifs
[TGAsTCA (1), ACCCCG (1), CCG...GCC (1)]
>>> MotifTools.save_motifs(motifs,'motifs.tamo')
>>> newmotifs
= MotifTools.load('motifs.tamo')
>>> print
newmotifs
[TGAsTCA (1), ACCCCG (1), CCG...GCC (1)]
Inspect the file "motifs.tamo" to view the format.
The ".tamo" format is described at:
https://jura.wi.mit.edu/fraenkel/regcode/release_v24/index.html#MotifFormat
Simple Sequence Scanning
The installation instructions suggested that you set the "TAMO"
environmental variable to be the head of the TAMO installation
directory. It should have a value similar to:
/usr/lib/python2.4/site-packages/TAMO/
For the remainder of this example, we'll assume that the variable is
set correctly, and can be accessed using "$TAMO" syntax.
First let's download some sample files:
In a new directory, download:
https://jura.wi.mit.edu/fraenkel/regcode/release_v24/final_set/Final_InTableS2_v24.motifs
https://jura.wi.mit.edu/fraenkel/regcode/release_v24/fsafiles/GCN4_YPD.fsa
The sequences in the GCN4_YPD.fsa are a subset of the intergenic
regions that were determined to be bound by the GCN4 when cells were
grown in rich media ("YPD").
First we'll scan for occurences of a typical GCN4 motif:
% python
$TAMO/Sitemap.py -f GCN4_YPD.fsa -L G -m TGASTCA
#/usr/lib/python2.4/site-packages/TAMO//Sitemap.py -f
GCN4_YPD.fsa -m TGASTCA -L G
# 1: [0]
G : TGAsTCA (1) 8.86 (0.70)
iYJR016C
------G--------
G TGAGTCA 132-138 12.65
iYOR301W
------G----------------
G TGACTCA 133-139 12.65
iYJL200C
-----------G---
G TGACTCA 226-232 12.65
<~35 additional lines...>
iYOL064C
-------G--
G TGAGTCA 150-156 12.65
iYGL184C
-------------------G------
G TGAGTCA 386-392 12.65
iYBR248C
----------G-------
G TGACTCA 200-206 12.65
Found matches in 40 of 59 input probes
The output shows graphically the sequences (as strings of "-'s") and
the placement of the matches to the motif (as "G's", because we
specified "-L G" on the command
line.) To the right each of each "sequence" is a summary of
the match, including the ID, the matching sequence, it's coordinates,
and its score.
We can look for more than one motif at once:
% python
$TAMO/Sitemap.py -f GCN4_YPD.fsa -L GM -m TGASTCA,AGGGG
#/usr/lib/python2.4/site-packages/TAMO//Sitemap.py -f GCN4_YPD.fsa -L
GM -m TGASTCA,AGGGG
# 2: [0, 1]
G : TGAsTCA (1) 8.86 (0.70)
M : AGGGG (1) 6.82 (0.70)
iYJR016C
------G--------
G TGAGTCA 132-138 12.65
iYOR301W
M--M--G-------M--------
G TGACTCA 133-139 12.65 | M CCCCT 5-9 9.74 | M
AGGGG 71-75 9.74 | M CCCCT 298-302
9.74
iYJL200C
-----------G---
G TGACTCA 226-232 12.65
iYOL064C
-------G--
G TGAGTCA 150-156 12.65
<~40 additional lines...>
Occurences of the first motif are noted with an "G," and
occurrences of the second with a "M" because we specified "-L GM" on the command line.
The "scale" (how many base per "-" symbol) and the threshold for match
(as a fraction of the motif maximum) are paramters that can be adjusted
via the "-S" and "-t" command line flags.
Simple Motif Scoring
Download all the same
files as described in "Simple Sequence Scanning" above, and copy
the file Yeast6kArray/yeast_Young_6k.fsa from your
"TAMOdata" directory. This file should have been downloaded
during
installation of the TAMO package. If it wasn't, type
$TAMO/GetDataFiles.py --Whitehead
and then copy the file to the current directory. The sequences in
the GCN4_YPD.fsa are a subset of the intergenic
regions in yeast_Young_6k.fsa that were determined to be bound by the
GCN4 when cells were grown in rich media ("YPD"). You
must have write permission in whichever directory the yeast_Young_k6.fsa
file resides.
NOTE: To demonstrate the generality of the approach, we'll use the
explicity "all sequences" file (yeast_Young_6k.fsa), which can be
replaced by whichever file suits your experiment. TAMO has a
built in alias for this file, "YEAST," that grabs this file from the
TAMOdata directory.
% python
$TAMO/MotifMetrics.py GCN4_YPD.fsa -genome yeast_Young_6k.fsa TGASTCA
#/usr/lib/python2.4/site-packages/TAMO//MotifMetrics.py GCN4_YPD.fsa
-genome yeast_Young_6k.fsa TGASTCA
Writing yeast_Young_6k.fsa.pickle...
#Searching 1 motif
E: 233 / 6725, 40 / 59
found. p: 1.01e-45 TGASTCA
This file computes the "enrichment" score of the motif, figureed as the
significance of the number of sequences that contain the motif computed
according to a hypergeometric distribution.
Other available scores include the "Group Specificity Score" from the
Church lab and a new metric that accounts for both the numbers of
occurrences within each sequence and the lengths of the sequence, using
a bionomial distribution. For these metrics, it will be necessary
to force the motif to be a motif object instead of a regular
expression, using the "-M" option:
% python
$TAMO/MotifMetrics.py GCN4_YPD.fsa -genome yeast_Young_6k.fsa -M
TGASTCA -spec
#/usr/lib/python2.4/site-packages/TAMO//MotifMetrics.py GCN4_YPD.fsa
-genome yeast_Young_6k.fsa -M TGASTCA -spec
#Forming PSSMs from text
#Searching 1 motifs
Church 59 / 6725, 40 / 233 top-ranked. p:
1.01e-45 TGAsTCA (1)
% python $TAMO/MotifMetrics.py GCN4_YPD.fsa
-genome yeast_Young_6k.fsa -M TGASTCA -binomial
#/usr/lib/python2.4/site-packages/TAMO//MotifMetrics.py GCN4_YPD.fsa
-genome yeast_Young_6k.fsa -M TGASTCA -binomial
#Forming PSSMs from text
#Searching 1 motifs
Ov: Observed 42
/ 238, expecting
0.0104424 ( 2.5 ) p: 0.00e+00 TGAsTCA (1)
The scores are 1e-45 and 0, respectively. Reported as
-log10(score), the Enrichment, Specificity, and binomial
"Over-representation" scores are 45, 45, and >50.
The MotifMetrics.py script can be used to score several motifs at once:
% python
$TAMO/MotifMetrics.py GCN4_YPD.fsa -g yeast_Young_6k.fsa -M -binomial
TGASTCA CGGNNCCG AGGGG
#/usr/lib/python2.4/site-packages/TAMO//MotifMetrics.py
GCN4_YPD.fsa -genome yeast_Young_6k.fsa -M -binomial TGASTCA CGGNNCCG
AGGGG
#Forming PSSMs from text
#Searching 3 motifs
Ov: Observed 42
/ 238, expecting
0.0104424 ( 2.5 ) p: 0.00e+00 TGAsTCA (1)
Ov: Observed 4
/ 151, expecting
0.0104424 ( 1.6 ) p: 7.46e-02 CGG..CCG (1)
Ov: Observed 26 /
2699, expecting 0.0104424 ( 28.2 )
p: 6.86e-01 AGGGG (1)
Or, better yet, it can be used to score all motifs in a file of motifs
(notice no need for "-M" here.)
% python
$TAMO/MotifMetrics.py GCN4_YPD.fsa -g yeast_Young_6k.fsa -binomial -t
Final_InTableS2_v24.motifs > All_Ov_Scores.out
% grep Ov
All_Ov_Scores.out | sort -gk 12 | head
Ov: Observed 42 / 238,
expecting 0.0104424 ( 2.5 ) p: 0.00e+00 TGAsTCa (1)
GCN4 GCN4_SM 0 121-GCN4
Ov: Observed 58 / 771, expecting 0.0104424 (
8.1 ) p: 1.05e-30 TGACTC (1) BAS1 BAS1_SM 0 2-BAS1
Ov: Observed 27 / 1705, expecting 0.0104424 ( 17.8 )
p: 2.45e-02 GATAAG (1) GZF3 Literature
Ov: Observed 5 / 160, expecting 0.0104424
( 1.7 ) p: 2.70e-02 CCG....CGG (1) UGA3 Literature
Ov: Observed 35 / 2355, expecting 0.0104424 ( 24.6 )
p: 2.71e-02 yTGACT (1) ASH1 Literature
Ov: Observed 8 / 338, expecting 0.0104424
( 3.5 ) p: 2.72e-02 tCACGTG (1) CBF1 CBF1_SM 0 72-CBF1
Here we see that when the lines containing "Ov" scores are sorted
according to the scores which are in the 12 column ("sort -gk 12"). The GCN4 motif is
the most significantly enriched, followed by BAS1 (which is a similar
motif.) The following motifs, with -log10(score) around "2," are
not significantly enriched, especially if you correct for multiple
hypothesis testing.
A similar approach can be used to search for all k-mers that with size
"k" or larger, but only if they occur at least once in the input
sequences (GCN4_YPD.fsa). The result is a kind of "quick and
dirty" motif discovery. Any of the metrics can be used, including
-spec, -binomial, -ROC_AUC, and the default, enrichment (no
arguments.) See the code for the MotifMetrics.py module for more
information.
% python
$TAMO/MotifMetrics.py GCN4_YPD.fsa -genome yeast_Young_6k.fsa -binomial
-n 7 > gcn4_7mers+.out
% grep Ov gcn4_7mers+.out | sort -gk 12 | head
Ov: Observed 42
/ 238, expecting
0.0104424 ( 2.5 ) p: 0.00e+00 TGACTCA (1)
Ov: Observed 19
/ 61,
expecting 0.0104424 ( 0.6 )
p: 4.45e-23 ATGACTCA (1)
Ov: Observed 15
/ 44,
expecting 0.0104424 ( 0.5 )
p: 3.31e-19 GTGACTCA (1)
Ov: Observed 26
/ 227, expecting
0.0104424 ( 2.4 ) p: 4.10e-19 ATGACTC (1)
Ov: Observed 15
/ 59,
expecting 0.0104424 ( 0.6 )
p: 4.96e-17 ATGAGTCA (1)
Ov: Observed 18
/ 158, expecting
0.0104424 ( 1.6 ) p: 1.17e-13 GAGTCAC (1)
Ov: Observed 10
/ 47,
expecting 0.0104424 ( 0.5 )
p: 5.61e-11 GTGAGTCA (1)
Ov: Observed 6
/ 10,
expecting 0.0104424 ( 0.1 )
p: 2.63e-10 GTGACTCAC (1)
Ov: Observed 7
/ 18,
expecting 0.0104424 ( 0.2 )
p: 3.90e-10 AATGACTCA (1)
Ov: Observed 17
/ 227, expecting
0.0104424 ( 2.4 ) p: 4.50e-10 ATGAGTC (1)
The calculation
produces a few thousand lines (one for each k-mer evaluated) and takes
about 10 minutes to run on a ~1.5 Ghz machine. The top-ranked
motifs are all incarnations of the correct specificity for this
dataset, TGAsTCA. The search scores k-mers that occur in high
numbers first, so that if aborted prematurely due to time restrictions,
the best scoring motifs are still likely to be detected.