Systems Biology of Disease

This page introduces some of the basic functionality of the TAMO package. More information can be obtained from browing the source code and documentation. Printer friendly version is here.
Simple motif operations: "A Taste of TAMO"

Type "python" at the command line and you should get the python prompt

Python 2.4 (#1, Dec 4 2004, 20:10:33) [GCC 3.3.3 (cygwin special)] on cygwin Type "help", "copyright", "credits" or "license" for more information. >>>
Load the module that contains the Motif class definition

>>> from TAMO import MotifTools
Now you can create a motif. In the absence of other data, you can build one from a text string:

>>> m = MotifTools.Motif_from_text("TGASTCA")
Several commands are available to display the motif. For example, it has a text string representation (stored in a member variable called "oneletter").>>> print m.oneletter TGAsTCA...and it is possible to print a text "logo."
>>> m.printlogo() # ------- #          -- 2.30 bits #         # TGA TCA # TGA TCA # TGA TCA # TGA TCA # TGACTCA # TGAGTCA -- 0.29 bits # ------- # TGAsTCAThe logo command has two formatting parameters, which you can inspect by using "help."
>>> help(m.printlogo) Help on method printlogo in module TAMO.MotifTools: printlogo(self, norm=2.3, height=10.0) method of TAMO.MotifTools.Motif instance     m.printlogo(,norm=2.3, height=10.0) -- Print a text-rendering of the Motif Logo         norm   -- maximum number of bits to show     height -- number of lines of text to use to render logo >>> m.printlogo(norm=2.0, height=5) # ------- #          -- 2.00 bits # TGA TCA # TGA TCA # TGACTCA # TGAGTCA -- 0.00 bits # ------- # TGAsTCAYou can also print the various matrix representations: log-likelihood, and probability, and counts. Counts will be the same as probability unless the motif is derived from a multiple-sequence alignment or from a matrix of letter counts.
>>> m._print_ll() #        0         1         2         3         4         5         6   #A    -4.392    -4.392     1.948    -4.392    -4.392    -4.392     1.948 #C    -4.392    -4.392    -4.392     0.965    -4.392     1.948    -4.392 #T     1.948    -4.392    -4.392    -4.392     1.948    -4.392    -4.392 #G    -4.392     1.948    -4.392     0.965    -4.392    -4.392    -4.392 >>> m._print_p() #        0         1         2         3         4         5         6   #A     0.012     0.012     0.964     0.012     0.012     0.012     0.964 #C     0.012     0.012     0.012     0.488     0.012     0.964     0.012 #T     0.964     0.012     0.012     0.012     0.964     0.012     0.012 #G     0.012     0.964     0.012     0.488     0.012     0.012     0.012 >>> m._print_counts() #        0         1         2         3         4         5         6   #A     0.012     0.012     0.964     0.012     0.012     0.012     0.964 #C     0.012     0.012     0.012     0.488     0.012     0.964     0.012 #T     0.964     0.012     0.012     0.012     0.964     0.012     0.012 #G     0.012     0.964     0.012     0.488     0.012     0.012     0.012
>>> msa = ['TGACTCA', ...        'TGACTCA', ...        'TGAGTCA', ...        'TGAGTCA'] >>> m_msa = MotifTools.Motif(msa) >>> print m_msa TGAsTCA (4) >>> m_msa._print_counts() #        0         1         2         3         4         5         6   #A     0.000     0.000     4.000     0.000     0.000     0.000     4.000 #C     0.000     0.000     0.000     2.000     0.000     4.000     0.000 #T     4.000     0.000     0.000     0.000     4.000     0.000     0.000 #G     0.000     4.000     0.000     2.000     0.000     0.000     0.000
Note that when a text string or multiple sequence alignment contains "N's," that TAMO uses pseudocounts based on the motif's background model to populate the missing letters, and renormalizes:

>>> msa = ['TGACTCA', ...        'TGANTCA', ...        'TGAGTCA', ...        'TGANTCA'] >>> m_msa = MotifTools.Motif(msa, {'A': 0.3, 'C': 0.2, 'G': 0.2, 'T': 0.3})>>> m._print_counts() #        0         1         2         3         4         5         6   #A     0.012     0.012     0.964     0.012     0.012     0.012     0.964 #C     0.012     0.012     0.012     0.488     0.012     0.964     0.012 #T     0.964     0.012     0.012     0.012     0.964     0.012     0.012 #G     0.012     0.964     0.012     0.488     0.012     0.012     0.012
If you have installed the weblogo package (see installation instructions), you can generate those colorful sequence logos described by Schneider and Stephens in 1990.

>>> filename = m.giflogo('gcn4') >>> print filename gcn4.gif
Motifs can be manipluated much like text strings:

>>> print m[0:3].oneletter TGA >>> print m[0:-1].oneletter TGAsTC >>> print (m[0:-1] + m[0:-1]).oneletter TGAsTCTGAsTC >>> m_long = m[0:-1] + m[0:3] >>> print m_long.oneletter TGAsTCTGA

Picking bounds "outside" of the motif generates "Ns" scaled according to the backround model:

>>> print m[-2,10].oneletter ..TGAsTCA...
For motifs of the same width, some comparison routines are provided:
>>> left = m[0:3] >>> right = m[4:7] >>> print left.oneletter, right.oneletter TGA TCA >>> print left - right 1.3468700594 >>> print right - left 1.3468700594 >>> help(m.__sub__) Help on method __sub__ in module TAMO.MotifTools: __sub__(self, other) method of TAMO.MotifTools.Motif instance     m.__sub__(other) --- Overloads the '-' operator to compute the Euclidean                           distance between probability matrices for motifs of                           equal width. Consult TAMO.Clustering.MotifCompare                           for metrics to compare motifs of different widths
Motif are primarily useful for scanning sequences. TAMO provides both "fast" and "slow" scanning functions. The "fast" ones are accessed through the Probeset object in MotifMetrics.py.   For less demanding scanning, motifs methods can be used directly.

>>> m.scan('GGTGTAGACGACTATCCGTGACTCACGAAAACGCG') (['TGACTCA'], [(18, 24)], [12.650430062474509]) >>> m.scan('GGTGTAGACGACTATCCGTTACTCACGAAAACGCG') ([], [], [])The format ([matching strings], [start/stop, start/stop], [score, score,....]) is convenient for some, but not all applications. Another variant is:
>>> m.bestscan('GGTGTAGACGACTATCCGTGACTCACGAAAACGCG') 12.650430062474509>>> m.bestscan('GGTGTAGACGACTATCCGTTACTCACGAAAACGCG') 6.3105800595898849
A typical match threshold to use is 70% of the best possible score:

>>> print m.maxscore, m.maxscore * 0.7 12.6504300625 8.85530104373 >>> sequence = 'GGTGTAGACGACTATCCGTGACTCACGAAAACGCG'
>>> if m.bestscan(sequence) > m.maxscore *0.7: print "MATCH!" MATCH!It is common to work with collections of motifs. Collections can be stored to disk using the built-in python serialization methods ("pickle" and "shelve,") but TAMO also supports its own user-readable text format. When stored in a list, collections of motifs can be stored and reloaded using the "load" and "save_motifs" functions:

>>> motifs = [MotifTools.Motif_from_text(x) for x in 'TGAsTCA ACCCCG CCGNNNGCC'.split()] >>> print motifs [TGAsTCA (1), ACCCCG (1), CCG...GCC (1)]>>> MotifTools.save_motifs(motifs,'motifs.tamo') >>> newmotifs = MotifTools.load('motifs.tamo') >>> print newmotifs [TGAsTCA (1), ACCCCG (1), CCG...GCC (1)]

Inspect the file "motifs.tamo" to view the format. The ".tamo" format is described at:

https://jura.wi.mit.edu/fraenkel/regcode/release_v24/index.html#MotifFormat

Simple Sequence Scanning

The installation instructions suggested that you set the "TAMO" environmental variable to be the head of the TAMO installation directory. It should have a value similar to:

/usr/lib/python2.4/site-packages/TAMO/

For the remainder of this example, we'll assume that the variable is set correctly, and can be accessed using "$TAMO" syntax.

First let's download some sample files:

In a new directory, download:

https://jura.wi.mit.edu/fraenkel/regcode/release_v24/final_set/Final_InTableS2_v24.motifs
https://jura.wi.mit.edu/fraenkel/regcode/release_v24/fsafiles/GCN4_YPD.fsa

The sequences in the GCN4_YPD.fsa are a subset of the intergenic regions that were determined to be bound by the GCN4 when cells were grown in rich media ("YPD").

First we'll scan for occurences of a typical GCN4 motif:

% python $TAMO/Sitemap.py -f GCN4_YPD.fsa-L G-m TGASTCA #/usr/lib/python2.4/site-packages/TAMO//Sitemap.py -f GCN4_YPD.fsa -m TGASTCA -L G # 1: [0] G   : TGAsTCA (1) 8.86 (0.70) iYJR016C       ------G--------                           G TGAGTCA 132-138 12.65    iYOR301W       ------G----------------                   G TGACTCA 133-139 12.65    iYJL200C       -----------G---                           G TGACTCA 226-232 12.65   <~35 additional lines...>iYOL064C       -------G--                                G TGAGTCA 150-156 12.65    iYGL184C       -------------------G------                G TGAGTCA 386-392 12.65    iYBR248C       ----------G-------                        G TGACTCA 200-206 12.65    Found matches in 40 of 59 input probes
The output shows graphically the sequences (as strings of "-'s") and the placement of the matches to the motif (as "G's", because we specified "-L G" on the command line.) To the right each of each "sequence" is a summary of the match, including the ID, the matching sequence, it's coordinates, and its score.

We can look for more than one motif at once:

% python $TAMO/Sitemap.py -f GCN4_YPD.fsa -L GM -m TGASTCA,AGGGG #/usr/lib/python2.4/site-packages/TAMO//Sitemap.py -f GCN4_YPD.fsa -L GM -m TGASTCA,AGGGG # 2: [0, 1] G   : TGAsTCA (1) 8.86 (0.70) M   : AGGGG (1) 6.82 (0.70) iYJR016C       ------G--------                           G TGAGTCA 132-138 12.65    iYOR301W       M--M--G-------M--------                   G TGACTCA 133-139 12.65    | M CCCCT 5-9 9.74 | M AGGGG 71-75 9.74 | M CCCCT 298-302 9.74       iYJL200C       -----------G---                           G TGACTCA 226-232 12.65    iYOL064C       -------G--                                G TGAGTCA 150-156 12.65   <~40 additional lines...>

Occurences of the first motif are noted with an "G," and occurrences of the second with a "M" because we specified "-L GM" on the command line.

The "scale" (how many base per "-" symbol) and the threshold for match (as a fraction of the motif maximum) are paramters that can be adjusted via the "-S" and "-t" command line flags.

Simple Motif Scoring

Download all the same files as described in "Simple Sequence Scanning" above, and copy the file Yeast6kArray/yeast_Young_6k.fsa from your "TAMOdata" directory. This file should have been downloaded during installation of the TAMO package. If it wasn't, type

$TAMO/GetDataFiles.py --Whitehead
and then copy the file to the current directory. The sequences in the GCN4_YPD.fsa are a subset of the intergenic regions in yeast_Young_6k.fsa that were determined to be bound by the GCN4 when cells were grown in rich media ("YPD").   You must have write permission in whichever directory the yeast_Young_k6.fsa file resides.

NOTE: To demonstrate the generality of the approach, we'll use the explicity "all sequences" file (yeast_Young_6k.fsa), which can be replaced by whichever file suits your experiment. TAMO has a built in alias for this file, "YEAST," that grabs this file from the TAMOdata directory.

% python $TAMO/MotifMetrics.py GCN4_YPD.fsa -genome yeast_Young_6k.fsa TGASTCA #/usr/lib/python2.4/site-packages/TAMO//MotifMetrics.py GCN4_YPD.fsa -genome yeast_Young_6k.fsa TGASTCA Writing yeast_Young_6k.fsa.pickle... #Searching 1 motif E:      233 / 6725, 40 /   59 found.    p: 1.01e-45 TGASTCA

This file computes the "enrichment" score of the motif, figureed as the significance of the number of sequences that contain the motif computed according to a hypergeometric distribution.

Other available scores include the "Group Specificity Score" from the Church lab and a new metric that accounts for both the numbers of occurrences within each sequence and the lengths of the sequence, using a bionomial distribution. For these metrics, it will be necessary to force the motif to be a motif object instead of a regular expression, using the "-M" option:

% python $TAMO/MotifMetrics.py GCN4_YPD.fsa -genome yeast_Young_6k.fsa -M TGASTCA -spec #/usr/lib/python2.4/site-packages/TAMO//MotifMetrics.py GCN4_YPD.fsa -genome yeast_Young_6k.fsa -M TGASTCA -spec #Forming PSSMs from text #Searching 1 motifs Church   59 / 6725, 40 / 233 top-ranked. p: 1.01e-45 TGAsTCA (1) % python $TAMO/MotifMetrics.py GCN4_YPD.fsa -genome yeast_Young_6k.fsa -M TGASTCA -binomial #/usr/lib/python2.4/site-packages/TAMO//MotifMetrics.py GCN4_YPD.fsa -genome yeast_Young_6k.fsa -M TGASTCA -binomial #Forming PSSMs from text #Searching 1 motifs Ov: Observed     42 /      238, expecting     0.0104424 ( 2.5 ) p: 0.00e+00 TGAsTCA (1)
The scores are 1e-45 and 0, respectively. Reported as -log10(score), the Enrichment, Specificity, and binomial "Over-representation" scores are 45, 45, and >50.

The MotifMetrics.py script can be used to score several motifs at once:
% python $TAMO/MotifMetrics.py GCN4_YPD.fsa -g yeast_Young_6k.fsa -M -binomial TGASTCA CGGNNCCG AGGGG #/usr/lib/python2.4/site-packages/TAMO//MotifMetrics.py GCN4_YPD.fsa -genome yeast_Young_6k.fsa -M -binomial TGASTCA CGGNNCCG AGGGG #Forming PSSMs from text #Searching 3 motifs Ov: Observed     42 /      238, expecting     0.0104424 ( 2.5 ) p: 0.00e+00 TGAsTCA (1) Ov: Observed      4 /      151, expecting     0.0104424 ( 1.6 ) p: 7.46e-02 CGG..CCG (1) Ov: Observed     26 /     2699, expecting     0.0104424 ( 28.2 ) p: 6.86e-01 AGGGG (1)
Or, better yet, it can be used to score all motifs in a file of motifs (notice no need for "-M" here.)

% python $TAMO/MotifMetrics.py GCN4_YPD.fsa -g yeast_Young_6k.fsa -binomial -t Final_InTableS2_v24.motifs > All_Ov_Scores.out % grep Ov All_Ov_Scores.out | sort -gk 12 | headOv: Observed 42 / 238, expecting 0.0104424 ( 2.5 ) p: 0.00e+00 TGAsTCa (1) GCN4 GCN4_SM 0 121-GCN4 Ov: Observed 58 / 771, expecting 0.0104424 ( 8.1 ) p: 1.05e-30 TGACTC (1) BAS1 BAS1_SM 0 2-BAS1 Ov: Observed 27 / 1705, expecting 0.0104424 ( 17.8 ) p: 2.45e-02 GATAAG (1) GZF3 Literature Ov: Observed 5 / 160, expecting 0.0104424 ( 1.7 ) p: 2.70e-02 CCG....CGG (1) UGA3 Literature Ov: Observed 35 / 2355, expecting 0.0104424 ( 24.6 ) p: 2.71e-02 yTGACT (1) ASH1 Literature Ov: Observed 8 / 338, expecting 0.0104424 ( 3.5 ) p: 2.72e-02 tCACGTG (1) CBF1 CBF1_SM 0 72-CBF1
Here we see that when the lines containing "Ov" scores are sorted according to the scores which are in the 12 column ("sort -gk 12"). The GCN4 motif is the most significantly enriched, followed by BAS1 (which is a similar motif.) The following motifs, with -log10(score) around "2," are not significantly enriched, especially if you correct for multiple hypothesis testing.

A similar approach can be used to search for all k-mers that with size "k" or larger, but only if they occur at least once in the input sequences (GCN4_YPD.fsa). The result is a kind of "quick and dirty" motif discovery. Any of the metrics can be used, including -spec, -binomial, -ROC_AUC, and the default, enrichment (no arguments.) See the code for the MotifMetrics.py module for more information.

% python $TAMO/MotifMetrics.py GCN4_YPD.fsa -genome yeast_Young_6k.fsa -binomial -n 7 > gcn4_7mers+.out%grep Ov gcn4_7mers+.out | sort -gk 12 | head Ov: Observed     42 /      238, expecting     0.0104424 ( 2.5 ) p: 0.00e+00 TGACTCA (1) Ov: Observed     19 /       61, expecting     0.0104424 ( 0.6 ) p: 4.45e-23 ATGACTCA (1) Ov: Observed     15 /       44, expecting     0.0104424 ( 0.5 ) p: 3.31e-19 GTGACTCA (1) Ov: Observed     26 /      227, expecting     0.0104424 ( 2.4 ) p: 4.10e-19 ATGACTC (1) Ov: Observed     15 /       59, expecting     0.0104424 ( 0.6 ) p: 4.96e-17 ATGAGTCA (1) Ov: Observed     18 /      158, expecting     0.0104424 ( 1.6 ) p: 1.17e-13 GAGTCAC (1) Ov: Observed     10 /       47, expecting     0.0104424 ( 0.5 ) p: 5.61e-11 GTGAGTCA (1) Ov: Observed      6 /       10, expecting     0.0104424 ( 0.1 ) p: 2.63e-10 GTGACTCAC (1) Ov: Observed      7 /       18, expecting     0.0104424 ( 0.2 ) p: 3.90e-10 AATGACTCA (1) Ov: Observed     17 /      227, expecting     0.0104424 ( 2.4 ) p: 4.50e-10 ATGAGTC (1)The calculation produces a few thousand lines (one for each k-mer evaluated) and takes about 10 minutes to run on a ~1.5 Ghz machine. The top-ranked motifs are all incarnations of the correct specificity for this dataset, TGAsTCA. The search scores k-mers that occur in high numbers first, so that if aborted prematurely due to time restrictions, the best scoring motifs are still likely to be detected.