Online Motif Discovery with WebMOTIFS

Clustering

When one does de novo motif discovery with multiple motifs, often many very similar motifs are discovered. To make it easier to analyze these motifs, WebMOTIFS clusters and averages motifs from basic, de novo motif discovery. This clustering step occurs after the motifs have been scored and filtered for significance.

WebMOTIFS' clustering and averaging methods are adapted from the methods used by Harbison et al.

Harbison et al. "Transcriptional regulatory code of a eukaryotic genome." Nature. 431(7004) (2004 Sept. 3):99-104.

The explanations below are based on the supplementary methods for that paper, which can be found here (PDF, external link).

Clustering

Motif discovery programs often discover a number of very similar motifs. In addition, if more than one motif discovery program is run, several programs may identify the same basic motif. To identify these common patterns, we cluster the significant motifs and average all members of the same cluster. The average motif can be represented and scored much like a single motif.

K-medoids Clustering: The set of significant motifs is clustered with k-medoids clustering, using the inter-motif distance metric described below. For a given number of clusters, we run the k-medoids algorithm 5 times to find a clustering with the minimal sum of within-cluster distances. To find the optimal number of clusters, we begin with 10 clusters, and then decrease the number of clusters until the average distances between members each medoid and the members of other clusters is sufficiently large (greater or equal to 0.15). This avoids splitting very similar sets of motifs into an arbitrarily large number of similar clusters.

For more information on the k-medoids clustering algorithm, see:

Hastie, T., Tibshirani, R. & Friedman, J. The elements of Statistical Learning; Data mining, inference and prediction. Springer-Verlag: New York, 2001.

Please note that WebMOTIFS' clustering algorithm treats dimers as single motifs. This is a very basic approach. An alternative approach would be to separate dimers into parts and cluster them into monomers, as in Jensen and Liu's hierarchical Bayesian clustering method. (Shane Jensen, Lei Shen, and Jun Liu. "Combining phylogenetic motif discovery and motif clustering topredict co-regulated genes." Bioinformatics 21(20) 2005: pp. 3832-9.)

Inter-Motif Distance: We use a distance metric to aid in the comparison of motifs. The distance D between two aligned motifs "a" and "b" is defined as, where w is the motif width, and ai,L and bi,L are the estimated probabilities of observing base L at position i of motifs a and b/ respectively.

In practice, the optimal alignment of motifs is not known. We therefore use the minimum distance between motifs among all alignments in which the motifs overlap by at least seven bases, or when the motifs are shorter, by 2 bases fewer than the shortest motif length. Alignments to the reverse complements of the motifs are included.

Motif Averaging

We compute a single motif representing each cluster. This motif is basically an average of the motifs in the cluster. We find the optimal alignment of the motifs in the cluster, then average the probabilities at each matrix position of the aligned motifs making up the cluster. [Low-information positions on the flanks of the averaged motifs are removed.] When computing a single motif to represent each cluster, we need to consider that each cluster can contain motifs from multiple motif discovery programs. Some of these programs tend to discover many nearly identical motifs, while others return only one motif of each type. To give both types of programs equal weight when computing the representative motif for a cluster, we include in our average only one motif of each type from each program.

Each average motif is then scored. For information on the metrics used, see the scoring and significance filtering page.

We provide the enrichment score, median enrichment z-score, bits, and group specificity scores for each average motif: these are calculated as described on the scoring and significance filtering page.