JASPAR is a popular open-access database for matrix models describing DNA-binding preferences for transcription factors and other DNA patterns. With its third major release, JASPAR has been expanded and equipped with additional functions aimed at both casual and power users. The heart of the JASPAR database-the JASPAR CORE sub-database-has increased by 12% in size, and three new specialized sub-databases have been added. New functions include clustering of matrix models by similarity, generation of random matrices by sampling from selected sets of existing models and a language-independent Web Service applications programming interface for matrix retrieval. JASPAR is available at http://jaspar.genereg.net.
The coiled-coil protein domain is a widespread structural motif known to be involved in a wealth of key interactions in cells and organisms. Coiled-coil recognition and prediction of their location in a protein sequence are important steps for modeling protein structure and function. Nowadays, thanks to the increasing number of experimentally determined protein structures, a significant number of coiled-coil protein domains is available. This enables the development of methods suited to predict the coiled-coil structural motifs starting from the protein sequence. Several methods have been developed to predict classical heptads using manually annotated coiled-coil domains. In this paper we focus on the prediction structurally-determined coiled-coil segments. We introduce a new method based on hidden Markov models that complement the existing methods and outperforms them in the task of locating structurallydefined coiled-coil segments.
MicroRNAs are emerging as important regulators of cancer-related processes. The miR-21 microRNA is over-expressed in a wide variety of cancers and has been causally linked to cellular proliferation, apoptosis and migration. Inhibition of mir-21 in MCF-7 breast cancer cells causes reduced cell growth. Using array expression analysis of MCF-7 cells depleted of miR-21 we identify mRNA targets of mir-21 and show a link between miR-21 and the p53 tumor suppressor protein. We furthermore find that the tumor suppressor protein Programmed Cell Death 4 (PDCD4) is regulated by miR-21 and demonstrate that PDCD4 is a functionally important target for miR-21 in breast cancer cells.
Localized mRNAs are transported to sites of local protein synthesis in large RNP granules, but their molecular composition is incompletely understood. IMP zipcode-binding proteins participate in mRNA localization, and in motile cells IMP-containing granules are dispersed around the nucleus and in cellular protrusions. We isolated the IMP1-containing RNP granules, and found that they represent a unique RNP entity, distinct from neuronal hStaufen and/or FMRP granules, P-bodies and stress-granules. Granules were 100-300 nm in diameter and consisted of IMPs, 40S ribosomal subunits, shuttling hnRNPs, poly(A)-binding proteins, and mRNAs. Moreover, granules contained CBP80 and factors belonging to the exon-junction complex, and lacked eIF4E, eIF4G and 60S ribosomal subunits, indicating that embodied mRNAs are not translated. Granules embodied mRNAs corresponding to about 3 of the HEK293 mRNA transcriptome. Messenger RNAs encoding proteins participating in the secretory pathway and ER-associated quality control, as well as ubiquitin-dependent metabolism, were enriched in the granules, reinforcing the concept of RNP granules as post-transcriptional operons.
When using conventional transmembrane topology and signal peptide predictors, such as TMHMM and SignalP, there is a substantial overlap between these two types of predictions. Applying these methods to five complete proteomes, we found that 30-65% of all predicted signal peptides and 25-35 of all predicted transmembrane topologies overlap. This impairs predictions of 5-10% of the proteome, hence this is an important issue in protein annotation. To address this problem, we previously designed a hidden Markov model, Phobius, that combines transmembrane topology and signal peptide predictions. The method makes an optimal choice between transmembrane segments and signal peptides, and also allows constrained and homology-enriched predictions. We here present a web interface (http://phobius.cgb.ki.se and http://phobius.binf.ku.dk) to access Phobius.
MOTIVATION: As more non-coding RNAs are discovered, the importance of methods for RNA analysis increases. Since the structure of ncRNA is intimately tied to the function of the molecule, programs for RNA structure prediction are necessary tools in this growing field of research. Furthermore, it is known that RNA structure is often evolutionarily more conserved than sequence. However, few existing methods are capable of simultaneously considering multiple sequence alignment and structure prediction. RESULTS: We present a novel solution to the problem of simultaneous structure prediction and multiple alignment of RNA sequences. Using Markov chain Monte Carlo in a simulated annealing framework, the algorithm MASTR (Multiple Alignment of STructural RNAs) iteratively improves both sequence alignment and structure prediction for a set of RNA sequences. This is done by minimizing a combined cost function that considers sequence conservation, covariation and basepairing probabilities. The results show that the method is very competitive to similar programs available today, both in terms of accuracy and computational efficiency. AVAILABILITY: Source code available from http://mastr.binf.ku.dk/ CONTACT: stinus@binf.ku.dk.
ABSTRACT: BACKGROUND: The prediction of the secondary structure of proteins is one of the most studied problems in bioinformatics. Despite their success in many problems of biological sequence analysis, Hidden Markov Models (HMMs) have not been used much for this problem, as the complexity of the task makes manual design of HMMs difficult. Therefore, we have developed a method for evolving the structure of HMMs automatically, using Genetic Algorithms (GAs). RESULTS: In the GA procedure, populations of HMMs are assembled from biologically meaningful building blocks. Mutation and crossover operators were designed to explore the space of such block-HMMs. After each step of the GA, the standard HMM estimation algorithm (the Baum-Welch algorithm) was used to update model parameters. The final HMM captures several features of protein sequence and structure, with its own HMM grammar. In contrast to neural network based predictors, the evolved HMM also calculates the probabilities associated with the predictions. We carefully examined the performance of the HMM based predictor, both under the multiple- and single-sequence condition. CONCLUSIONS: We have shown that the proposed evolutionary method can automatically design the topology of HMMs. The method reads the grammar of protein sequences and converts it into the grammar of an HMM. It improved previously suggested volutionary methods and increased the prediction quality. Especially, it shows good performance under the single-sequence condition and provides probabilistic information on the prediction result. The protein secondary structure predictor using HMMs (P.S.HMM) is on-line available (http://www.binf.ku.dk/ won/pshmm.htm). It runs under the single-sequence condition.
The annotation efforts of the BIOSAPIENS European Network of Excellence have generated several distributed annotation systems (DAS) with the aim of integrating Bioinformatics resources and annotating metazoan genomes (http://www.biosapiens.info). In this context, the PONGO DAS server (http://pongo.biocomp.unibo.it) provides the annotation on predictive basis for the all-alpha membrane proteins in the human genome, not only through DAS queries, but also directly using a simple web interface. In order to produce a more comprehensive analysis of the sequence at hand, this annotation is carried out with four selected and high scoring predictors: TMHMM2.0, MEMSAT, PRODIV and ENSEMBLE1.0. The stored and pre-computed predictions for the human proteins can be searched and displayed in a graphical view. However the web service allows the prediction of the topology of any kind of putative membrane proteins, regardless of the organism and more importantly with the same sequence profile for a given sequence when required. Here we present a new web server that incorporates the state-of-the-art topology predictors in a single framework, so that putative users can interactively compare and evaluate four predictions simultaneously for a given sequence. Together with the predicted topology, the server also displays a signal peptide prediction determined with SPEP. The PONGO web server is available at http://pongo.biocomp.unibo.it/pongo.
The prediction of protein structure from sequence remains a major unsolved problem in biology. The most successful protein structure prediction methods make use of a divide-and-conquer strategy to attack the problem: a conformational sampling method generates plausible candidate structures, which are subsequently accepted or rejected using an energy function. Conceptually, this often corresponds to separating local structural bias from the long-range interactions that stabilize the compact, native state. However, sampling protein conformations that are compatible with the local structural bias encoded in a given protein sequence is a long-standing open problem, especially in continuous space. We describe an elegant and mathematically rigorous method to do this, and show that it readily generates native-like protein conformations simply by enforcing compactness. Our results have far-reaching implications for protein structure prediction, determination, simulation, and design.
Chlamydia trachomatis is the aetiological agent of trachoma and sexually transmitted infections. The C. trachomatis genome sequence revealed an organism adapted to the intracellular habitat with a high coding ratio and a small genome consisting of 1.042-kilobase (kb) with 895 annotated protein coding genes. Here, we repredict the protein-coding genes of the C. trachomatis genome using the gene-finder EasyGene that was trained specifically for C. trachomatis, and compare it with the primary C. trachomatis annotation. Our work predicts 15 genes not listed in the primary annotation and 853 that are in agreement with the primary annotation. Forty two genes from the primary annotation are not predicted by EasyGene. The majority of these genes are listed as hypothetical in the primary annotation. The 15 novel predicted genes all overlap with genes on the complementary strand. We find homologues of several of the novel genes in C. trachomatis Serovar A and Chlamydia muridarum. Several of the genes have typical gene-like and protein-like features. Furthermore, we confirm transcriptional activity from 10 of the putative genes. The combined evidence suggests that at least seven of the 15 are protein coding genes. The data suggest the presence of overlapping active genes in C. trachomatis.
MOTIVATION: The importance of non-coding RNAs is becoming increasingly evident, and often the function of these molecules depends on the structure. It is common to use alignments of related RNA sequences to deduce the consensus secondary structure by detecting patterns of co-evolution. A central part of such an analysis is to measure covariation between two positions in an alignment. Here, we rank various measures ranging from simple mutual information to more advanced covariation measures. RESULTS: Mutual information is still used for secondary structure prediction, but the results of this study indicate which measures are useful. Incorporating more structural information by considering e.g. indels and stacking improves accuracy, suggesting that physically realistic measures yield improved predictions. This can be used to improve both current and future programs for secondary structure prediction. The best measure tested is the RNAalifold covariation measure modified to include stacking. AVAILABILITY: Scripts, data and supplementary material can be found at http://www.binf.ku.dk/Stinus_covariation
ABSTRACT: BACKGROUND: The number of sequenced eukaryotic genomes is rapidly increasing. This means that over time it will be hard to keep supplying customised gene finders for each genome. This calls for procedures to automatically generate species-specific gene finders and to re-train them as the quantity and quality of reliable gene annotation grows. RESULTS: We present a procedure, Agene, that automatically generates a species-specific gene predictor from a set of reliable mRNA sequences and a genome. We apply a Hidden Markov model (HMM) that implements explicit length distribution modelling for all gene structure blocks using acyclic discrete phase type distributions. The state structure of the each HMM is generated dynamically from an array of sub-models to include only gene features represented in the training set. CONCLUSIONS: Acyclic discrete phase type distributions are well suited to model sequence length distributions. The performance of each individual gene predictor on each individual genome is comparable to the best of the manually optimised species-specific gene finders. It is shown that species-specific gene finders are superior to gene finders trained on other species.
ABSTRACT: BACKGROUND: Genomic tiling micro arrays have great potential for identifying previously undiscovered coding as well as non-coding transcription. To-date, however, analyses of these data have been performed in an ad hoc fashion. RESULTS: We present a probabilistic procedure, ExpressHMM, that adaptively models tiling data prior to predicting expression on genomic sequence. A hidden Markov model (HMM) is used to model the distributions of tiling array probe scores in expressed and non-expressed regions. The HMM is trained on sets of probes mapped to regions of annotated expression and non-expression. Subsequently, prediction of transcribed fragments is made on tiled genomic sequence. The prediction is accompanied by an expression probability curve for visual inspection of the supporting evidence. We test ExpressHMM on data from the Cheng et al. (2005) tiling array experiments on ten Human chromosomes. Results can be downloaded and viewed from our web site. CONCLUSIONS: The value of adaptive modelling of fluorescence scores prior to categorisation into expressed and non-expressed probes is demonstrated. Our results indicate that our adaptive approach is superior to the previous analysis in terms of nucleotide sensitivity and transfrag specificity.
A genetic algorithm (GA) is proposed for finding the structure of hidden Markov Models (HMMs) used for biological sequence analysis. The GA is designed to preserve biologically meaningful building blocks. The search through the space of HMM structures is combined with optimization of the emission and transition probabilities using the classic Baum u 2013Welch algorithm. The system is tested on the problem of finding the promoter and coding region of C. jejuni. The resulting HMM has a superior discrimination ability to a handcrafted model that has been published in the literature.
MOTIVATION: When predicting sequence features like transmembrane topology, signal peptides, coil-coil structures, protein secondary structure or genes, extra support can be gained from homologs. RESULTS: We present here a general hidden Markov model (HMM) decoding algorithm that combines probabilities for sequence features of homologs by considering the average of the posterior label probability of each position in a global sequence alignment. The algorithm is an extension of the previously described 'optimal accuracy' decoder, allowing homology information to be used. It was benchmarked using an HMM for transmembrane topology and signal peptide prediction, Phobius. We found that the performance was substantially increased when incorporating information from homologs. AVAILABILITY: A prediction server for transmembrane topology and signal peptides that uses the algorithm is available at http://phobius.cgb.ki.se/poly.html. An implementation of the algorithm is available on request from the authors.
BACKGROUND: MicroRNAs (miRNA) are small (20-25 nt) non-coding RNA molecules that regulate gene expression through interaction with mRNA in plants and metazoans. A few hundred miRNAs are known or predicted, and most of those are evolutionarily conserved. In general plant miRNA are different from their animal counterpart: most plant miRNAs show near perfect complementarity to their targets. Exploiting this complementarity we have developed a method for identification plant miRNAs that does not rely on phylogenetic conservation. RESULTS: Using the presumed targets for the known miRNA as positive controls, we list and filter all segments of the genome of length approximately 20 that are complementary to a target mRNA-transcript. From the positive control we recover 41 (of 92 possible) of the already known miRNA-genes (representing 14 of 16 families) with only four false positives.Applying the procedure to find possible new miRNAs targeting any annotated mRNA, we predict of 592 new miRNA genes, many of which are not conserved in other plant genomes. A subset of our predicted miRNAs is additionally supported by having more than one target that are not homologues. CONCLUSION: These results indicate that it is possible to reliably predict miRNA-genes without using genome comparisons. Furthermore it suggests that the number of plant miRNAs have been underestimated and points to the existence of recently evolved miRNAs in Arabidopsis.
MOTIVATION: Prokaryotic genomes are sequenced and annotated at an increasing rate. The methods of annotation vary between sequencing groups. It makes genome comparison difficult and may lead to propagation of errors when questionable assignments are adapted from one genome to another. Genome comparison either on a large or small scale would be facilitated by using a single standard for annotation, which incorporates a transparency of why an open reading frame (ORF) is considered to be a gene. RESULTS: A total of 143 prokaryotic genomes were scored with an updated version of the prokaryotic genefinder EasyGene. Comparison of the GenBank and RefSeq annotations with the EasyGene predictions reveals that in some genomes up to approximately 60 of the genes may have been annotated with a wrong start codon, especially in the GC-rich genomes. The fractional difference between annotated and predicted confirms that too many short genes are annotated in numerous organisms. Furthermore, genes might be missing in the annotation of some of the genomes. We predict 41 of 143 genomes to be over-annotated by >5%, meaning that too many ORFs are annotated as genes. We also predict that 12 of 143 genomes are under-annotated. These results are based on the difference between the number of annotated genes not found by EasyGene and the number of predicted genes that are not annotated in GenBank. We argue that the average performance of our standardized and fully automated method is slightly better than the annotation. AVAILABILITY: The EasyGene 1.2 predictions and statistics can be accessed at http://www.binf.ku.dk/cgi-bin/easygene/search CONTACT: pern@binf.ku.dk.
An inherent problem in transmembrane protein topology prediction and signal peptide prediction is the high similarity between the hydrophobic regions of a transmembrane helix and that of a signal peptide, leading to cross-reaction between the two types of predictions. To improve predictions further, it is therefore important to make a predictor that aims to discriminate between the two classes. In addition, topology information can be gained when successfully predicting a signal peptide leading a transmembrane protein since it dictates that the N terminus of the mature protein must be on the non-cytoplasmic side of the membrane. Here, we present Phobius, a combined transmembrane protein topology and signal peptide predictor. The predictor is based on a hidden Markov model (HMM) that models the different sequence regions of a signal peptide and the different regions of a transmembrane protein in a series of interconnected states. Training was done on a newly assembled and curated dataset. Compared to TMHMM and SignalP, errors coming from cross-prediction between transmembrane segments and signal peptides were reduced substantially by Phobius. False classifications of signal peptides were reduced from 26.1% to 3.9% and false classifications of transmembrane helices were reduced from 19.0% to 7.7%. Phobius was applied to the proteomes of Homo sapiens and Escherichia coli. Here we also noted a drastic reduction of false classifications compared to TMHMM/SignalP, suggesting that Phobius is well suited for whole-genome annotation of signal peptides and transmembrane regions. The method is available at as well as at
A new general algorithm for optimization of potential functions for protein folding is introduced. It is based upon gradient optimization of the thermodynamic stability of native folds of a training set of proteins with known structure. The iterative update rule contains two thermodynamic averages which are estimated by (generalized ensemble) Monte Carlo. We test the learning algorithm on a Lennard-Jones (LJ) force field with a torsional angle degrees-of-freedom and a single-atom side-chain. In a test with 24 peptides of known structure, none folded correctly with the initial potential functions, but two-thirds came within 3 Aa to their native fold after optimizing the potential functions.
SUMMARY: Hidden Markov models (HMMs) are widely used for biological sequence analysis because of their ability to incorporate biological information in their structure. An automatic means of optimizing the structure of HMMs would be highly desirable. However, this raises two important issues; first, the new HMMs should be biologically interpretable, and second, we need to control the complexity of the HMM so that it has good generalization performance on unseen sequences. In this paper, we explore the possibility of using a genetic algorithm (GA) for optimizing the HMM structure. GAs are sufficiently flexible to allow incorporation of other techniques such as Baum-Welch training within their evolutionary cycle. Furthermore, operators that alter the structure of HMMs can be designed to favour interpretable and simple structures. In this paper, a training strategy using GAs is proposed, and it is tested on finding HMM structures for the promoter and coding region of the bacterium Campylobacter jejuni. The proposed GA for hidden Markov models (GA-HMM) allows, HMMs with different numbers of states to evolve. To prevent over-fitting, a separate dataset is used for comparing the performance of the HMMs to that used for the Baum-Welch training. The GA-HMM was capable of finding an HMM comparable to a hand-coded HMM designed for the same task, which has been published previously.
A method to predict lipoprotein signal peptides in Gram-negative Eubacteria, LipoP, has been developed. The hidden Markov model (HMM) was able to distinguish between lipoproteins (SPaseII-cleaved proteins), SPaseI-cleaved proteins, cytoplasmic proteins, and transmembrane proteins. This predictor was able to predict 96.8% of the lipoproteins correctly with only 0.3% false positives in a set of SPaseI-cleaved, cytoplasmic, and transmembrane proteins. The results obtained were significantly better than those of previously developed methods. Even though Gram-positive lipoprotein signal peptides differ from Gram-negatives, the HMM was able to identify 92.9% of the lipoproteins included in a Gram-positive test set. A genome search was carried out for 12 Gram-negative genomes and one Gram-positive genome. The results for Escherichia coli K12 were compared with new experimental data, and the predictions by the HMM agree well with the experimentally verified lipoproteins. A neural network-based predictor was developed for comparison, and it gave very similar results. LipoP is available as a Web server at www.cbs.dtu.dk/services/LipoP/.
BACKGROUND: Contrary to other areas of sequence analysis, a measure of statistical significance of a putative gene has not been devised to help in discriminating real genes from the masses of random Open Reading Frames (ORFs) in prokaryotic genomes. Therefore, many genomes have too many short ORFs annotated as genes. RESULTS: In this paper, we present a new automated gene-finding method, EasyGene, which estimates the statistical significance of a predicted gene. The gene finder is based on a hidden Markov model (HMM) that is automatically estimated for a new genome. Using extensions of similarities in Swiss-Prot, a high quality training set of genes is automatically extracted from the genome and used to estimate the HMM. Putative genes are then scored with the HMM, and based on score and length of an ORF, the statistical significance is calculated. The measure of statistical significance for an ORF is the expected number of ORFs in one megabase of random sequence at the same significance level or better, where the random sequence has the same statistics as the genome in the sense of a third order Markov chain. CONCLUSIONS: The result is a flexible gene finder whose overall performance matches or exceeds other methods. The entire pipeline of computer processing from the raw input of a genome or set of contigs to a list of putative genes with significance is automated, making it easy to apply EasyGene to newly sequenced organisms. EasyGene with pre-trained models can be accessed at http://www.cbs.dtu.dk/services/EasyGene.
We have developed reliability scores for five widely used membrane protein topology prediction methods, and have applied them both on a test set of 92 bacterial plasma membrane proteins with experimentally determined topologies and on all predicted helix bundle membrane proteins in three fully sequenced genomes: Escherichia coli, Saccharomyces cerevisiae and Caenorhabditis elegans. We show that the reliability scores work well for the TMHMM and MEMSAT methods, and that they allow the probability that the predicted topology is correct to be estimated for any protein. We further show that the available test set is biased towards high-scoring proteins when compared to the genome-wide data sets, and provide estimates for the expected prediction accuracy of TMHMM across the three genomes. Finally, we show that the performance of TMHMM is considerably better when limited experimental information (such as the in/out location of a protein's C terminus) is available, and estimate that at least ten percentage points in overall accuracy in whole-genome predictions can be gained in this way.
We have used a hidden Markov model (HMM) to identify the consensus sequence of the RpoD promoters in the genome of Campylobacter jejuni. The identified promoter consensus sequence is unusual compared to other bacteria, in that the region upstream of the TATA-box does not contain a conserved -35 region, but shows a very strong periodic variation in the AT-content and semi-conserved T-stretches, with a period of 10-11 nucleotides. The TATA-box is in some, but not all cases, preceded by a TGx, similar to an extended -10 promoter.We predicted a total of 764 presumed RpoD promoters in the C.jejuni genome, of which 654 were located upstream of annotated genes. A similar promoter was identified in Helicobacter pylori, a close phylogenetic relative of Campylobacter, but not in Escherichia coli, Vibrio cholerae, or six other Proteobacterial genomes, or in Staphylococcus aureus. We used upstream regions of high confidence genes as training data (n=529, for the C.jejuni genome). We found it necessary to limit the training set to genes that are preceded by an intergenic region of >100bp or by a gene oriented in the opposite direction to be able to identify a conserved sequence motif, and ended up with a training set of 175 genes. This leads to the conclusion that the remaining genes (354) are more rarely preceded by a (RpoD) promoter, and consequently that operon structure may be more widespread in C.jejuni than has been assumed by others.Structural predictions of the regions upstream of the TATA-box indicates a region of highly curved DNA, and we assume that this facilitates the wrapping of the DNA around the RNA polymerase holoenzyme, and offsets the absence of a conserved -35 binding motif.
We have developed an entirely sequence-based method that identifies and integrates relevant features that can be used to assign proteins of unknown function to functional classes, and enzyme categories for enzymes. We show that strategies for the elucidation of protein function may benefit from a number of functional attributes that are more directly related to the linear sequence of amino acids, and hence easier to predict, than protein structure. These attributes include features associated with post-translational modifications and protein sorting, but also much simpler aspects such as the length, isoelectric point and composition of the polypeptide chain.
Motivation: Membrane proteins are an abundant and functionally relevant subset of proteins that putatively include from about 15 up to 30 of the proteome of organisms fully sequenced. These estimates are mainly computed on the basis of sequence comparison and membrane protein prediction. It is therefore urgent to develop methods capable of selecting membrane proteins especially in the case of outer membrane proteins, barely taken into consideration when proteome wide analysis is performed. This will also help protein annotation when no homologous sequence is found in the database. Outer membrane proteins solved so far at atomic resolution interact with the external membrane of bacteria with a characteristic beta barrel structure comprising different even numbers of beta strands (beta barrel membrane proteins). In this they differ from the membrane proteins of the cytoplasmic membrane endowed with alpha helix bundles (all alpha membrane proteins) and need specialised predictors. Results: We develop a HMM model, which can predict the topology of beta barrel membrane proteins using, as input, evolutionary information. The model is cyclic with 6 types of states: two for the beta strand transmembrane core, one for the beta strand cap on either side of the membrane, one for the inner loop, one for the outer loop and one for the globular domain state in the middle of each loop. The development of a specific input for HMM based on multiple sequence alignment is novel. The accuracy per residue of the model is 83% when a jack knife procedure is adopted. With a model optimisation method using a dynamic programming algorithm seven topological models out of the twelve proteins included in the testing set are also correctly predicted. When used as a discriminator, the model is rather selective. At a fixed probability value, it retains 84% of a non-redundant set comprising 145 sequences of well-annotated outer membrane proteins. Concomitantly, it correctly rejects 90% of a set of globular proteins including about 1200 chains with low sequence identity (<30%) and 90% of a set of all alpha membrane proteins, including 188 chains. Availability:The program will be available on request from the authors.
We examined more than 700 DNA sequences (full length chromosomes and plasmids) for stretches of purines (R) or pyrimidines (Y) and alternating YR stretches; such regions will likely adopt structures which are different from the canonical B-form. Since one turn of the DNA helix is roughly 10 bp, we measured the fraction of each genome which contains purine (or pyrimidine) tracts of lengths of 10 bp or longer (hereafter referred to as 'purine tracts'), as well as stretches of alternating pyrimidines/purine (pyr/pur tracts') of the same length. Using this criteria, a random sequence would be expected to contain 1.0% of purine tracts and also 1.0% of the alternating pyr/pur tracts. In the vast majority of cases, there are more purine tracts than would be expected from a random sequence, with an average of 3.5%, significantly larger than the expectation value. The fraction of the chromosomes containing pyr/pur tracts was slightly less than expected, with an average of 0.8%. One of the most surprising findings is a clear difference in the length distributions of the regions studied between prokaryotes and eukaryotes. Whereas short-range correlations can explain the length distributions in prokaryotes, in eukaryotes there is an abundance of long stretches of purines or alternating purine/pyrimidine tracts, which cannot be explained in this way; these sequences are likely to play an important role in eukaryotic chromosome organisation.
A hidden Markov model of sigma(A) RNA polymerase cofactor recognition sites in Bacillus subtilis, containing either the common or the extended -10 motifs, has been constructed based on experimentally verified sigma(A) recognition sites. This work suggests that more information exists at the initiation site of transcription in both types of promoters than previously thought. When tested on the entire B. subtilis genome, the model predicts that approximately half of the sigma(A) recognition sites are of the extended type. Some of the response-regulator aspartate phosphatases were among the predictions of promoters containing extended sites. The expression of rapA and rapB was confirmed by site-directed mutagenesis to depend on the extended -10 region.
We describe and validate a new membrane protein topology prediction method, TMHMM, based on a hidden Markov model. We present a detailed analysis of TMHMM's performance, and show that it correctly predicts 97-98% of the transmembrane helices. Additionally, TMHMM can discriminate between soluble and membrane proteins with both specificity and sensitivity better than 99%, although the accuracy drops when signal peptides are present. This high degree of accuracy allowed us to predict reliably integral membrane proteins in a large collection of genomes. Based on these predictions, we estimate that 20-30% of all genes in most genomes encode membrane proteins, which is in agreement with previous estimates. We further discovered that proteins with N(in)-C(in) topologies are strongly preferred in all examined organisms, except Caenorhabditis elegans, where the large number of 7TM receptors increases the counts for N(out)-C(in) topologies. We discuss the possible relevance of this finding for our understanding of membrane protein assembly mechanisms. A TMHMM prediction service is available at http://www.cbs.dtu.dk/services/TMHMM/.
Salmonella enterica serovar Typhi (S. typhi) is the aetiological agent of typhoid fever, a serious invasive bacterial disease of humans with an annual global burden of approximately 16 million cases, leading to 600,000 fatalities. Many S. enterica serovars actively invade the mucosal surface of the intestine but are normally contained in healthy individuals by the local immune defence mechanisms. However, S. typhi has evolved the ability to spread to the deeper tissues of humans, including liver, spleen and bone marrow. Here we have sequenced the 4,809,037-base pair (bp) genome of a S. typhi (CT18) that is resistant to multiple drugs, revealing the presence of hundreds of insertions and deletions compared with the Escherichia coli genome, ranging in size from single genes to large islands. Notably, the genome sequence identifies over two hundred pseudogenes, several corresponding to genes that are known to contribute to virulence in Salmonella typhimurium. This genetic degradation may contribute to the human-restricted host range for S. typhi. CT18 harbours a 218,150-bp multiple-drug-resistance incH1 plasmid (pHCM1), and a 106,516-bp cryptic plasmid (pHCM2), which shows recent common ancestry with a virulence plasmid of Yersinia pestis.
In sequenced microbial genomes, some of the annotated genes are actually not protein-coding genes, but rather open reading frames that occur by chance. Therefore, the number of annotated genes is higher than the actual number of genes for most of these microbes. Comparison of the length distribution of the annotated genes with the length distribution of those matching a known protein reveals that too many short genes are annotated in many genomes. Here we estimate the true number of protein-coding genes for sequenced genomes. Although it is often claimed that Escherichia coli has about 4300 genes, we show that it probably has only approximately 3800 genes, and that a similar discrepancy exists for almost all published genomes.
We have analysed the complete sequence of the Escherichia coli K12 isolate MG1655 genome for chromatin-associated protein binding sites, and compared the predicted location of predicted sites with experimental expression data from 'DNA chip' experiments. Of the dozen proteins associated with chromatin in E. coli, only three have been shown to have significant binding preferences: integration host factor (IHF) has the strongest binding site preference, and FIS sites show a weak consensus, and there is no clear consensus site for binding of the H-NS protein. Using hidden Markov models (HMMs), we predict the location of 608 IHF sites, scattered throughout the genome. A subset of the IHF sites associated with repeats tends to be clustered around the origin of replication. We estimate there could be roughly 6000 FIS sites in E. coli, and the sites tend to be localised in two regions flanking the replication termini. We also show that the regions upstream of genes regulated by H-NS are more curved and have a higher AT content than regions upstream of other genes. These regions in general would also be localised near the replication terminus.
Ensemble methods, which combine several classifiers, have been successfully applied to decrease generalization error of machine learning methods. For most ensemble methods the ensemble members are combined by weighted summation of the output, called the linear average predictor. The logarithmic opinion pool ensemble method uses a multiplicative combination of the ensemble members, which treats the outputs of the ensemble members as independent probabilities. The advantage of the logarithmic opinion pool is the connection to the Kullback-Leibler error function, which can be decomposed into two terms: An average of the error of the ensemble members, and the ambiguity. The ambiguity is independent of the target function, and can be estimated using unlabeled data. The advantage of the decomposition is that an unbiased estimate of the generalization error of the ensemble can be obtained, while training still is on the full training set. These properties can be used to improve classification. The logarithmic opinion pool ensemble method is tested on the prediction of protein secondary structure. The focus is on how much improvement the general ensemble method can give rather than on outperforming existing methods, because that typically involves several more steps of refinement.
The application of the gene finder to the Adh region of the Drosophila melanogaster is described, and the prediction results are analyzed. is based on a probabilistic model called a hidden Markov model, and the probabilistic framework facilitates the inclusion of database matches of varying degrees of certainty. It is shown that database matches clearly improve the performance of the gene finder. For instance, the sensitivity for coding exons predicted with both ends correct grows from 62% to 70% on a high-quality test set, when matches to proteins, cDNAs, repeats, and transposons are included. The specificity drops more than the sensitivity increases when ESTs are used. This is due to the high noise level in EST matches, and it is discussed in more detail why this is and how it might be improved.
A general framework for hybrids of hidden Markov models (HMMs) and neural networks (NNs) called hidden neural networks (HNNs) is described. The article begins by reviewing standard HMMs and estimation by conditional maximum likelihood, which is used by the HNN. In the HNN, the usual HMM probability parameters are replaced by the outputs of state-specific neural networks. As opposed to many other hybrids, the HNN is normalized globally and therefore has a valid probabilistic interpretation. All parameters in the HNN are estimated simultaneously according to the discriminative conditional maximum likelihood criterion. The HNN can be viewed as an undirected probabilistic independence network (a graphical model), where the neural networks provide a compact representation of the clique functions. An evaluation of the HNN on the task of recognizing broad phoneme classes in the TIMIT database shows clear performance gains compared to standard HMMs tested on the same task.
This work investigates whether mRNA has a lower estimated folding free energy than random sequences. The free energy estimates are calculated by the mfold program for prediction of RNA secondary structures. For a set of 46 mRNAs it is shown that the predicted free energy is not significantly different from random sequences with the same dinucleotide distribution. For random sequences with the same mononucleotide distribution it has previously been shown that the native mRNA sequences have a lower predicted free energy, which indicates a more stable structure than random sequences. However, dinucleotide content is important when assessing the significance of predicted free energy as the physical stability of RNA secondary structure is known to depend on dinucleotide base stacking energies. Even known RNA secondary structures, like tRNAs, can be shown to have predicted free energies indistinguishable from randomized sequences. This suggests that the predicted free energy is not always a good determinant for RNA folding.
Countless millions of people have died from tuberculosis, a chronic infectious disease caused by the tubercle bacillus. The complete genome sequence of the best-characterized strain of Mycobacterium tuberculosis, H37Rv, has been determined and analysed in order to improve our understanding of the biology of this slow-growing pathogen and to help the conception of new prophylactic and therapeutic interventions. The genome comprises 4,411,529 base pairs, contains around 4,000 genes, and has a very high guanine + cytosine content that is reflected in the blased amino-acid content of the proteins. M. tuberculosis differs radically from other bacteria in that a very large portion of its coding capacity is devoted to the production of enzymes involved in lipogenesis and lipolysis, and to two new families of glycine-rich proteins with a repetitive structure that may represent a source of antigenic variation.
A hidden Markov model of signal peptides has been developed. It contains submodels for the N-terminal part, the hydrophobic region, and the region around the cleavage site. For known signal peptides, the model can be used to assign objective boundaries between these three regions. Applied to our data, the length distributions for the three regions are significantly different from expectations. For instance, the assigned hydrophobic region is between 8 and 12 residues long in almost all eukaryotic signal peptides. This analysis also makes obvious the difference between eukaryotes, Gram-positive bacteria, and Gram-negative bacteria. The model can be used to predict the location of the cleavage site, which it finds correctly in nearly 70% of signal peptides in a cross-validated test-almost the same accuracy as the best previous method. One of the problems for existing prediction methods is the poor discrimination between signal peptides and uncleaved signal anchors, but this is substantially improved by the hidden Markov model when expanding it with a very simple signal anchor model.
A novel method to model and predict the location and orientation of alpha helices in membrane-spanning proteins is presented. It is based on a hidden Markov model (HMM) with an architecture that corresponds closely to the biological system. The model is cyclic with 7 types of states for helix core, helix caps on either side, loop on the cytoplasmic side, two loops for the non-cytoplasmic side, and a globular domain state in the middle of each loop. The two loop paths on the non-cytoplasmic side are used to model short and long loops separately, which corresponds biologically to the two known different membrane insertions mechanisms. The close mapping between the biological and computational states allows us to infer which parts of the model architecture are important to capture the information that encodes the membrane topology, and to gain a better understanding of the mechanisms and constraints involved. Models were estimated both by maximum likelihood and a discriminative method, and a method for reassignment of the membrane helix boundaries were developed. In a cross validated test on single sequences, our transmembrane HMM, TMHMM, correctly predicts the entire topology for 77% of the sequences in a standard dataset of 83 proteins with known topology. The same accuracy was achieved on a larger dataset of 160 proteins. These results compare favourably with existing methods.
The conventional linear back-propagation algorithm is replaced by a non-linear version, which avoids the necessity for calculating the derivative of the activation function. This may be exploited in hardware realizations of neural processors. In this paper we derive the non-linear back-propagation algorithms in the framework of recurrent back-propagation and present some numerical simulations of feed-forward networks on the NetTalk problem. A discussion of implementation in analog VLSI electronics concludes the paper.
A hidden Markov model for gene finding consists of submodels for coding regions, splice sites, introns, intergenic regions and possibly more. It is described how to estimate the model as a whole from labeled sequences instead of estimating the individual parts independently from subsequences. It is argued that the standard maximum likelihood estimation criterion is not optimal for training such a model. Instead of maximizing the probability of the DNA sequence, one should maximize the probability of the correct prediction. Such a criterion, called conditional maximum likelihood, is used for the gene finder 'HMM-gene'. A new (approximative) algorithm is described, which finds the most probable prediction summed over all paths yielding the same prediction. We show that these methods contribute significantly to the high performance of HMMgene.
Within the context of learning a rule from examples, we study the general characteristics of learning with ensembles. The generalization performance achieved by a simple model ensemble of linear students is calculated exactly in the thermodynamic limit of a large number of input components, and shows a surprisingly rich behaviour. Our main findings are: For learning in large ensembles, it is advantageous to use under-regularized students, which actually over-fit the training data. Globally optimal generalization performance can be obtained by choosing the training set sizes of the students optimally. For smaller ensembles, optimization of the ensemble weights can yield significant improvements in ensemble generalization performance, in particular if the individual students are subject to noise in the training process. Choosing students with a wide range of regularization parameters makes this improvement robust against changes in the unknown level of corruption of the training data.
This paper presents a general framework for hybrids of Hidden Markov models (HMM) and neural networks (NN). In the new framework called Hidden Neural Networks (HNN) the usual HMM probability parameters are replaced by neural network outputs. To ensure a probabilistic interpretation the HNN is normalized globally as opposed to the local normalization enforced on parameters in standard HMMs. Furthermore, all parameters in the HNN are estim ated simultaneously according to the discriminative conditional maximum likelihood (CML) criterion. The HNNs show clear performance gains compared to standard HMMs on TIMIT continuous speech recognition benchmarks. On the task of recognizing five broad phoneme classes an accuracy of 84% is obtained compared to 76% for a standard HMM. Additionally, we report a preliminary result of 69% accuracy on the TIMIT 39 phoneme task.
We describe the structural implications of a periodic pattern found in human exons and introns by hidden Markov models. We show that exons (besides the reading frame) have a specific sequential structure in the form of a pattern with triplet consensus non-T(A/T)G, and a minimal periodicity of roughly ten nucleotides. The periodic pattern is also present in intron sequences, although the strength per nucleotide is weaker. Using two independent profile methods based on triplet bendability parameters from DNase I experiments and nucleosome positioning data, we show that the pattern in multiple alignments of internal exon and intron sequences corresponds to a periodic "in phase" bending potential towards the major groove of the DNA. The nucleosome positioning data show that the consensus triplets (and their complements) have a preference for locations on a bent double helix where the major groove faces inward and is compressed. The in-phase triplets are located adjacent to GCC/GGC triplets known to have the strongest bias in their positioning on the nuclesome. Analysis of mRNA sequences encoding proteins with known tertiary structure exclude the possibility that the pattern is a consequence of the previously well-known periodicity caused by the encoding of alpha-helices in proteins. Finally, we discuss the relation between the bending potential of coding and non-coding regions and its impact on the translational positioning of nucleosomes and the recognition of genes by the transcriptional machinery.
Hidden Markov models (HMMs) are a highly effective means of modeling a family of unaligned sequences or a common motif within a set of unaligned sequences. The trained HMM can then be used for discrimination or multiple alignment. The basic mathematical description of an HMM and its expectation-maximization training procedure is relatively straightforward. In this paper, we review the mathematical extensions and heuristics that move the method from the theoretical to the practical. We then experimentally analyze the effectiveness of model regularization, dynamic model modification and optimization strategies. Finally it is demonstrated on the SH2 domain how a domain can be found from unaligned sequences using a special model type. The experimental work was completed with the aid of the Sequence Alignment and Modeling software suite.
Most current methods for prediction of protein secondary structure use a small window of the protein sequence to predict the structure of the central amino acid. We describe a new method for prediction of the non-local structure called beta-sheet, which consists of two or more beta-strands that are connected by hydrogen bonds. Since beta-strands are often widely separated in the protein chain, a network with two windows is introduced. After training on a set of proteins the network predicts the sheets well, but there are many false positives. By using a global energy function the beta-sheet prediction is combined with a local prediction of the three secondary structures alpha-helix, beta-strand and coil. The energy function is minimized using simulated annealing to give a final prediction.
The prediction of protein secondary structure by use of carefully structured neural networks and multiple sequence alignments has been investigated. Separate networks are used for predicting the three secondary structures alpha-helix, beta-strand, and coil. The networks are designed using a priori knowledge of amino acid properties with respect to the secondary structure and the characteristic periodicity in alpha-helices. Since these single-structure networks all have less than 600 adjustable weights, overfitting is avoided. To obtain a three-state prediction of alpha-helix, beta-strand, or coil, ensembles of single-structure networks are combined with another neural network. This method gives an overall prediction accuracy of 66.3% when using 7-fold cross-validation on a database of 126 nonhomologous globular proteins. Applying the method to multiple sequence alignments of homologous proteins increases the prediction accuracy significantly to 71.3% with corresponding Matthew's correlation coefficients C alpha = 0.59, C beta = 0.52, and Cc = 0.50. More than 72% of the residues in the database are predicted with an accuracy of 80%. It is shown that the network outputs can be interpreted as estimated probabilities of correct prediction, and, therefore, these numbers indicate which residues are predicted with high confidence.
We present a method for condensing the information in multiple alignments of proteins into a mixture of Dirichlet densities over amino acid distributions. Dirichlet mixture densities are designed to be combined with observed amino acid frequencies to form estimates of expected amino acid probabilities at each position in a profile, hidden Markov model or other statistical model. These estimates give a statistical model greater generalization capacity, so that remotely related family members can be more reliably recognized by the model. This paper corrects the previously published formula for estimating these expected probabilities, and contains complete derivations of the Dirichlet mixture formulas, methods for optimizing the mixtures to match particular databases, and suggestions for efficient implementation.
We study the characteristics of learning with ensembles. Solving exactly the simple model of an ensemble of linear students, we find surprisingly rich behaviour. For learning in large ensembles, it is advantageous to use under-regularized students, which actually over-fit the training data. Globally optimal performance can be obtained by choosing the training set sizes of the students appropriately. For smaller ensembles, optimization of the ensemble weights can yield significant improvements in ensemble generalization performance, in particular if the individual students are subject to noise in the training process. Choosing students with a wide range of regularization parameters makes this improvement robust against changes in the unknown level of noise in the training data.
We analyse the sequential structure of human exons and their flanking introns by hidden Markov models. Together, models of donor site regions, acceptor site regions and flanked internal exons, show that exons-besides the reading frame-hold a specific periodic pattern. The pattern, which has the consensus: non-T(A/T)G and a minimal periodicity of roughly 10 nucleotides, is not a consequence of the nucleotide statistics in the three codon positions, nor of the well known nucleosome positioning signal. We discuss the relation between the pattern and other known sequence elements responsible for the intrinsic bending or curvature of DNA.
In a family of proteins or other biological sequences like DNA the various subfamilies are often very unevenly represented. For this reason a scheme for assigning weights to each sequence can greatly improve performance at tasks such as database searching with profiles or other consensus models based on multiple alignments. A new weighting scheme for this type of database search is proposed. In a statistical description of the searching problem it is derived from the maximum entropy principle. It can be proved that, in a certain sense, it corrects for uneven representation. It is shown that finding the maximum entropy weights is an easy optimization problem for which standard techniques are applicable.
Learning of continuous valued functions using neural network ensembles (committees) can give improved accuracy, reliable estimation of the generalization error, and active learning. The ambiguity is defined as the variation of the output of ensemble members averaged over unlabeled data, so it quantifies the disagreement among the networks. It is discussed how to use the ambiguity in combination with cross-validation to give a reliable estimate of the ensemble generalization error, and how this type of ensemble cross-validation can sometimes improve performance. It is shown how to estimate the optimal weights of the ensemble members using unlabeled data. By a generalization of query by committee, it is finally shown how the ambiguity can be used to select new training data to be labeled in an active learning scheme.
An approximate steepest descent strategy converging, in families of regular exponential densities, to maximum likelihood estimates of density functions is described. These density estimates are also obtained by an application of the principle of minimum relative entropy subject to empirical constraints. We prove tight bounds on the increase of the log-likelihood at each iteration of our strategy for families of exponential densities whose log-densities are spanned by a set of bounded basis functions.
A hidden Markov model for labeled observations, called a CHMM, is introduced and a maximum likelihood method is developed for estimating the parameters of the model. Instead of training it to model the statistics of the training sequences it is trained to optimize recognition. It resembles MMI training, but is more general, and has MMI as a special case. The standard forward-backward procedure for estimating the model cannot be generalized directly, but an ``incremental EM'' method is proposed.
Hidden Markov Models (HMMs) are applied to the problems of statistical modeling, database searching and multiple sequence alignment of protein families and protein domains. These methods are demonstrated on the globin family, the protein kinase catalytic domain, and the EF-hand calcium binding motif. In each case the parameters of an HMM are estimated from a training set of unaligned sequences. After the HMM is built, it is used to obtain a multiple alignment of all the training sequences. It is also used to search the SWISS-PROT 22 database for other sequences that are members of the given protein family, or contain the given domain. The HMM produces multiple alignments of good quality that agree closely with the alignments produced by programs that incorporate three-dimensional structural information. When employed in discrimination tests (by examining how closely the sequences in a database fit the globin, kinase and EF-hand HMMs), the HMM is able to distinguish members of these families from non-members with a high degree of accuracy. Both the HMM and PROFILESEARCH (a technique used to search for relationships between a protein sequence and multiply aligned sequences) perform better in these tests than PROSITE (a dictionary of sites and patterns in proteins). The HMM appears to have a slight advantage over PROFILESEARCH in terms of lower rates of false negatives and false positives, even though the HMM is trained using only unaligned sequences, whereas PROFILESEARCH requires aligned training sequences. Our results suggest the presence of an EF-hand calcium binding motif in a highly conserved and evolutionary preserved putative intracellular region of 155 residues in the alpha-1 subunit of L-type calcium channels which play an important role in excitation-contraction coupling. This region has been suggested to contain the functional domains that are typical or essential for all L-type calcium channels regardless of whether they couple to ryanodine receptors, conduct ions or both.
A hidden Markov model (HMM) has been developed to find protein coding genes in E. coli DNA using E. coli genome DNA sequence from the EcoSeq6 database maintained by Kenn Rudd. This HMM includes states that model the codons and their frequencies in E. coli genes, as well as the patterns found in the intergenic region, including repetitive extragenic palindromic sequences and the Shine-Delgarno motif. To account for potential sequencing errors and or frameshifts in raw genomic DNA sequence, it allows for the (very unlikely) possibility of insertions and deletions of individual nucleotides within a codon. The parameters of the HMM are estimated using approximately one million nucleotides of annotated DNA in EcoSeq6 and the model tested on a disjoint set of contigs containing about 325,000 nucleotides. The HMM finds the exact locations of about 80% of the known E. coli genes, and approximate locations for about 10%. It also finds several potentially new genes, and locates several places were insertion or deletion errors/and or frameshifts may be present in the contigs.
A Bayesian method for estimating the amino acid distributions in the states of a hidden Markov model (HMM) for a protein family or the columns of a multiple alignment of that family is introduced. This method uses Dirichlet mixture densities as priors over amino acid distributions. These mixture densities are determined from examination of previously constructed HMMs or multiple alignments. It is shown that this Bayesian method can improve the quality of HMMs produced from small training sets. Specific experiments on the EF-hand motif are reported, for which these priors are shown to produce HMMs with higher likelihood on unseen data, and fewer false positives and false negatives in a database search task.
The optimal brain damage (OBD) scheme of Le Cun, Denker and Solla for pruning of feedforward networks has been implemented and applied to the contiguity classification problem. It is shown that OBD improves the learning curve (the test error as a function of the number of examples). By inspecting the architectures obtained through pruning, it is found that the networks with less parameters have the smallest test error in agreement with "Ockhams Razor". Based on this, we propose a heuristic which selects the smallest successful architecture among a group of pruned networks and we show that it leads to very efficient optimization of the architecture. The validity of the approximations involved in OBD are discussed and it is found that they are surprisingly accurate for the problem studied.
We apply Hidden Markov Models (HMMs) to the problem of statistical modeling and multiple alignment of protein families. A variant of the Expectation Maximization (EM) algorithm known as the Viterbi algorithm is used to obtain the statistical model from the unaligned sequences. In a detailed series of experiments, we have taken 400 unaligned globin sequences, and produced a statistical model entirely automatically from the primary (unaligned) sequences using no prior knowledge of globin structure. The produced model includes amino acid distributions for all the known positions in the 7 major alpha-helices, as well as the probability of and average length of insertions between these positions, and the probability that each position is not present at all. Using this model, we obtained a multiple alignment of the 400 sequences and 225 other globin sequences, that agrees almost perfectly with a structural alignment by Bashford et al. This model can also discriminate all these 625 globins from nonglobin protein sequences with greater than 99% accuracy, and can thus be used for database searches.
It has been observed in numerical simulations that a weight decay can improve generalization in a feed-forward neural network. This paper explains why. It is proven that a weight decay has two effects in a linear network. First, it suppresses any irrelevant components of the weight vector by choosing the smallest vector that solves the learning problem. Second, if the size is chosen right, a weight decay can suppress some of the effects of static noise on the targets, which improves generalization quite a lot. It is then shown how to extend these results to networks with hidden layers and non-linear units. Finally the theory is confirmed by some numerical simulations using the data from NetTalk.