Publications by Anders Krogh
- Click here for a list without abstracts.
- Available papers are in pdf or compressed postscript (gzip).
- There may be slight differences between the available papers and
the published versions.
- J. C. Bryne, E. Valen,
M. H. Tang, T. Marstrand, O. Winther, I. da Piedade, A. Krogh,
B. Lenhard, and A. Sandelin.
JASPAR, the open access database of transcription
factor-binding profiles: new content and tools in the 2008 update.
Nucleic Acids Research, 2007.
Epub ahead of print.
JASPAR is a popular open-access database for matrix models
describing DNA-binding preferences for transcription factors and other DNA
patterns. With its third major release, JASPAR has been expanded and equipped
with additional functions aimed at both casual and power users. The heart of
the JASPAR database-the JASPAR CORE sub-database-has increased by 12% in
size, and three new specialized sub-databases have been added. New functions
include clustering of matrix models by similarity, generation of random
matrices by sampling from selected sets of existing models and a
language-independent Web Service applications programming interface for
matrix retrieval. JASPAR is available at
http://jaspar.genereg.net.
- C. Bystroff and
A. Krogh.
Hidden Markov models for prediction of protein
features.
In M. Zaki and C. Bystroff, editors, Protein Structure Prediction.
Springer, Humana Press, 2007.
- P. Fariselli,
D. Molinini, R. Casadio, and A. Krogh.
Prediction of structurally-determined coiled-coil domains
with hidden Markov models.
In S. Hochreiter and R. Wagner, editors, Bioinformatics Research and
Development, pages 292-302. Springer-Verlag, March 2007.
(PDF, 192671 bytes)
The coiled-coil protein domain is a widespread structural motif
known to be involved in a wealth of key interactions in cells and organisms.
Coiled-coil recognition and prediction of their location in a protein
sequence are important steps for modeling protein structure and function.
Nowadays, thanks to the increasing number of experimentally determined
protein structures, a significant number of coiled-coil protein domains is
available. This enables the development of methods suited to predict the
coiled-coil structural motifs starting from the protein sequence. Several
methods have been developed to predict classical heptads using manually
annotated coiled-coil domains. In this paper we focus on the prediction
structurally-determined coiled-coil segments. We introduce a new method based
on hidden Markov models that complement the existing methods and outperforms
them in the task of locating structurallydefined coiled-coil
segments.
- L. B. Frankel, N. R.
Christoffersen, A. Jacobsen, M. Lindow, A. Krogh, and A. H. Lund.
Programmed cell death 4 (PDCD4) is an important
functional target of the microRNA miR-21 in breast cancer
cells.
J. Biol. Chem., 2007.
Epub ahead of print.
MicroRNAs are emerging as important regulators of cancer-related
processes. The miR-21 microRNA is over-expressed in a wide variety of cancers
and has been causally linked to cellular proliferation, apoptosis and
migration. Inhibition of mir-21 in MCF-7 breast cancer cells causes reduced
cell growth. Using array expression analysis of MCF-7 cells depleted of
miR-21 we identify mRNA targets of mir-21 and show a link between miR-21 and
the p53 tumor suppressor protein. We furthermore find that the tumor
suppressor protein Programmed Cell Death 4 (PDCD4) is regulated by miR-21 and
demonstrate that PDCD4 is a functionally important target for miR-21 in
breast cancer cells.
- M. C. Frith, E. Valen,
A. Krogh, Y. Hayashizaki, P. Carninci, and A. Sandelin.
A code for transcription initiation in mammalian
genomes.
Genome Research, 2007.
To appear.
- L. Jonson, J. Vikesaa,
A. Krogh, L. K. Nielsen, T. V. Hansen, R. Borup, A. H. Johnsen,
J. Christiansen, and F. C. Nielsen.
Molecular composition of IMP1 RNP
granules.
Mol. & Cell. Proteomics, 6(5), 2007.
Localized mRNAs are transported to sites of local protein synthesis
in large RNP granules, but their molecular composition is incompletely
understood. IMP zipcode-binding proteins participate in mRNA localization,
and in motile cells IMP-containing granules are dispersed around the nucleus
and in cellular protrusions. We isolated the IMP1-containing RNP granules,
and found that they represent a unique RNP entity, distinct from neuronal
hStaufen and/or FMRP granules, P-bodies and stress-granules. Granules were
100-300 nm in diameter and consisted of IMPs, 40S ribosomal subunits,
shuttling hnRNPs, poly(A)-binding proteins, and mRNAs. Moreover, granules
contained CBP80 and factors belonging to the exon-junction complex, and
lacked eIF4E, eIF4G and 60S ribosomal subunits, indicating that embodied
mRNAs are not translated. Granules embodied mRNAs corresponding to about 3 of the HEK293 mRNA transcriptome. Messenger RNAs encoding proteins
participating in the secretory pathway and ER-associated quality control, as
well as ubiquitin-dependent metabolism, were enriched in the granules,
reinforcing the concept of RNP granules as post-transcriptional
operons.
- L. Kall, A. Krogh,
and E. L. Sonnhammer.
Advantages of combined transmembrane topology and signal
peptide prediction-the Phobius web server.
Nucleic Acids Research, 35(Web Server issue):W429-432, July
2007.
When using conventional transmembrane topology and signal peptide
predictors, such as TMHMM and SignalP, there is a substantial overlap between
these two types of predictions. Applying these methods to five complete
proteomes, we found that 30-65% of all predicted signal peptides and 25-35 of all predicted transmembrane topologies overlap. This impairs predictions
of 5-10% of the proteome, hence this is an important issue in protein
annotation. To address this problem, we previously designed a hidden Markov
model, Phobius, that combines transmembrane topology and signal peptide
predictions. The method makes an optimal choice between transmembrane
segments and signal peptides, and also allows constrained and
homology-enriched predictions. We here present a web interface
(http://phobius.cgb.ki.se and http://phobius.binf.ku.dk) to access
Phobius.
- S. Lindgreen,
P. P. Gardner, and A. Krogh.
MASTR: Multiple alignment and structure prediction of
non-coding RNAs using simulated annealing.
Bioinformatics, 2007.
Epub ahead of print.
MOTIVATION: As more non-coding RNAs are discovered, the importance
of methods for RNA analysis increases. Since the structure of ncRNA is
intimately tied to the function of the molecule, programs for RNA structure
prediction are necessary tools in this growing field of research.
Furthermore, it is known that RNA structure is often evolutionarily more
conserved than sequence. However, few existing methods are capable of
simultaneously considering multiple sequence alignment and structure
prediction. RESULTS: We present a novel solution to the problem of
simultaneous structure prediction and multiple alignment of RNA sequences.
Using Markov chain Monte Carlo in a simulated annealing framework, the
algorithm MASTR (Multiple Alignment of STructural RNAs) iteratively improves
both sequence alignment and structure prediction for a set of RNA sequences.
This is done by minimizing a combined cost function that considers sequence
conservation, covariation and basepairing probabilities. The results show
that the method is very competitive to similar programs available today, both
in terms of accuracy and computational efficiency. AVAILABILITY: Source code
available from http://mastr.binf.ku.dk/ CONTACT:
stinus@binf.ku.dk.
- M. Lindow, A. Jacobsen,
S Nygaard, Y. Mang, and A Krogh.
Intragenomic matching reveals a huge potential for
miRNA mediated regulation in plants.
PLoS Computational Biology, 2007.
To appear.
- K. J. Won, T. Hamelryck,
A. Prugel-Bennett, and A. Krogh.
An
evolving method for learning HMM structure: prediction of protein secondary
structure.
BMC Bioinformatics, 8(1):357, 2007.
Epub ahead of print.
ABSTRACT: BACKGROUND: The prediction of the secondary structure of
proteins is one of the most studied problems in bioinformatics. Despite their
success in many problems of biological sequence analysis, Hidden Markov
Models (HMMs) have not been used much for this problem, as the complexity of
the task makes manual design of HMMs difficult. Therefore, we have developed
a method for evolving the structure of HMMs automatically, using Genetic
Algorithms (GAs). RESULTS: In the GA procedure, populations of HMMs are
assembled from biologically meaningful building blocks. Mutation and
crossover operators were designed to explore the space of such block-HMMs.
After each step of the GA, the standard HMM estimation algorithm (the
Baum-Welch algorithm) was used to update model parameters. The final HMM
captures several features of protein sequence and structure, with its own HMM
grammar. In contrast to neural network based predictors, the evolved HMM also
calculates the probabilities associated with the predictions. We carefully
examined the performance of the HMM based predictor, both under the multiple-
and single-sequence condition. CONCLUSIONS: We have shown that the proposed
evolutionary method can automatically design the topology of HMMs. The method
reads the grammar of protein sequences and converts it into the grammar of an
HMM. It improved previously suggested volutionary methods and increased the
prediction quality. Especially, it shows good performance under the
single-sequence condition and provides probabilistic information on the
prediction result. The protein secondary structure predictor using HMMs
(P.S.HMM) is on-line available (http://www.binf.ku.dk/ won/pshmm.htm). It
runs under the single-sequence condition.
- M. Amico, M. Finelli,
I. Rossi, A. Zauli, A. Elofsson, H. Viklund, G. von Heijne, D. Jones,
A. Krogh, P. Fariselli, P. L. Martelli, and R. Casadio.
PONGO: a web server for multiple predictions of
all-alpha transmembrane proteins.
Nucleic Acids Research, 34:W169-W172, 2006.
(PDF, 585631 bytes)
The annotation efforts of the BIOSAPIENS European Network of
Excellence have generated several distributed annotation systems (DAS) with
the aim of integrating Bioinformatics resources and annotating metazoan
genomes (http://www.biosapiens.info). In this context, the PONGO DAS server
(http://pongo.biocomp.unibo.it) provides the annotation on predictive basis
for the all-alpha membrane proteins in the human genome, not only through DAS
queries, but also directly using a simple web interface. In order to produce
a more comprehensive analysis of the sequence at hand, this annotation is
carried out with four selected and high scoring predictors: TMHMM2.0, MEMSAT,
PRODIV and ENSEMBLE1.0. The stored and pre-computed predictions for the human
proteins can be searched and displayed in a graphical view. However the web
service allows the prediction of the topology of any kind of putative
membrane proteins, regardless of the organism and more importantly with the
same sequence profile for a given sequence when required. Here we present a
new web server that incorporates the state-of-the-art topology predictors in
a single framework, so that putative users can interactively compare and
evaluate four predictions simultaneously for a given sequence. Together with
the predicted topology, the server also displays a signal peptide prediction
determined with SPEP. The PONGO web server is available at
http://pongo.biocomp.unibo.it/pongo.
- T. Hamelryck,
J. T. Kent, and A. Krogh.
Sampling realistic
protein conformations using local structural bias.
PLoS Comput. Biol., 2(9):e131, 2006.
(PDF, 2447369 bytes)
The prediction of protein structure from sequence remains a major
unsolved problem in biology. The most successful protein structure prediction
methods make use of a divide-and-conquer strategy to attack the problem: a
conformational sampling method generates plausible candidate structures,
which are subsequently accepted or rejected using an energy function.
Conceptually, this often corresponds to separating local structural bias from
the long-range interactions that stabilize the compact, native state.
However, sampling protein conformations that are compatible with the local
structural bias encoded in a given protein sequence is a long-standing open
problem, especially in continuous space. We describe an elegant and
mathematically rigorous method to do this, and show that it readily generates
native-like protein conformations simply by enforcing compactness. Our
results have far-reaching implications for protein structure prediction,
determination, simulation, and design.
- K. T. Jensen,
L. Petersen, S. Falk, P. Iversen, P. Andersen, M. Theisen, and
A. Krogh.
Novel overlapping coding sequences in Chlamydia
trachomatis.
FEMS Microbiol. Lett., 265(1):106-117, 2006.
(PDF, 195664 bytes)
Chlamydia trachomatis is the aetiological agent of trachoma and
sexually transmitted infections. The C. trachomatis genome sequence revealed
an organism adapted to the intracellular habitat with a high coding ratio and
a small genome consisting of 1.042-kilobase (kb) with 895 annotated protein
coding genes. Here, we repredict the protein-coding genes of the C.
trachomatis genome using the gene-finder EasyGene that was trained
specifically for C. trachomatis, and compare it with the primary C.
trachomatis annotation. Our work predicts 15 genes not listed in the primary
annotation and 853 that are in agreement with the primary annotation. Forty
two genes from the primary annotation are not predicted by EasyGene. The
majority of these genes are listed as hypothetical in the primary annotation.
The 15 novel predicted genes all overlap with genes on the complementary
strand. We find homologues of several of the novel genes in C. trachomatis
Serovar A and Chlamydia muridarum. Several of the genes have typical
gene-like and protein-like features. Furthermore, we confirm transcriptional
activity from 10 of the putative genes. The combined evidence suggests that
at least seven of the 15 are protein coding genes. The data suggest the
presence of overlapping active genes in C. trachomatis.
- S. Lindgreen,
P. P. Gardner, and A. Krogh.
Measuring covariation in RNA alignments: physical
realism improves information measures.
Bioinformatics, 22(24):2988-2995, 2006.
(PDF, 134720 bytes)
MOTIVATION: The importance of non-coding RNAs is becoming
increasingly evident, and often the function of these molecules depends on
the structure. It is common to use alignments of related RNA sequences to
deduce the consensus secondary structure by detecting patterns of
co-evolution. A central part of such an analysis is to measure covariation
between two positions in an alignment. Here, we rank various measures ranging
from simple mutual information to more advanced covariation measures.
RESULTS: Mutual information is still used for secondary structure prediction,
but the results of this study indicate which measures are useful.
Incorporating more structural information by considering e.g. indels and
stacking improves accuracy, suggesting that physically realistic measures
yield improved predictions. This can be used to improve both current and
future programs for secondary structure prediction. The best measure tested
is the RNAalifold covariation measure modified to include stacking.
AVAILABILITY: Scripts, data and supplementary material can be found at
http://www.binf.ku.dk/Stinus_covariation
- K Munch and
A. Krogh.
Automatic generation of gene finders for eukaryotic
species.
BMC Bioinformatics, 7(1):263, 2006.
(PDF, 409817 bytes)
ABSTRACT: BACKGROUND: The number of sequenced eukaryotic genomes is
rapidly increasing. This means that over time it will be hard to keep
supplying customised gene finders for each genome. This calls for procedures
to automatically generate species-specific gene finders and to re-train them
as the quantity and quality of reliable gene annotation grows. RESULTS: We
present a procedure, Agene, that automatically generates a species-specific
gene predictor from a set of reliable mRNA sequences and a genome. We apply a
Hidden Markov model (HMM) that implements explicit length distribution
modelling for all gene structure blocks using acyclic discrete phase type
distributions. The state structure of the each HMM is generated dynamically
from an array of sub-models to include only gene features represented in the
training set. CONCLUSIONS: Acyclic discrete phase type distributions are well
suited to model sequence length distributions. The performance of each
individual gene predictor on each individual genome is comparable to the best
of the manually optimised species-specific gene finders. It is shown that
species-specific gene finders are superior to gene finders trained on other
species.
- K. Munch, P. P. Gardner,
P. Arctander, and A. Krogh.
A hidden Markov model approach for determining expression from
genomic tiling micro arrays.
BMC Bioinformatics, 7(1):239, 2006.
(PDF, 568372 bytes)
ABSTRACT: BACKGROUND: Genomic tiling micro arrays have great
potential for identifying previously undiscovered coding as well as
non-coding transcription. To-date, however, analyses of these data have been
performed in an ad hoc fashion. RESULTS: We present a probabilistic
procedure, ExpressHMM, that adaptively models tiling data prior to predicting
expression on genomic sequence. A hidden Markov model (HMM) is used to model
the distributions of tiling array probe scores in expressed and non-expressed
regions. The HMM is trained on sets of probes mapped to regions of annotated
expression and non-expression. Subsequently, prediction of transcribed
fragments is made on tiled genomic sequence. The prediction is accompanied by
an expression probability curve for visual inspection of the supporting
evidence. We test ExpressHMM on data from the Cheng et al. (2005) tiling
array experiments on ten Human chromosomes. Results can be downloaded and
viewed from our web site. CONCLUSIONS: The value of adaptive modelling of
fluorescence scores prior to categorisation into expressed and non-expressed
probes is demonstrated. Our results indicate that our adaptive approach is
superior to the previous analysis in terms of nucleotide sensitivity and
transfrag specificity.
- K.-J. Won, A. Prugel-Bennett,
and A. Krogh.
Evolving the structure of hidden Markov
models.
IEEE Transactions on Evolutionary Computation, 10(1):39-49,
2006.
(PDF, 597786 bytes)
A genetic algorithm (GA) is proposed for finding the structure of
hidden Markov Models (HMMs) used for biological sequence analysis. The GA is
designed to preserve biologically meaningful building blocks. The search
through the space of HMM structures is combined with optimization of the
emission and transition probabilities using the classic Baum u 2013Welch
algorithm. The system is tested on the problem of finding the promoter and
coding region of C. jejuni. The resulting HMM has a superior discrimination
ability to a handcrafted model that has been published in the
literature.
- L. Käll,
A. Krogh, and E. L. Sonnhammer.
An HMM posterior decoder for sequence feature
prediction that includes homology information.
Bioinformatics, 21(Suppl. 1):i251-i257, 2005.
(PDF, 83426 bytes)
MOTIVATION: When predicting sequence features like transmembrane
topology, signal peptides, coil-coil structures, protein secondary structure
or genes, extra support can be gained from homologs. RESULTS: We present here
a general hidden Markov model (HMM) decoding algorithm that combines
probabilities for sequence features of homologs by considering the average of
the posterior label probability of each position in a global sequence
alignment. The algorithm is an extension of the previously described 'optimal
accuracy' decoder, allowing homology information to be used. It was
benchmarked using an HMM for transmembrane topology and signal peptide
prediction, Phobius. We found that the performance was substantially
increased when incorporating information from homologs. AVAILABILITY: A
prediction server for transmembrane topology and signal peptides that uses
the algorithm is available at http://phobius.cgb.ki.se/poly.html. An
implementation of the algorithm is available on request from the
authors.
- M. Lindow and
A. Krogh.
Computational evidence for hundreds of non-conserved plant
microRNAs.
BMC Genomics, 6:119, 2005.
(PDF, 317514 bytes)
BACKGROUND: MicroRNAs (miRNA) are small (20-25 nt) non-coding RNA
molecules that regulate gene expression through interaction with mRNA in
plants and metazoans. A few hundred miRNAs are known or predicted, and most
of those are evolutionarily conserved. In general plant miRNA are different
from their animal counterpart: most plant miRNAs show near perfect
complementarity to their targets. Exploiting this complementarity we have
developed a method for identification plant miRNAs that does not rely on
phylogenetic conservation. RESULTS: Using the presumed targets for the known
miRNA as positive controls, we list and filter all segments of the genome of
length approximately 20 that are complementary to a target mRNA-transcript.
From the positive control we recover 41 (of 92 possible) of the already known
miRNA-genes (representing 14 of 16 families) with only four false
positives.Applying the procedure to find possible new miRNAs targeting any
annotated mRNA, we predict of 592 new miRNA genes, many of which are not
conserved in other plant genomes. A subset of our predicted miRNAs is
additionally supported by having more than one target that are not
homologues. CONCLUSION: These results indicate that it is possible to
reliably predict miRNA-genes without using genome comparisons. Furthermore it
suggests that the number of plant miRNAs have been underestimated and points
to the existence of recently evolved miRNAs in Arabidopsis.
- P. Nielsen and
A. Krogh.
Large-scale prokaryotic gene prediction and comparison to
genome annotation.
Bioinformatics, 21(24):4322-4329, 2005.
(PDF, 295624 bytes)
MOTIVATION: Prokaryotic genomes are sequenced and annotated at an
increasing rate. The methods of annotation vary between sequencing groups. It
makes genome comparison difficult and may lead to propagation of errors when
questionable assignments are adapted from one genome to another. Genome
comparison either on a large or small scale would be facilitated by using a
single standard for annotation, which incorporates a transparency of why an
open reading frame (ORF) is considered to be a gene. RESULTS: A total of 143
prokaryotic genomes were scored with an updated version of the prokaryotic
genefinder EasyGene. Comparison of the GenBank and RefSeq annotations with
the EasyGene predictions reveals that in some genomes up to approximately 60 of the genes may have been annotated with a wrong start codon, especially in
the GC-rich genomes. The fractional difference between annotated and
predicted confirms that too many short genes are annotated in numerous
organisms. Furthermore, genes might be missing in the annotation of some of
the genomes. We predict 41 of 143 genomes to be over-annotated by >5%,
meaning that too many ORFs are annotated as genes. We also predict that 12 of
143 genomes are under-annotated. These results are based on the difference
between the number of annotated genes not found by EasyGene and the number of
predicted genes that are not annotated in GenBank. We argue that the average
performance of our standardized and fully automated method is slightly better
than the annotation. AVAILABILITY: The EasyGene 1.2 predictions and
statistics can be accessed at http://www.binf.ku.dk/cgi-bin/easygene/search
CONTACT: pern@binf.ku.dk.
- L. Käll,
A. Krogh, and E. L. L. Sonnhammer.
A combined transmembrane topology and signal peptide
prediction method.
Journal of Molecular Biology, 338(5):1027-1036, 2004.
(PDF, 615083 bytes)
An inherent problem in transmembrane protein topology prediction
and signal peptide prediction is the high similarity between the hydrophobic
regions of a transmembrane helix and that of a signal peptide, leading to
cross-reaction between the two types of predictions. To improve predictions
further, it is therefore important to make a predictor that aims to
discriminate between the two classes. In addition, topology information can
be gained when successfully predicting a signal peptide leading a
transmembrane protein since it dictates that the N terminus of the mature
protein must be on the non-cytoplasmic side of the membrane. Here, we present
Phobius, a combined transmembrane protein topology and signal peptide
predictor. The predictor is based on a hidden Markov model (HMM) that models
the different sequence regions of a signal peptide and the different regions
of a transmembrane protein in a series of interconnected states. Training was
done on a newly assembled and curated dataset. Compared to TMHMM and SignalP,
errors coming from cross-prediction between transmembrane segments and signal
peptides were reduced substantially by Phobius. False classifications of
signal peptides were reduced from 26.1% to 3.9% and false classifications of
transmembrane helices were reduced from 19.0% to 7.7%. Phobius was applied to
the proteomes of Homo sapiens and Escherichia coli. Here we also noted a
drastic reduction of false classifications compared to TMHMM/SignalP,
suggesting that Phobius is well suited for whole-genome annotation of signal
peptides and transmembrane regions. The method is available at as well as
at
- O. Winther and
A. Krogh.
Teaching computers to fold
proteins.
Phys. Rev. E, 70:030903(R), 2004.
(PDF, 294280 bytes)
A new general algorithm for optimization of potential functions for
protein folding is introduced. It is based upon gradient optimization of the
thermodynamic stability of native folds of a training set of proteins with
known structure. The iterative update rule contains two thermodynamic
averages which are estimated by (generalized ensemble) Monte Carlo. We test
the learning algorithm on a Lennard-Jones (LJ) force field with a torsional
angle degrees-of-freedom and a single-atom side-chain. In a test with 24
peptides of known structure, none folded correctly with the initial potential
functions, but two-thirds came within 3 Aa to their native fold after
optimizing the potential functions.
- K.-J. Won, A. Prugel-Bennett,
and A. Krogh.
Training HMM structure with genetic algorithm for
biological sequence analysis.
Bioinformatics, 20(18):3613-3619, 2004.
(PDF, 122027 bytes)
SUMMARY: Hidden Markov models (HMMs) are widely used for biological
sequence analysis because of their ability to incorporate biological
information in their structure. An automatic means of optimizing the
structure of HMMs would be highly desirable. However, this raises two
important issues; first, the new HMMs should be biologically interpretable,
and second, we need to control the complexity of the HMM so that it has good
generalization performance on unseen sequences. In this paper, we explore the
possibility of using a genetic algorithm (GA) for optimizing the HMM
structure. GAs are sufficiently flexible to allow incorporation of other
techniques such as Baum-Welch training within their evolutionary cycle.
Furthermore, operators that alter the structure of HMMs can be designed to
favour interpretable and simple structures. In this paper, a training
strategy using GAs is proposed, and it is tested on finding HMM structures
for the promoter and coding region of the bacterium Campylobacter jejuni. The
proposed GA for hidden Markov models (GA-HMM) allows, HMMs with different
numbers of states to evolve. To prevent over-fitting, a separate dataset is
used for comparing the performance of the HMMs to that used for the
Baum-Welch training. The GA-HMM was capable of finding an HMM comparable to a
hand-coded HMM designed for the same task, which has been published
previously.
- A. S. Juncker,
H. Willenbrock, G. von Heijne, S. Brunak, H. Nielsen, and
A. Krogh.
Prediction of lipoprotein signal peptides in
Gram-negative bacteria.
Protein Sci., 12(8):1652-1662, 2003.
(PDF, 338425 bytes)
A method to predict lipoprotein signal peptides in Gram-negative
Eubacteria, LipoP, has been developed. The hidden Markov model (HMM) was able
to distinguish between lipoproteins (SPaseII-cleaved proteins),
SPaseI-cleaved proteins, cytoplasmic proteins, and transmembrane proteins.
This predictor was able to predict 96.8% of the lipoproteins correctly with
only 0.3% false positives in a set of SPaseI-cleaved, cytoplasmic, and
transmembrane proteins. The results obtained were significantly better than
those of previously developed methods. Even though Gram-positive lipoprotein
signal peptides differ from Gram-negatives, the HMM was able to identify
92.9% of the lipoproteins included in a Gram-positive test set. A genome
search was carried out for 12 Gram-negative genomes and one Gram-positive
genome. The results for Escherichia coli K12 were compared with new
experimental data, and the predictions by the HMM agree well with the
experimentally verified lipoproteins. A neural network-based predictor was
developed for comparison, and it gave very similar results. LipoP is
available as a Web server at www.cbs.dtu.dk/services/LipoP/.
- T. S. Larsen and
A. Krogh.
EasyGene - a prokaryotic
gene finder that ranks ORFs by statistical significance.
BMC Bioinformatics, 4(1):21, 2003.
(PDF, 832391 bytes)
BACKGROUND: Contrary to other areas of sequence analysis, a measure
of statistical significance of a putative gene has not been devised to help
in discriminating real genes from the masses of random Open Reading Frames
(ORFs) in prokaryotic genomes. Therefore, many genomes have too many short
ORFs annotated as genes. RESULTS: In this paper, we present a new automated
gene-finding method, EasyGene, which estimates the statistical significance
of a predicted gene. The gene finder is based on a hidden Markov model (HMM)
that is automatically estimated for a new genome. Using extensions of
similarities in Swiss-Prot, a high quality training set of genes is
automatically extracted from the genome and used to estimate the HMM.
Putative genes are then scored with the HMM, and based on score and length of
an ORF, the statistical significance is calculated. The measure of
statistical significance for an ORF is the expected number of ORFs in one
megabase of random sequence at the same significance level or better, where
the random sequence has the same statistics as the genome in the sense of a
third order Markov chain. CONCLUSIONS: The result is a flexible gene finder
whose overall performance matches or exceeds other methods. The entire
pipeline of computer processing from the raw input of a genome or set of
contigs to a list of putative genes with significance is automated, making it
easy to apply EasyGene to newly sequenced organisms. EasyGene with
pre-trained models can be accessed at
http://www.cbs.dtu.dk/services/EasyGene.
- K. Melén,
A. Krogh, and G. von Heijne.
Reliability measures for membrane protein topology
prediction algorithms.
Journal of Molecular Biology, 327(3):735-744, 2003.
(PDF, 298308 bytes)
We have developed reliability scores for five widely used membrane
protein topology prediction methods, and have applied them both on a test set
of 92 bacterial plasma membrane proteins with experimentally determined
topologies and on all predicted helix bundle membrane proteins in three fully
sequenced genomes: Escherichia coli, Saccharomyces cerevisiae and
Caenorhabditis elegans. We show that the reliability scores work well for the
TMHMM and MEMSAT methods, and that they allow the probability that the
predicted topology is correct to be estimated for any protein. We further
show that the available test set is biased towards high-scoring proteins when
compared to the genome-wide data sets, and provide estimates for the expected
prediction accuracy of TMHMM across the three genomes. Finally, we show that
the performance of TMHMM is considerably better when limited experimental
information (such as the in/out location of a protein's C terminus) is
available, and estimate that at least ten percentage points in overall
accuracy in whole-genome predictions can be gained in this
way.
- L. Petersen, T. S.
Larsen, D. W. Ussery, S. L. W. On, and A. Krogh.
RpoD promoters in Campylobacter jejuni exhibit a
strong periodic signal instead of a -35 box.
Journal of Molecular Biology, 326(5):1361-1372, 2003.
(PDF, 630427 bytes)
We have used a hidden Markov model (HMM) to identify the consensus
sequence of the RpoD promoters in the genome of Campylobacter jejuni. The
identified promoter consensus sequence is unusual compared to other bacteria,
in that the region upstream of the TATA-box does not contain a conserved -35
region, but shows a very strong periodic variation in the AT-content and
semi-conserved T-stretches, with a period of 10-11 nucleotides. The TATA-box
is in some, but not all cases, preceded by a TGx, similar to an extended -10
promoter.We predicted a total of 764 presumed RpoD promoters in the C.jejuni
genome, of which 654 were located upstream of annotated genes. A similar
promoter was identified in Helicobacter pylori, a close phylogenetic relative
of Campylobacter, but not in Escherichia coli, Vibrio cholerae, or six other
Proteobacterial genomes, or in Staphylococcus aureus. We used upstream
regions of high confidence genes as training data (n=529, for the C.jejuni
genome). We found it necessary to limit the training set to genes that are
preceded by an intergenic region of >100bp or by a gene oriented in the
opposite direction to be able to identify a conserved sequence motif, and
ended up with a training set of 175 genes. This leads to the conclusion that
the remaining genes (354) are more rarely preceded by a (RpoD) promoter, and
consequently that operon structure may be more widespread in C.jejuni than
has been assumed by others.Structural predictions of the regions upstream of
the TATA-box indicates a region of highly curved DNA, and we assume that this
facilitates the wrapping of the DNA around the RNA polymerase holoenzyme, and
offsets the absence of a conserved -35 binding motif.
- L. J. Jensen, R. Gupta,
N. Blom, D. Devos, J. Tamames, C. Kesmir, H. Nielsen, H. H. Staerfeldt,
K. Rapacki, C. Workman, C. A. Andersen, S. Knudsen, A. Krogh,
A. Valencia, and S. Brunak.
Prediction of human protein function from
post-translational modifications and localization features.
Journal of Molecular Biology, 319(5):1257-1265, 2002.
(PDF, 441510 bytes)
We have developed an entirely sequence-based method that identifies
and integrates relevant features that can be used to assign proteins of
unknown function to functional classes, and enzyme categories for enzymes. We
show that strategies for the elucidation of protein function may benefit from
a number of functional attributes that are more directly related to the
linear sequence of amino acids, and hence easier to predict, than protein
structure. These attributes include features associated with
post-translational modifications and protein sorting, but also much simpler
aspects such as the length, isoelectric point and composition of the
polypeptide chain.
- P. L. Martelli,
P. Fariselli, A. Krogh, and R. Casadio.
A sequence-profile-based HMM for predicting and
discriminating beta barrel membrane proteins.
Bioinformatics, 18:S46-S53, 2002.
Proceedings of ISMB 2002.
(PDF, 147765 bytes)
Motivation: Membrane proteins are an abundant and functionally
relevant subset of proteins that putatively include from about 15 up to 30 of the proteome of organisms fully sequenced. These estimates are mainly
computed on the basis of sequence comparison and membrane protein prediction.
It is therefore urgent to develop methods capable of selecting membrane
proteins especially in the case of outer membrane proteins, barely taken into
consideration when proteome wide analysis is performed. This will also help
protein annotation when no homologous sequence is found in the database.
Outer membrane proteins solved so far at atomic resolution interact with the
external membrane of bacteria with a characteristic beta barrel structure
comprising different even numbers of beta strands (beta barrel membrane
proteins). In this they differ from the membrane proteins of the cytoplasmic
membrane endowed with alpha helix bundles (all alpha membrane proteins) and
need specialised predictors. Results: We develop a HMM model, which can
predict the topology of beta barrel membrane proteins using, as input,
evolutionary information. The model is cyclic with 6 types of states: two for
the beta strand transmembrane core, one for the beta strand cap on either
side of the membrane, one for the inner loop, one for the outer loop and one
for the globular domain state in the middle of each loop. The development of
a specific input for HMM based on multiple sequence alignment is novel. The
accuracy per residue of the model is 83% when a jack knife procedure is
adopted. With a model optimisation method using a dynamic programming
algorithm seven topological models out of the twelve proteins included in the
testing set are also correctly predicted. When used as a discriminator, the
model is rather selective. At a fixed probability value, it retains 84% of a
non-redundant set comprising 145 sequences of well-annotated outer membrane
proteins. Concomitantly, it correctly rejects 90% of a set of globular
proteins including about 1200 chains with low sequence identity (<30%) and
90% of a set of all alpha membrane proteins, including 188 chains.
Availability:The program will be available on request from the
authors.
- D. Ussery, D. M.
Soumpasis, S. Brunak, H. H. Staerfeldt, P. Worning, and A. Krogh.
Bias of purine stretches in sequenced
chromosomes.
Computers and Chemistry, 26(5):531-541, 2002.
(PDF, 593359 bytes)
We examined more than 700 DNA sequences (full length chromosomes
and plasmids) for stretches of purines (R) or pyrimidines (Y) and alternating
YR stretches; such regions will likely adopt structures which are different
from the canonical B-form. Since one turn of the DNA helix is roughly 10 bp,
we measured the fraction of each genome which contains purine (or pyrimidine)
tracts of lengths of 10 bp or longer (hereafter referred to as 'purine
tracts'), as well as stretches of alternating pyrimidines/purine (pyr/pur
tracts') of the same length. Using this criteria, a random sequence would be
expected to contain 1.0% of purine tracts and also 1.0% of the alternating
pyr/pur tracts. In the vast majority of cases, there are more purine tracts
than would be expected from a random sequence, with an average of 3.5%,
significantly larger than the expectation value. The fraction of the
chromosomes containing pyr/pur tracts was slightly less than expected, with
an average of 0.8%. One of the most surprising findings is a clear difference
in the length distributions of the regions studied between prokaryotes and
eukaryotes. Whereas short-range correlations can explain the length
distributions in prokaryotes, in eukaryotes there is an abundance of long
stretches of purines or alternating purine/pyrimidine tracts, which cannot be
explained in this way; these sequences are likely to play an important role
in eukaryotic chromosome organisation.
- S. Brunak and
A. Krogh.
ISMB 2001 (editorial).
Bioinformatics, 17(Suppl. 1):S1-S4, 2001.
Proceedings of ISMB 2001.
(PDF, 51658 bytes)
- H. Jarmer, T. S.
Larsen, A. Krogh, H. H. Saxild, S. Brunak, and S. Knudsen.
Sigma A recognition sites in the Bacillus subtilis
genome.
Microbiology, 147:2417-2424, 2001.
(PDF, 579429 bytes)
A hidden Markov model of sigma(A) RNA polymerase cofactor
recognition sites in Bacillus subtilis, containing either the common or the
extended -10 motifs, has been constructed based on experimentally verified
sigma(A) recognition sites. This work suggests that more information exists
at the initiation site of transcription in both types of promoters than
previously thought. When tested on the entire B. subtilis genome, the model
predicts that approximately half of the sigma(A) recognition sites are of the
extended type. Some of the response-regulator aspartate phosphatases were
among the predictions of promoters containing extended sites. The expression
of rapA and rapB was confirmed by site-directed mutagenesis to depend on the
extended -10 region.
- A. Krogh,
B. Larsson, G. von Heijne, and E. L. L. Sonnhammer.
Predicting transmembrane protein topology with a hidden
Markov model: Application to complete genomes.
Journal of Molecular Biology, 305(3):567-580, January 2001.
(PDF, 959503 bytes)
We describe and validate a new membrane protein topology prediction
method, TMHMM, based on a hidden Markov model. We present a detailed analysis
of TMHMM's performance, and show that it correctly predicts 97-98% of the
transmembrane helices. Additionally, TMHMM can discriminate between soluble
and membrane proteins with both specificity and sensitivity better than 99%,
although the accuracy drops when signal peptides are present. This high
degree of accuracy allowed us to predict reliably integral membrane proteins
in a large collection of genomes. Based on these predictions, we estimate
that 20-30% of all genes in most genomes encode membrane proteins, which is
in agreement with previous estimates. We further discovered that proteins
with N(in)-C(in) topologies are strongly preferred in all examined organisms,
except Caenorhabditis elegans, where the large number of 7TM receptors
increases the counts for N(out)-C(in) topologies. We discuss the possible
relevance of this finding for our understanding of membrane protein assembly
mechanisms. A TMHMM prediction service is available at
http://www.cbs.dtu.dk/services/TMHMM/.
- J. Parkhill,
G. Dougan, K. D. James, N. R. Thomson, D. Pickard, J. Wain, C. Churcher,
K. L. Mungall, S. D. Bentley, M. T. G. Holden, M. Sebaihia, S. Baker,
D. Basham, K. Brooks, T. Chillingworth, P. Connerton, A. Cronin, P. Davis,
R. M. Davies, L. Dowd, N. White, J. Farrar, T. Feltwell, N. Hamlin, A. Haque,
T. T. Hien, S. Holroyd, K. Jagels, A. Krogh, T. S. Larsen, S. Leather,
S. Moule, P. Ò'Gaora, C. Parry, M. Quail, K. Rutherford, M. Simmonds,
J. Skelton, K. Stevens, S. Whitehead, and B. G. Barrell.
Complete genome sequence of
a multiple drug resistant Salmonella enterica serovar Typhi
CT18.
Nature, 413(6858):848-852, 2001.
(PDF, 1044232 bytes)
Salmonella enterica serovar Typhi (S. typhi) is the aetiological
agent of typhoid fever, a serious invasive bacterial disease of humans with
an annual global burden of approximately 16 million cases, leading to 600,000
fatalities. Many S. enterica serovars actively invade the mucosal surface of
the intestine but are normally contained in healthy individuals by the local
immune defence mechanisms. However, S. typhi has evolved the ability to
spread to the deeper tissues of humans, including liver, spleen and bone
marrow. Here we have sequenced the 4,809,037-base pair (bp) genome of a S.
typhi (CT18) that is resistant to multiple drugs, revealing the presence of
hundreds of insertions and deletions compared with the Escherichia coli
genome, ranging in size from single genes to large islands. Notably, the
genome sequence identifies over two hundred pseudogenes, several
corresponding to genes that are known to contribute to virulence in
Salmonella typhimurium. This genetic degradation may contribute to the
human-restricted host range for S. typhi. CT18 harbours a 218,150-bp
multiple-drug-resistance incH1 plasmid (pHCM1), and a 106,516-bp cryptic
plasmid (pHCM2), which shows recent common ancestry with a virulence plasmid
of Yersinia pestis.
- M. Skovgaard,
L. J. Jensen, S. Brunak, D. Ussery, and A. Krogh.
On the total number of genes and their length
distribution in complete microbial genomes.
Trends in Genetics, 17(8):425-428, August 2001.
(PDF, 44091 bytes)
In sequenced microbial genomes, some of the annotated genes are
actually not protein-coding genes, but rather open reading frames that occur
by chance. Therefore, the number of annotated genes is higher than the actual
number of genes for most of these microbes. Comparison of the length
distribution of the annotated genes with the length distribution of those
matching a known protein reveals that too many short genes are annotated in
many genomes. Here we estimate the true number of protein-coding genes for
sequenced genomes. Although it is often claimed that Escherichia coli has
about 4300 genes, we show that it probably has only approximately 3800 genes,
and that a similar discrepancy exists for almost all published
genomes.
- D. Ussery, T. S.
Larsen, K. T. Wilkes, C. Friis, P. Worning, A. Krogh, and
S. Brunak.
Genome organisation and chromatin structure in
Escherichia coli.
Biochimie, 83(2):201-212, 2001.
(PDF, 489822 bytes)
We have analysed the complete sequence of the Escherichia coli K12
isolate MG1655 genome for chromatin-associated protein binding sites, and
compared the predicted location of predicted sites with experimental
expression data from 'DNA chip' experiments. Of the dozen proteins associated
with chromatin in E. coli, only three have been shown to have significant
binding preferences: integration host factor (IHF) has the strongest binding
site preference, and FIS sites show a weak consensus, and there is no clear
consensus site for binding of the H-NS protein. Using hidden Markov models
(HMMs), we predict the location of 608 IHF sites, scattered throughout the
genome. A subset of the IHF sites associated with repeats tends to be
clustered around the origin of replication. We estimate there could be
roughly 6000 FIS sites in E. coli, and the sites tend to be localised in two
regions flanking the replication termini. We also show that the regions
upstream of genes regulated by H-NS are more curved and have a higher AT
content than regions upstream of other genes. These regions in general would
also be localised near the replication terminus.
- J. V. Hansen and
A. Krogh.
A general method for combining predictors tested on
protein secondary structure prediction.
In H. Malmgren, M. Borga, and L. Niklasson, editors, Proceedings of
Artificial Neural Networks in Medicine and Biology, pages 259-264,
London, May 2000. Springer-Verlag.
(Gzipped PostScript, 6 pages, 31321 bytes)
(PDF, 63819 bytes)
Ensemble methods, which combine several classifiers, have been
successfully applied to decrease generalization error of machine learning
methods. For most ensemble methods the ensemble members are combined by
weighted summation of the output, called the linear average predictor. The
logarithmic opinion pool ensemble method uses a multiplicative combination of
the ensemble members, which treats the outputs of the ensemble members as
independent probabilities. The advantage of the logarithmic opinion pool is
the connection to the Kullback-Leibler error function, which can be
decomposed into two terms: An average of the error of the ensemble members,
and the ambiguity. The ambiguity is independent of the target function, and
can be estimated using unlabeled data. The advantage of the decomposition is
that an unbiased estimate of the generalization error of the ensemble can be
obtained, while training still is on the full training set. These properties
can be used to improve classification. The logarithmic opinion pool ensemble
method is tested on the prediction of protein secondary structure. The focus
is on how much improvement the general ensemble method can give rather than
on outperforming existing methods, because that typically involves several
more steps of refinement.
- A. Krogh.
Using database matches with HMMgene for automated gene
detection in Drosophila.
Genome Research, 10(4):523-528, 2000.
(PDF, 345831 bytes)
The application of the gene finder to the Adh region of the
Drosophila melanogaster is described, and the prediction results are
analyzed. is based on a probabilistic model called a hidden Markov model, and
the probabilistic framework facilitates the inclusion of database matches of
varying degrees of certainty. It is shown that database matches clearly
improve the performance of the gene finder. For instance, the sensitivity for
coding exons predicted with both ends correct grows from 62% to 70% on a
high-quality test set, when matches to proteins, cDNAs, repeats, and
transposons are included. The specificity drops more than the sensitivity
increases when ESTs are used. This is due to the high noise level in EST
matches, and it is discussed in more detail why this is and how it might be
improved.
- A. Krogh and
S. K. Riis.
Hidden neural networks.
Neural Computation, 11(2):541-563, 1999.
(PDF, 308921 bytes)
A general framework for hybrids of hidden Markov models (HMMs) and
neural networks (NNs) called hidden neural networks (HNNs) is described. The
article begins by reviewing standard HMMs and estimation by conditional
maximum likelihood, which is used by the HNN. In the HNN, the usual HMM
probability parameters are replaced by the outputs of state-specific neural
networks. As opposed to many other hybrids, the HNN is normalized globally
and therefore has a valid probabilistic interpretation. All parameters in the
HNN are estimated simultaneously according to the discriminative conditional
maximum likelihood criterion. The HNN can be viewed as an undirected
probabilistic independence network (a graphical model), where the neural
networks provide a compact representation of the clique functions. An
evaluation of the HNN on the task of recognizing broad phoneme classes in the
TIMIT database shows clear performance gains compared to standard HMMs tested
on the same task.
- C. Workman and
A. Krogh.
No evidence that mRNA have lower folding free energies
than random sequences with the same dinucleotide distribution.
Nucleic Acids Research, 27(24):4816-4822, 1999.
(PDF, 322485 bytes)
This work investigates whether mRNA has a lower estimated folding
free energy than random sequences. The free energy estimates are calculated
by the mfold program for prediction of RNA secondary structures. For a set of
46 mRNAs it is shown that the predicted free energy is not significantly
different from random sequences with the same dinucleotide distribution. For
random sequences with the same mononucleotide distribution it has previously
been shown that the native mRNA sequences have a lower predicted free energy,
which indicates a more stable structure than random sequences. However,
dinucleotide content is important when assessing the significance of
predicted free energy as the physical stability of RNA secondary structure is
known to depend on dinucleotide base stacking energies. Even known RNA
secondary structures, like tRNAs, can be shown to have predicted free
energies indistinguishable from randomized sequences. This suggests that the
predicted free energy is not always a good determinant for RNA
folding.
- S. T. Cole, R. Brosch,
J. Parkhill, T. Garnier, C. Churcher, D. Harris, S. V. Gordon, K. Eiglmeier,
S. Gas, C. E. Barry III, F. Tekaia, K. Badcock, D. Basham, D. Brown,
T. Chillingworth, R. Connor, R. Davies, K. Devlin, T. Feltwell, S. Gentles,
N. Hamlin, S. Holroyd, T. Hornsby, K. Jagels, A. Krogh, J. McLean,
S. Moule, L. Murphy, K. Oliver, J. Osborne, M. A. Quail, M.-A. Rajandream,
J. Rogers, S. Rutter, K. Seeger, J. Skelton, R. Squares, S. Squares, J. E.
Sulston, K. Taylor, S. Whitehead, and B. G. Barrell.
Deciphering the biology of Mycobacterium
tuberculosis from the complete genome sequence.
Nature, 393:537-544, 1998.
(PDF, 2784947 bytes)
Countless millions of people have died from tuberculosis, a chronic
infectious disease caused by the tubercle bacillus. The complete genome
sequence of the best-characterized strain of Mycobacterium tuberculosis,
H37Rv, has been determined and analysed in order to improve our understanding
of the biology of this slow-growing pathogen and to help the conception of
new prophylactic and therapeutic interventions. The genome comprises
4,411,529 base pairs, contains around 4,000 genes, and has a very high
guanine + cytosine content that is reflected in the blased amino-acid content
of the proteins. M. tuberculosis differs radically from other bacteria in
that a very large portion of its coding capacity is devoted to the production
of enzymes involved in lipogenesis and lipolysis, and to two new families of
glycine-rich proteins with a repetitive structure that may represent a source
of antigenic variation.
- R. M. Durbin, S. R.
Eddy, A. Krogh, and G. Mitchison.
Biological Sequence Analysis.
Cambridge University Press, Cambridge, UK, 1998.
- A. Krogh.
Gene finding: putting the parts together.
In Martin J. Bishop, editor, Guide to Human Genome Computing,
chapter 11, pages 261-274. Academic Press, San Diego, CA, 2nd edition,
1998.
(Gzipped PostScript, 12 pages, 42693 bytes)
(PDF, 945426 bytes)
- A. Krogh.
An introduction to hidden Markov models for biological
sequences.
In S. L. Salzberg, D. B. Searls, and S. Kasif, editors, Computational
Methods in Molecular Biology, chapter 4, pages 45-63. Elsevier,
Amsterdam, 1998.
(Gzipped PostScript, 24 pages, 67831 bytes)
(PDF, 1483458 bytes)
- H. Nielsen and
A. Krogh.
Prediction of signal peptides and signal anchors by a
hidden Markov model.
In J. Glasgow, T. Littlejohn, F. Major, R. Lathrop, D. Sankoff, and C. Sensen,
editors, Proceedings of the Sixth International Conference on
Intelligent Systems for Molecular Biology, pages 122-130, Menlo Park,
CA, 1998. AAAI Press.
(Gzipped PostScript, 9 pages, 58984 bytes)
(PDF, 1081707 bytes)
A hidden Markov model of signal peptides has been developed. It
contains submodels for the N-terminal part, the hydrophobic region, and the
region around the cleavage site. For known signal peptides, the model can be
used to assign objective boundaries between these three regions. Applied to
our data, the length distributions for the three regions are significantly
different from expectations. For instance, the assigned hydrophobic region is
between 8 and 12 residues long in almost all eukaryotic signal peptides. This
analysis also makes obvious the difference between eukaryotes, Gram-positive
bacteria, and Gram-negative bacteria. The model can be used to predict the
location of the cleavage site, which it finds correctly in nearly 70% of
signal peptides in a cross-validated test-almost the same accuracy as the
best previous method. One of the problems for existing prediction methods is
the poor discrimination between signal peptides and uncleaved signal anchors,
but this is substantially improved by the hidden Markov model when expanding
it with a very simple signal anchor model.
- E. L.L.
Sonnhammer, G. von Heijne, and A. Krogh.
A hidden Markov model for predicting transmembrane
helices in protein sequences.
In J. Glasgow, T. Littlejohn, F. Major, R. Lathrop, D. Sankoff, and C. Sensen,
editors, Proceedings of the Sixth International Conference on
Intelligent Systems for Molecular Biology, pages 175-182, Menlo Park,
CA, 1998. AAAI Press.
(Gzipped PostScript, 8 pages, 42470 bytes)
(PDF, 844205 bytes)
A novel method to model and predict the location and orientation of
alpha helices in membrane-spanning proteins is presented. It is based on a
hidden Markov model (HMM) with an architecture that corresponds closely to
the biological system. The model is cyclic with 7 types of states for helix
core, helix caps on either side, loop on the cytoplasmic side, two loops for
the non-cytoplasmic side, and a globular domain state in the middle of each
loop. The two loop paths on the non-cytoplasmic side are used to model short
and long loops separately, which corresponds biologically to the two known
different membrane insertions mechanisms. The close mapping between the
biological and computational states allows us to infer which parts of the
model architecture are important to capture the information that encodes the
membrane topology, and to gain a better understanding of the mechanisms and
constraints involved. Models were estimated both by maximum likelihood and a
discriminative method, and a method for reassignment of the membrane helix
boundaries were developed. In a cross validated test on single sequences, our
transmembrane HMM, TMHMM, correctly predicts the entire topology for 77% of
the sequences in a standard dataset of 83 proteins with known topology. The
same accuracy was achieved on a larger dataset of 160 proteins. These results
compare favourably with existing methods.
- J. A. Hertz,
A. Krogh, B. Lautrup, and T. Lehmann.
Non-linear backpropagation: Doing back-propagation
without derivatives of the activation function.
IEEE Transactions on Neural Networks, 8(6):1321-1327, 1997.
(PDF, 170227 bytes)
The conventional linear back-propagation algorithm is replaced by a
non-linear version, which avoids the necessity for calculating the derivative
of the activation function. This may be exploited in hardware realizations of
neural processors. In this paper we derive the non-linear back-propagation
algorithms in the framework of recurrent back-propagation and present some
numerical simulations of feed-forward networks on the NetTalk problem. A
discussion of implementation in analog VLSI electronics concludes the
paper.
- A. Krogh.
Two methods for improving performance of a HMM and
their application for gene finding.
In T. Gaasterland, P. Karp, K. Karplus, C. Ouzounis, C. Sander, and
A. Valencia, editors, Proceedings of the Fifth International Conference
on Intelligent Systems for Molecular Biology, pages 179-186, Menlo
Park, CA, 1997. AAAI Press.
(Gzipped PostScript, 8 pages, 41953 bytes)
(PDF, 145092 bytes)
A hidden Markov model for gene finding consists of submodels for
coding regions, splice sites, introns, intergenic regions and possibly more.
It is described how to estimate the model as a whole from labeled sequences
instead of estimating the individual parts independently from subsequences.
It is argued that the standard maximum likelihood estimation criterion is not
optimal for training such a model. Instead of maximizing the probability of
the DNA sequence, one should maximize the probability of the correct
prediction. Such a criterion, called conditional maximum likelihood, is used
for the gene finder 'HMM-gene'. A new (approximative) algorithm is described,
which finds the most probable prediction summed over all paths yielding the
same prediction. We show that these methods contribute significantly to the
high performance of HMMgene.
- A. Krogh
and P. Sollich.
Statistical mechanics of ensemble learning.
Physical Review E, 55(1):811-825, 1997.
(PDF, 297146 bytes)
Within the context of learning a rule from examples, we study the
general characteristics of learning with ensembles. The generalization
performance achieved by a simple model ensemble of linear students is
calculated exactly in the thermodynamic limit of a large number of input
components, and shows a surprisingly rich behaviour. Our main findings are:
For learning in large ensembles, it is advantageous to use under-regularized
students, which actually over-fit the training data. Globally optimal
generalization performance can be obtained by choosing the training set sizes
of the students optimally. For smaller ensembles, optimization of the
ensemble weights can yield significant improvements in ensemble
generalization performance, in particular if the individual students are
subject to noise in the training process. Choosing students with a wide range
of regularization parameters makes this improvement robust against changes in
the unknown level of corruption of the training data.
- S. K. Riis and
A. Krogh.
Hidden neural networks: A framework for HMM/NN
hybrids.
In Proceedings of IEEE International Conference on Acoustics, Speech, and
Signal Processing, pages 3233-3236, New York, USA, 1997. IEEE.
(PDF, 446055 bytes)
This paper presents a general framework for hybrids of Hidden
Markov models (HMM) and neural networks (NN). In the new framework called
Hidden Neural Networks (HNN) the usual HMM probability parameters are
replaced by neural network outputs. To ensure a probabilistic interpretation
the HNN is normalized globally as opposed to the local normalization enforced
on parameters in standard HMMs. Furthermore, all parameters in the HNN are
estim ated simultaneously according to the discriminative conditional maximum
likelihood (CML) criterion. The HNNs show clear performance gains compared to
standard HMMs on TIMIT continuous speech recognition benchmarks. On the task
of recognizing five broad phoneme classes an accuracy of 84% is obtained
compared to 76% for a standard HMM. Additionally, we report a preliminary
result of 69% accuracy on the TIMIT 39 phoneme task.
- P. Baldi, S. Brunak,
Y. Chauvin, and A. Krogh.
Hidden Markov models for human genes: Periodic patterns
in exon sequences.
In S. Suhai, editor, Theoretical and Computational Methods in Genome
Research, N.Y., 1996. Plenum Press.
- P. Baldi, S. Brunak,
Y. Chauvin, and A. Krogh.
Naturally occurring nucleosome positioning signals in
human exons.
Journal of Molecular Biology, 263:503-510, 1996.
We describe the structural implications of a periodic pattern found
in human exons and introns by hidden Markov models. We show that exons
(besides the reading frame) have a specific sequential structure in the form
of a pattern with triplet consensus non-T(A/T)G, and a minimal periodicity of
roughly ten nucleotides. The periodic pattern is also present in intron
sequences, although the strength per nucleotide is weaker. Using two
independent profile methods based on triplet bendability parameters from
DNase I experiments and nucleosome positioning data, we show that the pattern
in multiple alignments of internal exon and intron sequences corresponds to a
periodic "in phase" bending potential towards the major groove of the DNA.
The nucleosome positioning data show that the consensus triplets (and their
complements) have a preference for locations on a bent double helix where the
major groove faces inward and is compressed. The in-phase triplets are
located adjacent to GCC/GGC triplets known to have the strongest bias in
their positioning on the nuclesome. Analysis of mRNA sequences encoding
proteins with known tertiary structure exclude the possibility that the
pattern is a consequence of the previously well-known periodicity caused by
the encoding of alpha-helices in proteins. Finally, we discuss the relation
between the bending potential of coding and non-coding regions and its impact
on the translational positioning of nucleosomes and the recognition of genes
by the transcriptional machinery.
- R. Hughey and
A. Krogh.
Hidden Markov models for sequence analysis: extension
and analysis of the basic method.
CABIOS, 12:95-107, 1996.
(Gzipped PostScript, 23 pages, 272331 bytes)
(PDF, 710025 bytes)
Hidden Markov models (HMMs) are a highly effective means of
modeling a family of unaligned sequences or a common motif within a set of
unaligned sequences. The trained HMM can then be used for discrimination or
multiple alignment. The basic mathematical description of an HMM and its
expectation-maximization training procedure is relatively straightforward. In
this paper, we review the mathematical extensions and heuristics that move
the method from the theoretical to the practical. We then experimentally
analyze the effectiveness of model regularization, dynamic model modification
and optimization strategies. Finally it is demonstrated on the SH2 domain how
a domain can be found from unaligned sequences using a special model type.
The experimental work was completed with the aid of the Sequence Alignment
and Modeling software suite.
- A. Krogh and
S. K. Riis.
Prediction of beta sheets in proteins.
In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, editors, Advances in
Neural Information Processing Systems, volume 8, pages 917-923,
Cambridge, MA, USA, 1996. MIT Press.
(Gzipped PostScript, 7 pages, 124377 bytes)
(PDF, 1364585 bytes)
Most current methods for prediction of protein secondary structure
use a small window of the protein sequence to predict the structure of the
central amino acid. We describe a new method for prediction of the non-local
structure called beta-sheet, which consists of two or more beta-strands that
are connected by hydrogen bonds. Since beta-strands are often widely
separated in the protein chain, a network with two windows is introduced.
After training on a set of proteins the network predicts the sheets well, but
there are many false positives. By using a global energy function the
beta-sheet prediction is combined with a local prediction of the three
secondary structures alpha-helix, beta-strand and coil. The energy function
is minimized using simulated annealing to give a final
prediction.
- S. K. Riis and
A. Krogh.
Improving prediction of protein secondary structure using
structured neural networks and multiple sequence alignments.
Journal of Computational Biology, 3:163-183, 1996.
(Gzipped PostScript, 27 pages, 112928 bytes)
(PDF, 308269 bytes)
The prediction of protein secondary structure by use of carefully
structured neural networks and multiple sequence alignments has been
investigated. Separate networks are used for predicting the three secondary
structures alpha-helix, beta-strand, and coil. The networks are designed
using a priori knowledge of amino acid properties with respect to the
secondary structure and the characteristic periodicity in alpha-helices.
Since these single-structure networks all have less than 600 adjustable
weights, overfitting is avoided. To obtain a three-state prediction of
alpha-helix, beta-strand, or coil, ensembles of single-structure networks are
combined with another neural network. This method gives an overall prediction
accuracy of 66.3% when using 7-fold cross-validation on a database of 126
nonhomologous globular proteins. Applying the method to multiple sequence
alignments of homologous proteins increases the prediction accuracy
significantly to 71.3% with corresponding Matthew's correlation coefficients
C alpha = 0.59, C beta = 0.52, and Cc = 0.50. More than 72% of the residues
in the database are predicted with an accuracy of 80%. It is shown that the
network outputs can be interpreted as estimated probabilities of correct
prediction, and, therefore, these numbers indicate which residues are
predicted with high confidence.
- S. K. Riis and
A. Krogh.
Joint estimation of parameters in hidden neural
networks.
In Proceedings of IEEE Nordic Signal Processing Symposium, pages
431-434. IEEE, 1996.
(Gzipped PostScript, 4 pages, 64324 bytes)
(PDF, 169200 bytes)
- K. Sjölander,
K. Karplus, M. Brown, R. Hughey, A. Krogh, I. S. Mian, and
D. Haussler.
Dirichlet mixtures: A method for improved detection of
weak but significant protein sequence homology.
CABIOS, 12(4):327-345, 1996.
(Gzipped PostScript, 33 pages, 105486 bytes)
(PDF, 364242 bytes)
We present a method for condensing the information in multiple
alignments of proteins into a mixture of Dirichlet densities over amino acid
distributions. Dirichlet mixture densities are designed to be combined with
observed amino acid frequencies to form estimates of expected amino acid
probabilities at each position in a profile, hidden Markov model or other
statistical model. These estimates give a statistical model greater
generalization capacity, so that remotely related family members can be more
reliably recognized by the model. This paper corrects the previously
published formula for estimating these expected probabilities, and contains
complete derivations of the Dirichlet mixture formulas, methods for
optimizing the mixtures to match particular databases, and suggestions for
efficient implementation.
- P. Sollich and
A. Krogh.
Learning with ensembles: How over-fitting can be
useful.
In M. E. Hasselmo D. S. Touretzky, M. C. Mozer, editor, Advances in
Neural Information Processing Systems, volume 8, pages 190-196,
Cambridge, MA, USA, 1996. MIT Press.
(Gzipped PostScript, 7 pages, 57751 bytes)
(PDF, 155044 bytes)
We study the characteristics of learning with ensembles. Solving
exactly the simple model of an ensemble of linear students, we find
surprisingly rich behaviour. For learning in large ensembles, it is
advantageous to use under-regularized students, which actually over-fit the
training data. Globally optimal performance can be obtained by choosing the
training set sizes of the students appropriately. For smaller ensembles,
optimization of the ensemble weights can yield significant improvements in
ensemble generalization performance, in particular if the individual students
are subject to noise in the training process. Choosing students with a wide
range of regularization parameters makes this improvement robust against
changes in the unknown level of noise in the training data.
- P. Baldi, S. Brunak,
Y. Chauvin, J. Engelbrecht, and A. Krogh.
Periodic sequence patterns in human exons.
In C. Rawlings, D. Clark, R. Altman, L. Hunter, T. Lengauer, and S. Wodak,
editors, Proceedings of the Third International Conference on
Intelligent Systems for Molecular Biology, pages 30-38, Menlo Park,
CA, July 1995. AAAI Press.
(Gzipped PostScript, 9 pages, 75028 bytes)
(PDF, 465226 bytes)
We analyse the sequential structure of human exons and their
flanking introns by hidden Markov models. Together, models of donor site
regions, acceptor site regions and flanked internal exons, show that
exons-besides the reading frame-hold a specific periodic pattern. The
pattern, which has the consensus: non-T(A/T)G and a minimal periodicity of
roughly 10 nucleotides, is not a consequence of the nucleotide statistics in
the three codon positions, nor of the well known nucleosome positioning
signal. We discuss the relation between the pattern and other known sequence
elements responsible for the intrinsic bending or curvature of
DNA.
- A. Krogh and G. Mitchison.
Maximum entropy weighting of aligned sequences of
proteins or DNA.
In C. Rawlings, D. Clark, R. Altman, L. Hunter, T. Lengauer, and S. Wodak,
editors, Proceedings of the Third International Conference on
Intelligent Systems for Molecular Biology, pages 215-221, Menlo Park,
CA, July 1995. AAAI Press.
(Gzipped PostScript, 7 pages, 46748 bytes)
(PDF, 168255 bytes)
In a family of proteins or other biological sequences like DNA the
various subfamilies are often very unevenly represented. For this reason a
scheme for assigning weights to each sequence can greatly improve performance
at tasks such as database searching with profiles or other consensus models
based on multiple alignments. A new weighting scheme for this type of
database search is proposed. In a statistical description of the searching
problem it is derived from the maximum entropy principle. It can be proved
that, in a certain sense, it corrects for uneven representation. It is shown
that finding the maximum entropy weights is an easy optimization problem for
which standard techniques are applicable.
- A. Krogh
and J. Vedelsby.
Neural network ensembles, cross validation and active
learning.
In D. S. Touretzky G. Tesauro and T. K. Leen, editors, Advances in Neural
Information Processing Systems, pages 231-238, Cambridge MA, 1995.
MIT Press.
(Gzipped PostScript, 8 pages, 66341 bytes)
(PDF, 237918 bytes)
Learning of continuous valued functions using neural network
ensembles (committees) can give improved accuracy, reliable estimation of the
generalization error, and active learning. The ambiguity is defined as the
variation of the output of ensemble members averaged over unlabeled data, so
it quantifies the disagreement among the networks. It is discussed how to use
the ambiguity in combination with cross-validation to give a reliable
estimate of the ensemble generalization error, and how this type of ensemble
cross-validation can sometimes improve performance. It is shown how to
estimate the optimal weights of the ensemble members using unlabeled data. By
a generalization of query by committee, it is finally shown how the ambiguity
can be used to select new training data to be labeled in an active learning
scheme.
- P. Baldi, S. Brunak,
Y. Chauvin, J. Engelbrecht, and A. Krogh.
Hidden Markov models for human genes.
In J.D. Cowan, G. Tesauro, and J. Alspector, editors, Advances in Neural
Information Processing Systems, volume 6, pages 761-768, San
Francisco, CA, 1994. Morgan Kauffmann Publishers.
- N. Cesa-Bianchi, A. Krogh, and M. K. Warmuth.
Bounds on approximate steepest descent for likelihood
maximization in exponential families.
IEEE Transactions on Information Theory, 40(4):1215-1220, July
1994.
(PDF, 344865 bytes)
An approximate steepest descent strategy converging, in families of
regular exponential densities, to maximum likelihood estimates of density
functions is described. These density estimates are also obtained by an
application of the principle of minimum relative entropy subject to empirical
constraints. We prove tight bounds on the increase of the log-likelihood at
each iteration of our strategy for families of exponential densities whose
log-densities are spanned by a set of bounded basis
functions.
- A. Krogh.
Hidden Markov models for labeled sequences.
In Proceedings of the 12th IAPR International Conference on Pattern
Recognition, pages 140-144, Los Alamitos, California, October 1994.
IEEE Computer Society Press.
(PDF, 431509 bytes)
A hidden Markov model for labeled observations, called a CHMM, is
introduced and a maximum likelihood method is developed for estimating the
parameters of the model. Instead of training it to model the statistics of
the training sequences it is trained to optimize recognition. It resembles
MMI training, but is more general, and has MMI as a special case. The
standard forward-backward procedure for estimating the model cannot be
generalized directly, but an ``incremental EM'' method is
proposed.
- A. Krogh,
M. Brown, I. S. Mian, K. Sjölander, and D. Haussler.
Hidden Markov models in computational biology:
Applications to protein modeling.
Journal of Molecular Biology, 235:1501-1531, Feb. 1994.
(PDF, 1961051 bytes)
Hidden Markov Models (HMMs) are applied to the problems of
statistical modeling, database searching and multiple sequence alignment of
protein families and protein domains. These methods are demonstrated on the
globin family, the protein kinase catalytic domain, and the EF-hand calcium
binding motif. In each case the parameters of an HMM are estimated from a
training set of unaligned sequences. After the HMM is built, it is used to
obtain a multiple alignment of all the training sequences. It is also used to
search the SWISS-PROT 22 database for other sequences that are members of the
given protein family, or contain the given domain. The HMM produces multiple
alignments of good quality that agree closely with the alignments produced by
programs that incorporate three-dimensional structural information. When
employed in discrimination tests (by examining how closely the sequences in a
database fit the globin, kinase and EF-hand HMMs), the HMM is able to
distinguish members of these families from non-members with a high degree of
accuracy. Both the HMM and PROFILESEARCH (a technique used to search for
relationships between a protein sequence and multiply aligned sequences)
perform better in these tests than PROSITE (a dictionary of sites and
patterns in proteins). The HMM appears to have a slight advantage over
PROFILESEARCH in terms of lower rates of false negatives and false positives,
even though the HMM is trained using only unaligned sequences, whereas
PROFILESEARCH requires aligned training sequences. Our results suggest the
presence of an EF-hand calcium binding motif in a highly conserved and
evolutionary preserved putative intracellular region of 155 residues in the
alpha-1 subunit of L-type calcium channels which play an important role in
excitation-contraction coupling. This region has been suggested to contain
the functional domains that are typical or essential for all L-type calcium
channels regardless of whether they couple to ryanodine receptors, conduct
ions or both.
- A. Krogh, I. S.
Mian, and D. Haussler.
A hidden Markov model that finds genes in E.
coli DNA.
Nucleic Acids Research, 22:4768-4778, 1994.
(Gzipped PostScript, 22 pages, 104898 bytes)
(PDF, 1425261 bytes)
A hidden Markov model (HMM) has been developed to find protein
coding genes in E. coli DNA using E. coli genome DNA sequence from the
EcoSeq6 database maintained by Kenn Rudd. This HMM includes states that model
the codons and their frequencies in E. coli genes, as well as the patterns
found in the intergenic region, including repetitive extragenic palindromic
sequences and the Shine-Delgarno motif. To account for potential sequencing
errors and or frameshifts in raw genomic DNA sequence, it allows for the
(very unlikely) possibility of insertions and deletions of individual
nucleotides within a codon. The parameters of the HMM are estimated using
approximately one million nucleotides of annotated DNA in EcoSeq6 and the
model tested on a disjoint set of contigs containing about 325,000
nucleotides. The HMM finds the exact locations of about 80% of the known E.
coli genes, and approximate locations for about 10%. It also finds several
potentially new genes, and locates several places were insertion or deletion
errors/and or frameshifts may be present in the contigs.
- M. Brown, R. Hughey,
A. Krogh, I. S. Mian, K. Sjölander, and D. Haussler.
Using Dirichlet mixture priors to derive hidden
Markov models for protein families.
In L. Hunter, D. Searls, and J. Shavlik, editors, Proceedings of the
First International Conference on Intelligent Systems for Molecular
Biology, pages 47-55, Menlo Park, CA, July 1993. AAAI/MIT Press.
A Bayesian method for estimating the amino acid distributions in
the states of a hidden Markov model (HMM) for a protein family or the columns
of a multiple alignment of that family is introduced. This method uses
Dirichlet mixture densities as priors over amino acid distributions. These
mixture densities are determined from examination of previously constructed
HMMs or multiple alignments. It is shown that this Bayesian method can
improve the quality of HMMs produced from small training sets. Specific
experiments on the EF-hand motif are reported, for which these priors are
shown to produce HMMs with higher likelihood on unseen data, and fewer false
positives and false negatives in a database search task.
- J. Gorodkin, L. K.
Hansen, A. Krogh, C. Svarer, and O. Winther.
A quantitative study of pruning by optimal brain
damage.
International Journal of Neural Systems, 4(2):159-169, 1993.
The optimal brain damage (OBD) scheme of Le Cun, Denker and Solla
for pruning of feedforward networks has been implemented and applied to the
contiguity classification problem. It is shown that OBD improves the learning
curve (the test error as a function of the number of examples). By inspecting
the architectures obtained through pruning, it is found that the networks
with less parameters have the smallest test error in agreement with "Ockhams
Razor". Based on this, we propose a heuristic which selects the smallest
successful architecture among a group of pruned networks and we show that it
leads to very efficient optimization of the architecture. The validity of the
approximations involved in OBD are discussed and it is found that they are
surprisingly accurate for the problem studied.
- D. Haussler,
A. Krogh, I. S. Mian, and K. Sjölander.
Protein modeling using hidden Markov models: Analysis
of globins.
In T. N. Mudge, V. Milutinovic, and L. Hunter, editors, Proceedings of
the Twenty-Sixth Annual Hawaii International Conference on System
Sciences, volume 1, pages 792-802, Los Alamitos, California, 1993.
IEEE Computer Society Press.
(PDF, 935916 bytes)
We apply Hidden Markov Models (HMMs) to the problem of statistical
modeling and multiple alignment of protein families. A variant of the
Expectation Maximization (EM) algorithm known as the Viterbi algorithm is
used to obtain the statistical model from the unaligned sequences. In a
detailed series of experiments, we have taken 400 unaligned globin sequences,
and produced a statistical model entirely automatically from the primary
(unaligned) sequences using no prior knowledge of globin structure. The
produced model includes amino acid distributions for all the known positions
in the 7 major alpha-helices, as well as the probability of and average
length of insertions between these positions, and the probability that each
position is not present at all. Using this model, we obtained a multiple
alignment of the 400 sequences and 225 other globin sequences, that agrees
almost perfectly with a structural alignment by Bashford et al. This model
can also discriminate all these 625 globins from nonglobin protein sequences
with greater than 99% accuracy, and can thus be used for database
searches.
- A. Krogh.
Learning with noise in a linear perceptron.
Journal of Physics A, 25:1119-1133, 1992.
- A. Krogh and
J. A. Hertz.
Generalization in a linear perceptron in the presence of
noise.
Journal of Physics A, 25:1135-1147, 1992.
- A. Krogh and
J. A. Hertz.
A simple weight decay can improve
generalization.
In J. E. Moody, S. J. Hanson, and R. P. Lippmann, editors, Advances in
Neural Information Processing Systems, volume 4, pages 950-957, San
Mateo, CA, 1992. Morgan Kauffmann Publishers.
(Gzipped PostScript, 8 pages, 36458 bytes)
(PDF, 137373 bytes)
It has been observed in numerical simulations that a weight decay
can improve generalization in a feed-forward neural network. This paper
explains why. It is proven that a weight decay has two effects in a linear
network. First, it suppresses any irrelevant components of the weight vector
by choosing the smallest vector that solves the learning problem. Second, if
the size is chosen right, a weight decay can suppress some of the effects of
static noise on the targets, which improves generalization quite a lot. It is
then shown how to extend these results to networks with hidden layers and
non-linear units. Finally the theory is confirmed by some numerical
simulations using the data from NetTalk.
- J. A. Hertz and
A. Krogh.
Statistical dynamics of learning.
In T. Kohonen et al., editor, Artificial Neural Networks,
volume 1, pages 125-131. Elsevier, 1991.
Proceedings ICANN-91, Helsinki.
- J. A. Hertz, A. Krogh,
and R. Palmer.
Introduction to the Theory of Neural
Computation.
Addison-Wesley, Redwood City, 1991.
- A. Krogh.
Learning and Generalization in Neural
Networks.
PhD thesis, The Niels Bohr Institute, University of Copenhagen, Denmark,
1991.
- A. Krogh and
J. A. Hertz.
Dynamics of generalization in linear
perceptrons.
In R.P. Lippmann, J.E. Moody, and D.E. Touretzky, editors, Advances in
Neural Information Processing Systems, volume 3, pages 897-903, San
Mateo, CA, 1991. Morgan Kauffmann Publishers.
- A. Krogh and
J. A. Hertz.
Hebbian learning of principal components.
In R. Eckmiller, G. Hartmann, and G. Hauske, editors, Parallel Processing
in Neural Systems and Computers, pages 183-186. Elsevier,
1990.
- A. Krogh, G. I.
Thorbergsson, and J. A. Hertz.
A cost function for internal representations.
In D. S. Touretzky, editor, Advances in Neural Information Processing
Systems, volume 2, pages 733-740, San Mateo, 1990. Morgan
Kaufmann.
- J. A. Hertz,
A. Krogh, and G. I. Thorbergsson.
Phase transitions in simple learning.
Journal of Physics A, 22:2133-2150, 1989.
- J. A. Hertz, G. I.
Thorbergsson, and A. Krogh.
Dynamics of learning in simple perceptrons.
Physica Scripta, T25:149-151, 1989.
- A. Krogh and
J. A. Hertz.
Hierarchical associative networks.
In L. Personnaz and G.Dreyfus, editors, Neural Networks from Models to
Applications, pages 336-342, Paris, 1989. I.D.S.E.T.
Proceedings of nEuro'88, Paris, June 88.
- A. Krogh and
J. A. Hertz.
Mean field analysis of hierarchical associative networks
with magnetization.
Journal of Physics A, 21:2211-2224, 1988.
- C. Cortes,
A. Krogh, and J. A. Hertz.
Hierarchical associative networks.
Journal of Physics A, 20:4449-4455, 1987.
- A. Krogh.
Neurale netværk: En hierarkisk model.
Master's thesis, The Niels Bohr Institute, University of Copenhagen, Denmark,
1987.
In Danish.