An incomplete,
personal collection of the Maximum Entropy approach to a few biological problems.
The fact that correlation does
not imply causation is obvious. Correlation between
variables in data does not mean that they are causally linked e.g. frequencies of residue co-occurrence in
protein families or correlation of gene expression levels over many
experiments. This is because correlation between non causal variables may be induced by chaining of correlation between a set of intervening, directly causal
variables. Such "non-causal correlation" is well
known in statistical physics, and much has been written about it.
The problem of weeding out non-causal correlation in observed biological data is especially important when we are dealing with highly coupled
networks of variables. For instance a protein residue might contact many other residues in a protein in 3D, and
genes similarly may have effect on very many other genes . Mutual information where
each pair of variables is looked at in isolation of all other pairs of variables will lead to many false high correlations as one is treating the pairs as independent
et of each other., In some problems e.g. predicting RNA secondary structure,
this is not important. In contrast, in residues contacts in and across proteins and in gene
expression networks it will be crucial. As yet, the maximum entropy approach is not that widespread in computational biology; but I think it will become so, especially as
amount of data increases. There are formal mathematical links of MaxEnt to
"graphical models" and to
Bayesian network approach but in my hands the latter is not as powerful.
This may be to do with the fact that in the maximum entropy approach has an
additional feature as well as the global approach; it not
need to assume that the lack of
observation of a correlation means it could never happen. In that sense it is
'maximum entropy' since one is assuming the flattest possible distribution of
all possible combinations except for the data. or rather the data can imply
many probability distributions which are consistent, maximizing the entropy
ensures that the global probability model is the one which is the flattest, least biased, for
the data in hand.
I can offer much more
literature if you would like ; below is a very limited set. Books at the end
1. Papers: Applications
This is a succinct exposition of the use of
maximum entropy formalism to predict 3D contacts in proteins and stability.
This paper was written in 2000/2001 but then buried in LANL. After we found out about the work ( via Gary Stormo), we
called Alan Lapedes and encouraged him to put his 199/200 work into arXiv, which he then did, in
July 2012. The original and buried non published paper was 2001 which makes it
the first successful use of MAxEnt for prediction of residue contacts in
proteins. its also worth listening to a recorded talk he gave on the subject in
the archives - can be downloaded from http://online.itp.ucsb.edu/online/infobio01/lapedes/
Protein
3D Structure Computed from Evolutionary Sequence Variation Debora S., Marks Lucy J. Colwell Robert
Sheridan,Thomas A. Hopf,Andrea Pagnani,Riccardo Zecchina,Chris Sander, PLoS One
2011
Read supplement too.
Direct-coupling
analysis of residue coevolution captures native contacts across many protein families Proc Natl Acad Sci U S A. 2011 Dec
6;108(49):E1293-301. doi: 10.1073/pnas.1111471108. 21.Morcos F, Pagnani A, Lunt B, Bertolino A, Marks DS, Sander C, Zecchina R, Onuchic JN, Hwa T, Weigt M.
Read supplement too.
Three-dimensional structures of membrane proteins from genomic
sequencing.
Hopf TA, Colwell LJ, Sheridan R, Rost B, Sander C, Marks DS.
Cell. 2012 Jun 22;149(7):1607-21. doi:
10.1016/j.cell.2012.04.012. Epub 2012 May 10.
Weak pairwise correlations imply strongly correlated network states in a
neural population.
Schneidman E, Berry MJ 2nd, Segev R, Bialek W.
Nature. 2006
Using the
principle of entropy maximization to infer genetic interaction networks from geneexpression patterns. Lezon TR, Banavar JR, Cieplak M, Maritan A, Fedoroff NV. Proc Natl Acad Sci U S A. 2006 Dec 12;103(50)
Correlated Mutations in models of protein sequences; phylogenetic and structural effects Alan S. Lapedes1, Bertrand G. Giraud, LonChang Liu and Gary D. Stormo Statistics in Molecular Biology
IMS Lecture Notes - Monograph Series (1999) Volume 33
nice toy model to show effect and solution
2. Papers: Theory
Information Theory and Statistical Mechanics. E. T Jaynes 1957 part one
Information Theory and Statistical Mechanics. E. T Jaynes 1957 part two
These two papers will give you all you need to understand
the use of maximum entropy formalism , but may have to be read more than once
Superadditive correlation Giraud
BG, Heumann
JM, Lapedes
AS. Phys
Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics. 1999 May;59(5 Pt A):4983-91.
Very
nice figure on page 2 says it all intuitively
E. T. Jaynes
Information Theory and Statistical Physics. Brandeis Summer institute lectures in
Theoretical Physics , 1962. Vol 3 page 181
Nice background
3. Recent
Improved
contact prediction in proteins: Using pseudolikelihoods to infer Potts models. Magnus
Ekeberg, Cecilia
Lövkvist, Yueheng
Lan, Martin
Weigt, Erik
Aurell arXiv:1211.1281 [q-bio.QM]
This is still MaxEnt, but an alternative approach to solving the equations
for the LaGrange multipliers once the objective function is set up. It also
uses the norm of the matrix rather than a corrected MI formulation to plug the
computed probabilities into to find correlated columns, if that's what one once
to do We have an implementation of our code with this method of solving Erik
shared with us and we also testing it. one thin is its much slower. Not sure
yet whether that matters. .
4. Papers: Connection to Bayes
Bayesian methods: General Background: An
Introductory Tutorial
E. T. Jaynes 1996 ( he died in 98)
Lovely philosophical back1996 ground, from
Herodutus, Bernouilli, bayes, laplace , Jeffreys, Cox and Shannon
Nice list
http://bayes.wustl.edu/etj/articles/
5. Books
E T Jaynes. Probability Theory: The Logic of Science.
Buhlman.
Murphy