%% LyX 1.6.9 created this file. For more info, see http://www.lyx.org/. %% Do not edit unless you really know what you are doing. \documentclass[english]{article} \usepackage[T1]{fontenc} \usepackage[latin9]{inputenc} \usepackage[letterpaper]{geometry} \geometry{verbose,tmargin=2.5cm,bmargin=2.5cm,lmargin=2.5cm,rmargin=2.5cm} \usepackage{babel} \usepackage{amsbsy} \usepackage{amssymb} \usepackage[numbers]{natbib} \usepackage[unicode=true,pdfusetitle, bookmarks=true,bookmarksnumbered=false,bookmarksopen=false, breaklinks=false,pdfborder={0 0 1},backref=false,colorlinks=false] {hyperref} \makeatletter %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Textclass specific LaTeX commands. \usepackage{Sweave} \newenvironment{wrapsweave}{\par}{\par} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% User specified LaTeX commands. %\VignetteIndexEntry{Protein abundances vs. equilibrium activities} % so DOIs in bibliography show up as hyperlinks \newcommand*{\doi}[1]{\href{http://dx.doi.org/#1}{doi: #1}} \makeatother \begin{document} \title{Protein abundances vs. equilibrium activities} \author{Jeffrey M. Dick} \maketitle \section{Introduction} The \texttt{diagram()} function serves multiple purposes that might be confusing to the new user. From its name, we know that it produces diagrams of some sort. These are equilibrium chemical activity diagrams -- that is the primary purpose of the function. However, inspecting the arguments to the function reveals that the input values are the affinities of formation reactions of species in the system. How do we go from chemical affinities to chemical activities? This problem defines the purpose of two auxiliary functions (\texttt{equil.react()} and \texttt{equil.boltz()}) whose algorithms are described below. Some explanation of terminology is in order. By chemical activity we mean the quantity $a_{i}$ that appears in the expression \begin{equation} \mu_{i}=\mu_{i}^{\circ}+RT\ln a_{i}\,,\label{eq:mu}\end{equation} where $\mu_{i}$ and $\mu_{i}^{\circ}$ stand for the chemical potential and the standard chemical potential of the $i$th species, and $R$ and $T$ represent the gas constant and the temperature in Kelvin. Chemical activity is related to molality ($m_{i}$) by\begin{equation} a_{i}=\gamma_{i}m_{i}\,,\label{eq:activity}\end{equation} where $\gamma_{i}$ stands for the activity coefficient of the $i$th species. For this discussion, we take $\gamma_{i}=1$ for all species, so chemical activity is assumed to be numerically equivalent to molality. Since molality is a measure of concentration, calculating the equilibrium chemical activities can be a theoretical tool to help understand the relative abundances of species, including proteins. After going over the methods used in CHNOSZ for equilibrium activity calculations, some comparisons with experimental protein abundance data are made. \section{Calculations at a single point} Here we discuss two procedures for calculating equilibrium activities of species. The first is a reaction-matrix approach and the second takes advantage of the Boltzmann distribution. We show (by example) that the two approaches are equivalent when the formation reactions of residue equivalents of proteins are used. The example system here has also been described in a paper \citep{Dic08}. \subsection{Reaction-matrix approach} The next two sections give examples of calculating the equilibrium activities of two proteins using a matrix of equations representing reactions to form the proteins. Although the examples below include only two proteins, each additional protein introduces one more equation and unknown, so this procedure can be carried out for any number of proteins given the necessary computational requirements. \subsubsection{Whole proteins} Let us calculate the equilibrium activities of two proteins in metastable equilibrium. To do this we start by writing the formation reactions of each protein as\begin{equation} stuff_{3}\rightleftharpoons\mathrm{CSG\_METVO}\,\label{react:CSG_METVO}\end{equation} and\begin{equation} stuff_{4}\rightleftharpoons\mathrm{CSG\_METJA}\,.\label{react:CSG_METJA}\end{equation} The basis species in the reactions are collectively symbolized by $stuff$; the subscripts simply refer to the reaction number in this document. In these examples, $stuff$ consists of $\mathrm{CO_{2}}$, $\mathrm{H_{2}O}$, $\mathrm{NH_{3}}$, $\mathrm{O_{2}}$, $\mathrm{H_{2}S}$ and $\mathrm{H^{+}}$ in different molar proportions. To see what $stuff$ is, try out these commands in CHNOSZ: \begin{wrapsweave} <>= library(CHNOSZ) basis("CHNOS+") species("CSG",c("METVO","METJA")) @ \end{wrapsweave} Although the basis species are defined, the temperature is not yet specified, so it is not immediately possible to calculate the ionization states of the proteins. That is why the coefficient on $\mathrm{H^{+}}$ is zero in the output above. To see what the computed protein charges are at 25 $^{\circ}$C and 1 bar and at pH 7 (which is the opposite of the logarithm of activity of $\mathrm{H^{+}}$ in the basis species), try this: \begin{wrapsweave} <>= protein.info() @ \end{wrapsweave} Note that \texttt{affinity()} is called twice by \texttt{protein.info()}; this so that both charges and standard Gibbs energies of ionization of the proteins can be calculated. The \texttt{Z} values in the table are the charges of the proteins computed using the ionization constants of sidechain and terminal groups, and the \texttt{G.Z} values are the calculated Gibbs energies of formation of the ionized proteins \citep{DLH06}. The \texttt{ZC} values are the average oxidation states of carbon of the proteins. Let us now calculate the chemical affinities of formation of the ionized proteins: \begin{wrapsweave} <>= a <- affinity() a$values @ \end{wrapsweave} Since \texttt{affinity()} returns a list with a lot of information (such as the basis species and species definitions) the last command was written to only print the \texttt{values} part of that list. The values are actually dimensionless, i.e. $\boldsymbol{A}/2.303RT$. The affinities of the formation reactions above were calculated for a \emph{reference value of activity of the proteins, which is not the equilibrium value}. Those non-equilibrium activities were $10^{-3}$. How do we calculate the equilibrium values? Let us write specific statements of the expression for chemical affinity (2.303 is used here to stand for $\ln10$), \begin{equation} \boldsymbol{A}=2.303RT\log(K/Q)\,,\label{eq:affinity}\end{equation} for Reactions \ref{react:CSG_METVO} and \ref{react:CSG_METJA} as\begin{equation} \begin{array}{rl} \boldsymbol{A}_{3}/2.303RT & =\log K_{3}-\log Q_{3}\\ & =\log K_{3}+\log a_{stuff,3}-\log a_{\mathrm{CSG\_METVO}}\\ & =\boldsymbol{A}_{3}^{*}/2.303RT-\log a_{\mathrm{CSG\_METVO}}\end{array}\label{eq:A3_METVO}\end{equation} and \begin{equation} \begin{array}{rl} \boldsymbol{A}_{4}/2.303RT & =\log K_{4}-\log Q_{4}\\ & =\log K_{4}+\log a_{stuff,4}-\log a_{\mathrm{CSG\_METJA}}\\ & =\boldsymbol{A}_{4}^{*}/2.303RT-\log a_{\mathrm{CSG\_METJA}}\,.\end{array}\label{eq:A4_METJA}\end{equation} The $\boldsymbol{A}^{*}$ denote the affinities of the formation reactions when the activities of the proteins are zero. From the output above it follows that $\boldsymbol{A}_{3}^{*}/2.303RT=104.6774$ and $\boldsymbol{A}_{4}^{*}/2.303RT=314.1877$. Next we must specify how reactions are balanced in this system: what is conserved during transformations between species (let us call it the immobile component)? For proteins, one possibility is to use the repeating protein backbone group. Let us use $n_{i}$ to designate the number of residues in the $i$th protein, which is equal to the number of backbone groups, which is equal to the length of the sequence. If $\gamma_{i}=1$ in Eq. (\ref{eq:activity}), the relationship between the activity of the $i$th protein ($a_{i}$) and the activity of the residue equivalent of the $i$th protein ($a_{residue,i}$) is\begin{equation} a_{residue,i}=n_{i}\times a_{i}\,.\label{eq:a_residue_i}\end{equation} We can use this to write a statement of mass balance: \begin{equation} 553\times a_{\mathrm{CSG\_METVO}}+530\times a_{\mathrm{CSG\_METJA}}=1.083\,.\label{eq:a_residue}\end{equation} At equilibrium, the affinities of the formation reactions, per conserved quantity (in this case protein backbone groups) are equal. Therefore $\boldsymbol{A}=\boldsymbol{A}_{3}/553=\boldsymbol{A}_{4}/530$ is a condition for equilibrium. Combining this with Eqs. (\ref{eq:A3_METVO}) and (\ref{eq:A4_METJA}) gives\begin{equation} \boldsymbol{A}/2.303RT=\left(104.6774-\log a_{\mathrm{CSG\_METVO}}\right)/553\label{eq:A_METVO}\end{equation} and\begin{equation} \boldsymbol{A}/2.303RT=\left(314.1877-\log a_{\mathrm{CSG\_METJA}}\right)/530\,.\label{eq:A_METJA}\end{equation} Now we have three equations (\ref{eq:a_residue}--\ref{eq:A_METJA}) with three unknowns. The solution can be displayed in CHNOSZ as follows. The argument \texttt{residue=FALSE} overrides the default setting for \texttt{diagram} when proteins are the species of interest and instructs it to use the function named \texttt{equil.react()}, which implements the equation-solving strategy described in the next section. Here we retrieve the equilibrium activities using \texttt{diagram()} without letting it actually do any plotting. \begin{wrapsweave} <>= d <- diagram(a,residue=FALSE,do.plot=FALSE) d$logact @ \end{wrapsweave} Those are the logarithms of the equilibrium activities of the proteins. Combining these values with either Eqs. (\ref{eq:A_METVO}) or (\ref{eq:A_METJA}) gives us the same value for affinity of the formation reactions per residue (or per protein backbone group), $\boldsymbol{A}/2.303RT=0.5978817$. Equilibrium activities that differ by such great magnitudes make it appear that the proteins are very unlikely to coexist in metastable equilibrium. Later we explain the concept of using residue equivalents of the proteins to achieve a different result. \subsubsection{Implementing the reaction-matrix approach} The implementation used in CHNOSZ for finding a solution to the system of equations relies on a difference function for the activity of the immobile component. The steps to obtain this difference function are: \begin{enumerate} \item Set the total activity of the immobile (conserved) component as $a_{\mathrm{ic}}$ (e.g., the 1.083 in Eqn. \ref{eq:a_residue}). \item Write a function for the logarithm of activity of each of the species of interest: $\boldsymbol{A}=\left(\boldsymbol{A}_{i}^{*}-2.303RT\log a_{i}\right)/n_{\mathrm{ic},i}$, where $n_{\mathrm{ic},i}$ stands for the number of moles of the immobile component that react in the formation of one mole of the $i$th species. (e.g., for systems of proteins where the backbone group is conserved, $n_{\mathrm{ic},i}$ is the same as $n_{i}$ in Eq. \ref{eq:a_residue_i}). Calculate values for each of the $\boldsymbol{A}_{i}^{*}$. Metastable equilibrium is implied by the identity of $\boldsymbol{A}$ in all of the equations. \item Write a function for the total activity of the immobile component: $a_{\mathrm{ic}}^{'}=\sum n_{\mathrm{ic},i}a_{i}$. \item The difference function is now $\delta a_{\mathrm{ic}}=a_{\mathrm{ic}}^{'}-a_{\mathrm{ic}}$. \end{enumerate} Now all we have to do is solve for the value of $\boldsymbol{A}$ where $\delta a_{\mathrm{ic}}=0$. This is achieved in the code by first looking for a range of values of $\boldsymbol{A}$ where at one end $\delta a_{\mathrm{ic}}<0$ and at the other end $\delta a_{\mathrm{ic}}>0$, then using the \texttt{uniroot()} function that is part of R to find the solution. This approach is subject to failure if for all trial ranges of $\boldsymbol{A}$ the $\delta a_{\mathrm{ic}}$ are of the same sign, which gives an error message like {}``i tried it 1000 times but can't make it work''. Even if values of $\delta a_{\mathrm{ic}}$ on either side of zero can be located, the algorithm does not guarantee an accurate solution and may give a warning about poor convergence if a certain (currently hard-coded) tolerance is not reached. \subsubsection{Residue equivalents} Let us consider the formation reactions of the \emph{residue equivalents} of proteins, for example\begin{equation} stuff_{12}\rightleftharpoons\mathrm{CSG\_METVO(residue)}\,\label{react:CSG_METVO_residue}\end{equation} and\begin{equation} stuff_{13}\rightleftharpoons\mathrm{CSG\_METJA(residue)}\,.\label{react:CSG_METJA_residue}\end{equation} The formulas of the residue equivalents are those of the proteins divided by the number of residues in each protein. With the \texttt{residue.info()} function it is possible to see the coefficients on the basis species in these reactions: \begin{wrapsweave} <>= residue.info() @ \end{wrapsweave} Let us denote by $\boldsymbol{A}_{12}$ and $\boldsymbol{A}_{13}$ the chemical affinities of Reactions \ref{react:CSG_METVO_residue} and \ref{react:CSG_METJA_residue}. We can write \begin{equation} \boldsymbol{A}_{12}/2.303RT=\log K_{12}+\log a_{stuff,12}-\log a_{\mathrm{CSG\_METVO(residue)}}\label{eq:A3_METVO_residue}\end{equation} and \begin{equation} \boldsymbol{A}_{13}/2.303RT=\log K_{13}+\log a_{stuff,13}-\log a_{\mathrm{CSG\_METJA(residue)}}\,,\label{eq:A4_METJA_residue}\end{equation} For metastable equilibrium we have $\boldsymbol{A}_{12}/1=\boldsymbol{A}_{13}/1$. The $1$'s in the denominators are there as a reminder that we are still conserving residues, and that each reaction now is written for the formation of a single residue equivalent. So, let us write $\boldsymbol{A}$ for $\boldsymbol{A}_{12}=\boldsymbol{A}_{13}$ and also define $\boldsymbol{A}_{12}^{*}=\boldsymbol{A}_{12}+2.303RT\log a_{\mathrm{CSG\_METVO(residue)}}$ and $\boldsymbol{A}_{13}^{*}=\boldsymbol{A}_{13}+2.303RT\log a_{\mathrm{CSG\_METJA(residue)}}$. At the same temperature, pressure and activities of basis species and proteins as shown in the previous section, we can write $\boldsymbol{A}_{12}^{*}=\boldsymbol{A}_{3}^{*}/553=2.303RT\times0.1892901$ and $\boldsymbol{A}_{13}^{*}=\boldsymbol{A}_{4}^{*}/530=2.303RT\times0.5928069$ to give \begin{equation} \boldsymbol{A}/2.303RT=0.1892901-\log a_{\mathrm{CSG\_METVO(residue)}}\label{eq:A_METVO_residue}\end{equation} and\begin{equation} \boldsymbol{A}/2.303RT=0.5928069-\log a_{\mathrm{CSG\_METJA(residue)}}\,,\label{eq:A_METJA_residue}\end{equation} which are equivalent to Equations 12 and 13 in the paper \citep{Dic08} but with more decimal places shown. A third equation arises from the conservation of amino acid residues:\begin{equation} a_{\mathrm{CSG\_METVO(residue)}}+a_{\mathrm{CSG\_METJA(residue)}}=1.083\,.\label{eq:a_residue_2}\end{equation} The solution to these equations is $a_{\mathrm{CSG\_METVO(residue)}}=0.3065982$, $a_{\mathrm{CSG\_METJA(residue)}}=0.7764018$ and $\boldsymbol{A}/2.303RT=0.7027204$. The corresponding logarithms of activities of the proteins are $\log\left(0.307/553\right)=-3.256$ and $\log\left(0.776/530\right)=-2.834$. These activities of the proteins are much closer to each other than those calculated using formation reactions for whole protein formulas, so this result seems more compatible with the actual coexistence of proteins in nature. The approach just described is not used by \texttt{diagram()} when \texttt{residue=TRUE} (which is the default setting). Instead, the Boltzmann distribution, described next, is implemented for that situation. \subsection{Boltzmann distribution} An expression for Boltzmann distribution, relating equilibrium activities of species to the affinities of their formation reactions, can be written as (using the same definitions of the symbols above)\begin{equation} \frac{a_{i}}{\sum a_{i}}=\frac{e^{\boldsymbol{A}_{i}^{*}/RT}}{\sum e^{\boldsymbol{A}_{i}^{*}/RT}}\,.\label{eq:Boltzmann}\end{equation} Using this equation, we can very quickly (without setting up a system of equations) calculate the equilibrium activities of proteins using their residue equivalents. Above, we saw $\boldsymbol{A}_{12}^{*}/2.303RT=0.1892901$ and $\boldsymbol{A}_{13}^{*}/2.303RT=0.5928069$. Multiplying by $\ln10=2.302585$ gives $\boldsymbol{A}_{12}^{*}/RT=0.4358565$ and $\boldsymbol{A}_{13}^{*}/RT=1.364988$. We then have $e^{\boldsymbol{A}_{12}^{*}/RT}=1.546287$ and $e^{\boldsymbol{A}_{13}/RT}=3.915678$. This gives us $\sum e^{A_{i}^{*}/RT}=5.461965$, $a_{12}/\sum a_{i}=0.2831009$ and $a_{13}/\sum a_{i}=0.7168991$. Since $\sum a_{i}=1.083$, we arrive at $a_{12}=0.3065982$ and $a_{13}=0.7764018$, the same result as above. This example was also described in a recent paper \citep{DS11}. This computation can be carried out in CHNOSZ using the following commands, which implies \texttt{residue=TRUE} as the default setting for systems of proteins. This setting signifies to consider the formation reactions of the residue equivalents instead of the whole proteins, AND consequently to make a call to \texttt{equil.boltz()} rather than \texttt{equil.react()}. \begin{wrapsweave} <>= d <- diagram(a,do.plot=FALSE) as.numeric(d$logact) @ \end{wrapsweave} We can also specify \texttt{as.residue=TRUE} (which means to return the logarithms of activities of the residue equivalents rather than converting them to logarithms of activities of the proteins): \begin{wrapsweave} <>= d <- diagram(a,as.residue=TRUE,do.plot=FALSE) 10^as.numeric(d$logact) @ \end{wrapsweave} Although this example includes only two proteins, this procedure is suitable for calculating the metastable equilibrium activities of any number of proteins. \section{Calculations as a function of a single variable} A comparison of the outcomes of equilibrium calculations that do and do not use the residue equivalents for proteins was given in a publication \citep{Dic08}. An expanded version of a diagram in that paper is below (though, without labels on the figures). \begin{wrapsweave} <>= organisms <- c("METSC","METJA","METFE","HALJP","METVO","METBU","ACEKI","BACST","BACLI","AERSA") proteins <- c(rep("CSG",6),rep("SLAP",4)) basis("CHNOS+") species(proteins,organisms) a <- affinity(O2=c(-100,-65)) par(mfrow=c(2,1)) diagram(a,ylim=c(-5,-1),legend.x=NULL,residue=FALSE) title(main="Equilibrium activities of proteins, whole formulas") diagram(a,ylim=c(-5,-1),legend.x=NULL) title(main="Equilibrium activities of proteins, residue equivalents") @ \end{wrapsweave} The reaction-matrix approach described above can also be applied to systems having conservation coefficients that differ from unity, such as many mineral and inorganic systems, where the immobile component has different molar coefficients in the formulas. For example, consider a system like that described by Seewald, 1997 \citep{See96}: \begin{wrapsweave} <>= basis('CHNOS+') basis('pH',5) species(c('H2S','S2-2','S3-2','S2O3-2','S2O4-2','S3O6-2','S5O6-2','S2O6-2','HSO3-','SO2','HSO4-')) a <- affinity(O2=c(-50,-15),T=325,P=350) par(mfrow=c(2,1)) diagram(a,logact=-2,ylim=c(-30,0),legend.x="topleft",cex.names=0.8) title(main="Aqueous sulfur speciation, whole formulas") diagram(a,logact=-2,ylim=c(-30,0),legend.x="topleft",cex.names=0.8,residue=TRUE) title(main="Aqueous sulfur speciation, 'residue' equivalents") @ \end{wrapsweave} The first diagram is quantitatively similar to the one shown by Seewald, 1997, but in the second (where we have set \texttt{residue=TRUE}) the range of activities is lower at any given $\log f_{\mathrm{O_{2}}_{\left(g\right)}}$. There, the function was told to rewrite the formation reactions of the aqueous sulfur species for their residue equivalents in the same way the formation reactions for the proteins were rewritten above. The number of {}``residues'' in each species is the coefficient of the immobile component, in this case $\mathrm{H_{2}S}$, in the formation reaction. Maybe \texttt{residue=TRUE} doesn't make sense for systems where the formulas of species are similar in size to those of the basis species. For molecules as large as proteins it might be a useful concept. It is now (since CHNOSZ version 0.9) the defaut mode for \texttt{diagram()} when working with proteins. With the potential for calculating equilibrium activities of proteins comes the desire to compare these calculations to actual measurements! \section{Becoming Human} Let's look at some protein abundance levels in human blood plasma. First get going with the experimental data. In CHNOSZ is a table listing the upper limits of the intervals, or ranges, of protein abundances taken from figures available in Anderson and Anderson, 2002, 2003 \citep{AA02,AA03}. The protein abundances in the tables are in log10(pg/ml); let's convert that to molality. First locate the file with the abundance data. Then read it. Then identify the protein named {}``INS.C'' and drop it from the list. The reason for doing so is that preliminary calculations show it is much more stable than any other protein in the list. It is therefore an interesting outlier in terms of relative stabilities of the proteins. Then get the species indices of the proteins for thermodynamic calculations (with parameters based on amino acid compositions of the proteins listed in \texttt{thermo\$protein} ... and doing so quietly, without console messages that would fill up a whole page here). Then calculate the masses of the proteins. Then convert log10(pg/ml) to log10(mol/L) (logarithm of molarity). The conversion from pg/ml to g/L involves a factor of $10^{-9}$ ($\frac{10^{0}\mathrm{g}}{10^{12}\mathrm{pg}}\times\frac{10^{3}\mathrm{ml}}{10^{0}\mathrm{L}}$); then to get molarity we divide by mass ($\frac{\mathrm{g}}{\mathrm{mol}}$). \begin{wrapsweave} <>= f <- system.file("extdata/abundance/AA03.csv",package="CHNOSZ") pdata <- read.csv(f) pdrop <- which(pdata$name == "INS.C") pname <- pdata$name[-pdrop] iip <- info(paste(pname,"HUMAN",sep="_"),quiet=TRUE) pmass <- element(thermo$obigt$formula[iip])$mass loga.expt <- logm <- log10( 10^pdata$log10.pg.ml.[-pdrop] / 10^9 / pmass ) @ \end{wrapsweave} As implied by the {}``loga'', we are assuming for the comparisons offered below that molarity (derived from the published abundance data) can be taken to be equal to molality and that molality can equated with chemical activity. The latter equality (the assumption of ideal behavior) especially should be subject to more scrutiny. We'll go ahead anyway and calculate, for ideality, the equilibrium activities of the proteins. First we need to calculate the total activity of residues from the experimental data, but to do that we need even more firstly the lengths of the proteins. \begin{wrapsweave} <>= pl <- protein.length(paste(pname,"HUMAN",sep="_")) logares.tot <- sum(10^loga.expt * pl) @ \end{wrapsweave} Our total activity (\emph{not} the logarithm of it) of residues turns out to be about 200, which for our average protein length of 637 works out to about 0.3 for the average protein, if the total activity of residues could be attributed to that single average protein. Now let's get down to the stuff CHNOSZ is made for. First define the basis species. Then define the species, being the proteins; for some reason (most likely the existence of factors in R) the {}``as.character'' is needed to avoid an error. Then calculate the affinities of the formation reactions of the proteins. Then calculate the equilibrium activities, but don't plot them by themselves. Instead, use \texttt{revisit} to compare the equilibrium activities to the experimental abundances. \begin{wrapsweave} \setkeys{Gin}{width=0.6\textwidth} <>= basis("CHNOS+") species(as.character(pname),"HUMAN",quiet=TRUE) a <- affinity() d <- diagram(a,logact=logares.tot,do.plot=FALSE) pch <- as.numeric(pdata$type) revisit(d,"rmsd",loga.ref=loga.expt,pch=pch) legend("bottomright",pch=unique(pch),legend=unique(pdata$type)) @ \setkeys{Gin}{width=1.0\textwidth} \end{wrapsweave} There seems to be almost no relation between the reference values and the calculated ones. But what if we increase the oxygen fugacity? $\log f_{\mathrm{O_{2}}_{\left(g\right)}}=-80$ might be appropriate for some subcellular conditions, or reduced hydrothermal systems. Blood is exposed to oxygen after all... let's try $\log f_{\mathrm{O_{2}}_{\left(g\right)}}=-60$. \begin{wrapsweave} \setkeys{Gin}{width=0.6\textwidth} <>= basis("O2",-60) a <- affinity() d <- diagram(a,logact=logares.tot,do.plot=FALSE) revisit(d,"rmsd",loga.ref=loga.expt,pch=pch) legend("topleft",pch=unique(pch),legend=unique(pdata$type)) @ \setkeys{Gin}{width=1.0\textwidth} \end{wrapsweave} Well it's still quite scattered. However, the RMSD has decreased considerably, the loess fit has a positive slope, and the dynamic ranges of the calculations and observations are more similar. \section{Comparison with expression profile in \emph{E. coli}} Amino acid compositions of proteins in \emph{Escherichia coli} are provided with CHNOSZ at \texttt{extdata/protein/ECO.csv.xz}. Protein abundances in the cytosol of \emph{E. coli} reported by Ishihama et al., 2008 \citep{ISR+08} are provided with CHNOSZ at \texttt{extdata/abundances/ISR+08.csv.xz}. We can use \texttt{get.expr()} to retrieve the abundance data for all or only selected proteins, and also add these proteins to CHNOSZ's inventory (\texttt{thermo\$protein}) based on amino acid compositions from the \texttt{ECO.csv} file. First though we use \texttt{data(thermo)} to clear out the settings from the previous calculations. \begin{wrapsweave} <>= data(thermo) file <- system.file("extdata/abundance/ISR+08.csv",package="CHNOSZ") expr <- get.expr(file,"ID","emPAI","ECO",list(description="kinase")) range(expr$loga.ref) @ \end{wrapsweave} The result (\texttt{expr}) lists data for proteins where the \texttt{description} column of ISR+08.csv contains the term \texttt{kinase}. The list has elements named \texttt{id} (corresponding to the \texttt{ID} column of ISR+08.csv), \texttt{iprotein} (corresponding to the rownumber of the proteins in thermo\$protein) and \texttt{loga.ref} (logarithm of activity, corresponding to the \texttt{emPAI} column of ISR+08.csv, scaled so that total activity of residues is unity). Note that the ID's of five of the 36 proteins that are described as {}``kinase'' are not found in ECO.csv, so only 31 proteins are returned by the above call to \texttt{get.expr()}. The minimum and maximum values of the reference abundances are separated by over five orders of magnitude. Now we can calculate the metastable equilibrium activities of the proteins, setting the total activity of residues to unity. We then use \texttt{revisit()} to make a plot and compute the root mean square deviation between the experimental and calculated relative abundances. Since the equilibrium activities of the proteins were only calculated at a single point, \texttt{revisit()} here makes a scatter plot. The colors reflect the average oxidation state of carbon of the proteins (red -- more reduced, blue -- more oxidized). \begin{wrapsweave} \setkeys{Gin}{width=0.6\textwidth} <>= basis("CHNOS+") a <- affinity(iprotein=expr$iprotein) d <- diagram(a,logact=0,do.plot=FALSE) tp <- thermo$protein[expr$iprotein,] z <- ZC(protein.formula(tp)) col <- rgb(max(z)-z, 0, z-min(z), max=diff(range(z))) revisit(d,"rmsd",loga.ref=expr$loga.ref,pch=16,col=col) @ \setkeys{Gin}{width=1.0\textwidth} \end{wrapsweave} How can the correlation be improved? We can find where the RMSD minimizes as a function of a single variable. Or let's go for two variables ... note that we have to specify \texttt{mam=FALSE} in the call to \texttt{diagram()} in this case: \begin{wrapsweave} \setkeys{Gin}{width=0.6\textwidth} <>= a <- affinity(O2=c(-90,-60),NH3=c(-35,0),iprotein=expr$iprotein) d <- diagram(a,logact=0,do.plot=FALSE,mam=FALSE) r <- revisit(d,"rmsd",loga.ref=expr$loga.ref) @ \setkeys{Gin}{width=1.0\textwidth} \end{wrapsweave} Now set the activities of the basis species where the minimum RMSD was found, calculate the affinities and equilibrium activities, and compare the results with the reference abundances. \begin{wrapsweave} \setkeys{Gin}{width=0.6\textwidth} <>= basis(c("O2","NH3"),c(r$x,r$y)) a <- affinity(iprotein=expr$iprotein) d <- diagram(a,logact=0,do.plot=FALSE) revisit(d,"rmsd",loga.ref=expr$loga.ref,pch=16,col=col) @ \setkeys{Gin}{width=1.0\textwidth} \end{wrapsweave} \section{Summary} Using default settings, equilibrium activities of proteins are calculated in CHNOSZ by converting formation reactions of proteins to their per-residue equivalents, then using the Boltzmann distribution to transform the affinities of the formation reactions (in an equal-activity reference state) to equilibrium activities (an equal-affinity reference state). The construction of 2-D predominance diagrams (for proteins or any other type of system) by default avoids calculating the equilibrium activities of species and instead identifies predominant species based on maximum affinity (after normalizing by the conservation coefficients). For systems of proteins, set\texttt{ mam=FALSE} in \texttt{diagram()} to run the activity calculations if these values are needed, such as in the \emph{E. coli} example above. If oxygen fugacity is raised from its default nominal setting in CHNOSZ, the dynamic range of equilibrium activities calculated for proteins in human plasma becomes similar to the observed reference abundances of the proteins, and a slight positive correlation emerges. Equilibrium activities of kinases in \emph{E. coli} cytosol have a dynamic range that is also similar to the observed abundances, but our findings so far imply going to a very low chemical potential of nitrogen (in terms of $\log a_{\mathrm{NH_{3}}_{\left(aq\right)}}$) to minimize the overall deviation. \section{Document Information} Revision history \begin{itemize} \item 2009-11-29 Initial version (Calculating relative abundances of proteins) \item 2011-06-20 Add human and \emph{E. coli} comparisons \end{itemize} R session information \begin{wrapsweave} <>= sessionInfo() @ \end{wrapsweave} \bibliographystyle{plainnat} \bibliography{vig} \end{document}