%% LyX 1.6.9 created this file. For more info, see http://www.lyx.org/. %% Do not edit unless you really know what you are doing. \documentclass[english]{article} \usepackage{lmodern} \usepackage[T1]{fontenc} \usepackage[latin9]{inputenc} \usepackage[letterpaper]{geometry} \geometry{verbose,tmargin=2.5cm,bmargin=2.5cm,lmargin=2.5cm,rmargin=2.5cm} \usepackage{babel} \usepackage{amsbsy} \usepackage{amssymb} \usepackage[numbers]{natbib} \usepackage[unicode=true,pdfusetitle, bookmarks=true,bookmarksnumbered=false,bookmarksopen=false, breaklinks=false,pdfborder={0 0 1},backref=false,colorlinks=false] {hyperref} \makeatletter %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Textclass specific LaTeX commands. \usepackage{Sweave} \newenvironment{wrapsweave}{\par}{\par} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% User specified LaTeX commands. %\VignetteIndexEntry{Hot-spring chemistry and model proteins from a metagenome} %\SweaveOpts{width=3,height=3} % so DOIs in bibliography show up as hyperlinks \newcommand*{\doi}[1]{\href{http://dx.doi.org/#1}{doi: #1}} \makeatother \begin{document} \title{Hot-spring chemistry and model proteins from a metagenome} \author{Jeffrey M. Dick and Everett L. Shock} \maketitle \begin{wrapsweave} <>= options(width=85,continue=" ") @ \end{wrapsweave} \section{Introduction} This vignette for the CHNOSZ package demonstrates calculations of the relative stabilities of proteins with amino acid compositions derived from metagenomic data in a hot spring. The calculations are described in more detail in a recent paper \citep{DS11}. The purpose of this document is to present a minimal amount of code to show how most of the figures in the paper can be generated. The code actually used to make the figures in the paper is available online as supporting information for the paper. {}``Bison Pool'' is a hot spring in Sentinel Meadows, in the Lower Geyser Basin of Yellowstone National Park. Environmental DNA sequences from biofilm samples at five sites along the outflow are available from the Joint Genome Institute's IMG/M system \citep{MIS+08}. IMG/M also provides the sequences of proteins inferred from the metagenome, with predicted functional annotations. The sample IDs used in the field, and also in IMG/M, have the letters N, S, R, Q, P, which refer to sites 1, 2, 3, 4, 5, in that order; site 1 is the boiling source of the hot spring and site 5 is $\sim$22 meters downstream where the water on some days was in the range of $\sim$55-60 $^{\circ}$C \citep{HRM+11}. The amino acid compositions of the model proteins are derived from the metagenome. The {}``overall'' model proteins have amino acid compositions that are the average of all protein sequences at any sampling site. There are also 20 different {}``classified'' model proteins for subsets of the metagenome that share keywords in the annotations. \section{General Setup} Load CHNOSZ. \begin{wrapsweave} <>= library(CHNOSZ) @ \end{wrapsweave} \begin{wrapsweave} <>= # there might be a stale version of thermo from running # previous vignettes during package build. reset it to be safe. data(thermo) @ \end{wrapsweave} After loading the package, the amino acid compositions of the model proteins are present in \texttt{thermo\$protein}. Assemble the classification terms using one of the site names. \begin{wrapsweave} <>= tp <- thermo$protein classes <- tp$protein[tp$organism=="bisonN"] classes @ \end{wrapsweave} Associate the sites with their letter codes. \begin{wrapsweave} <>= sites <- c("N","S","R","Q","P") sitenames <- paste("bison",sites,sep="") @ \end{wrapsweave} \section{Average Oxidation State of Carbon in Model Proteins} Plot the average oxidation state of carbon ($\overline{Z}_{\mathrm{C}}$) of each of the classes of model proteins. Initialize the plot, labeling the \emph{$x$}-axis at sites 1 to 5. Then set up colors for the lines (the {}``overall'' model proteins in green) and identify the classes of proteins to label on the plot. For model proteins of all classes and sites, get the rownumbers of the proteins (\texttt{ip}), then extract the rows from the \texttt{thermo\$protein} data frame (\texttt{p}), then calculate the chemical formulas of the proteins (\texttt{pf}), and finally calculate values of $\overline{Z}_{\mathrm{C}}$. Loop over the classes, plotting the lines and labels. After Fig. 4 in Ref. \citep{DS11}. \begin{wrapsweave} \setkeys{Gin}{width=0.7\textwidth} <>= plot(0,0,xlim=c(-1,6),ylim=c(-0.33,-0.11),xlab="site",ylab=expression(bar(italic(Z))[C]),axes=FALSE) axis(1,at=1:5) axis(2) box() col <- c("green",rep("black",20)) lwd <- c(3,rep(1,20)) clab <- c("hydrolase","overall","protease","oxidoreductase","transport","membrane","permease") ip <- protein(rep(classes,each=5),rep(sitenames,20)) p <- thermo$protein[ip,] pf <- protein.formula(p) ZC.p <- ZC(pf) for(i in 1:length(classes)) { lines(1:5,ZC.p[(1:5)+5*(i-1)],col=col[i],lwd=lwd[i]) if(classes[i] %in% clab) text(0.8,ZC.p[1+5*(i-1)],classes[i],adj=1) } @ \setkeys{Gin}{width=1.0\textwidth} \end{wrapsweave} \section{Chemical Setting} \subsection{Basis Species} Use CHNOSZ's \texttt{basis} function to define the basis species. The basis species consist of $\mathrm{HCO_{3}^{-}}$, $\mathrm{H_{2}O}$, $\mathrm{NH_{3}}$, $\mathrm{HS^{-}}$, $\mathrm{H_{2}}$ and $\mathrm{H^{+}}$ (all aqueous species except for $\mathrm{H_{2}O}$ liquid). Note: the functions \texttt{basis} and \texttt{species} are important for their side effect of setting up the thermodynamic system used by other functions in CHNOSZ. \begin{wrapsweave} <>= basis(c("HCO3-","H2O","NH3","HS-","H2","H+")) @ \end{wrapsweave} \subsection{Hot-Spring Chemistry} To assess the relative stabilities of the proteins, parameters related to the chemistry of their environment are needed. The temperature (in $^{\circ}$C) and pH measured in the field at the time the biofilm samples were collected for sequence analysis are defined below as a function of distance (in meters) from the source of the hot spring. \begin{wrapsweave} <>= distance <- c(0,6,11,14,22) T.bison <- c(93.3,79.4,67.5,65.3,57.1) pH.bison <- c(7.350,7.678,7.933,7.995,8.257) @ \end{wrapsweave} The plot shows the temperature and pH as a function of distance. Use \texttt{splinefun}ctions to draw smooth lines between the points (a line formed by segments ending at the intervals in \texttt{xpoints}). Use the \texttt{mfrow} option to \texttt{par} to put two plots next to each other, and CHNOSZ's \texttt{axis.label} function to make a formatted label for temperature. After Fig. 5a,c in Ref. \citep{DS11}. \begin{wrapsweave} <>= Tfun <- splinefun(distance,T.bison,method="mono") pHfun <- splinefun(distance,pH.bison,method="mono") xpoints <- seq(0,22,0.25) par(mfrow=c(1,2)) plot(distance,T.bison,xlab="distance, m",ylab=axis.label("T")) lines(xpoints,Tfun(xpoints)) plot(distance,pH.bison,xlab="distance, m",ylab="pH") lines(xpoints,pHfun(xpoints)) @ \end{wrapsweave} Besides temperature and pH, other chemical properties of the hot spring water change as it cools. Oxidation-reduction (redox) state of the water is left for exploration (see below), but the activities of the other basis species are set to values that approximate analytical observations. Set pH to the value for site 3; the pH gradient will be considered later. \begin{wrapsweave} <>= basis(c("HCO3-","NH3","HS-","H+"),c(-3,-4,-7,-7.933)) @ \end{wrapsweave} \section{Relative Stabilities of Proteins} \subsection{Formation Reactions of the Proteins} Use \texttt{species} to display the coefficients in formation reactions of the overall model proteins and set the species definition for CHNOSZ. \begin{wrapsweave} %\begin{small} <>= species("overall",sitenames) @ %\end{small} \end{wrapsweave} After sequencing and assembly, many gene fragments remain, so the number of amino acid residues in the model proteins is the average length of mostly partial protein sequences and does not necessarily reflect the genetic sequence lengths. The first six columns of the output indicate the stoichiometries of the formation reactions. Although the reactions are written for non-ionized proteins, $\mathrm{H^{+}}$ has a non-zero coefficient because of the presence of other charged basis species. The reactions can be divided by the lengths (number of amino acid residues) of the proteins to write per-residue formation reactions. First get the rownumbers of the overall model proteins in \texttt{thermo\$protein} (\texttt{ip}), then calculate the protein {}``length'' (number of amino acid residues, not integers for the model proteins), then divide the protein formation reactions by the respective protein lengths. \begin{wrapsweave} <>= ip <- protein("overall",sitenames) pl <- protein.length(-ip) mys <- species() mys[,1:6]/pl @ \end{wrapsweave} The above can be used to deduce for example that the mass-action effect of increasing the activity of $\mathrm{H_{2}}$ is to drive formation of the model protein for site 1 more than the others. \subsection{Predominances as a Function of Temperature and Hydrogen Activity} First set the temperature limits (\texttt{Tlim}) over which to perform the calculations. Then define a function \texttt{H2prot}, which gives the logarithm of activity of hydrogen ($\log a_{\mathrm{H_{2}}_{\left(aq\right)}}$) as a linear function of temperature. \begin{wrapsweave} <>= Tlim <- c(50,100) H2prot <- function(T) -11 + T * 3/40 @ \end{wrapsweave} The metastable equilibrium activity diagram shows predominance fields of the overall model proteins as a function of temperature and logarithm of activity of aqueous hydrogen. After Fig. 5b in Ref. \citep{DS11}. \begin{wrapsweave} \setkeys{Gin}{width=0.5\textwidth} <>= a <- affinity(T=Tlim,H2=c(-7,-4)) diagram(a,color=NULL,names=1:5) lines(Tlim,H2prot(Tlim),lty=2) title(main="overall") @ \setkeys{Gin}{width=1.0\textwidth} \end{wrapsweave} The stability fields for the proteins from the higher temperatures are at higher activities of hydrogen in the diagram. The dashed line enters the stability fields the model proteins at approximately the actual environmental temperatures. Write a function to load one or more of the classes of model proteins. \begin{wrapsweave} <>= loadclass <- function(class) { species(delete=TRUE) species(rep(class,each=5),rep(sitenames,length(class))) } @ \end{wrapsweave} Then use this function to make equilibrium activity diagrams for a few different classes, afterwards restoring the species definition to the overall model proteins. To save space, the messages printed by CHNOSZ during the calculations are hidden. After Fig. 6 in Ref. \citep{DS11}. \begin{wrapsweave} <>= forclasses <- c("transferase","transport","dehydrogenase","synthase") par(mfrow=c(2,2)) for(i in 1:4) { loadclass(forclasses[i]) a <- affinity(T=Tlim,H2=c(-7,-4)) diagram(a,color=NULL,names=1:5) title(main=forclasses[i]) lines(Tlim,H2prot(Tlim),lty=2) } loadclass("overall") @ \end{wrapsweave} \subsection{Chemical Affinities} Calculate the residue-normalized chemical affinities of the formation reactions of the overall model proteins. First set activities of the proteins equal to unity (logarithm of activity equal to zero). Then calculate affinities per mole of protein for the temperature, pH and $\log a_{\mathrm{H_{2}}_{\left(aq\right)}}$ of each site. Use the lengths of the model proteins, defined above in \texttt{pl}, to calculate the affinities per residue. After Table 5 in Ref. \citep{DS11}. \begin{wrapsweave} <>= species(1:5,0) a <- affinity(T=T.bison,pH=pH.bison,H2=H2prot(T.bison)) a.res <- t(as.data.frame(a$values))/pl a.res @ \end{wrapsweave} The affinities are expressed as dimensionless values, i.e. $\boldsymbol{A}/2.303RT$ where $\boldsymbol{A}$, $R$ and $T$ stand for chemical affinity, gas constant, and temperature in Kelvin. The affinities are all negative, but highest at the conditions of sites 1 and 2 (columns 1 and 2) for the model proteins from site 1 (row 1), and highest at the conditions of sites 3 to 5 (columns 3 to 5) for the model protein from site 4 (row 4). \begin{wrapsweave} <>= sapply(1:5,function(x) which.max(a.res[,x])) @ \end{wrapsweave} This observation is mostly consistent with the predominance diagram shown before. However, the model protein for site 2 appears in the diagram but, per-residue, is not more stable than the others at the conditions of site 2. The disconnect is caused by the different numbers of residues in the model proteins, and the fact that the predominance diagram shows the relative stabilities of the proteins (based on per-residue formation reactions). Calculate size-adjusted per-residue affinities by subtracting the logarithm of the protein length from the per-residue affinities. (A related operation, for calculating activities of proteins, is described following Eq. 19 in Ref. \citep{DS11}.) \begin{wrapsweave} <>= a.res <- a.res - log10(pl) a.res sapply(1:5,function(x) which.max(a.res[,x])) @ \end{wrapsweave} Now there is a 1--2--4 progression in relative stabilities, which matches the stability fields in the equilibrium predominance diagram shown above. \subsection{Activities Along a Combined Chemical Gradient} The predominance diagrams above only shows the most stable protein at any temperature-$\log a_{\mathrm{H_{2}}_{\left(aq\right)}}$ combination. Combine the gradients of temperature, pH and hydrogen activity to plot the metastable equilibrium activities of the proteins. The total activity of residues is set by reference activities of the proteins equal to $10^{-3}$. Run the calculation for the distance intervals in \texttt{xpoints}. \begin{wrapsweave} <>= species(1:5,-3) xT <- Tfun(xpoints) xpH <- pHfun(xpoints) xH2 <- H2prot(xT) a <- affinity(T=xT,pH=xpH,H2=xH2) @ \end{wrapsweave} Label the $x$-axis {}``distance''. To do so modify a few entries in the list returned by \texttt{affinity} (\texttt{xname}, \texttt{xlim}, \texttt{xvals}); otherwise the $x$-axis would represent temperature (the first variable in the argument list to \texttt{affinity}). After Fig. 5f in Ref. \citep{DS11}. \begin{wrapsweave} \setkeys{Gin}{width=0.5\textwidth} <>= a$xname <- "distance, m" a$xlim <- c(range(xpoints),length(xpoints)) a$xvals <- xpoints diagram(a,legend.x=NULL) legend("bottom",lty=1:5,legend=paste(1:5," "),bty="n") @ \setkeys{Gin}{width=1.0\textwidth} \end{wrapsweave} Plot the equilibrium degrees of formation as a function of distance for six different classes of proteins. Calculate the affinities for all of the proteins. For each group of three, make strip charts using the \texttt{strip} function in CHNOSZ. The heights of the bars represent the relative abundances of the model proteins. The five steps of the color code go from site 1 (red) to site 5 (blue). The messages printed by CHNOSZ during the calculations are not shown here. \begin{wrapsweave} <>= xclasses <- c("overall","transferase","transport","synthetase","membrane","permease") loadclass(xclasses) a <- affinity(T=xT,pH=xpH,H2=xH2) a$xname <- "distance, m" a$xlim <- c(range(xpoints),length(xpoints)) a$xvals <- xpoints col <- c("red","orange","yellow","green","blue") par(mfrow=c(1,2)) for(i in 1:2) { ispecies <- lapply((1:3)+(i-1)*3,function(x) {1:5+(x-1)*5} ) names(ispecies) <- xclasses[(1:3)+(i-1)*3] strip(a = a, ispecies = ispecies, col = col, xticks = distance, cex.names = 1) } @ \end{wrapsweave} \section{From Protein Stability to Redox Measurements} The temperature-hydrogen activity relationship depicted by the dashed line in the figures above was derived by considering the relative stabilities of the model proteins. How does it compare with measurements of redox chemistry in the hot spring? At least three types of field and analytical measurements are available for comparison: oxidation-reduction potential (ORP), dissolved oxygen, and sulfide/sulfate concentrations. \subsection{Oxidation-Reduction Potential} ORP, measured in the field using a portable probe and pH/voltmeter, can be converted to Eh by adding the potential of the reference electrode, in this case silver-silver chloride (Ag/AgCl) in saturated KCl. Lacking the high-temperature potentials of this electrode, the following equation from Ref. \citep{BPJ85} for Ag/AgCl (1M KCl) is used, with temperature in degrees Celsius and potential in volts. \begin{wrapsweave} <>= E.AgAgCl <- function(T) { 0.23737 - 5.3783e-4 * T - 2.3728e-6 * T^2 - 2.2671e-9 * (T+273) } @ \end{wrapsweave} Together with ORP, temperature and pH will all be needed. Data from Bison Pool and another hot spring shown in Fig. 9 of Ref. \citep{DS11} are both included, but for simplicity here they are lumped together. The ORP values have units of mV. \begin{wrapsweave} <>= T.ORP <- c(93.9,87.7,75.7,70.1,66.4,66.2) pH.ORP <- c(8.28,8.31,7.82,7.96,8.76,8.06) ORP <- c(-258,-227,-55,-58,-98,-41) @ \end{wrapsweave} Convert ORP to effective values of $\log a_{\mathrm{H_{2}}_{\left(aq\right)}}$. First convert to Eh (in volts). Then convert Eh to pe. Then combine Eh and pH with the law of mass action for\[ 2e^{-}+2\mathrm{H^{+}}\rightleftharpoons\mathrm{H_{2}}_{\left(aq\right)}\] to calculate $\log a_{\mathrm{H_{2}}_{\left(aq\right)}}$. The law of mass action is the condition where the activity quotient ($Q$) of the reaction is equal to the equilibrium constant ($K$). \begin{wrapsweave} <>= Eh <- ORP/1000 + E.AgAgCl(T.ORP) pe <- convert(Eh,"pe",T=convert(T.ORP,"K")) logK.ORP <- subcrt(c("e-","H+","H2"),c(-2,-2,1),c("aq","aq","aq"),T=T.ORP)$out$logK logaH2.ORP <- logK.ORP - 2*pe - 2*pH.ORP @ \end{wrapsweave} \subsection{Sulfide/Sulfate} For sulfide/sulfate, assign activities that are equal to concentrations (in molal units) measured in the field season of 2005, when the biofilm samples were collected for metagenomic sequencing, and use the law of mass action for\[ \mathrm{HS^{-}}+4\mathrm{H_{2}O}\rightleftharpoons\mathrm{SO_{4}}^{-2}+\mathrm{H^{+}}+4\mathrm{H_{2}}_{\left(aq\right)}\] to calculate an effective activity of hydrogen. Sulfide was determined spectrophotometrically at the hot spring, and sulfate was determined using ion chromatography on water samples returned from the field. \begin{wrapsweave} <>= loga.HS <- log10(c(4.77e-6,2.03e-6,3.12e-7,4.68e-7,2.18e-7)) loga.SO4 <- log10(c(2.10e-4,2.03e-4,1.98e-4,2.01e-4,1.89e-4)) logK.S <- subcrt(c("HS-","H2O","SO4-2","H+","H2"),c(-1,-4,1,1,4),state=c("aq","liq","aq","aq","aq"),T=T.bison)$out$logK logaH2.S <- (logK.S + pH.bison - loga.SO4 + loga.HS) / 4 @ \end{wrapsweave} \subsection{Dissolved Oxygen} Calculate the effective activity of hydrogen corresponding to the dissolved oxygen concentration using the law of mass action for\[ 0.5\mathrm{O_{2}}_{\left(aq\right)}+\mathrm{H_{2}}_{\left(aq\right)}\rightleftharpoons\mathrm{H_{2}O}\,.\] Convert the dissolved oxygen concentrations, in mg/L, to molarities (mol/L) and use these values to set the activity of $\mathrm{O_{2}}_{\left(aq\right)}$. \begin{wrapsweave} <>= DO <- c(0.173,0.776,0.9,1.6,2.8) logaO2 <- log10(DO/1000/32) logK <- subcrt(c("O2","H2","H2O"),c(-0.5,-1,1),c("aq","aq","liq"),T=T.bison)$out$logK logaH2.O <- 0 - 0.5*logaO2 - logK @ \end{wrapsweave} \subsection{Comparison of Effective Hydrogen Activities} Make a plot showing the activities of hydrogen as a function of temperature derived from the relative stabilities of the proteins, and the ORP, sulfur and oxygen measurements. Connect the data points with dashed lines and add labels for the lines using specified coordinates. After Fig. 9 of Ref. \citep{DS11}. \begin{wrapsweave} \setkeys{Gin}{width=0.5\textwidth} <>= plot(Tlim,H2prot(Tlim),xlim=Tlim,ylim=c(-45,0),xlab=axis.label("T"),ylab=axis.label("H2"),type="l") points(T.ORP,logaH2.ORP,pch=15) lines(T.ORP,logaH2.ORP,lty=2) points(T.bison,logaH2.O,pch=16) lines(T.bison,logaH2.O,lty=2) points(T.bison,logaH2.S,pch=17) lines(T.bison,logaH2.S,lty=2) llab <- c("H2prot","ORP","oxygen","sulfur") text(c(72,80,72,74),c(-4,-25,-34,-10.5),llab) # legend("bottomright",lty=c(1,NA,NA,NA),pch=c(NA,15,16,17),legend=llab) @ \setkeys{Gin}{width=1.0\textwidth} \end{wrapsweave} \section{Conclusions} The calculations and figures presented here involve the average oxidation state of carbon in proteins, temperature and pH of the hot spring, relative stabilities of model proteins as a function of temperature, $\log a_{\mathrm{H_{2}}_{\left(aq\right)}}$ and pH, and comparison of effective values of $\log a_{\mathrm{H_{2}}_{\left(aq\right)}}$ from different redox reactions. \bibliographystyle{unsrtnat} \bibliography{vig} \end{document}