|
|
||||||||
REVIEW
Center for Cell Analysis and Modeling, Department of Cell Biology, University of Connecticut School of Medicine, Farmington, Connecticut
moraru{at}panda.uchc.edu
| Abstract |
|---|
| Introduction |
|---|
|
|
|---|
This review explores the promises and pitfalls of computational modeling in cell biology, focusing on paradigms and examples from the world of intracellular signaling, with special emphasis on the role of morphology and spatial simulations. However, the scope of intracellular signaling networks is increasingly difficult to delineate, given the multiple molecular interactions present inside the cell. Thus it is not clear where to draw the boundary between signaling networks on the one hand and either metabolic networks or gene-regulatory networks on the other. We do not attempt to make such clear distinctions here, and we will mainly use the more global view of modeling and simulating intracellular molecular networks. We conclude that the field of cell biology is ripe for widespread use of computational approaches and that the best avenue for progress is to combine top-down, systems-level modeling with bottom-up, detailed quantitative modeling.
We will review both of these classes of methodologies (with more detail and emphasis on the second) in an attempt to guide the reader as to their relative roles and appropriateness for tackling particular modeling needs related to intracellular signaling.
| The Top-Down View: Reverse Engineering Molecular Networks |
|---|
|
|
|---|
Recently, as "signaling pathways" became "signaling networks," a flurry of interest in applying network theory and related techniques to the study of intracellular signaling developed. The premise that the analysis of a complex network as a single large functional ensemble will allow us to infer its emergent properties and behavior is very attractive.
A classical example of a well-developed map of intracellular molecular networks is the metabolic network. Mathematical and engineering tools have been applied to the study of metabolism, and they have been very successful, both at prediction of the behavior of the system and at identification of biological motifs and modules (63). Universal principles such as robustness and control mechanisms have been extracted, and metabolic control analysis has become a very well established methodology for quite some time, with its own set of laws and practical applications (15). However, at present, studies of metabolism are not a good template for approaching signaling (and gene) networks. As a result of decades of detailed biochemical studies, the individual molecules participating in the metabolic network are very well characterized, and essentially all links between the nodes are known, in most cases, at a very good level of quantitative detail (rate laws, binding, and regulation parameters). Although a few subcomponents of signaling pathways are characterized to a similar extentgood examples are Ca2+ signaling in cardiomyocytes (23, 52) and EGF receptor signaling in fibroblasts (51, 62)many of them are not, and a large number of interactions (links in the network) are not known at all; this lack of knowledge is even more prevalent for gene-regulatory networks. Additionally, metabolic networks are probably at least an order of magnitude smaller in terms of numbers of components and interactions.
It is almost a cliché to say that the most important characteristic of a network is its architecture, but what does architecture actually mean? It is the collection of features present in the map of the network. Therefore, one of the current challenges in the case of signaling networks is to first discover the actual map of the network (1, 10). The question of what exactly to include (genes? proteins? small molecules? all of the above?) and at what granularity (modules? whole cell?) is subject to debate. Traditionally, network maps have developed slowly as the result of painstakingly combining and reconciling the results of many separate studies of individual molecules and their intracellular functions and interactions. However, given the large number of components involved and the high level of interactivity, and given the recent availability of high-throughput experimental data for both molecular interactions and functional analyses, deciphering maps of intracellular networks has recently fallen under the domain of systems biology and computational biology.
One school of thought advocates the use of statistical methods developed in the field of complex systems studies (31). These methods promise that we can infer the structure of the network, i.e., "learn the model," with little detailed and accurate information about all of the nodes and their interaction. Large-scale datasets appropriate for this purpose are being generated by the advent of recent high-throughput technologies (34). This approach has been named alternatively as inference, system identification, or reverse engineering. The mathematical tools that have proven to be very useful for this purpose are probabilistic graphical models, such as Bayesian networks, Markov networks, or chain graphs (18). The details of applying these methods to biological network reverse engineering is beyond the scope of this review. However, we must mention that, particularly in the study of gene networks and genetic regulatory mechanisms, there have been a number of recent examples where this approach was successfully applied to identify new regulatory links or to infer modules of related genes from microarray expression data; a prototypical example (53) is illustrated in FIGURE 1
. The individual network motifs and critical nodes identified would become the subject of the traditional experimental research, which is often needed to eventually separate epiphenomenon from causation (e.g., when is coexpression actually coregulation?). Once new hypotheses are being tested, the confidence in the network structure grows and the model can be refined by targeted experiments using perturbation analysis. Additionally, once the map is known with some certainty, perturbation analysis can be even more useful, since it can generate additional testable predictions of network behavior without the intimate quantitative knowledge of individual interactions (20).
|
| Reductionism Redux: The Devil is in the Details |
|---|
|
|
|---|
Network modules
A common feature of intracellular networks is that the connections between nodes are neither random nor regular. Often, there are numerous localized interactions among groups of nodes and few distant connections traversing many such groups of nodes, thus forming functional "modules." Such large, sparse, scale-free, and modular networks are common in natural or designed complex systems (e.g., the market economy, the internet, VLSI chips), but they are not ubiquitous (e.g., neural nets, either biological or artificial, are neither sparse nor modular). However, metabolic, signaling, and gene-regulatory networks are most likely to turn out to be in all cases sparse, scale-free, and modular, due to the fact that such networks are best suited to balance adaptability and robustness (24, 58), and thus this is the most likely architecture in the evolutionary design of systems that are critical for species survival. Detailed studies in many different organisms of the best-characterized intracellular network, the metabolic network, have confirmed the ubiquity of a pattern of hierarchical modularity (50). It is therefore appropriate to ask whether we should rather focus on the detailed analysis, modeling, and validation of the individual modules, which can be then used as building blocks for more comprehensive quantitative models that could eventually become very accurate.
This is in essence the traditional reductionist approach to science, and the application of mathematical and computational tools to characterize individual modules has been very successful. Some functional modules can be very simple and are recognizable after a cursory look at a wiring diagram, such as the well-known feedback or feed-forward loops. However, experiments have shown that, even for such simple modules, careful mathematical analysis is necessary to predict the signal-processing characteristics of the module (the dynamic behavior encoded by it), because the quantitative details of the kinetic parameters governing the interactions usually make the system highly nonlinear. For example, a negative feedback loop can generate either homeostasis (adaptation) or an oscillatory response, and a positive feedback loop can generate hypersensitivity (thresholding) or bistability (5, 45).
It is therefore more appropriate to characterize a module not by its wiring principles but rather by the type of response it generates. A simple analogy is that in an electrical circuit it is more important to know whether a specific circuit component functions as a transistor or as a diode than to know whether it is a semiconductor or a vacuum tube (35). When building such a "parts list" of a cellular network, we find repeated occurrences of a relatively small number of different types of modules, among the most common types being amplifiers, attenuators, oscillators, "push buttons" (when stimulus needs to be maintained to maintain the response), "toggle switches" (when response remains stable after the stimulus disappearsoften responsible for irreversible responses and checkpoint controls), and "sniffers" (short-lived response only when stimulus changes in intensityoften responsible for adaptation/desensitization) (60). Predictive models (based on the detailed understanding of the underlying biology and mathematics) of such basic functional modules allow us to choose appropriate strategies to target control mechanisms in the network, and these models raise the prospect of being able to alter the network architecture by rewiring or constructing new modules (25). Recent proofs of concept have been the successful design and expression of synthetic genetic toggle switches and oscillators (12, 19).
Stochastic variability
One problem that we often face with generic models of intracellular networks is created by the fact that many molecules in the network are present in low copy numberthis is especially true of mRNA and proteins but also of small molecules (e.g., in a volume of 0.5 µm diameter, which is comparable with the size of the lumen of a mitochondrial crista invagination or to the area surrounding a sarcoplasmic reticulum cistern, a concentration of 100 nM Ca2+ turns out to be only 4 Ca2+ ions). Some of the methods discussed above employ probabilistic techniques and thus can be adapted to capture noise and stochasticity, but the problem lies in the fact that individual modules can dramatically alter their encoded behavior when only small numbers of molecules are involved. FIGURE 2
shows examples (3, 4) where such effects can be easily grasped intuitively. If a module that exhibits bistability operates with small numbers of molecules so that stochastic noise is significant, random fluctuations can overcome histeresis; instead of bistability, such a system will oscillate between the normally stable steady states according to some function of the noise frequency! Similarly, large stochastic fluctuations will alter the behavior of a module that encodes hypersensitivity by increasing the sensitivity and decreasing the sharpness of thresholding up to the point that, when measured in large populations, hypersensitivity is transformed into a graded response. Therefore, in such cases we must resort to kinetic models using Monte Carlo simulations or to stochastic differential equation approximations to properly describe the system (see the references for the examples above and also the section on methods an tools below).
|
Spatial models
The third problem arises from the fact that in many cases we dont deal with a simple homogenous bag of molecules. Even prokaryotes, which usually are devoid of compartmentalization, appear to have numerous supramolecular complexes that define regions of specialized function. Calculations from first principles of molecular collision probabilities due to normal random motion in volumes on the order of 1 µm3 or larger suggest that restricting the spatial mobility of enzymes may be the only mechanism to achieve reaction rates that are not close to zero (A. Slepoy, Sandia Nationa Laboratory, personal communication). Additionally, intracellular concentration gradients and spatial organization of molecular complexes can play a significant role in regulating microbial activities such as chemotaxis and quorum sensing and are more common than originally thought (40, 57).
Mammalian cells have multiple internal compartments, and the possibility of interactions between different molecules is tightly controlled. Processes such as translocation of proteins to and from membranes are commonplace, and fluxes across membranes need to be specifically taken into account in many models. Internalization and endocytosis of surface receptors frequently play an important role in shaping the response to signals; EGF signaling is a good example (26). An extreme example of exquisite spatial organization that regulates signaling is found in cardiomyocytes. There, excitation-contraction coupling and force generation is characterized in good quantitative detail and has seen many successful comprehensive modeling efforts across different scales (these are beyond the scope of this review and have been recently discussed in the first issue of the new Physiology; see Ref. 46). Eukaryotic cells can also develop intracellular concentration gradients of active participants in signaling cascades such as the MAPK pathway (33). Such gradients can also occur in metabolites and even at very small spatial scales, such as the submicrometer intracristae sacs of mitochondria (I. I. Moraru, unpublished data, and Ref. 38). Some recent theoretical work aimed to include diffusion in the mathematical analysis of network control (48), but the applicability of such approaches may be limited to parts of signaling networks that do not span multiple compartments.
How spatial organization controls signal transduction (and even just the role of simple processes such as diffusion) is often difficult to determine without exact knowledge of the morphology and of some of the critical kinetic parameters. Therefore, studies of well-known spatial phenomena (such as intracellular Ca2+ waves and transients) have prompted (and required) the development of detailed models with quantitative descriptions of all of the critical components. FIGURE 3
illustrates an example of the subtleties involved in spatial modeling: Ca2+ transients operated by the same signal-transduction mechanism [inositol trisphosphate (IP3)-induced Ca2+ release from the endoplasmic reticulum] show different responses and different mechanisms of regulation in three systems with different morphologies and spatial scales (L. M. Loew, unpublished data, and Refs. 14, 16, and 61). In large Xenopus oocytes, bistability determines a tidal wave, and its propagation is dependent on the propagation of an "IP3 wave." In differentiated neuroblastoma cells, cell morphology and the uneven distribution of the endoplasmic reticulum shape the characteristics of the transient Ca2+ wave, which may be cell-wide, abortive, or limited to the soma, depending on the site of the extracellular stimulus. In Purkinje cell spiny dendrites, the kinetics conspire to provide coincidence detection of repetitive stimuli and generate Ca2+ transients that are either localized to a spine or propagating to the dendrite lumen.
|
| The Tools of the Trade: Data, Software, Languages |
|---|
|
|
|---|
Data gathering
There are two fundamental requirements for computational approaches: 1) comprehensive experimental data and 2) a proper analytical framework. The required data are simply concentrations and distributions of molecules (state parameters of the system), their mobility constants, and the binding kinetics and rate laws of their interactions (dynamic parameters of the system). The first type of data is now readily obtainable for gene and protein expression on a large scale (microarrays and mass spectrometry techniquesgenome, transcriptome, proteome), or, when necessary, complemented by traditional methods focused on individual molecules.
Much has been made of the uncertainty present in this data, as well as of the sheer lack of quantitative information regarding the dynamic parameters. If history is any guide, however, we believe that technological advances and the combination of directed, large-scale approaches with the work of thousands of individual researches will make it only a matter of time until the data are available. After all, as little as 15 years ago the (then fledgling) Human Genome Project was argued by some to be intractable, and today we are discussing the future possibility of routine genome sequencing of individual personss for diagnostic purposes (27)! Also, new techniques have been perfected to obtain in vivo data with full spatial information. Confocal microscopy can now obtain complete, full, four-dimensional (that is, three-dimensional over time) protein expression data not only in single cells but during the full course of embryogenesis of an entire organism such as Caenorhabditis elegans (43) and can do it so efficiently that a small group of collaborating laboratories could record it for all of the open reading frames of the organisms genome in a reasonable amount of time (W. Mohler, University of Connecticut Health Center, personal communication). Moreover, modeling approaches help direct experiments for acquiring high-quality, quantitative, small-scale data for the critical components of the studied system. Techniques such as fluorescence recovery after photobleaching (FRAP), fluorescence loss in photobleaching (FLIP), and fluorescence correlation spectroscopy (FCS) today allow accurate measurements of real diffusion coefficients and binding constants in vivo (49).
Data and model exchange
A possibly larger concern is the limited set and relative lack of maturity of available computer tools to analyze and use this data to build models and run simulations (even though there has been an explosion of development efforts in this regard). We can distinguish four broad categories of features of these tools: data handling, model building, simulation, and analysis. Many public data repositories and databases exist; at the time of this writing, the Pathway Resource List (http://www.cbio.mskcc.org/prl/) had links to 154 internet pathway resources, most of which are databases containing information such as protein-protein interactions or metabolic reactions. However, we need more flexible information management systems to query, annotate, curate, and transform collected data in a format appropriate for modeling, and we are still far from a real biological information system that some researchers have called for (13)one that could come close to the coverage and functionality of the information management systems that exist in other physical sciences (e.g., GIS, the Geographical Information System).
We have to overcome many problems such as data curation, standards, and even accessibility; for example, the most comprehensive curated database of biological networks at present, Pathways Knowledge Base (Ingenuity, Mountain View, CA) is a commercial product available at only a few major academic institutions. The issue of standards is most critical due to the large variety of public and custom software tools used for modeling and simulation (see below). To facilitate the persistence, comparison, and reuse of computational models among different software platforms, we need a common vocabulary and a common language to describe them. Beyond unique identifiers for the various molecules (which already exist in public databases such as KEGG, SwissProt, GenBank, etc.), we need a comprehensive public ontology that comprises molecules, structures, and biological functions. From the many efforts underway (see Open Biological Ontologies at http://obo.sourceforge.net for a current listing), the Gene Ontology and BioCyc projects are quite promising for intracellular networks.
The next level is creating a platform-independent language that would describe the abstractions of a model, replete with quantitative detail, mathematical formulations, and possibly input and output data. Extensible markup language (XML) dialects have been developed in recent years for this purpose, most notably CellML (37) and SBML (17, 28). These two differ both in syntax and in scope: CellML has better support for metadata, spatial information, and mathematical descriptions (using MathML standard), whereas SBML was originally designed to be more specific for pathway and network models. The latter, which was developed as a community effort, has been broadening its scope and seems to be rapidly gaining acceptance as the current standard (currently, more than 60 software packages claim to provide at least some level of SBML support; see http://www.sbml.org). Finally, such ontologies and languages provide the fundamentals for more comprehensive pathway information exchange frameworks such as BioPax (http://www.biopax.org), which uses a new standard developed by the W3C consortium, the Web Ontology Language (OWL), that goes beyond XML and Resource Description Framework (RDF, a general standard for, among other uses, metadata representation); it was specifically designed for use by applications that need to not only to present but to process the content of information, thus facilitating machine interpretability.
Model building and simulations
We will skip a detailed review of model inference methods (discussed in THE TOP-DOWN VIEW, above), since most of that type of research uses dedicated software to match the mathematical approaches chosen by the particular group of researchers involved, and we refer the reader to reviews in bioinformatics journals for more technical details.
We will focus now on model building and simulation methods based on a known/hypothesized network connectivity. This was also often achieved as a customized development, but there are an increasing number of "generic" tools that biologists other than the original developers or their collaborators can use (and have used). The fundamental role of these tools is to facilitate a qualitative and quantitative description of the network map (using graphical interfaces or table-based reaction lists), and some will automatically create its mathematical representation (a system of equations). Whether or not created by hand, the latter is the input to simulation engines that use numerical methods to generate data such as the predicted time course of the networks state variables given a set of initial conditions. Early entries [e.g., Gepasi (42) and GENESIS/KinetiKit (2)] were focused on simple biochemical reaction networks under conditions in which uniform concentrations of molecules are expected (such as bacterial metabolic reactions). Thus the spatial representation of the system was limited to simple compartmental distributions, which translate in mathematical terms into a system of ordinary differential equations (ODEs). Systems of ODEs have many well-known numerical solvers available, are not very computationally expensive, and allow relatively easy model validation and analysis by methods like flux control analysis and parameter sensitivity calculations. However, they are not sufficient when spatial and stochastic aspects are involved.
When diffusion and actual specific morphologies have to be taken into account, if the system is modeled in a deterministic way, the mathematical representation will include partial differential equations (PDEs). Numerical solution of systems of PDEs can easily become computationally intractable at physiologically relevant spatial resolution and time courses, and analysis methods such as parameter sensitivity are essentially an open problem. However, since systems of PDEs are a common mathematical formulation for many other processes in physics or engineering (flow, reactors, explosions, etc.), a number of simulation engines for PDEs have been available for quite some time, ranging from relatively simple commercial products, such as FEMLAB (Comsol, Burlington, MA), to high-end versions that can take advantage of massively parallel supercomputers, such as Sandia National Labs MPSalsa (http://www.cs.sandia.gov/CRF/MPSalsa) or Pacific Northwest National Labs NWPhys (http://www.emsl.pnl.gov/nwphys). Even so, very few tools have been developed that are able to simulate deterministic spatial models and that were specifically designed for modeling biological processes, and most are rather specific (e.g., Continuity from UCSD Cardiac Mechanics Research Group for cardiac modeling). One exception, which is generic and already relatively mature, is The Virtual Cell (http://www.vcell.org), developed at the National Center for Cell Analysis and Modeling (36, 56). It is an integrated framework for modeling cell-biological processes that is deployed as a freely accessible distributed application to be used over the Internet, and it includes graphical model-building tools; automated math generation; numerical solvers for simulating both compartmental and spatial problems covering reactions, fluxes, diffusion, advection, electrical potential, and currents on analytic or experimental geometries; and model and results export and analysis tools.
Stochastic simulations have been used for a long time in simulating the dynamics of systems with a small number of states (e.g., modeling the kinetics of membrane channels), usually by using pseudorandom number generation (so-called Monte Carlo simulations) and the Gillespie algorithm (22). Many ad hoc implementations have been recently used to simulate intracellular networks under stochastic regimes, particularly gene-regulatory networks (39). Solving stochastic differential equations is an acceptable approximation when some of the components of the process are abundant enough to be treated using a continuous formulation (i.e., solving the Langevin equation). Stochastic simulations have the additional advantage that they can also solve spatial problems; even though they become computationally expensive when the number of molecules treated discretely is large, they have recently become a realistic alternative to PDE simulations. This is due to a combination of more efficient software implementations, the exponential growth in computer CPU performance, and the fact that stochastic numerical algorithms lend themselves more easily to parallel processing (thus being able to leverage todays large computer clusters or grid server farms); recent developments in numerical algorithms (8, 21) promise to further expand the range of practical applicability.
MCell (59) is one of the first software platforms that is a general-purpose full stochastic simulator, and although it was initially practical only for small problem domains (e.g., signaling at one neuromuscular junction) and did not allow for direct treatment of binding interactions, the newest version (to be released soon) does not have these limitations (J. Stiles, Pittsburgh Supercomputing Institute, personal communication). Other recently developed stochastic simulators show scalability to the point of metabolic network module simulations, including explicit tracking of all small and large molecules involved (except water and electrolytes) over an entire average-sized bacterial cell (e.g., ChemCell; S. Plimpton, Sandia National Labs, personal communication). Additionally, several large simulation platforms such as E-Cell, Virtual Cell, BioSpice, etc. now include stochastic capabilities.
| Conclusion |
|---|
|
|
|---|
To elaborate, let us take modern drug development as an example. Traditionally, the bottleneck was the lack of compounds to modulate the activity of a specific target (receptor, enzyme, gene, etc.). Advances in chemistry and high-throughput screening techniques have alleviated this, but even when we have a highly selective and effective modulator for the target of interest, we are often faced with the disappointing result that the final effect on the cell is either not significant or is different from what we expect. Many times this is due to the fact that the target is part of a complex network, which may either be robust with respect to that particular node or have poorly understood emergent properties. Many of the drugs that were found to be effective in practice by trial and error appear to be quite promiscuous, influencing many cellular targets. Therefore, today the questions most difficult to answer seem to be: 1) which are the best targets to choose and 2) how should they be influenced (inhibit, activate, knockout, overexpress, etc.) to achieve a desired effect? One recent approach has been to identify drug "signatures," a collection of the effects of a known chemical on many network components (e.g., in the search for kinase inhibitors as potential anticancer drugs), which can be used both to probe the characteristics of the network and to guide in the search for better drugs. Obviously, the ideal answer would be to have a well-characterized model of the pathway(s) of interest through which simulation predictions can identify which is the best combination of "buttons to push" to obtain the desired effect.
There is hope (and fear) that in the not-too-distant future we will be able to reengineer our bodies, to quickly identify and fix whatever goes wrong in most disease states with minimal collateral damage, and even to grow new biological replacement parts that can be as good as, or better than, the (young) original. Whenever that will happen, it will ultimately be due to the development of high-quality tools for analyzing, designing, modeling, and simulating cells and tissues. The components that deal with signaling networks will likely be among the most important, but also most difficult, to develop.
| References |
|---|
|
|
|---|
1 signaling pathway. J Biol Chem 27: 89588965, 1999.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| Visit Other APS Journals Online |