Past Research

2012 Research Interest Statement

I believe plants are the most important life forms on earth. They give us the air we breathe, clothes we wear, houses we live in, energy we burn and drugs that cure our diseases or make us feel better. Plants will give us all of these until we last and then outlive us all. Despite all this, we know very little about how they do what they do. Even for the best-studied species, Arabidopsis thaliana (a wild mustard), we know about less than 20% of what its genes do and how or why they do it.

We want to uncover the molecular mechanisms underlying adaptive traits in plants to understand how these traits evolved. A bottleneck in achieving our goals is the limited understanding of the functions of most genes in plant genomes. With a sequenced genome as a starting point, we are building genome-wide molecular networks of genes and proteins using a combination of computational and empirical approaches. Using these networks, we want to elucidate functions of uncharacterized genes rapidly and systematically. Ultimately we are interested in finding patterns of network evolution to identify the evolutionary paths of functional innovation for adaptation.

The questions that we are pursuing are:

  • Why are plants so robust to genetic and environmental perturbations and how do they express this resilience?
  • How is plant metabolism wired and how does it evolve?

The approaches and projects we are developing to answer the questions are:

  • Computational framework to predict metabolic networks of plants
  • Reconstruction of co-function networks in plants
  • Identification of genome-wide genetic interaction network of plants
  • Empirical testing of plant metabolic networks using genetic and metabolomic approaches
  • Novel method of measuring functional similarities
  • Identification of all genes involved in complex traits such as salt tolerance
  • Computational and empirical identification of signaling pathways and complexes
  • Identification of novel classes of transcription factor regulators
  • Characterization of novel gene families

We employ several methods in our quest: 1) combination of computational modeling and targeted experimental testing in the lab; 2) systematic collection of large-scale data needed for the modeling through collaboration with other labs; 3) robust, quantitative analysis of the data and the models. Our work is inherently embedded in collaborations with other labs both at Carnegie and other institutions. This is my vision of our mode of operation. Our own lab is most invested in the synthesis aspect of the ‘research engine’ with a growing component on experimentation. But we collaborate with many excellent labs in all three aspects.

2009 Research Interest Statement

Two things drive my research these days. One is a desire to uncover the mysteries of how plants process the myriad of information from their environment and reprogram their growth and development. I am fascinated with how plants decide. The other is a desire to uncover the potential for greatness in emerging scientists. I am fascinated with how humans, in particular scientists, decide.

I am trying to engage these two drivers in designing projects to foster the diverse interests of individual members of my group, ranging from evolution of pathways to mechanisms of gene regulation, and to capitalize on the diverse training backgrounds in the group, ranging from molecular biology to physics.
Naturally this network of inspiration lends itself to projects that can be described as a series of Venn diagrams where the intersections represent collaborative and integrative projects among the members. I believe that the union will help satisfy the two driving forces of my research program. Also turnover of lab members over time serves as natural check points and selective forces on the evolution of our collective knowledge and expertise.

We employ several methods in our quest: 1) combination of computational modeling and targeted experimental testing in the lab; 2) systematic collection of large-scale data needed for the modeling through collaboration with other labs; 3) robust, quantitative analysis of the data and the models.

Here are examples of the ‘intersecting’ projects in our group to illustrate the types of questions we are asking and approaches we are taking.

Questions and approaches…

  • The quest for novel biological processes
  • Systematic discovery of the role of protein degradation in response to the environment
  • Exploring the power of metabolomics in dissecting genetic interactions
  • Systematic discovery of signaling pathways and complexes (Lalonde et al, 2010)

Methodologies and tools…

The next five to ten years… I expect that many of the current projects will leave along with the lab members to establish their own groups. I am humbled by so many intriguing, unsolved problems in plant biology and am brewing some concrete ideas about the following topics currently, which may or may not become major projects in my group in the next several years. 

  • Systematic discovery of novel reactions and pathways
  • Determining the “key” players in functional modules
  • Mechanism of cross-talk between exogenous and endogenous signals for growth and development
  • Mechanism of genome-wide homeostasis against genetic and environmental variation

2005 Research Interest Statement

I have been involved several ongoing projects that address some of the needs stated above. The projects can be grouped into three categories: biological databases, bio-ontologies, and systems approaches in biology. Biological databases include a database for all information of a single organism, a database for a specific type of information (metabolism) in many species, and a database for managing and exploring literature data for any type of system of interest. Bio-ontologies include designing and building ontologies specific for particular domains of biological knowledge such as biological processes, molecular functions, cellular components of all organisms and anatomical parts and developmental stages for flowering plants. Systems approaches include two small projects in collaboration with plant biologists to address questions about specific aspects of Arabidopsis biology such as deciphering the transcriptional regulatory circuit for cold acclimation in plants and systematic determination of subcellular and tissue localization of proteins of unknown function in planta.

In addition to the projects described above, I have a personal mission to mobilize the research community to contribute to biological databases and share knowledge and expertise, to bridge the gaps of information dissemination between traditional scientific journals and biological databases, and to bridge the gap between biologists and computer scientists. I believe that the plant biology community is not taking full advantage of the recent advances in communications and technology. Through TAIR, we are creating and testing mechanisms for researchers to provide data and expertise directly to a database. I am communicating with publishers of major plant journals to share data and establish cross-references between journal websites and databases. I am also in communication with an open-access publisher to create a joint journal devoted to publishing papers that are not suitable for traditional journals such as functional genomics like microarray data, methods, and reproducible negative results. Finally, I believe that major breakthroughs in bioinformatics will come from in-depth collaborations between biology experts and computer science experts rather than from people who know a little bit of both. As an editor for Plant Physiology, I am managing the publication of bioinformatics papers in this journal in order to educate plant biologists about bioinformatics. I would be very interested in doing the converse: bringing biology papers into a computer science journal.

For most biological databases, the literature is one of the main data sources, and significant resources are devoted to capturing this information. Our long-term goal is to develop a set of systematic procedures and tools for integrating knowledge from the confined context of a research article into the dynamic, broad context of a biological database. We have developed a literature curation tool called PubSearch (www.pubsearch.org), which stores literature, gene, functional annotation, and keyword data in a stand-alone database and allows curators to establish associations between these data types using a web browser. In collaboration with Simon Twigger’s group at the Medical College of Wisconsin, we are extending PubSearch to include a literature fetching function (PubFetch) and work-tracking function (PubTrack) to create a comprehensive environment to manage the literature data.

In an effort to systematically characterize Arabidopsis proteins with unknown function, we are collaborating with four cell biology labs (David Jackson at Cold Spring Harbor Laboratory, David Ehrhardt at Carnegie Institution, Vitaly Cytovsky at SUNY Stoneybrook, and Natasha Raikhel at UC Riverside) to identify subcellular localization of approximately 800 genes that have no known function in planta (real-time images of live cells in intact plants). In addition to discovering localization patterns of these novel proteins, we are already identifying potential novel organelles and suborganelles.

Future Plans (Next Five Years)

In the next five years, I would like to continue the three categories of the projects (biological databases, bio-ontologies, and systems approaches) but make a transition from developing infrastructure and tools to creating applications that use the infrastructure to infer new information or identify patterns. However, I value the critical importance of maintaining and updating the resources, which will be done by professional curators and software developers. Personally, I would like to develop programs that can, for example, predict function based on the knowledge and information embedded in TAIR. Also, I am interested in analyzing the bio-ontologies and their annotations to identify any novel patterns, both regular and irregular. In addition to continuing the existing projects, I intend to initiate a couple of new projects, one on building an infrastructure for metabolomics and the other on analyzing the correlation between networking and scientific success in collaboration with social scientists.

Biological Databases
I would like to transform TAIR into a discovery environment for all plant researchers, educators, and students. The proposed work will include a comprehensive annotation of the genome, transcriptome and proteome, including regulation and phenotype information. TAIR will provide access to all public data resulting from large-scale ‘omics’ research and traditional ‘hypothesis-driven’ research in intuitive, powerful, and highly integrated views capable of facilitating new discoveries about plant development and physiology. The project will continue to develop controlled vocabularies and standardized data exchange mechanisms for maximal interoperability with other biological databases and will provide data in explicitly defined and structured formats to facilitate programmatic data retrieval. TAIR’s strong support within the plant research community will be utilized to create networks of information connecting TAIR to other plant databases, web resources for specific types of Arabidopsis information, and traditional scientific journals. In addition, TAIR’s role as an essential resource in the plant research community requires that a mechanism for long-term support of the project be established. To that end, several potential ways to generate revenues will be explored.

In the next five years, we will focus on completing the plant metabolism information in MetaCyc to a golden standard such that it will effectively have replaced all the textbooks. Towards that end, we will actively solicit collaboration from the classical biochemists and other colleagues from the Society of Phytochemistry in addition to curating the data from the primary literature, reviews, and textbooks and results from functional genomics and proteomics experiments. Once the known information is complete and updated in the database, we can start to ask questions about missing information (e. g. missing enzymes, compounds, and pathways in an organism as compared to another). In addition, we should be able to ask questions about the differences and similarities between strategies taken by different organisms.

For PubSearch, I am interested in collaborating with computer scientists to incorporate methods such as Natural Language Processing for more automated literature curation. Our experience of manual and semi-manual extraction of knowledge from literature would provide a good baseline for such a collaborative project.

Bio-ontologies
One of the immediate applications of bio-ontologies is in associating biological objects such as genes. This allows quantitative comparison of genes and can facilitate interoperability (querying of one database by another) if multiple databases use the same ontologies to annotate data objects. I am interested in analyzing the ontologies and their annotated data objects in TAIR, GOC, and POC databases to determine the global organization patterns of the ontologies and the genes using graph theoretical calculations. I am also interested in creating and using ontologies for complex information and will focus on describing phenotype information using multiple ontologies.

Systems approaches
I am particularly interested in the preliminary results from the projects in the category. From the cold acclimation project, we found that transcripts are turned on in a series of waves as a function of time in cold-treated Arabidopsis. In order to group the genes into more discrete regulons (genes that are regulated by the same transcription factor(s)), I feel that we need to learn more about the potential promoter regions. Towards that end, we have gathered and curated all the experimentally verified cis-elements from biological databases. Using PubSearch, we can efficiently extract all other experimentally verified cis-elements. We will use this dataset to map the non-coding sequences of genes and intergenic regions and ask if there are any high-level patterns of cis-element compositions in the non-coding genic and intergenic sequences. If we can define promoters more precisely, then we should be able to develop algorithms to compare promoters. Using better-defined promoter information, we want to analyze the microarray data. In addition, we intend to curate all of the known cis-element/transcription factor relationships. I would like to collaborate with computer scientists interested in developing heuristic algorithms that could predict transcription factor/cis-element relationships based on the curated dataset.

The unknown protein localization project also has several venues we want to pursue. First, we want to use the experimental data as a training set to determine if we can identify any new targeting/localization signals and motifs by either using existing algorithms (e.g., TargetP) or developing new algorithms in collaboration with computer scientists. In addition, this project is just starting to produce results, and the first 1% of the unknown proteins revealed not only interesting localization patterns such as cell-type and tissue specificity, but also uncovered some novel localization patterns. Some of the novel localization patterns may be novel organelles or suborganelles previously undetected. We are set to capture localization images of 800 genes this year and have submitted a renewal to do another 4000 genes. Even within the current grant period, we will produce about 8000 images. In order to group the novel patterns into categories and analyze all of the images efficiently, we need to perform content-based image searching as well as the ability to cluster the images. I would be very interested in collaborating with computer scientists to develop such programs and use them to analyze these localization patterns.

2002 Research Interest Statement

My goal is to build an infrastructure that allows researchers to share information and knowledge in order to identify new insights and facilitate the process of generating new paradigms in biology. A long-term goal is to systematically delineate what is known and unknown in order to mobilize the research community to solve the rules underlying the workings of an organism.

One of the most efficient ways of solving problems in biology lies in the use of model organisms or systems in which the basic rules are uncovered and applied to more diverse sets of organisms and problems. For higher plants, Arabidopsis thaliana has been adopted as a model organism due to its small genome size, self-compatibility, and short generation time. Since its adoption as a model organism, many tools have been developed for this plant, including facile and efficient methods of transformation, complete genome sequence, and high-density genetic maps. Capturing and representing biological knowledge from studies using Arabidopsis thaliana is the subject of my research. More specifically, my group has developed a computer-based infrastructure to capture the research community information and the knowledge generated in the research literature and developed a query/analysis/visualization system to allow researchers to identify correlations in the information. In the future, we would like to develop a knowledge-capture system to bring the research findings directly into the computer infrastructure, and develop a simulation system that can predict an accurate outcome of any scenarios that may occur in the plant.

In order to capture the knowledge from this large body of research community, we need to develop an infrastructure that allows researchers to find and share the information and knowledge generated. Advancement of computer science and communications technology has established the internet to be the most efficient medium for exchanging knowledge. In addition, advancement of high-throughput technology such as sequencing and microarray methods have allowed biologists to produce large quantities of data. Developing an infrastructure to house and make accessible these large quantities of data has been a problem for many research communities. In collaboration with information technology scientists at the National Center for Genome Resources in Santa Fe, New Mexico, my group has been engaged in developing an infrastructure to house the vast quantities of information for Arabidopsis. The infrastructure is called the Arabidopsis Information Resource (TAIR, http://arabidopsis.org), which is accessible via commonly used web browsers and can be searched and downloaded in a number of ways. For example, researchers can identify genes or proteins of interest based on many parameters (e.g. subcellular localization, expression patterns, or mutant phenotypes) from the text-based search forms, sequence analysis tools, or bulk query forms. SeqViewer (http://arabidopsis.org/servlets/sv) allows visualization of these genes on the genome decorated with clones, transcripts, genetic markers and polymorphisms. The SeqViewer interactively displays the genome from the whole chromosome down to 10 kb of nucleotide sequence. Alternatively, researchers can visualize these genes mapped on metabolic pathways from the whole cell level down to individual reactions along with metabolic compound structures using AraCyc (http://arabidopsis.org/tools/aracyc). Upon finding relevant information about genes, researchers can order associated DNA or seed stocks from the Arabidopsis Biological Resource Center (ABRC, http://arabidopsis.org/arbrc). Detailed, and up-to-date information about the database content as well as its usage statistics can be found online (http://arabidopsis.org/about).

TAIR uses an object-oriented approach to data representation and software architecture. The underlying database is implemented in a relational database management system (Sybase version 11.9.2). The data is organized in a hierarchical structure where a parent table groups a set of child tables with similar attributes and each node can be linked to other nodes and tables. At the top of the data hierarchy is the TairObject class, which is linked to other top parent classes such as Attribution (source of the data), Reference (experimental evidence source), and Annotation (descriptive information). Thus, the Attribution, Reference and Annotation classes constitute the meta data of all TAIR objects. This design has the advantage of allowing easy expansion of new data types as well as flexibility and minimization of linking tables. More detailed information about the database schemas and documentation can be found online (http://arabidopsis.org/search/schemas.html).

TAIR software is developed in a client-server mode using the JAVA Servlet technology. All applications are accessible to users by common web browsers to accommodate maximum user platform and software (operating system) diversity. Software for accessing the database is developed using an object-oriented architecture. A set of JAVA classes called TAIR Foundation Classes serve a number of functions to the front-end applications that use JAVA Server Pages. Documentation of the TAIR Application Program Interface can be found on ‘About TAIR’ section of the home page. A set of bulk download tools based on flat files use CGI scripts written in Perl. Finally a number of weekly updated, static HTML pages serve relevant Arabidopsis and external links information to the community.

This project, in its third year, is accessed by about 20,000 unique internet addresses per month. Approximately 2.5 million hits and 500,000 web pages are accessed by researchers around the globe every month. TAIR is currently the most visible Arabidopsis project. For example, when using the word ‘Arabidopsis’ on Google (http://google.com), TAIR is on top of the list.

We have developed a literature curation tool called PubSearch, which stores literature, gene, functional annotation, and keyword data in a stand-alone database and allows curators to establish associations between these data types using a web browser. In PubSearch, first-pass associations between terms (gene names and keywords) and articles are made automatically by a string matching program that indexes terms to articles. Commonly occurring words such as AND, THE, IF (stop words) are filtered out to minimize meaningless associations from being stored. For terms with a higher signal-to-noise ratio, curators verify the matches via the web browser user interface.

PubSearch uses a simple database schema in a MySQL database management system (DBMS) (version 3.21), which can be queried and updated using a password-protected login mechanism via the internet using a web-browser. The middleware is written in Java (version 1.3) and uses Java Servlet and Java Server Page (JSP) technology. The system is currently running on a Linux RedHat7.2 system with Tomcat (version 4.0) as the servlet engine. The tool has been used and refined for the past 6 months by 7 curators at TAIR and 5 Arabidopsis curators at the Institute for Genome Resources (TIGR) to curate over 12,000 articles. The tool is much more convenient and user-friendly than our old system involving flat files and our curation work has become much more efficient as a result.

In addition to providing curators with a sophisticated tool to facilitate literature curation, this project impacts three bodies of the research community significantly. First, the Arabidopsis research community benefits from access to accurate and consistent annotations of data objects from the literature, which are produced in a fast, efficient manner. Second, researchers engaged in high throughput genomic projects benefit by having access to reliable, high quality annotations that can be used to enhance automated annotations. Often sequence comparison is used to predict the potential function of genes and gene products in a newly sequenced organism; accurate and detailed descriptions of a model genome and its complements will improve the accuracy of the newly sequenced organism’s annotation. Third, members of the computer science research community can use the rules, methods and curated data to develop more sophisticated and accurate algorithms to extract and analyze data from the literature. The set of human-curated data along with explicit rules used for the annotations will provide much-needed test data sets for developing and improving algorithms based on methods such as natural language processing and machine learning. This final application of the tool lends the possibility that manual curation of literature can be infinitely reduced, allowing our curation teams the freedom to use their scientific training to explore and question the data collected in MODs leading to new hypotheses and potential discoveries.

The establishment and usage of these shared, controlled vocabularies will allow researchers to query across all organisms for knowledge and begin to address correlations between structure and function in explicit, systematic ways.

Future Plans in the Next Few Years

In addition, we will develop a set of data entry and update tools to allow researchers to add and update any information in the database. Currently, we have an interactive data entry system only for person or organization profile information. We plan on expanding this to allow researchers to add information about genetic markers, genes, proteins, microarray experiments, etc. In addition, we will implement a system to allow a researcher to attach his or her own comments to any information at TAIR. Our long-term goal is to establish TAIR as an essential communication and research tool whereby it is the first place a researcher should go to find out about any aspect of Arabidopsis biology. Some aspect of in-house curation will always be essential but we hope to disperse some of the curation responsibilities to those researchers that have generated the data and thus create a co-operative resource.

Ultimately, our goal is to provide the common vocabulary, visualization tools, and information retrieval mechanisms that permit integration of all knowledge about Arabidopsis into a seamless whole that can be queried from any perspective. Of equal importance for plant biologists, the ideal TAIR will permit a user to use information about one organism to develop hypotheses about less well-studied organisms. In the next few years, we hope to develop user-friendly tools that permit an individual working outside this model species to formulate a query based on their organism of interest, have that query directed to the relevant knowledge in Arabidopsis, and present the information in a way that can be understood by any plant biologist. We will be making efforts to cross-link information in TAIR with information about other plants and organisms in other databases. In addition, we will develop a more comprehensive help system to allow researchers not familiar with Arabidopsis to use the information in TAIR more effectively.

In an effort to systematically characterize the unknown, we are collaborating with four cell biology labs (David Jackson at Cold Spring Harbor Laboratory, David Ehrhardt at Carnegie Institution, Vitaly Cytovsky at SUNY Stoneybrook, and Natasha Raikhel at UC Riverside) to identify subcellular localization of approximately 800 genes that have no known function, not similar to any known genes, and have no localization information. The selected genes with their 5’ and 3’ intergenic regions will be PCR-amplified, fused to GFP, and the transgenic plants harboring the clones will be examined for subcellular localization. Our role will be to develop a Laboratory Information Management System (LIMS) to store and prioritize the candidate genes for cloning based on a number of criteria (including annotation download from TAIR, existence of full-length cDNA, etc.), track the status of the cloning, upload the preliminary results for internal discussions, and export the data to TAIR and other public repositories. In addition, the experimental results from this study will be used to identify potential novel signal peptides and improve subcellular localization prediction algorithms.