by Wayne Smith
Before the 1950's, geneticists studied the mechanisms of how genes are passed from one generation to the next by using cross-breeding experiments. However, the discovery of the molecular structure of the genetic material and its chemical properties opened up a fertile field for understanding biology through the fields of chemistry and physics, without using live organisms. As scientists probed into the chemical makeup of genes, their growing understanding helped them develop well-defined processes for extracting genetic information. This produced volumes of data growing far more quickly than could be analyzed. Computational science has come alongside these efforts to decipher the genetic machinery of humans.
The next section provides an overview of what is involved in deciphering a genome. After defining the problem, this paper surveys significant areas where computational science has helped to solve the problem. Though scientists regularly use computers to assist in laboratory methods for data collection and verification, the main developments from the computational science perspective are in data organization, access and analysis. The last section concludes by discussing emerging areas for future research.
Mapping and sequencing a genome consists of discovering how biological traits are encoded in the genetic material of an organism. Many biological characteristics can be traced to one or more genes. A gene is a portion of a DNA (Deoxyribonucleic Acid) strand. This strand is made up of millions (sometimes billions) of smaller molecules called nucleotides. Mapping gives an estimate of the location of a gene in the DNA. Sequencing identifies the exact nucleotides that make up a gene.
Organisms are made out of cells, and cell behavior is governed by its genetic makeup. Various chemicals act as agents to activate genes in a process called gene expression. Once a gene has been ``turned on'', it undergoes transcription and translation. Transcription uses the DNA as a template to create a mRNA molecule, which is then translated into proteins made up of amino acids. As proteins are formed from the DNA, they spontaneously fold into three dimensional shapes that dictate their chemical function. This process, from encoded gene to expressed gene, frames the problem of deciphering the genome. Scientists want to associate a particular portion of DNA with a particular biological function. By decoding the DNA, they hope to define a unified theory of biology that explains all biological functions.
As scientists break down the gene expression process, several key sub-problems have emerged, as illustrated in Figure 1. While trying to find direct associations between DNA and biological functions, several questions arose:
DNA and amino acid arrangements represent sequence data, and protein structures (called conformations) represent the structure data. Functional data that relates the two exist in medical and pharmaceutical databases, but functional analysis is still relatively new (see this paper's conclusion). The whole problem can be summarized as trying to relate patterns among the sequence data to patterns among the structures that, in turn, map to patterns among the protein functions. Scientists are hunting for the basic principles and mechanisms that effectively explain the process of gene expression. The hunt for principles has ranged from knowledge-intensive methods that use chemistry and quantum mechanics, to data-intensive methods that perform statistical analysis of existing sequence and structure data.

As scientists continue to collect data, a multitude of databases have emerged ([7, 10] survey many current databases). Some databases hold information on a particular organism while others specialize in a specific type of information (e.g. only sequences or only bibliographies). One of the primary sequence databases is the Genome Database at Johns Hopkins (found at http://gdbwww.gdb.org, and it is replicated at nine international sites. Two important structure databases are Brookhaven's Protein Data Bank (PDB) (found at http://pdb.pdb.bnl.gov and Cambridge's Crystallographic Data Bank (found at http://csdvx2.ccdc.cam.ac.uk. Apart from simply collecting the information on-line, efforts to improve infrastructure have focused on finding common schemas for the many databases and on making databases more accessible (e.g. via the web) and more usable. It is critical that the genetic data be intuitively presented to its users: geneticists, molecular biologist, etc.
Between transcription and translation, portions of the DNA are spliced out and never translated into protein. Some of the earliest programs, therefore, searched DNA sequences for areas that would not be spliced out. Statistical models of coding (areas not spliced out) versus non-coding regions have attained 95% accuracy while models using neural networks have attained 99% accuracy [3]. The current emphasis for site analysis is on finding boundaries between coding and non-coding regions.
Once the sequence databases grew to a significant size, programs for sequence matching emerged in the 1980's to find similarities among the collection. Today the main programs used for sequence matching are BLAST (Basic Local Alignment Search Tool) (found at http://www.ncbi.nlm.nih.gov/BLAST and FASTA (see, for example http://www.genome.ad.jp/SIT/FASTA.html). Each uses a scoring mechanism to track matches of subsequences, and the intermediate scores narrow the search for larger subsequence matching [14]. These techniques and their variations are capable of classifying approximately 40% to 50% of the new sequences. The remainder falls into what is commonly called the Twilight Zone of sequence similarities because their relationships are too subtle for pure sequence comparison methods to determine. To make further classifications, scientists use structure data or rely on principles from chemistry and physics.
Quantum mechanics provides a foundation for deriving the shape of a molecule based on its constituent atoms. However, the calculations required for a molecule of even only six atoms are enormous and requires many assumptions. Nevertheless, simulations of molecular dynamics have been used for years to understand protein structures [4]. Scientists have also formulated ways to compare and group protein structures using the two- and three-dimensional coordinates of the constituent atoms. These and other techniques are described in more detail in [2].
A protein has a primary, secondary, tertiary and quaternary structure. Some recent analyses use six levels that add super-secondary structure (between secondary and tertiary) and domains (between super-secondary and tertiary).
Around 1961 researchers began to understand that protein sequence largely determines protein structure. Scientists have tried a multitude of techniques to associate the primary structure with the secondary and tertiary structures. Early attempts from the 1960's, 1970's and 1980's used statistical correlation, dynamic programming and other techniques, all without significant progress. Meanwhile, in the laboratories, the determination of protein structures continues to rely on expensive and time-consuming processes such as X-ray crystallography. There are many undiscovered sequences, but there are far more unknown structures. Sequence comparison and classification allow a method for structure prediction. Given a known sequence with an unknown structure, one may find its structure using known structures of similar sequences (see Figure 2). This basic approach is used in almost all of the recent predictive methods.

Scientists have pursued sequence-to-structure prediction using molecular
dynamics simulations, simulated annealing, Markov Models, stochastic grammars,
neural networks and heuristic searches.
Part of the difficulty in predicting structure from sequence is that the data is sparse and unevenly distributed across all possible sequences and structures. For example, there are an estimated 90,000 protein structures associated with the human genome, but there are only about 3,000 captured in databases [2]. Predictive algorithms are constrained by the sufficiency of past associations. Sparse history leads to weak predictive power.
We have toured the landscape of mapping and sequencing the human genome. The basic problem is summarized as building associations between genes encoded in long DNA sequences and biological functions dictated by activating those genes. Geneticists have developed numerous techniques to collect, validate and distribute data for DNA sequences, protein sequences and protein structures. Computational scientists have developed tools for enhancing these tasks.
Working with geneticists, computational scientists will enable widely accessible databases of genetic data. Standards are being developed for common schemas, but tools will be needed to enforce those schemas and to migrate existing data to the new formats. Data warehousing technology for database migration and transformation should be considered for these tasks.
As the sequence and structure databases become better populated, classification and prediction algorithms will improve. The hybrid approaches appear to be the most promising. In the near- to mid-term, though, computational scientists should consider alternative approaches for predicting protein structure. The governing rule that sequence determines structure may be better applied when the sequence data is more complete. Moreover, data from protein functions and metabolic pathways should be probed, compared, classified and used to aid in structure determination. There is a wealth of information on protein and synthetic drug functions hidden in medical and pharmaceutical databases that must be tapped to build structure-to-function associations. Work is in progress, but structure-to-function associations could become the next big hurdle in deciphering genomes.
Last Modified:
Location: www.acm.org/crossroads/xrds4-1/genome.html