Tuesday, February 17, 2009

Base-ial Profiling

Hidden Markov Models in Bioinformatics
Contributed editorial appearing in
Scientific Computing & Instrumentation 18:11, November 2001, pg. 15.

While browsing the meat aisle at the local supermarket, the cheerful little sticker on a package of hamburger proudly announcing “Ground Lean Chuck, 2.45 LBS,” is quite a gift to an analytical scientist. It is rare for sample quantities to be presented so conspicuously. Most often data such as composition and mass must be extracted through the careful utilization of measuring instruments and data analysis. Even then, the instrument’s transducer response merely serves as a surrogate that varies proportionally to the actual data desired. For example, a resistance temperature device (RTD) has a known resistance that increases linearly with increase in temperature. By recording the RTD resistance at known calibration temperatures, a calibration plot can be constructed. The response of the RTD to temperature is modeled by a linear least-squares fit to the data yielding the parameters of slope and offset. In this fashion the RTD response is “trained” to communicate the “hidden” value of sample temperature through the transducer’s “visible” change in resistance.

State values including sample temperature, pressure, mass, density, and pH can be obtained using this point wise “calibrate and measure” technique. Dynamic processes like speech recognition require a similar, albeit more sophisticated approach. Other than simply calibrating a microphone membrane to measure the amplitude and frequency of a human voice, the instrumental recognition of spoken words requires the calibration model to accommodate temporal patterns of tone, harmonics, percussion, and modulation that coalesce to form words. Variable inflections, accents, tempo, enunciation, and dialects dictate the construction of a probability-based library of calibration models. The acquired pattern or temporal “profile” of a recorded word is compared against this library of models in search of highly probable matches. As the recognition library becomes more developed, entire sentences can be profiled in order to reduce homophone errors. An early IBM advertising campaign for its speech recognition software showcased the correct automated dictation of the phrase, “Write a letter to Mr. Wright, right now.” The probabilistic algorithm used by temporal speech recognition software is based on the early 1900s work of Russian mathematician, Andrei Andreyevich Markov, who pioneered the investigation of sequences of random variables in which the future variable is determined by the present variable but is independent of the way in which the present state arose from its predecessors. Known as hidden Markov models (HMMs), these model libraries also have been used to profile and study the music of specific composers. Using the profiles generated, new compositions can be synthesized in the style of Mozart or Bach.

HMM profiles can also be generated for the spatial patterns of written words. The probability that a specific letter follows another can he modeled through an examination of example vocabulary. In the English language, the probability that the letter “u” will follow a “q” is high, while low for a “g” following a “z.” The insertion, deletion or mutation of letters between similar terms from related languages can be analyzed using HMMs. Red, rot, rojo, and rouge exhibit a common ancestry or “homology” between English, German, Spanish, and French languages.

The late 1980s dissertation research of Dr. Gary A. Churchill, presently with the Jackson Laboratory in Bar Harbor, Maine, demonstrated the elegant utility of HMMs to the then nascent area of biological statistics. Concomitant with the appearance of analytical instrumentation for the sequencing of DNA base pairs and protein amino acids, Churchill’s work likened the linear, seemingly random sequences identified by genomics and proteomics studies to the vocabulary words of varied human languages. Profile HMMs constructed from known patterns permit newly sequenced gene and protein fragments to be classified. Collections of sequences obtained from different organisms and having unknown functions can be used to train HMM libraries used to sort the sequences into homologous families. Study of the structure and function of these families increases the likelihood of success over the examination of the sequences individually.

The next generation of HMMs being developed for bioinformatics is incorporating information of the secondary and tertiary structure of proteins. In much the same way that the IBM voice recognition software uses sentence context to avoid homophone errors, 3-D structure and intermolecular forces between separate protein domains influence the composition of the sequence. Armed with HMM libraries that include these contextual effects, protein folding and macromolecular structure can be predicted from the sequence information gathered instrumentally.
blog comments powered by Disqus