United States Patent6961664
Selifonov , ; et al.November 1, 2005

Title

Methods of populating data structures for use in evolutionary simulations

Abstract

In particular, this invention provides novel methods of populating data structures for use in evolutionary modeling. In particular, this invention provides methods of populating a data structure with a plurality of character strings. The methods involve encoding two or more a biological molecules into character strings to provide a collection of two or more different initial character strings; selecting at least two substrings from the pool of character strings; concatenating the substrings to form one or more product strings about the same length as one or more of the initial character strings; adding the product strings to a collection of strings; and optionally repeating this process using one or more of the product strings as an initial string in the collection of initial character strings.


Inventors:Selifonov; Sergey A. (Los Altos, CA), Stemmer; Willem P. C.  (Los Gatos, CA)
Assignee:Maxygen (Redwood City, CA)
Appl. No.:495668
Filed:February 1, 2000

Current U.S. Class:702/19 702/27 435/6 536/23.1 
Field of Search:435/6 536/23.1 702/19,27 706/47 712/200

U.S. Patent Documents
4959312September 1990Sirokin
4994379February 1991Chang
5023171June 1991Ho et al.
5043272August 1991Hartly et al.
5066584November 1991Gyllensten et al.
5264563November 1993Huse
5319308June 1994Dechene
5408181April 1995Dechene
5447839September 1995Manos et al.
5512463April 1996Stemmer
5514588May 1996Varadaraj et al.
5519319May 1996Smith
5521077May 1996Kholsa et al.
5565350October 1996Kmiec
5580730December 1996Okamoto
5603793February 1997Yoshida et al.
5605793February 1997Stemmer
5612205March 1997Kay et al.
5650722July 1997Smith
5675253October 1997Smith
5717085February 1998Little et al.
5760012June 1998Kmiec et al.
5763239June 1998Short et al.
5789228August 1998Lam et al.
5789577August 1998Geysen
5811238September 1998Stemmer et al.
5814473September 1998Warren et al.
5824469October 1998Horwitz et al.
5830696November 1998Short
5830721November 1998Stemmer et al.
5834252November 1998Stemmer et al.
5837458November 1998Minshull et al.
5866363February 1999Pieczenik
5869644February 1999Shortle et al.
5876997March 1999Kretz
5912119June 1999Radman et al.
5925749July 1999Mathur et al.
5928905July 1999Stemmer et al.
5939250August 1999Short
5939300August 1999Robertson et al.
5942430August 1999Robertson et al.
5948666September 1999Callen et al.
5958672September 1999Short
5958751September 1999Murphy et al.
5962258October 1999Mathur et al.
5962283October 1999Warren et al.
5965408October 1999Short
5985646November 1999Murphy et al.
6001574December 1999Short et al.
6004788December 1999Short
6030779February 2000Short
6054267April 2000Short
6096548August 2000Stemmer
6117679September 2000Stemmer
6125331September 2000Toh
6132970October 2000Stemmer
6153410November 2000Arnold et al.
6159690December 2000Borrebaeck et al.
6188965February 2001Mayo et al.
6269312July 2001Mayo et al.
6403312June 2002Dahiyat et al.
6455254September 2002Short
6537776March 2003Short
6605449August 2003Short
Foreign Patent Documents
00/37684Jun., 2000WO
0911396Apr., 1999EP
0911396May., 1999EP
0934999Aug., 1999EP
97/07205Feb., 1997WO
97/20078Jun., 1997WO
97/25410Jul., 1997WO
97/35957Oct., 1997WO
97/35966Oct., 1997WO
97/44361Nov., 1997WO
97/48416Dec., 1997WO
97/48717Dec., 1997WO
97/48794Dec., 1997WO
98/00526Jan., 1998WO
98/01581Jan., 1998WO
98/13485Apr., 1998WO
98/13487Apr., 1998WO
98/24799Jun., 1998WO
98/27230Jun., 1998WO
98/28416Jul., 1998WO
98/31387Jul., 1998WO
98/36080Aug., 1998WO
98/41622Sep., 1998WO
98/41623Sep., 1998WO
98/41653Sep., 1998WO
98/42832Oct., 1998WO
98/48034Oct., 1998WO
98/58085Dec., 1998WO
99/07837Feb., 1999WO
99/08539Feb., 1999WO
99/10472Mar., 1999WO
99/10539Mar., 1999WO
99/19518Apr., 1999WO
99/21979May., 1999WO
99/23107May., 1999WO
99/23236May., 1999WO
99/41368Aug., 1999WO
99/41369Aug., 1999WO
99/41383Aug., 1999WO
99/41402Aug., 1999WO
99/45154Sep., 1999WO
99/57128Nov., 1999WO
99/65927Dec., 1999WO
WO 00/00632Jan., 2000WO
WO 00/23564Apr., 2000WO
WO 00/42560Jul., 2000WO
WO 00/42561Jul., 2000WO
WO 00/53744Sep., 2000WO
WO 00/58517Oct., 2000WO
WO 91/06643May., 1991WO
WO 92/06176Apr., 1992WO
WO 95/15972Jun., 1995WO
WO 95/22625Aug., 1995WO
WO 96/16073May., 1996WO
WO 96/33207Oct., 1996WO
WO 97/33957Sep., 1997WO
WO 97/35966Oct., 1997WO
WO 98/49350Nov., 1998WO
WO00/42559Jul., 2000WO
WO00/47612Aug., 2000WO
WO01/61344Aug., 2001WO
WO01/75767Oct., 2001WO
Other References
Yokomori et al., "A similarity measure for DNA sequence analysis based on locality," Genome Inf. Ser. (1993), vol. 4, pp. 283-292. .
Sun, Fengzhu, "Modeling DNA Shuffling," Journal of Computational Biology, 1999, vol. 6, No. 1, pp. 77-89. .
Merriam-Webster On-line Dictionary, 2002, www.m-w.com. .
Dahiyat et al., "Automated Design of the Surface Positions of Protein Helices." Protein Science 1997, pp. 1333-1336. .
Dahiyat et al., "De Novo Protein Design: Fully Automated Sequence Selection," Science vol. 278 Oct. 3, 1997, pp. 82-87. .
Dahiyat et al., "De Novo Protein Design: Towards Fully Automated Sequence Selection," J. Mol. Biol. 1997 273, pp. 789-796. .
Dahiyat et al., "Probing the Role of Packing Specificity in Protein Design," Proc. Natl. Acad. Sci. USA, Sep. 1997, vol. 94, pp. 10172-10177. .
Gordon and Mayo, "Branch-and-Terminate: A Combinatorial Optimization Algorithm for Protein Design," Structure, Sep. 1999, pp. 1089-1098. .
Gordon et al., "Energy Functions for Protein Design," Current Opinion in Structural Biology, 1999, pp. 509-513. .
Gordon and Mayo, "Radical Performance Enhancements for Combinatorial Optimization Algorithms Based on the Dead-End Elimination Theorem," Journal of Computational Chemistry, 1998, vol. 19, pp. 1505-1514. .
Haney et al., "Structural Basis for Thermostability and Identification of Potential Active Site Residues for Adenylate Kinases From the Archaeal Genus Methanococcus," Proteins: Structure, Function, and Genetics, 1997, pp. 117-130. .
Malakauskas and Mayo, "Design, Structure and Stability of a Hyperthermophilic Protein Variant," Natural Structural Biology, Jun. 1998, vol. 5 No. 6, pp. 470-475. .
Pollock et al., "Coevolving Protein Residues: Maximum Likelihood Identification and Relationship to Structure," J. Mol. Biol, 1999, pp. 187-198. .
Street and Mayo, "Computation Protein Design," Structure, May 1999, pp. R105-R109. .
Street and Mayo, "Intrinsic .beta.-sheet Propensities Result from van der Walls Interactions Between Side Chains and the Local Backbone," Proc. Natl. Acad. Sci. USA, Aug. 1999, vol. 96, pp. 9074-9076. .
Street and Mayo, "Pairwise Calculation of Protein Solvent-Accessible Surface Areas," Folding & Design, Jun. 3, 1998, pp. 253-258. .
Strop and Mayo, "Rubredoxin Variant Folds without Iron," American Chemical Society, Mar. 24, 1999, vol. 121, No. 11, pp. 2341-2345. .
Wollenberg and Atchley, "Separation of Phylogenetic and Functional Associations in Biological Sequences by Using the Parametric Bootstrap," PNAS, Mar. 28, 2000, vol. 97, pp. 3288-3291. .
Stemmer, "DNA Shuffling by Random Fragmentation and Reassembly: In Vitro Recombination for Moleclar Evolution," Proc. Natl. Acad. Sci. USA, Oct. 1994, vol. 91, pp. 10747-10751. .
Venkatasubramanian et al., "Evolutionary Design of Molecules with Desired Properties Using the Genetic Algorithm," J. Chem. Inf. Comput. Sci., 1995, vol. 35pp. 188-195. .
Whipple et al., "Application of Genetic Algorithms to Combinatorial Synthesis: A Computational Approach to Lead Identification and Lead Optimization," J. AM. Chem. Soc., 1996, vol. 118, pp. 1669-1676. .
Harayama, Shigeaki, "Artificial Evolution by DNA Shuffling," Tibtech, Feb. 1998, vol. 16, pp. pp. 16-19 and pp. 80-82. .
Zhang, Ching, "A Genetic Algorithm for Molecular Sequence Comparison," Proceedings of the International Conference on Systems, Man, and Cybernetics, 1994, pp. 1926-1931. .
Stemmer et al., "Single-Step assembly of a Gene and Entire Plasmid from Large Numbers of Oligodeoxyribonucleotides," Gene, 1995, vol. 164, pp. 49-53. .
CS 262, Computational Genomics, Handout .TM.1: Course Information, printed from: http://www.stanford.edu/class/cs, Spring 2003, 11 pages. .
Corpet et al., Browsing Protein Families Via the Rich Family Description Format, Bioinformatics, vol. 15, No. 12, 1999, pp. 1020-1027. .
Mironov et al., "Computer Analysis of Transcription Regulatory Patterns in Completely Sequenced Bacterial Genomes," Nucleic Acids Research, vol. 27, No. 14, 1999, pp. 2981-2989. .
Moore et al., Modeling and Optimization of DNA Recombination, Computer and Chemical Engineering 2000, Department of Chemical Engineering, The Pennsylvania State University, University Park .COPYRGT. 2000. .
Gregory L. Moore, Costas D. Maranas, Modeling DNA Mutation and Recombination for Directed Evolution Experiments, Department of Chemical Engineering, The Pennsylvania State University, University Park, Received Oct. 28, 1999, Accepted in revised form Apr. 15, 2000 .COPYRGT. 2000 Academic Press. .
Young et al., "Characterization of Receptor Binding Determinants of Granulocyte Colony Stimulating Factor," Protein Science 6:1228-1236, 1997. .
Dahiyat and Mayo, "Protein Design Automation," Protein Science, 5:895-903, (1996). .
Su et al., "Coupling Backbone Flexibility and Amino Acid Sequence Selection in Protein Design," Protein Science, 6:1701-1707, (1997). .
Voigt et al., "Computationally Focusing the Directed Evolution of Proteins," Journal of Cellular Biochemistry Supplement, 37:58-63 (2001). .
Hellberg et al., "Minimum Analogue Peptide Sets (MAPS) for Quantitative Structure-Activity Relationships," Int. J. Peptide Protein Res. 37:414-427 (1991). .
Martin van Heel, "A New Family of Powerful Multivariates Statistical Sequence Analysis Techniques," J. Mol. Biol, 220:877-887 (1991). .
Goldman et al., "Estimating Protein Function From Combinatorial Sequence Data Using Decision Algorithms and Neural Networks," Drug Dev. Research 33:125-132 (1994). .
Gustafsson et al., "Exploration of Sequence Space for Protein Engineering," J. Mol. Recognit. 14:308-314 (2001). .
Miyazawa et al., "Residue-Residue Potentials with a Favorable Contact Pair Term and an Unfavorable High Packing Density Term, for Simulation and Threading," J. Mol. Biol., 256:623-644 (1996). .
Chao Zhang, "Extracting Contact Energies From Protein Structures: A Study Using a Simplified Model," Proteins: Structure, Function, and Genetics, 31:299-308 (1998). .
Miyazawa et al., "Self-Consistent Estimation of Inter-Residue Protein Contact Energies Based on an Equilibrium Mixture Approximation of Residues," Proteins: Structure, Function, and Genetics, 34:49-68 (1999). .
Miyazawa et al., "An Empirical Energy Potential With a References State for Protein Fold and Sequence Recognition," Proteins: Structure, Function, and Genetics, 36:357-369 (1999). .
Moore et al., "Predicting Crossover Generation in DNS Shuffling," PNAS, vol. 98, No. 6, 3226-3231 (2001). .
Lehman et al., "Engineering Proteins for Thermostability: the Use of Sequence Alignments Versus Rational Design and Directed Evolution," Current Opinion in Biotechnology, 13:371-375 (2001). .
Colleen Kelly, "A Test of the Markovian Model of DNA Evolution," Biometrics 50, 653-664, (1994). .
H.W. Hellinga, "Rational Protein Design: Combining Theory and Experiment," Proc. Natl. Acad. Sci. USA, vol. 94, pp. 10015-10017, (1997). .
William F. DeGrado, "Proteins from Scratch," Science, vol. 278, 80-81 (1997). .
Jonsson, et al, "Quaintitative Sequence-Activity Modeils (QSAM)- Tool For Sequence Design", Nuclear Acid Research vol. 21, No. 3, pp. 733-739 (1993). .
Sjostrom, et al, "Signal Peptide Amino Acid Sequences in Escheruchua coli Contain Information Related To Final Protein Localization. A Multivariate Data Analysis", The CMBO Journal vol. 6, No. 3, pp 823-831, (1987). .
Patel, et al, "Patenting Computer-Designed Peptides", Journal Of Computer-Acid Molecular Design 12 pp543-556, (1998). .
Schneider, et al, "Peptide Design by Artificial Neural Networks and Computer-Based Evolutionary Search", Proc. Natl. Acad. Sci. USA, vol. 95, pp. 12179-121184, Oct. 1998. .
Mee, et al, "Design of Active Analogues of a 15-Residue Peptide Using D-Optimal Design QSAR and a Combinatorial Search Algorithm", J Peptide Res. 49, pp. 89-102, (1997). .
Bogarad, et al, "A Hierarchical Approach to Protein Molecular Evolution", Proc. Natl. Acad. Sci. USA, vol. 96, pp. 2597-2595, Mar. 1999. .
Darius, et al, "Simulated Molecular Evolution" Or Computer-Generated Artifacts?, Biophysical Journal, vol. 67, pp. 2120-2122, Nov. 1994. .
Martin et al., "Measuring Diversity: Experimental Design of Combinatorial Libraries for Drug Discovery," J. Med. Chem. 38, 1431-1436, 1995. .
Sheridan et al., " Using a Genetic Algorithm to Suggest Combinatorial Libraries," J. Chem. Inf. Compu. Sci., 35, 310-320, 1995. .
D.K. Agrafiotis, "Multiobjective Optimization of Combinatorial Libraries," IBM J. Res & Dev., vol., 45, No. 3, 545-566, 2001. .
Selifonov, Sergey A. et al., "Methods For Making Characteristics Strings, Polynucleotides And Polypeptides Having Desired Characteristics", U.S. Appl. No. 09/416,375, filed Oct. 12, 1999. .
Selifonov, Sergey A. et al, "Methods For Making Character Strings, Polynucleotides And Polypeptides Having Desired Characteristics", U.S. Appl. No. 09/494,282, filed Jan. 18, 2000. .
Selifonov, Sergey A. et al., "Methods For Making Characteristics Strings, Polynucleotides And Polypeptides Having Desired Characteristics", U.S. Appl. No. 09/539,486, filed Mar. 30, 200. .
Selifonov, Sergey A. et al., "Methods For Making Character Strings, Polynucleotides And Polypeptides Having Desired Characteristics", U.S. Appl. No. 09/618,579, filed Jul. 18, 2000. .
Selifonov, Sergey A. et al. "Methods For Making Characteristics Strings, Polynucleotides And Polypeptides Having Desired Characteristics", PCT Application No. PCT/US00/01202, Publication No. WO 00/02560. .
Selifonov, Sergey A. et al. "In Silico Cross-Over Selection", PCT Application No. PCT/US01/10231, Publication No. WO 01/75767 A3. .
Alexeev et al. "Stable and Inheritable changes in genotype and phenotype of albino melanocytes induced by an RNA-DNA oligonucleotide." Nature Biotechnology 1343-1346 (1998). .
Giver and Arnold "Combinatorial protein design by in vitro recombination" Current Opinion in Chemical Biology (1998) 2:335-338. .
Zhao et al., "Molecular evolution by staggered extension process (StEP) in vitro recombination" Nature Biotechnology vol. 16 (1998) pp. 258-261. .
Crameri and Stemmer "10.sup.20 --Fold aptamer library amplification without gel purification" Nucleic Acid Research (1993) vol. 21, no. 18 pp. 4110. .
Feng and Doolittle "Progressive Sequence Alignment as a Prerequisite to Correct Phylogenetic Trees" J. Mol. Evol. 35:351-360 (1987). .
Higgins and Sharp "Fast and sensitive multiple sequence alignments on a microcomputer" CABIOS 5:151-153 (1989). .
Sun "Modeling DNA Shuffling" Journal of Computational Biology 6(1):77-90. .
Altschul et al., "Basic Local Alignment Search Tool" J. Mol. Biol. 215:403-410 (1990). .
Boehnke et al., "Statistical Methods for Multipoint Radiation Hybrid Mapping" Am. J. Hum. Genet. (1987) 49:1174-1188. .
Brunner and Bujard "Promoter recognition and promoter strength in the Escherichia coli system" EMBO J. 6:3139-3144. .
Chang et al., "Evolution of a cytokine using DNA family shuffling" Nature Biotechnology (1999) 17:793-797. .
Christians et al., "Directed evolution of thymidine kinase for AZT phosphorylation using DNA family shuffling" Nature Biotechnology (1999) 17:259-264. .
Crameri and Stemmer "Combinatorial multiple cassette mutagenesis creates all the permutations of mutant and wildtype cassettes" BioTechniques (1995) 18:194-195. .
Crameri et al., "Construction and evolution of antibody-phage libraries by DNA shuffling" Nature Medicine (1996) 2:100-103. .
Crameri et al., "Improved Green Fluorescent Protein by Molecular Evolution Using DNA Shuffling" Nature Biotechnology (1996) 14:315-319. .
Crameri, A. et al., (1997) "Molecular evolution of an arsenate detoxification pathway by DNA shuffling." Nature Biotechnology 15:436-438. .
Crameri, A. et al., (1998) "DNA shuffling of a family of genes from diverse species accelerates directed evolution." Nature 391:288-291. .
Gates, C.M. et al., (1995) "Affinity selective isolation of ligands from peptide libraries through display on a lac repressor 'headpiece dimer". Journal of Molecular Biology 255:373-386. .
Irvine et al., "SELEXION: Systematic Evolution of Ligands by Exponential Enrichment with Integrated Optimization by Non-linear Analysis" J. Mol. Biol. (1991) 222:739-761. .
Josson et al., "Quantitative sequence-activity models (QSAM)-tools for sequence design" Nucleic Acids Res. (1993) 21(3):733-739. .
Kelly et al., "A Test of the Markovian Model of DNA Evolution" Biometrics (1994) 50(3):653-664. .
Knaus and Bujard "P.sub.L of coliphage lambda: an alternative solution for an efficient promoter" EMBO (1998) 7(9):2919-2923. .
Lander and Waterman "Genomic Mapping by Fingerprinting Random Clones: A Mathematical Analysis" Genomics (1988) 2:231-239. .
Lanzer and Bujard "Promoters largely determine the efficiency of repressor action" Proc. Natl. Acad. Sci. (1988) 85:8973-8977. .
Minshull and Stemmer "Protein evolution by molecular breeding" Current opinion in Chemical Biology (1999) 3:284-290. .
Ness, J. et al., (1999) "DNA shuffling of subgenominc sequences of subtilisin." Nature Biotechnology 17:893-896. .
Stemmer "Rapid evolution of a protein in vitro by DNA shuffling" Nature (1994) 370-389-391. .
Stemmer (1995) "Searching Sequence Space" Bio/Technology 13:549-553. .
Stemmer (1995) "The Evolution of Molecular Computation." Science 270:1510. .
Stemmer (1996) "Sexual PCR and Assembly PCR." In: The Encyclopedia of Molecular Biology. VCH Publishers, New York. pp. 447-457. .
Stemmer and Soong (1999) "Molecular Breeding of viruses for targeting and other clinical properties." Tumor Targeting 4:1-4. .
Sun and Waterman "A Mathematical Analysis of in vitro Molecular Selection-Amplification" J. Mol. Biol. (1996) 258:650-660. .
Patten, P.A. et al., (1997) "Application of DNA Shuffling to Pharmaceuticals and Vaccines." Current Opinion in Biotechnology 8:724-733. .
Zhang, J. et al., "Directed evolution of an effective fucosidase from a galactosidase by DNA shuffling and screening" Proceedings of the National Academy of Sciences, USA (1997) 64:4504-4509. .
Alberts et al. "Recombinant DNA Technology." New York Publishing, Inc. 291-312 (1994). .
Arkin "Optimizing Nucleotide Mixtures to Encode Specific Subsets of Smino Acids for Semi-Random Mutagenesis." Biotechnology (N Y). Mar. 1992;10(3):297-300. .
Arkin "An algorithm for protein engineering: Simulations of recursive ensemble mutagenesis." Proc Natl Acad Sci U S A. Aug. 15, 1992;89(16):7811-5. .
Barlett "Long-lasting gene repair." Nature Biotechnology 16:1312 (1998). .
Botstein et al. "Strategies and Applications of in Vitro Mutagenesis." Science 229:1193-1201 (1985). .
Carter "Site-directed mutagenesis." Biochem J. 237:1-7 (1986). .
Carter "Improved Oligonucleotide-Directed Mutagenesis Using M13 Vectors." Methods in Enzymol 154:382-403 (1987). .
Carter "Improved oligonucleotide site-directed mutagenesis using M13 vectors." Nucleic Acids Res. 13:4431-4443 (1985). .
Chee et al. "Accessing Genetic Information with High-Density DNA Arrays." Science 274:610-614 (1996). .
Cole-Strauss et al. "Correction of the Mutation Responsible for Sickle Cell Anemia by an RNA-DNA Oligonucleotide." Science 1386-1389 (1996). .
Dale et al. "Oligonucleotide-directed Random Mutagenesis Using the Phosporothioate Method." Methods Mol. Biol. 57:369-374 (1996). .
Delagrave "Searching Sequence Space to Engineer Proteins: Exponential Ensemble Mutagenesis." Biotechnology (N Y). Dec. 1993;11(13):1548-52. .
Delagrave "Recursive ensemble mutagenesis." Protein Eng. Apr. 1993;6(3):327-31. .
Delagrave "Context dependence of phenotype prediction and diversity in combinatorial mutagenesis." Protein Eng. Mar. 1995;8(3):237-42. .
Egea Biosciences, Inc. "Pathway to the Proteome". .
Fodor et al. "Light-Directed, Spatially Addressable Parallel Chemical Synthesis." Science 251:767-777 (1991). .
Fodor "Genome Analysis: Genome Technology." FASEB Journal 11:121 (1997). .
Fodor "Massively Parallel Genomics." Science 227:393-395 (1997). .
Goldman "An Algorithmically Optimized Combinatorial Library Screened by Digital Imaging Spectroscopy." Biotechnology (N Y). Dec. 1992;10(12):1557-61. .
Hill et al. "The structure of granulocyte-colony-stimulating factor and its relationship to other growth factors." PNAS 90:5167 (1993). .
Kaczorowski et al. "Co-operatively of hexamer ligation." Gene 179(1):189-193 (1996). .
Karlin et al. "Applications and statistics for multiple high-scoring segments in molecular sequences." PNAS USA 90:5873-5787 (1993). .
Kayushin et al. "A convenient approach to the synthesis of trinucleotide phosphoroamidites-synthons for the generation of oligonucleotide/peptide libraries." Nucleic Acids Res. 24:3748-3755 (1996). .
Kren et al. "Targeted Nucleotide Exchange in the Alkaline Phosphates Gene of HuH-7 Cells Mediated by a Chimeric RNA/DNA Oligonucleotide." Hepatology 25(6):1462-1468 (1997). .
Kren et al. "In Vivo site-directed mutagenesis of the factor IX gene by chimeric RNA/DNA oligonucleotides." Nature Medicine4(3):285-290 (1998). .
Kunkel "The Efficiency of Oligonucleotide-Directed Mutagenesis." Nucleic Acids & Mol. Biol (1997). .
Ling et al. "Approaches to DNA Mutagenesis: An Overview." Anal Biochem. 254(2):157-178 (1997). .
Mehling et al. "Nucleotide sequences of streptomycete 16S ribosomal DNA: towards a specific identification system for streptomycetes using PCR." Microbiology 141:2139-2147 (1995). .
Mintz et al. "EHD1-An EH-Domain-Containing Protein with a Specific Expression Pattern." Genomics 59(1):66-76 (1999). .
Muerhoff et al. Identification of conserved nucleotide sequences within the GB virus c 5'-untranslated region: Design of PCR primers for detection of viral RNA J. Virological Methods 62:55-62 (1996). .
Nakamaye et al. "Inhibition of restriction endonuclease Nci I Cleavage by phoshorothioate groups and its applications to oligonucleotide-directed mutagenesis." Nucleic Acids. Res. 14:9679-9698 (1986). .
Needleman et al. "A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins." J. Mol. Biol. 48:443 (1970). .
Pathy "Modular exchange principles in proteins." Curr Opinion in Structural Biology 1:351-361 (1991). .
Pathy "Introns and exons." Curr. Opinion in Structural Biology 4:383-392 (1994). .
Pearson et al. "Improved tools for biological sequence comparison." PNAS 85:2444 (1988). .
Porter et al. "Direct PCR Sequencing with boronated Nucleotides." Nucleic Acids Res. 25(8):1611-1617 (1997). .
Robles "Hydropathy and Molar Volume Constraints on Combinatorial Mutants of the Photosynthetic Reaction Center." J Mol Biol. Jul. 5, 1993;232(1):242-52. .
Sayers et al. "5'-3' Exonucleases in phosphorothiote-biased oligonucleotide-directed mutagenesis." Nucleic Acids Res. 16:791-802(1988). .
Sayers et al. "Strand specific cleavage of phosphorothioate-containing DNA by reaction with restriction endonucleases in the presence of ethidium bromide." Nucleic Acids Res. 16:803-814 (1988). .
Singh et al. "Application of Genetic Algorithms to Combinatorial Synthesis: A Computational Approach to Lead Identification and Lead Optimization." J. Chem Inf. Comput. Sci 118:1669-1679 (1996). .
Smith "In Vitro Mutagenesis." Ann. Rev. Genet. 19:423-462 (1985). .
Smith et al. "Comparison of Biosequences." Adv. Appl. Math 2:482 (1981). .
Strauss "The Site-Specific Correction of Genetic Defects." Nature Medicine 4:274-275 (1998). .
Taylor et al. "The use of phosphorothioate-modified DNA in restriction enzyme reactions to prepare nicked DNA." Nucleic Acids Res. 13:8749-8764 (1985). .
Taylor et al. "The rapid generation of oligonucleotide-directed mutations at high frequency using phosphorothioate-modified DNA." Nucleic Acids Res. 13:8765-8787 (1985). .
Virnekas et al. "Trinucleotide phisphoramidites: ideal reagents for the synthesis of mixed oligonucleotides for random mutagenesis." Nucleic Acids Res. 22:5600-5607 (1994). .
Von Der Osten et al. "Protein engineering of subtilisins to improve stability in detergent formulations." J. Biotechnol. 28:55-68 (1993). .
Wells "Importance of hydrogen-bond formation in stabilizing the transition state of subtilisin." Trans. R. Soc. Lond. A317:415-423 (1986). .
Xiang et al. "Targeted gene conversion in a mammalian CD34+-enriched cell population using a chimeric RNA/DNA oligonucleotide." J. Mol. Med. 75:829-835 (1997). .
Yoon et al. "Targeted gene correction of episomal DNA in mammalian cells mediated by a chimeric RNA-DNA oligonucleotide." PNAS 93:2071-2076 (1996). .
Youvan "Imaging Sequence Space." Nature. May 5, 1994;369(6475):79-80. .
Zang et al. "Directed evolution of a fucosidase from a galactosidase by DNA shuffling and screening." PNAS USA 94:4504-4509 (1997). .
Zoller "New recombinant DNA methodology for protein engineering." Curr. Opinion in Biotechnology 3:348-354 (1992). .
Sun "Modeling DNA Shuffline." Pro. @nd. Inter. Conf Computational Molecular Biol. 251-257 (1998). .
Dewey et al. (1998) "Non-equilibrum Thermodynamics of Molecular Evolution." J. Theor Biol.193:593-599. .
Diatchenko et al. (1996) "Suppression subtractive hybridization: A method for generating different regulated or tissue-specific cDNA probes and libraries." PNAS USA 93:6025-6030. .
Morchio et al. (1997) "Simulation of Protein Evolution: Evidence for a Non-linear Aminoacidic Substitution Rate." Riv Biol 83-94. .
Shao et al. (1998) "Random-priming in vitro recombination: an effective tool for directed evolution." Nucleic Acids Research 26(2):681-683. .
Singapore Examination Report dated Jul. 25, 2001 for SG 0004677-1 Filed Jan. 18, 2000. .
Singapore Examination Report dated Jul. 27, 2001 for SG 0004561-7 Filed Jan. 18, 2000. .
Chamberlain et al. (1990) "Multiplex PCR for the Diagnosis of Duchenne Muscular Dystrophy." PCR Protocols 272-281. .
Hayashi et al. (1994) "Simultaneous Mutagenesis of Antibody CDR Regions by Overlap Extension and PCR." BioTechniques 17(2):310-315. .
Jirholt et al. (1998) "Exploiting sequence space: shuffling in vivo formed complementarity determining regions into a master framework." Gene 215:471-476. .
Kobayashi et al (1997) "Analysis of Assembly of Synthetic Antibody Fragments: Expression of Functional scFv with Predefined Specificity." BioTechniques 23(3):500-503. .
Prodromou et al (1992) "Recursive PCR: a novel technique for total gene synthesis." Protein Engineering 5(8):827-829. .
Soderlind et al.(1999) "Complementarity-determining region (CDR) implantatio: a theme or recombination." Immunothnology 4:279-285. .
Soderlind et al.((2000) "Recombining germlineOderived CDR sequences for creating diverse single-framework antibody libraries." Nature Biothecnology 18:852-856. .
Soderlind et al. (1995) "Domain libraries: Synthetic diversity for de novo design of antibody V-regions." Gene 160:269-272. .
Aita et al., "Analysis of Local Fitness Landscape with a Model of the Rough Mt. Fuji-Type Landscape: Application to Prolyl Endopeptidase and Thermolysin," Biopolymers. vol. 54, pp. 64-79, Accepted Jan. 14, 2000. .
Hellberg et al., "The Prediction of Bradykinin Potentiating Potency of Pentapeptides. An Example of a Peptide Quantitative Structure-Activity Relationship," Acia Chemica Scandinaviea B 40, pp. 135-140, 1988. .
Bucht et al., "Optimising the Signal Peptide for Glycosyl Phosphatidylinositol Modification of Human Acetylcholenesterase Using Mutational Analysis and Peptide-Quantitative Structure-Activity Relationships," Biochimica et Biophysica Acta 1431, pp. 471-482, 1999. .
Sandberg et al., "Engineering Multiple Properties of a Protein by Combinatorial Mutagenesis," Proc. Natl. Acad. Sci. USA, vol. 90, pp. 8367-8371, Sep. 1993. .
Wrede et al., "Peptide Design Aided by Neural Networks: Biological Activity of Artificial Signal Peptidase I Cleavage Sites," Biochemistry, 37, pp. 3588-3593, 1998. .
Jill Damborsky, "Quantitative Structure-Function and Structure-Stability Relationship of Purposely Modified Proteins," Protein Engineering, vol. 11, No. 1, pp. 21-30, 1998. .
Hellberg, et al., "Peptide Quantitative Structure--Activity Relationships, a Multivariate Approach," J. Med Chem, 30: pp 1126-1195, 1987. .
Sandberg et al., "New Chemical Descriptors Relevant for the Design of Biologically Active Peptides. A Multivariate Characterization of 87 Amino Acids," J. Med Chem., 41, pp. 2481-2491, 1998. .
Casari et al., "A Method to Predict Functional Residues in Proteins," Nat. Struct Biol., 2, pp. 171-178, 1995. .
Gogos et al., "Assignment of Enzyme Substrate Specificity by Principal Component Analysis of Aligned Protein Sequences: An Experimental Test Using DNA Glycosylase Homologs," Proteins: Structure, Function, and Genetics, 40, pp. 98-105, 2000. .
Suzuki et al., "A Method for Detecting Positive Selection at Single Amino Acid Sites," Mol. Biol. Evol. 16 (10): pp. 1315-1328, 1999. .
Benner et al., "Amino Acid Substitution During Functionally Constrained Divergent Evolution of Protein Sequences," Protein Engineering, vol. 7, No. 11, pp. 1323-1332, 1994. .
Wu et al., "Discovering Empirically Conserved Amino Acid Substitution Groups in Databases of Protein Families," Proc. Int. Conf. Intell. Syst. Mol. Biol., 4, pp. 230-240, 1996. .
Adenot et al., "Peptides Quantitative Structure-Function Relationships: An Automated Mutation Strategy to Design Peptides and Pseudopeptides from Substitution Matrices," Journal of Molecular Graphics and Modelling, 17, pp. 292-309, 1999. .
Norinder et al., "A Quantitative Structure-Activity Relationship Study of Some Substance P-Related Peptides," J. Peptide Res., 49, pp. 155-162, 1997. .
Sandberg, "Deciphering Sequence Data a Multivariate Approach," Ph.D Thesis, Umea: Umea University, 78 pages, 1997. .
Eriksson et al., "Peptide QSAR on Substance P Analogues, Enkephalins and Bradykinins Containing L-and D-Amino Acids," Acta Chemica Scandinavica, 44, pp. 50-56, 1990. .
Ufkes eta l., "Further Studies on the Structure-Activity Relationships of Bradykinin-Potentiating Peptides," European Journal of Pharmacology, 79, pp. 155-158, 1982. .
Dobrynin et al., "Synthesis of Model Promoter for Gene Expression in Escherichia coli," Symposium Series No. 7, pp. 365-376, 1980. .
Skinner et al., "Potential Use of Additivity of Mutational Effects in Simplifying Protein Engineering," Proc. Natl. Acad. Sci., vol. 93, pp. 10753-10757, 1996. .
Aita et al., "Theory of Evolutionary Molecular Engineering Through Simultaneous Accumulation of Advantageous Mutations," J. Theor. Biol., 207, pp/ 543-556, 2000. .
Lathrop et al., "Global Optimum Protein Threading with Gapped Alignment and Empirical Pair Score Functions," J. Mol. Biol., 255, pp. 641-665, 1996. .
Hellberg et al., "A Multivariate Approach to QSAR," Ph.D. Thesis, Umea, Sweden: University of Umea: 1986. .
Vector NTI Suite 7.0 User's Manual (portion) describing software believed to be available prior to Feb. 1, 2000..~
Primary Examiner: Horlick; Kenneth R.
Assistant Examiner: Kim; Young J.
Attorney, Agent or Firm:Fujita; Sharon M. Weaver; Jeffrey K. Kruse; Norman J.

Parent Case Text



CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of "METHODS OF POPULATING DATA STRUCTURES FOR USE IN EVOLUTIONARY SIMULATIONS" by Selifonov and Stemmer, U.S. Ser. No. PCT/US00/01138, Filed Jan. 18, 2000, and is a continuation-in-part of "METHODS OF POPULATING DATA STRUCTURES FOR USE N EVOLUTIONARY SIMULATIONS" by Selifonov and Stemmer, U.S. Ser. No. 09/416,837 (now abandoned), filed Oct. 12, 1999.

This application is also a continuation-in-part of "METHODS FOR MAKING CHARACTER STRINGS, POLYNUCLEOTIDES AND POLYPEPTIDES HAVING DESIRED CHARACTERISTICS" by Selifonov et al., filed Jan. 18, 2000, U.S. Ser. No. 09/494,282, and is a continuation-in-part of "METHODS FOR MAKING CHARACTER STRINGS, POLYNUCLEOTIDES AND POLYPEPTIDES HAVING DESIRED CHARACTERISTICS" by Selifonov et al., filed Jan. 18, 2000, U.S. Ser. No. PCT/US00/1202; which are continuation-in-part applications of "METHODS FOR MAKING CHARACTER STRINGS, POLYNUCLEOTIDES AND POLYPEPTIDES HAVING DESIRED CHARACTERISTICS" by Selifonov et al., U.S. Ser. No. 09/416,375 (now abandoned), filed Oct. 12, 1999, which is a non provisional of "METHODS FOR MAKING CHARACTER STRINGS, POLYNUCLEOTIDES AND POLYPEPTIDES HAVING DESIRED CHARACTERISTICS" by Selifonov and Stemmer, U.S. Ser. No. 60/116,447, filed Jan. 19, 1999 and a non-provisional of "METHODS FOR MAKING CHARACTER STRINGS, POLYNUCLEOTIDES AND POLYPEPTIDES HAVING DESIRED CHARACTERISTICS" by Selifonov and Stemmer, U.S. Ser. No. 60/118,854, filed Feb. 5, 1999.

This application is also a continuation-in-part of "OLIGONUCLEOTIDE MEDIATED NUCLEIC ACID RECOMBINATION" by Crameri et al., filed Jan. 18, 2000, U.S. Ser. No. 09/484,850 (now U.S. Pat. No. 6,368,861) and a continuation-in-part of "OLIGONUCLEOTIDE MEDIATED NUCLEIC ACID RECOMBINATION" by Crameri et al., filed Jan. 18, 2000, U.S. Ser. No. PCT/US00/01203, which are continuation-in-part applications of"OLIGONUCLEOTIDE MEDIATED NUCLEIC ACID RECOMBINATION" by Crameri et al., U.S. Ser. No. 09/408,392 (now U.S. Pat. No. 6,376,246), filed Sep. 28, 1999, which is a non-provisional of "OLIGONUCLEOTIDE MEDIATED NUCLEIC ACID RECOMBINATION" by Crameri et al., U.S. Ser. No. 60/118,813, filed Feb. 5, 1999 and a non-provisional of "OLIGONUCLEOTIDE MEDIATED NUCLEIC ACID RECOMBINATION" by Crameri et al., U.S. Ser. No. 60/141,049, filed Jun. 24, 1999.

Claims


What is claimed is:
1. A method of identifying molecules for production, wherein the molecules are represented by concatenated strings, said method comprising: i) encoding two or more biological molecules into a data structure of initial character strings to provide a collection of two or more different initial character strings wherein each of said biological molecules comprises at least about 10 subunits; ii) selecting at least two substrings from said initial character strings; iii) concatenating said substrings to form one or more product strings about the same length as one or more of the initial character strings; iv) adding the product strings to a data structure to populate a data structure of product strings; v) determining sequence identities of at least one of the product strings relative to at least one initial character string; and vi) selecting one or more product biological molecules for production, wherein the one or more product biological molecules correspond to one or more of the product strings having greater than 30% sequence identity with the at least one initial character string.

2. The method of claim 1, wherein said encoding comprises encoding two or more nucleic acid sequences into said character strings.

3. The method of claim 2, wherein said two or more nucleic acid sequences comprise a nucleic acid sequence encoding a naturally occurring protein.

4. The method of claim 1, wherein said encoding comprises encoding two or more amino acid sequences into said character strings.

5. The method of claim 4, wherein said two or more amino acid sequences comprise an amino acid sequence encoding a naturally occurring protein.

6. The method of claim 1, wherein said initial character strings have at least 30% sequence identity with each other.

7. The method of claim 1, wherein said selecting in (ii) comprises selecting at least one substring from an initial character string such that the ends of said substring occur in string regions of about 3 to about 20 characters in the initial character string that have higher sequence identity with the corresponding region of another of said initial character strings than the overall sequence identity between the two initial character strings.

8. The method of claim 1, wherein said selecting in (ii) comprises selecting substrings such that the ends of said substrings occur in predefined motifs of about 4 to about 8 characters.

9. The method of claim 1, wherein said selecting in (ii) comprises aligning two or more of said initial character strings to maximize pairwise identity between two or more substrings of the initial character strings, and selecting a character that is a member of an aligned pair for the end of one of the two or more substrings.

10. The method of claim 1, wherein said method further comprises randomly altering one or more characters of said initial or product character strings.

11. The method of claim 10, wherein said method further comprises randomly selecting and altering one or more occurrences of a particular preselected character in said initial or product character strings.

12. The method of claim 1, wherein said encoding, selecting, or concatenating is performed on an internet site.

13. The method of claim 1, wherein said encoding, selecting, or concatenating is performed on a server.

14. The method of claim 1, wherein said encoding, selecting, or concatenating is performed on a client linked to a network.

15. The method of claim 1, wherein the initial character strings of (i) are related in that they encode the same gene or protein family but differ in sequence.

16. The method of claim 1, further comprising determining a computationally predicted property for molecules represented by the product strings.

17. The method of claim 1, wherein the molecules represented by the product strings are made in parallel in an array of vessels.

18. The method of claim 1, wherein the molecules represented by the product strings are made by assembly of oligonucleotides.

19. The method of claim 1, wherein the one or more product strings of (vi) have greater than 50% sequence identity with the at least one initial character string.

20. The method of claim 1, wherein the one or more product strings of (vi) have greater than 75% sequence identity with the at least one initial character string.

21. The method of claim 1, wherein the one or more product strings of (vi) have greater than 85% sequence identity with the at least one initial character string.

22. The method of claim 1, wherein the one or more product strings of (vi) have greater than 90% sequence identity with the at least one initial character string.

23. The method of claim 1, wherein the one or more product strings of (vi) have greater than 95% sequence identity with the at least one initial character string.

24. The method of claim 1, wherein adding the product strings to a data structure comprises adding more than one product string to the data structure.

25. The method of claim 1, wherein selecting at least two substrings from said initial character strings comprises random substring selection.

26. The method of claim 1, wherein selecting at least two substrings from said initial character strings comprises uniform substring selection.

27. The method of claim 1, wherein selecting at least two substrings from said initial character strings comprises motif-based selection.

28. The method of claim 1, wherein selecting at least two substrings from said initial character strings comprises alignment-based selection.

29. The method of claim 1, wherein selecting at least two substrings from said initial character strings comprises frequency-biased selection.

30. A computer program product on a computer readable media comprising computer code that: i) encodes two or more biological molecules into initial character strings to provide a collection of two or more different initial character strings wherein each of said biological molecules comprises at least about ten subunits; ii) selects at least two substrings from said initial character strings; iii) concatenates said substrings to form one or more product strings about the same length as one or more of the initial character strings; iv) adds the product strings to a data structure to populate a data structure of product strings; v) determines sequence identities of at least one of the product strings relative to at least one initial character string; and vi) selects one or more product biological molecules for production, wherein the one or more product biological molecules correspond to one or more of the product strings having greater than 30% sequence identity with the at least one initial character string.

31. The computer program product of claim 30, wherein said two or more biological molecules are nucleic acid sequences.

32. The computer program product of claim 30, wherein said two or more biological molecules are nucleic acid sequences encoding naturally occurring proteins.

33. The computer program product of claim 30, wherein said two or more biological molecules are amino acid sequences.

34. The computer program product of claim 30, wherein said initial character strings have at least 30% sequence identity with each other.

35. The computer program product of claim 30, wherein said computer code selects in (ii) at least one substring from an initial character string such that the ends of said substring occur in string regions of about three to about twenty characters in the initial character string that have higher sequence identity with a corresponding region of another of said initial character strings than the overall sequence identity between the two initial character substrings.

36. The computer program product of claim 30, wherein said computer code selects substrings such that the ends of said substrings occur in predefined motifs of about 4 to about 8 characters.

37. The computer program product of claim 30, wherein the computer code selects substrings by aligning two or more of said initial character strings to maximize pairwise identity between two or more substrings of the character strings, and selecting a character that is a member of an aligned pair for the end of one substring.

38. The computer program product of claim 30, wherein said computer code additionally randomly alters one or more characters of said initial or product character strings.

39. The computer program product of claim 38, wherein said computer code additionally randomly selects and alters one or more occurrences of a particular preselected character in said initial or product character strings.

40. The computer program product of claim 30, wherein said computer code is stored on media selected from the group consisting of magnetic media, optical media, and optomagnetic media.

41. The computer program product of claim 30, wherein said computer code is in dynamic or static memory of a computer.

42. The computer program product of claim 30, wherein the initial character strings of (i) are related in that they encode the same gene or protein family but differ in sequence.

43. The computer program product of claim 30, wherein the code instructs physical screening of the molecule(s) represented by the product strings for one or more desired properties.

44. The computer program product of claim 30, wherein the code instructs determination of a computationally predicted property for molecules represented by the product strings.

45. The computer program product of claim 30, wherein the code tests members of the data structure of product strings for a particular property and determines sequence differences responsible for differences in the particular property using multi-variate analysis.

46. The computer program product of claim 30, wherein the one or more product strings of (vi) having greater than 50% sequence identity with the at least one initial character string.

47. The computer program product of claim 30, wherein the one or more product strings of (vi) having greater than 75% sequence identity with the at least one initial character string.

48. The computer program product of claim 30, wherein the one or more product strings of(vi) having greater than 95% sequence identity with the at least one initial character string.

49. The computer program product of claim 30, wherein the computer code adds the product strings to a data structure by adding more than one product string to the data structure.

50. The computer program product of claim 30, wherein the computer code selects at least two substrings from said initial character strings by a random substring selection.

51. The computer program product of claim 30, wherein the computer code selects at least two substrings from said initial character strings by a uniform substring selection.

52. The computer program product of claim 30, wherein the computer code selects at least two substrings from said initial character strings by a motif-based selection.

53. The computer program product of claim 30, wherein the computer code selects at least two substrings from said initial character strings by an alignment-based selection.

54. The computer program product of claim 30, wherein the computer code selects at least two substrings from said initial character strings by a frequency-biased selection.

55. A method of identifying molecules for production, wherein the molecules are represented by concatenated strings, said method comprising: i) encoding two or more related biological molecules into a data structure of initial character strings to provide a collection of two or more different initial character strings wherein each of said biological molecules comprises at least about 10 subunits; ii) selecting at least two substrings from said initial character strings; iii) concatenating said substrings to form one or more product strings; iv) adding the product strings to a data structure to populate a data structure of product strings; and v) determining whether at least one of the product strings have at least a predefined measure of similarity with at least one initial character string; and vi) selecting one or more product biological molecules for production, wherein the one or more product biological molecules correspond to one or more of the product strings determined to have greater than the predefined value of sequence identity with at least one initial string.

56. A method of identifying molecules for production, wherein the molecules are represented by concatenated strings, said method comprising: i) encoding two or more biological molecules into a data structure of initial character strings to provide a collection of two or more different initial character strings wherein each of said biological molecules comprises at least about 10 subunits; ii) selecting at least two substrings from said initial character strings; iii) concatenating said substrings to form one or more product strings about the same length as one or more of the initial character strings; iv) adding the product strings to a data structure to populate a data structure of product strings; v) providing an alignment of at least one of the product strings relative to at least one initial character string; and vi) selecting one or more product biological molecules for production, wherein the one or more product biological molecules correspond to one or more of the product strings having greater than 30% sequence identity with the at least one initial character string.

57. The method of claim 56, wherein said encoding comprises encoding two or more amino acid sequences into said character strings, and wherein said two or more amino acid sequences comprise an amino acid sequence encoding a naturally occurring protein.

58. The method of claim 56, wherein said initial character strings have at least 30% sequence identity with each other.

59. The method of claim 56, wherein said selecting in (ii) comprises selecting at least one substring from an initial character string such that the ends of said substring occur in string regions of about 3 to about 20 characters in the initial character string that have higher sequence identity with the corresponding region of another of said initial character strings than the overall sequence identity between the two initial character strings.

60. The method of claim 56, wherein said selecting in (ii) comprises selecting substrings such that the ends of said substrings occur in predefined motifs of about 4 to about 8 characters.

61. The method of claim 56, wherein said selecting in (ii) comprises aligning two or more of said initial character strings to maximize pairwise identity between two or more substrings of the initial character strings, and selecting a character that is a member of an aligned pair for the end of one of the two or more substrings.

62. The method of claim 56, wherein said method further comprises randomly altering one or more characters of said initial or product character strings.

63. The method of claim 56, wherein the one or more product strings of (vi) having greater than 50% sequence identity with the at least one initial character string.

64. The method of claim 56, wherein the one or more product strings of (vi) having greater than 75% sequence identity with the at least one initial character string.

65. The method of claim 56, wherein the one or more product strings of (vi) having greater than 85% sequence identity with the at least one initial character string.

66. The method of claim 56, wherein the one or more product strings of (vi) having greater than 90% sequence identity with the at least one initial character string.

67. The method of claim 56, wherein the one or more product strings of(vi) having greater than 95% sequence identity with the at least one initial character string.

68. The method of claim 56, wherein adding the product strings to a data structure comprises adding more than one product string to the data structure.

69. The method of claim 56, wherein selecting at least two substrings from said initial character strings comprises random substring selection.

70. The method of claim 56, wherein selecting at least two substrings from said initial character strings comprises uniform substring selection.

71. The method of claim 56, wherein selecting at least two substrings from said initial character strings comprises motif-based selection.

72. The method of claim 56, wherein selecting at least two substrings from said initial character strings comprises alignment-based selection.

73. The method of claim 56, wherein selecting at least two substrings from said initial character strings comprises frequency-biased selection.

74. A computer program product on a computer readable media comprising computer code that: i) encodes two or more biological molecules into initial character strings to provide a collection of two or more different initial character strings wherein each of said biological molecules comprises at least about ten subunits; ii) selects at least two substrings from said initial character strings; iii) concatenates said substrings to form one or more product strings about the same length as one or more of the initial character strings; iv) adds the product strings to a data structure to populate a data structure of product strings; v) provides an alignment of at least one of the product strings relative to at least one initial character string; and vi) selects one or more product biological molecules for production, wherein the one or more product biological molecules correspond to one or more of the product strings having greater than 30% sequence identity with the at least one initial character string.

75. The computer program product of claim 74, wherein said computer code encodes two or more amino acid sequences into said character strings, and wherein said two or more amino acid sequences comprise an amino acid sequence encoding a naturally occurring protein.

76. The computer program product of claim 74, wherein said initial character strings have at least 30% sequence identity with each other.

77. The computer program product of claim 74, wherein said computer code selects in (ii) at least one substring from an initial character string such that the ends of said substring occur in string regions of about three to about twenty characters in the initial character string that have higher sequence identity with a corresponding region of another of said initial character strings than the overall sequence identity between the two initial character substrings.

78. The computer program product of claim 74, wherein said computer code selects in (ii) by selecting substrings such that the ends of said substrings occur in predefined motifs of about 4 to about 8 characters.

79. The computer program product of claim 74, wherein said computer code selects in (ii) by aligning two or more of said initial character strings to maximize pairwise identity between two or more substrings of the initial character strings, and selecting a character that is a member of an aligned pair for the end of one of the two or more substrings.

80. The computer program product of claim 74, wherein said computer code further randomly alters one or more characters of said initial or product character strings.

81. The computer program product of claim 74, wherein the one or more product strings of (vi) having greater than 50% sequence identity with the at least one initial character string.

82. The computer program product of claim 74, wherein the one or more product strings of (vi) having greater than 75% sequence identity with the at least one initial character string.

83. The computer program product of claim 74, wherein the one or more product strings of (vi) having greater than 85% sequence identity with the at least one initial character string.

84. The computer program product of claim 74, wherein the one or more product strings of (vi) having greater than 90% sequence identity with the at least one initial character string.

85. The computer program product of claim 74, wherein the one or more product strings of (vi) having greater than 95% sequence identity with the at least one initial character string.

86. The computer program product of claim 74, wherein the computer code adds the product strings to a data structure by adding more than one product string to the data structure.

87. The computer program product of claim 74, wherein the computer code selects at least two substrings from said initial character strings by a random substring selection.

88. The computer program product of claim 74, wherein the computer code selects at least two substrings from said initial character strings by a uniform substring selection.

89. The computer program product of claim 74, wherein the computer code selects at least two substrings from said initial character strings by a motif-based selection.

90. The computer program product of claim 74, wherein the computer code selects at least two substrings from said initial character strings by an alignment-based selection.

91. The computer program product of claim 74, wherein the computer code selects at least two substrings from said initial character strings by a frequency-biased selection.

92. A method of identifying molecules for production, wherein the molecules are represented by concatenated strings, said method comprising: i) encoding two or more naturally occurring biological molecules into a data structure of initial character strings to provide a collection of two or more different initial character strings wherein each of said biological molecules comprises at least about 10 subunits; ii) selecting at least two substrings from said initial character strings; iii) concatenating said substrings to form one or more product strings about the same length as one or more of the initial character strings; iv) adding the product strings to a data structure to populate a data structure of product strings; and v) selecting one or more product biological molecules for production, wherein the one or more product biological molecules correspond to one or more of the product strings having greater than 30% sequence identity with the at least one initial character string.

93. The method of claim 92, wherein said encoding comprises encoding two or more nucleic acid sequences into said character strings.

94. The method of claim 92, wherein said encoding comprises encoding two or more amino acid sequences into said character strings, and wherein said two or more amino acid sequences comprise an amino acid sequence encoding a naturally occurring protein.

95. The method of claim 92, wherein said initial character strings have at least 30% sequence identity with each other.

96. The method of claim 92, wherein said selecting in (ii) comprises selecting at least one substring from an initial character string such that the ends of said substring occur in string regions of about 3 to about 20 characters in the initial character string that have higher sequence identity with the corresponding region of another of said initial character strings than the overall sequence identity between the two initial character strings.

97. The method of claim 92, wherein said selecting in (ii) comprises selecting substrings such that the ends of said substrings occur in predefined motifs of about 4 to about 8 characters.

98. The method of claim 92, wherein said selecting in (ii) comprises aligning two or more of said initial character strings to maximize pairwise identity between two or more sub strings of the initial character strings, and selecting a character that is a member of an aligned pair for the end of one of the two or more substrings.

99. The method of claim 92, wherein said method further comprises randomly altering one or more characters of said initial or product character strings.
100. The method of claim 92, wherein the one or more product strings of (v) having greater than 50% sequence identity with the at least one initial character string.
101. The method of claim 92, wherein the one or more product strings of (v) having greater than 75% sequence identity with the at least one initial character string.
102. The method of claim 92, wherein the one or more product strings of (v) having greater than 85% sequence identity with the at least one initial character string.
103. The method of claim 92, wherein the one or more product strings of (v) having greater than 90% sequence identity with the at least one initial character string.
104. The method of claim 92, wherein the one or more product strings of (v) having greater than 95% sequence identity with the at least one initial character string.
105. The method of claim 92, wherein adding the product strings to a data structure comprises adding more than one product string to the data structure.
106. The method of claim 92, wherein selecting at least two substrings from said initial character strings comprises random substring selection.
107. The method of claim 92, wherein selecting at least two substrings from said initial character strings comprises uniform substring selection.
108. The method of claim 92, wherein selecting at least two substrings from said initial character strings comprises motif-based selection.
109. The method of claim 92, wherein selecting at least two substrings from said initial character strings comprises alignment-based selection.
110. The method of claim 92, wherein selecting at least two substrings from said initial character strings comprises frequency-biased selection.
111. A computer program product on a computer readable media comprising computer code that: i) encodes two or more naturally occurring biological molecules into initial character strings to provide a collection of two or more different initial character strings wherein each of said biological molecules comprises at least about ten subunits; ii) selects at least two substrings from said initial character strings; iii) concatenates said substrings to form one or more product strings about the same length as one or more of the initial character strings; iv) adds the product strings to a data structure to populate a data structure of product strings; and v) selects one or more product biological molecules for production, wherein the one or more product biological molecules correspond to one or more of the product strings having greater than 30% sequence identity with the at least one initial character string.
112. The computer program product of claim 111, wherein said computer code encodes by encoding two or more nucleic acid sequences into said character strings.
113. The computer program product of claim 111, wherein said computer code encodes two or more amino acid sequences into said character strings, and wherein said two or more amino acid sequences comprise an amino acid sequence encoding a naturally occurring protein.
114. The computer program product of claim 111, wherein said initial character strings have at least 30% sequence identity with each other.
115. The computer program product of claim 111, wherein said computer code selects in (ii) at least one substring from an initial character string such that the ends of said substring occur in string regions of about three to about twenty characters in the initial character string that have higher sequence identity with a corresponding region of another of said initial character strings than the overall sequence identity between the two initial character substrings.
116. The computer program product of claim 111, wherein said computer code selects in (ii) by selecting substrings such that the ends of said substrings occur in predefined motifs of about 4 to about 8 characters.
117. The computer program product of claim 111, wherein said computer code selects in (ii) by aligning two or more of said initial character strings to maximize pairwise identity between two or more substrings of the initial character strings, and selecting a character that is a member of an aligned pair for the end of one of the two or more substrings.
118. The computer program product of claim 111, wherein said computer code further randomly alters one or more characters of said initial or product character strings.
119. The computer program product of claim 111, wherein the one or more product strings of (v) having greater than 50% sequence identity with the at least one initial character string.
120. The computer program product of claim 111, wherein the one or more product strings of (v) having greater than 75% sequence identity with the at least one initial character string.
121. The computer program product of claim 111, wherein the one or more product strings of (v) having greater than 85% sequence identity with the at least one initial character string.
122. The computer program product of claim 111, wherein the one or more product strings of (v) having greater than 90% sequence identity with the at least one initial character string.
123. The computer program product of claim 111, wherein the one or more product strings of (v) having greater than 95% sequence identity with the at least one initial character string.
124. The computer program product of claim 111, wherein the computer code adds the product strings to a data structure by adding more than one product string to the data structure.
125. The computer program product of claim 111, wherein the computer code selects at least two substrings from said initial character strings by a random substring selection.
126. The computer program product of claim 111, wherein the computer code selects at least two substrings from said initial character strings by a uniform substring selection.
127. The computer program product of claim 111, wherein the computer code selects at least two substrings from said initial character strings by a motif-based selection.
128. The computer program product of claim 111, wherein the computer code selects at least two substrings from said initial character strings by an alignment-based selection.
129. The computer program product of claim 111, wherein the computer code selects at least two substrings from said initial character strings by a frequency-biased selection.
130. A method of identifying molecules for production, wherein the molecules are represented by concatenated strings, said method comprising: i) encoding two or more biological molecules into a data structure of initial character strings to provide a collection of two or more different initial character strings wherein each of said biological molecules comprises at least about 10 subunits; ii) selecting at least two substrings from said initial character strings; iii) concatenating said substrings to form one or more product strings about the same length as one or more of the initial character strings; iv) adding the product strings to a data structure to populate a data structure of product strings; v) obtaining one or more computationally predicted properties for at least one of the product strings in the data structure; and vi) selecting one or more product biological molecules for production on the basis of the one or more computationally predicted properties.
131. The method of claim 130, wherein the computationally predicted properties comprise one or more of a maximum or minimum molecular weight, a maximum or minimum free energy, a maximum or minimum contact surface with a target molecule or surface, a specified net charge, a predicted pK, a predicted pI, a binding avidity, secondary form, and tertiary form.
132. The method of claim 130, wherein said encoding comprises encoding two or more amino acid sequences into said character strings.
133. The method of claim 130, wherein said selecting in (ii) comprises aligning two or more of said initial character strings to maximize pairwise identity between two or more sub strings of the initial character strings, and selecting a character that is a member of an aligned pair for the end of one of the two or more substrings.
134. The method of claim 130, wherein said method further comprises randomly altering one or more characters of said initial or product character strings.
135. The method of claim 130, wherein the one or more product biological molecules of (vi) having greater than 50% sequence identity with the at least one initial character string.
136. The method of claim 130, wherein the one or more product biological molecules of (vi) having greater than 75% sequence identity with the at least one initial character string.
137. The method of claim 130, wherein the one or more product biological molecules of (vi) having greater than 90% sequence identity with the at least one initial character string.
138. The method of claim 130, wherein adding the product strings to a data structure comprises adding more than one product string to the data structure.
139. The method of claim 130, wherein selecting at least two substrings from said initial character strings comprises random substring selection.
140. The method of claim 130, wherein selecting at least two substrings from said initial character strings comprises uniform substring selection.
141. The method of claim 130, wherein selecting at least two substrings from said initial character strings comprises motif-based selection.
142. The method of claim 130, wherein selecting at least two substrings from said initial character strings comprises alignment-based selection.
143. The method of claim 130, wherein selecting at least two substrings from said initial character strings comprises frequency-biased selection.
144. A computer program product on a computer readable media comprising computer code that: i) encodes two or more biological molecules into initial character strings to provide a collection of two or more different initial character strings wherein each of said biological molecules comprises at least about ten subunits; ii) selects at least two substrings from said initial character strings; iii) concatenates said substrings to form one or more product strings about the same length as one or more of the initial character strings; iv) adds the product strings to a data structure to populate a data structure of product strings; v) obtains one or more computationally predicted properties for at least one of the product strings in the data structure; and vi) selects one or more product biological molecules for production on the basis of the one or more computationally predicted properties.
145. The computer program product of claim 144, wherein the computationally predicted properties comprise one or more of a maximum or minimum molecular weight, a maximum or minimum free energy, a maximum or minimum contact surface with a target molecule or surface, a specified net charge, a predicted pK, a predicted pI, a binding avidity, secondary form, and tertiary form.
146. The computer program product of claim 144, wherein the computer code encodes in (i) by encoding two or more amino acid sequences into said character strings.
147. The computer program product of claim 144, wherein the computer code selects in (ii) by aligning two or more of said initial character strings to maximize pairwise identity between two or more substrings of the initial character strings, and selecting a character that is a member of an aligned pair for the end of one of the two or more substrings.
148. The computer program product of claim 144, wherein the computer code further randomly alters one or more characters of said initial or product character strings.
149. The computer program product of claim 144, wherein the one or more product biological molecules of (vi) having greater than 50% sequence identity with the at least one initial character string.
150. The computer program product of claim 144, wherein the one or more product biological molecules of (vi) having greater than 75% sequence identity with the at least one initial character string.
151. The computer program product of claim 144, wherein the one or more product biological molecules of (vi) having greater than 90% sequence identity with the at least one initial character string.
152. The computer program product of claim 144, wherein the computer code adds the product strings to a data structure by adding more than one product string to the data structure.
153. The computer program product of claim 144, wherein the computer code selects at least two substrings from said initial character strings by a random substring selection.
154. The computer program product of claim 144, wherein the computer code selects at least two substrings from said initial character strings by a uniform substring selection.
155. The computer program product of claim 144, wherein the computer code selects at least two substrings from said initial character strings by a motif-based selection.
156. The computer program product of claim 144, wherein the computer code selects at least two substrings from said initial character strings by an alignment-based selection.
157. The computer program product of claim 144, wherein the computer code selects at least two substrings from said initial character strings by a frequency-biased selection.

Description

COPYRIGHT STATEMENT

A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by any one of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyrightrights whatsoever.

STATEMENT AS TO RIGHTS TO INVENTIONS MADE UNDER FEDERALLY SPONSORED RESEARCH AND DEVELOPMENT

Not Applicable

FIELD OF THE INVENTION

This invention relates to the field of computer modeling and simulations. In particular, this invention provides novel methods of populating data structures for use in evolutionary modeling.

BACKGROUND OF THE INVENTION

There is an extensive history of the use of computers to simulate and/or investigate the evolution of life, of individual genetic systems and/or population genetic/phenotypic systems. The motor propelling most artificial life (Alife) simulations is an algorithm which allows artificial creatures to evolve and/or adapt to their environment. The fundamental algorithms fall into two dominant categories: learning algorithms (e.g., algorithms typified by neural networks) and evolutionary algorithms, typified, for example, by genetic algorithms.

Many artificial life researchers, especially those concerned with higher-order processes such as learning and adaptation, endow their organisms with a neural net which serves as an artificial brain (see, e.g., Touretzky (1088-1991). Neural Information Processing Systems, volume 1-4. Morgan Kaufmann, 1988-1991. Neural networks are learning algorithms. They may be trained e.g. to classify images into categories. A typical task is to recognize to which letter a given hand-written character corresponds.

A neural net is composed of a collection of input-output devices, called neurons, which are organized in a (highly connected) network. Normally the network is organized into layers: an input layer which receives sensory input, any number of so-called hidden layers which perform the actual computations, and an output layer which reports the results of these computations. Training a neural network involves adjusting the strengths of the connections between the neurons in the net.

The other major type of biologically inspired fundamental algorithms are the evolutionary algorithms. While learning processes (e.g., neural networks) are metaphorically based on learning processes in individual organisms, evolutionary algorithms are inspired by evolutionary change in populations of individuals. Relative to neural nets, evolutionary algorithms have only recently gained wide acceptance in academic and industrial circles.

Evolutionary algorithms are generally iterative. An iteration is typically referred to as a "generation". The basic evolutionary algorithm traditionally begins with a population of randomly chosen individuals. In each generation, the individuals "compete" among themselves to solve a posed problem. Individuals which perform relatively well are more likely to "survive" to the next generation. Those surviving to the next generation may be subject to a small, random modifications. If the algorithm is correctly set up, and the problem is indeed one subject to solution in this manner, then as the iteration proceeds the population will contain solutions of increasing quality.

The most popular evolutionary algorithm is the genetic algorithm of J. Holland (J. H. Holland (1992) Adaptation in Natural and Artificial Systems. University of Michigan Press 1975, Reprinted by MIT Press.). The genetic algorithm is widely used in practical contexts (e.g., financial forecasting, management science, etc.). It is particularly well-adapted to multivariate problems whose solution space is discontinuous ("rugged") and poorly understood. To apply the genetic algorithm, one defines
1) a mapping from the set of parameter values into the set of (0-1) bit strings (e.g. character strings), and 2) a mapping from bit strings into the reals, the so-called fitness function.

In most evolutionary algorithms, a set of randomly-chosen bit strings constitutes the initial population. In the basic genetic algorithm, a cycle is repeated during which: the fitness of each individual in the population is evaluated; copies of individuals are made in proportion to their fitness; and the cycle is repeated. The typical starting point for such evolutionary algorithms is a set of randomly chosen bit strings. The use of an "arbitrary", random or haphazard starting population can strongly bias the evolutionary algorithm away from an efficient, accurate or concise solution to the problem at hand, particularly where the algorithm is used to model or analyze a biological history or process. Indeed, the only "force" driving the evolutionary algorithm to any solution whatsoever is a fitness determination and associated selection pressure. While a solution may eventually be reached, because the process starts from a random (e.g. arbitrary) initial state in which the population members bear no relationship to each other, the population dynamics as the algorithm proceeds reveals little or no information reflecting the dynamics of the simulated system.

In addition, evolutionary algorithms are typically relatively high order simulations and provide population level information. Specific genetic information, if it is present at all, typically exists as an abstract representation of an allele (typically as a single character) or allele frequency. Consequently evolutionary algorithms provide little or no information regarding events on a molecular level.

Similarly, neural nets and/or cellular automata, take as their starting point, essentially artificial constructs and utilize internal rules (algorithms) to approximate biological processes. As a consequence such models generally mimic processes or metaprocesses, but again afford little or no information or insight regarding events at the molecular level.

SUMMARY OF THE INVENTION

This invention provides novel methods of generating "initial" populations suitable for further computational manipulation, e.g. via genetic/evolutionary algorithms. The members of populations generated by the methods of this invention possess varying degrees of "relatedness" or "similarity" to each other reflective of the degrees of covariance found in naturally occurring populations. In addition, unlike the populations used as input in typical evolutionary algorithms, the populations generated by the methods provided herein typically contain detailed information about individual members and the information is typically of sufficient complexity to provide a "continuous" (rather than binary) measure of intermember variability and/or relatedness. Indeed the methods of this invention provide detailed coding of molecular information in the individuals comprising the populations created according to the methods of this invention.

Thus, in one embodiment, this invention provides methods of populating a data structure with (e.g. generating a collection or library of) character strings. The method preferably involve i) encoding two or more a biological molecules into character strings to provide a collection of two or more different initial character strings wherein each of said biological molecules comprises at least about 10 subunits; ii) selecting at least two substrings from said character strings; iii) concatenating said substrings to form one or more product strings about the same length as one or more of the initial character strings; iv) adding the product strings to a collection of strings (a datastructure); and v) optionally repeating steps (i) or (ii) through (iv) using one or more of said product strings as an initial string in the collection of initial character strings. In particularly preferred embodiments, the "encoding" comprises encoding one or more nucleic acid sequences and/or one or more amino acid sequences into the character strings. The nucleic acid and/or amino acid sequences can be unknown and/or haphazardly selected, but preferably encode known protein(s). In one preferred embodiment, biological molecules are selected such that they have at least about 30%, preferably at least about 50%, more preferably at least about 75%, and most preferably at least about 85%, 90%, or even 95% sequence identity with each other.

In one embodiment, the substring(s) are selected such that the ends of the substrings occur in character string regions of about 3 to about 300, preferably about 6 to about 20, more preferably about 10 to about 100 and most preferably about 20 to about 50 characters that have higher sequence identity with the corresponding region of another of the initial character strings than the overall sequence identity between the same two strings. In another embodiment, the selecting can involve selecting substrings such that the ends of said substrings occur in predefined motifs of about 4 to about 100, preferably from about 4 to about 50, even more preferably from about 4 to about 10, still more preferably from about 6 to about 30 and most preferably from about 6 to about 20 characters.

In one embodiment, the selecting and concatenating can compromise concatenating substrings from two different initial strings such that the concatenation occurs in a region of about three to about twenty characters having higher sequence identity between two different initial strings than the overall sequence identity between the two different initial strings. The selecting can also comprise aligning two or more of said initial character strings to maximize pairwise identity between two or more substrings of the character strings, and selecting a character that is a member of an aligned pair for the end of one substring.

In certain embodiments, the "adding" step involves calculating the theoretical PI, PK, molecular weight, hydrophobicity, secondary structure and/or other properties of a protein encoded by the character string. In one preferred embodiment, the product strings are added to the collection (datastructure) only if they have greater than 30%, preferably greater than 50%, more preferably greater than 75% or 85% sequence identity with the initial strings.

The method can further involve randomly altering one or more characters of the character strings. This can be accomplished according to a number of methods including, but not limited to introducing a random string into the initial string collection and/or utilizing a stochastic operator as described herein. In a particularly preferred embodiment, the operations described above are performed in a computer.

In another embodiment, this invention provides a computer program product comprising computer code that i) encodes two or more a biological molecules into character strings to provide a collection of two or more different initial character strings wherein each of said biological molecules comprises at least about ten subunits; ii) selects at least two substrings from the character strings; iii) concatenates the substrings to form one or more product strings about the same length as one or more of the initial character strings; iv) adds the product strings to a collection of strings (i.e., populates a datastructure); and v) optionally repeats steps (i) or (ii) through (iv) using one or more of the product strings as an initial string in the collection of initial character strings. In other words, the computer program product comprising computer code that performs the operations described herein. The program code can be provided in compiled form, as source code, as object code, as an executable, etc. The program can be provided on any convenient medium, e.g., magnetic media, optical media, electronic media, optomagnetic media, etc. The code can also be present on a computer, e.g. in memory (dynamic or static memory) on a hard drive, etc.

In another embodiment, this invention provides a system for generating labels (tags) and/or music derived from the sequences of biological molecules. The system comprises an encoder for encoding two or more initial strings from biological molecules (e.g. nucleic acid and/or proteins); an isolator for identifying and selecting substrings from the two or more strings; a concatenator for concatenating the substrings; a data structure for storing the concatenated substrings as a collection of strings; a comparator for measuring the number and/or variability of the collection of strings and determining that sufficient strings exist in the collection of strings; and a command writer for writing the collection of strings into a raw string file. In a preferred embodiment, the isolator comprises a comparator for aligning and determining regions of identity between two or more initial strings. Similarly the comparator may comprise a means for calculating sequence identity and the isolator and comparator may optionally share this means. In preferred embodiments, the isolator selects substrings such that the ends of said substrings occur in string regions of about three to about 100 characters that have higher sequence identity with the corresponding region of another of the initial character strings than the overall sequence identity between the same two strings.

In another embodiment, the isolator selects substrings such that the ends of said substrings occur in predefined motifs of about 4 to about 100, preferably from about 4 to about 50, even more preferably from about 4 to about 10, still more preferably from about 6 to about 30 and most preferably from about 6 to about 20 characters. In one embodiment, the isolator and concatenator individually or in combination concatenate substrings from two different initial strings such that the concatenation occurs in a region of about 3 to about 300, more preferably about 5 to about 200, most preferably from about 10 to about 100 characters having higher sequence identity between said two different initial strings than the overall sequence identity between said two different initial strings. In one preferred implementation, the isolator aligns two or more of the initial character strings to maximize pairwise identity between two or more substrings of the character strings, and selects a character that is a member of an aligned pair for the end of one substring.

The comparator can impose any of a wide variety of selection criteria. Thus, in various embodiments, the comparator can calculate theoretical PI, PK, molecular weight, hydrophobicity, secondary structure and/or other properties of an encoded protein. In one preferred embodiment, the comparator adds strings to the data structure only if they have greater than 30% identity with the initial strings.

The system can, optionally, comprising an operator that randomly alters one or more characters of the character strings. In certain embodiments, such an operator can randomly select and alter one or more occurrences of a particular preselected character in said character strings. Preferred datastructures in this system stores encoded (or deconvolved) nucleic acid sequences and/or encoded or deconvolved amino acid sequences.

A further understanding of the invention can be had from the detailed discussion of specific embodiments below. For purposes of clarity, this discussion refers to devices, methods, and concepts in terms of specific examples. However, the method of the present invention may operate within a variety of types of logical devices. It is therefore intended that the invention not be limited except as provided in the attached claims (as interpreted under the doctrine of equivalents).

Furthermore, it is recognized that logic systems can include a wide variety of different components and different functions in a modular fashion. Different embodiments of a system can include different mixtures of elements and functions and may group various functions as parts of various elements. For purposes of clarity, the invention is described in terms of systems that include many different innovative components and innovative combinations of components. No inference should be taken to limit the invention to combinations containing all of the innovative components listed in any illustrative embodiment in this specification.

Definitions

The terms "character string" "word", "binary string" or "encoded string" represent any entity capable of storing sequence information (e.g. the subunit structure of a biological molecule such as the nucleotide sequence of a nucleic acid, the amino acid sequence of a protein, the sugar sequence of a polysaccharide, etc.). In one embodiment, the character string can be a simple sequence of characters (letters, numbers, or other symbols) or it can be numeric representation of such information in tangible or intangible (e.g. electronic, magnetic, etc.) form. The character string need not be "linear", but can also exist in a number of other forms, e.g. a linked list, etc.

A "character" when used in reference to a character of a character string refers to a subunit of the string. In a preferred embodiment, the character of a character string encodes one subunit of the encoded biological molecule. Thus, for example, in a preferred embodiment, where the encoded biological molecule is a protein, a character of the string encodes a single amino acid.

A "motif" refers to a pattern of subunits comprising a biological molecule. The motif can refer to a subunit pattern of the unencoded biological molecule or to a subunit pattern of an encoded representation of a biological molecule.

The term substring refers to a string that is found within another string. The substring can include the full length "parent" string, but typically, the substring represents a substring of the full-length string.

The term "data structure" refers to the organization and optionally associated device for the storage of information, typically multiple "pieces" of information. The data structure can be a simple recordation of the information (e.g., a list) or the data structure can contain additional information (e.g., annotations) regarding the information contained therein, can establish relationships between the various "members" (information "pieces") of the data structure, and can provide pointers or linked to resources external to the data structure. The data structure can be intangible but is rendered tangible when stored/represented in tangible medium. The data structure can represent various information architectures including, but not limited to simple lists, linked lists, indexed lists, data tables, indexes, hash indices, flat file databases, relational databases, local databases, distributed databases, thin client databases, and the like. In preferred embodiments, the data structure provides fields sufficient for the storage of one or more character strings. The data structure is preferably organized to permit alignment of the character strings and, optionally, to store information regarding the alignment and/or string similarities and/or string differences. In one embodiment this information is in the form of alignment "scores" (e.g., similarity indices) and/or alignment maps showing individual subunit (e.g., nucleotide in the case of nucleic acid) alignments. The term "encoded character string" refers to a representation of a biological molecule that preserves desired sequence/structural information regarding that molecule.

Similarity, when used herein can refer to a similarity measurement between the encoded representation(s) of a molecule (e.g., the initial character strings) or between the molecules represented by the encoded character strings.

When referring to operations on strings (e.g. insertions, deletions, transformations, etc.) it will be appreciated that the operation can be performed on the encoded representation of a biological molecule or on the "molecule" prior to encoding so that the encoded representation captures the operation.

The term "subunit" when used in reference to a biological molecule refers to the characteristic "monomer" of which a biological is composed. Thus, for example, the subunit of a nucleic acid is a nucleotide, the subunit of a polypeptide is an amino acid, the subunit of a polysaccharide is a sugar, etc.

The terms "pool" or "collection" are used interchangeably when used to refer to strings.

A "biological molecule" refers to a molecule typically found in a biological organism. Preferred biological molecules include biological macromolecules that are typically polymeric in nature being composed of multiple subunits. Typical biological molecules include, but are not limited to nucleic acids (formed of nucleotide subunits) proteins (formed of amino acid subunits), polysaccharides (formed of sugar subunits), etc.

The phrase "encoding a biological molecule" refers to the generation of a representation of that biological molecule that preferably contains and can therefore be used to recreate the information content of the original biological molecule.

The term "nucleic acid" refers to a deoxyribonucleotide or ribonucleotide polymer in either single- or double-stranded form, and unless otherwise limited, encompasses known analogs of natural nucleotides that can function in a similar manner as naturally occurring nucleotides.

A "nucleic acid sequence" refers to the order and identity of the nucleotides comprising a nucleic acid.

The terms "polypeptide", "peptide" and "protein" are used interchangeably herein to refer to a polymer of amino acid residues. The terms apply to amino acid polymers in which one or more amino acid residue is an artificial chemical analogue of a corresponding naturally occurring amino acid, as well as to naturally occurring amino acid polymers.

A "polypeptide sequence" refers to the order and identity of the amino acids comprising a polypeptide.

The phrase "adding the product strings to a collection of strings" as used herein does not require a mathematical addition. Rather it refers to a process of identifying one or more strings as included within a set of strings. This can be accomplished by a variety of means including, but not limited to copying or moving the string(s) in question into a data structure that is a collection of strings, setting or providing a pointer from the string to a data structure that represents a collection of strings, setting a flag associated with the string indicating its inclusion in a particular set, or simply designating a rule that the string(s) so produced are included in the collection.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates a flow chart depicting one embodiment of the methods of this invention.

FIG. 2 illustrates a selection and concatenation of subsequences according to the method(s) of this invention.

FIG. 3 illustrates a selection and concatenation of subsequences according to the method(s) of this invention where the concatenation utilizes an alignment algorithm to fix the order of substrings.

FIG. 4 illustrates a representational digital device 700 according to the present invention.

FIG. 5 is a chart and relational tree showing percent similarity for different subtilisins (an exemplar set of initial character strings).

FIG. 6 is a pairwise dot-plot alignment showing homology areas for different subtilisins.

FIG. 7 is a pairwise dot-plot alignment showing homology areas for seven different parental subtilisins.

DETAILED DESCRIPTION

I. Generating Populations of Character Strings.

This invention provides novel computational methods to generate representations of actual or theoretical populations of entities suitable for use as initial (or mature/processed) populations in evolutionary models more preferably in evolutionary models typified by genetic algorithms. When initialized to reflect features of particular biological organisms, the entities generated by the methods of this invention each contain significant information regarding underlying molecular biology (e.g. representative amino acid or nucleic acid sequence(s)) and thereby permit models based on genetic or other algorithms to provide unprecedented level so information regarding evolutionary processes at the molecular level.

In particularly preferred embodiments, the methods of this invention generate populations of character strings where each character string represents one or more biological molecules. Using only a few strings as "seeds" the methods generate large populations of strings bearing an "evolutionary" relationship to the initial seed members. In contrast to traditional genetic algorithms in which initial member sets are arbitrary, random/haphazard, or selected for mathematical or representational convenience, the populations generated by the methods of this invention are, in preferred embodiments, derived from known existing biological "precursors" (e.g., particular nucleic acid sequences and/or polypeptide sequences).

In a preferred embodiment, the methods of this invention involve:

1) Identifying/selecting two or more biological molecules;

2) Encoding the biological molecules into character strings;

3) Selecting at least two substrings from the character strings;

4) Concatenating said substrings to form one or more product strings about the same length as one or more of the initial character strings;

5) Adding the product strings to a collection of strings which can be the set of initial strings or a separate set; and

6) Optionally introducing additional variation into the a resulting string set;

7) Optionally adding selection pressure to the resulting string set.

8) Optionally repeating steps (2) or (3) through (7) using one or more of the product strings as an initial string in the collection of initial character strings.

Each of these operations is described in more detail below.

II. Encoding One or More Biological Molecules into Character Strings.

The methods of this invention typically utilize one or more "seed" members. The "seed" members are preferably representations of one or more biological molecules. Thus, the initial steps of preferred embodiments of this invention involve selecting two or more biological molecules and encoding the biological molecules into one or more character strings.

A Identifying/Selecting "Seed/Initial" Biological Molecule(s).

Virtually any biological molecule can be used in the methods of this invention. However, preferred biological molecules are "polymeric" biological macromolecules comprising a multiplicity of "subunits". Biological macromolecules particularly well suited to the methods of this invention include, but are not limited to nucleic acids (e.g. DNA, RNA, etc.), proteins, glycoproteins, carbohydrates, polysaccharides, certain fatty acids, and the like.

When nucleic acids are selected, the nucleic acid can be single stranded or double stranded, although it will be appreciated that a single strand is sufficient to represent/encode a double stranded nucleic acid. The nucleic acids are preferably known nucleic acids. Such nucleic acid sequences can be readily determined from a number of sources including, but not limited to public databases (e.g., GenBank), proprietary databases (e.g. Incyte databases), scientific publications, commercial or private sequencing laboratories, in-house sequencing laboratories, etc.

The nucleic acids can include genomic nucleic acids, cDNAs, mRNAs, artificial sequences, natural sequences having modified nucleotides, and the like.

In one preferred embodiment, the two or more biological molecules are "related", but not identical. Thus, the nucleic acids may represent the same gene or genes but differ in the strain, species, genus, family, order, phylum or kingdom from which they are derived. Similarly, in one embodiment, the protein, polysaccharide, or other molecule(s) are the same protein, polysaccharide, or other molecule(s) with differences between the molecules resulting from the fact that they are selected from different strains, species, genus, families, orders, phyla or kingdoms.

The biological molecules can represent a single gene product (e.g. an MRNA, a cDNA, a protein, etc.) or they can represent a collection of gene products and/or non-coding nucleic acids. In certain preferred embodiments, the biological molecules will represent members of one or more particular metabolic pathways (e.g. regulatory, signaling or synthetic pathways). Thus, for example, the biological molecules can include members comprising an entire operon, or a complete biosynthetic pathway (e.g., the lac operon, Protein: B-DNA gal operon, the colicin A operon, the lux operon, polyketide synthesis pathways, etc.).

In certain preferred embodiments, the biological molecules can include any number of different, genes, proteins, etc. Thus, in certain embodiments, the biological molecules could include the total nucleic acid (e.g. genomic DNA, cDNA, or MRNA) or total protein, or total lipid, etc., of an individual, or multiple individuals of the same or different species.

In certain embodiments, the biological molecules can reflect a "representation" of the total population of that species' molecules. High order representation of populations of molecules is accomplished in the laboratory and, according to the methods of this invention can be performed in silico. Methods of representing complex molecules or populations of molecules are seen in Representational Difference Analysis (RDA) and related techniques (see, e.g., Lisitsyn (1995) Trends Genet., 11(8):
303-307, Risinger et al. (1994) Mol Carcinog. 11(1): 13-18, and Michiels et al. (1998) Nucleic Acids Res. 26: 15 3608-3610, and references cited therein).

Particular preferred biological molecules for encoding and manipulation in the methods of this invention include proteins and/or the nucleic acids encoding the proteins of various classes of molecules such as therapeutic proteins such as erythropoietin (EPO), insulin, peptide hormones such as human growth hormone; growth factors and cytokines such as Neutrophil activating peptide-78, GRO.alpha./MGSA, Gro.beta., GRO.gamma., MIP-1.alpha., MIP-16, MCP-1, epidermal growth factor, fibroblast growth factor, hepatocyhte growth factor, insulin-like growth factor, the interferons, the interleukins, keratinocyte growth factor, leukemia inhibitory factor, oncostatin M, PD-ECSF, PDGF, pleiotropin, SCF, c-kit ligand, angiogenesis factors (e.g. vascular endothelial growth factors VEGF-A, VEGF-B, VEGF-C, VEGF-D, placental growth factor (PLGF), etc.), growth factors (e.g. G-CSF, GM-CSF), soluble receptors (e.g. IL4R, IL-13R, IL-10R, soluble T-cell receptors, etc.), and the like.

Other preferred molecules of encoding include, but are not limited to transcription and expression activators. The transcription and expression activators include genes and/or proteins that modulate cell growth, differentiation, regulation and the like an are found in prokaryotes viruses, and eukaryotes including fungi, plants and animals. Expression activators include, but are not limited to cytokines, inflammatory molecules, growth factors, growth factor receptors, and oncogene products, interleukins (e.g., IL-1, IL2, IL-8, etc.) interferons, FGF, IGF-I, IGF-II, FF, PDGF, TNF, TGF-.alpha., TGF-.beta., EGK, KGF, SCR/c-kit, CD40L/CD40, VLA-4VCAM-1, ICAM-1/LFA-1, and hyalurin/CD44, signal transduction molecules and corresponding oncogene products, e.g., Mos, RAS, Raf, and Met; and transcriptional activators and supressors, e.g., p53, Tat, Fos, Myc, Jun, Myb, Rel, and steroid hormone receptors such as those for estrogen, progesterone, testosterone, aldosterone, the LDL receptor ligand and corticosterone.

Preferred molecules for encoding in the methods of this invention also include proteins from infectious or otherwise pathogenic organisms e.g. proteins characteristic of Aspergillus sp., Candida sp., E. coli, Staphyloccoi sp, Streptocci sp., Clostridia sp., Neisseria sp., Enterobacteriacea sp., Helicobacter sp., Vibrio sp., Capylobacter sp., Pseudomonas sp., Ureaplasma Sp., Legionella sp., Spirochetes, Mycobacteria sp., Actnomyces sp., Nocardia sp., Chlamydia sp., Rickettsia sp., Coxiella sp., Ehrilichia sp., Rochalimaea, Brucella, Yersinia, Fracisel