The genome sequence of the facultative intracellular pathogen Brucella melitensis.

DelVecchio VG, Kapatral V, Redkar RJ, Patra G, Mujer C, Los T, Ivanova N,
Anderson I, Bhattacharyya A, Lykidis A, Reznik G, Jablonski L, Larsen N, D'Souza
M, Bernal A, Mazur M, Goltsman E, Selkov E, Elzer PH, Hagius S, O'Callaghan D,
Letesson JJ, Haselkorn R, Kyrpides N, Overbeek R.

Brucella melitensis is a facultative intracellular bacterial pathogen that causes
abortion in goats and sheep and Malta fever in humans. The genome of B.
melitensis strain 16M was sequenced and found to contain 3,294,935 bp distributed
over two circular chromosomes of 2,117,144 bp and 1,177,787 bp encoding 3,197
ORFs. By using the bioinformatics suite ERGO, 2,487 (78%) ORFs were assigned
functions. The origins of replication of the two chromosomes are similar to those
of other alpha-proteobacteria. Housekeeping genes, including those involved in
DNA replication, transcription, translation, core metabolism, and cell wall
biosynthesis, are distributed on both chromosomes. Type I, II, and III secretion
systems are absent, but genes encoding sec-dependent, sec-independent, and
flagella-specific type III, type IV, and type V secretion systems as well as
adhesins, invasins, and hemolysins were identified. Several features of the B.
melitensis genome are similar to those of the symbiotic Sinorhizobium meliloti.

Proc Natl Acad Sci U S A. 2002 Jan 8;99(1):443-8. Epub 2001 Dec 26.

Genomes OnLine Database (GOLD): a monitor of genome projects world-wide.

Bernal A, Ear U, Kyrpides N.

GOLD is a comprehensive resource for accessing information related to completed and ongoing genome projects world-wide. The database currently provides information on 350 genome projects, of which 48 have been completely sequenced and their analysis published. GOLD was created in 1997 and since April 2000 it has been licensed to Integrated Genomics. The database is freely available through the URL: http://igweb.integratedgenomics.com/GOLD/.

Nucleic Acids Res. 2001 Jan 1;29(1):126-7.

Archaeal shikimate kinase, a new member of the GHMP-kinase family.

Daugherty M, Vonstein V, Overbeek R, Osterman A.

Shikimate kinase (EC 2.7.1.71) is a committed enzyme in the seven-step biosynthesis of chorismate, a major precursor of aromatic amino acids and many other aromatic compounds. Genes for all enzymes of the chorismate pathway except shikimate kinase are found in archaeal genomes by sequence homology to their bacterial counterparts. In this study, a conserved archaeal gene (gi1500322 in Methanococcus jannaschii) was identified as the best candidate for the missing shikimate kinase gene by the analysis of chromosomal clustering of chorismate biosynthetic genes. The encoded hypothetical protein, with no sequence similarity to bacterial and eukaryotic shikimate kinases, is distantly related to homoserine kinases (EC 2.7.1.39) of the GHMP-kinase superfamily. The latter functionality in M. jannaschii is assigned to another gene (gi591748), in agreement with sequence similarity and chromosomal clustering analysis. Both archaeal proteins, overexpressed in Escherichia coli and purified to homogeneity, displayed activity of the predicted type, with steady-state kinetic parameters similar to those of the corresponding bacterial kinases: K(m,shikimate) = 414 +/- 33 microM, K(m,ATP) = 48 +/- 4 microM, and k(cat) = 57 +/- 2 s(-1) for the predicted shikimate kinase and K(m,homoserine) = 188 +/- 37 microM, K(m,ATP) = 101 +/- 7 microM, and k(cat) = 28 +/- 1 s(-1) for the homoserine kinase. No overlapping activity could be detected between shikimate kinase and homoserine kinase, both revealing a >1,000-fold preference for their own specific substrates. The case of archaeal shikimate kinase illustrates the efficacy of techniques based on reconstruction of metabolism from genomic data and analysis of gene clustering on chromosomes in finding missing genes.

J Bacteriol. 2001 Jan; 183(1): 292–300.
doi:  10.1128/JB.183.1.292-300.2001

Analysis of the Thermotoga maritima genome combining a variety of sequence similarity and genome context tools.

Kyrpides NC, Ouzounis CA, Iliopoulos I, Vonstein V, Overbeek R.

The proliferation of genome sequence data has led to the development of a number of tools and strategies that facilitate computational analysis. These methods include the identification of motif patterns, membership of the query sequences in family databases, metabolic pathway involvement and gene proximity. We re-examined the completely sequenced genome of Thermotoga maritima by employing the combined use of the above methods. By analyzing all 1877 proteins encoded in this genome, we identified 193 cases of conflicting annotations (10%), of which 164 are new function predictions and 29 are amendments of previously proposed assignments. These results suggest that the combined use of existing computational tools can resolve inconclusive sequence similarities and significantly improve the prediction of protein function from genome sequence.

Nucleic Acids Res. 2000 Nov 15;28(22):4573-6.

WIT: integrated system for high-throughput genome sequence analysis and metabolic reconstruction.

Overbeek R, Larsen N, Pusch GD, D'Souza M, Selkov E Jr, Kyrpides N, Fonstein M, Maltsev N, Selkov E.

The WIT (What Is There) (http://wit.mcs.anl.gov/WIT2/) system has been designed to support comparative analysis of sequenced genomes and to generate metabolic reconstructions based on chromosomal sequences and metabolic modules from the EMP/MPW family of databases. This system contains data derived from about 40 completed or nearly completed genomes. Sequence homologies, various ORF-clustering algorithms, relative gene positions on the chromosome and placement of gene products in metabolic pathways (metabolic reconstruction) can be used for the assignment of gene functions and for development of overviews of genomes within WIT. The integration of a large number of phylogenetically diverse genomes in WIT facilitates the understanding of the physiology of different organisms.

Nucleic Acids Res. 2000 Jan 1;28(1):123-5.

Genomics: what is realistically achievable?

Overbeek R

We now have a large and growing number of sequenced genomes. It is widely understood that this presents research opportunities and promises to change the way biology advances, but the magnitude and nature of the opportunities is, for the most part, poorly understood. In this short piece, I wish to examine the following two questions: First, how quickly will sequence data be produced? Second, what impact will this have on our understanding of the sequenced organisms?

Since I am a computer scientist by training, I tend to think of the current situation in which the field of genomics is being driven forward by rapid technological advances as quite analogous to the sequence of events in computing that were triggered by advances in microcomputer and network technologies. I distinctly remember the early period in which it seemed clear to most computer scientists (including myself) that technical advances were very desirable and interesting, but could have little impact on either the fundamental research issues or the overall advance of the field. Most of us completely underestimated the impact of exponential price improvements in key-enabling technologies. Certainly no one that I know of foresaw in any detail the current world of computing (although a few had rare insights into the potential). As we face the world generated by the web, we should remember that as late as the early 1990s common wisdom indicated that 'movies on demand' would be the application that drove increased network bandwidth.

Genome Biol. 2000; 1(2): comment2002.1–comment2002.3.
Published online 2000 Jul 28. doi:  10.1186/gb-2000-1-2-comment2002

Protein interaction maps for complete genomes based on gene fusion events.

Enright AJ, Iliopoulos I, Kyrpides NC, Ouzounis CA.

A large-scale effort to measure, detect and analyse protein-protein interactions using experimental methods is under way. These include biochemistry such as co-immunoprecipitation or crosslinking, molecular biology such as the two-hybrid system or phage display, and genetics such as unlinked noncomplementing mutant detection. Using the two-hybrid system, an international effort to analyse the complete yeast genome is in progress. Evidently, all these approaches are tedious, labour intensive and inaccurate. From a computational perspective, the question is how can we predict that two proteins interact from structure or sequence alone. Here we present a method that identifies gene-fusion events in complete genomes, solely based on sequence comparison. Because there must be selective pressure for certain genes to be fused over the course of evolution, we are able to predict functional associations of proteins. We show that 215 genes or proteins in the complete genomes of Escherichia coli, Haemophilus influenzae and Methanococcus jannaschii are involved in 64 unique fusion events. The approach is general, and can be applied even to genes of unknown function.

Nature 402, 86-90 (4 November 1999) | doi:10.1038/47056

Universal protein families and the functional content of the last universal common ancestor.

Kyrpides N, Overbeek R, Ouzounis C.

The phylogenetic distribution of Methanococcus jannaschii proteins can provide, for the first time, an estimate of the genome content of the last common ancestor of the three domains of life. Relying on annotation and comparison with reference to the species distribution of sequence similarities results in 324 proteins forming the universal family set. This set is very well characterized and relatively small and nonredundant, containing 301 biochemical functions, of which 246 are unique. This universal function set contains mostly genes coding for energy metabolism or information processing. It appears that the Last Universal Common Ancestor was an organism with metabolic networks and genetic machinery similar to those of extant unicellular organisms.

J Mol Evol. 1999 Oct;49(4):413-23.

The use of gene clusters to infer functional coupling.

Overbeek R, Fonstein M, D'Souza M, Pusch GD, Maltsev N.

Previously, we presented evidence that it is possible to predict functional coupling between genes based on conservation of gene clusters between genomes. With the rapid increase in the availability of prokaryotic sequence data, it has become possible to verify and apply the technique. In this paper, we extend our characterization of the parameters that determine the utility of the approach, and we generalize the approach in a way that supports detection of common classes of functionally coupled genes (e.g., transport and signal transduction clusters). Now that the analysis includes over 30 complete or nearly complete genomes, it has become clear that this approach will play a significant role in supporting efforts to assign functionality to the remaining uncharacterized genes in sequenced genomes.

Proc Natl Acad Sci U S A. 1999 Mar 16;96(6):2896-901.