Future plans for the CUB-DB

What's in and out of the DB

Currently, I have only genome-wide data in the database. By this I mean counts and metrics that can characterize an entire genome, like "number of coding sequences," "ribosomal criteria," "GC-content," etc.. All sequence and gene specific data is on the file system. I am contemplating moving this information into the database. This would represent a fairly significant change to my code, and would jump the size of the DB up to about 40Gig (at least, that's how big the data is on the file system).

The big thing slowing down the process is making sure we have a good backup process for the database. It doesn't seem very efficient to shut-down the DB every night and spend an hour or more dumping all of the data into a monolithic backup. When the DB is only 50Meg, then a mysqldump works just find. But 40Gig?


I now have code that is able to scrape the phylogenetic relationshipa from NCBI's taxonomy browser. I would like to include this information in the database. Currently, I include class and phylum only, but this is retrieved from fields in the annotated sequence files, and does not always agree with the more formal listing on the organism's phylogeny page.

Strand Criteria

I want to add strand criterion to my measures. This will require knowledge of where the chromosomal replication origin site is. Once this is known, the genes that reside on the leading and lagging strands can be identified, and differences in their codon composition can be assessed.

Fitness Landscapes

At some point, I want to add a visualization of the fitness landscapes for all of the organisms. The fitness landscape visualization technique is described in my mSCCI paper (Raiford et al.). It currently shows how self-consistent reference sets are in the codon usage space. I may try to find a way to alter it to show the fitness of weight-solutions (as in the search-based approach: Raiford et al.). This would show only translational efficiency ridges, so I may not persue that particular aspect of fitness landscapes. The current method shows how self-consistent the fitness lantscapes are, so if the dominant bias is strand or content, they will present as a ridge as long as they manifest themselves in self-consitent reference sets.

The content and opinions expressed on this Web page do not necessarily reflect the views of nor are they endorsed by the University of Montana.