My area of research is in Bioinformatics which, briefly defined, is the application of computational methods to solve biomedical problems. I focus on developing methods to enable computers to play a greater role in automated knowledge discovery. In other words, in addition to using computers to solve specific problems, I am also interested in ways of getting computers to first establish what is known and then be able to condense large amounts of diverse data to infer what is not yet known, but statistically significant and scientifically interesting. As one might suspect, defining what is scientifically interesting turns out to be harder than defining statistical significance, but that’s what makes it fun.
In general, I am interested in both integrating and data-mining large biomedical databases for patterns that can help science accelerate its knowledge regarding the genetic causes that lead to the onset and progression of diseases. Although we’ve known for almost a decade now the physical location of the 25,000 genes we humans have, approximately one-third of them still have no known function. For genes we do know something about, the amount of information per gene is extremely skewed towards those of commercial importance and, for reasons unknown, the rate of new gene discovery has slowed noticably over the past 5 years. Emerging data indicates many, if not most, of these uncharacterized genes are just as important, biologically speaking, as the ones we do know about. These uncharacterized genes are consistently appearing in genome-wide association searches for mutations that cause human disease. Thus, there’s a growing need to accurately predict gene function.
My current research focus is on the refinement and testing of an algorithm I’ve developed to infer gene function by integrating and modeling the information contained both in the massive amount of scientific literature (over 19 million records in MEDLINE, growing at a rate of around 750,000 new scientific papers per year) and in experimental databases such as gene expression and protein-protein interaction databases. With collaborators, mostly local, we are experimentally testing the predicted gene functions and have found that it has performed very accurately so far. We have now discovered approximately 37 new genes involved in important biological processes such as coagulation, immune cell movement, cell division, brain cancer growth, endometriosis and Alzheimer’s Disease, among others. The discovery of these new genes is important because, for many of them, it opens up the possibility that we can create more accurate diagnostics for diseases, prognose disease outcome, and identify new targets for pharmaceutical intervention.