Selected research projects for participating students are listed below.
Modern Information Infrastructure
Modern Grid technology represents an emerging and expanding instrumentation, computing, information and storage platform that allows geographically distributed resources, which are under distinct control, to be linked together in a transparent fashion. The power of the grids lies not only in the aggregate computing ability, data storage, and network bandwidth that can readily be brought to bear on a particular problem, but also on its ease of use. After a decade's research effort, grids are moving out of research laboratories into early-adopter production systems, such as the Computational Grid for certain computation-intensive applications, the Data Grid for distributed and optimized storage of large amounts of accessible data, as well as the Knowledge Grid for intelligent use of the Data Grid for knowledge creation and tools to all users.
Students working on this topic will be introduced to the basic computing and data infrastructure of the grid, the basic administration of the grid and potential applications that may benefit from the grid. Project staff will guide the participating students through the overall infrastructure set-up process, data and computation job migrations via a web-portal. Students will also monitor the job status, user statistics, system load status and bandwidth consumption status over a web-portal.
Protein Function Studies
The complete sequencing of numerous genomes now brings up the next major challenge in biology: to understand how these genes function. By now scientists have only unraveled the functions of a small percentage of the proteins in these genes. Often protein functions ascribe to some recurring sub-structural motifs (or mini-motifs, as they often contain less than 15 amino acids), which usually present specific positional characteristics in the protein sequences. Devising efficient models, computationally or stochastically, to identify potentially functional motifs is the first essential step toward realizing protein functions.
This research project has yielded a web-based program to search the proteome database for the presence of mini-motifs, computationally, in protein queries. The MnM downloads several NCBI databases (RefSeq, LocusLink, HomoloGene, Taxonomy, Pfam, and dbSNP databases, etc.) as the input and analyze for potentially functional mini-motifs. Participating students be involved in a few components that are currently under development.
Genomic Knowledge Inference
It is crucial that the massive genomic data produced are well represented so that useful biological information may be efficiently extracted or inferred. A useful tool for effective knowledge representation is the semantic network system. A semantic network is a conceptual model for knowledge representation, in which the knowledge entities are represented by nodes (or vertices), while the edges (or arcs) are the relations between entities. A semantic network is an effective tool, serving as the backbone knowledge representation system for genomic, clinical and medical data. Usually these knowledge bases are stored at locations geographically distributed. This highlights the importance of an efficient distributed semantic network system enabling distributed knowledge integration and inferences.
The semantic network is a key component of the Unified Medical Language System (UMLS) project initiated in 1986 by the U.S. National Library of Medicine (NLM). The goal of the UMLS is to facilitate associative retrieval and integration of biomedical information so researchers and health professionals can use such information from different (readable) sources. Students participating in this project will learn about semantic networks, the UMLS, biomedical knowledge representation and basic concepts of distributed knowledge reasoning. Students will also participate in the design of the distributed UMLS.
Study of Ethical and Legal Issues
The deployment of grid technologies will inevitably foster the sharing of information from molecular, individual to population levels. Releasing personal genomic data, even with consent, implies a de facto release of information pertaining to related individuals. Protocols generally agreed upon are yet to be worked out. In addition, the uniqueness of personal genotype often renders anonymity of the information source difficult. Strict regulations need to be devised to keep such information from being abused.
Determining liability for medical accidents or errors resulting from the use of a Bio/Health-Grid while providing health-care to a patient is crucial. For an international virtual organization enabled by this infrastructure, such issues become far more complicated. As an initial step toward determining the correct jurisdiction, the European Union has adopted the Council Regulation (EC) No 44/2001 of 22 December 2000 on jurisdiction and the recognition and enforcement of judgments in civil and commercial matters. More details will be discussed based on the Health-Grid White Paper by the European Health-Grid Association. A few colleagues will be invited for guest lectures.
The cancer Biomedical Informatics Grid (caBig) is sponsored by the National Cancer Institute (NCI) and its activities are supervised by the National Cancer Institute Center for Bioinformatics (NCICB). The initiative operates through an open development community made up of a wide spectrum of cancer researchers. Anyone can participate in caBIG and there is no cost to join. The caBIG community includes over 50 cancer centers, numerous other NCI-supported research endeavors, 30 federal, academic, not-for profit and industry organizations and over 900 individuals altogether.
The Biomedical Informatics Research Network (BIRN), another National Institutes of Health-initiated large-scale data grid, has deployed tools and infrastructure that are fostering a new biomedical collaborative culture and infrastructure. By intertwining concurrent revolutions occurring in biomedicine and information technology, BIRN is enabling researchers to participate in large-scale, cross-institutional research studies where they are able to acquire, share, analyze, mine and interpret data acquired at multiple sites using advanced processing and visualization tools. Some core components of this infrastructure, designed around a flexible large-scale grid model, include: a scalable and powerful data integration environment that allows users to access multiple databases as if they were a single database; the use and development of ontologies and data exchange standards; a user portal that provides a common user interface, encouraging greater collaboration among researchers and offering access to a powerful suite of biomedical tools.
Additional research projects are to be added shortly.
The schedule of training seminars and joint activities with other REU programs hosted at the university will be announced shortly.
Based on the selected research topic, students will prepare a mid-term (one-page) project summary and a proposal presentation. Students will also prepare an oral presentation and a written report (8-10 pages) based on their research results at the end of the program.