Research

 People

 Publications

 Sponsors & Affiliates

 News

 Related Sites

 Offered Courses



Approaches and Mechanisms for Analyzing Genomic Research Data by Jong-eun Jun

With the recent advancement of DNA microarray technologies, the expression levels of thousands of genes can be measured simultaneously. The obtained data are usually organized as a matrix where the columns represent genes (usually genes of the whole genome), and the rows correspond to the samples (e.g. various tissues, experimental conditions, or time points). Given this rich amount of gene expression data, it is essential to extract hidden knowledge from this matrix.

One of the key steps in gene expression analysis is to perform the clustering of genes that show similar patterns. By identifying a set of gene clusters, we can hypothesize that the genes clustered together tend to be functionally related. Thus, gene expression clustering may be useful in identifying mechanisms of gene regulation and interaction, which can be used to understand the function of a cell.

Since gene expression data consists of measurements across various conditions (or time points), they are characterized by multi-dimensional, huge size in terms of volume, and noisy data. Thus, clustering algorithms must be able to address and exploit such features of the datasets. Recent database mining research has proposed density-based clustering algorithms, which are relevant for multi-dimensional noisy datasets. By addressing the limitations of previous density-based clustering methods, we present a KNN (k-nearest neighbor) density estimation clustering algorithm that is relevant for producing co-expressed gene clusters. In addition, to reduce the complexity of the presented clustering technique, we explored different optimization schemes. Preliminary experimental results indicate that the proposed method successfully identifies co-expressed gene clusters for yeast time-series datasets.


The above figure shows sample co-expressed gene clusters that are identified by our clustering algorithm. As shown, our clustering algorithm is able to capture co-expressed genes successfully. In addition, to verify the produced clusters are biologically meaningful, we examined the GO annotation for genes within a same cluster. The following figure shows the GO annotation of the sample genes in the co-expressed gene cluster, which is generated by GoTermFinder. As illustrated, the concepts correspond to the gene within a same cluster are semantically similar, thus, our clustering algorithm can also capture biologically meaningful clusters.

Finally, based on the observation that there exists a priori biological knowledge in bioinformatics, we developed constrained clustering algorithm that incorporates background knowledge early in the clustering process. The Gene Ontology is employed to assess relevant background knowledge, which is defined as pairwise instance-level constraints (e.g., two genes must belong to a same cluster). Based on the presence of a small amount of background knowledge, we perform gene expression clustering to produce biologically meaningful as well as co-expressed gene clusters. Brief information can be found in our poster.


Home | Research | People | Publications | Sponsors & Affiliates | News | Offered Courses

© 2000-2013 Semantic Information Research Laboratory. All Rights Reserved.