Research

 People

 Publications

 Sponsors & Affiliates

 News

 Related Sites

 Offered Courses



Subspace Clustering of High Dimensional Data Based on Domain Transformation by Jong-eun Jun

Motivation and Technology Background
With the rapid progress in data acquisition, sensing, and storage technologies, the size of available datasets is increasing at a rapid rate. Given this overwhelming amount of data, data mining, which transforms raw datasets into useful higher-level knowledge, becomes a must in analyzing and understanding the information. The raw dataset is consists of objects that are usually represented as a vector of numeric values in multidimensional space. Higher-level of knowledge can be extracted by discovering groups (clusters) of similar objects from the raw dataset. And, traditional clustering methods, which consider all of the dimensions (attributes) for measuring similarity between objects, have taken a leading role in data mining. However, there are several weaknesses for dealing with high dimensional data with traditional clustering frameworks. This is caused by inherent characteristics of high dimensional raw data, which are the increasing irrelevance of some dimensions, and the tendency to be nearly equidistant between objects as the number of dimensions in the dataset increases. For example, in Figure1, same objects that don’t show strong correlation in left side graph can have correlation in right side graph. Feature selection methods have been used for discovering relevant dimensions from high dimensional space, and clusters on them, however the feature selection methods have a critical limit that uncovers relations between objects in multiple, overlapping sub-dimensional spaces.


Figure 1. Full dimension vs. Subspace


Subspace clustering algorithms are suggested to overcome inherent problems in full-dimensional clustering frameworks and feature selection methods. The main purpose of subspace clustering algorithms is to find meaningful groups of objects in multiple sub-dimensions that may be overlapped, because clusters may be embedded in lower dimensional subspaces. In addition, different sets of dimensions may be relevant for different sets of objects, and an object can have multiple characteristics in different sub-spaces.

There are two main streams of previous approaches in subspace clustering. Top-down algorithms (e.g., PROCLUS, FINDIT, etc) find full-dimensional clusters first, and then refine them to find subspace clusters. Bottom-up strategies (e.g., CLIQUE, MAFIA, etc.) start from finding low dimensional dense regions, and then use them to make relevant subspace clusters. However, both of them have a major weakness on scalability of dimensions. For example, the number of dimensions (m) for gene expression data usually varies from 20 to 100, and the number of dimensions of feature vectors for documents is well over 100,000. When m is equal to 50, the number of possible subspaces is 250 – 1. It is computationally expensive to search all subspaces to identify clusters. To cope with this problem, bottom-up type of algorithms first identify clusters in low dimensional spaces, and use them to find clusters in higher dimensional spaces based on the a priori algorithm. However, this approach is not scalable with the number of dimensions in general. This motivates the necessity of a subspace clustering algorithm that is scalable with the number of dimensions. To this end, we develop a novel subspace clustering algorithm that utilizes domain transformation from objects to object-object relations.

Current efforts and capabilities
Our efforts on data mining are mainly focused on gene expression analysis [1, 2]. This is because there can be valuable hidden high-level knowledge in gene expression datasets that consist of the expression levels of thousands of genes under hundreds of conditions. The other reason that we use gene expression data is that most genes are expected to be strongly correlated under certain sub-conditions (e.g. Figure1), and these genes have similar biological functions. In addition to gene expression analysis, we also focus our research on word subspace clustering.

Figure 2. Overview of SCLUDOT

Our current capabilities on data mining include the following:

  • Identifying coherent patterns on subsets of gene expression datasets [1]: We propose a novel Subspace clustering algorithm, referred to as SCLUDOT (Subspace CLUstering based on DOmain Transformation), for identifying coherent sub-patterns of gene expression datasets. The overview of our proposed subspace clustering algorithm is shown on Figure2. By performing discretization (step1 in Figre2) to gene expression profiles, each gene can be expressed as a sequence of symbols instead of a sequence of numeric values.
    As gene pairs may be similar in certain conditions and not in other conditions, the maximum number of same symbols in gene pairs would be less than or equal to total number of conditions, even though all genes have same length of sequence of symbols (total number of conditions). Now, the similarity between two genes is transformed as a sequence of symbols that represents the maximal subspace cluster for the gene pair (step2 in Figure2).
    This domain transformation (from genes into gene-gene relations) allows us to make the number of possible subspace clusters dependent on the number of genes (objects) instead of number of conditions (dimensions).
    Based on the symbolized genes, we present an efficient subspace clustering algorithm that is linearly scalable to the number of dimensions. In addition, the running time can be drastically reduced by utilizing inverted indexing and pruning non-interesting subspaces.
    Experimental results so far indicate that the proposed method efficiently identifies co-expressed gene subspace clusters for a yeast cell cycle dataset. However, SCLUDOT is not limited to gene expression mining, but applicable to subspace clustering in general for high-dimensional massive datasets.

  • SCLUDOT-based metadata management: To strengthen our work in terms of generality, we plan to apply SCLUDOT to word subspace clustering. In order to retrieve rich semantic information from the overwhelming amount of unstructured data, metadata (e.g., ontologies) can be employed. Since manually building and maintaining such metadata is nearly impossible, an automated dynamic metadata management technique will play an important role.
    We proposed a novel algorithm called WebSim (Web-based term Similarity metric) [4,5], whose feature extraction and similarity model is based on a conventional Web search engine.
    There are two main benefits of utilizing a Web search engine. First, we can obtain the freshest content for each term that represents the up-to-date knowledge on the term. Second, in comparison with the approaches that use the certain amount of crawled Web documents as a corpus, our method is less sensitive to the problem of data sparseness because we access as much content as possible using a search engine.
    A new approach that combines WebSim with SCLUDOT gives a promising dynamic metadata management model. Even though WebSim provides more reliable metrics for relationships between terms or concepts, it is still suffering from dealing with ambiguity of terms.
    Because SCLUDOT finds clusters in multiple, overlapped subspaces in the manner of being scalable to the number of dimensions, it has the capability to extract multiple relations between terms or concepts under different subsets of features in very high dimensional feature vector space extracted by WebSim. And, these subspace clusters reveal hidden term relationships that cannot be found by full-dimensional term similarity metrics.

At the core of our current research efforts, SCLUDOT plays a key role. An efficient subspace clustering algorithm promises huge benefits on diverse problem domains, and SCLUDOT may be a stepping stone to the successful subspace clustering framework.

Possible problem domains of subspace clustering
Although all the high dimensional data can be considered as input data of subspace clustering, datasets in certain problem domains have recently received scientific attention:

  • Gene expression analysis: Since gene expression datasets consist of measurements across various conditions (or time points), they are characterized by high dimensional, huge size volume, and noisy data. It is well known that genes can manifest a coherent pattern under subsets of experimental conditions, and these patterns are closely related with certain functionality of genes. Therefore, It is essential to identify such local patterns in gene expression datasets, which is a key to revealing biologically meaningful clusters.
  • E-commerce: With the fast growing online market, finding user groups who show similar interests in products is more important than any other era in history. Also, users can be identified by diverse characteristics, and current technologies allow substantial user information to be gathered. However, it is difficult to analyze the user information data because of its high dimensionality and the heterogeneity of users’ interests. A subspace clustering with the ability of finding locally correlated patterns is a promising method for recommendation systems and target marketing in e-commerce.
    ? Besides gene expression analysis and e-commerce, there are several problem domains that are suitable for subspace clustering. For example, creating a hierarchy of data sources in an information integration system and organizing unstructured web search results are possible problem domains where subspace clustering can be utilized.

Research Plan
Regarding our subspace clustering algorithm (SCLUDOT), we plan to investigate the following:

  • Improving discritization step: Discritization on each condition is the first step in SCLUDOT, and plays a key role in improving the quality of subspace clusters. Currently, we utilize a k-means algorithm for this step, but it limits number of possible symbols in each condition (dimension) to a pre-defined fixed number. But, there can be different characteristics (e.g. density distribution of values, pre-determined importance, etc.) of each condition: this means the algorithm should have an ability to freely determine the number of symbols in each dimension. So, we plan to apply different known algorithms for this step, and conduct experiments to find the most efficient method.
  • Finding concrete subspace clusters: There are still more issues vis-a-vis SCLUDOT – how we can systematically determine the bound of number of objects in a subspace cluster, and what is relevant minimum number of dimensions that provides meaningful clusters? Currently, a probabilistic model is used to determine the minimum number of objects in a cluster. And, the minimum number of dimensions is determined experimentally. These two issues strongly affect the average running time and quality of result, because the threshold for the minimum number of objects in a cluster is related with the pruning step and total number of obtained subspace clusters, and the minimum number of dimensions determines the size of search space. We plan to investigate these issues in diverse ways.
  • Extending SCLUDOT to pattern-based subspace clustering: Our current subspace clustering algorithm utilizes value similarity for finding subspace clusters, but there is high possibility to extent it to pattern-based clustering. Pattern-based clustering is performed by pattern similarity, and it has been widely studied for the purpose of dealing with time-series data. By adding numeric orders of symbols as another factor for the clustering, we can achieve pattern-based subspace clustering.
  • Efficient analysis for the result: In order to interpret obtained clusters, external knowledge needs to be involved. It depends on problem domain, because available external knowledge will be different domain by domain. For gene expression analysis, we plan to explore the methodology to quantify the relationship between subspace clusters and known biological knowledge by utilizing a Gene Ontology.
  • Comprehensive evaluation of the proposed algorithm on different datasets: For the purpose of qualitatively testing SCLUDOT, we plan to conduct more comprehensive experiments on diverse datasets as well as compare our approach with other competitive algorithms.

Besides improving our subspace clustering algorithm itself, we plan to extend our research to apply SCLUDOT to other application domains. SCLUDOT is a general-purpose subspace clustering algorithm, therefore it may be suitable for e-commerce and association rule mining by combining SCUDOT with a classification method. So, we plan to investigate whether SCLUDOT is suitable to domains such as association rule mining, and recommendation systems.


Home | Research | People | Publications | Sponsors & Affiliates | News | Offered Courses

© 2000-2013 Semantic Information Research Laboratory. All Rights Reserved.