Ting Su

Clustering High-Dimensional Data

Date: April 9 2007 (Monday)

Abstract:

Several technologies, such as the Internet, microarrays, and increased storage capacities, have made data analysis imperative. One popular technique for making sense of and organizing these data is clustering. Clustering is the process of grouping 'similar' objects together. Similarity is typically defined based on some metric, which in turn depends on the features describing each data sample. Clustering is challenging because we do not have labeled training data, and is even more difficult when not all of the features are important in grouping `natural' clusters. In this research, we address two issues in clustering: (1) feature reduction, and (2) initialization.

There are two approaches to reduce dimensionality: feature transformation and feature selection. We propose a feature transformation algorithm called auto_PPCA that can build a hierarchical mixture of local principal component analyzers automatically. Sometimes, people would prefer feature selection rather than transformation, because they want to keep a subset of the original features. Thus, we also explore feature selection for clustering problems. We develop a method to simultaneously cluster and select features on text data using a mixture of multinomial models. We, then, explore feature selection for clustering in a general framework. We provide a formal definition of relevant features for clustering and validate this definition through experimental evaluation.

Many clustering algorithms, such as K-means and Gaussian mixture clustering, strongly depend on the initial guess of the partitions. Currently, random re-start methods are the standard methods for initialization. The problem with random methods is that they are not repeatable, and they may still lead to a solution with bad quality unless we allow the number of re-starts to be very large (thereby, making the clustering time-consuming for large data sets). In this thesis, we investigate deterministic techniques for initializing clustering algorithms that can compete with the classical random methods.

Thesis Committee:
Prof. Jennifer Dy (advisor)
Prof. Dana Brooks
Prof. Masoud Salehi