Semantic Clustering Research

Semantic Clustering Research

Funded by a gift from Microsoft Research, Redmond, Washington

This research was built on Seer, an existing predictive file-hoarding system for mobile computing operations. Our focus was on investigating incremental clustering (possibly using idle cycles to pre-build clusters), examining the Seer data-gathering parameter space, and determining the contributions of various inputs into Seer.

Currently, system software makes little use of much of information that is available. The system can observe everything that happens on a machine throughout its entire life-span, and can use those observations to improve caching and page replacement algorithms, to optimize placement of objects in distributed systems, and to prefetch information over a slow network.

Many attempts to use introspection to improve system performance require identifying groups of entities (objects, pages, files, etc.) that are related to one another in performance-meaningful ways. Once such relationships are discovered, the system can use that information to deal with the collection of related entities as a whole. For example, reference to one entity in the collection can be used as a strong hint that others are likely to be referenced. At UCLA, we have used this general principle to great effect in the Seer file-hoarding system for disconnected mobile computers. This system builds clusters of related files based on observing past activity in the file system, and hoards entire clusters on the disk of a portable computer before the machine is disconnected from the net. It is able to deduce correctly the relationships between files and builds clusters sufficiently well so that 96% of all disconnections experience no missing files (Seer has properly hoarded everything required). In most of the remaining 4% of the cases, the missing file was not sufficiently important to prevent the user from continuing to work. Further, simulations based on real file traces show that Seer uses, at most, 75% more disk than the true working set of files used (the minimum required size), as opposed to methods like LRU that use a factor of ten more spaces to achieve the same results.

Seer has shed light on the proper way to build clusters for file hoarding and related applications. Many of the clustering techniques pioneered here can be used to cluster other introspection information. However, since Seer is a research system, and our first attempt in the general area, not all aspects of its operation are perfect. Many of these imperfections are tolerable for Seer's purposes, but limit more general use of its clustering algorithms. For example, Seer always builds new clusters every time it needs to decide what to hoard on a portable computer. It makes no use of clusters built previously, at the cost of spending large amounts of processing time to build the clusters anew. While it can afford a delay of a minute or two to build clusters, a page replacement algorithm or network prefetching utility cannot afford such delays. Since we expect the clusters built each time to be largely the same as those built the last time, this heavy cost need not be paid. Instead, old clusters could be saved and new information used to adjust them to new circumstances. However, changing the clustering algorithms to work this way would not be trivial. Entirely new algorithms would be required.

Another important shortcoming preventing the general use of Seer's algorithms is that they use a great deal of memory to preserve information required to perform clustering. We currently do not fully understand how much of that data is key to creating good clusters. Answering that question will require (among other things) searching a parameter space for the best settings for some of Seer's internal parameters. Also, the system uses a number of secondary inputs to make its cluster and hoarding decisions. We know the Seer works much less well without all of them, but we do not know which are vital to good operation.

Principal Investigator: Dr. Peter Reiher

Back to: Travler main page
Last modified: July 6, 1998