Sponsors & Affiliates


 Related Sites

 Offered Courses

Pattern and Concept/Topic Discovery and Mining in Large-scale Streamed Information by Seokkyung Chung

As the Web continues to grow as a vehicle for the distribution of information, many news organizations are providing newswire services through the Internet. Given this popularity of the Web news services, the primary goal of this research is to develop methodologies for news streams mining. Toward this end, we have been developing topic mining, which effectively identifies useful patterns (e.g., metadata, topics, events that are instances of topics) from news streams. Although we have worked on news domain, the topic mining framework can be extended to other kinds of information streams (e.g., emails).

Web news articles are composed of hyperlinks, audio, video, images, and text. However, since not all news stories have corresponding multimedia data, text can be a rich source of information about the news. Given that text is unstructured data, efficient text mining and access methods are required to obtain valuable knowledge embedded into the text document. Therefore, to build a novel framework for an intelligent news database management and navigation scheme, we utilize techniques in information retrieval, data mining, machine learning, and natural language processing.

The above figure illustrates the main parts of the proposed framework. Topic mining is composed of four components, information gathering, information preprocessing, information analysis, and information presentation. A Web crawler retrieves a set of news documents from a news Web site (e.g., CNN) in the information gathering stage. Developing an intelligent Web crawler is another research area, and it is not our main focus. Thus, we implement a simple Web spider, which downloads news articles from a news Web site on a daily basis. The retrieved documents are processed by data mining tools to produce useful higher-level knowledge (e.g., a document hierarchy, a topic ontology, etc), which is stored in a content description database. Instead of interacting with a Web news service directly, by exploiting knowledge in the database, an information delivery agent can present an answer in response to a user request.

Current capabilities on topic mining from news stream datasets include the following:

• Efficient incremental hierarchical news document clustering. Since several hundred news articles are published everyday at a single Web news site, to cope with such dynamic environments, we should provide efficient incremental data mining algorithms. Despite the huge body of research efforts on document clustering, little work has been conducted in the context of incremental hierarchical news document clustering. Our developed clustering algorithm based on a neighborhood-search has several key advantages, including the scalability with the high dimensionality, capability to discover clusters with different shapes and sizes, and ability to provide succinct description of clusters.

• Topic detection and tracking. Due to the overwhelming amount of information involved, it is crucial to provide an intelligent agent that can identify novel information and track related information for a user. Given a stream of news articles, topic mining identifies whether a new document belongs to an existing topic or new topic. Topic mining also tracks events of interest based on sample news story. For example, it associates incoming news stories with the related stories (which were already discussed before), or it can also monitor the news stream for further stories on the same topic.

• Topic ontology learning from a news stream. In order to achieve rich semantic information retrieval, metadata (e.g., ontological information) should be employed. Since manually building and maintaining such metadata is nearly impossible, we developed a prototype system for learning topic ontologies. A topic ontology is a collection of concepts and relations. One view of a concept is as a set of terms that characterize a topic. We employ two generic kinds of relations, specialization and generalization. The former is useful when refining a query while the latter can be used when we generalize the query to increase recall or broaden the search.

The above figure shows a possible outcome of topic mining. Each node in a document cluster hierarchy can be associated with a set of terms, which is referred to as a topic ontology node. As shown, topic ontologies can characterize a news topic at multiple levels of generality.
An experimental prototype system has been developed, implemented and tested to demonstrate the effectiveness of the topic mining framework. The results show that the proposed clustering algorithm produces high-quality document cluster hierarchy, and obtained topic ontology provides an interpretation of the news topics at different levels of abstraction.

One possible applications of topic mining is to utilize it for Web search. For example, the incremental document clustering algorithm can be applied to a stream of Web pages returned by a search engine. Since topic mining can build a document cluster hierarchy incrementally, a user can browse a document cluster hierarchy instead of examining a flat list of documents. In addition, topic ontologies can be used to suggest alternative query terms to refine the query.

Please see our recent presentation at ODBASE, 2003.

Home | Research | People | Publications | Sponsors & Affiliates | News | Offered Courses

© 2000-2013 Semantic Information Research Laboratory. All Rights Reserved.