Auto-detection of Hot Topics in Mass Chinese Internet Information
Abstract
In order to overcome the weakness of the traditional topic detection clustering strategy and realize hot topic auto-discovery, we re-examined density-based clustering algorithms, and then put forward a sub-cluster relation-based and multi-resolution density clustering algorithm (SRBMRClustering) which considers both adjacency information of sub-clusters and relative density concept. And in the meanwhile, in order to reduce the computational complexity, we proposed a Web structure-based text feature weight calculation method and a concept-feature extraction method and used feature-based news text vector representation method to improve the textual representation and shrink the dimension of feature space. Finally, we used Chinese news corpus of June-July 2012 to verify our algorithm. The experimental results show that the algorithm’s performance and clustering quality are improved to a notable extent.
Keywords
Hot topic auto-detection, Text preprocessing, Text clustering, Web structure-based text feature weight calculation, Concept feature extraction, Multi-resolution density clustering, Subcluster relation-based clustering, Feature-based news text vector representation, Feature space, Advanced Chinese word segmentation ICTCLAS system.Text
DOI
10.12783/dtcse/cmsms2018/25261
10.12783/dtcse/cmsms2018/25261
Refbacks
- There are currently no refbacks.