Lexico-Semantic Model for Text Clustering Based on Hybrid Semantics and Graph Similarity
Abstract
This paper proposes a lexico-semantic model for text clustering that combines a lexical TF-IDF representation, semantic components at the local and global levels, and a hybrid measure of inter-document similarity. The semantic components include the ICAN model, which captures local associative relations within a document, and the PCAN model, which constructs a global corpus-based semantic space from word co-occurrence statistics. Text clustering is performed using the iterative ICIC algorithm, which takes into account both the similarity of a document to the cluster centroid and its similarity to the nearest documents within the cluster. The experimental evaluation was carried out on two balanced corpora of news texts containing 7,200 and 4,800 documents. A comparison was made between baseline clustering algorithms, hybrid TF-IDF + ICAN and TF-IDF + PCAN models, as well as different operating modes of the ICIC algorithm. The results show that the use of hybrid semantics improves clustering quality in terms of Accuracy, ARI, and NMI, while the best results are achieved by combining PCAN with ICIC. The findings confirm the effectiveness of integrating lexical, semantic, and structural characteristics within a unified text clustering model.
Full Text:
PDF (Russian)References
Sparck Jones K. A statistical interpretation of term specificity and its application in retrieval // Journal of Documentation. 1972. Vol. 28, No. 1. P. 11–21. DOI: 10.1108/eb026526.
Salton G., Buckley C. Term-weighting approaches in automatic text retrieval // Information Processing & Management. 1988. Vol. 24, No. 5. P. 513–523. DOI: 10.1016/0306-4573(88)90021-0.
Deerwester S., Dumais S. T., Furnas G. W., Landauer T. K., Harshman R. Indexing by latent semantic analysis // Journal of the American Society for Information Science. 1990. Vol. 41, No. 6. P. 391–407.
Manning C. D., Raghavan P., Schütze H. Introduction to Information Retrieval. Cambridge: Cambridge University Press, 2008. 506 p.
Church K. W., Hanks P. Word association norms, mutual information, and lexicography // Computational Linguistics. 1990. Vol. 16, No. 1. P. 22–29.
Turney P. D., Pantel P. From frequency to meaning: vector space models of semantics // Journal of Artificial Intelligence Research. 2010. Vol. 37. P. 141–188.
Mikolov T., Chen K., Corrado G., Dean J. Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781, 2013.
Levy O., Goldberg Y. Neural word embedding as implicit matrix factorization // Advances in Neural Information Processing Systems. 2014. Vol. 27. P. 2177–2185.
Levy O., Goldberg Y., Dagan I. Improving distributional similarity with lessons learned from word embeddings // Transactions of the Association for Computational Linguistics. 2015. Vol. 3. P. 211–225. DOI: 10.1162/tacl_a_00134.
Zhou S., Xu H., Zheng Z., Chen J., Li Z., Bu J., Wu J., Wang X., Zhu W., Ester M. A Comprehensive Survey on Deep Clustering: Taxonomy, Challenges, and Future Directions // ACM Computing Surveys. 2024. DOI: 10.1145/3689036.
Maden E., Karagoz P. Recent methods on short text stream clustering: A survey study // Wiley Interdisciplinary Reviews: Computational Statistics. 2023. Vol. 15, No. 6. DOI: 10.1002/wics.1610.
Subakti A., Murfi H., Hariadi N. The performance of BERT as data representation of text clustering // Journal of Big Data. 2022. Vol. 9. Art. 15. DOI: 10.1186/s40537-022-00564-9.
Grootendorst M. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv:2203.05794, 2022.
Petukhova A., Matos-Carvalho J. P., Fachada N. Text Clustering with Large Language Model Embeddings. arXiv:2403.15112, 2024.
Xu Q., Gu H., Ji S. Text clustering based on pre-trained models and autoencoders // Frontiers in Computational Neuroscience. 2024. Vol. 17. Art. 1334436. DOI: 10.3389/fncom.2023.1334436.
Guo Y., Wu G. A restarted large-scale spectral clustering with self-guiding and block diagonal representation // Pattern Recognition. 2024. Vol. 156. Art. 110746. DOI: 10.1016/j.patcog.2024.110746.
Sadjadi F., Torra V., Jamshidi M. Preprocessed Spectral Clustering with Higher Connectivity for Robustness in Real-World Applications // International Journal of Computational Intelligence Systems. 2024. Vol. 17. Art. 86. DOI: 10.1007/s44196-024-00455-2.
Ding S., Wu B., Xu X., Guo L., Ding L. Graph clustering network with structure embedding enhanced // Pattern Recognition. 2023. Vol. 144. Art. 109833. DOI: 10.1016/j.patcog.2023.109833.
Lemaire, Benoit & Denhiere, Guy. (2004). Incremental Construction of an Associative Network from a Corpus. Proceedings of the 26th Annual Meeting of the Cognitive Science Society.
Ismael Ali, Austin Melton. Semantic-Based Text Document Clustering Using Cognitive Semantic Learning and Graph Theory. Proceedings – 12 th IEEE International Conference on Semantic Computing, ICSC 2018 Vol. 2018-January, 9 April 2018, Pages 243-247 DOI: 10.1109/ICSC.2018.0004
Refbacks
- There are currently no refbacks.
Abava Кибербезопасность Monetec 2026 СНЭ
ISSN: 2307-8162