Abstract
This paper explores the application of recurrent convolutional neural networks (RCNN) to text clustering, an unsupervised task in natural language processing (NLP). The RCNN is trained with pseudo-labels that are generated by pre-clustering on unsupervised document representations. To enhance the quality of pseudo-labels, the K-Nearest Neighbors (KNN) algorithm is used to select training samples for the neural network. After the deep feature representations of all documents have been obtained using the trained RCNN, the agglomerative hierarchical clustering (AHC) algorithm is used to cluster them. The experimental results on two public databases show that the proposed approach significantly boosts the performance of text clustering.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24(5), 513–523 (1988)
Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42(1–2), 177–196 (2001)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Larochelle, H., Lauly, S.: A neural autoregressive topic model. In: Advances in Neural Information Processing Systems, pp. 2708–2716 (2012)
Kim, Y.: Convolutional neural networks for sentence classification (2014). arXiv Preprint arXiv:1408.5882
Kalchbrenner, N., Grefenstette, E., Blunsom, P.: A convolutional neural network for modelling sentences (2014). arXiv Preprint arXiv:1404.2188
Wang, S., Huang, M., Deng, Z.: Densely connected CNN with multi-scale feature attention for text classification. In: IJCAI, pp. 4468–4474 (2018)
Tang, D., Qin, B., Liu, T.: Document modeling with gated recurrent neural network for sentiment classification. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1422–1432 (2015)
Lee, J.Y., Dernoncourt, F.: Sequential short-text classification with recurrent and convolutional neural networks (2016). arXiv Preprint arXiv:1603.03827
Wang, B.: Disconnected recurrent neural networks for text categorization. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 2311–2320 (2018)
Lai, S., Xu, L., Liu, K., et al.: Recurrent convolutional neural networks for text classification. In: Twenty-Ninth AAAI Conference on Artificial Intelligence (2015)
Wen, Y., Zhang, W., Luo, R., et al.: Learning text representation using recurrent convolutional neural network with highway layers (2016). arXiv Preprint arXiv:1606.06905
Data Mining and Knowledge Discovery Handbook (2005)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Cieri, C., Miller, D., Walker, K.: The Fisher Corpus: a resource for the next generations of speech-to-text. In: LREC, vol. 4, pp. 69–71 (2004)
Xu, J., Xu, B., Wang, P., et al.: Self-taught convolutional neural networks for short text clustering. Neural Netw. 88, 22–31 (2017)
Kesiraju, S., Burget, L., Szöke, I., et al.: Learning document representations using subspace multinomial model. In: INTERSPEECH, pp. 700–704 (2016)
Nguyen, D.Q., Billingsley, R., Du, L., et al.: Improving topic models with latent feature word representations. Trans. Assoc. Comput. Linguist. 3, 299–313 (2015)
Xie, P., Xing, E.P.: Integrating document clustering and topic modeling (2013). arXiv Preprint arXiv:1309.6874
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization (2014). arXiv Preprint arXiv:1412.6980
Xie, J., Girshick, R., Farhadi, A.: Unsupervised deep embedding for clustering analysis. In: International Conference on Machine Learning, pp. 478–487 (2016)
Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Chen, Z., Guo, W. (2020). KNN-Based Pseudo-supervised RCNN Framework for Text Clustering. In: Liu, Y., Wang, L., Zhao, L., Yu, Z. (eds) Advances in Natural Computation, Fuzzy Systems and Knowledge Discovery. ICNC-FSKD 2019. Advances in Intelligent Systems and Computing, vol 1075. Springer, Cham. https://doi.org/10.1007/978-3-030-32591-6_10
Download citation
DOI: https://doi.org/10.1007/978-3-030-32591-6_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32590-9
Online ISBN: 978-3-030-32591-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)