Skip to main content

KNN-Based Pseudo-supervised RCNN Framework for Text Clustering

  • Conference paper
  • First Online:

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1075))

Abstract

This paper explores the application of recurrent convolutional neural networks (RCNN) to text clustering, an unsupervised task in natural language processing (NLP). The RCNN is trained with pseudo-labels that are generated by pre-clustering on unsupervised document representations. To enhance the quality of pseudo-labels, the K-Nearest Neighbors (KNN) algorithm is used to select training samples for the neural network. After the deep feature representations of all documents have been obtained using the trained RCNN, the agglomerative hierarchical clustering (AHC) algorithm is used to cluster them. The experimental results on two public databases show that the proposed approach significantly boosts the performance of text clustering.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   219.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   279.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24(5), 513–523 (1988)

    Article  Google Scholar 

  2. Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42(1–2), 177–196 (2001)

    Article  Google Scholar 

  3. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  4. Larochelle, H., Lauly, S.: A neural autoregressive topic model. In: Advances in Neural Information Processing Systems, pp. 2708–2716 (2012)

    Google Scholar 

  5. Kim, Y.: Convolutional neural networks for sentence classification (2014). arXiv Preprint arXiv:1408.5882

  6. Kalchbrenner, N., Grefenstette, E., Blunsom, P.: A convolutional neural network for modelling sentences (2014). arXiv Preprint arXiv:1404.2188

  7. Wang, S., Huang, M., Deng, Z.: Densely connected CNN with multi-scale feature attention for text classification. In: IJCAI, pp. 4468–4474 (2018)

    Google Scholar 

  8. Tang, D., Qin, B., Liu, T.: Document modeling with gated recurrent neural network for sentiment classification. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1422–1432 (2015)

    Google Scholar 

  9. Lee, J.Y., Dernoncourt, F.: Sequential short-text classification with recurrent and convolutional neural networks (2016). arXiv Preprint arXiv:1603.03827

  10. Wang, B.: Disconnected recurrent neural networks for text categorization. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 2311–2320 (2018)

    Google Scholar 

  11. Lai, S., Xu, L., Liu, K., et al.: Recurrent convolutional neural networks for text classification. In: Twenty-Ninth AAAI Conference on Artificial Intelligence (2015)

    Google Scholar 

  12. Wen, Y., Zhang, W., Luo, R., et al.: Learning text representation using recurrent convolutional neural network with highway layers (2016). arXiv Preprint arXiv:1606.06905

  13. Data Mining and Knowledge Discovery Handbook (2005)

    Google Scholar 

  14. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  15. Cieri, C., Miller, D., Walker, K.: The Fisher Corpus: a resource for the next generations of speech-to-text. In: LREC, vol. 4, pp. 69–71 (2004)

    Google Scholar 

  16. Xu, J., Xu, B., Wang, P., et al.: Self-taught convolutional neural networks for short text clustering. Neural Netw. 88, 22–31 (2017)

    Article  Google Scholar 

  17. Kesiraju, S., Burget, L., Szöke, I., et al.: Learning document representations using subspace multinomial model. In: INTERSPEECH, pp. 700–704 (2016)

    Google Scholar 

  18. Nguyen, D.Q., Billingsley, R., Du, L., et al.: Improving topic models with latent feature word representations. Trans. Assoc. Comput. Linguist. 3, 299–313 (2015)

    Article  Google Scholar 

  19. Xie, P., Xing, E.P.: Integrating document clustering and topic modeling (2013). arXiv Preprint arXiv:1309.6874

  20. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization (2014). arXiv Preprint arXiv:1412.6980

  21. Xie, J., Girshick, R., Farhadi, A.: Unsupervised deep embedding for clustering analysis. In: International Conference on Machine Learning, pp. 478–487 (2016)

    Google Scholar 

  22. Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Zhi Chen or Wu Guo .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chen, Z., Guo, W. (2020). KNN-Based Pseudo-supervised RCNN Framework for Text Clustering. In: Liu, Y., Wang, L., Zhao, L., Yu, Z. (eds) Advances in Natural Computation, Fuzzy Systems and Knowledge Discovery. ICNC-FSKD 2019. Advances in Intelligent Systems and Computing, vol 1075. Springer, Cham. https://doi.org/10.1007/978-3-030-32591-6_10

Download citation

Publish with us

Policies and ethics