Korean-Optimized Word Representations for Out-of-Vocabulary Problems Caused by Misspelling Using Sub-character Information

Kim, Seonhghyun; Kim, Jai-Eun; Hawang, Seokhyun; Ivan, Berlocher; Yang, Seung-Won

doi:10.1007/978-3-030-12385-7_3

Seonhghyun Kim⁴,
Jai-Eun Kim⁴,
Seokhyun Hawang⁴,
Berlocher Ivan⁴ &
…
Seung-Won Yang⁴

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 70))

Included in the following conference series:

Future of Information and Communication Conference

1545 Accesses

Abstract

In this paper, we propose Korean-optimized word representations that can better address the out-of-vocabulary (OOV) problem caused by misspelling. This problem is an important issue in many applications based on natural language processing. However, previous models do not fully consider the representations of misspelled OOV words. To overcome this problem, we propose sub-character information obtained from Korean Jamo units and also adopt additional sub-character information to better withstand the misspelling. Finally, experimental results show that our model is about 2.3 times more accurate than the conventional model in case of the misspelled word while still maintaining the semantic relationship of the words.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://www.adams.ai/apiPage?tms=pos.

References

Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space, arXiv preprint arXiv:1301.3781 (2013)
Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Google Scholar
Sienčnik, S.K.: Adapting word2vec to named entity recognition. In: Proceedings of the 20th Nordic Conference of Computational Linguistics, Nodalida 2015, May 11–13, 2015, Vilnius, Lithuania, pp. 239–243. Linköping University Electronic Press (2015)
Google Scholar
Hu, M., Peng, Y., Qiu, X.: Reinforced mnemonic reader for machine comprehension. CoRR, abs/1705.02798 (2017)
Google Scholar
Wieting, J., Bansal, M., Gimpel, K., Livescu, K.: Charagram: embedding words and sentences via character n-grams, arXiv preprint arXiv:1607.02789 (2016)
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information, arXiv preprint arXiv:1607.04606 (2016)
Sampson, G.: Writing Systems. London (1985)
Google Scholar
Choi, H., Kwon, H., Yoon, A.: Improving recall for context-sensitive spelling correction rules using conditional probability model with dynamic window sizes. J. KIISE 42(5), 629–636 (2015)
Article Google Scholar
Kang, S.-S., Kim, Y.T.: Syllable-based model for the Korean morphology. In: Proceedings of the 15th Conference on Computational Linguistics, vo. 1, pp. 221–226. Association for Computational Linguistics (1994)
Google Scholar
Stratos, K.: A Sub-character architecture for Korean language processing, arXiv preprint arXiv:1707.06341 (2017)
Luong, T., Socher, R., Manning, C.: Better word representations with recursive neural networks for morphology. In: Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pp. 104–113 (2013)
Google Scholar
Botha, J., Blunsom, P.: Compositional morphology for word representations and language modelling. In: International Conference on Machine Learning, pp. 1899–1907 (2014)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Google Scholar
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification, arXiv preprint arXiv:1607.01759 (2016)
Joulin, A., Grave, E., Bojanowski, P., Douze, M., Jégou, H., Mikolov, T.: Fasttext. zip: Compressing text classification models, arXiv preprint arXiv:1612.03651 (2016)
Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., Ruppin, E.: Placing search in context: the concept revisited. In: Proceedings of the 10th International Conference on World Wide Web, pp. 406–414. ACM, New York (2001)
Google Scholar

Download references

Acknowledgements

This work was supported by Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) (2013-0-00109, WiseKB: Big data based self-evolving knowledge base and reasoning platform).

Author information

Authors and Affiliations

AI Labs, Saltlux Inc., Seoul, Republic of Korea
Seonhghyun Kim, Jai-Eun Kim, Seokhyun Hawang, Berlocher Ivan & Seung-Won Yang

Authors

Seonhghyun Kim
View author publications
You can also search for this author in PubMed Google Scholar
Jai-Eun Kim
View author publications
You can also search for this author in PubMed Google Scholar
Seokhyun Hawang
View author publications
You can also search for this author in PubMed Google Scholar
Berlocher Ivan
View author publications
You can also search for this author in PubMed Google Scholar
Seung-Won Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Seonhghyun Kim .

Editor information

Editors and Affiliations

Faculty of Science and Engineering, Saga University, Saga, Japan
Kohei Arai
The Science and Information (SAI) Organization, Bradford, UK
Rahul Bhatia

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kim, S., Kim, JE., Hawang, S., Ivan, B., Yang, SW. (2020). Korean-Optimized Word Representations for Out-of-Vocabulary Problems Caused by Misspelling Using Sub-character Information. In: Arai, K., Bhatia, R. (eds) Advances in Information and Communication. FICC 2019. Lecture Notes in Networks and Systems, vol 70. Springer, Cham. https://doi.org/10.1007/978-3-030-12385-7_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-12385-7_3
Published: 02 February 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-12384-0
Online ISBN: 978-3-030-12385-7
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics