Abstract
Many mining algorithms have been presented for business big data such as marketing baskets, but they cannot be effective or efficient for mining DNA sequences, any of which is typically with a small alphabet but a much long sizes. This paper will design a compact data structure called Association Matrix, and give an algorithm to specially mine long DNA sequences. The Association Matrix is novel in-memory data structure, which can be so compact that it can deal with super long DNA sequences in a limited memory spaces. Such, based on the Association Matrix structure, we can design the algorithms for efficiently mining key segments from DNA sequences. Additionally, we will show our related experiments and results in this paper.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Papapetrou, P., Benson, G., Kollios, G.: Mining poly-regions in DNA. Int. J. Data Min. Bioinform. 4, 406–428 (2012)
Agrawal, R., Srikant, R.: Mining sequential patterns. In: The 1995 International Conference on Data Engineering, pp. 3–14. Taipei, Taiwan (1995)
Srikant, R., Agrawal, R.: Mining sequential patterns: generalizations and performance improvements. In: The 1996 International Conference on Extending Database Technology (EDBT), pp. 3–17 (1996)
Han, J., Pei, J.: Free-span: frequent pattern-projected sequential pattern mining. In: The 2000 International Conference on Knowledge Discovery and Data Mining, pp. 355–359 (2000)
Mohammed, J.: SPADE: an efficient algorithm for mining frequent sequences. J. Mach. Learn. 1, 31–60 (2001)
Liu, C., Chen, L., Liu, Z., Tseng, V.: Effective peak alignment for mass spectrometry data analysis using two-phase clustering approach. Int. J. Data Min. Bioinf. 1, 52–66 (2014)
Bell, D., Guan, J.: Data mining for motifs in DNA sequences. In: The 2003 Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing. LNCS, vol. 2639, pp. 507–514 (2003)
Liu, Z., Jiao, D., Sun, X.: Classifying genomic sequences by sequence feature analysis. Genomics Proteomics Bioinf. 4, 201–205 (2005)
Habib, N., Kaplan, T., Margalit, H., Friedman, N.: A novel Bayesian DNA motif comparison method for clustering and retrieval. PLoS Comput. Biol. 4, 1–16 (2008)
Mannila, H., Toivonen, H., Verkamo, I.: Discovery of frequent episodes in event sequences. Data Min. Knowl. Discov. 1, 259–289 (1997)
Mannila, H., Salmenkivi, M.: Finding simple intensity descriptions from event sequence data. In: The 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 341–346 (2001)
Keogh, E., Chu, S., Hart, D., Pazzani, M.: An online algorithm for segmenting time series. In: The 2001 IEEE International Conference on Data Mining, pp. 289–296 (2001)
Stegmaier, P., Kel, A., Wingender, E., Borlak, J.: A discriminative approach for unsupervised clustering of DNA sequence motifs. PLoS Comput. Bio. 9, e1002958 (2013)
Wu, Y., Wang, L., Ren, J., Ding, W., Wu, X.: Mining sequential patterns with periodic wildcard gap. J. Appl. Intell. 41, 99–116 (2014)
Wang, K., Xu, Y., Yu, J.: Scalable sequential pattern mining for biological sequences. In: The 13th International Conference on Information and Knowledge Management, pp. 10–15 (2004)
Acknowledgements
I am deeply indebted to the NSFC (China National Science Foundation of China), for its funding support with Number 61773415 makes the related re-search of this paper better.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Mao, G. (2020). Association Matrix Method and Its Applications in Mining DNA Sequences. In: Ahram, T. (eds) Advances in Artificial Intelligence, Software and Systems Engineering. AHFE 2019. Advances in Intelligent Systems and Computing, vol 965. Springer, Cham. https://doi.org/10.1007/978-3-030-20454-9_15
Download citation
DOI: https://doi.org/10.1007/978-3-030-20454-9_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-20453-2
Online ISBN: 978-3-030-20454-9
eBook Packages: EngineeringEngineering (R0)