Abstract
Big data is a term used to represent data that is big in volume, speed, and variety. With inflammation, these characteristics are also inflated to 42 V’s. We have focused our survey for feature selection in big data, as feature selection is one of the most used methods for dimensionality reduction techniques. Feature selection is used for elimination of irrelevant and redundant features from dataset to improve the classification performance. This paper includes big data characteristics, different feature selection method, and current research challenges of feature selection. We observed that swarm intelligence techniques are the most popular methods among researchers for feature selection in big data. Further, we conclude that gray wolf optimization and particle swarm optimization are the most preferred algorithms by researchers.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Sudhakar Ilango S, Vimal S, Kaliappan M, Subbulakshmi P (2018) Optimization using artificial bee colony based clustering approach for big data. Cluster Comput 1–9
Devi DR, Sasikala SJ (2019) J Big Data 103. https://doi.org/10.1186/s40537-019-0267-3
Qiu J, Wu Q, Ding G, Xu Y, Feng S (2016) A survey of machine learning for big data processing. EURASIP J Adv Sig Process 1687–6180 (2016)
Enterprise Big Data Framework. https://www.bigdataframework.org/data-types-structured-vs-unstructured-data/
Morán-Fernández L, Bolón-Canedo V, Alonso-Betanzos A (2017) Centralized vs distributed feature selection methods based on data complexity measures. Knowl Based Syst 117:27–45
Brezočnik L, Fister I, Podgorelec V (2018) Swarm intelligence algorithms for feature selection: a review. Appl Sci 8(9):1521
Jena B, Gourisaria MK, Rautaray SS, Pandey M (2017) A survey work on optimization techniques utilizing map reduce framework in hadoop cluster. Int J Intell Syst Appl 9(4):61
Shafer T. The 42 v’s of big data and data science. https://www.elderresearch.com/company/blog/42-v-of-big-data
Mar B. How much data do we create every day? The mind-blowing stats everyone should read. https://www.forbes.com/sites/bernardmarr/2018/05/21/how-much-data-do-we-create-every-day-the-mind-blowing-stats-everyone-should-read/#24ee8aaa60ba
Zhang Y, Gong D, Cheng J (2017) Multi-objective particle swarm optimization approach for cost-based feature selection in classification. IEEE/ACM Trans Comput Biol Bioinform (TCBB) 14:64–75
Gupta SL, Baghel A, Iqbal A (2019) Big data classification using scale-free binary particle swarm optimization. In: Harmony search and nature inspired optimization algorithms. Springer, Singapore, pp 1177–1187
Li J, Liu H (2017) Challenges of feature selection for big data analytics. IEEE Intell Syst 32:9–15
Ramírez-Gallego S, Lastra I, Martínez-Rego D, Bolón-Canedo V, Benítez JM, Herrera F, Alonso-Betanzos A (2017) Fast-mRMR: fast minimum redundancy maximum relevance algorithm for high-dimensional big data. Int J Intell Syst 32:134–152
Emary E, Zawbaa HM, Grosan C, Hassenian AE (2015) Feature subset selection approach by gray-wolf optimization. In: Afro-European conference for industrial advancement. Springer, Cham, pp 1–13
Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A (2015) Distributed feature selection: an application to microarray data classification. Appl Soft Comput 30:136–150
Kawamura A, Chakraborty B (2017) A hybrid approach for optimal feature subset selection with evolutionary algorithms. In: IEEE 8th international conference on awareness science and technology (iCAST), pp 564–568
Emary E, Yamany W, Hassanien AE, Snasel V (2015) Multi-objective gray-wolf optimization for attribute reduction. Procedia Comput Sci 65:623–632
Majdi MM, Mirjalili S (2017) Hybrid whale optimization algorithm with simulated annealing for feature selection. Neurocomputing 260:302–312
Mirjalili SZ, Mirjalili S, Saremi S, Faris H, Aljarah I (2018) Grasshopper optimization algorithm for multi-objective optimization problems. Appl Intell 48(4):805–820
Gu S, Cheng R, Jin Y (2018) Feature selection for high-dimensional classification using a competitive swarm optimizer. Soft Comput 22:811–822
Fong S, Yang X-S, Deb S (2013) Swarm search for feature selection in classification. In: 2013 IEEE 16th international conference on computational science and engineering, pp 902–909
Tripathi AK, Sharma K, Bala M (2018) A novel clustering method using enhanced grey wolf optimizer and mapreduce. Big Data Res 14:93–100
Weng J, Young D (2017) Some dimension reduction strategies for the analysis of survey data. J Big Data 4(1):1–19
Boyle T. Dealing with imbalanced data. https://towardsdatascience.com/methods-for-dealing-with-imbalanced-data-5b761be45a18
L’Heureux A, Grolinger K, Elyamany HF, Capretz MAM (2017) Machine learning with big data: challenges and approaches. IEEE Access 5(5):777–797
Rencberoglu E. Fundamental techniques of feature engineering for machine learning. https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114
Wang L (2017) Heterogeneous data and big data analytics. Autom Control Inf Sci 3(1):8–15
Devi SG, Sabrigiriraj M (2017) Swarm intelligent based online feature selection (OFS) and weighted entropy frequent pattern mining (WEFPM) algorithm for big data analysis. Cluster Comput 1–13
Rong M, Gong D, Gao X (2019) Feature selection and its use in big data: challenges, methods, and trends. IEEE Access 7:19709–19725
Seijo-Pardo B, Porto-Díaz I, Bolón-Canedo V, Alonso-Betanzos A (2017) Ensemble feature selection: homogeneous and heterogeneous approaches. Knowl Based Syst 118:124–139
Hossain MA, Jia X, Benediktsson JA (2016) One-class oriented feature selection and classification of heterogeneous remote sensing images. IEEE J Sel Top Appl Earth Obs Remote Sens 9(4):1606–1612
Chen L, Zhang D, Pan G, Ma X, Yang D, Kushlev K, Zhang W, Li S (2015) Bike sharing station placement leveraging heterogeneous urban open data. In: Proceedings of the 2015 ACM international joint conference on pervasive and ubiquitous computing. ACM, pp 571–575
Oliver J. Is big data enough for machine learning in cyber security? https://www.trendmicro.com/vinfo/us/security/news/security-technology/is-big-data-big-enough-for-machine-learning-in-cybersecurity
Faris H, Aljarah I, Al-Betar MA, Mirjalili S (2018) Grey wolf optimizer: a review of recent variants and applications. Neural Comput Appl 1–23
Hodge VJ, O’Keefe S, Austin J (2016) Hadoop neural network for parallel and distributed feature selection. Neural Netw 78:24–35
Emary E, Zawbaa HM, Hassanien AE (2016) Binary grey wolf optimization approaches for feature selection. Neurocomputing 172:371–381
Fong S, Wong R, Vasilakos A (2015) Accelerated PSO swarm search feature selection for data stream mining big data. IEEE Trans Serv Comput 9:33–45
Rezek IA, Roberts SJ (1998) Stochastic complexity measures for physiological signal analysis. IEEE Trans Biomed Eng 45(9):1186–1191
Cornejo FM, Zunino A, Murazzo M (2018) Job schedulers for machine learning and data mining algorithms distributed in hadoop, In: VI Jornadas de cloud computing & big data (JCC&BD), La Plata
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Tiwari, S.R., Rana, K.K. (2021). Feature Selection in Big Data: Trends and Challenges. In: Kotecha, K., Piuri, V., Shah, H., Patel, R. (eds) Data Science and Intelligent Applications. Lecture Notes on Data Engineering and Communications Technologies, vol 52. Springer, Singapore. https://doi.org/10.1007/978-981-15-4474-3_9
Download citation
DOI: https://doi.org/10.1007/978-981-15-4474-3_9
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-4473-6
Online ISBN: 978-981-15-4474-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)