Feature Selection in Big Data: Trends and Challenges

Tiwari, Suman R.; Rana, Kaushik K.

doi:10.1007/978-981-15-4474-3_9

Feature Selection in Big Data: Trends and Challenges

Suman R. Tiwari⁶ &
Kaushik K. Rana⁷

Conference paper
First Online: 18 June 2020

1718 Accesses
7 Citations

Part of the book series: Lecture Notes on Data Engineering and Communications Technologies ((LNDECT,volume 52))

Abstract

Big data is a term used to represent data that is big in volume, speed, and variety. With inflammation, these characteristics are also inflated to 42 V’s. We have focused our survey for feature selection in big data, as feature selection is one of the most used methods for dimensionality reduction techniques. Feature selection is used for elimination of irrelevant and redundant features from dataset to improve the classification performance. This paper includes big data characteristics, different feature selection method, and current research challenges of feature selection. We observed that swarm intelligence techniques are the most popular methods among researchers for feature selection in big data. Further, we conclude that gray wolf optimization and particle swarm optimization are the most preferred algorithms by researchers.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Sudhakar Ilango S, Vimal S, Kaliappan M, Subbulakshmi P (2018) Optimization using artificial bee colony based clustering approach for big data. Cluster Comput 1–9
Google Scholar
Devi DR, Sasikala SJ (2019) J Big Data 103. https://doi.org/10.1186/s40537-019-0267-3
Qiu J, Wu Q, Ding G, Xu Y, Feng S (2016) A survey of machine learning for big data processing. EURASIP J Adv Sig Process 1687–6180 (2016)
Google Scholar
Enterprise Big Data Framework. https://www.bigdataframework.org/data-types-structured-vs-unstructured-data/
Morán-Fernández L, Bolón-Canedo V, Alonso-Betanzos A (2017) Centralized vs distributed feature selection methods based on data complexity measures. Knowl Based Syst 117:27–45
Article Google Scholar
Brezočnik L, Fister I, Podgorelec V (2018) Swarm intelligence algorithms for feature selection: a review. Appl Sci 8(9):1521
Article Google Scholar
Jena B, Gourisaria MK, Rautaray SS, Pandey M (2017) A survey work on optimization techniques utilizing map reduce framework in hadoop cluster. Int J Intell Syst Appl 9(4):61
Google Scholar
Shafer T. The 42 v’s of big data and data science. https://www.elderresearch.com/company/blog/42-v-of-big-data
Mar B. How much data do we create every day? The mind-blowing stats everyone should read. https://www.forbes.com/sites/bernardmarr/2018/05/21/how-much-data-do-we-create-every-day-the-mind-blowing-stats-everyone-should-read/#24ee8aaa60ba
Zhang Y, Gong D, Cheng J (2017) Multi-objective particle swarm optimization approach for cost-based feature selection in classification. IEEE/ACM Trans Comput Biol Bioinform (TCBB) 14:64–75
Article Google Scholar
Gupta SL, Baghel A, Iqbal A (2019) Big data classification using scale-free binary particle swarm optimization. In: Harmony search and nature inspired optimization algorithms. Springer, Singapore, pp 1177–1187
Google Scholar
Li J, Liu H (2017) Challenges of feature selection for big data analytics. IEEE Intell Syst 32:9–15
Article Google Scholar
Ramírez-Gallego S, Lastra I, Martínez-Rego D, Bolón-Canedo V, Benítez JM, Herrera F, Alonso-Betanzos A (2017) Fast-mRMR: fast minimum redundancy maximum relevance algorithm for high-dimensional big data. Int J Intell Syst 32:134–152
Article Google Scholar
Emary E, Zawbaa HM, Grosan C, Hassenian AE (2015) Feature subset selection approach by gray-wolf optimization. In: Afro-European conference for industrial advancement. Springer, Cham, pp 1–13
Google Scholar
Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A (2015) Distributed feature selection: an application to microarray data classification. Appl Soft Comput 30:136–150
Article Google Scholar
Kawamura A, Chakraborty B (2017) A hybrid approach for optimal feature subset selection with evolutionary algorithms. In: IEEE 8th international conference on awareness science and technology (iCAST), pp 564–568
Google Scholar
Emary E, Yamany W, Hassanien AE, Snasel V (2015) Multi-objective gray-wolf optimization for attribute reduction. Procedia Comput Sci 65:623–632
Article Google Scholar
Majdi MM, Mirjalili S (2017) Hybrid whale optimization algorithm with simulated annealing for feature selection. Neurocomputing 260:302–312
Article Google Scholar
Mirjalili SZ, Mirjalili S, Saremi S, Faris H, Aljarah I (2018) Grasshopper optimization algorithm for multi-objective optimization problems. Appl Intell 48(4):805–820
Article Google Scholar
Gu S, Cheng R, Jin Y (2018) Feature selection for high-dimensional classification using a competitive swarm optimizer. Soft Comput 22:811–822
Article Google Scholar
Fong S, Yang X-S, Deb S (2013) Swarm search for feature selection in classification. In: 2013 IEEE 16th international conference on computational science and engineering, pp 902–909
Google Scholar
Tripathi AK, Sharma K, Bala M (2018) A novel clustering method using enhanced grey wolf optimizer and mapreduce. Big Data Res 14:93–100
Article Google Scholar
Weng J, Young D (2017) Some dimension reduction strategies for the analysis of survey data. J Big Data 4(1):1–19
Article Google Scholar
Boyle T. Dealing with imbalanced data. https://towardsdatascience.com/methods-for-dealing-with-imbalanced-data-5b761be45a18
L’Heureux A, Grolinger K, Elyamany HF, Capretz MAM (2017) Machine learning with big data: challenges and approaches. IEEE Access 5(5):777–797
Google Scholar
Rencberoglu E. Fundamental techniques of feature engineering for machine learning. https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114
Wang L (2017) Heterogeneous data and big data analytics. Autom Control Inf Sci 3(1):8–15
MathSciNet Google Scholar
Devi SG, Sabrigiriraj M (2017) Swarm intelligent based online feature selection (OFS) and weighted entropy frequent pattern mining (WEFPM) algorithm for big data analysis. Cluster Comput 1–13
Google Scholar
Rong M, Gong D, Gao X (2019) Feature selection and its use in big data: challenges, methods, and trends. IEEE Access 7:19709–19725
Article Google Scholar
Seijo-Pardo B, Porto-Díaz I, Bolón-Canedo V, Alonso-Betanzos A (2017) Ensemble feature selection: homogeneous and heterogeneous approaches. Knowl Based Syst 118:124–139
Article Google Scholar
Hossain MA, Jia X, Benediktsson JA (2016) One-class oriented feature selection and classification of heterogeneous remote sensing images. IEEE J Sel Top Appl Earth Obs Remote Sens 9(4):1606–1612
Article Google Scholar
Chen L, Zhang D, Pan G, Ma X, Yang D, Kushlev K, Zhang W, Li S (2015) Bike sharing station placement leveraging heterogeneous urban open data. In: Proceedings of the 2015 ACM international joint conference on pervasive and ubiquitous computing. ACM, pp 571–575
Google Scholar
Oliver J. Is big data enough for machine learning in cyber security? https://www.trendmicro.com/vinfo/us/security/news/security-technology/is-big-data-big-enough-for-machine-learning-in-cybersecurity
Faris H, Aljarah I, Al-Betar MA, Mirjalili S (2018) Grey wolf optimizer: a review of recent variants and applications. Neural Comput Appl 1–23
Google Scholar
Hodge VJ, O’Keefe S, Austin J (2016) Hadoop neural network for parallel and distributed feature selection. Neural Netw 78:24–35
Article Google Scholar
Emary E, Zawbaa HM, Hassanien AE (2016) Binary grey wolf optimization approaches for feature selection. Neurocomputing 172:371–381
Article Google Scholar
Fong S, Wong R, Vasilakos A (2015) Accelerated PSO swarm search feature selection for data stream mining big data. IEEE Trans Serv Comput 9:33–45
Google Scholar
Rezek IA, Roberts SJ (1998) Stochastic complexity measures for physiological signal analysis. IEEE Trans Biomed Eng 45(9):1186–1191
Article Google Scholar
Cornejo FM, Zunino A, Murazzo M (2018) Job schedulers for machine learning and data mining algorithms distributed in hadoop, In: VI Jornadas de cloud computing & big data (JCC&BD), La Plata
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Department, R.C. Technical Institute, Ahmedabad, Gujarat, India
Suman R. Tiwari
Vishwakarma Government Engineering College, Ahmedabad, Gujarat, India
Kaushik K. Rana

Authors

Suman R. Tiwari
View author publications
You can also search for this author in PubMed Google Scholar
Kaushik K. Rana
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Suman R. Tiwari .

Editor information

Editors and Affiliations

Faculty of Engineering, Symbiosis Institute of Technology, Pune, India
Ketan Kotecha
Department of Computer Science, Università degli Studi di Milano, Milan, Italy
Vincenzo Piuri
Gandhinagar Institute of Technology, Gandhinagar, Gujarat, India
Hetalkumar N. Shah
Department of Computer Engineering, Gandhinagar Institute of Technology, Gandhinagar, Gujarat, India
Rajan Patel

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tiwari, S.R., Rana, K.K. (2021). Feature Selection in Big Data: Trends and Challenges. In: Kotecha, K., Piuri, V., Shah, H., Patel, R. (eds) Data Science and Intelligent Applications. Lecture Notes on Data Engineering and Communications Technologies, vol 52. Springer, Singapore. https://doi.org/10.1007/978-981-15-4474-3_9

Download citation

DOI: https://doi.org/10.1007/978-981-15-4474-3_9
Published: 18 June 2020
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-4473-6
Online ISBN: 978-981-15-4474-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics