Skip to main content

Feature Selection in Big Data: Trends and Challenges

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes on Data Engineering and Communications Technologies ((LNDECT,volume 52))

Abstract

Big data is a term used to represent data that is big in volume, speed, and variety. With inflammation, these characteristics are also inflated to 42 V’s. We have focused our survey for feature selection in big data, as feature selection is one of the most used methods for dimensionality reduction techniques. Feature selection is used for elimination of irrelevant and redundant features from dataset to improve the classification performance. This paper includes big data characteristics, different feature selection method, and current research challenges of feature selection. We observed that swarm intelligence techniques are the most popular methods among researchers for feature selection in big data. Further, we conclude that gray wolf optimization and particle swarm optimization are the most preferred algorithms by researchers.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Sudhakar Ilango S, Vimal S, Kaliappan M, Subbulakshmi P (2018) Optimization using artificial bee colony based clustering approach for big data. Cluster Comput 1–9

    Google Scholar 

  2. Devi DR, Sasikala SJ (2019) J Big Data 103. https://doi.org/10.1186/s40537-019-0267-3

  3. Qiu J, Wu Q, Ding G, Xu Y, Feng S (2016) A survey of machine learning for big data processing. EURASIP J Adv Sig Process 1687–6180 (2016)

    Google Scholar 

  4. Enterprise Big Data Framework. https://www.bigdataframework.org/data-types-structured-vs-unstructured-data/

  5. Morán-Fernández L, Bolón-Canedo V, Alonso-Betanzos A (2017) Centralized vs distributed feature selection methods based on data complexity measures. Knowl Based Syst 117:27–45

    Article  Google Scholar 

  6. Brezočnik L, Fister I, Podgorelec V (2018) Swarm intelligence algorithms for feature selection: a review. Appl Sci 8(9):1521

    Article  Google Scholar 

  7. Jena B, Gourisaria MK, Rautaray SS, Pandey M (2017) A survey work on optimization techniques utilizing map reduce framework in hadoop cluster. Int J Intell Syst Appl 9(4):61

    Google Scholar 

  8. Shafer T. The 42 v’s of big data and data science. https://www.elderresearch.com/company/blog/42-v-of-big-data

  9. Mar B. How much data do we create every day? The mind-blowing stats everyone should read. https://www.forbes.com/sites/bernardmarr/2018/05/21/how-much-data-do-we-create-every-day-the-mind-blowing-stats-everyone-should-read/#24ee8aaa60ba

  10. Zhang Y, Gong D, Cheng J (2017) Multi-objective particle swarm optimization approach for cost-based feature selection in classification. IEEE/ACM Trans Comput Biol Bioinform (TCBB) 14:64–75

    Article  Google Scholar 

  11. Gupta SL, Baghel A, Iqbal A (2019) Big data classification using scale-free binary particle swarm optimization. In: Harmony search and nature inspired optimization algorithms. Springer, Singapore, pp 1177–1187

    Google Scholar 

  12. Li J, Liu H (2017) Challenges of feature selection for big data analytics. IEEE Intell Syst 32:9–15

    Article  Google Scholar 

  13. Ramírez-Gallego S, Lastra I, Martínez-Rego D, Bolón-Canedo V, Benítez JM, Herrera F, Alonso-Betanzos A (2017) Fast-mRMR: fast minimum redundancy maximum relevance algorithm for high-dimensional big data. Int J Intell Syst 32:134–152

    Article  Google Scholar 

  14. Emary E, Zawbaa HM, Grosan C, Hassenian AE (2015) Feature subset selection approach by gray-wolf optimization. In: Afro-European conference for industrial advancement. Springer, Cham, pp 1–13

    Google Scholar 

  15. Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A (2015) Distributed feature selection: an application to microarray data classification. Appl Soft Comput 30:136–150

    Article  Google Scholar 

  16. Kawamura A, Chakraborty B (2017) A hybrid approach for optimal feature subset selection with evolutionary algorithms. In: IEEE 8th international conference on awareness science and technology (iCAST), pp 564–568

    Google Scholar 

  17. Emary E, Yamany W, Hassanien AE, Snasel V (2015) Multi-objective gray-wolf optimization for attribute reduction. Procedia Comput Sci 65:623–632

    Article  Google Scholar 

  18. Majdi MM, Mirjalili S (2017) Hybrid whale optimization algorithm with simulated annealing for feature selection. Neurocomputing 260:302–312

    Article  Google Scholar 

  19. Mirjalili SZ, Mirjalili S, Saremi S, Faris H, Aljarah I (2018) Grasshopper optimization algorithm for multi-objective optimization problems. Appl Intell 48(4):805–820

    Article  Google Scholar 

  20. Gu S, Cheng R, Jin Y (2018) Feature selection for high-dimensional classification using a competitive swarm optimizer. Soft Comput 22:811–822

    Article  Google Scholar 

  21. Fong S, Yang X-S, Deb S (2013) Swarm search for feature selection in classification. In: 2013 IEEE 16th international conference on computational science and engineering, pp 902–909

    Google Scholar 

  22. Tripathi AK, Sharma K, Bala M (2018) A novel clustering method using enhanced grey wolf optimizer and mapreduce. Big Data Res 14:93–100

    Article  Google Scholar 

  23. Weng J, Young D (2017) Some dimension reduction strategies for the analysis of survey data. J Big Data 4(1):1–19

    Article  Google Scholar 

  24. Boyle T. Dealing with imbalanced data. https://towardsdatascience.com/methods-for-dealing-with-imbalanced-data-5b761be45a18

  25. L’Heureux A, Grolinger K, Elyamany HF, Capretz MAM (2017) Machine learning with big data: challenges and approaches. IEEE Access 5(5):777–797

    Google Scholar 

  26. Rencberoglu E. Fundamental techniques of feature engineering for machine learning. https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114

  27. Wang L (2017) Heterogeneous data and big data analytics. Autom Control Inf Sci 3(1):8–15

    MathSciNet  Google Scholar 

  28. Devi SG, Sabrigiriraj M (2017) Swarm intelligent based online feature selection (OFS) and weighted entropy frequent pattern mining (WEFPM) algorithm for big data analysis. Cluster Comput 1–13

    Google Scholar 

  29. Rong M, Gong D, Gao X (2019) Feature selection and its use in big data: challenges, methods, and trends. IEEE Access 7:19709–19725

    Article  Google Scholar 

  30. Seijo-Pardo B, Porto-Díaz I, Bolón-Canedo V, Alonso-Betanzos A (2017) Ensemble feature selection: homogeneous and heterogeneous approaches. Knowl Based Syst 118:124–139

    Article  Google Scholar 

  31. Hossain MA, Jia X, Benediktsson JA (2016) One-class oriented feature selection and classification of heterogeneous remote sensing images. IEEE J Sel Top Appl Earth Obs Remote Sens 9(4):1606–1612

    Article  Google Scholar 

  32. Chen L, Zhang D, Pan G, Ma X, Yang D, Kushlev K, Zhang W, Li S (2015) Bike sharing station placement leveraging heterogeneous urban open data. In: Proceedings of the 2015 ACM international joint conference on pervasive and ubiquitous computing. ACM, pp 571–575

    Google Scholar 

  33. Oliver J. Is big data enough for machine learning in cyber security? https://www.trendmicro.com/vinfo/us/security/news/security-technology/is-big-data-big-enough-for-machine-learning-in-cybersecurity

  34. Faris H, Aljarah I, Al-Betar MA, Mirjalili S (2018) Grey wolf optimizer: a review of recent variants and applications. Neural Comput Appl 1–23

    Google Scholar 

  35. Hodge VJ, O’Keefe S, Austin J (2016) Hadoop neural network for parallel and distributed feature selection. Neural Netw 78:24–35

    Article  Google Scholar 

  36. Emary E, Zawbaa HM, Hassanien AE (2016) Binary grey wolf optimization approaches for feature selection. Neurocomputing 172:371–381

    Article  Google Scholar 

  37. Fong S, Wong R, Vasilakos A (2015) Accelerated PSO swarm search feature selection for data stream mining big data. IEEE Trans Serv Comput 9:33–45

    Google Scholar 

  38. Rezek IA, Roberts SJ (1998) Stochastic complexity measures for physiological signal analysis. IEEE Trans Biomed Eng 45(9):1186–1191

    Article  Google Scholar 

  39. Cornejo FM, Zunino A, Murazzo M (2018) Job schedulers for machine learning and data mining algorithms distributed in hadoop, In: VI Jornadas de cloud computing & big data (JCC&BD), La Plata

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Suman R. Tiwari .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tiwari, S.R., Rana, K.K. (2021). Feature Selection in Big Data: Trends and Challenges. In: Kotecha, K., Piuri, V., Shah, H., Patel, R. (eds) Data Science and Intelligent Applications. Lecture Notes on Data Engineering and Communications Technologies, vol 52. Springer, Singapore. https://doi.org/10.1007/978-981-15-4474-3_9

Download citation

Publish with us

Policies and ethics