Skip to main content

Identifying Forensic Interesting Files in Digital Forensic Corpora by Applying Topic Modelling

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 127))

Abstract

The cyber forensics is an emerging area, where the culprits in a cyber-attack are identified. To perform an investigation, investigator needs to identify the device, backup the data and perform analysis. Therefore, as the cybercrimes increase, so the seized devices and its data also increase, and due to the massive amount of data, the investigations are delayed significantly. Till today many of the forensic investigators use regular expressions and keyword search to find the evidences, which is a traditional approach. In traditional analysis, when the query is given, only exact searches that are matched to particular query are shown while disregarding the other results. Therefore, the main disadvantage with this is that, some sensitive files may not be shown while queried, and also additionally, all the data must be indexed before performing the query which takes huge manual effort as well as time. To overcome this, this research proposes two-tier forensic framework that introduced topical modelling to identify the latent topics and words. Existing approaches used latent semantic indexing (LSI) that has synonymy problem. To overcome this, this research introduces latent semantic analysis (LSA) to digital forensics field and applies it on author’s corpora which contain 29.8 million files. Interestingly, this research yielded satisfactory results in terms of time and in finding uninteresting as well as interesting files. This paper also gives fair comparison among forensic search techniques in digital corpora and proves that the proposed methodology performance outstands.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://en.wikipedia.org/wiki/Latent_semantic_analysis.

  2. 2.

    https://edutechwiki.unige.ch/en/Latent_semantic_analysis_and_indexing.

  3. 3.

    https://www.scholarpedia.org/article/Latent_semantic_analysis.

  4. 4.

    https://www.gnu.org/software/gsl/manual/html_node/Singular-Value-Decomposition.html.

  5. 5.

    https://blog.statsbot.co/singular-value-decomposition-tutorial-52c695315254.

  6. 6.

    https://pythonhosted.org/Pyro4/nameserver.html.

References

  1. Raghavan S (2013) Digital forensic research: current state of the art. CSI Trans ICT 1(1):91–114. https://doi.org/10.1007/s40012-012-0008-7

    Article  Google Scholar 

  2. Beebe N (2009) Digital forensic research: the good, the bad and the unaddressed. In: Advances in digital forensics V, pp 17–36

    Google Scholar 

  3. Rogers MK, Seigfried K (2004) The future of computer forensics: a needs analysis survey. Comput Secur 23(1):12–16

    Article  Google Scholar 

  4. Joseph P, Norman J (2019) An analysis of digital forensics in cyber security. In: First international conference on artificial intelligence and cognitive computing, vol 815, pp 0–7

    Google Scholar 

  5. Bem D, Feld F, Huebner E, Bem O (2008) Computer forensics—past, present and future. J Inf Sci Technol 5(3):43–59

    Google Scholar 

  6. Peterson G (2015) Digital Forensics XI. In: Peterson G, Shenoi S (eds) Advances in digital forensics XI 11th. Springer, Orlando, pp 74–89

    Chapter  Google Scholar 

  7. Amari K (2009) Techniques and tools for recovering and analyzing data from volatile memory. Boston

    Google Scholar 

  8. Regional Computer Forensics Laboratory (2016) FBI Fiscal annual report. Mexico. Retrieved from https://abc.xyz/investor/pdf/2016_google_annual_report.pdf

  9. Pratap Singh S (2016) Crime in India 2016. New Delhi, India. Retrieved from http://ncrb.gov.in/StatPublications/CII/CII2016/pdfs/NEWPDFs/CrimeinIndia-2016CompletePDF291117.pdf

  10. Papadimitriou H, Berkeley UC (1998) Latent semantic indexing: analysis. In: Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems, pp 159–168

    Google Scholar 

  11. Olmos R, León JA, Jorge-Botana G, Escudero I (2009) An introduction to latent semantic analysis. Behav Res Methods 41(3):944–950

    Article  Google Scholar 

  12. Landauer TK, Foltz PW, Laham D (2009) An introduction to latent semantic analysis. Discourse Process 25(2–3):259–284

    Google Scholar 

  13. Landauer TK, Dumais ST, A solution to Plato’s problem: the latent semantic analysis theory of acquisition, induction, and representation of knowledge

    Google Scholar 

  14. Joseph P, Norman J (2019) Forensic corpus data reduction techniques for faster analysis by eliminating tedious files. Inf Secur J 28(4–5):136–147. https://doi.org/10.1080/19393555.2019.1689319

    Article  Google Scholar 

  15. Bird S, Loper E, Klein E (2009) Natural language processing with python. O’Reilly Media Inc.

    Google Scholar 

  16. Garfinkel SL (2006) Forensic feature extraction and cross-drive analysis. Digit Investig 3:71–81

    Article  Google Scholar 

  17. Trefethern L, Bau D III (1997) Numerical linear algebra, vol 102. Soceity for Industrial and Applied Mathematics, Philadelphia

    Book  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to D. Paul Joseph .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Joseph, D.P., Norman, J. (2021). Identifying Forensic Interesting Files in Digital Forensic Corpora by Applying Topic Modelling. In: Tripathy, A., Sarkar, M., Sahoo, J., Li, KC., Chinara, S. (eds) Advances in Distributed Computing and Machine Learning. Lecture Notes in Networks and Systems, vol 127. Springer, Singapore. https://doi.org/10.1007/978-981-15-4218-3_40

Download citation

Publish with us

Policies and ethics