Skip to main content

Action Recognition in Real-World Videos

  • Living reference work entry
  • First Online:
Computer Vision

Related Concepts

Synonyms

Detection and Localization; Event Recognition

Definition

The goal of human action recognition is to temporally or spatially localize the human action of interest in video sequences. Temporal localization (i.e., indicating the start and end frames of the action in a video) is referred to as frame-level detection. Spatial localization, which is more challenging, means to identify the pixels within each action frame that correspond to the action. This setting is usually referred to as pixel-level detection. In this chapter, we are using action, activity, and event interchangeably.

Background

Three main ingredients of action research are visual features, machine learning methodology, and datasets. Recent years have witnessed a tremendous increase in research and development in all these areas of research. Several new visual features have been proposed which range from handcrafted local and global...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Notes

  1. 1.

    http://www.thumos.info/home.html

  2. 2.

    https://research.google.com/ava/challenge.html

  3. 3.

    https://epic-kitchens.github.io/2020

  4. 4.

    http://vuchallenge.org/charades.html

References

  1. Jain M, Van Gemert JC, Snoek CGM (2015) What do 15,000 object categories tell us about classifying and localizing actions? In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 46–55

    Google Scholar 

  2. Choutas V, Weinzaepfel P, Revaud J, Schmid C (2018) Potion: Pose motion representation for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)

    Google Scholar 

  3. Hou R, Chen C, Sukthankar R, Shah M (2019) An efficient 3D CNN for action/object segmentation in video. arXiv preprint arXiv:1907.08895

    Google Scholar 

  4. Chang C-Y, Huang D-A, Sui Y, Fei-Fei L, Niebles JC (2019) D3tw: Discriminative differentiable dynamic time warping for weakly supervised action alignment and segmentation. arXiv preprint arXiv:1901.02598

    Google Scholar 

  5. Kumar Singh K, Jae Lee Y (2017) Hide-and-seek: forcing a network to be meticulous for weakly-supervised object and action localization. In: Proceedings of the IEEE international conference on computer vision, pp 3524–3533

    Google Scholar 

  6. Wang L, Xiong Y, Lin D, Gool LV (2017) Untrimmednets for weakly supervised action recognition and detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6402–6411

    Google Scholar 

  7. Nguyen P, Han B, Liu T, Prasad G (2018) Weakly supervised action localization by sparse temporal pooling network. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 6752–6761

    Google Scholar 

  8. Zhong J-X, Li N, Kong W, Liu S, Li TH, Li G (2019) Graph convolutional label noise cleaner: train a plug-and-play action classifier for anomaly detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1237–1246

    Google Scholar 

  9. Soomro K, Shah M (2017) Unsupervised action discovery and localization in videos. In: Proceedings of the IEEE international conference on computer vision, pp 696–705

    Google Scholar 

  10. Yu G, Yuan J, Liu Z (2011) Unsupervised random forest indexing for fast action search. In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, pp 865–872

    Google Scholar 

  11. Jones S, Shao L (2014) A multigraph representation for improved unsupervised/semi-supervised learning of human actions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 820–826

    Google Scholar 

  12. Gu C, Sun C, Ross DA, Vondrick C, Pantofaru C, Li Y, Vijayanarasimhan S, Toderici G, Ricco S, Sukthankar R et al (2018) Ava: a video dataset of spatio-temporally localized atomic visual actions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6047–6056

    Google Scholar 

  13. Sigurdsson GA, Varol G, Wang X, Farhadi A, Laptev I, Gupta A (2016) Hollywood in homes: crowdsourcing data collection for activity understanding. In: European conference on computer vision, pp 510–526. Springer

    Google Scholar 

  14. Abu-El-Haija S, Kothari N, Lee J, Natsev P, Toderici G, Varadarajan B, Vijayanarasimhan S (2016) Youtube-8m: a large-scale video classification benchmark. arXiv preprint arXiv:1609.08675

    Google Scholar 

  15. Monfort M, Andonian A, Zhou B, Ramakrishnan K, Bargal SA, Yan Y, Brown L, Fan Q, Gutfreund D, Vondrick C et al (2019) Moments in time dataset: one million videos for event understanding. IEEE transactions on pattern analysis and machine intelligence

    Google Scholar 

  16. Yong Jae Lee, Ghosh J, Grauman K (2012) Discovering important people and objects for egocentric video summarization. In: 2012 IEEE conference on computer vision and pattern recognition, pp 1346–1353. IEEE

    Google Scholar 

  17. Damen D, Doughty H, Farinella GM, Fidler S, Furnari A, Kazakos E, Moltisanti D, Munro J, Perrett T, Price W, Wray M (2018) Scaling egocentric vision: The epic-kitchens dataset. In: European conference on computer vision (ECCV)

    Google Scholar 

  18. Ghanem B, Caba Heilbron F, Escorcia V, Niebles JC (2015) Activitynet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 961–970

    Google Scholar 

  19. Sultani W, Chen C, Shah M (2018) Real-world anomaly detection in surveillance videos. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 6479–6488

    Google Scholar 

  20. Jiang Y-G, Liu J, Zamir AR, Toderici G, Laptev I, Shah M, Sukthankar R (2014) Thumos challenge: action recognition with a large number of classes

    Google Scholar 

  21. Yilmaz A, Shah M (2005) Actions as objects: a novel action representation. In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Google Scholar 

  22. Jiang Z, Lin Z, Davis L (2012) Recognizing human actions by learning and matching shape-motion prototype trees. IEEE Trans Pattern Anal Mach Intell 34(3):533–547

    Article  Google Scholar 

  23. Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2–3):107–123

    Article  Google Scholar 

  24. Wang H, Kläser A, Schmid C, Cheng-Lin L (2011) Action recognition by dense trajectories. In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, pp 3169–3176

    Google Scholar 

  25. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576

    Google Scholar 

  26. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497

    Google Scholar 

  27. Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6299–6308

    Google Scholar 

  28. Chen Y, Kalantidis Y, Li J, Yan S, Feng J (2018) Multi-fiber networks for video recognition. In: Proceedings of the European conference on computer vision (ECCV), pp 352–367

    Google Scholar 

  29. Sultani W, Chen C, Shah M (2018) Real-world anomaly detection in surveillance videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6479–6488

    Google Scholar 

  30. Sultani W, Zhang D, Shah M (2017) Unsupervised action proposal ranking through proposal recombination. Comput Vis Image Underst 161:42–50

    Article  Google Scholar 

  31. Luvizon D, Picard D, Tabia H (2018) 2D/3D pose estimation and action recognition using multitask deep learning, pp 5137–5146

    Google Scholar 

  32. Jhuang H, Gall J, Zuffi S, Schmid C, Black MJ (2013) Towards understanding action recognition. In: Proceedings of the IEEE international conference on computer vision, pp 3192–3199

    Google Scholar 

  33. Lu J, Corso JJ et al (2015) Human action segmentation with hierarchical supervoxel consistency. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3762–3771

    Google Scholar 

  34. Ghosh P, Yao Y, Davis L, Divakaran A (2020) Stacked spatio-temporal graph convolutional networks for action segmentation. In: The IEEE Winter conference on applications of computer vision, pp 576–585

    Google Scholar 

  35. Sultani W, Shah M (2016) What if we do not have multiple videos of the same action?–video action localization using web images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1077–1085

    Google Scholar 

  36. Soomro K, Idrees H, Shah M (2015) Action localization in videos through context walk. In: Proceedings of the IEEE international conference on computer vision, pp 3280–3288

    Google Scholar 

  37. Singh G, Cuzzolin F (2016) Untrimmed video classification for activity detection: submission to activitynet challenge. arXiv preprint arXiv:1607.01979

    Google Scholar 

  38. Montes A, Salvador A, Pascual S, Giro-i Nieto X (2016) Temporal activity detection in untrimmed videos with recurrent neural networks. arXiv preprint arXiv:1608.08128

    Google Scholar 

  39. Zhao Y, Xiong Y, Wang L, Wu Z, Tang X, Lin D (2017) Temporal action detection with structured segment networks. In: Proceedings of the IEEE international conference on computer vision, pp 2914–2923

    Google Scholar 

  40. Singh G, Saha S, Sapienza M, Torr PHS, Cuzzolin F (2017) Online real-time multiple spatiotemporal action localisation and prediction. In: Proceedings of the IEEE international conference on computer vision, pp 3637–3646

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chen Chen .

Section Editor information

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this entry

Check for updates. Verify currency and authenticity via CrossMark

Cite this entry

Sultani, W., Arshad, Q.A., Chen, C. (2020). Action Recognition in Real-World Videos. In: Computer Vision. Springer, Cham. https://doi.org/10.1007/978-3-030-03243-2_846-1

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-03243-2_846-1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-03243-2

  • Online ISBN: 978-3-030-03243-2

  • eBook Packages: Springer Reference Computer SciencesReference Module Computer Science and Engineering

Publish with us

Policies and ethics