Action Recognition in Real-World Videos

Sultani, Waqas; Arshad, Qazi Ammar; Chen, Chen

doi:10.1007/978-3-030-03243-2_846-1

Waqas Sultani²,
Qazi Ammar Arshad² &
Chen Chen³

579 Accesses

Related Concepts

Synonyms

Detection and Localization; Event Recognition

Definition

The goal of human action recognition is to temporally or spatially localize the human action of interest in video sequences. Temporal localization (i.e., indicating the start and end frames of the action in a video) is referred to as frame-level detection. Spatial localization, which is more challenging, means to identify the pixels within each action frame that correspond to the action. This setting is usually referred to as pixel-level detection. In this chapter, we are using action, activity, and event interchangeably.

Background

Three main ingredients of action research are visual features, machine learning methodology, and datasets. Recent years have witnessed a tremendous increase in research and development in all these areas of research. Several new visual features have been proposed which range from handcrafted local and global...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Notes

References

Jain M, Van Gemert JC, Snoek CGM (2015) What do 15,000 object categories tell us about classifying and localizing actions? In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 46–55
Google Scholar
Choutas V, Weinzaepfel P, Revaud J, Schmid C (2018) Potion: Pose motion representation for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)
Google Scholar
Hou R, Chen C, Sukthankar R, Shah M (2019) An efficient 3D CNN for action/object segmentation in video. arXiv preprint arXiv:1907.08895
Google Scholar
Chang C-Y, Huang D-A, Sui Y, Fei-Fei L, Niebles JC (2019) D3tw: Discriminative differentiable dynamic time warping for weakly supervised action alignment and segmentation. arXiv preprint arXiv:1901.02598
Google Scholar
Kumar Singh K, Jae Lee Y (2017) Hide-and-seek: forcing a network to be meticulous for weakly-supervised object and action localization. In: Proceedings of the IEEE international conference on computer vision, pp 3524–3533
Google Scholar
Wang L, Xiong Y, Lin D, Gool LV (2017) Untrimmednets for weakly supervised action recognition and detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6402–6411
Google Scholar
Nguyen P, Han B, Liu T, Prasad G (2018) Weakly supervised action localization by sparse temporal pooling network. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 6752–6761
Google Scholar
Zhong J-X, Li N, Kong W, Liu S, Li TH, Li G (2019) Graph convolutional label noise cleaner: train a plug-and-play action classifier for anomaly detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1237–1246
Google Scholar
Soomro K, Shah M (2017) Unsupervised action discovery and localization in videos. In: Proceedings of the IEEE international conference on computer vision, pp 696–705
Google Scholar
Yu G, Yuan J, Liu Z (2011) Unsupervised random forest indexing for fast action search. In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, pp 865–872
Google Scholar
Jones S, Shao L (2014) A multigraph representation for improved unsupervised/semi-supervised learning of human actions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 820–826
Google Scholar
Gu C, Sun C, Ross DA, Vondrick C, Pantofaru C, Li Y, Vijayanarasimhan S, Toderici G, Ricco S, Sukthankar R et al (2018) Ava: a video dataset of spatio-temporally localized atomic visual actions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6047–6056
Google Scholar
Sigurdsson GA, Varol G, Wang X, Farhadi A, Laptev I, Gupta A (2016) Hollywood in homes: crowdsourcing data collection for activity understanding. In: European conference on computer vision, pp 510–526. Springer
Google Scholar
Abu-El-Haija S, Kothari N, Lee J, Natsev P, Toderici G, Varadarajan B, Vijayanarasimhan S (2016) Youtube-8m: a large-scale video classification benchmark. arXiv preprint arXiv:1609.08675
Google Scholar
Monfort M, Andonian A, Zhou B, Ramakrishnan K, Bargal SA, Yan Y, Brown L, Fan Q, Gutfreund D, Vondrick C et al (2019) Moments in time dataset: one million videos for event understanding. IEEE transactions on pattern analysis and machine intelligence
Google Scholar
Yong Jae Lee, Ghosh J, Grauman K (2012) Discovering important people and objects for egocentric video summarization. In: 2012 IEEE conference on computer vision and pattern recognition, pp 1346–1353. IEEE
Google Scholar
Damen D, Doughty H, Farinella GM, Fidler S, Furnari A, Kazakos E, Moltisanti D, Munro J, Perrett T, Price W, Wray M (2018) Scaling egocentric vision: The epic-kitchens dataset. In: European conference on computer vision (ECCV)
Google Scholar
Ghanem B, Caba Heilbron F, Escorcia V, Niebles JC (2015) Activitynet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 961–970
Google Scholar
Sultani W, Chen C, Shah M (2018) Real-world anomaly detection in surveillance videos. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 6479–6488
Google Scholar
Jiang Y-G, Liu J, Zamir AR, Toderici G, Laptev I, Shah M, Sukthankar R (2014) Thumos challenge: action recognition with a large number of classes
Google Scholar
Yilmaz A, Shah M (2005) Actions as objects: a novel action representation. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Google Scholar
Jiang Z, Lin Z, Davis L (2012) Recognizing human actions by learning and matching shape-motion prototype trees. IEEE Trans Pattern Anal Mach Intell 34(3):533–547
Article Google Scholar
Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2–3):107–123
Article Google Scholar
Wang H, Kläser A, Schmid C, Cheng-Lin L (2011) Action recognition by dense trajectories. In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, pp 3169–3176
Google Scholar
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576
Google Scholar
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497
Google Scholar
Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6299–6308
Google Scholar
Chen Y, Kalantidis Y, Li J, Yan S, Feng J (2018) Multi-fiber networks for video recognition. In: Proceedings of the European conference on computer vision (ECCV), pp 352–367
Google Scholar
Sultani W, Chen C, Shah M (2018) Real-world anomaly detection in surveillance videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6479–6488
Google Scholar
Sultani W, Zhang D, Shah M (2017) Unsupervised action proposal ranking through proposal recombination. Comput Vis Image Underst 161:42–50
Article Google Scholar
Luvizon D, Picard D, Tabia H (2018) 2D/3D pose estimation and action recognition using multitask deep learning, pp 5137–5146
Google Scholar
Jhuang H, Gall J, Zuffi S, Schmid C, Black MJ (2013) Towards understanding action recognition. In: Proceedings of the IEEE international conference on computer vision, pp 3192–3199
Google Scholar
Lu J, Corso JJ et al (2015) Human action segmentation with hierarchical supervoxel consistency. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3762–3771
Google Scholar
Ghosh P, Yao Y, Davis L, Divakaran A (2020) Stacked spatio-temporal graph convolutional networks for action segmentation. In: The IEEE Winter conference on applications of computer vision, pp 576–585
Google Scholar
Sultani W, Shah M (2016) What if we do not have multiple videos of the same action?–video action localization using web images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1077–1085
Google Scholar
Soomro K, Idrees H, Shah M (2015) Action localization in videos through context walk. In: Proceedings of the IEEE international conference on computer vision, pp 3280–3288
Google Scholar
Singh G, Cuzzolin F (2016) Untrimmed video classification for activity detection: submission to activitynet challenge. arXiv preprint arXiv:1607.01979
Google Scholar
Montes A, Salvador A, Pascual S, Giro-i Nieto X (2016) Temporal activity detection in untrimmed videos with recurrent neural networks. arXiv preprint arXiv:1608.08128
Google Scholar
Zhao Y, Xiong Y, Wang L, Wu Z, Tang X, Lin D (2017) Temporal action detection with structured segment networks. In: Proceedings of the IEEE international conference on computer vision, pp 2914–2923
Google Scholar
Singh G, Saha S, Sapienza M, Torr PHS, Cuzzolin F (2017) Online real-time multiple spatiotemporal action localisation and prediction. In: Proceedings of the IEEE international conference on computer vision, pp 3637–3646
Google Scholar

Download references

Author information

Authors and Affiliations

Information Technology University, Lahore, Pakistan
Waqas Sultani & Qazi Ammar Arshad
University of North Carolina at Charlotte, Charlotte, NC, USA
Chen Chen

Authors

Waqas Sultani
View author publications
You can also search for this author in PubMed Google Scholar
Qazi Ammar Arshad
View author publications
You can also search for this author in PubMed Google Scholar
Chen Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chen Chen .

Section Editor information

No affiliation provided
Mubarak Shah

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Sultani, W., Arshad, Q.A., Chen, C. (2020). Action Recognition in Real-World Videos. In: Computer Vision. Springer, Cham. https://doi.org/10.1007/978-3-030-03243-2_846-1

Download citation

DOI: https://doi.org/10.1007/978-3-030-03243-2_846-1
Published: 07 July 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-03243-2
Online ISBN: 978-3-030-03243-2
eBook Packages: Springer Reference Computer SciencesReference Module Computer Science and Engineering

Publish with us

Policies and ethics