Related Concepts
Synonyms
Definition
The goal of human action recognition is to temporally or spatially localize the human action of interest in video sequences. Temporal localization (i.e., indicating the start and end frames of the action in a video) is referred to as frame-level detection. Spatial localization, which is more challenging, means to identify the pixels within each action frame that correspond to the action. This setting is usually referred to as pixel-level detection. In this chapter, we are using action, activity, and event interchangeably.
Background
Three main ingredients of action research are visual features, machine learning methodology, and datasets. Recent years have witnessed a tremendous increase in research and development in all these areas of research. Several new visual features have been proposed which range from handcrafted local and global...
References
Jain M, Van Gemert JC, Snoek CGM (2015) What do 15,000 object categories tell us about classifying and localizing actions? In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 46–55
Choutas V, Weinzaepfel P, Revaud J, Schmid C (2018) Potion: Pose motion representation for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)
Hou R, Chen C, Sukthankar R, Shah M (2019) An efficient 3D CNN for action/object segmentation in video. arXiv preprint arXiv:1907.08895
Chang C-Y, Huang D-A, Sui Y, Fei-Fei L, Niebles JC (2019) D3tw: Discriminative differentiable dynamic time warping for weakly supervised action alignment and segmentation. arXiv preprint arXiv:1901.02598
Kumar Singh K, Jae Lee Y (2017) Hide-and-seek: forcing a network to be meticulous for weakly-supervised object and action localization. In: Proceedings of the IEEE international conference on computer vision, pp 3524–3533
Wang L, Xiong Y, Lin D, Gool LV (2017) Untrimmednets for weakly supervised action recognition and detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6402–6411
Nguyen P, Han B, Liu T, Prasad G (2018) Weakly supervised action localization by sparse temporal pooling network. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 6752–6761
Zhong J-X, Li N, Kong W, Liu S, Li TH, Li G (2019) Graph convolutional label noise cleaner: train a plug-and-play action classifier for anomaly detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1237–1246
Soomro K, Shah M (2017) Unsupervised action discovery and localization in videos. In: Proceedings of the IEEE international conference on computer vision, pp 696–705
Yu G, Yuan J, Liu Z (2011) Unsupervised random forest indexing for fast action search. In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, pp 865–872
Jones S, Shao L (2014) A multigraph representation for improved unsupervised/semi-supervised learning of human actions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 820–826
Gu C, Sun C, Ross DA, Vondrick C, Pantofaru C, Li Y, Vijayanarasimhan S, Toderici G, Ricco S, Sukthankar R et al (2018) Ava: a video dataset of spatio-temporally localized atomic visual actions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6047–6056
Sigurdsson GA, Varol G, Wang X, Farhadi A, Laptev I, Gupta A (2016) Hollywood in homes: crowdsourcing data collection for activity understanding. In: European conference on computer vision, pp 510–526. Springer
Abu-El-Haija S, Kothari N, Lee J, Natsev P, Toderici G, Varadarajan B, Vijayanarasimhan S (2016) Youtube-8m: a large-scale video classification benchmark. arXiv preprint arXiv:1609.08675
Monfort M, Andonian A, Zhou B, Ramakrishnan K, Bargal SA, Yan Y, Brown L, Fan Q, Gutfreund D, Vondrick C et al (2019) Moments in time dataset: one million videos for event understanding. IEEE transactions on pattern analysis and machine intelligence
Yong Jae Lee, Ghosh J, Grauman K (2012) Discovering important people and objects for egocentric video summarization. In: 2012 IEEE conference on computer vision and pattern recognition, pp 1346–1353. IEEE
Damen D, Doughty H, Farinella GM, Fidler S, Furnari A, Kazakos E, Moltisanti D, Munro J, Perrett T, Price W, Wray M (2018) Scaling egocentric vision: The epic-kitchens dataset. In: European conference on computer vision (ECCV)
Ghanem B, Caba Heilbron F, Escorcia V, Niebles JC (2015) Activitynet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 961–970
Sultani W, Chen C, Shah M (2018) Real-world anomaly detection in surveillance videos. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 6479–6488
Jiang Y-G, Liu J, Zamir AR, Toderici G, Laptev I, Shah M, Sukthankar R (2014) Thumos challenge: action recognition with a large number of classes
Yilmaz A, Shah M (2005) Actions as objects: a novel action representation. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Jiang Z, Lin Z, Davis L (2012) Recognizing human actions by learning and matching shape-motion prototype trees. IEEE Trans Pattern Anal Mach Intell 34(3):533–547
Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2–3):107–123
Wang H, Kläser A, Schmid C, Cheng-Lin L (2011) Action recognition by dense trajectories. In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, pp 3169–3176
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497
Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6299–6308
Chen Y, Kalantidis Y, Li J, Yan S, Feng J (2018) Multi-fiber networks for video recognition. In: Proceedings of the European conference on computer vision (ECCV), pp 352–367
Sultani W, Chen C, Shah M (2018) Real-world anomaly detection in surveillance videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6479–6488
Sultani W, Zhang D, Shah M (2017) Unsupervised action proposal ranking through proposal recombination. Comput Vis Image Underst 161:42–50
Luvizon D, Picard D, Tabia H (2018) 2D/3D pose estimation and action recognition using multitask deep learning, pp 5137–5146
Jhuang H, Gall J, Zuffi S, Schmid C, Black MJ (2013) Towards understanding action recognition. In: Proceedings of the IEEE international conference on computer vision, pp 3192–3199
Lu J, Corso JJ et al (2015) Human action segmentation with hierarchical supervoxel consistency. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3762–3771
Ghosh P, Yao Y, Davis L, Divakaran A (2020) Stacked spatio-temporal graph convolutional networks for action segmentation. In: The IEEE Winter conference on applications of computer vision, pp 576–585
Sultani W, Shah M (2016) What if we do not have multiple videos of the same action?–video action localization using web images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1077–1085
Soomro K, Idrees H, Shah M (2015) Action localization in videos through context walk. In: Proceedings of the IEEE international conference on computer vision, pp 3280–3288
Singh G, Cuzzolin F (2016) Untrimmed video classification for activity detection: submission to activitynet challenge. arXiv preprint arXiv:1607.01979
Montes A, Salvador A, Pascual S, Giro-i Nieto X (2016) Temporal activity detection in untrimmed videos with recurrent neural networks. arXiv preprint arXiv:1608.08128
Zhao Y, Xiong Y, Wang L, Wu Z, Tang X, Lin D (2017) Temporal action detection with structured segment networks. In: Proceedings of the IEEE international conference on computer vision, pp 2914–2923
Singh G, Saha S, Sapienza M, Torr PHS, Cuzzolin F (2017) Online real-time multiple spatiotemporal action localisation and prediction. In: Proceedings of the IEEE international conference on computer vision, pp 3637–3646
Author information
Authors and Affiliations
Corresponding author
Section Editor information
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this entry
Cite this entry
Sultani, W., Arshad, Q.A., Chen, C. (2020). Action Recognition in Real-World Videos. In: Computer Vision. Springer, Cham. https://doi.org/10.1007/978-3-030-03243-2_846-1
Download citation
DOI: https://doi.org/10.1007/978-3-030-03243-2_846-1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-03243-2
Online ISBN: 978-3-030-03243-2
eBook Packages: Springer Reference Computer SciencesReference Module Computer Science and Engineering