Skip to main content

3M2RNet: Multi-Modal Multi-Resolution Refinement Network for Semantic Segmentation

  • Conference paper
  • First Online:
Book cover Advances in Computer Vision (CVC 2019)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 944))

Included in the following conference series:

Abstract

One of the most important steps towards 3D scene understanding is the semantic segmentation of images. The 3D scene understanding is considered as the crucial requirement in computer vision and robotic applications. With the availability of RGB-D cameras, it is desired to improve the accuracy of the scene understanding process by exploiting the depth along with appearance features. One of the main problems in RGB-D semantic segmentation is how to fuse or combine these two modalities to achieve more advantages of the common and specific features of each modality. Recently, the methods that encounter deep convolutional neural networks have reached the state-of-the-art results in dense prediction. They are usually used as feature extractors as well as data classifiers with an end-to-end training procedure. In this paper, an efficient multi-modal multi-resolution refinement network is proposed to exploit the advantages of these modalities (RGB and depth) as much as possible. This refinement network is a type of encoder-decoder networks with two separate encoder branches and one decoder stream. The feature abstract representation of deep networks is performed by down-sampling operations in encoder branches leading to some resolution loss in data. Therefore, in the decoder branch, the occurred resolution loss must be compensated. In the modality fusion process, a weighted fusion of “clean” information paths of each resolution level of the two encoders is utilized via the skip connection by the aid of the identity mapping function. The extensive experimental results on the three main challenging datasets of NYU-V2, SUN RGB-D, and Stanford 2D-3D-S show that the proposed network obtains the state-of-the-art results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)

    Google Scholar 

  2. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014). arXiv preprint arXiv:14091556

  3. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)

    Google Scholar 

  4. Cadena, C., Košecka, J.: Semantic parsing for priming object detection in RGB-D scenes. In: 3rd Workshop on Semantic Perception, Mapping and Exploration (2013)

    Google Scholar 

  5. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)

    Google Scholar 

  6. Szegedy, C., Toshev, A., Erhan, D.: Deep neural networks for object detection. In: Advances in Neural Information Processing Systems, pp. 2553–2561 (2013)

    Google Scholar 

  7. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos, pp. 568–576 (2014)

    Google Scholar 

  8. Wang, L., Qiao, Y., Tang, X.: Action recognition with trajectory-pooled deep-convolutional descriptors. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4305–4314 (2015)

    Google Scholar 

  9. Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: IEEE International Conference on Computer Vision, pp. 2650–2658 (2015)

    Google Scholar 

  10. Liu, F., Shen, C., Lin, G.: Deep convolutional neural fields for depth estimation from a single image. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 5162–5170 (2015)

    Google Scholar 

  11. Toshev, A., Szegedy, C.: Deeppose: human pose estimation via deep neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1653–1660 (2014)

    Google Scholar 

  12. Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2018). %@ 0162-8828

    Article  Google Scholar 

  13. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)

    Google Scholar 

  14. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  15. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: IEEE Conference on Computer Vision and Pattern Recognition, p. 3 (2017)

    Google Scholar 

  16. Couprie, C., Farabet, C., Najman, L., LeCun, Y.: Indoor semantic segmentation using depth information (2013). arXiv preprint arXiv:13013572

  17. Lin, G., Milan, A., Shen, C., Reid, I.D.: RefineNet: multi-path refinement networks for high-resolution semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition, p. 5 (2017)

    Google Scholar 

  18. Gupta, S., Arbelaez, P., Malik, J.: Perceptual organization and recognition of indoor scenes from RGB-D images. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 564–571. IEEE (2013). %@ 1063-6919

    Google Scholar 

  19. Li, C., Kowdle, A., Saxena, A., Chen, T.: Towards holistic scene understanding: feedback enabled cascaded classification models. In: Advances in Neural Information Processing Systems, pp. 1351–1359 (2010)

    Google Scholar 

  20. Muller, A.C., Behnke. S.: Learning depth-sensitive conditional random fields for semantic segmentation of RGB-D images. In: IEEE International Conference on Robotics and Automation, pp. 6232–6237. IEEE (2014)

    Google Scholar 

  21. Ren, X., Bo, L., Fox, D.: RGB-(D) scene labeling: features and algorithms. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2759–2766. IEEE (2012). %@ 1467312282

    Google Scholar 

  22. Shotton, J., Winn, J., Rother, C., Criminisi, A.: Textonboost: joint appearance, shape and context modeling for multi-class object recognition and segmentation. In: Computer Vision–ECCV 2006, pp. 1–15. Springer (2006). %@ 3540338322

    Google Scholar 

  23. Silberman, N., Fergus, R.: Indoor scene segmentation using a structured light sensor. In: IEEE International Conference on Computer Vision Workshops, pp. 601–608. IEEE (2011). %@ 1467300624

    Google Scholar 

  24. Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: European Conference on Computer Vision, pp. 746–760. Springer (2012). %@ 3642337147

    Google Scholar 

  25. Silberman, N., Sontag, D., Fergus, R.: Instance segmentation of indoor scenes using a coverage loss. In: European Conference on Computer Vision, pp. 616–631. Springer (2014). %@ 3319105892

    Google Scholar 

  26. Zheng, S., Cheng, M.-M., Warrell, J., Sturgess, P., Vineet, V., Rother, C., Torr, P.H.S.: Dense semantic image segmentation with objects and attributes. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3214–3221. IEEE (2014)

    Google Scholar 

  27. Plath, N., Toussaint, M., Nakajima, S.: Multi-class image segmentation using conditional random fields and global classification. In: Proceedings of the International Conference on Machine Learning, pp. 817–824. ACM (2009). %@ 1605585165

    Google Scholar 

  28. Reynolds, J., Murphy, K.: Figure-ground segmentation using a hierarchical conditional random field. In: Canadian Conference on Computer and Robot Vision, pp. 175–182. IEEE (2007). %@ 0769527868

    Google Scholar 

  29. Yang, M.Y., Förstner, W.: A hierarchical conditional random field model for labeling and classifying images of man-made scenes. In: IEEE International Conference on Computer Vision Workshops, pp. 196–203. IEEE (2011). %@ 1467300624

    Google Scholar 

  30. Kindermann, R., Snell, J.L.: Markov Random Fields and Their Applications, vol. 1. American Mathematical Society Providence (1980). %@ 0821850016

    Google Scholar 

  31. Lafferty, J., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data (2001)

    Google Scholar 

  32. Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: European Conference on Computer Vision, pp. 818–833. Springer (2014)

    Google Scholar 

  33. Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. In: IEEE International Conference on Computer Vision, pp. 1520–1528 (2015)

    Google Scholar 

  34. Wang, J., Wang, Z., Tao, D., See, S., Wang, G.: Learning common and specific features for RGB-D semantic segmentation with deconvolutional networks. In: European Conference on Computer Vision, pp. 664–679. Springer (2016)

    Google Scholar 

  35. Jégou, S., Drozdzal, M., Vazquez, D., Romero, A., Bengio, Y.: The one hundred layers Tiramisu: fully convolutional DenseNets for semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1175–1183. IEEE (2017). %@ 1538607336

    Google Scholar 

  36. Song, S., Lichtenberg, S.P., Xiao, J.: Sun RGB-D: A RGB-D scene understanding benchmark suite. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 567–576 (2015)

    Google Scholar 

  37. Armeni, I., Sax, S., Zamir, A.R., Savarese, S.: Joint 2D-3D-semantic data for indoor scene understanding (2017). arXiv preprint arXiv:170201105

  38. Li, Z., Gan, Y., Liang, X., Yu, Y., Cheng, H., Lin, L.: LSTM-CF: unifying context modeling and fusion with LSTMS for RGB-D scene labeling. In: European Conference on Computer Vision, pp. 541–557. Springer (2016)

    Google Scholar 

  39. Hazirbas, C., Ma, L., Domokos, C., Cremers, D.: FuseNet: incorporating depth into semantic segmentation via fusion-based CNN architecture. In: Asian Conference on Computer Vision, pp. 213–228. Springer (2016)

    Google Scholar 

  40. Park, S.-J., Hong, K.-S., Lee, S.: RDFNet: RGB-D multi-level residual feature fusion for indoor semantic segmentation. In: The IEEE International Conference on Computer Vision (2017)

    Google Scholar 

  41. He, K., Zhang, X., Ren. S., Sun, J.: Identity mappings in deep residual networks. In: European Conference on Computer Vision, pp. 630–645. Springer (2016)

    Google Scholar 

  42. Gupta, S., Girshick, R., Arbeláez, P., Malik, J.: Learning rich features from RGB-D images for object detection and segmentation. In: European Conference on Computer Vision, pp. 345–360. Springer (2014). %@ 3319105833

    Google Scholar 

  43. Lin, G., Shen, C., Van Den Hengel, A., Reid, I.: Exploring context with deep structured models for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1352–1366 (2018). %@ 0162-8828

    Article  Google Scholar 

  44. Kong, S., Fowlkes, C.: Pixel-wise Attentional Gating for Parsimonious Pixel Labeling (2018). arXiv preprint arXiv:180501556

  45. Wang, W., Neumann, U.: Depth-aware CNN for RGB-D Segmentation (2018). arXiv preprint arXiv:180306791

  46. Kendall, A., Badrinarayanan, V., Cipolla, R.: Bayesian SegNet: model uncertainty in deep convolutional encoder-decoder architectures for scene understanding (2015). arXiv preprint arXiv:151102680

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fahimeh Fooladgar .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Fooladgar, F., Kasaei, S. (2020). 3M2RNet: Multi-Modal Multi-Resolution Refinement Network for Semantic Segmentation. In: Arai, K., Kapoor, S. (eds) Advances in Computer Vision. CVC 2019. Advances in Intelligent Systems and Computing, vol 944. Springer, Cham. https://doi.org/10.1007/978-3-030-17798-0_44

Download citation

Publish with us

Policies and ethics