图像描述问题发展趋势及应用马倩霞李频捷宋靖雁张涛 (清华大学) 1 引 言2 概 述3 研究用数据集4 研究方法4.1 基于物体识别和属性检测4.2 基于多示例学习的模型4.3 编码器-解码器结构模型4.4 注意力模型4.5 强化学习模型4.6 生成对抗模型4.7 混合模型5 应 用6 发展趋势6.1 端到端学习6.2 高层语义6.3 链式结构向层级结构的转变6.4 注意力机制6.5 统一架构7 结束语1. Karpathy A. Connecting Images and Natural Language [D];Stanford: Stanford University, 2016. 2. Baltrušaitis T, Ahuja C, Morency L P. Multimodal machine learning:A survey and taxonomy[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2017(99):1-1. 3. Reichert D P, Series P, Storkey A J. A hierarchical generative model of recurrent object-based attention in the visual cortex[C]. International Conference on Artifical Neural Networks, Berlin, 2011: 18-25. 4. Karpathy A, Fei L. Deep visual-semantic alignments for generating image descriptions[J]. IEEE Trans Pattern Anal Mach Intell, 2017, 39(4):664-676. 5. Li Y F, Kwok J T, Tsang I W, et al. A convex method for locating regions of interest with multi-instance learning[C]. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Heidelberg, Berlin, 6 September 2009. 6. Kulkarni G, Premraj V, Ordonez V, et al. Babytalk:understanding and generating simple image descriptions[J]. IEEE Trans Pattern Anal Mach Intell,2013,35(12):2891-2903. 7. Yang Y, Teo C L, Daumé Iii H, et al. Corpus-guided sentence generation of natural images[C]. Proceedings of the Conference on Empirical Methods in Natural Language Processing, Edinburgh, Scotland, UK, 27-31 July 2011. 8. Ordonez V, Kulkarni G, Berg T L. Im2Text:Describing images using 1 million captioned photographs[C]. Proceedings of the International Conference on Neural Information Processing Systems, Granada, Spain, 12-15 December 2011. 9. Farhadi A, Hejrati M, Sadeghi M A, et al. Every picture tells a story:Generating sentences from images[C]. Proceedings of the European Conference on Computer Vision, Crete, Greece, 5-11 September 2010. 10. Li S, Kulkarni G, Berg T L, et al. Composing simple image descriptions using web-scale n-grams[C]. Proceedings of the Fifteenth Conference on Computational Natural Language Learning, Portland, OR, USA, 23-24 June 2011. 11. Aker A, Gaizauskas R J. Generating image descriptions using dependency relational patterns[C]. Proceedings of the Meeting of the Association for Computational Linguistics, Uppsala, Sweden, 11-16 July 2010. 12. Kuznetsova P, Ordonez V, Berg A C, et al. Collective generation of natural image descriptions [C]. Proceedings of the Meeting of the Association for Computational Linguistics: Long Papers, Jeju Island, Korea, 8-14 July 2012. 13. Feng Y, Lapata M. How many words is a picture worth?automatic caption generation for news images[C]. Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, F,Uppsala, Sweden, 11-16 July 2010. 14. Kiros R, Salakhutdinov R, Zemel R S. Unifying visual-semantic embeddings with multimodal neural language models[J]. arXiv preprint arXiv:1411.2539. 15. Kiros R, Salakhutdinov R, Zemel R. Multimodal neural language models[C]. International conference on machine learning, Beijing, China, 21-26 June 2014. 16. Hochreiter S, Schmidhuber J. Long short-term memory [J]. Neural Computation, 1997, 9(8):1735-1780. 17. Wang C, Yang H, Bartz C, et al. Image captioning with deep bidirectional LSTMs[C]. Proceedings of the ACM on Multimedia Conference, Amsterdam, The Netherlands, 15-19 October 2016 18. Tan Y H, Chan C S. phi-LSTM:A phrase-based hierarchical LSTM model for image captioning[C]. In Asian Conference on Computer Vision, Taipei, Taiwan, 21-23 November 2016. 19. Yao T, Pan Y, Li Y, et al. Boosting image captioning with attributes[C]. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22-29 Octorber 2017. 20. Jiang W, Ma L, Chen X, et al. Learning to guide decoding for image captioning[J]. arXiv preprint arXiv:180400887. 21. Rupprecht C, Laina I, Navab N, et al. Guide me:Interacting with deep networks[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18-22 June 2018. 22. Deshpande A, Aneja J, Wang L, et al. Fast, Diverse and accurate image captioning guided by part-of-speech[C]. Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16-19 June 2019. 23. Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]. In Advances in neural information processing systems, Long Beach, CA, USA, 3-9 December 2017. 24. Cornia M, Stefanini M, Baraldi L, et al. Meshed-memory transformer for image captioning[C]. Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seatle, USA, 14-19 June 2020. 25. Chen S, Jin Q, Wang P, et al. Say as you wish:Fine-grained control of image caption generation with abstract scene graphs[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seatle, USA, 14-19 June 2020. 26. Huang L, Wang W, Chen J, et al. Attention on attention for image captioning[C]. Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October-2 November 2019. 27. Zhou Y, Wang M, Liu D, et al. More grounded image captioning by distilling image-text matching model[C]. Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seatle, USA, 14-19 June 2020. 28. Wang Y, Lin Z, Shen X, et al. Skeleton key:Image captioning by skeleton-attribute decomposition[C]. Proceedings of the 2017 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Honolulu, Hawaii, 24-30 June 2017. 29. Zhou B, Lapedriza A, Khosla A, et al. Places:A 10 million image database for scene recognition[J]. IEEE Trans Pattern Anal Mach Intell, 2017(99): 1-1. 30. Castrejón L, Aytar Y, Vondrick C, et al. Learning aligned cross-modal representations from weakly aligned data[C]. Proceedings of the Computer Vision and Pattern Recognition, Las Vegas, Nevada, USA, 26 June-1 July 2016. 31. Mccormac J, Handa A, Leutenegger S, et al. SceneNet RGB-D:Can 5M synthetic images beat generic imagenet pre-training on indoor segmentation?[C].In 2017 IEEE International Conference on Computer Vision, Venice, Italy, 22-29 October 2017. 32. Yoshikawa Y, Shigeto Y, Takeuchi A. STAIR captions:Constructing a large-scale japanese image caption dataset[C]. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2:Short Papers), Vancouver, Canada, 30 July-4 August 2017. 33. Dietterich T G, Lathrop R H, Lozano-Pérez T. Solving the multiple instance problem with axis-parallel rectangles[J]. Artificial Intelligence, 1997, 89(1-2):31-71. 34. Feng J, Zhou Z H. Deep MIML network [C]. The Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4-9 February 2017. 35. Zhang Y L, Zhou Z H. Multi-instance learning with key instance shift [C]. Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, Melbourne, Australia, 19-25 August 2017. 36. Wei X S, Wu J, Zhou Z H. Scalable algorithms for multi-instance learning[J]. IEEE Trans Neural Netw Learn Syst, 2017, 28(4):975-987. 37. Maron O, Lozano-Pérez T. A framework for multiple-instance learning[J]. Advances in Neural Information Processing Systems, 1998, 200(2):570-576. 38. Zhang M L, Zhou Z H. Improve multi-instance neural networks through feature selection[J]. Neural Processing Letters, 2004, 19(1):1-10. 39. Zhang Q, Goldman S A. EM-DD:An improved multi-instance learning technique [C]. Advances in Neural Information Processing Systems, Vancouver, Canada, 6-11 December 2010. 40. Zhang C, Platt J C, Viola P A. Multiple instance boosting for object detection[J]. Advances in Neural Information Processing Systems, 2007, 18:1419-1426. 41. Fang H, Platt J C, Zitnick C L, et al. From captions to visual concepts and back[C]. Proceedings of the Computer Vision and Pattern Recognition, Columbus, Ohio, USA, 24-27 June 2014. 42. Cho K, Merrienboer B V, Gulcehre C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[C]. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, USA, 25–29 October 2014. 43. Mao J, Xu W, Yang Y, et al. Explain images with multimodal recurrent neural networks[J]. arXiv preprint, arXiv:1410.1090. 44. Vinyals O, Toshev A, Bengio S, et al. Show and tell:A neural image caption generator[C].2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, Massachusetts, USA, 8-10 June 2015. 45. Sutskever I, Vinyals O, Le Q V. Sequence to sequence learning with neural networks[C]. Advances in Neural Information Processing Systems, Palais des Congrès de Montréal, Montréal, Canada, 8-13 December 2014. 46. Graves A, Wayne G, Danihelka I. Neural turing machines[J]. arXiv preprint, arXiv:1410.5401. 47. Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate[C]. In International Conference on Learning Representations, San Diego, CA, USA, 7-9 May 2015. 48. Xu K, Ba J, Kiros R, et al. Show, attend and tell:Neural image caption generation with visual attention[C]. In Proceedings of The 32nd International Conference on Machine Learning, Lille, France, 6-11 July, 2015. 49. Guo L, Liu J, Yao P, et al. MSCap:Multi-style image captioning with unpaired stylized text[C]. Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16-20 June 2019, 50. Dognin P, Melnyk I, Mroueh Y, et al. Adversarial semantic alignment for improved image captions [C]. Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16-20 June 2019,. 51. Li K, Zhang Y, Li K, et al. Visual semantic reasoning for image-text matching[C]. Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October-2November 2019. 52. Zheng Z, Zheng L, Garrett M, et al. Dual-path convolutional image-text embeddings with instance loss[J]. ACM Trans Multimedia Comput Commun Appl, 2020, 16(2): 51. 53. Li M, Wang F, Chang X, et al. Auxiliary signal-guided knowledge encoder-decoder for medical report generation[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seatle, USA, 14-19 June 2020. 54. Peng Y, Tang Y X, Lee S, et al. COVID-19-CT-CXR:A freely accessible and weakly labeled chest X-ray and CT image collection on COVID-19 from biomedical literature[C]. arXiv preprint, arXiv: 2006.06177. 55. Giancristofaro G T, Panangadan A. Predicting sentiment toward transportation in social media using visual and textual features[C]. Proceedings of the 2016 IEEE 19th International Conference on Intelligent Transportation Systems, Rio de Janeiro, Brazil, 1-4 November 2016. 56. Wu Q, Shen C H, Liu L Q, et al. What value do explicit high level concepts have in vision to language problems?[C]. 2016 Ieee Conference on Computer Vision and Pattern Recognition, Las Vegas, Nevada, USA, 26 June 26-1 July 1 2016. 57. Gehring J, Auli M, Grangier D, et al. Convolutional sequence to sequence learning[C]. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 7-9 August 2017. 引用本文: 马倩霞, 李频捷, 宋靖雁, 等. 图像描述问题发展趋势及应用[J]. 无人系统技术, 2020, 3(6):25-35. (MA Qianxia,LI Pinjie,SONG Jingyan,et al.The Development Trends and Applications of Image Caption[J].Unmanned Systems Technology,2020,03(06):25-35.) End 马倩霞(1992-),女,博士研究生,主要研究方向为计算机视觉、多模态机器学习。 李频捷(1995-),女,硕士研究生,主要研究方向为复杂系统与智能体系。 宋靖雁(1964-),男,博士,教授,主要研究方向为智能交通系统控制理论与控制模型研究、智能控制与空间机器人。 张 涛(1969-),男,博士,教授,主要研究方向为智能控制理论及应用、机器人学和智能系统建模。本文通讯作者。 |
|
来自: taotao_2016 > 《视觉》