Abstract

Multi-modal scene parsing is a prevalent topic in robotics and autonomous driving since the knowledge of different modalities can complement each other. Recently, the success of self-attention-based methods has demonstrated the effectiveness of capturing long-range dependencies. However, the tremendous cost dramatically limits the application of this idea in multi-modal fusion. To alleviate this problem, this paper designs a multimodal additive-attention cross-fusion block (AC) and an efficient AC variant (EAC) to effectively capture global awareness among different modalities. Moreover, a simple yet efficient transformer-based trans-context block (TC) is also presented to incorporate contextual information. Based on the above components, we propose a light hybrid cross-fusion network (HCFNet), which can explore long-range dependencies of multi-modal information while keeping local details. Finally, we conduct comprehensive experiments and analyses on both indoor (NYUv2-13, -40) and outdoor (Cityscapes-11) datasets. Experimental results show that the proposed HCFNet outperforms current start-of-the-art methods with mIoU scores of 66.9% and 51.5% on NYUv2-13 and -40 class settings, respectively. Our model also shows a competitive mIoU score of 80.6% on the Cityscapes-11 dataset.

pdf | check online