Virtual Guidance as a Mid-level Representation
for Navigation

Hsuan-Kung Yang1, Tsung-Chih Chiang1*, Ting-Ru Liu1*, Chun-Wei Huang1*, Jou-Min Liu1*, and Chun-Yi Lee1

1 Elsa Lab, Department of Computer Science, National Tsing Hua University, Hsinchu City, Taiwan.

Abstract

In the context of autonomous navigation, effectively conveying abstract navigational cues to agents in dynamic environments poses challenges, particularly when the navigation information is multimodal. To address this issue, the paper introduces a novel technique termed "Virtual Guidance," which is designed to visually represent non-visual instructional signals. These visual cues, rendered as colored paths or spheres, are overlaid onto the agent's camera view, serving as easily comprehensible navigational instructions. We evaluate our proposed method through experiments in both simulated and real-world settings. In the simulated environments, our virtual guidance outperforms baseline hybrid approaches in several metrics, including adherence to planned routes and obstacle avoidance. Furthermore, we extend the concept of virtual guidance to transform text-prompt-based instructions into a visually intuitive format for real-world experiments. Our results validate the adaptability of virtual guidance and its efficacy in enabling policy transfer from simulated scenarios to real-world ones.

Virtual Guidance in Simulated Environments

To investigate the feasibility of virtual guidance as a form of mid-level representation, we have developed a flexible framework using the Unity engine and the Unity ML-Agents Toolkit. This framework, illustrated below, is designed to be fully configurable and enables the generation of guidance signals. The inputs of the agent are rendered in the form of semantic segmentation, an effective mid-level representation that can serve as an input observation for a Deep Reinforcement Learning (DRL) agent. This design philosophy enables the exploration of diverse ways of presenting guidance signals to the agent, as well as various virtual guidance and vector-based approaches. The agent receives the guidance signals along with semantic segmentation maps, and is tasked with processing these inputs to learn a policy to reach its intended destination.

Virtual Guidance Schemes

  • Navigation Path

    In the first scheme, the navigation line obtained from the planning module is represented as a colored path on the semantic segmentation map. An example visualization is illustrated in the corresponding figure. Specifically, the navigation path is implemented as a 3D mesh in the simulated environments and projected onto the camera view plane. This rendered navigation path can be considered as a rich and informative signal that carries both semantic and guidance information. It highlights the permissible regions for the DRL agent and the route leading to the target location.

  • Waypoint

    The second scheme generates a set of waypoints \( \mathcal{W} \) by segmenting the planned navigation trajectory. The waypoints serve as hints to instruct the agent to the destination. These waypoints are visualized as 3D virtual balls in the virtual environments and are projected onto the camera image plane. The visualization is presented in the corresponding figure. Unlike the first scheme, which utilizes a navigation path to provide dense and informative signals, the second one provides the waypoints as 3D virtual balls, which are sparse signals for the DRL agent to locate.

An Overview of the Real-World Framework

To further validate the applicability of the proposed virtual guidance scheme in real-world scenarios, we have designed a specific task. The objectives of this task are twofold. First, the task seeks to verify the effective transferability of the pre-trained DRL agent's policy to real-world settings, where it follows the virtual guidance. Second, it aims to demonstrate that the virtual guidance scheme possesses the flexibility to adapt not only to trajectories generated by specific planning algorithms but also to instructions from diverse methods, provided these instructions can be translated into visual representations. This adaptability highlights the advantage of the virtual guidance scheme in eliminating the necessity for the agent to interpret inputs from multiple modalities.

Experimental Results

This section presents a comparison of different guidance schemes. The evaluation results indicate that the schemes incorporating virtual guidance (denoted as \( VG \)) consistently outperform the baseline scheme (denoted as \( Hybrid \)) in terms of SPL, success rate, line following rate, and waypoint collection rate, in both \( seen \) and \( unseen \) scenarios. This suggests that our vision-based guidance strategies can effectively offer informative navigational cues. Therefore, it is able to alleviate the agent's burden to learn the correlation between visual observations and navigational instructions derived from different non-visual modalities.

Environment Guidance Scheme Representation Form Performance Failure Cases
SPL (%) Success Rate (%) Line Following Rate (%) Waypoint Collecting Rate (%) Collision Rate (%) Out-of-Bound (%)
Seen Hybrid (one-time) {RGB, (r, θ)} 44.88 46.93 36.35 26.36 26.65 26.43
VGwaypoint (one-time) RGB 73.16 73.51 69.54 73.53 19.41 6.95
VGpath (one-time) RGB 72.68 72.75 89.46 N/A 26.26 0.96
VGpath (real-time) RGB 88.85 89.54 N/A N/A 9.02 1.44
Unseen Hybrid (one-time) {RGB, (r, θ)} 18.96 20.86 35.33 20.33 37.02 42.12
VGwaypoint (one-time) RGB 57.69 58.18 67.77 66.06 30.11 11.13
VGpath (one-time) RGB 57.46 57.54 89.19 N/A 36.94 5.45
VGpath (real-time) RGB 81.48 82.74 N/A N/A 15.08 1.99
@misc{yang2023virtual,
            title={Virtual Guidance as a Mid-level Representation for Navigation},
            author={Hsuan-Kung Yang and Tsung-Chih Chiang and Ting-Ru Liu and Chun-Wei Huang and Jou-Min Liu and Chun-Yi Lee},
            year={2023},
            eprint={2303.02731},
            archivePrefix={arXiv},
          }
Copied!