In the context of autonomous navigation, effectively conveying abstract navigational cues to agents in dynamic environments poses challenges, particularly when the navigation information is multimodal. To address this issue, the paper introduces a novel technique termed "Virtual Guidance," which is designed to visually represent non-visual instructional signals. These visual cues, rendered as colored paths or spheres, are overlaid onto the agent's camera view, serving as easily comprehensible navigational instructions. We evaluate our proposed method through experiments in both simulated and real-world settings. In the simulated environments, our virtual guidance outperforms baseline hybrid approaches in several metrics, including adherence to planned routes and obstacle avoidance. Furthermore, we extend the concept of virtual guidance to transform text-prompt-based instructions into a visually intuitive format for real-world experiments. Our results validate the adaptability of virtual guidance and its efficacy in enabling policy transfer from simulated scenarios to real-world ones.
To investigate the feasibility of virtual guidance as a form of mid-level representation, we have developed a flexible framework using the Unity engine and the Unity ML-Agents Toolkit. This framework, illustrated below, is designed to be fully configurable and enables the generation of guidance signals. The inputs of the agent are rendered in the form of semantic segmentation, an effective mid-level representation that can serve as an input observation for a Deep Reinforcement Learning (DRL) agent. This design philosophy enables the exploration of diverse ways of presenting guidance signals to the agent, as well as various virtual guidance and vector-based approaches. The agent receives the guidance signals along with semantic segmentation maps, and is tasked with processing these inputs to learn a policy to reach its intended destination.
In the first scheme, the navigation line obtained from the planning module is represented as a colored path on the semantic segmentation map. An example visualization is illustrated in the corresponding figure. Specifically, the navigation path is implemented as a 3D mesh in the simulated environments and projected onto the camera view plane. This rendered navigation path can be considered as a rich and informative signal that carries both semantic and guidance information. It highlights the permissible regions for the DRL agent and the route leading to the target location.
The second scheme generates a set of waypoints \( \mathcal{W} \) by segmenting the planned navigation trajectory. The waypoints serve as hints to instruct the agent to the destination. These waypoints are visualized as 3D virtual balls in the virtual environments and are projected onto the camera image plane. The visualization is presented in the corresponding figure. Unlike the first scheme, which utilizes a navigation path to provide dense and informative signals, the second one provides the waypoints as 3D virtual balls, which are sparse signals for the DRL agent to locate.
To further validate the applicability of the proposed virtual guidance scheme in real-world scenarios, we have designed a specific task. The objectives of this task are twofold. First, the task seeks to verify the effective transferability of the pre-trained DRL agent's policy to real-world settings, where it follows the virtual guidance. Second, it aims to demonstrate that the virtual guidance scheme possesses the flexibility to adapt not only to trajectories generated by specific planning algorithms but also to instructions from diverse methods, provided these instructions can be translated into visual representations. This adaptability highlights the advantage of the virtual guidance scheme in eliminating the necessity for the agent to interpret inputs from multiple modalities.
This section presents a comparison of different guidance schemes. The evaluation results indicate that the schemes incorporating virtual guidance (denoted as \( VG \)) consistently outperform the baseline scheme (denoted as \( Hybrid \)) in terms of SPL, success rate, line following rate, and waypoint collection rate, in both \( seen \) and \( unseen \) scenarios. This suggests that our vision-based guidance strategies can effectively offer informative navigational cues. Therefore, it is able to alleviate the agent's burden to learn the correlation between visual observations and navigational instructions derived from different non-visual modalities.
Environment | Guidance Scheme | Representation Form | Performance | Failure Cases | ||||
---|---|---|---|---|---|---|---|---|
SPL (%) | Success Rate (%) | Line Following Rate (%) | Waypoint Collecting Rate (%) | Collision Rate (%) | Out-of-Bound (%) | |||
Seen | Hybrid (one-time) | {RGB, (r, θ)} | 44.88 | 46.93 | 36.35 | 26.36 | 26.65 | 26.43 |
VGwaypoint (one-time) | RGB | 73.16 | 73.51 | 69.54 | 73.53 | 19.41 | 6.95 | |
VGpath (one-time) | RGB | 72.68 | 72.75 | 89.46 | N/A | 26.26 | 0.96 | |
VGpath (real-time) | RGB | 88.85 | 89.54 | N/A | N/A | 9.02 | 1.44 | |
Unseen | Hybrid (one-time) | {RGB, (r, θ)} | 18.96 | 20.86 | 35.33 | 20.33 | 37.02 | 42.12 |
VGwaypoint (one-time) | RGB | 57.69 | 58.18 | 67.77 | 66.06 | 30.11 | 11.13 | |
VGpath (one-time) | RGB | 57.46 | 57.54 | 89.19 | N/A | 36.94 | 5.45 | |
VGpath (real-time) | RGB | 81.48 | 82.74 | N/A | N/A | 15.08 | 1.99 |
We validated the proposed virtual guidance concept in real-world scenarios, using text prompts to derive virtual guidance paths. In our experimental configurations, several different compositions of objects were arranged. Each composition contained multiple objects, with at least two objects from the same category but with distinct appearances. For example, the second composition featured two differently designed umbrellas along with a green box. The agent received a text prompt describing a particular composition of objects.
Each experiment was performed over five independent runs. Owing to the capabilities of GroundingDINO, all objects described in the text prompts were correctly detected. The results are presented in the below table. The symbols & and \( | \) indicate that & requires the agent to reach all the objects described by the text prompt, while \( | \) requires the agent to reach just one of them. The agent's task was to identify and approach the correct object(s) based on the provided text prompt, using the transformed virtual guidance. The results indicate that the policies trained in virtual environments can be effectively transferred to real-world settings, and that the agent can follow virtual guidance paths generated via text-prompt-based navigation objectives.
Object Compositions | Text Prompts | Approaches | Success Rate |
---|---|---|---|
{ ![]() ![]() |
`purple umbrella' & `blue umbrella' | VGwaypoints | 80.00% |
VGpath + VGwaypoints | 80.00% | ||
{ ![]() ![]() ![]() |
`purple umbrella' | `blue umbrella' | `green box' | VGwaypoints | 80.00% |
VGpath + VGwaypoints | 60.00% | ||
{ ![]() ![]() ![]() |
`purple umbrella' | `blue box' | `green box' | VGwaypoints | 60.00% |
VGpath + VGwaypoints | 80.00% | ||
{ ![]() ![]() ![]() |
`traffic cone' | `blue box' | `green box' | VGwaypoints | 60.00% |
VGpath + VGwaypoints | 60.00% |
@misc{yang2023virtual, title={Virtual Guidance as a Mid-level Representation for Navigation}, author={Hsuan-Kung Yang and Tsung-Chih Chiang and Ting-Ru Liu and Chun-Wei Huang and Jou-Min Liu and Chun-Yi Lee}, year={2023}, eprint={2303.02731}, archivePrefix={arXiv}, }Copied!