Virtual Guidance as a Mid-level Representation for Navigation

Abstract

In the context of autonomous navigation, effectively conveying abstract navigational cues to agents in dynamic environments poses challenges, particularly when the navigation information is multimodal. To address this issue, the paper introduces a novel technique termed "Virtual Guidance," which is designed to visually represent non-visual instructional signals. These visual cues, rendered as colored paths or spheres, are overlaid onto the agent's camera view, serving as easily comprehensible navigational instructions. We evaluate our proposed method through experiments in both simulated and real-world settings. In the simulated environments, our virtual guidance outperforms baseline hybrid approaches in several metrics, including adherence to planned routes and obstacle avoidance. Furthermore, we extend the concept of virtual guidance to transform text-prompt-based instructions into a visually intuitive format for real-world experiments. Our results validate the adaptability of virtual guidance and its efficacy in enabling policy transfer from simulated scenarios to real-world ones.

Demonstration of (a) the concept of virtual guidance, and (b) the agent’s behaviors in real-world scenarios.

The virtual guidance signals, rendered as either colored paths or spheres, are superimposed on the semantic segmentation derived from the agent's camera view, with the goal of guiding the agent toward a specific direction. The above figure shows the demonstration of (a) the concept of virtual guidance, and (b) the agent’s behaviors in real-world scenarios.

Virtual Guidance in Simulated Environments

To investigate the feasibility of virtual guidance as a form of mid-level representation, we have developed a flexible framework using the Unity engine and the Unity ML-Agents Toolkit. This framework, illustrated below, is designed to be fully configurable and enables the generation of guidance signals. The inputs of the agent are rendered in the form of semantic segmentation, an effective mid-level representation that can serve as an input observation for a Deep Reinforcement Learning (DRL) agent. This design philosophy enables the exploration of diverse ways of presenting guidance signals to the agent, as well as various virtual guidance and vector-based approaches. The agent receives the guidance signals along with semantic segmentation maps, and is tasked with processing these inputs to learn a policy to reach its intended destination.

The overview of the simulation framework.

Virtual Guidance Schemes

Navigation Path
In the first scheme, the navigation line obtained from the planning module is represented as a colored path on the semantic segmentation map. An example visualization is illustrated in the corresponding figure. Specifically, the navigation path is implemented as a 3D mesh in the simulated environments and projected onto the camera view plane. This rendered navigation path can be considered as a rich and informative signal that carries both semantic and guidance information. It highlights the permissible regions for the DRL agent and the route leading to the target location.
Waypoint
The second scheme generates a set of waypoints \( \mathcal{W} \) by segmenting the planned navigation trajectory. The waypoints serve as hints to instruct the agent to the destination. These waypoints are visualized as 3D virtual balls in the virtual environments and are projected onto the camera image plane. The visualization is presented in the corresponding figure. Unlike the first scheme, which utilizes a navigation path to provide dense and informative signals, the second one provides the waypoints as 3D virtual balls, which are sparse signals for the DRL agent to locate.

The overview of different types of virtual guidance schemes compared to the vector based one.

An Overview of the Real-World Framework

To further validate the applicability of the proposed virtual guidance scheme in real-world scenarios, we have designed a specific task. The objectives of this task are twofold. First, the task seeks to verify the effective transferability of the pre-trained DRL agent's policy to real-world settings, where it follows the virtual guidance. Second, it aims to demonstrate that the virtual guidance scheme possesses the flexibility to adapt not only to trajectories generated by specific planning algorithms but also to instructions from diverse methods, provided these instructions can be translated into visual representations. This adaptability highlights the advantage of the virtual guidance scheme in eliminating the necessity for the agent to interpret inputs from multiple modalities.

The overview of the framework for deploying and validating the concept of virtual guidance in real-world scenarios.

Experimental Results

This section presents a comparison of different guidance schemes. The evaluation results indicate that the schemes incorporating virtual guidance (denoted as \( VG \)) consistently outperform the baseline scheme (denoted as \( Hybrid \)) in terms of SPL, success rate, line following rate, and waypoint collection rate, in both \( seen \) and \( unseen \) scenarios. This suggests that our vision-based guidance strategies can effectively offer informative navigational cues. Therefore, it is able to alleviate the agent's burden to learn the correlation between visual observations and navigational instructions derived from different non-visual modalities.

Environment	Guidance Scheme	Representation Form	Performance				Failure Cases
Environment	Guidance Scheme	Representation Form	SPL (%)	Success Rate (%)	Line Following Rate (%)	Waypoint Collecting Rate (%)	Collision Rate (%)	Out-of-Bound (%)
Seen	Hybrid (one-time)	{RGB, (r, θ)}	44.88	46.93	36.35	26.36	26.65	26.43
	VG_waypoint (one-time)	RGB	73.16	73.51	69.54	73.53	19.41	6.95
	VG_path (one-time)	RGB	72.68	72.75	89.46	N/A	26.26	0.96
	VG_path (real-time)	RGB	88.85	89.54	N/A	N/A	9.02	1.44
Unseen	Hybrid (one-time)	{RGB, (r, θ)}	18.96	20.86	35.33	20.33	37.02	42.12
	VG_waypoint (one-time)	RGB	57.69	58.18	67.77	66.06	30.11	11.13
	VG_path (one-time)	RGB	57.46	57.54	89.19	N/A	36.94	5.45
	VG_path (real-time)	RGB	81.48	82.74	N/A	N/A	15.08	1.99

Real-World Validation of Virtual Guidance

We validated the proposed virtual guidance concept in real-world scenarios, using text prompts to derive virtual guidance paths. In our experimental configurations, several different compositions of objects were arranged. Each composition contained multiple objects, with at least two objects from the same category but with distinct appearances. For example, the second composition featured two differently designed umbrellas along with a green box. The agent received a text prompt describing a particular composition of objects.

                          SCENARIO: {,
                          , }
                        

                          Text Prompt:
 "blue umbrella"
                        

                          SCENARIO: {, , }
                        

                          Text Prompt:
 "traffic cone"
                        

                          SCENARIO: {,
                            }
                        

                            Text Prompt:
 "purple umbrella .
 blue umbrella"
                        

                          SCENARIO: {,
                            , }
                        

                          Text Prompt:
 "box"
                        

Experimental Results

Each experiment was performed over five independent runs. Owing to the capabilities of GroundingDINO, all objects described in the text prompts were correctly detected. The results are presented in the below table. The symbols & and \( | \) indicate that & requires the agent to reach all the objects described by the text prompt, while \( | \) requires the agent to reach just one of them. The agent's task was to identify and approach the correct object(s) based on the provided text prompt, using the transformed virtual guidance. The results indicate that the policies trained in virtual environments can be effectively transferred to real-world settings, and that the agent can follow virtual guidance paths generated via text-prompt-based navigation objectives.

Object Compositions	Text Prompts	Approaches	Success Rate
{ , }	`purple umbrella' & `blue umbrella'	VG_waypoints	80.00%
{ , }	`purple umbrella' & `blue umbrella'	VG_path + VG_waypoints	80.00%
{ , , }	`purple umbrella' \| `blue umbrella' \| `green box'	VG_waypoints	80.00%
{ , , }	`purple umbrella' \| `blue umbrella' \| `green box'	VG_path + VG_waypoints	60.00%
{ , , }	`purple umbrella' \| `blue box' \| `green box'	VG_waypoints	60.00%
{ , , }	`purple umbrella' \| `blue box' \| `green box'	VG_path + VG_waypoints	80.00%
{ , , }	`traffic cone' \| `blue box' \| `green box'	VG_waypoints	60.00%
{ , , }	`traffic cone' \| `blue box' \| `green box'	VG_path + VG_waypoints	60.00%