4-5-hands-on-exercises
Hands-On Exercises
These hands-on exercises will guide you through implementing key components of a Vision-Language-Action (VLA) system. You will need a Python environment with whisper and access to an LLM API (like OpenAI's GPT-3.5/GPT-4) for these exercises. A simulated ROS 2 environment (e.g., Isaac Sim with a humanoid robot) will be beneficial for Exercise 2.
Exercise 1: Build a Voice-to-Action Demo using Whisper + ROS 2
Objective: To create a basic pipeline that converts a spoken command into a simple ROS 2 action.
You will need:
- Python 3 environment.
openai-whisperlibrary installed (pip install -U openai-whisper).- An active ROS 2 installation.
- A simple ROS 2 action server (e.g.,
SayHelloaction with amessagefield).
Steps:
-
Create a ROS 2 Action Definition:
- Define a simple ROS 2 action, for example,
SayHello.action. SayHello.action:string message
---
bool success
---
float feedback_percentage- Build your ROS 2 package to generate the action interfaces.
- Define a simple ROS 2 action, for example,
-
Implement a
SayHelloAction Server:- Write a Python ROS 2 node that implements the
SayHelloaction server. - When it receives a goal (a
message), it should print the message to the console, sendfeedback_percentage, and then returnsuccess=True.
# simple_action_server.py
import rclpy
from rclpy.node import Node
from rclpy.action import ActionServer, GoalHandle # Import GoalHandle
from example_interfaces.action import SayHello # Assuming example_interfaces for simplicity
class SimpleActionServer(Node):
def __init__(self):
super().__init__('simple_action_server')
self._action_server = ActionServer(
self,
SayHello,
'say_hello',
self.execute_callback
)
self.get_logger().info("SayHello Action Server started.")
def execute_callback(self, goal_handle: GoalHandle): # Type hint for clarity
self.get_logger().info(f"Received goal: {goal_handle.request.message}")
for i in range(1, 11): # Simulate progress
feedback = SayHello.Feedback()
feedback.feedback_percentage = float(i) * 10.0
goal_handle.publish_feedback(feedback)
self.get_logger().info(f"Feedback: {feedback.feedback_percentage}%")
rclpy.spin_once(self, timeout_sec=0.1)
goal_handle.succeed()
result = SayHello.Result()
result.success = True
self.get_logger().info("Goal succeeded!")
return result
def main(args=None):
rclpy.init(args=args)
action_server = SimpleActionServer()
rclpy.spin(action_server)
rclpy.shutdown()
if __name__ == '__main__':
main() - Write a Python ROS 2 node that implements the
-
Implement the Voice-to-Action Client:
- Write a Python script that:
- Records a short audio clip (you can use
sounddeviceorpyaudiofor this, or simply use a pre-recorded.wavfile). - Uses OpenAI Whisper to transcribe the audio.
- Uses a simple rule-based approach (or an LLM if you have API access) to extract the message for the
SayHelloaction. For example, if the transcription contains "hello robot", extract "hello robot". - Creates a ROS 2 action client for
SayHelloand sends the extracted message as a goal.
- Records a short audio clip (you can use
# voice_action_client.py
import rclpy
from rclpy.node import Node
from rclpy.action import ActionClient
from example_interfaces.action import SayHello
import whisper # Ensure you have whisper installed
# Function to simulate audio recording and transcription (replace with actual recording)
def get_voice_command_text():
# This would typically involve capturing microphone input
# For demo, let's use a hardcoded text or pre-recorded audio
print("Please say a command (e.g., 'hello robot')...")
# In a real scenario, you'd record audio and pass to whisper.transcribe
# Example:
# audio_file = "path/to/your/recorded_audio.wav"
# model = whisper.load_model("base")
# result = model.transcribe(audio_file)
# return result["text"]
return "hello robot" # Placeholder
class VoiceActionClient(Node):
def __init__(self):
super().__init__('voice_action_client')
self._action_client = ActionClient(self, SayHello, 'say_hello')
self.get_logger().info("VoiceActionClient created.")
def send_goal(self, message):
self.get_logger().info("Waiting for action server...")
self._action_client.wait_for_server()
goal_msg = SayHello.Goal()
goal_msg.message = message
self.get_logger().info(f"Sending goal: {message}")
self._send_goal_future = self._action_client.send_goal_async(goal_msg, feedback_callback=self.feedback_callback)
self._send_goal_future.add_done_callback(self.goal_response_callback)
def goal_response_callback(self, future):
goal_handle = future.result()
if not goal_handle.accepted:
self.get_logger().info('Goal rejected :(')
return
self.get_logger().info('Goal accepted :)')
self._get_result_future = goal_handle.get_result_async()
self._get_result_future.add_done_callback(self.get_result_callback)
def get_result_callback(self, future):
result = future.result().result
self.get_logger().info(f'Result: {result.success}')
rclpy.shutdown()
def feedback_callback(self, feedback_msg):
feedback = feedback_msg.feedback
self.get_logger().info(f'Received feedback: {feedback.feedback_percentage}%')
def main(args=None):
rclpy.init(args=args)
client = VoiceActionClient()
text_command = get_voice_command_text()
client.send_goal(text_command)
rclpy.spin(client)
if __name__ == '__main__':
main() - Write a Python script that:
-
Run the Demo:
- In one terminal, run the ROS 2 action server:
ros2 run <your_package> simple_action_server. - In another terminal, run the client:
python voice_action_client.py. - Observe the action server receiving and processing the command.
- In one terminal, run the ROS 2 action server:
Success Criteria: A spoken command (or hardcoded text) is transcribed and successfully triggers the ROS 2 action server, printing the message and feedback.
Exercise 2: LLM-Based Task Planner Controlling a Simulated Humanoid
Objective: To use an LLM to generate a sequence of actions for a simulated humanoid robot based on a natural language command.
You will need:
- Access to an LLM API (e.g., OpenAI API key).
- A simulated humanoid robot in an environment with known objects (e.g., in Isaac Sim from Module 3).
- A ROS 2 setup with action servers for navigation, grasping, and placing.
Steps:
-
Set Up Robot Action Servers:
- Ensure your simulated humanoid has ROS 2 action servers for fundamental capabilities like:
navigate_to_pose(from Nav2 or a custom controller)grasp_objectplace_objectfind_object(could integrate with Isaac ROS perception)
- Each action server should define its own
.actionfile and implement its logic.
- Ensure your simulated humanoid has ROS 2 action servers for fundamental capabilities like:
-
Develop an LLM Orchestrator Node:
- Write a Python ROS 2 node that acts as the central orchestrator.
- This node will:
- Receive a high-level natural language command (e.g., from a web interface, or by transcribing with Whisper).
- Send this command as a prompt to the LLM API. The prompt should instruct the LLM to output a JSON-formatted action graph, similar to the "Clean the room" example in Section 4.2.
- Parse the LLM's JSON response to get the sequence of actions.
- For each action in the sequence, create the appropriate ROS 2 action goal and send it to the relevant action client.
- Monitor the execution of each action and handle success or failure before proceeding to the next.
# llm_orchestrator.py (conceptual)
import rclpy
from rclpy.node import Node
from rclpy.action import ActionClient
# Import your custom action interfaces
from your_robot_interfaces.action import NavigateToPose, GraspObject, PlaceObject # etc.
import openai # pip install openai
import json
class LLMOrchestrator(Node):
def __init__(self):
super().__init__('llm_orchestrator')
self.openai_client = openai.OpenAI(api_key="YOUR_OPENAI_API_KEY")
# Action Clients for each robot capability
self.nav_client = ActionClient(self, NavigateToPose, 'navigate_to_pose')
self.grasp_client = ActionClient(self, GraspObject, 'grasp_object')
self.place_client = ActionClient(self, PlaceObject, 'place_object')
# ... other clients
self.get_logger().info("LLM Orchestrator ready.")
async def get_llm_plan(self, command):
prompt = f"""
You are a robot task planner. Given a high-level command,
generate a sequence of robot actions in JSON format.
Each action should have an 'action' type and 'parameters'.
Available actions: 'navigate_to_pose', 'grasp_object', 'place_object'.
Example: {json.dumps({"actions": [{"action": "navigate_to_pose", "parameters": {"x": 1.0, "y": 2.0}}, {"action": "grasp_object", "parameters": {"object_id": "cup_01"}}]})}
Command: "{command}"
"""
try:
response = self.openai_client.chat.completions.create(
model="gpt-3.5-turbo", # or gpt-4
messages=[{"role": "user", "content": prompt}]
)
plan_str = response.choices[0].message.content
return json.loads(plan_str)
except Exception as e:
self.get_logger().error(f"Error getting plan from LLM: {e}")
return None
async def execute_plan(self, llm_plan):
if not llm_plan or "actions" not in llm_plan:
self.get_logger().error("Invalid LLM plan received.")
return False
for action_step in llm_plan["actions"]:
action_type = action_step["action"]
parameters = action_step.get("parameters", {})
self.get_logger().info(f"Executing: {action_type} with {parameters}")
if action_type == "navigate_to_pose":
goal = NavigateToPose.Goal()
goal.pose.position.x = parameters.get("x", 0.0)
goal.pose.position.y = parameters.get("y", 0.0)
# ... populate other pose fields
await self.nav_client.wait_for_server()
future = self.nav_client.send_goal_async(goal)
result = await future.result().get_result_async()
if not result.result.success:
self.get_logger().error(f"Navigation failed!")
return False
elif action_type == "grasp_object":
goal = GraspObject.Goal()
goal.object_id = parameters.get("object_id", "")
await self.grasp_client.wait_for_server()
future = self.grasp_client.send_goal_async(goal)
result = await future.result().get_result_async()
if not result.result.success:
self.get_logger().error(f"Grasp failed for {goal.object_id}!")
return False
# ... handle other action types
return True
def main(args=None):
rclpy.init(args=args)
orchestrator = LLMOrchestrator()
# Example command - in a real system, this could come from Whisper
command = "Go to the charging station and pick up the power pack."
# Use asyncio to run the async methods
import asyncio
asyncio.run(orchestrator.execute_plan(asyncio.run(orchestrator.get_llm_plan(command))))
orchestrator.destroy_node()
rclpy.shutdown()
if __name__ == '__main__':
main() -
Run the Simulation:
- Launch your simulated humanoid in Isaac Sim (or Gazebo).
- Ensure all necessary ROS 2 action servers (navigation, grasping, etc.) are running.
- Run your
llm_orchestrator.pynode. - Observe the simulated robot executing the sequence of actions planned by the LLM.
Success Criteria: A high-level natural language command is processed by the LLM, converted into a series of ROS 2 action calls, and the simulated humanoid robot successfully performs the corresponding actions in the virtual environment.
These exercises provide a strong foundation for building sophisticated VLA capabilities, pushing your humanoid robot towards truly intelligent and autonomous operation.