Using YOLO to track objects

This example provides a use case for Object Detection Components. We show how to create a custom component which converts bounding boxes into a 3D pose. The example works with any RBG sensor_msgs::msg::Image signal, such as provided by the Camera Streamer component from components/core-vision. The final application will be able to track an object with a robot based on a fixed camera position.

Moving the robot towards an object in RViz

Setup

To run a YOLO executor, you need a yolo model file in ONNX format and a YAML class file. Yolo models are widely available in PT formats. For example, download the lightweight YOLO12n.pt from ultralytics here.

To convert a PT file to ONNX, run the following Python code with ultralytics installed:

from ultralytics import YOLO
model = YOLO("yolov12n.pt")
model.export(format="onnx")  # creates 'yolov12n.onnx'

For this example, also download the standard coco.yaml class file here. Move the ONNX model file and the YAML class file into a new directory.

In AICA Launcher, create a configuration with the following core version and packages:

AICA Core v4.4.2
collections/object-detection for the YOLO Executor component (TODO: check package name post-release and specify version)
components/core-vision v1.0.0 for the Camera Streamer component

Under Advanced Settings, add a volume containing the folder where you stored your YOLO model files and link it to /files.

Using the YOLO Executor

The YOLO Executor component observes runs the YOLO segmentation model on an image. It takes a camera stream and outputs the segmented image as well as the locations of bounding boxes on the image. The user can set the Object Class parameter, which will cause the component to set the predicate Object Detected to True when the specified object is found in the image. The Confidence Threshold parameter is the minimum score a predicted bounding box must have to be considered a valid detection. IOU Threshold is used during Non-Maximum Suppression (NMS) to decide whether two bounding boxes represent the same object. For example, if IOU threshold is set to 0.5, any box that overlaps more than 50% with a higher-scoring box will be discarded. For now we leave them as default values.

To test the YOLO executor:

Create a new application
Remove the default Hardware Interface node
Add the Camera Streamer component from the core vision package
- Set the Source parameter to a video device or file accordingly.
- Turn on the auto-configure and auto-activate switches
Add the YOLO Executor component
- Set the Model Path parameter to the .onnx file, e.g., /files/yolo12n.onnx
- Set the Classes Path parameter to the yaml label file, e.g., /files/coco.yaml
- Set the Rate parameter to 3 (works on most machines), on a machine with a GPU this can be set to higher.
- Turn on the auto-configure and auto-activate switches
Connect the output of the start node to each component to load them when the application is started
Connect the Image output of the Camera Streamer to the RGB Image input of the YOLO Executor

The complete application is shown below: Graph

The YOLO Executors parameters are as follows: Graph

Running the application

Start the application in AICA Studio. Then open RViz: bottom-right gear icon → "Launch RViz", then RViz → Add → By topic → /yolo_executor/annotated_image/Image to view the YOLO model's annotated output. It should show the camera images with bounding boxes drawn around key objects in it. The bounding boxes are published on the yolo_executor/bounding_boxes topic as std_msgs/msg/String and can be viewed in AICA Studio in the "ROS Topics" tab. Changing the Confidence Threshold and IOU Threshold parameters will change how often and how many bounding boxes the model outputs.

note

Only users with a Linux host can visualize the image stream with RViz. On macOS, AICA Launcher will not show the RViz option.

Tracking an object with YOLO

The bounding boxes generated by YOLO can be used to move a robot towards an object.

Creating a custom component to estimate position from bounding boxes

We first need to create a custom component which can estimate a 3D pose from a position in the camera frame. More information about custom components can be found here. We make 3 simplifying assumptions:

Camera is fixed and angled downward
Known camera position pointing down: (x=0, y=0.6, z=0.6)
Objects are on a flat surface at z=0

Set up the repository

Create a git repository from the component-template
Clone the repository, enter the directory, and run:
```
./initialize_package.sh
```
Name it component_utils and include a Python Lifecycle component
Rename py_lifecycle_component.py to yolo_to_marker.py in source/component_utils/component_utils/
Rename py_lifecycle_component.json to yolo_to_marker.json in source/component_utils/component_descriptions/
Register the component in source/component_utils/setup.cfg under [options.entry_points] like this:
```
component_utils::YoloToMarker = component_utils.yolo_to_marker:YoloToMarker
```

Component code

Below is the core implementation, which can be copied into the respective files. The component expects a JSON string as input provided by the YOLO Executor component, projects bounding boxes into 3D based on camera height and field of view, and outputs a CartesianState for robot control.

yolo_to_marker.py

import state_representation as sr
from modulo_components.lifecycle_component import LifecycleComponent
from std_msgs.msg import String
from modulo_core import EncodedState
import clproto
import json
import numpy as np

from rclpy.duration import Duration
from copy import deepcopy

class YoloToMarker(LifecycleComponent):
    def __init__(self, node_name: str, *args, **kwargs):
        super().__init__(node_name, *args, **kwargs)
        # inputs
        self.json_input = ""
        self.add_input("json_input", "json_input", String)

        # outputs
        self._marker_pose = sr.CartesianPose("object", "camera_frame") #pose with name object, referenced to camera frame
        self.add_output("marker_pose", "_marker_pose", EncodedState)

        # parameters
        self.to_find = 'person'
        self.fov = [69, 42]
        self.image_size = [480, 840]
        self.object_distance = 0.6
        
        self.add_parameter(sr.Parameter("to_find", 'person', sr.ParameterType.STRING), "thing to find and track")
        self.add_parameter(sr.Parameter("fov", [69.0, 42.0], sr.ParameterType.DOUBLE_ARRAY), "Camera FoV in X")
        self.add_parameter(sr.Parameter("image_size", [480, 840], sr.ParameterType.DOUBLE_ARRAY), "Size of image (pixels)")
        self.add_parameter(sr.Parameter("object_distance", 0.6, sr.ParameterType.DOUBLE), "Target position (Z)")
    
    def on_configure_callback(self) -> bool:
        # configuration steps before running
        self.fov = self.get_parameter_value("fov")
        self.fov = np.radians(np.asarray(self.fov))

        self.object_distance = self.get_parameter_value("object_distance")
        return True

    def __centre_pt_to_position(self, xy):
        # function to find find the 3d position of an object based on position in the image.
        image_size = np.asarray(self.image_size)
        ratio = np.clip((xy - image_size / 2) / (image_size / 2), -1, 1)

        position = ratio * self.object_distance * np.tan(self.fov/2)    

        position = (np.append(position, self.object_distance)).reshape((3,1))
        return position
    
    def on_step_callback(self):
        try:
            data = json.loads(self.json_input)['detections']
            # self.get_logger().info(data)

            position = {}
            for i, detection in enumerate(data):
                centre_of_bbox = np.asarray([
                    (detection['x_min'] + detection['x_max']) / 2,
                    (detection['y_min'] + detection['y_max']) / 2
                ])
                position_key = f"{detection['class_name']}"
                position[position_key] = self.__centre_pt_to_position(centre_of_bbox)
            
            self.get_logger().info(f'{position}')   

            # output pose of target object
            to_find = self.get_parameter_value("to_find")
            if to_find in position:
                self._marker_pose.set_position(position[to_find])

        except Exception as e:
            self.get_logger().error(f'Error in on_step_callback: {e}')

yolo_to_marker.json

{
  "$schema": "https://docs.aica.tech/schemas/1-1-1/component.schema.json",
  "name": "YOLO to marker",
  "description": {
    "brief": "Reads bounding boxes and outputs an interactive marker",
    "details": "This component receives YOLO object detection outputs in JSON format, identifies a specified target object, and converts its bounding box coordinates into a 3D Cartesian pose relative to the camera frame. The pose is computed assuming a fixed distance from the camera and using the camera's field of view and image resolution. If the target object is detected, its 3D position is output as an interactive marker pose."
  },
  "registration": "component_utils::YoloToMarker",
  "inherits": "modulo_components::LifecycleComponent",
  "inputs": [
    {
      "display_name": "Json Input",
      "description": "Bounding boxes from YOLO model",
      "signal_name": "json_input",
      "signal_type": "string"
    }
  ],
  "outputs": [
    {
      "display_name": "Pose Command",
      "description": "The pose command",
      "signal_name": "marker_pose",
      "signal_type": "cartesian_state"
    }
  ],
  "parameters": [
    {
      "display_name": "Object to Track",
      "description": "The name of the object to track, matching a class from the YOLO model.",
      "parameter_name": "to_find",
      "parameter_type": "string",
      "default_value": "person"
    },
    {
      "display_name": "Field of View (FoV)",
      "description": "Camera field of view in degrees [horizontal, vertical]",
      "parameter_name": "fov",
      "parameter_type": "double_array",
      "default_value": [
        69.0,
        42.0
      ]
    },
    {
      "display_name": "Image Size",
      "description": "Image resolution [width, height] in pixels",
      "parameter_name": "image_size",
      "parameter_type": "double_array",
      "default_value": [
        480,
        840
      ]
    },
    {
      "display_name": "Object Distance",
      "description": "Assumed distance from the camera to the object in meters",
      "parameter_name": "object_distance",
      "parameter_type": "double",
      "default_value": 0.6,
      "dynamic": true
    }
  ]
}

Enter the component folder in terminal and run

docker build -f aica-package.toml -t objectdetection .

Next, edit the AICA Launcher configuration and enter objectdetection under Custom Packages. After launching, you should see the Component Utils package listed in the "Add Component" menu and be able to add the custom YOLO to Marker component.

Application setup

Add the YOLO to Marker component to the previously configured application
- Set the Rate parameter to match the rate of the YOLO Executor
- Set the Object to Track parameter to person to track yourself in the frame
- Set the remaining parameters based on your camera configuration
- Turn on the auto-configure and auto-activate switches
Connect the Bounding Boxes output of the YOLO Executor component to the JSON Input input of the YOLO to Marker component

With the application running the logs should show the 3D position in the camera_frame.

Creating a frame and converting it to a signal

Add a Hardware Interface node to the application and select the Generic six-axis robot arm in the URDF selection, then run the application. When it is running, select "3D Viz" at the top right. Record a frame tool0 and name it camera_frame; this is the position of the camera. It can be moved by dragging the axis markers, with the Z direction as the direction in which the camera is pointing. Note that the robot will move to where the objects are detected, which may be unreachable depending on the camera position, and cause the robot controller to stop.

note

tool0 is the end effector frame for Generic six-axis robot arm. If using a different robot then the frame should be changed accordingly.

Stop the application, and go back to "Graph", open the yaml code and check that the recorded frame is there, for example:

frames:
  camera_frame:
    reference_frame: world
    position:
      x: 0
      y: 0.3
      z: 0.4
    orientation:
      w: 0
      x: 0
      y: 1
      z: 0

To convert the frame to a signal:

Add TF to Signal from AICA Core Components package
- Connect the load node to the On Start block (purple square with a black play symbol)
- Configure the component to auto-configure, auto-activate
- Set camera_frame and Reference frame to world. With the application running the camera frame can be observed under ROS Topics/tf_to_signal/pose(modulo_interfaces/msg/EncodedState).

Transforming a frame to world

To drive the robot towards the object position it needs to be in world frame. The Cartesian Transformation component can convert from camera_frame to world. Add it to the application and connect the load node, connect the Pose Output from TF to Signal to Input 1, and the output of YOLO to Marker to Input 2. The output of Cartesian Transformation is now the object position in the world frame.

Moving a robot towards an object

We can use a Signal Point Attractor to move the robot end effector towards the object.

Add the Signal Point Attractor to the application
- Configure the component to auto-configure and auto-activate.
- Connect the load node to the On Activate transition in YOLO to Marker. This means that the robot will not move until after YOLO to Marker has started.
- Connect Cartesian state, under Robot State Broadcaster to the Input pose of Signal Point Attractor.
- Connect the output from Cartesian Transformation to Attractor pose.
- Connect Output twist to an IK Velocity Controller on the Hardware Interface, this can be added under Controllers.

The final graph is shown below:

Graph

Application code

yaml application

schema: 2-0-4
dependencies:
  core: v4.2.0
frames:
  camera_frame:
    reference_frame: world
    position:
      x: 0
      y: 0.3
      z: 0.4
    orientation:
      w: 0
      x: 0
      y: 1
      z: 0
on_start:
  load:
    - hardware: hardware
    - component: yolo_executor
    - component: camera_streamer
    - component: tf_to_signal
    - component: yolo_to_marker
components:
  yolo_to_marker:
    component: component_utils::YoloToMarker
    display_name: YOLO to marker
    events:
      transitions:
        on_load:
          lifecycle:
            component: yolo_to_marker
            transition: configure
        on_configure:
          lifecycle:
            component: yolo_to_marker
            transition: activate
        on_activate:
          load:
            component: signal_point_attractor
    parameters:
      rate: !!float 5.0
      object_distance: 0.4
    inputs:
      json_input: /yolo_executor/bounding_boxes
    outputs:
      marker_pose: /yolo_to_marker/state_command
  signal_point_attractor:
    component: aica_core_components::motion::SignalPointAttractor
    display_name: Signal Point Attractor
    events:
      transitions:
        on_configure:
          lifecycle:
            component: signal_point_attractor
            transition: activate
        on_load:
          lifecycle:
            component: signal_point_attractor
            transition: configure
    parameters:
      rate: !!float 2.0
    inputs:
      state: /hardware/robot_state_broadcaster/cartesian_state
      attractor: /cartesian_transformation/output
    outputs:
      twist: /signal_point_attractor/twist
  cartesian_transformation:
    component: aica_core_components::signal::CartesianTransformation
    display_name: Cartesian Transformation
    inputs:
      input_1: /tf_to_signal/pose
      input_2: /yolo_to_marker/state_command
    outputs:
      output: /cartesian_transformation/output
  tf_to_signal:
    component: aica_core_components::ros::TfToSignal
    display_name: TF to Signal
    events:
      transitions:
        on_configure:
          lifecycle:
            component: tf_to_signal
            transition: activate
        on_load:
          lifecycle:
            component: tf_to_signal
            transition: configure
          load:
            component: cartesian_transformation
    parameters:
      frame: camera_frame
    outputs:
      pose: /tf_to_signal/pose
  camera_streamer:
    component: core_vision_components::image_streaming::CameraStreamer
    display_name: Camera Streamer
    events:
      transitions:
        on_load:
          lifecycle:
            component: camera_streamer
            transition: configure
        on_configure:
          lifecycle:
            component: camera_streamer
            transition: activate
    parameters:
      source: /files/videos/hand.mp4
      undistorted_image_cropping: false
    outputs:
      image: /camera_streamer/image
  yolo_executor:
    component: object_detection_components::YOLOExecutor
    display_name: YOLO Executor
    events:
      transitions:
        on_load:
          lifecycle:
            component: yolo_executor
            transition: configure
        on_configure:
          lifecycle:
            component: yolo_executor
            transition: activate
    parameters:
      rate: !!float 1.0
      model_file: /files/onnx/best.onnx
      classes_file: /files/onnx/coco8.yaml
    inputs:
      rgb_image: /camera_streamer/image
    outputs:
      bounding_boxes: /yolo_executor/bounding_boxes
hardware:
  hardware:
    display_name: Hardware Interface
    urdf: Generic six-axis robot arm
    rate: 100
    events:
      transitions:
        on_load:
          load:
            - controller: robot_state_broadcaster
              hardware: hardware
            - controller: ik_velocity_controller
              hardware: hardware
    controllers:
      robot_state_broadcaster:
        plugin: aica_core_controllers/RobotStateBroadcaster
        outputs:
          cartesian_state: /hardware/robot_state_broadcaster/cartesian_state
        events:
          transitions:
            on_load:
              switch_controllers:
                hardware: hardware
                activate: robot_state_broadcaster
      ik_velocity_controller:
        plugin: aica_core_controllers/velocity/IKVelocityController
        inputs:
          command: /signal_point_attractor/twist
        events:
          transitions:
            on_load:
              switch_controllers:
                hardware: hardware
                activate: ik_velocity_controller
graph:
  positions:
    on_start:
      x: 300
      y: 0
    stop:
      x: 300
      y: 100
    components:
      yolo_to_marker:
        x: 1440
        y: 60
      signal_point_attractor:
        x: 1460
        y: 520
      cartesian_transformation:
        x: 960
        y: 600
      tf_to_signal:
        x: 460
        y: 560
      camera_streamer:
        x: 520
        y: 160
      yolo_executor:
        x: 1020
        y: 100
    hardware:
      hardware:
        x: 1960
        y: -20
  edges:
    yolo_to_marker_state_command_cartesian_transformation_input_2:
      path:
        - x: 1860
          y: 280
        - x: 1860
          y: 580
        - x: 940
          y: 580
        - x: 940
          y: 860
    yolo_executor_on_activate_yolo_to_marker_yolo_to_marker:
      path:
        - x: 1380
          y: 200
        - x: 1380
          y: 120
    on_start_on_start_yolo_executor_yolo_executor:
      path:
        - x: 680
          y: 40
        - x: 680
          y: 160
    on_start_on_start_camera_streamer_camera_streamer:
      path:
        - x: 480
          y: 40
        - x: 480
          y: 220
    hardware_hardware_robot_state_broadcaster_cartesian_state_signal_point_attractor_state:
      path:
        - x: 1400
          y: 520
        - x: 1400
          y: 780
    camera_streamer_image_yolo_executor_rgb_image:
      path:
        - x: 980
          y: 420
        - x: 980
          y: 360
    on_start_on_start_tf_to_signal_tf_to_signal:
      path:
        - x: 440
          y: 40
        - x: 440
          y: 620
    on_start_on_start_yolo_to_marker_yolo_to_marker:
      path:
        - x: 920
          y: 40
        - x: 920
          y: 120
    yolo_to_marker_on_activate_signal_point_attractor_signal_point_attractor:
      path:
        - x: 1900
          y: 240
        - x: 1900
          y: 540
        - x: 1420
          y: 540
        - x: 1420
          y: 580
    tf_to_signal_on_load_cartesian_transformation_cartesian_transformation:
      path:
        - x: 880
          y: 740
        - x: 880
          y: 660
    yolo_executor_bounding_boxes_yolo_to_marker_json_input:
      path:
        - x: 1420
          y: 360
        - x: 1420
          y: 320
    yolo_to_marker_marker_pose_cartesian_transformation_input_2:
      path:
        - x: 1860
          y: 320
        - x: 1860
          y: 500
        - x: 920
          y: 500
        - x: 920
          y: 860

Setup​

Using the YOLO Executor​

Running the application​

Tracking an object with YOLO​

Creating a custom component to estimate position from bounding boxes​

Set up the repository​

Component code​

Application setup​

Creating a frame and converting it to a signal​

Transforming a frame to world​

Moving a robot towards an object​

Application code​