Skip to content

Visual Understanding

Module Description

Performing visual language comprehension tasks, such as answering visual questions, understanding scenes, and making advanced deductions.

Module ID: visual_understanding

Module Parameters

Name Type Default Description
prompt string Describe the content This algorithm needs an additional prompt in order to perform the analysis. You can use any prompt, depending on the length of the result, the analysis will take more time. See examples bellow.
model_key string qwen2.5_vl_3b_instruct The underlying model. Supported values are qwen2.5_vl_3b_instruct, qwen2.5_vl_7b_instruct and smolvlm_instruct.
enable_shot_detection boolean false With shot detectionn enabled, the module will attempt to separate each shot in the video into a segment and apply the prompt to each individual segment. This will give you shot descriptions with timecodes, for example. If disabled, the prompt will be applied to the entire video.
shot_detection_threshold number 30 The threshold defines how different each pair of adjacent frames needs to be to trigger a scene break when the difference between them exceeds the threshold value. A higher number means a stricter threshold, resulting in fewer shots, while a lower number means a more lenient threshold, resulting in more shots.
shot_detection_method string content The shot detection algorithm to use, supported values are: content and adaptive. See shot detection methods bellow.
structured_output_schema object {} A JSON object that defines the exact structure you expect the model to return. It should mirror the keys and nesting of the output. See examples for structured output.

Example

Send the following JSON as request body via POST to the /jobs/ endpoint:

{
  "sources": [
    "https://test-videos.co.uk/vids/bigbuckbunny/mp4/h264/720/Big_Buck_Bunny_720_10s_10MB.mp4"
  ],
  "modules": {
    "visual_understanding": {
      "prompt": "Describe the actions happening in this video scene.",
      "model_key": "qwen2.5_vl_7b_instruct"
    }
  }
}

In order to get the results you can request the /jobs/{JOB_ID}/detailed-results/ endpoint, the response looks like this:

{
  "data": [
    {
      "detections": [],
      "frame_end": 300,
      "frame_start": 0,
      "id": "1d27611a-fc62-4e31-b6a3-cf1df4f3a9e9",
      "media_type": "video",
      "meta": {
        "indexed_identity": null,
        "prompt": "Describe the actions happening in this video scene.",
        "response": "A tree grows out of a grassy mound with a hole in it.",
        "structured_output_schema": {},
        "structured_response": null
      },
      "module": "visual_understanding",
      "source": "https://test-videos.co.uk/vids/bigbuckbunny/mp4/h264/720/Big_Buck_Bunny_720_10s_10MB.mp4",
      "tc_end": "00:00:10:00",
      "tc_start": "00:00:00:00",
      "thumbnail": null,
      "time_end": 10,
      "time_start": 0
    }
  ],
  "limit": 100,
  "next": null,
  "offset": 0,
  "prev": null,
  "total": 1
}

Each detailed result element contains the response text inside the meta field.

Without enable_shot_detection enabled (set to true), the results consist of one segment describing the whole video. In order to get results per segment (shot), you need to enable enable_shot_detection.

Shot Detection Methods

Method Description
content Detects shot changes using weighted average of pixel changes in the HSV colorspace.
adaptive Performs rolling average on differences in HSV colorspace. In some cases, this can improve handling of fast motion.

Structured Output

You can define a structured output schema to ensure the model returns results in a specific format. The schema should be a JSON object that mirrors the expected output structure. For example:

{
  "sources": [
    "storage://ubXZFeryA7zoF0N0hDgr"
  ],
  "modules": {
    "visual_understanding": {
      "enable_shot_detection": true,
      "model_key": "qwen2.5_vl_7b_instruct",
      "prompt": "You are a scene analysis assistant. Analyze the given image and return a JSON object describing the scene. Populate each field precisely according to the given instructions in the JSON schema below. Use exact categories, avoid vague language, and ensure each entry follows the required format",
      "structured_output_schema": {
        "camera_setting": "Choose the best fitting camera shot from this list: [ 'Establishing Shot', 'Wide Shot', 'Full Shot', 'Cowboy Shot', 'Medium Shot', 'Medium Close Up Shot', 'Close Up / Extreme Close Up Shot', 'Two / Three / Group Shot', 'OTS / Over The Shoulder Shot', 'POV / Point Of View Shot', 'Dutch Angle Shot', 'Low Angle Shot / High Angle Shot' ]",
        "daytime": "What time of day is shown? Choose one: [ 'Night', 'Day', 'Morning', 'Evening', 'Midday' ]",
        "location": "Is the scene indoors or outdoors? Choose one: [ 'Interior', 'Exterior' ]",
        "persons_appearing": "Are people clearly visible and central in the scene? Answer 'Yes' or 'No'.",
        "scene_description": "Describe the scene in one complete sentence. Include key actions, visible people, setting, important objects, atmosphere, and inferred context (e.g., event type, disaster, location). Do not mention the viewer or video itself.",
        "scene_tags": "List relevant, lowercase tags describing people, actions, objects, setting, mood, time of day, and inferred context. Include readable text if meaningful. No vague or redundant tags.",
        "text_appearing": "Transcribe any clearly readable text in the scene. Leave blank if none.",
        "weather": "Describe the weather condition or say 'interior' if indoors. Choose one: [ 'Sun', 'Cloudy', 'Storm', 'Fog', 'Snow', 'Rain', 'Mist', 'Sandstorm', 'Overcast', 'interior' ]"
      }
    }
  }
}

Example response from the /jobs/{JOB_ID}/detailed-results/ endpoint:

{
    "data": [
        {
            "detections": [],
            "frame_end": 62,
            "frame_start": 0,
            "id": "5ea16764-35f9-4799-ade5-8facb10dbd3e",
            "media_type": "video",
            "meta": {
                "indexed_identity": null,
                "prompt": "You are a scene analysis assistant. Analyze the given image and return a JSON object describing the scene. Populate each field precisely according to the given instructions in the JSON schema below. Use exact categories, avoid vague language, and ensure each entry follows the required format",
                "response": null,
                "structured_output_schema": {
                    "camera_setting": "Choose the best fitting camera shot from this list: [ 'Establishing Shot', 'Wide Shot', 'Full Shot', 'Cowboy Shot', 'Medium Shot', 'Medium Close Up Shot', 'Close Up / Extreme Close Up Shot', 'Two / Three / Group Shot', 'OTS / Over The Shoulder Shot', 'POV / Point Of View Shot', 'Dutch Angle Shot', 'Low Angle Shot / High Angle Shot' ]",
                    "daytime": "What time of day is shown? Choose one: [ 'Night', 'Day', 'Morning', 'Evening', 'Midday' ]",
                    "location": "Is the scene indoors or outdoors? Choose one: [ 'Interior', 'Exterior' ]",
                    "persons_appearing": "Are people clearly visible and central in the scene? Answer 'Yes' or 'No'.",
                    "scene_description": "Describe the scene in one complete sentence. Include key actions, visible people, setting, important objects, atmosphere, and inferred context (e.g., event type, disaster, location). Do not mention the viewer or video itself.",
                    "scene_tags": "List 5\u201315 relevant, lowercase tags describing people, actions, objects, setting, mood, time of day, and inferred context. Include readable text if meaningful. No vague or redundant tags.",
                    "text_appearing": "Transcribe any clearly readable text in the scene. Leave blank if none.",
                    "weather": "Describe the weather condition or say 'interior' if indoors. Choose one: [ 'Sun', 'Cloudy', 'Storm', 'Fog', 'Snow', 'Rain', 'Mist', 'Sandstorm', 'Overcast', 'interior' ]"
                },
                "structured_response": {
                    "camera_setting": "Close Up / Extreme Close Up Shot",
                    "daytime": "Day",
                    "location": "Interior",
                    "persons_appearing": "No",
                    "scene_description": "A modern kitchen with an open refrigerator displaying neatly arranged food items, illuminated by overhead lights.",
                    "scene_tags": [
                        "kitchen",
                        "refrigerator",
                        "food storage",
                        "modern design",
                        "overhead lighting",
                        "organized",
                        "interior",
                        "daylight",
                        "clean",
                        "contemporary"
                    ],
                    "text_appearing": "LIEBHERR",
                    "weather": "interior"
                }
            },
            "module": "visual_understanding",
            "source": "storage://ubXZFeryA7zoF0N0hDgr",
            "tc_end": "00:00:02:12",
            "tc_start": "00:00:00:00",
            "thumbnail": null,
            "time_end": 2.48,
            "time_start": 0.0
        },
        ...
    ]
}