Skip to content

Text Recognition

Module Description

Text Recognition detects and extracts printed text from images and video frames. It supports multiple languages, outputs structured OCR results (including layout and positional metadata), and can be used for indexing, searching and analytics of visual documents.

Module ID: text_recognition

Warning

Slow page loads on the DeepVA Frontend can occur during result visualization when processing long videos with the Text Recognition module.

DeepVA’s Text Recognition samples video at 4 frames/second and the frontend fetches the full detailed-results payload, which includes bounding boxes, timestamps, and metadata for every detected text segment. For long videos this can be very large, currently leading to very slow page loads.

We are currently addressing this issue by introducing a better lazy-loading technique in the frontend. API users are not impacted since endpoints supports pagination.

Module Parameters

Name Type Default Description
language string latin The language or category of languages
threshold number 0.5 A confidence threhold to only return predictions with at least higher confidence than this threshold with a range from 0.0 to 1.0
mode string quality Quality vs. performance mode for accuracy and speed optimization. Available values: quality, performance.

Example

The following example demonstrates the processing of the last 10 seconds of the provided video using the processing range in order to extract the text of the credits.

Send the following JSON as request body via POST to the /jobs/ endpoint:

{
  "sources": [
    "https://commondatastorage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4"
  ],
  "modules": {
    "text_recognition": {
      "language": "latin",
      "threshold": 0.8,
      "range": {
        "time_start": 490,
        "time_end": 596
      }
    }
  }
}

In order to get the results you can request the /jobs/{JOB_ID}/detailed-results/ endpoint, the response looks like this:

{
  "data": [
    {
      "detections": [],
      "frame_end": 11772,
      "frame_start": 11772,
      "id": "61f49b68-81cc-4bc6-b4e1-75efa993cea7",
      "media_type": "video",
      "meta": {
        "bounding_box": {
          "h": 0.05416666666666667,
          "w": 0.365625,
          "x": 0.31953125,
          "y": 0.23194444444444445
        },
        "confidence": 0.97308,
        "indexed_identity": null,
        "text": "Peach Open Movie Team"
      },
      "module": "text_recognition",
      "source": "https://commondatastorage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4",
      "tc_end": "00:08:10:12",
      "tc_start": "00:08:10:12",
      "thumbnail": null,
      "time_end": 490.5,
      "time_start": 490.5
    },
    {
      "detections": [],
      "frame_end": 11772,
      "frame_start": 11772,
      "id": "9cd3117e-6f60-48ca-b1d7-827fab72ed51",
      "media_type": "video",
      "meta": {
        "bounding_box": {
          "h": 0.03888888888888889,
          "w": 0.26796875,
          "x": 0.23984375,
          "y": 0.37222222222222223
        },
        "confidence": 0.98117,
        "indexed_identity": null,
        "text": "SACHA GOEDEGEBURE"
      },
      "module": "text_recognition",
      "source": "https://commondatastorage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4",
      "tc_end": "00:08:10:12",
      "tc_start": "00:08:10:12",
      "thumbnail": null,
      "time_end": 490.5,
      "time_start": 490.5
    },
    {
      "detections": [],
      "frame_end": 11772,
      "frame_start": 11772,
      "id": "c070a711-f993-431a-81eb-755af9e7929d",
      "media_type": "video",
      "meta": {
        "bounding_box": {
          "h": 0.041666666666666664,
          "w": 0.09921875,
          "x": 0.55703125,
          "y": 0.37222222222222223
        },
        "confidence": 0.998,
        "indexed_identity": null,
        "text": "Director"
      },
      "module": "text_recognition",
      "source": "https://commondatastorage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4",
      "tc_end": "00:08:10:12",
      "tc_start": "00:08:10:12",
      "thumbnail": null,
      "time_end": 490.5,
      "time_start": 490.5
    },
    {
      "detections": [],
      "frame_end": 11772,
      "frame_start": 11772,
      "id": "ba1b3abb-7774-4074-9057-2f581c428a71",
      "media_type": "video",
      "meta": {
        "bounding_box": {
          "h": 0.03888888888888889,
          "w": 0.26875,
          "x": 0.2390625,
          "y": 0.4375
        },
        "confidence": 0.9822,
        "indexed_identity": null,
        "text": "ANDREAS GORALCZYK"
      },
      "module": "text_recognition",
      "source": "https://commondatastorage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4",
      "tc_end": "00:08:10:12",
      "tc_start": "00:08:10:12",
      "thumbnail": null,
      "time_end": 490.5,
      "time_start": 490.5
    },
    ...
  ]
}

Each detailed result element contains the specific text recognition results inside the meta field containing the following:

Field Description
text The detected text string
bounding_box The text region (x, y, width, height) in relative coordinates (0.0 - 1.0).
confidence The confidence score (0.0 - 1.0) of the prediction
indexed_identity Not utilized by the text recognition module. Expected value is going to be null.

Also see the definition of the Detailed Results object.