Text Recognition

Module Description

Text Recognition detects and extracts printed text from images and video frames. It supports multiple languages, outputs structured OCR results (including layout and positional metadata), and can be used for indexing, searching and analytics of visual documents.

Module ID: text_recognition

Warning

Slow page loads on the DeepVA Frontend can occur during result visualization when processing long videos with the Text Recognition module.

DeepVA’s Text Recognition samples video at 4 frames/second and the frontend fetches the full detailed-results payload, which includes bounding boxes, timestamps, and metadata for every detected text segment. For long videos this can be very large, currently leading to very slow page loads.

We are currently addressing this issue by introducing a better lazy-loading technique in the frontend. API users are not impacted since endpoints supports pagination.

Module Parameters

Name	Type	Default	Description
language	string	latin	The language or category of languages
threshold	number	0.5	A confidence threhold to only return predictions with at least higher confidence than this threshold with a range from `0.0` to `1.0`
mode	string	quality	Quality vs. performance mode for accuracy and speed optimization. Available values: `quality`, `performance`.

Example

The following example demonstrates the processing of the last 10 seconds of the provided video using the processing range in order to extract the text of the credits.

Send the following JSON as request body via POST to the /jobs/ endpoint:

{
  "sources": [
    "https://commondatastorage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4"
  ],
  "modules": {
    "text_recognition": {
      "language": "latin",
      "threshold": 0.8,
      "range": {
        "time_start": 490,
        "time_end": 596
      }
    }
  }
}

In order to get the results you can request the /jobs/{JOB_ID}/detailed-results/ endpoint, the response looks like this:

{
  "data": [
    {
      "detections": [],
      "frame_end": 11772,
      "frame_start": 11772,
      "id": "61f49b68-81cc-4bc6-b4e1-75efa993cea7",
      "media_type": "video",
      "meta": {
        "bounding_box": {
          "h": 0.05416666666666667,
          "w": 0.365625,
          "x": 0.31953125,
          "y": 0.23194444444444445
        },
        "confidence": 0.97308,
        "indexed_identity": null,
        "text": "Peach Open Movie Team"
      },
      "module": "text_recognition",
      "source": "https://commondatastorage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4",
      "tc_end": "00:08:10:12",
      "tc_start": "00:08:10:12",
      "thumbnail": null,
      "time_end": 490.5,
      "time_start": 490.5
    },
    {
      "detections": [],
      "frame_end": 11772,
      "frame_start": 11772,
      "id": "9cd3117e-6f60-48ca-b1d7-827fab72ed51",
      "media_type": "video",
      "meta": {
        "bounding_box": {
          "h": 0.03888888888888889,
          "w": 0.26796875,
          "x": 0.23984375,
          "y": 0.37222222222222223
        },
        "confidence": 0.98117,
        "indexed_identity": null,
        "text": "SACHA GOEDEGEBURE"
      },
      "module": "text_recognition",
      "source": "https://commondatastorage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4",
      "tc_end": "00:08:10:12",
      "tc_start": "00:08:10:12",
      "thumbnail": null,
      "time_end": 490.5,
      "time_start": 490.5
    },
    {
      "detections": [],
      "frame_end": 11772,
      "frame_start": 11772,
      "id": "c070a711-f993-431a-81eb-755af9e7929d",
      "media_type": "video",
      "meta": {
        "bounding_box": {
          "h": 0.041666666666666664,
          "w": 0.09921875,
          "x": 0.55703125,
          "y": 0.37222222222222223
        },
        "confidence": 0.998,
        "indexed_identity": null,
        "text": "Director"
      },
      "module": "text_recognition",
      "source": "https://commondatastorage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4",
      "tc_end": "00:08:10:12",
      "tc_start": "00:08:10:12",
      "thumbnail": null,
      "time_end": 490.5,
      "time_start": 490.5
    },
    {
      "detections": [],
      "frame_end": 11772,
      "frame_start": 11772,
      "id": "ba1b3abb-7774-4074-9057-2f581c428a71",
      "media_type": "video",
      "meta": {
        "bounding_box": {
          "h": 0.03888888888888889,
          "w": 0.26875,
          "x": 0.2390625,
          "y": 0.4375
        },
        "confidence": 0.9822,
        "indexed_identity": null,
        "text": "ANDREAS GORALCZYK"
      },
      "module": "text_recognition",
      "source": "https://commondatastorage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4",
      "tc_end": "00:08:10:12",
      "tc_start": "00:08:10:12",
      "thumbnail": null,
      "time_end": 490.5,
      "time_start": 490.5
    },
    ...
  ]
}

Each detailed result element contains the specific text recognition results inside the meta field containing the following:

Field	Description
text	The detected text string
bounding_box	The text region (x, y, width, height) in relative coordinates (`0.0` - `1.0`).
confidence	The confidence score (`0.0` - `1.0`) of the prediction
indexed_identity	Not utilized by the text recognition module. Expected value is going to be `null`.

Also see the definition of the Detailed Results object.