Text Recognition
Module Description
Text Recognition detects and extracts printed text from images and video frames. It supports multiple languages, outputs structured OCR results (including layout and positional metadata), and can be used for indexing, searching and analytics of visual documents.
Module ID: text_recognition
Warning
Slow page loads on the DeepVA Frontend can occur during result visualization when processing long videos with the Text Recognition module.
DeepVA’s Text Recognition samples video at 4 frames/second and the frontend fetches the full detailed-results payload, which includes bounding boxes, timestamps, and metadata for every detected text segment. For long videos this can be very large, currently leading to very slow page loads.
We are currently addressing this issue by introducing a better lazy-loading technique in the frontend. API users are not impacted since endpoints supports pagination.
Module Parameters
Name | Type | Default | Description |
---|---|---|---|
language | string | latin | The language or category of languages |
threshold | number | 0.5 | A confidence threhold to only return predictions with at least higher confidence than this threshold with a range from 0.0 to 1.0 |
mode | string | quality | Quality vs. performance mode for accuracy and speed optimization. Available values: quality , performance . |
Example
The following example demonstrates the processing of the last 10 seconds of the provided video using the processing range in order to extract the text of the credits.
Send the following JSON as request body via POST to the /jobs/
endpoint:
{
"sources": [
"https://commondatastorage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4"
],
"modules": {
"text_recognition": {
"language": "latin",
"threshold": 0.8,
"range": {
"time_start": 490,
"time_end": 596
}
}
}
}
In order to get the results you can request the /jobs/{JOB_ID}/detailed-results/
endpoint, the response looks like this:
{
"data": [
{
"detections": [],
"frame_end": 11772,
"frame_start": 11772,
"id": "61f49b68-81cc-4bc6-b4e1-75efa993cea7",
"media_type": "video",
"meta": {
"bounding_box": {
"h": 0.05416666666666667,
"w": 0.365625,
"x": 0.31953125,
"y": 0.23194444444444445
},
"confidence": 0.97308,
"indexed_identity": null,
"text": "Peach Open Movie Team"
},
"module": "text_recognition",
"source": "https://commondatastorage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4",
"tc_end": "00:08:10:12",
"tc_start": "00:08:10:12",
"thumbnail": null,
"time_end": 490.5,
"time_start": 490.5
},
{
"detections": [],
"frame_end": 11772,
"frame_start": 11772,
"id": "9cd3117e-6f60-48ca-b1d7-827fab72ed51",
"media_type": "video",
"meta": {
"bounding_box": {
"h": 0.03888888888888889,
"w": 0.26796875,
"x": 0.23984375,
"y": 0.37222222222222223
},
"confidence": 0.98117,
"indexed_identity": null,
"text": "SACHA GOEDEGEBURE"
},
"module": "text_recognition",
"source": "https://commondatastorage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4",
"tc_end": "00:08:10:12",
"tc_start": "00:08:10:12",
"thumbnail": null,
"time_end": 490.5,
"time_start": 490.5
},
{
"detections": [],
"frame_end": 11772,
"frame_start": 11772,
"id": "c070a711-f993-431a-81eb-755af9e7929d",
"media_type": "video",
"meta": {
"bounding_box": {
"h": 0.041666666666666664,
"w": 0.09921875,
"x": 0.55703125,
"y": 0.37222222222222223
},
"confidence": 0.998,
"indexed_identity": null,
"text": "Director"
},
"module": "text_recognition",
"source": "https://commondatastorage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4",
"tc_end": "00:08:10:12",
"tc_start": "00:08:10:12",
"thumbnail": null,
"time_end": 490.5,
"time_start": 490.5
},
{
"detections": [],
"frame_end": 11772,
"frame_start": 11772,
"id": "ba1b3abb-7774-4074-9057-2f581c428a71",
"media_type": "video",
"meta": {
"bounding_box": {
"h": 0.03888888888888889,
"w": 0.26875,
"x": 0.2390625,
"y": 0.4375
},
"confidence": 0.9822,
"indexed_identity": null,
"text": "ANDREAS GORALCZYK"
},
"module": "text_recognition",
"source": "https://commondatastorage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4",
"tc_end": "00:08:10:12",
"tc_start": "00:08:10:12",
"thumbnail": null,
"time_end": 490.5,
"time_start": 490.5
},
...
]
}
Each detailed result element contains the specific text recognition results inside the meta
field containing the following:
Field | Description |
---|---|
text | The detected text string |
bounding_box | The text region (x, y, width, height) in relative coordinates (0.0 - 1.0 ). |
confidence | The confidence score (0.0 - 1.0 ) of the prediction |
indexed_identity | Not utilized by the text recognition module. Expected value is going to be null . |
Also see the definition of the Detailed Results object.