Using the OpenAI Transcription Engine to Generate Subtitles

Integration of the OpenAI speech-to-text function enables automatically generated subtitles

Date
Author
Filip Milovanovic
Post-production expert,
ELEMENTS
Category
Workflow

Having access to transcribed media can prove invaluable. Not only can you automatically create subtitles, but you can also make the content of your footage easily searchable. Instead of having to watch an entire video to find the term “healthy food”, you can simply search for it and then easily jump to the corresponding timecode. Transcribing your archives gives you unprecedented access to specific phrases in a matter of seconds.

The flexibility of the ELEMENTS Automation Engine means that the integration of the OpenAI Transcription Engine is extremely easy. With the powerful Search and Subtitle functions of the Media Library, you’ll be able to elevate your workflows to the next level. In this blog, we’ll highlight our integration of the OpenAI Transcription Engine, hopefully inspiring you to add it to your workflows.

About the OpenAI Whisper Model

Whisper is an automatic speech recognition (ASR) system from OpenAI, developed using 680,000 hours of multilingual and multitask-supervised data. OpenAI states that Whisper “approaches human-level robustness and accuracy in English speech recognition,” and our tests have resulted in some very positive results.

Languages currently supported for the transcription endpoint:
Afrikaans, Arabic, Armenian, Azerbaijani, Belarusian, Bosnian, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kannada, Kazakh, Korean, Latvian, Lithuanian, Macedonian, Malay, Marathi, Māori, Nepali, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese, and Welsh.

Integration

Integrating the OpenAI transcription engine into the ELEMENTS environment can be divided into three main parts: the OpenAI Client, the Automation Engine, and the workflow integration.

OpenAI Client

To run an on-premise transcription based on the OpenAI Whisper model, a Linux Client running Docker is required. The role of the client can be assumed by either an ELEMENTS Worker or a computer of your choice connected to the network.
Using the on-premise engine has the following benefits:

  • Footage never leaves the company
  • There is no need to pay for cloud credits for in-cloud processing
  • There is no need for an internet connection

Alternatively, the transcription can be realised through a cloud-based service. To use this method, the user needs to create an OpenAI account with enough credits (tokens) in it. The API key from the OpenAI account connects it to the ELEMENTS Automation Engine.

Automation Engine

The ELEMENTS Automation Engine is an integral part of every ELEMENTS system and helps users easily build and execute any number of chained tasks. It includes a wide range of pre-implemented tasks such as filesystem operations, storage management, cloud actions, and notifications, as well as the option to execute custom scripts.

When an on-premise transcription job is triggered for one or more assets, the Automation Engine will pass the asset proxies to the OpenAI Client for analysis. After the OpenAI Client has finished the transcription, the Automation Engine will automatically put the resulting .srt file into the location of the original video asset. A Media Library scan is triggered, and upon completion, the .srt file is automatically linked to the original asset.

When an in-cloud transcription job is triggered for one or multiple assets, a temporary audio-only file is created. This audio file is sent to the OpenAI Client for analysis. The Automation Engine downloads the .srt file into the location of the original video asset. The temporary audio file is deleted, and a Media Library scan is triggered.

Workflow Integration

Triggering the transcription automation job can be done in one of the following three ways:

1. Through the Media Library
Select any number of assets in the Media Library and start the transcription job in the drop-down menu. Alternatively, the Automation can also be displayed as a button in the Media Library’s button bar, making the job easier to locate.

2. Through the context menu in macOS Finder and Windows Explorer
Start the transcription job straight from the context menu of macOS Finder or Windows Explorer, from any workspace mounted via the ELEMENTS Client, using the Advanced tab of the Automation Jobs settings.

3. Through the scheduler
Transcription jobs can also be scheduled to start at fixed intervals; every hour, day, month, etc. The Crontab support allows users to further specify when to run the job, for instance, only on specific days of the week.

Subtitle Functionality

The Transcription Automation job returns an .srt file with the same name as the original, transcribed file. For this reason, the Media Library automatically recognises the subtitle and the video file as a set and links them together.

Subtitle Submenu

In the Subtitle submenu, a list of all linked subtitles is displayed. Click on an entry in the list to display the chosen subtitles as an overlay in the preview window. When clicking on a line in the displayed subtitle, the video playback jumps to the corresponding timecode.

Subtitle Editing

The Media Library in WebUI version 23.10 includes expanded subtitle support. Opening a subtitle file in the Media Library will display the content of the file and allow the user to edit the subtitle lines as well as the corresponding timecode.

Subtitle Search

Use the Media Library search function to find subtitles using one or multiple search criteria. The subtitle files can be found by searching for either file name or file content, as well as by using other Media Library search criteria such as custom metadata fields, modification dates, modifying users, etc. If a subtitle file exists that corresponds to the search criteria, both the subtitle and the corresponding video asset will be displayed in the search result list.

Conclusion

Being able to create automatic transcriptions can be a true game-changer for your workflow. It allows you to make the content of your footage searchable and provides, in most cases, far more information about the footage than metadata tags ever could.

When paired with the Media Library, the automatic transcription feature opens up a world of possibilities. For example, even after the footage has been taken offline and moved into archives, the Media Library still allows you to preview it in the proxy quality of your choice. The Media Library also automatically connects the footage with its corresponding transcript. To easily find any footage on a system, users can use the powerful Search function, which lets you combine any number of search criteria to find exactly what you need. This allows you to search through your large pool of transcribed media for any phrase that may have been mentioned, e.g. all mentions of Mary Poppins in the last six months.

The transcription can be done on an on-premise Linux server that is connected to the network and is running the OpenAI Whisper model. Alternatively, the model can also be deployed on an ELEMENTS Worker. Doing on-premise transcription offers two major benefits. The first is that your footage won’t need to leave your company to be transcribed, and the second is that no internet connection is required.

Workflow

Comparing the Auto Reframe Functions in DaVinci Resolve and Premiere Pro

Workflow

Everything You Need to Know About the Proxy Workflow in DaVinci Resolve

Workflow

Everything You Need to Know About The Proxy Workflow in Adobe Premiere Pro

Glossar

COBIT

COBIT ist ein international anerkanntes Rahmenwerk für das Management und die Governance von Informationstechnologie. Es bietet ein umfassendes Regelwerk von Prinzipien, Praktiken und analytischen Instrumenten und Modellen zur Steuerung der unternehmensweiten IT.