Premiere Pro’s awesome new Speech to Text tool

Examining an exciting new addition to Adobe Premiere Pro’s toolkit

Date
Author
Filip Milovanovic
Post-production expert,
ELEMENTS
Category
Collaboration

The July 2021 release of Premiere Pro 15.4, much like typical minor version updates, features a small selection of new and very useful functions and fixes, which include:

  • Premiere now allows you to easily convert Legacy Titles into Source Graphic
  • Text layers can now have multiple shadows
  • Automatic audio device switching on Windows
  • Two to three times faster Scene Edit Detection
  • Team projects get faster saving, relinking and a progressive project loading function
  • Colour improvements

Apart from these updates, version 15.4 also introduces one extremely interesting feature – Speech to text.

Adobe Sensei

Introduced in 2016, Adobe Sensei is the artificial intelligence and machine learning platform developed to support the ecosystem of Adobes applications. The capabilities of the Adobe Sensei platform are continuously being expanded to include many valuable tools, such as:

  • The After Effects native Content-Aware Fill can make selected areas of the video “disappear” by filling them with the results of the sampled surroundings.
  • In Premiere Pro, the Auto Reframe tool lets you automatically reframe the video when the aspect ratio of the sequence is changed. This feature makes it easy to modify videos for the unconventional aspect ratios used in social media without losing the focus of the action.
  • In Audition, the Autoducking feature allows you to automatically adjusts the volume of one audio track according to the other. This makes it easy to mix music and dialogue.

Interested in taking a look under the hood of Adobe Sensei ML Framework? Then read this interesting Adobe Tech Blog, which explains the inner workings of Adobe Sensei.

Speech to Text

This latest Adobe Sensei implementation looks set to greatly enhance transcribing and captioning workflows whilst simplifying interview-filled projects. The easy to use tool can be found under Windows > Text and is included within Creative Cloud all apps or Premiere Pro single app subscriptions. The “Transcribe sequence” button opens a settings menu and allows you to set the parameters for the text recognition. The user can decide whether a mix of all audio tracks of a specific audio track should be used and whether the whole sequence or only a marked area should be analysed.

Presently a user can choose from 13 different languages: English, UK English, simplified or traditional Chinese, Spanish, German, French, Japanese, Portuguese, Korean, Italian, Russian or Hindi. Clicking the Transcribe button will prepare the audio and upload it to the cloud for analysis. Once the process is complete, the transcription results are downloaded from the cloud and displayed in the Transcript window. This means that an internet connection is needed to use this tool.

Speakers

The tool is able to differentiate between different speakers and lets the user name them in the transcript. When actioning this feature the user will be presented with the following legal notice.

Once the audio transcript has been completed, the user can easily assign names to different speakers.

Generate captions

In the Transcript tab, the newly created transcription can be used to create captions for the sequence in question. The captions can be exported into an SRT or simple text file. This export capability allows you to add the generated captions to YouTube and Facebook videos or any other external player that supports subtitles.

Accuracy

As with any other AI transcription engine, 100% accuracy can’t be expected. The results are however, quite precise. Here is a transcription example of the ELEMENTS BOLT presentation video, read by an English native speaker.

Keeping harddrive (hard drive) based storage at optimal performance can be quite a challenge, and that is a blast from the past with the bent (brand) new elements gold Aleesha (. Unleash the) performance with a power of me (NVMe), a revolutionary storage technology that will take your workflow to the next level. With 200 terabytes of high performance shared storage in a single two USAC (2U chassis) experience, efficiency breakthroughs with technology 10 times faster than traditional SSD, ultra high bandwidth and Modie fragmentation (no defragmentation) need it (needed) ever, even when working in 8-K and higher while using only around one tenth of the power compared to a hydrant (Hard drive) system at the same performance, (.) introducing a game changer purpose built for media and entertainment, delivering unmatched performance in a shared storage infrastructure while decreasing your operational footprint substantially. The next milestone in our mission to provide human centered media storage. This cover (Discover) the brand new aliments Volt (ELEMENTS BOLT).

What is evident here is the engine’s inability to always use the correct punctuation, sometimes adding unnecessary commas and failing to recognise the end of a sentence. Compound words such as ultra-high and brand-new are often missing their hyphen, and as is most often the case with speech to text engines, esoteric words such as NVMe and names are understandably often wrongly interpreted. The accuracy diminishes when an accent is introduced, or voices of multiple speakers overlap.

Summary

The new Speech to Text toolset is an amazing addition to the capabilities of Adobe Premiere Pro. The highest accuracy of the transcription is achieved by native speakers with clear pronunciation, and the resulting errors in the transcription can be manually corrected with ease. Currently, 13 different languages are supported with Adobe promising more to come. This new function can be extremely useful for projects containing many interviews, as a simple text-search brings the editor to the precise timecode in which the searched term has been detected. It is just as useful for quick and easy caption generation which can even be exported to an SRT file. This Speech to Text function is included in the Creative Cloud all apps and the Premiere Pro single app subscriptions, and according to Adobe doesn’t have any set limits for fair and reasonable usage by individual subscribers, for their own projects.

However, there are some potential concerns when it comes to productions in which the footage needs to be tightly controlled. One of these is that the footage isn’t analysed locally, instead having to be uploaded to Adobe’s cloud services. Adobe is aware of these potential deal-breakers and addresses them in the following statements:

„Speech to Text has been developed with security in mind. User files are encrypted in transit and during the transcription process. As soon as a transcription is completed, the user files are deleted.”
As well as: “Speech to Text enables users to remain GDPR-compliant, as the transcription service is hosted on servers based in the European Union or the United States, depending on the user’s location.”

The second thing to keep in mind is that the editing station running Premiere Pro needs access to the internet in order to be able to upload the audio for analysis. This is something that many post-production facilities try to avoid as it can make the infrastructure vulnerable to the attacks from the internet.

However, despite the above-mentioned drawbacks , this amazing free function of Adobe Premiere Pro can be extremely helpful during the editing process and will therefore undoubtedly find itself in the middle of many post-production workflows.

Collaboration

In conversation: Thomas Grønning Knudsen

ELEMENTS-25Gb-Fibre Collaboration

How to achieve maximum performance on a high-speed Ethernet link

Collaboration

Achieving effective communication in remote teams

Glossar

COBIT

COBIT ist ein international anerkanntes Rahmenwerk für das Management und die Governance von Informationstechnologie. Es bietet ein umfassendes Regelwerk von Prinzipien, Praktiken und analytischen Instrumenten und Modellen zur Steuerung der unternehmensweiten IT.