A 1991 study by the BBC was the first to document regular complaints about hard-to-understand speech in movies. Loud background sound and music were identified as causing the difficulties in intelligibility. However, the results of the study weren’t conclusive; only later did it emerge that personal preference and listening effort are very important for speech intelligibility. The study also pointed out that the broadcast system in use at that time was incapable of transmitting an additional soundtrack with speech at a higher volume. A lot has changed since then. As early as 2011, BBC and Fraunhofer IIS carried out a public field test during the Wimbledon Championships that allowed viewers to personalize the dialogue levels. This was the birth of dialogue enhancement as an “object-based” service facilitated by the broadcaster (see info box on next page).
Although object-based sound production is becoming increasingly important worldwide, most content today is still produced, transmitted and archived in channel-based formats. And up until a few years ago, traditional, model-based signal processing methods were used for dialogue separation in TV content. Today, these approaches are being roundly outperformed by deep neural networks (DNNs). MPEG-H Dialog+ also uses state-of-the-art DNNs to achieve dialogue separation and thus highest-quality personalization of dialogue, including for legacy material. Success has validated this approach. The technology was selected for the first national field tests in Germany in which DNNs permitted the dialogue personalization of TV content.
MPEG-H Dialog+ is a file-based dialogue separation technology that was developed at Fraunhofer IIS. At its core is a neural network that separates out the dialogue – in this case, a Deep Convolutional Neural Network. The network is trained using a specially prepared audio database that contains data derived from real broadcasting content supplied to Fraunhofer IIS by TV networks and production companies. The DNN training works on the basis of stems – that is to say, composite soundtracks. Two stems are prepared: one for dialogue and another for music and effects (M&E). These audio stems are edited manually to exclude all parts in which non-speech sounds are present in the dialogue stem or dialogue in the M&E stem. This prevents training errors whereby, for example, sounds could later be misinterpreted as speech and separated.
Having received the mix of the components as an input signal, the neural network automatically separates them so that they are available as individual elements at the output and can be remixed. The goal is for these elements to resemble the separate components as closely as possible. For the quality and robustness of the model, it is essential to have a very broad variety of training data covering the full spectrum of broadcasting contents. Both female and male speakers are represented in the data, which comes from all kinds of genres – from nature documentaries to sports programs to movies. The language of the training data has been predominantly German to date, but initial projects in other languages indicate that Dialog+ can also deliver good results there.
Dialogue separation makes it possible to separate the dialogue and non-dialogue signals in existing mixes. But what do you do with the separated components to get a new audio mix that is easier to understand? This is where the automatic remixing of MPEG-H Dialog+ comes into play: it can combine static background noise reduction with dynamic, time-variant background noise reduction.
The static reduction lowers the level of the separated background sound by a set decibel value across the entire signal. This has several advantages: one is that the general sound design and music, which many people find intrusively loud, becomes more quiet; another is that it makes it possible to distinguish clearly and quickly between the original mix and the Clear Speech version. However, background noise reduction isn’t strictly necessary in the absence of dialogue. Indeed, it can even spoil the esthetics and artistic intent or suppress sounds with narrative significance. In such cases, it makes sense to lower the background sound level only when the dialogue signal is present and to lower it only as much as is absolutely necessary. Helpfully, Fraunhofer IIS also has a solution for this: the Adaptive Background Attenuation algorithm, which automatically generates a dynamic new mix by means of a few easily adjustable parameters.
Is speech intelligibility really such a big problem? The short answer is: Yes! Many people complain about not understanding dialogue. For this reason, the German public broadcaster Westdeutscher Rundfunk (WDR) and Fraunhofer IIS cooperated in an online test in 2020. While watching content in the ARD Player, over 2,000 participants were able to switch between the original mix and a Clear Speech – as ARD calls it – version with reduced background noise. Afterward, the participants answered an online survey. It turned out that 68 percent of all participants have problems understanding TV dialogue either often or very often. This problem intensifies with age. A full 90 percent of participants over the age of 60 reported difficulty understanding TV dialogue. The option of switching to a Clear Speech mix appealed to 83 percent of participants – even those who said they had no or few problems with speech intelligibility. This shows that this is not a fringe concern, but that the desire for clearer speech, or at least for options, permeates the entire audience.
Subsequently, the Fraunhofer IIS partners WDR and Bayerischer Rundfunk (BR) carried out field tests with MPEG-H Dialog+, whereby Clear Speech soundtracks were produced and made available for various popular German TV shows. WDR transmitted Clear Speech via DVB-S as an additional audio signal, while BR provided Clear Speech via HbbTV and added it synchronously to the existing broadcast signal. Some of the WDR productions were then offered in the on-demand service in the ARD Mediathek, with the addition of further TV shows since. The provision of an additional Clear Speech soundtrack does not involve any significant additional cost or effort; it can be inserted directly into the current workflows of the existing media center content. The Clear Speech mix is generated automatically from the original mix and fed into the ARD Mediathek.
In the future, broadcasting and streaming will increasingly make use of object-based formats, known as Next Generation Audio (NGA). In addition to producing the channel-based Clear Speech stereo mix, MPEG-H Dialog+ can automatically generate a file that combines the separated audio objects and the metadata that are required for NGA. These files are suitable for use as a production format for NGA distribution processes and can be encoded directly into MPEG-H Audio. Such a workflow was implemented on a trial basis at WDR, including encoding and playback in an MPEG-H-capable app.
Film and TV production is making increasing use of cloud-based services. These facilitate the easy and rapid scaling of production workflows and can be consumed online by a broad range of users. They can also greatly reduce initialization and maintenance costs in the software-as-a-service domain. Fraunhofer IIS designed its NGA technologies to meet these requirements and for integration into state-of-the-art workflows. This means they are not only ready for immediate use, but also fit for the future.