24 languages, seven billion parameters
Teuken-7B was trained from the ground up in all 24 official languages of the European Union and comprises seven billion parameters. “What truly sets it apart is the nearly 50 percent share of non-English pretraining data,” Küch explains. This multilingual foundation ensures stable and consistent performance across a wide range of languages. The model also features a specially developed multilingual tokenizer, optimized for energy and cost efficiency and designed to work equally well across all languages. Tokenizers break down words into smaller units—called tokens—which the AI model can then process. Thanks to its multilingual approach, Teuken-7B handles complex linguistic structures, such as those found in German, with ease. Teuken has been trained more efficiently compared to similar multilingual models.
At its core, Teuken-7B is a technology ready to be put into practice—with a wide range of potential use cases. “By training the model on application-specific data, companies can develop tailored AI solutions that operate without black-box components,” explains Prof. Dr.-Ing. Bernhard Grill, Director of Fraunhofer IIS. The most obvious use case is chat applications, for which Teuken-7B has already been adapted through a process known as instruction tuning. The OpenGPT-X partners have deliberately taught Teuken-7B to understand user instructions.