Howto for TTS with Emacs-Reveal

Jens Lechtenbörger

Press “play” once in audio-controls below (or type “a”) to start the presentation, which advances automatically afterwards.

1. General thoughts

All of this builds on emacs-reveal (Lechtenbörger 2019a, 2019b)
- Check out its howto first
Text-To-Speech (TTS) should read notes (#+begin_notes ... #+end_notes)
- Controlled by option reveal-with-tts
  - Use customization for available speakers
- Audio is played with the audio slideshow plugin for Reveal.js
If slides with audio advance automatically, this is a video mode
- Then, notes are required for every slides
- Reveal.js “fragments” (animations) are still possible

1.1. Technical Idea

Implement TTS as two-stage process
- First, extract notes from presentation
  
  Speech is generated from text in a process with two stages. First, usual presentation notes serve as text input. These notes may embed SSML break elements to specify a break with a given duration in seconds between sentences. See the source code of this slide for examples.
  - Generate a text file for each note
    - Its name is a hash value of the contents
    While processing the org source code to generate a presentation, each note is extracted into a text file (with some preprocessing as revisited on a later slide). The name of such a text file is the hash value of its contents. Thus, changing contents lead to changing names.
  - Generate one index file that stores names (and other information) for all text files
    
    In addition, another text file serves as index, collecting the names and positions of texts in a presentation. Besides, this index file also records configuration information, such as the speaker to be used.
  - This happens during export/publication of Org files into reveal.js presentations
    
    This text processing happens automatically in the background.
- Second, run TTS software on index file to generate audio
  - Implemented in Docker image emacs-reveal/tts
    - Image includes TTS implementations SpeechT5 and SpeechBrain
  - StyleTTS2 available in Docker image emacs-reveal/tts-styletts2
    - Activate with default voice: #+OPTIONS: reveal_with_tts:StyleTTS2
    - Or with target audio for voice cloning: #+OPTIONS: reveal_with_tts:StyleTTS2:/oer/target.wav
  - Generated audio shares hash value of its text as part of its name, enabling caching of unchanged audio
Second, the index serves as input for the text to speech implementation, which is available as Docker image. Here, names of generated audio again embed the hash values of their input texts, enabling caching of unchanged audio.

Use audio slideshow plugin to play audio

1.2. Docker image `emacs-reveal/tts`

Contains two free/libre and open TTS implementations
- SpeechBrain
- Microsoft SpeechT5
For size reasons, without GPU support
Small wrapper package tts.py
- Sample invocation shown in .gitlab-ci.yml of this presentation

2. Slide with notes and fragments

These notes are transformed to audio by TTS and read by the audio plugin (if it is enabled). Org-re-reveal converts text to have each sentence on a single line, which is converted to audio by a Docker image of emacs-reveal.

Note that hyphenated words and abbreviations may not be pronounced correctly. However, org-re-reveal contains a customizable set of translation rules for preprocessing.

Notes can contain Org markup, such as hyperlinks, bold, emphasis, code, verbatim.

Such markup is removed for TTS in org-re-reveal.

Lists can be used in notes as well:

This is a first item in a list.
Second item.

As we aim for text to speech, notes should consist of full sentences, including full stops, question marks etc. Warnings are shown upon export if the code detects this not to be the case.

Notes on this slide clarify some aspects of the text generated by org-re-reveal as basis for TTS. To pronounce numbers, abbreviations, and “complicated” word, see variable org-re-reveal-tts-normalize-table.

Besides, for demonstration purposes, this slide contains fragments with separate notes:

First appearing point, with notes

Each fragment has its own notes. These ones are meant for the first bullet point.
Second appearing point

Explanations continue with this second bullet point.

3. A real example

Next slide is part of a course on IT Systems

3.0.1. Offset as Pointer into Range

“Address translation with offset in covered address range” by Max Lütkemeyer and Jens Lechtenbörger under CC BY-SA 4.0; from GitLab

4. The End

Person taking steps to top

The road ahead …

“Figure” under CC0 1.0; converted from Pixabay

https://gitlab.com/oer/

4.1. Bibliography

Lechtenbörger, Jens. 2019a. “Emacs-reveal: A software bundle to create OER presentations.” Journal of Open Source Education (Jose) 2 (18). https://doi.org/10.21105/jose.00050.

———. 2019b. “Simplifying license attribution for OER with emacs-reveal.” In 17. Fachtagung Bildungstechnologien (DELFI 2019), edited by Niels Pinkwart and Johannes Konert, 205–16. Bonn: Gesellschaft für Informatik e.V. https://doi.org/10.18420/delfi2019_280.

License Information

No warranties are given. The license may not give you all of the permissions necessary for your intended use.

In particular, trademark rights are not licensed under this license. Thus, rights concerning third party logos (e.g., on the title slide) and other (trade-) marks (e.g., “Creative Commons” itself) remain with their respective holders.