vmixtts/README.md

6.0 KiB
Raw Permalink Blame History

vmixtts

Mix voices, accents, dialects and languages for a unique experience

Initial idea and inspiration

This project is inspired by the evolution of Text-to-Speech (TTS) systems from the 80s and 90s, where early speech synthesis methods, such as formant synthesis (e.g., eSpeak), were resource-light but often lacked the natural intonation and richness of modern AI-based TTS systems. The goal is to combine the strengths of older, efficient methods with modern AI techniques to quickly and flexibly generate unique voices that are resource-efficient but rich in variety. I hope the project can be useful to those both with low powered devices and those who want to generate high quality voices for their projects.

  • Recombine these approaches by using modern AI to modify phonemes in real time, while keeping the speech generation lightweight.
  • Generate new accents or voices on the fly by leveraging AI models to rewrite dictionary files quickly and test them.

Key Objectives:

  • Produce custom accents, intonations, and speech styles by recombining TTS methods from the last decades with AI.
  • Enhance these voices using linguistic data from Wikidata and BabelNet for diverse, unique and unusual voices.
  • Allow for fast, easy and dynamic generation of new voices and accents and quick testing of new ideas for creative purposes.

How do we modify the voices?

  • Use AI such as small and large language models to automatically generate and modify phoneme dictionary files that adapt speech to new accents, intonations, or even "recombinant" voices.
  • Dynamically create phoneme dictionaries for different pairs of languages based on linguistic data (e.g., Wikidata, BabelNet).

How do we generate and test new voices?

  • Use espeak-ng as a foundation for phoneme-based synthesis, enhancing it with phoneme adjustments generated by AI models.
  • For TTS voices, leverage older TTS technologies (e.g., eSpeak) for their efficiency, while using modern AI to enhance voice generation with more natural intonation, accents, and speech styles.

Dependencies

To run this project, youll need the following dependencies, which allow for phoneme generation, Wikidata querying, and LLM integration:

Software:

  • Ollama: For making API calls to LLMs to modify phoneme files dynamically.
  • espeak-ng: For phoneme-based TTS synthesis.
  • Python: For running the code and installing the dependencies.

Python Libraries:

  • py-espeak-ng: For using espeak-ng.
  • WDQS Python: For querying Wikidata to find shared senses between languages.
  • Pandas: For handling data and saving results in Parquet format.
  • PyArrow: For efficient data storage.
  • SentenceTransformers: For vector-based similarity comparisons in post-processing.
  • LlamaIndex: To manage the workflow of calling LLMs and handling the generation of phoneme dictionaries.

conda create -n vmixtts python=3.11 conda activate vmixtts

Install packages

conda install -c conda-forge pandas pyarrow sentence-transformers llama-index ollama wikidataintegrator pip python -m pip install py-espeak-ng wikidataintegrator

Install espeak-ng

Or on debian/ubuntu: sudo apt-get install espeak

Install Ollama

Go to: https://ollama.com/docs/guide/install

Approach

Step 1: Phoneme-Based Synthesis Using espeak-ng and llama-index The foundation of the project is espeak-ng, which provides phoneme-based TTS synthesis. We use espeak-ngs phoneme dictionaries as a starting point and then adjust these phonemes with LLM-based modifications to generate custom accents or speech styles. We use Ollama to make API calls to LLMs for modifying phoneme files dynamically. This allows for the creation of custom accents and speech styles on the fly.

Step 2: Dynamic Phoneme Dictionary Generation We use llama-index to dynamically generate phoneme dictionaries for different pairs of languages based on linguistic data (e.g., Wikidata, BabelNet). This allows for the creation of unique, diverse voices and accents.

Step 3: Testing and Evaluation We test the generated phoneme dictionaries and adjust them as needed using vector-based similarity comparisons. We also evaluate the quality of the generated voices using subjective tests.

Process Flow of Functions

  1. Generate Accent Function (generate_accent):

    • Input: Takes a scenario (e.g., "French tourist speaking English") as input.
    • LLM Language Selection: Calls the LLMPredictor to determine the base language (e.g., English) and the secondary influencing language (e.g., French) using the context of the scenario.
    • Phoneme File Parsing: Loads the phoneme files for both languages using parse_phoneme_file.
    • Phoneme Index Creation: Creates a vector search index from the phoneme files using LlamaIndex.
    • Relevant Phoneme Query: Queries the LLM to retrieve the most influential phonemes from the secondary language that affect the accent in the base language.
    • Accent Rule Generation: Based on the identified phonemes, it generates new phoneme modification rules to reflect the accent of the secondary language.
    • Compilation: Compiles the phoneme rules and validates them. If any rules fail during compilation, it runs a binary search to identify and fix the faulty rules using the LLM.
    • Final Output: Generates and saves a new phoneme file that reflects the requested accent modification.
  2. Helper Functions:

    • compile_phoneme_file: Uses subprocess.run to compile phoneme files.
    • generate_safe_filename: Creates a filename that reflects the base language and scenario in a safe format.
    • get_base_and_secondary_phoneme_files: Loads the phoneme files for both the base and secondary languages.
    • binary_search_for_faulty_rules: Identifies faulty phoneme rules that fail during compilation using a binary search method.
    • attempt_to_fix_rules: Uses the LLM to correct faulty phoneme rules identified during binary search.
    • save_and_compile_final_phoneme_file: Saves the final set of valid and fixed phoneme rules and compiles the final phoneme file.