Initial generated code and readme based on the concept (chatgpt and cursor/claude3.5-sonnet supported)

This commit is contained in:
Nicolas Dickinson 2024-09-10 13:11:41 +02:00
parent 4ed1c63171
commit d0a7663440
2 changed files with 315 additions and 1 deletions

View File

@ -1,3 +1,86 @@
# vmixtts
Mix voices, accents, dialects and languages for a unique experience
Mix voices, accents, dialects and languages for a unique experience
## Initial idea and inspiration
This project is inspired by the evolution of Text-to-Speech (TTS) systems from the 80s and 90s, where early speech synthesis methods, such as formant synthesis (e.g., eSpeak), were resource-light but often lacked the natural intonation and richness of modern AI-based TTS systems. The goal is to combine the strengths of older, efficient methods with modern AI techniques to quickly and flexibly generate unique voices that are resource-efficient but rich in variety. I hope the project can be useful to those both with low powered devices and those who want to generate high quality voices for their projects.
- Recombine these approaches by using modern AI to modify phonemes in real time, while keeping the speech generation lightweight.
- Generate new accents or voices on the fly by leveraging AI models to rewrite dictionary files quickly and test them.
## Key Objectives:
- Produce custom accents, intonations, and speech styles by recombining TTS methods from the last decades with AI.
- Enhance these voices using linguistic data from Wikidata and BabelNet for diverse, unique and unusual voices.
- Allow for fast, easy and dynamic generation of new voices and accents and quick testing of new ideas for creative purposes.
How do we modify the voices?
- Use AI such as small and large language models to automatically generate and modify phoneme dictionary files that adapt speech to new accents, intonations, or even "recombinant" voices.
- Dynamically create phoneme dictionaries for different pairs of languages based on linguistic data (e.g., Wikidata, BabelNet).
How do we generate and test new voices?
- Use espeak-ng as a foundation for phoneme-based synthesis, enhancing it with phoneme adjustments generated by AI models.
- For TTS voices, leverage older TTS technologies (e.g., eSpeak) for their efficiency, while using modern AI to enhance voice generation with more natural intonation, accents, and speech styles.
## Dependencies
To run this project, youll need the following dependencies, which allow for phoneme generation, Wikidata querying, and LLM integration:
Software:
- Ollama: For making API calls to LLMs to modify phoneme files dynamically.
- espeak-ng: For phoneme-based TTS synthesis.
- Python: For running the code and installing the dependencies.
Python Libraries:
- py-espeak-ng: For using espeak-ng.
- WDQS Python: For querying Wikidata to find shared senses between languages.
- Pandas: For handling data and saving results in Parquet format.
- PyArrow: For efficient data storage.
- SentenceTransformers: For vector-based similarity comparisons in post-processing.
- LlamaIndex: To manage the workflow of calling LLMs and handling the generation of phoneme dictionaries.
### Create and activate a conda environment (optional but recommended)
`conda create -n vmixtts python=3.11`
`conda activate vmixtts`
### Install packages
`conda install -c conda-forge pandas pyarrow sentence-transformers llama-index ollama wikidataintegrator pip`
`python -m pip install py-espeak-ng wikidataintegrator`
### Install espeak-ng
Or on debian/ubuntu: `sudo apt-get install espeak`
### Install Ollama
Go to: https://ollama.com/docs/guide/install
## Approach
Step 1: Phoneme-Based Synthesis Using espeak-ng and llama-index
The foundation of the project is espeak-ng, which provides phoneme-based TTS synthesis. We use espeak-ngs phoneme dictionaries as a starting point and then adjust these phonemes with LLM-based modifications to generate custom accents or speech styles. We use Ollama to make API calls to LLMs for modifying phoneme files dynamically. This allows for the creation of custom accents and speech styles on the fly.
Step 2: Dynamic Phoneme Dictionary Generation
We use llama-index to dynamically generate phoneme dictionaries for different pairs of languages based on linguistic data (e.g., Wikidata, BabelNet). This allows for the creation of unique, diverse voices and accents.
Step 3: Testing and Evaluation
We test the generated phoneme dictionaries and adjust them as needed using vector-based similarity comparisons. We also evaluate the quality of the generated voices using subjective tests.
## Process Flow of Functions
1. **Generate Accent Function (`generate_accent`)**:
- **Input**: Takes a scenario (e.g., "French tourist speaking English") as input.
- **LLM Language Selection**: Calls the **LLMPredictor** to determine the base language (e.g., English) and the secondary influencing language (e.g., French) using the context of the scenario.
- **Phoneme File Parsing**: Loads the phoneme files for both languages using `parse_phoneme_file`.
- **Phoneme Index Creation**: Creates a vector search index from the phoneme files using **LlamaIndex**.
- **Relevant Phoneme Query**: Queries the **LLM** to retrieve the most influential phonemes from the secondary language that affect the accent in the base language.
- **Accent Rule Generation**: Based on the identified phonemes, it generates new phoneme modification rules to reflect the accent of the secondary language.
- **Compilation**: Compiles the phoneme rules and validates them. If any rules fail during compilation, it runs a binary search to identify and fix the faulty rules using the LLM.
- **Final Output**: Generates and saves a new phoneme file that reflects the requested accent modification.
2. **Helper Functions**:
- **`compile_phoneme_file`**: Uses `subprocess.run` to compile phoneme files.
- **`generate_safe_filename`**: Creates a filename that reflects the base language and scenario in a safe format.
- **`get_base_and_secondary_phoneme_files`**: Loads the phoneme files for both the base and secondary languages.
- **`binary_search_for_faulty_rules`**: Identifies faulty phoneme rules that fail during compilation using a binary search method.
- **`attempt_to_fix_rules`**: Uses the LLM to correct faulty phoneme rules identified during binary search.
- **`save_and_compile_final_phoneme_file`**: Saves the final set of valid and fixed phoneme rules and compiles the final phoneme file.

231
generate-accent.py Normal file
View File

@ -0,0 +1,231 @@
import os
import re
import subprocess
from llama_index import (
LLMPredictor,
GPTVectorStoreIndex,
SimpleDirectoryReader,
PromptHelper
)
from llama_index.node_parser import SimpleNodeParser
from llama_index.prompts import PromptTemplate
from llama_index import LLMPredictor
import ollama
# Function to compile phoneme file
def compile_phoneme_file(filename):
result = subprocess.run(["make", filename], capture_output=True, text=True)
if result.returncode != 0:
raise Exception(result.stderr)
# Function to generate a safe filename
def generate_safe_filename(language, scenario):
safe_scenario = re.sub(r'[^a-zA-Z0-9_]', '_', scenario)
safe_filename = f"{language}_{safe_scenario}.txt"
return safe_filename
# Function to identify the base language using LLM
def identify_base_language(scenario, llm_predictor):
prompt = f"Based on the scenario '{scenario}', identify the base language that the speaker is using. Provide the corresponding language code, ensuring that the language exists in ISO language codes."
response = llm_predictor.predict(prompt)
return response # Extract LLM's language code
# Function to get phoneme files for base and secondary languages
def get_base_and_secondary_phoneme_files(base_language_code, secondary_language_code):
base_phoneme_file = f"ph_{base_language_code}.txt"
secondary_phoneme_file = f"ph_{secondary_language_code}.txt"
try:
with open(base_phoneme_file, 'r') as base_file:
base_content = base_file.read()
with open(secondary_phoneme_file, 'r') as sec_file:
secondary_content = sec_file.read()
return base_content, secondary_content
except FileNotFoundError as e:
raise Exception(f"Phoneme file not found: {e}")
# Function to generate phoneme modifications with LLM
def generate_phoneme_modifications_with_llm(scenario, base_file, secondary_file, llm_predictor):
prompt = f"Based on the scenario '{scenario}', please create modifications to the base phoneme file using characteristics derived from the secondary language phoneme file. Make changes only in the base phoneme file, such as adjusting vowels or consonants, to reflect the influence of the secondary language."
response = llm_predictor.predict(prompt)
return response # Extract LLM's generated phoneme file
# Function to compile phoneme file with all rules
def compile_with_all_rules(rules, base_file):
combined_rules = "\n".join(rules)
temp_file = "temp_combined_phoneme_file.txt"
with open(temp_file, 'w') as file:
file.write(base_file + "\n" + combined_rules)
try:
compile_phoneme_file(temp_file)
print("All rules compiled successfully.")
return True # Compilation success
except Exception as e:
print(f"Compilation failed with error: {e}")
return False, str(e) # Compilation failed, return error log
# Binary search to identify faulty rules
def binary_search_for_faulty_rules(rules, base_file):
valid_rules = []
faulty_rules = []
def compile_batch(rules_batch):
combined_rules = "\n".join(rules_batch)
temp_file = "temp_batch_phoneme_file.txt"
with open(temp_file, 'w') as file:
file.write(base_file + "\n" + combined_rules)
try:
compile_phoneme_file(temp_file)
return True # Batch compiled successfully
except Exception:
return False # Compilation failed
def binary_search(rules_batch):
if len(rules_batch) == 1:
if compile_batch(rules_batch):
valid_rules.append(rules_batch[0])
else:
faulty_rules.append(rules_batch[0])
else:
mid = len(rules_batch) // 2
left_batch = rules_batch[:mid]
right_batch = rules_batch[mid:]
if not compile_batch(left_batch):
binary_search(left_batch)
else:
valid_rules.extend(left_batch)
if not compile_batch(right_batch):
binary_search(right_batch)
else:
valid_rules.extend(right_batch)
binary_search(rules)
return valid_rules, faulty_rules
# Function to attempt to fix faulty rules using LLM
def attempt_to_fix_rules(faulty_rules, base_file, llm_predictor):
fixed_rules = []
for rule in faulty_rules:
prompt = f"The following phoneme rule '{rule}' failed to compile. Please suggest a corrected version based on the base phoneme file."
response = llm_predictor.predict(prompt)
fixed_rule = response.strip()
if fixed_rule:
fixed_rules.append(fixed_rule)
return fixed_rules
# Function to save and compile the final phoneme file
def save_and_compile_final_phoneme_file(base_file, valid_rules, fixed_rules):
all_rules = valid_rules + fixed_rules
final_file_content = base_file + "\n" + "\n".join(all_rules)
output_filename = "final_phoneme_file.txt"
with open(output_filename, 'w') as file:
file.write(final_file_content)
compile_final_phoneme_file(output_filename)
# Simulated final compile function
def compile_final_phoneme_file(filename):
try:
compile_phoneme_file(filename)
print(f"Successfully compiled {filename}")
except Exception as e:
print(f"Error during compilation: {e}")
with open("compilation_errors.log", 'a') as log_file:
log_file.write(f"Compilation error for {filename}: {str(e)}\n")
def parse_phoneme_file(file_path):
with open(file_path, 'r') as f:
content = f.read()
phonemes = {}
for match in re.finditer(r'phoneme\s+(\w+)\s*//\s*(.+?)\n(.*?)endphoneme', content, re.DOTALL):
name, description, body = match.groups()
phonemes[name] = {'description': description.strip(), 'body': body.strip()}
return phonemes
def get_available_languages():
espeak_path = os.environ.get('ESPEAK_PATH')
if not espeak_path:
raise ValueError("ESPEAK_PATH environment variable not set")
phsource_dir = os.path.join(espeak_path, 'phsource')
return [f[3:] for f in os.listdir(phsource_dir) if f.startswith('ph_')]
def create_index_from_phonemes(phonemes):
documents = [f"{k}: {v['description']}\n{v['body']}" for k, v in phonemes.items()]
parser = SimpleNodeParser()
nodes = parser.get_nodes_from_documents(documents)
return GPTVectorStoreIndex(nodes)
def generate_accent(scenario):
# Initialize Ollama through LLMPredictor
llm_predictor = LLMPredictor(llm=ollama.Ollama(model="llama2"))
prompt_helper = PromptHelper(max_input_size=4096, num_output=256, max_chunk_overlap=20)
available_languages = get_available_languages()
# Select languages
language_selection_prompt = PromptTemplate(
"Given the scenario: '{scenario}' and the following available languages: {languages}, "
"please select the most appropriate base language and secondary language for accent generation. "
"Respond in the format: 'Base: [language], Secondary: [language]'."
)
language_selection_query = language_selection_prompt.format(scenario=scenario, languages=', '.join(available_languages))
language_selection_response = llm_predictor.query(language_selection_query)
base_lang, secondary_lang = re.search(r'Base: (\w+), Secondary: (\w+)', language_selection_response['text']).groups()
# Parse phoneme files
espeak_path = os.environ.get('ESPEAK_PATH')
base_phonemes = parse_phoneme_file(os.path.join(espeak_path, 'phsource', f'ph_{base_lang}'))
secondary_phonemes = parse_phoneme_file(os.path.join(espeak_path, 'phsource', f'ph_{secondary_lang}'))
# Create indices
base_index = create_index_from_phonemes(base_phonemes)
secondary_index = create_index_from_phonemes(secondary_phonemes)
# Get relevant phonemes
relevant_phonemes_prompt = PromptTemplate(
"Given the phonemes for the base language ({base_lang}) and secondary language ({secondary_lang}), "
"list the phonemes from the secondary language that are most likely to influence the accent when applied to the base language. "
"Provide your response as a comma-separated list of phoneme names."
)
relevant_phonemes_query = relevant_phonemes_prompt.format(base_lang=base_lang, secondary_lang=secondary_lang)
relevant_phonemes_response = llm_predictor.query(relevant_phonemes_query)
relevant_phonemes = [p.strip() for p in relevant_phonemes_response['text'].split(',')]
# Generate accent rules
accent_rules_prompt = PromptTemplate(
"Based on the relevant phonemes identified ({relevant_phonemes}), "
"generate accent modification rules for the base language ({base_lang}). "
"Use the following format for each rule:\n\n"
"phoneme [name]\n"
" [modification instructions]\n"
"endphoneme\n\n"
"Here are the relevant secondary language phonemes for reference:\n\n"
"{secondary_phonemes}"
)
secondary_phonemes_str = "\n\n".join([f"{p}:\n{secondary_phonemes[p]['body']}" for p in relevant_phonemes])
accent_rules_query = accent_rules_prompt.format(
relevant_phonemes=', '.join(relevant_phonemes),
base_lang=base_lang,
secondary_phonemes=secondary_phonemes_str
)
accent_rules_response = llm_predictor.query(accent_rules_query)
accent_rules = accent_rules_response['text']
# Compile and validate rules (using existing functions)
rules = accent_rules.split('\n')
compile_success = compile_with_all_rules(rules, base_phonemes)
if not compile_success:
valid_rules, faulty_rules = binary_search_for_faulty_rules(rules, base_phonemes)
fixed_rules = attempt_to_fix_rules(faulty_rules, base_phonemes, llm_predictor)
save_and_compile_final_phoneme_file(base_phonemes, valid_rules, fixed_rules)
else:
save_and_compile_final_phoneme_file(base_phonemes, rules, [])
return generate_safe_filename(base_lang, scenario)