Text-to-Music (TTM) is the process of mapping musical semantics to physical audio. Decoding this process is key to optimizing your inputs.
Text Embedding
Mapping human language (e.g., 'melancholic jazz') to a vector space so the machine understands musical features.
- Opt Point: Prompt Engineering precision
- Input Req: Structured text input
Latent Feature Generation
Carving out musical features (melody, rhythm) in the hidden Latent Space using autoregressive or diffusion models.
- Autoregressive: Autoregressive Prediction
- Diffusion: Diffusion Reconstruction
Waveform Synthesis
Restoring abstract features like Mel-spectrograms into audible waves, determining quality and space.
- Opt Point: Vocoder performance & pre-training data
Technical View
Understanding the stages helps troubleshoot:
Embedding determines 'understanding'.
Vocoder determines 'audio fidelity'.