Rafael Valle
Email: aef@daeealrllyubke.elver unscramble

I'm a polymath research scientist and manager at NVIDIA, where I represent ADLR's (Applied Deep Learning Research) audio team. ADLR–Audio focuses on generative models with intelligence in audio understanding and synthesis, with occasional explorations in vision.

I am passionate about generative modeling, machine perception and machine improvisation. Over the years, I have had the opportunity to collaborate with fantastic researchers and co-invent Fugatto, Audio Flamingo, OMCAT, ETTA, Koel-TTS, P-Flow, the RAD* family of models with the One Aligner To Rule Them All, Flowtron and WaveGlow.

During my PhD at UC Berkeley I was advised mainly by Prof. Sanjit Seshia and Prof. Edmund Campion and my research focused on machine listening and improvisation. At UC Berkeley, I was part of the TerraSwarm Research Center, where I worked on problems related to adversarial attacks and verified artificial intelligence.

During Fall 2016 I was a Research Intern at Gracenote in Emeryville, where I worked on audio classification using Deep Learning. Previously I was a Scientist Intern at Pandora in Oakland, where I investigated segments and scores that describe novelty seeking behavior in listeners.

Before coming to Berkeley, I completed a master's in Computer Music from HMDK Stuttgart in Germany and a bachelor's in Orchestral Conducting from UFRJ in Brazil.

News
Publications

[NEW] Fugatto: Foundational Generative Audio Transformer Opus 1
Rafael Valle, Rohan Badlani, Zhifeng Kong, Sang-gil Lee, Arushi Goel, Joao Felipe Santos, Aya Aljafari, Sungwon Kim, Shuqi Dai, Siddharth Gururani, Alexander H. Liu, Kevin J. Shih, Ryan Prenger, Wei Ping, Chao-Han Huck Yang, Bryan Catanzaro
ICLR 2025

paper | website | abstract | bibtex

        @misc{fugatto2025,
          title={Fugatto},
          author={Fugatto Team},
          note={ICLR 2025, available at \url{https://fugatto.github.io/}}
        }
        
Omcat

[NEW] OMCAT: Omni Context Aware Transformer
Arushi Goel, Karan Sapra, Matthieu Le, Rafael Valle, Andrew Tao, Bryan Catanzaro
arXiv preprint 2024

arXiv | abstract | bibtex

  @article{goel2024omcat,
    title={OMCAT: Omni context aware transformer},
    author={Goel, A and Sapra, K and Le, M and Valle, R and Tao, A and Catanzaro, B},
    journal={arXiv preprint arXiv:2410.12109},
    year={2024}
  }
        
Audio Flamingo 2

[NEW] Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities
Sreyan Ghosh, Zhifeng Kong, Sonal Kumar, S Sakshi, Jaehyeon Kim, Wei Ping, Rafael Valle, Dinesh Manocha, Bryan Catanzaro
under review 2025

arXiv | abstract | bibtex

  @article{kong2024audio,
          title={Audio Flamingo: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities},
          author={Sreyan Ghosh, Zhifeng Kong, Sonal Kumar, S Sakshi, Jaehyeon Kim, Wei Ping, Rafael Valle, Dinesh Manocha, Bryan Catanzaro},
          journal={},
          year={2025}
        }
        
Koel-TTS

[NEW] Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance
Shehzeen Hussain, Paarth Neekhara, Xuesong Yang, Edresson Casanova, Subhankar Ghosh, Mikyas T. Desta, Roy Fejgin, Rafael Valle, Jason Li
arXiv preprint 2025

arXiv | abstract | bibtex

        @article{hussain2025koelt,
          title={Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance},
          author={Hussain, S and Neekhara, P and Yang, X and Casanova, E and Ghosh, S and Desta, MT and ...},
          journal={arXiv preprint arXiv:2502.05236},
          year={2025}
        }
        
UniWav

[NEW] UniWav
Alexander H. Liu, Sang-gil Lee, Chao-Han Huck Yang, Yuan Gong, Yu-Chiang Frank Wang, James R. Glass, Rafael Valle, Bryan Catanzaro
ICLR 2025

website | abstract | bibtex

        @misc{uniwav2025,
          title={UniWav},
          author={UniWav Team},
          note={ICLR 2025, available at \url{https://research.nvidia.com/labs/twn/publication/iclr_2025_uniwav/}}
        }
        
A2SB
degraded
bandwidth expanded

[NEW] A2SB: Audio-to-Audio Schrodinger Bridges
Zhifeng Kong*, Kevin J. Shih*, Weili Nie, Arash Vahdat, Sang-gil Lee, Joao Felipe Santos, Ante Jukic, Rafael Valle, Bryan Catanzaro
arXiv preprint 2025

arXiv | abstract | bibtex

        @article{kong2025a2sb,
          title={A2SB: Audio-to-Audio Schrodinger Bridges},
          author={Kong, Z and Shih, KJ and Nie, W and Vahdat, A and Lee, S and Santos, JF and Jukic, A and Valle, R and ...},
          journal={arXiv preprint arXiv:2501.11311},
          year={2025}
        }
        
ETTA
"A hip-hop track using sounds from a construction site—hammering nails as the beat, drilling sounds as scratches, and metal clanks as rhythm accents."

[NEW] ETTA: Elucidating the Design Space of Text-to-Audio Models
Sanggil Lee, Zhifeng Kong, Arushi Goel, Sungwon Kim, Rafael Valle, Bryan Catanzaro
arXiv preprint 2024

arXiv | abstract | bibtex

  @article{lee2024etta,
    title={ETTA: Elucidating the Design Space of Text-to-Audio Models},
    author={Lee, S and Kong, Z and Goel, A and Kim, S and Valle, R and Catanzaro, B},
    journal={arXiv preprint arXiv:2412.19351},
    year={2024}
  }
        
TangoFlux

[NEW] TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization
Chia-Yu Hung, Navonil Majumder, Zhifeng Kong, Ambuj Mehrish, Rafael Valle, Bryan Catanzaro, Soujanya Poria
arXiv preprint 2024

arXiv | abstract | bibtex

  @article{hung2024tangoflux,
    title={TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization},
    author={Hung, CY and Majumder, N and Kong, Z and Mehrish, A and Valle, R and Catanzaro, B and Poria, S},
    journal={arXiv preprint},
    year={2024}
  }
        
Expressivesinger

[NEW] ExpressiveSinger: Multilingual and multi-style score-based singing voice synthesis with expressive performance control
Shuqi Dai, Ming-Yu Liu, Rafael Valle, Siddharth Gururani
ACM Multimedia 2024

pdf | abstract | bibtex

  @inproceedings{dai2024expressivesinger,
    title={Expressivesinger: Multilingual and multi-style score-based singing voice synthesis with expressive performance control},
    author={Dai, S and Liu, MY and Valle, R and Gururani, S},
    booktitle={Proc. 32nd ACM Multimedia},
    pages={3229--3238},
    year={2024}
  }
        
Synthio

[NEW] Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data
Sreyan Ghosh, Sonal Kumar, Zhifeng Kong, Rafael Valle, Bryan Catanzaro, Dinesh Manocha
ICLR 2025

arXiv | abstract | bibtex

  @article{ghosh2024synthio,
    title={Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data},
    author={Ghosh, S and Kumar, S and Kong, Z and Valle, R and Catanzaro, B and Manocha, D},
    journal={arXiv preprint arXiv:2410.02056},
    year={2024}
  }
        
Robust Alignment

[NEW] Improving robustness of LLM-based speech synthesis by learning monotonic alignment
Paarth Neekhara, Shehzeen Hussain, Subhankar Ghosh, Jason Li, Rafael Valle, Rohan Badlani, Boris Ginsburg
arXiv preprint arXiv:2406.17957, 2024

arXiv | abstract | bibtex

  @article{neekhara2024robust,
    title={Improving robustness of llm-based speech synthesis by learning monotonic alignment},
    author={Neekhara, P and Hussain, S and Ghosh, S and Li, J and Valle, R and Badlani, R and Ginsburg, B},
    journal={arXiv preprint arXiv:2406.17957},
    year={2024}
  }
        

Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities
Zhifeng Kong, Arushi Goel, Rohan Badlani, Wei Ping, Rafael Valle, Bryan Catanzaro
under review 2024

arXiv | abstract | bibtex

  @article{kong2024audio,
          title={Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities},
          author={Kong, Zhifeng and Goel, Arushi and Badlani, Rohan and Ping, Wei and Valle, Rafael and Catanzaro,    Bryan},
          journal={arXiv preprint arXiv:2402.01831},
          year={2024}
        }
        
sym
reference
P-Flow

P-Flow: A Fast and Data-Efficient Zero-Shot TTS through Speech Prompting
Sungwon Kim, Kevin Shih, Rohan Badlani, Joao Felipe Santos, Evelina Bakhturina, Mikyas Desta, Rafael Valle, Sungroh Yoon, Bryan Catanzaro
NEURIPS 2023

pdf | abstract | bibtex

        
sym
Seen (French)
Unseen (German)
Unseen (Hindi)
Unseen (Spanish)

RADMMM: Multilingual Multiaccented Multispeaker Text-to-Speech
Rohan Badlani, Rafael Valle, Kevin J. Shih, João Felipe Santos, Siddhart Gururani, Bryan Catanzaro
Interspeech 2023

pdf | abstract | bibtex | arXiv | code

      @inproceedings{badlani23_interspeech,
        author={Rohan Badlani and Rafael Valle and Kevin J. Shih and João Felipe Santos and Siddharth Gururani and Bryan Catanzaro},
        title={{RAD-MMM: Multilingual Multiaccented Multispeaker Text To Speech}},
        year=2023,
        booktitle={Proc. INTERSPEECH 2023},
        pages={626--630},
        doi={10.21437/Interspeech.2023-2330}
      }
      
sym

SelfVC: Voice Conversion With Iterative Refinement using Self Transformations
Paarth Neekhara, Shehzeen Hussain, Rafael Valle, Boris Ginsburg, Rishabh Ranjan, Shlomo Dubnov, Farinaz Koushanfar, Julian McAuley
Under Review 2024

pdf | abstract | bibtex | arXiv

        
sym

SPACE: Speech-driven Portrait Animation with Controllable Expression
Siddharth Gururani, Arun Mallya, Ting-Chun Wang, Rafael Valle, Ming-Yu Liu
ICCV 2023

pdf | abstract | bibtex | arXiv

      @inproceedings{gururani2023space,
        title={SPACE: Speech-driven Portrait Animation with Controllable Expression},
        author={Gururani, Siddharth and Mallya, Arun and Wang, Ting-Chun and Valle, Rafael and Liu, Ming-Yu},
        booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
        pages={20914--20923},
        year={2023}
      }
      
sym

High-Acoustic Fidelity Text To Speech Synthesis With Fine-Grained Control Of Speech Attributes
Rafael Valle, João Felipe Santos, Kevin J. Shih, Rohan Badlani, Bryan Catanzaro
ICASSP 2023

pdf | abstract | bibtex | code

        @inproceedings{valle2023high,
          title={High-Acoustic Fidelity Text To Speech Synthesis With Fine-Grained Control Of Speech Attributes},
          author={Valle, Rafael and Santos, Jo{\~a}o Felipe and Shih, Kevin J and Badlani, Rohan and Catanzaro, Bryan},
          booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
          pages={1--5},
          year={2023},
          organization={IEEE}
        }
        
sym

Any-to-Any Voice Conversion with F0 and Timbre Disentanglement and Novel Timbre Conditioning
Sudheer Kovela, Rafael Valle, Ambrish Dantrey, Bryan Catanzaro
ICASSP 2023

pdf | abstract | bibtex

        @inproceedings{kovela2023any,
          title={Any-to-Any Voice Conversion with F 0 and Timbre Disentanglement and Novel Timbre Conditioning},
          author={Kovela, Sudheer and Valle, Rafael and Dantrey, Ambrish and Catanzaro, Bryan},
          booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
          pages={1--5},
          year={2023},
          organization={IEEE}
        }
        
sym

VANI: Very-lightweight Accent-controllable TTS for Native and Non-native speakers with Identity Preservation
Rohan Badlani, Ashish Arora, Subhankar Ghosh, Rafael Valle, Kevin J. Shih, João Felipe Santos, Boris Ginsburg, Bryan Catanzaro
ICASSP 2023

pdf | abstract | bibtex | arXiv | code

        @inproceedings{badlani2023vani,
          title={VANI: Very-lightweight Accent-controllable TTS for Native and Non-native speakers with Identity Preservation},
          author={Badlani, Rohan and Arora, Akshit and Ghosh, Subhankar and Valle, Rafael and Shih, Kevin J and Santos, Jo{\~a}o Felipe and Ginsburg, Boris and Catanzaro, Bryan},
          booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
          pages={1--2},
          year={2023},
          organization={IEEE}
        }
        
sym

One TTS Alignment to Rule Them All
Rohan Badlani, Adrian Łańcucki, Kevin J. Shih, Rafael Valle
ICASSP 2022

pdf | abstract | bibtex | arXiv | code

        @inproceedings{badlani2022one,
          title={One TTS alignment to rule them all},
          author={Badlani, Rohan and {\L}a{\'n}cucki, Adrian and Shih, Kevin J and Valle, Rafael and Ping, Wei and Catanzaro, Bryan},
          booktitle={ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
          pages={6092--6096},
          year={2022},
          organization={IEEE}
        }
        
sym

Generative modeling for low dimensional speech attributes with neural spline flows
Kevin J. Shih, Rafael Valle, Rohan Badlani, Bryan Catanzaro
arXiv 2022

pdf | abstract | bibtex | arXiv | code

        @article{shih2022generative,
          title={Generative modeling for low dimensional speech attributes with neural spline flows},
          author={Shih, Kevin J and Valle, Rafael and Badlani, Rohan and Santos, Jo{\~a}o Felipe and Catanzaro, Bryan},
          journal={arXiv preprint arXiv:2203.01786},
          year={2022}
        }
        
sym

RAD-TTS: Parallel flow-based TTS with robust alignment learning and diverse synthesis
Kevin J. Shih, Rafael Valle, Rohan Badlani, Adrian Lancucki, Wei Ping, Bryan Catanzaro
ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models 2021

pdf | abstract | bibtex | code

        @inproceedings{shih2021rad,
          title={RAD-TTS: Parallel flow-based TTS with robust alignment learning and diverse synthesis},
          author={Shih, Kevin J and Valle, Rafael and Badlani, Rohan and Lancucki, Adrian and Ping, Wei and Catanzaro, Bryan},
          booktitle={ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models},
          year={2021}
        }
        
sym

Character-based handwritten text transcription with attention networks
Jason Poulos and Rafael Valle
Neural Computing and Applications 2021

pdf | abstract | bibtex | arXiv |

        @article{poulos2021character,
          title={Character-based handwritten text transcription with attention networks},
          author={Poulos, Jason and Valle, Rafael},
          journal={Neural Computing and Applications},
          volume={33},
          number={16},
          pages={10563--10573},
          year={2021},
          publisher={Springer}
        }
        

Improving Keyword Spotting with Synthetic Speech
U. Vaidya, Rafael Valle, M. Jain, U. Ahmed, V. Karandikar, S. S. Chauhan, Bryan Catanzaro

abstract

Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis
Rafael Valle, Kevin Shih, Ryan Prenger, Bryan Catanzaro
arXiv 2019 - ICLR 2020

pdf | samples | abstract

Neural ODEs for Image Segmentation with Level Sets
Rafael Valle, Fitsum Reda, Mohammad Shoeybi, Patrick Legresley, Andrew Tao, Bryan Catanzaro
arXiv 2019

pdf | abstract

Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens
Rafael Valle*, Jason Li*, Ryan Prenger, Bryan Catanzaro
arXiv 2019 - ICASSP 2020

pdf | samples | abstract

WaveGlow: a Flow-based Generative Network for Speech Synthesis
Ryan Prenger, Rafael Valle, Bryan Catanzaro
ICASSP 2019

pdf | samples | abstract

sym

TequilaGAN: How to easily identify GAN samples
Rafael Valle, Wilson Cai and Anish Doshi
arXiv 2018

pdf | abstract

sym sym

Attacking Speaker Recognition with Deep Generative Models
Anish Doshi, Wilson Cai and Rafael Valle
arXiv 2017

pdf | abstract | code

Sequence Generation with GANs
Rafael Valle
2017

github | abstract | audio

sym

Audio-Based Room Occupancy Analysis using Gaussian Mixtures and Hidden Markov Models
Rafael Valle
Future Technologies Conference (FTC) 2016
Detection and Classification of Acoustic Scenes and Events 2016

pdf | abstract | bibtex | arXiv | code

        @article{valle2016abroa,
          title={ABROA: Audio-Based Room-Occupancy Analysis using Gaussian Mixtures and Hidden Markov Models},
          author={Valle, Rafael},
          journal={arXiv preprint arXiv:1607.07801},
          year={2016}
        }
        
sym

Missing Data Imputation for Supervised Classification
Jason Poulos and Rafael Valle
Applied Artificial Intelligence 2018

pdf | abstract | bibtex | arXiv | code

      @article{poulos2016missing,
        title={Missing Data Imputation for Supervised Learning},
        author={Poulos, Jason and Valle, Rafael},
        journal={arXiv preprint arXiv:1610.09075},
        year={2016}
      }
      
sym

Learning and Visualizing Music Specifications using Pattern Graphs
Rafael Valle, Daniel Fremont, Ilge Akkaya, Alexandre Donze, Adrian Freed and Sanjit Seshia
ISMIR 2016

pdf | abstract | bibtex | code

      @inproceedings{valle2016learning,
        title={Learning and Visualizing Music Specifications using Pattern Graphs},
        author={Valle, Rafael and Fremont, Daniel J and Akkaya, Ilge and Donze, Alexandre and Freed, Adrian and Seshia, Sanjit S},
        booktitleaddon= {Proceedings of the Seventeenth ISMIR Conference}        
        booktitle={ISMIR},
        year={2016}
      }