I'm a polymath senior research scientist at NVIDIA
focusing on audio and computer vision applications. I am passionate about generative modeling,
machine perception and machine improvisation.
Normalizing Flows >>> VAEs >>> GANs
During Fall 2016 I was a Research Intern at Gracenote in Emeryville, where I
worked on audio classification using Deep Learning. Previously I was a
Scientist Intern at Pandora in Oakland, where I investigated
segments and scores that describe novelty seeking behavior in listeners.
Before coming to Berkeley, I completed a master's in Computer Music from HMDK Stuttgart in Germany and a bachelor's in Orchestral Conducting from UFRJ in Brazil.
News
Paper about improving keyword spotting with synthetic speech.
Paper about text-to-spectrogram model with more realism and expressivity than current SOTA.
Paper about a novel approach for image segmentation combining Neural ODEs and the Level Set method.
Publications
Improving Keyword Spotting with Synthetic Speech
U. Vaidya, Rafael Valle, M. Jain, U. Ahmed, V. Karandikar, S. S. Chauhan, Bryan Catanzaro
In this paper we describe a method that uses text-to-speech (TTS) synthesis models to improve the quality of keyword spotting models and to reduce the time and money required to train them. We synthesize varied data from different speakers by combining Flowtron, a multispeaker text-to-mel-spectrogram synthesis model producing speech with high variance, and WaveGlow, a universal mel-spectrogram to audio model. We fine-tune the synthetic data by using QuartzNet, an automatic speech recognition model, to find and remove samples with skipped, repeated and mispronounced words. With this fine-tuned synthetic data and 10% of human data we are able to achieve keyword spotting scores (accuracy and F1) that are comparable to using the full human dataset. We provide results on binary and multiclass Wake-up-Word datasets, including the Speech Commands Dataset.
In our recent paper, we
propose Flowtron: an autoregressive flow-based generative network for
text-to-speech synthesis with control over speech variation and style
transfer. Flowtron combines insights from IAF and optimizes Tacotron 2
in order to provide high-quality and controllable mel-spectrogram synthesis.
We propose a novel approach for image segmentation that combines Neural Ordinary Differential Equations (NODEs) and the Level Set method. Our approach parametrizes the evolution of an initial contour with a NODE that implicitly learns from data a speed function describing the evolution. In addition, for cases where an initial contour is not available and to alleviate the need for careful choice or design of contour embedding functions, we propose a NODE-based method that evolves an image embedding into a dense per-pixel semantic label space. We evaluate our methods on kidney segmentation (KiTS19) and on salient object detection (PASCAL-S, ECSSD and HKU-IS). In addition to improving initial contours provided by deep learning models while using a fraction of their number of parameters, our approach achieves F scores that are higher than several state-of-the-art deep learning algorithms
Mellotron is a multispeaker
voice synthesis model based on Tacotron 2 GST that can make a
voice emote and sing without emotive or singing training data. By
explicitly conditioning on rhythm and continuous pitch contours
from an audio signal or music score, Mellotron is able to generate
speech in a variety of styles ranging from read speech to
expressive speech, from slow drawls to rap and from monotonous
voice to singing voice.
We propose WaveGlow: a
flow-based network capable of generating high quality speech from
mel-spectrograms. WaveGlow combines insights from Glow and WaveNet in
order to provide fast, efficient and high-quality audio synthesis,
without the need for auto-regression. WaveGlow is implemented using
only a single network, trained using only a single cost function:
maximizing the likelihood of the training data, which makes the
training procedure simple and stable.
In this paper we show
strategies to easily identify fake samples generated with the
Generative Adversarial Network framework. One strategy is based on
the statistical analysis and comparison of raw pixel values and
features extracted from them. The other strategy learns formal
specifications from the real data and shows that fake samples
violate the specifications of the real data. We show that fake
samples produced with GANs have a universal signature that can be
used to identify fake samples. We provide results on MNIST,
CIFAR10, music and speech data.
In this paper we investigate the
ability of generative adversarial networks (GANs) to synthesize
spoofing attacks on modern speaker recognition systems. We first show
that samples generated with SampleRNN and WaveNet are unable to fool a
CNN-based speaker recognition system. We propose a modification of the
Wasserstein GAN objective function to make use of data that is real
but not from the class being learned. Our semi-supervised learning
method is able to perform both targeted and untargeted attacks,
raising questions related to security in speaker authentication systems.
In this paper we investigate the generation of sequences using generative adversarial networks (GANs). We open the paper by providing a brief introduction to sequence generation and challenges in GANs. We briefly describe encoding strategies for text and MIDI data in light of their use with convolutional architectures. In our experiments we consider the unconditional generation of polyphonic and monophonic piano roll generation as well as short sequences. For each data type, we provide sonic or text examples of generated data, interpolation in the latent space and vector arithmetic.
This paper outlines preliminary steps towards the development of an audio based room-occupancy analysis model. Our approach borrows from speech recognition tradition and is based on Gaussian Mixtures and Hidden Markov Models. We analyze possible challenges encountered in the development of such a model, and offer several solutions including feature design and prediction strategies. We provide results obtained from experiments with audio data from a retail store in Palo Alto, California. Model assessment is done via leave-two-out Bootstrap and model convergence achieves good accuracy, thus representing a contribution to multimodal people counting algorithms.
@article{valle2016abroa,
title={ABROA: Audio-Based Room-Occupancy Analysis using Gaussian Mixtures and Hidden Markov Models},
author={Valle, Rafael},
journal={arXiv preprint arXiv:1607.07801},
year={2016}
}
This paper compares methods for imputing missing categorical data for supervised learning tasks. The ability of researchers to accurately fit a model and yield unbiased estimates may be compromised by missing data, which are prevalent in survey-based social science research. We experiment on two machine learning benchmark datasets with missing categorical data, comparing classifiers trained on non-imputed (i.e., one-hot encoded) or imputed data with different degrees of missing data perturbation. The results show imputation methods can increase predictive accuracy in the presence of missing-data perturbation. Additionally, we find that for imputed models, missing data perturbation can improve prediction accuracy by regularizing the classifier.
@article{poulos2016missing,
title={Missing Data Imputation for Supervised Learning},
author={Poulos, Jason and Valle, Rafael},
journal={arXiv preprint arXiv:1610.09075},
year={2016}
}
We describe a system to learn and visualize specifications from song(s) in symbolic and audio formats. The core of our approach is based on a software engineering procedure called specification mining. Our procedure extracts patterns from feature vectors and uses them to build pattern graphs. The feature vectors are created by segmenting song(s) and extracting time and and frequency domain features from them, such as chromagrams, chord degree and interval classification. The pattern graphs built on these feature vectors provide the likelihood of a pattern between nodes, as well as start and ending nodes. The pattern graphs learned from a song(s) describe formal specifications that can be used for human interpretable quantitatively and qualitatively song comparison or to perform supervisory control in machine improvisation. We offer results in song summarization, song and style validation and machine improvisation with formal specifications.
@inproceedings{valle2016learning,
title={Learning and Visualizing Music Specifications using Pattern Graphs},
author={Valle, Rafael and Fremont, Daniel J and Akkaya, Ilge and Donze, Alexandre and Freed, Adrian and Seshia, Sanjit S},
booktitleaddon= {Proceedings of the Seventeenth ISMIR Conference}
booktitle={ISMIR},
year={2016}
}