Rafael Valle
Email:

I'm a polymath senior research scientist and manager at NVIDIA, where I represent ADLR's (Applied Deep Learning Research) audio team. ADLR–Audio focuses on generative models in audio, text and vision, with an emphasis in audio understanding and audio synthesis.

I am passionate about generative modeling, machine perception and machine improvisation. Over the years, I have had the opportunity to collaborate with fantastic researchers and co-invent Audio Flamingo, P-Flow, the RAD* family of models with the One Aligner To Rule Them All, Flowtron and WaveGlow.

During my PhD at UC Berkeley I was advised mainly by Prof. Sanjit Seshia and Prof. Edmund Campion and my research focused on machine listening and improvisation. At UC Berkeley, I was part of the TerraSwarm Research Center, where I worked on problems related to adversarial attacks and verified artificial intelligence.

During Fall 2016 I was a Research Intern at Gracenote in Emeryville, where I worked on audio classification using Deep Learning. Previously I was a Scientist Intern at Pandora in Oakland, where I investigated segments and scores that describe novelty seeking behavior in listeners.

Before coming to Berkeley, I completed a master's in Computer Music from HMDK Stuttgart in Germany and a bachelor's in Orchestral Conducting from UFRJ in Brazil.

News
Publications
sym

[NEW] Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities
Zhifeng Kong, Arushi Goel, Rohan Badlani, Wei Ping, Rafael Valle, Bryan Catanzaro
under review 2024

arXiv | abstract | bibtex

In this paper, we propose Audio Flamingo, a novel audio language model with 1) strong audio understanding abilities, 2) the ability to quickly adapt to unseen tasks via in-context learning and retrieval, and 3) strong multi-turn dialogue abilities. We introduce a series of training techniques, architecture design, and data strategies to enhance our model with these abilities. Extensive evaluations across various audio understanding tasks confirm the efficacy of our method, setting new state-of-the-art benchmarks.

	@article{kong2024audio,
          title={Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities},
          author={Kong, Zhifeng and Goel, Arushi and Badlani, Rohan and Ping, Wei and Valle, Rafael and Catanzaro,    Bryan},
          journal={arXiv preprint arXiv:2402.01831},
          year={2024}
        }
        
sym
reference
P-Flow

[NEW] P-Flow: A Fast and Data-Efficient Zero-Shot TTS through Speech Prompting
Sungwon Kim, Kevin Shih, Rohan Badlani, Joao Felipe Santos, Evelina Bakhturina, Mikyas Desta, Rafael Valle, Sungroh Yoon, Bryan Catanzaro
NEURIPS 2023

pdf | abstract | bibtex

While recent large-scale neural codec language models have shown significant improvement in zero-shot TTS by training on thousands of hours of data, they suffer from drawbacks such as a lack of robustness, slow sampling speed similar to previous autoregressive TTS methods, and reliance on pre-trained neural codec representations. Our work proposes P-Flow, a fast and data-efficient zero-shot TTS model that uses speech prompts for speaker adaptation. P-Flow comprises a speech-prompted text encoder for speaker adaptation and a flow matching generative decoder for high-quality and fast speech synthesis. Our speech-prompted text encoder uses speech prompts and text input to generate speaker-conditional text representation. The flow matching generative decoder uses the speaker-conditional output to synthesize high-quality personalized speech significantly faster than in real-time. Unlike the neural codec language models, we specifically train P-Flow on LibriTTS dataset using a continuous mel-representation. Through our training method using continuous speech prompts, P-Flow matches the speaker similarity performance of the large-scale zero-shot TTS models with two orders of magnitude less training data and has more than 20$\times$ faster sampling speed. Our results show that P-Flow has better pronunciation and is preferred in human likeness and speaker similarity to its recent state-of-the-art counterparts, thus defining P-Flow as an attractive and desirable alternative.

        
sym
Seen (French)
Unseen (German)
Unseen (Hindi)
Unseen (Spanish)

[NEW] RADMMM: Multilingual Multiaccented Multispeaker Text-to-Speech
Rohan Badlani, Rafael Valle, Kevin J. Shih, João Felipe Santos, Siddhart Gururani, Bryan Catanzaro
Interspeech 2023

pdf | abstract | bibtex | arXiv | code

We work to create a multilingual speech synthesis system which can generate speech with the proper accent while retaining the characteristics of an individual voice. This is challenging to do because it is expensive to obtain bilingual training data in multiple languages, and the lack of such data results in strong correlations that entangle speaker, language, and accent, resulting in poor transfer capabilities. To overcome this, we present a multilingual, multiaccented, multispeaker speech synthesis model based on RADTTS with explicit control over accent, language, speaker and fine-grained F0 and energy features. Our proposed model does not rely on bilingual training data. We demonstrate an ability to control synthesized accent for any speaker in an open-source dataset comprising of 7 accents. Human subjective evaluation demonstrates that our model can better retain a speaker's voice and accent quality than controlled baselines while synthesizing fluent speech in all target languages and accents in our dataset.

      @inproceedings{badlani23_interspeech,
        author={Rohan Badlani and Rafael Valle and Kevin J. Shih and João Felipe Santos and Siddharth Gururani and Bryan Catanzaro},
        title={{RAD-MMM: Multilingual Multiaccented Multispeaker Text To Speech}},
        year=2023,
        booktitle={Proc. INTERSPEECH 2023},
        pages={626--630},
        doi={10.21437/Interspeech.2023-2330}
      }
      
sym

[NEW] SelfVC: Voice Conversion With Iterative Refinement using Self Transformations
Paarth Neekhara, Shehzeen Hussain, Rafael Valle, Boris Ginsburg, Rishabh Ranjan, Shlomo Dubnov, Farinaz Koushanfar, Julian McAuley
Under Review 2024

pdf | abstract | bibtex | arXiv

We propose SelfVC, a training strategy to iteratively improve a voice conversion model with self-synthesized examples. Previous efforts on voice conversion focus on explicitly disentangling speech representations to separately encode speaker characteristics and linguistic content. However, disentangling speech representations to capture such attributes using task-specific loss terms can lead to information loss by discarding finer nuances of the original signal. In this work, instead of explicitly disentangling attributes with loss terms, we present a framework to train a controllable voice conversion model on entangled speech representations derived from self-supervised learning and speaker verification models. First, we develop techniques to derive prosodic information from the audio signal and SSL representations to train predictive submodules in the synthesis model. Next, we propose a training strategy to iteratively improve the synthesis model for voice conversion, by creating a challenging training objective using self-synthesized examples. In this training approach, the current state of the synthesis model is used to generate voice-converted variations of an utterance, which serve as inputs for the reconstruction task, ensuring a continuous and purposeful refinement of the model. We demonstrate that incorporating such self-synthesized examples during training improves the speaker similarity of generated speech as compared to a baseline voice conversion model trained solely on heuristically perturbed inputs. SelfVC is trained without any text and is applicable to a range of tasks such as zero-shot voice conversion, cross-lingual voice conversion, and controllable speech synthesis with pitch and pace modifications. SelfVC achieves state-of-the-art results in zero-shot voice conversion on metrics evaluating naturalness, speaker similarity, and intelligibility of synthesized audio.

        
sym

[NEW] SPACE: Speech-driven Portrait Animation with Controllable Expression
Siddharth Gururani, Arun Mallya, Ting-Chun Wang, Rafael Valle, Ming-Yu Liu
ICCV 2023

pdf | abstract | bibtex | arXiv

Animating portraits using speech has received growing attention in recent years, with various creative and practical use cases. An ideal generated video should have good lip sync with the audio, natural facial expressions and head motions, and high frame quality. In this work, we present SPACE, which uses speech and a single image to generate high-resolution, and expressive videos with realistic head pose, without requiring a driving video. It uses a multi-stage approach, combining the controllability of facial landmarks with the high-quality synthesis power of a pretrained face generator. SPACE also allows for the control of emotions and their intensities. Our method outperforms prior methods in objective metrics for image quality and facial motions and is strongly preferred by users in pair-wise comparisons.

      @inproceedings{gururani2023space,
        title={SPACE: Speech-driven Portrait Animation with Controllable Expression},
        author={Gururani, Siddharth and Mallya, Arun and Wang, Ting-Chun and Valle, Rafael and Liu, Ming-Yu},
        booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
        pages={20914--20923},
        year={2023}
      }
      
sym

[NEW] High-Acoustic Fidelity Text To Speech Synthesis With Fine-Grained Control Of Speech Attributes
Rafael Valle, João Felipe Santos, Kevin J. Shih, Rohan Badlani, Bryan Catanzaro
ICASSP 2023

pdf | abstract | bibtex | code

Recently developed neural-based TTS models have focused on robustness and finer control over acoustic features such as phoneme duration, energy, and F0, allowing users to have some degree of control over the prosody of the generated speech. We propose a model with fine grained attribute control, which also has better acoustic fidelity (attributes of the output which we want to control do not deviate from the control signals) than previously proposed models as shown in our experiments 1 . Unlike other models, our proposed model does not require fine-tuning the vocoder on its outputs, indicating that it generates higher quality mel-spectrograms that are closer to the ground-truth distribution than that of other models.

        @inproceedings{valle2023high,
          title={High-Acoustic Fidelity Text To Speech Synthesis With Fine-Grained Control Of Speech Attributes},
          author={Valle, Rafael and Santos, Jo{\~a}o Felipe and Shih, Kevin J and Badlani, Rohan and Catanzaro, Bryan},
          booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
          pages={1--5},
          year={2023},
          organization={IEEE}
        }
        
sym

[NEW] Any-to-Any Voice Conversion with F0 and Timbre Disentanglement and Novel Timbre Conditioning
Sudheer Kovela, Rafael Valle, Ambrish Dantrey, Bryan Catanzaro
ICASSP 2023

pdf | abstract | bibtex

Despite recent advances in voice conversion (VC), it is still challenging to do real-time one-shot voice conversion with good control over timbre and F0. In this work, we present a PPG-based VC model that directly decodes waveforms. We designed a speaker conditioned decoder based on HiFi-GAN, along with a new discriminator that produces high quality audio. Using an F0 prenet and F0 augmented speaker encoder, we are able to control F0 and timbre independently with high fidelity. Our objective and subjective evaluations show that our method is preferred over others in terms of audio quality, timbre similarity and prosody retention.

        @inproceedings{kovela2023any,
          title={Any-to-Any Voice Conversion with F 0 and Timbre Disentanglement and Novel Timbre Conditioning},
          author={Kovela, Sudheer and Valle, Rafael and Dantrey, Ambrish and Catanzaro, Bryan},
          booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
          pages={1--5},
          year={2023},
          organization={IEEE}
        }
        
sym

VANI: Very-lightweight Accent-controllable TTS for Native and Non-native speakers with Identity Preservation
Rohan Badlani, Ashish Arora, Subhankar Ghosh, Rafael Valle, Kevin J. Shih, João Felipe Santos, Boris Ginsburg, Bryan Catanzaro
ICASSP 2023

pdf | abstract | bibtex | arXiv | code

We introduce VANI, a very lightweight multi-lingual accent controllable speech synthesis system. Our model builds upon disentanglement strategies proposed in RADMMM and supports explicit control of accent, language, speaker and fine-grained F0 and energy features for speech synthesis. We utilize the Indic languages dataset, released for LIMMITS 2023 as part of ICASSP Signal Processing Grand Challenge, to synthesize speech in 3 different languages. Our model supports transferring the language of a speaker while retaining their voice and the native accent of the target language. We utilize the large-parameter RADMMM model for Track 1 and lightweight VANI model for Track 2 and 3 of the competition.

        @inproceedings{badlani2023vani,
          title={VANI: Very-lightweight Accent-controllable TTS for Native and Non-native speakers with Identity Preservation},
          author={Badlani, Rohan and Arora, Akshit and Ghosh, Subhankar and Valle, Rafael and Shih, Kevin J and Santos, Jo{\~a}o Felipe and Ginsburg, Boris and Catanzaro, Bryan},
          booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
          pages={1--2},
          year={2023},
          organization={IEEE}
        }
        
sym

One TTS Alignment to Rule Them All
Rohan Badlani, Adrian Łańcucki, Kevin J. Shih, Rafael Valle
ICASSP 2022

pdf | abstract | bibtex | arXiv | code

Speech-to-text alignment is a critical component of neural text-to-speech (TTS) models. Autoregressive TTS models typically use an attention mechanism to learn these alignments on-line. However, these alignments tend to be brittle and often fail to generalize to long utterances and out-of-domain text, leading to missing or repeating words. Most non-autoregressive end-to-end TTS models rely on durations extracted from external sources. In this paper we leverage the alignment mechanism proposed in RAD-TTS and demonstrate its applicability to wide variety of neural TTS models. The alignment learning framework combines the forward-sum algorithm, Viterbi algorithm, and an efficient static prior. In our experiments, the framework improves all tested TTS architectures, both autoregressive (Flowtron, Tacotron 2) and non-autoregressive (FastPitch, FastSpeech 2, RAD-TTS). Specifically, it improves alignment convergence speed, simplifies the training pipeline by eliminating need for external aligners, enhances robustness to errors on long utterances and improves the perceived speech synthesis quality, as judged by human evaluators.

        @inproceedings{badlani2022one,
          title={One TTS alignment to rule them all},
          author={Badlani, Rohan and {\L}a{\'n}cucki, Adrian and Shih, Kevin J and Valle, Rafael and Ping, Wei and Catanzaro, Bryan},
          booktitle={ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
          pages={6092--6096},
          year={2022},
          organization={IEEE}
        }
        
sym

Generative modeling for low dimensional speech attributes with neural spline flows
Kevin J. Shih, Rafael Valle, Rohan Badlani, Bryan Catanzaro
arXiv 2022

pdf | abstract | bibtex | arXiv | code

Despite recent advances in generative modeling for text-to-speech synthesis, these models do not yet have the same fine-grained adjustability of pitch-conditioned deterministic models such as FastPitch and FastSpeech2. Pitch information is not only low-dimensional, but also discontinuous, making it particularly difficult to model in a generative setting. Our work explores several techniques for handling the aforementioned issues in the context of Normalizing Flow models. We also find this problem to be very well suited for Neural Spline flows, which is a highly expressive alternative to the more common affine-coupling mechanism in Normalizing Flows.

        @article{shih2022generative,
          title={Generative modeling for low dimensional speech attributes with neural spline flows},
          author={Shih, Kevin J and Valle, Rafael and Badlani, Rohan and Santos, Jo{\~a}o Felipe and Catanzaro, Bryan},
          journal={arXiv preprint arXiv:2203.01786},
          year={2022}
        }
        
sym

RAD-TTS: Parallel flow-based TTS with robust alignment learning and diverse synthesis
Kevin J. Shih, Rafael Valle, Rohan Badlani, Adrian Lancucki, Wei Ping, Bryan Catanzaro
ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models 2021

pdf | abstract | bibtex | code

This work introduces a predominantly parallel, end-to-end TTS model based on normalizing flows. It extends prior parallel approaches by additionally modeling speech rhythm as a separate generative distribution to facilitate variable token duration during inference. We further propose a robust framework for the on-line extraction of speech-text alignments – a critical yet highly unstable learning problem in end-to-end TTS frameworks. Our experiments demonstrate that our proposed techniques yield improved alignment quality, better output diversity compared to controlled baselines.

        @inproceedings{shih2021rad,
          title={RAD-TTS: Parallel flow-based TTS with robust alignment learning and diverse synthesis},
          author={Shih, Kevin J and Valle, Rafael and Badlani, Rohan and Lancucki, Adrian and Ping, Wei and Catanzaro, Bryan},
          booktitle={ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models},
          year={2021}
        }
        
sym

Character-based handwritten text transcription with attention networks
Jason Poulos and Rafael Valle
Neural Computing and Applications 2021

pdf | abstract | bibtex | arXiv |

The paper approaches the task of handwritten text recognition (HTR) with attentional encoder-decoder networks trained on sequences of characters, rather than words. We experiment on lines of text from popular handwriting datasets and compare different activation functions for the attention mechanism used for aligning image pixels and target characters. We find that softmax attention focuses heavily on individual characters, while sigmoid attention focuses on multiple characters at each step of the decoding. When the sequence alignment is one-to-one, softmax attention is able to learn a more precise alignment at each step of the decoding, whereas the alignment generated by sigmoid attention is much less precise. When a linear function is used to obtain attention weights, the model predicts a character by looking at the entire sequence of characters and performs poorly because it lacks a precise alignment between the source and target. Future research may explore HTR in natural scene images, since the model is capable of transcribing handwritten text without the need for producing segmentations or bounding boxes of text in images.

        @article{poulos2021character,
          title={Character-based handwritten text transcription with attention networks},
          author={Poulos, Jason and Valle, Rafael},
          journal={Neural Computing and Applications},
          volume={33},
          number={16},
          pages={10563--10573},
          year={2021},
          publisher={Springer}
        }
        

Improving Keyword Spotting with Synthetic Speech
U. Vaidya, Rafael Valle, M. Jain, U. Ahmed, V. Karandikar, S. S. Chauhan, Bryan Catanzaro

abstract

In this paper we describe a method that uses text-to-speech (TTS) synthesis models to improve the quality of keyword spotting models and to reduce the time and money required to train them. We synthesize varied data from different speakers by combining Flowtron, a multispeaker text-to-mel-spectrogram synthesis model producing speech with high variance, and WaveGlow, a universal mel-spectrogram to audio model. We fine-tune the synthetic data by using QuartzNet, an automatic speech recognition model, to find and remove samples with skipped, repeated and mispronounced words. With this fine-tuned synthetic data and 10% of human data we are able to achieve keyword spotting scores (accuracy and F1) that are comparable to using the full human dataset. We provide results on binary and multiclass Wake-up-Word datasets, including the Speech Commands Dataset.

Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis
Rafael Valle, Kevin Shih, Ryan Prenger, Bryan Catanzaro
arXiv 2019 - ICLR 2020

pdf | samples | abstract

In our recent paper, we propose Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis with control over speech variation and style transfer. Flowtron combines insights from IAF and optimizes Tacotron 2 in order to provide high-quality and controllable mel-spectrogram synthesis.

Neural ODEs for Image Segmentation with Level Sets
Rafael Valle, Fitsum Reda, Mohammad Shoeybi, Patrick Legresley, Andrew Tao, Bryan Catanzaro
arXiv 2019

pdf | abstract

We propose a novel approach for image segmentation that combines Neural Ordinary Differential Equations (NODEs) and the Level Set method. Our approach parametrizes the evolution of an initial contour with a NODE that implicitly learns from data a speed function describing the evolution. In addition, for cases where an initial contour is not available and to alleviate the need for careful choice or design of contour embedding functions, we propose a NODE-based method that evolves an image embedding into a dense per-pixel semantic label space. We evaluate our methods on kidney segmentation (KiTS19) and on salient object detection (PASCAL-S, ECSSD and HKU-IS). In addition to improving initial contours provided by deep learning models while using a fraction of their number of parameters, our approach achieves F scores that are higher than several state-of-the-art deep learning algorithms

Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens
Rafael Valle*, Jason Li*, Ryan Prenger, Bryan Catanzaro
arXiv 2019 - ICASSP 2020

pdf | samples | abstract

Mellotron is a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data. By explicitly conditioning on rhythm and continuous pitch contours from an audio signal or music score, Mellotron is able to generate speech in a variety of styles ranging from read speech to expressive speech, from slow drawls to rap and from monotonous voice to singing voice.

WaveGlow: a Flow-based Generative Network for Speech Synthesis
Ryan Prenger, Rafael Valle, Bryan Catanzaro
ICASSP 2019

pdf | samples | abstract

We propose WaveGlow: a flow-based network capable of generating high quality speech from mel-spectrograms. WaveGlow combines insights from Glow and WaveNet in order to provide fast, efficient and high-quality audio synthesis, without the need for auto-regression. WaveGlow is implemented using only a single network, trained using only a single cost function: maximizing the likelihood of the training data, which makes the training procedure simple and stable.

sym

TequilaGAN: How to easily identify GAN samples
Rafael Valle, Wilson Cai and Anish Doshi
arXiv 2018

pdf | abstract

In this paper we show strategies to easily identify fake samples generated with the Generative Adversarial Network framework. One strategy is based on the statistical analysis and comparison of raw pixel values and features extracted from them. The other strategy learns formal specifications from the real data and shows that fake samples violate the specifications of the real data. We show that fake samples produced with GANs have a universal signature that can be used to identify fake samples. We provide results on MNIST, CIFAR10, music and speech data.

sym sym

Attacking Speaker Recognition with Deep Generative Models
Anish Doshi, Wilson Cai and Rafael Valle
arXiv 2017

pdf | abstract | code

In this paper we investigate the ability of generative adversarial networks (GANs) to synthesize spoofing attacks on modern speaker recognition systems. We first show that samples generated with SampleRNN and WaveNet are unable to fool a CNN-based speaker recognition system. We propose a modification of the Wasserstein GAN objective function to make use of data that is real but not from the class being learned. Our semi-supervised learning method is able to perform both targeted and untargeted attacks, raising questions related to security in speaker authentication systems.

Sequence Generation with GANs
Rafael Valle
2017

github | abstract | audio

In this paper we investigate the generation of sequences using generative adversarial networks (GANs). We open the paper by providing a brief introduction to sequence generation and challenges in GANs. We briefly describe encoding strategies for text and MIDI data in light of their use with convolutional architectures. In our experiments we consider the unconditional generation of polyphonic and monophonic piano roll generation as well as short sequences. For each data type, we provide sonic or text examples of generated data, interpolation in the latent space and vector arithmetic.

sym

Audio-Based Room Occupancy Analysis using Gaussian Mixtures and Hidden Markov Models
Rafael Valle
Future Technologies Conference (FTC) 2016
Detection and Classification of Acoustic Scenes and Events 2016

pdf | abstract | bibtex | arXiv | code

This paper outlines preliminary steps towards the development of an audio based room-occupancy analysis model. Our approach borrows from speech recognition tradition and is based on Gaussian Mixtures and Hidden Markov Models. We analyze possible challenges encountered in the development of such a model, and offer several solutions including feature design and prediction strategies. We provide results obtained from experiments with audio data from a retail store in Palo Alto, California. Model assessment is done via leave-two-out Bootstrap and model convergence achieves good accuracy, thus representing a contribution to multimodal people counting algorithms.

        @article{valle2016abroa,
          title={ABROA: Audio-Based Room-Occupancy Analysis using Gaussian Mixtures and Hidden Markov Models},
          author={Valle, Rafael},
          journal={arXiv preprint arXiv:1607.07801},
          year={2016}
        }
        
sym

Missing Data Imputation for Supervised Classification
Jason Poulos and Rafael Valle
Applied Artificial Intelligence 2018

pdf | abstract | bibtex | arXiv | code

This paper compares methods for imputing missing categorical data for supervised learning tasks. The ability of researchers to accurately fit a model and yield unbiased estimates may be compromised by missing data, which are prevalent in survey-based social science research. We experiment on two machine learning benchmark datasets with missing categorical data, comparing classifiers trained on non-imputed (i.e., one-hot encoded) or imputed data with different degrees of missing data perturbation. The results show imputation methods can increase predictive accuracy in the presence of missing-data perturbation. Additionally, we find that for imputed models, missing data perturbation can improve prediction accuracy by regularizing the classifier.

      @article{poulos2016missing,
        title={Missing Data Imputation for Supervised Learning},
        author={Poulos, Jason and Valle, Rafael},
        journal={arXiv preprint arXiv:1610.09075},
        year={2016}
      }
      
sym

Learning and Visualizing Music Specifications using Pattern Graphs
Rafael Valle, Daniel Fremont, Ilge Akkaya, Alexandre Donze, Adrian Freed and Sanjit Seshia
ISMIR 2016

pdf | abstract | bibtex | code

We describe a system to learn and visualize specifications from song(s) in symbolic and audio formats. The core of our approach is based on a software engineering procedure called specification mining. Our procedure extracts patterns from feature vectors and uses them to build pattern graphs. The feature vectors are created by segmenting song(s) and extracting time and and frequency domain features from them, such as chromagrams, chord degree and interval classification. The pattern graphs built on these feature vectors provide the likelihood of a pattern between nodes, as well as start and ending nodes. The pattern graphs learned from a song(s) describe formal specifications that can be used for human interpretable quantitatively and qualitatively song comparison or to perform supervisory control in machine improvisation. We offer results in song summarization, song and style validation and machine improvisation with formal specifications.

      @inproceedings{valle2016learning,
        title={Learning and Visualizing Music Specifications using Pattern Graphs},
        author={Valle, Rafael and Fremont, Daniel J and Akkaya, Ilge and Donze, Alexandre and Freed, Adrian and Seshia, Sanjit S},
        booktitleaddon= {Proceedings of the Seventeenth ISMIR Conference}        
        booktitle={ISMIR},
        year={2016}
      }