# Speech synthesis project description and first attempt at a regression MLP

Today, for my first blog post, I’ll introduce the Speech Synthesis project I’ll be working on, as well as my first attempts at a model using a vanilla feedforward neural net.

The project: This semester’s project consists in using TIMIT, a well-known speech dataset, to produce a model of speech synthesis. This dataset essentially contains sound (.wav) and transcription files (.txt). For example, a model could, at a basic level, be trained with a sequence of phonemes as input and sound waves as target; and then eventually generate sound samples given a new list of phonemes as input. Information on the speaker could be introduced in the model as well, such as gender, age and dialect, to produce specific voices or reproduce those of some of the dataset speakers. Using the deep learning algorithms discussed in class, attempting to create this sort of speech generative model will thus be at the core of this project.

The dataset: The TIMIT dataset, published in 1993, is composed of recordings from 630 speakers speaking 8 different American English dialects. Each speaker read 10 sentences, 2 of which are the same for every speaker. It is important to note that the number of speakers for each dialect varies, and that approximately 70% percent of the speakers are male. Each sentence is documented by 4 files: a .wav file containing the speech waveform (sampled at 16 kHz), and 3 .txt transcription files (orthographic, word and phonetic transcriptions). The following table, taken from the TIMIT 1993 release document, summarizes the speaker statistics.

# First experiment: Predicting the next acoustic sample

For my first experiment, and as suggested by Prof. Bengio (http://ift6266h14.wordpress.com/experimenting/), I will aim at training a feedforward neural network to predict the next acoustic sample based on a fixed window of previous samples. In this post I will go through all the steps I had to take to get there: obtaining the acoustic sample data in a python readable format, extracting the features for the considered task, and building and using the NN model. I will conclude by discussing the results.

Obtaining acoustic sample data in readable format for Python

Laurent has already figured out what to do and explains the process in his blog. I’ll recapitulate briefly here to document my first experiment.

Apparently, the preprocessed data that we had at first has gone through a frequency transform called MFCC. This is problematic because this type of data is not easily transformable back into sound samples. For this reason, it was said in class (Jan 30) that we should not use this format for speech synthesis.

Moreover, the version of TIMIT with sound samples only, in a NIST .wav file, apparently isn’t straightforwardly readable by standard audio player.

Laurent suggests the use of the linux “sox” command to convert that format into readable .wav format. Then, he notes that the package scipy.io.wavfile can be used to convert a data vector in a .wav file. This data has already been converted and is available on the LISA server. There are 2 file types which will be sufficient for my first exploratory experiment: readable .wav files (.wav) and corresponding numpy arrays (.npy).

I’ll begin my experiment with the very first subject of the first dialect, in /DR1/FCJF0”. Importing and visualizing the first utterance is straightforward.

Extracting features for predicting the next acoustic sample

Now that the vectors are understandable, we need to generate the actual table that will be used for training the regression algorithm. This means having a (n x d+1) matrix of n examples of d dimensions (+ the target). To do so, let’s first start with the fact that the sampling frequency is 16 kHz, and that a phoneme is 10-20 ms (as mentionned in class). To begin with, we could use 15 ms windows with “number of samples – 1” overlap, that is we shift the window of one sample to extract the next example. The following sample is also extracted as the target for that window. That would give:

number of samples in 1 frame = 16 000 Hz * 15 ms = 240 samples

For the first sentence of the first subject, consisting of 24679 samples, this gives a 24438 x 241 matrix. We can apply this on the 10 sentences of the first subject, and concatenate everything in one matrix. This gives a matrix of size 382 721 x 241, which can be used for training the model. The code used to extract the features can be found here.

Building and using the Vanilla Neural Network

Next, for the implementation of the neural net, I chose to revisit the code I wrote with William in the prerequisite for this class, IFT6390. In that course, we implemented a multiclass classification net with one hidden layer, using cross-entropy as loss function, tanh for the hidden layer and softmax for the output layer non-linearity, with L2 regularization. Going through the python/numpy code again, I reviewed the net’s implementation, and made changes to use the squared error as the loss function as well as a linear output layer. The code can be seen here.

The code, though vectorized, is not optimized, and doesn’t use Theano yet (this is the next step). For the moment, running 100 epochs with 450 hidden units and minibatches of 200 examples takes around 45 min on my CPU. Therefore, to choose the hyperparameters, I manually tried different combinations on a subset of the training set, and picked one that led to a visible decrease in training and validation errors over the first few epochs. For the results presented here, I used:

• Regularization parameter: $\lambda = 10^{-6}$
• Learning rate: $\eta = 0.14$
• minibatch size = 200
• epochs number = 100
• number of hidden units = 450

Furthermore, through simple testing of the algorithm with a toy data set, I realized how crucial normalization is for the good behavior of the net. Inputs and targets are thus normalized column-wise by subtracting their mean and dividing by their standard deviation. For reconstruction of the synthetized speech in a .wav file, the inverse transformation is applied.

Once trained, the net is used to predict the next value of an array of 240 samples extracted from the dataset. This predicted value is then added to the original array, and the now last 240 samples are again used to predict the next value. This is repeated until the array has reached a size of 50 000 samples. Finally, the array is converted and saved in a .wav format.

Results and discussion

The training of the neural network yielded the following error curves over 100 epochs:

We see that 100 epochs were not enough to produce a minimum in the validation error curve that could lead to early stopping; the optimal number of epochs has not been reached yet.

The synthesized speech waveform is presented in the following graphs:

We see that it looks like a combination of sine waves that stabilizes after approximately 6000 samples. Listening to the .wav file, it’s actually a minor seventh (something like C4+Bb4). The reason behind this specific output is not clear to me right now. It may have something to do with the window length used to extract the features (a wave with a period of 240 samples at 16 kHz is pretty close to a C2…). This result is obviously very far from the expected output of speech.

In any case, there is still a lot of things I want to work on with this specific model. First, I started “Theanizing” the code, which should make it much quicker to run. This will enable the use of a hyperparameter optimization strategy such as grid or random search. It will also allow training over a bigger number of epochs (as will running on the LISA cluster instead of my machine). Second, Pylearn2’s MLP model could be used for comparison purposes. Finally, feature extraction parameters could be varied: increasing or decreasing the size of the extracted windows, using data from more than one subject (or just trying with another subject’s data), predicting more than only the next sample, etc. Maybe this can shine light on a potential link between the frequencies in the synthesized signal and the features’ length.

I shall go over this in the next days.

## 15 thoughts on “Speech synthesis project description and first attempt at a regression MLP”

1. Hi Hubert,
this is actually very similar to what I got with a multilayer net with 2 sigmoid layers as hidden layers and a linear output layer. My approach uses the Pylearn2 library (which uses Theano) and the code can be found here: https://github.com/jfsantos/ift6266h14/blob/master/exp_next_sample.py (this is the code I use to generate speech from the learned model: https://github.com/jfsantos/ift6266h14/blob/master/gen_babble.py). Did you use a single frame as the first input when using the model as a generator?

2. We talked after class today about normalization and that it probably makes sense to assume that samples already have mean zero, and to normalize each sample by the same value (instead of normalizing each column of the matrix X separately). We also talked about whether simply normalizing by 2^15 to make put the samples in the range -1 and 1 should be good enough, rather than normalizing by the empirical standard deviation.

Just wanted to record that I tried normalizing by 2^15 and this works much worse than normalizing by ~560, which is roughly the empirical standard deviation. With the latter I’m now able to reproduce your results, but with the former the learning procedure behaves weird.

So indeed, the right normalization is important.

3. Hi Hubert, I just wanted to ask where did you get the 240 test samples from? it’s probably from one of validation or test files, but note that these are from different speakers, because the split of train/valid/test is based on speakers as far as I know, so you are training on one specific speaker and then testing on another. You might try training on different speakers, and probably include speaker information to your input.