Results with ReLUs and different subjects

I ran some more experiments based on my previous post:

  1. Using ReLUs on the hidden layers
  2. Using more hidden units in the NSNN
  3. Training/generating with other speakers

1. Using ReLUs on the hidden layers

I changed the code from my last post to use rectified linear units instead of tanh units for the hidden layers. I manually tuned the learning rate by running many instances of the model with different learning rates over 12 epochs, and kept the one yielding the more consistent drop in validation error. The model was then launched on FCFJ0’s data, with:

  • NSNN: 2 hidden layers (200 and 150 units)
  • WNN: 1 hidden layer (250 units)
  • Learning rate: starts at 0.0025, then multiplied by 0.7 every time the validation error stalls for a few epochs
  • No weight decay
  • Minibatch size: 1
  • Number of epochs: 500

When generating speech, using a multiplicative factor of the standard deviation smaller than 2 (for sampling from the NSNN) typically yields audio clips that exponentially increase and explode. For example, at epoch 45, using a factor of 1, the model generated this:


This effect was found to disappear after around 75 epochs.

The training procedure stopped improving the training and validation errors around epoch 168 and was thus killed at that point (as compared to 366 epochs for tanh units). The best-sounding generated clip was found at epoch 165, with a factor of 2 (training error: 607 (0.0022), validation error: 1460 (0.0054)). It tries to replicate the sentence “The emperor had a mean temper.”, as in the previous posts.


There are some “contained” bursts, but the prosody and some sounds actually match the original sentence. This shows that using ReLUs can yield similar results to tanh units with the exact same architecture. The fine-tuning of the learning rate may be the cause for the ReLU model’s fastest convergence, and thus this point cannot be assessed with this experiment.

2. Using more hidden units in the NSNN

A second experiment using ReLUs was to increase the number of hidden units on each hidden layer. I chose to only increase the number of hidden units of the NSNN, for a start. The model was therefore almost the same as above:

  • NSNN: 2 hidden layers (250 and 200 units)
  • WNN: 1 hidden layer (250 units)
  • Learning rate: starts at 0.00255, then multiplied by 0.7 every time the validation error stalls for a few epochs
  • No weight decay
  • Minibatch size: 1
  • Number of epochs: 500

The same burst effect as above was observed for audio clips generated after less than 60 epochs and for factors smaller than 2. The training procedure stopped improving the training and validation errors around epoch 94. The best-sounding generated clip was found at epoch 90, with a factor of 2 (training error: 566 (0.0021), validation error: 1459 (0.0054)):


Similarly to the above result, the prosody and some of the right sounds can be heard, though it is very noisy. The signal is more stable as well.

This experiment is not enough to prove the usefulness of adding hidden units in this context, but shows that similar results can be obtained with a bigger version of the model. Perhaps the number of hidden units that were added was insignificant as compared to what would make a discernable difference, or maybe the model would have benefited more from an added hidden layer. Morevoer, adding more hidden units (or layers) to WNN could also have impacted those results.

3. Training/generating with other speakers

As was suggested by Joao, I aimed at training the previous models with data from more than one speaker. For this, I used this script that extracts acoustic samples and phone information from different speakers, and saves everything in one file that is later loaded by the models. For the data set to be loaded in memory at once, and for the training time to be reasonable with a stochastic gradient descent, data was extracted from two female subjects of dialect 1 (FCFJ0 and FVFB0) only.

I thus used the exact same model as in part 1 of the current post, but with this new data set. For a change, the original sentence to replicate is now “We’ll serve rhubarb pie after Rachel’s talk.”:


The model stopped improving around epoch 150, at which point the training error was 1675 (0.0062), and the validation error 4117 (0.015). The generation with a factor of 2 then yielded this:


As before, the speech is very noisy, and some remnants of the original sentence’s rhythm and sounds can be heard. As opposed to previous experiments on subject FCFJ0 only (especially in the previous post with tanh units), though, the phones are not as clear and the original speaker(s)’s voice is less recognizable.

Another interesting point is the “burst” sound, which still appears with factors smaller than 2, and which this time continues to appear even after 150 epochs. This effect thus probably arises with ReLUs when the network still hasn’t seen enough examples and is still a poor fit to the data set’s distribution. In practical speech synthesis applications, this might be a problem since a lot of noise (multiplicative factor higher or equal to 2) needs to be added in order to avoid such bursts.

Finally, the same experiment with 2 speakers was replicated with the previous model (tanh units). The initial learning rate was chosen to be 0.001. The training is still ongoing as I’m writing, but here are the latest results I got, from epoch 120:


The training and validation errors are still improving (they are now at 1842 (0.0059) and 3792 (0.0121)), but already at this point we can hear some parts of the original sentence. The voice itself sounds a lot more artificial and noisy than previously seen with completed trainings.


 Conclusion of the project

My work on this speech synthesis project mainly concerned the use of MLPs on one speaker. I started with an implementation in Python/Numpy of a simple neural network predicting the next sample from the previous 240 samples. I later switched to Theano, which provides more flexibility and power, as well as a better chance at understanding implementation details. Then, a modified neural network architecture with 2 MLPs was implemented, with multiple hidden layers and dynamic learning rate, and using the previous, current and next phones for predicting the next acoustic sample.

The very first synthesis experiments only yielded sine waves that didn’t seem connected to the input data. By adding phone information, and after tuning the learning rate and modifying the speech generation by sampling from the network as from a Gaussian, better results (unintelligible and very noisy babbles) were obtained with the second architecture (see Table 1 for a summary of those experiments). As noted by Joao, using different multiplicative factors in front of the standard deviation sometimes yielded more structured results. Finally, the second architecture was improved with a unified implementation under a single Theano graph, and the now faster model could be used with different configurations: tanh units, ReLUs, different numbers of hidden units, training on one or two speakers. The best results were obtained with tanh units on one speaker, and sound similar to the desired sentence in terms of rhythm and sound. This architecture thus seems like a promising one for speech synthesis.

Table 1 - Summary of experiments with second architecture

Table 1 – Summary of experiments with second architecture

Future work should more thoroughly ascertain the effect of different hyperparameters of the model, namely the weight decay (as well as the type of regularization – L1 or L2), the number of hidden units and layers (and on which part of the model it should be added), and the learning rate. There’s probably a way to get even more out of the model that gave the best results by tuning those hyperparameters correctly. An implementation using Theano’s “scan” function might as well come handy to train the model efficiently on more than two subjects (that would also necessitate on-the-fly extraction of features, as in Vincent’s TIMIT class). Adding depth to this approach and combining it with unsupervised pretraining might eventually bring us closer to realistic and flexible human speech synthesis.


Results with 2-layer MLP on 1 speaker

I used an improved implementation of the previous unified architecture to generate speech on subject FCJF0. The first 9 utterances were used for train/valid/test, and the 10th utterance was kept for generation. The hyperparameters were:

  • 2 hidden layers (200 and 150 units) for the network outputting the next sample (NSNN)
  • 1 hidden layer (250 units) for the network outputting the biases of NSNN (WNN)
  • Learning rate: starts at 0.001, then is divided by 2 every time the validation error stalls for a few epochs
  • No weight decay
  • Minibatch size: 1
  • Number of epochs: 500

Here are the training curves:


The training stopped improving the validation error at epoch 366, and started overfitting at that point. The testing error was then 1282.9 unnormalized (0.0041 normalized).

The best-sounding clips were obtained with a factor of 1 for the standard deviation used in sampling from the network. The clips with a factor of 5, though very noisy, also gave interesting results.

Here is the original clip (the classic sentence, “The emperor had a mean temper.”):


And here is the generation with a factor of 1, using the sequence of phones from that previous sentence:


And with a factor of 5 (more like a whisper):

The generation at epoch 330 sounded a bit better:


The results are much closer to speech than in my previous posts. We can now actually hear some kind of prosody, and with some imagination we can even hear the original words.

For the remaining time, I want to try using ReLU activations on the hidden layers, increase the capacity by adding more hidden units in the NSNN, and finally train on other speakers.

Second MLP architecture for next sample prediction – Part III

This time, I used the code of my previous post to perform a hyperparameter search on a very restricted set of possible values. I varied the learning rate and the number of hidden units of both networks (WNN, the network that outputs the biases of NSNN given the phone information; NSNN, the network that outputs the next sample given the current samples), on 30 epochs. I used the first 9 utterances of subject FCJF0 for training/validation/test, and the 10th utterance for reconstruction. See Thomas’s post for a comparison of results on FCJF0.

I found out that I was using too high learning rates. The optimal value seemed to be around 0.0025 (instead of 0.01) for the architecture of interest. Furthermore, increasing the number of hidden units dramatically increases the training time, so a reasonable value yielding good results was 250 (WNN) and 200 (NSNN) hidden units.

Additionally, as compared to the previous post, the training curves are much smoother using this learning rate. The best test error (mean-squared error) for this run was 2809.36 unnormalized, or 0.00896 normalized.
I ran the training a second time using this optimal configuration:

  • Learning rate: 0.0025
  • Number of hidden units: 250 (WNN), 200 (NSNN)
  • Weight decay: 0

The training goes on until the validation error stops decreasing over 3 epochs. Moreover, speech generation is attempted every 15 epochs. As was done by João, different multiplicative factors were used to modify the amount of noise when sampling from the output distribution: [0.01, 0.05, 0.1, 0.5, 1, 2, 5, 10].


The training went on for 260 epochs, and the best test error was achieved at epoch 257 (2042.68 unnormalized, 0.00651 normalized).
Here is the original audio clip:


Here are the audio clips generated at epoch 257:









A factor of 2 produces a very short sound somehow close to speech (and a factor of 5 creates cricket sounds…).

In general, throughout training, a multiplicative factor of 2 produced the best results. Listening to the clips produced during epoch 240, I found that one which sounded like a word was said:


I’m still working on a better implementation of this architecture with Theano. Using one graph for the model should speed up training a lot, and allow the use of more hidden units.

Second MLP architecture for next sample prediction – Part II

In my last post, I explained the concept of another neural network architecture that could be used to tackle our speech generation problem. This architecture is composed of two neural networks: the first one (NSNN) generates the next acoustic sample from a window of previous samples, while the second one (WNN) generates the parameters of the first neural network.

In class, Prof. Bengio mentioned a few things that we should consider working on such an architecture:

  • Instead of outputting every parameter of the NSNN, the WNN could output only the biases. This reduces dramatically the size of the network, and thus the training time.
  • Instead of performing purely stochastic gradient descent, a faster way would be to proceed in a hybrid-minibatch-like mode. For this, examples with the same previous/current/next phones configuration have to be grouped in the dataset. It is then possible to train the model in two steps: first, backpropagate on NSNN with many examples of the same phone configuration in a minibatch; second, use this single new update of NSNN’s parameters to backpropagate on WNN.
  • We must draw our predicted values by adding Gaussian noise to the values outputted by the model (see Laurent’s post and William’s quick implementation).
  • Other suggestions included: using rectified linear activations and adding more hidden layers, not using Pickle for saving/loading data and denormalizing the MSE when reporting errors.

Following those recommendations, the WNN now outputs biases only and predicted acoustic samples are drawn by adding Gaussian noise. The hybrid-minibatch gradient descent wasn’t implemented as I foresaw problems with dataset organization and minibatch sizes (this could probably be avoided with smart feature extraction on the complete dataset).

In this post, I will thus present the implementation I converged to, and the results I got by training this architecture a few times.

Theano Implementation of the Second Architecture
It was hinted at during class that training an architecture where one model outputs the parameters of a second one could be implemented using one single Theano graph. However, I couldn’t figure out what would be the way of doing this; I thus went for something simpler, probably slower, but that makes sense to me.

As can be seen here, I used three main Theano functions for training the whole model: train_model_NSNN(), train_model_WNN() and update_params_train_NSNN(). The first one is essentially the same as the training function in the original Theano tutorial. The second one is also very similar, except that it directly takes the targets it needs for backpropagation by concatenating the reshaped biases of the NSNN. The third one is simply updating the biases of the NSNN by replacing them with the values obtained from forward-propagating in the WNN. The rest is mainly a rehash of previous ideas for a vanilla neural network.

Training procedure
I’m still working with data from one subject (it’s time I move on, I know!). That’s mainly because I really wanted to get to know Theano better instead of diving into Pylearn2 for that model (and also because other people’s work on a TIMIT wrapper with on-the-fly extraction of data has been for Pylearn2 only). I thus rely on a short class I wrote a few weeks ago to extract windows of samples and phone information before building the model.

In the case of the following experiment, I used data from subject MCPM0 only. The first 9 utterances are used for training, validation and testing, and the 10th utterance is used for the reconstruction/generation part.

Before tuning the hyperparameters, I realized how training in a stochastic fashion is really slow (at least with my implementation, on a i7-3770 CPU). This limited the number of hidden units to use for both networks, and made the configuration where the WNN outputs weights in addition to biases obviously impossible to train in a few hours.

Thus, the number of hidden units was chosen so the model could be run within around one hour.

  • WNN: 250 hidden units
  • NSNN: 150 hidden units

Different learning rates were manually tried until it was found that the training error decreased without diverging in the first few epochs. This gave a learning rate of 0.01 (used for both networks).

  • Learning rate: 0.01

Finally, only L2 regularization was used, and was left untouched at 0.0001.

  • L2 weight decay: 0.0001

The training was performed over 30 epochs.
As for generation, as João did, I used a primer sequence of zeros. I then constructed on top of that using the phone sequence of the 10th utterance for that subject.

This is what I got from training over 30 epochs, with 150 NSNN hidden units and 250 WNN hidden units:

The best validation error was 11386 (denormalized) , for which the test error was 22583. Speech generation gave noise, with no apparent structure that could lead to think it is speech.

This experiment still wasn’t close to producing something similar to speech. As mentioned by David Belius and João, the scale of noise in the reconstruction might have to be empirically adjusted for better results. For João, a multiplicative factor of 5 gave something close to speech. Again, this result might also come from the fact that he trained his network with way more data than me.

To improve my results, I need to find a way to accelerate training – especially if I am to train on the complete set or trying different numbers of hidden layers/units per layer. Rectified linear units might improve the convergence of the training procedure. Using the hybrid-minibatch scheme as well.

My goal is to spend a bit more time trying to do that, and then I’ll switch to Pylearn2 for the end of the project, to get a sense of what I’ve missed so far.

Second MLP architecture for next sample prediction

Over the last few days, I’ve worked towards an implementation of a second neural network architecture, based on a suggestion Prof. Bengio made in class. This model consists of two networks, with the peculiar feature that the first network outputs the weights of the second network. This model also makes use of information on phones, which was not covered in my first post. Here, as a first post on this topic, I describe my understanding of the model (the second post will come as soon as I finally get my code to work properly!).

The model

For this experiment, we need to implement two neural networks. The first one (which I call “Weights NN” – WNN) will receive the following inputs: one-hot encoding of previous, current and next phones, and the duration of the current phone; it will then output the weights and biases for the second network. The second network (“Next-sample NN” – NSNN), on the other hand, will be exactly like the model I trained for my first post, except for the fact that its parameters are actually the output of the first network. It will nonetheless receive a window of samples as input, and output the predicted next sample. Here is a visual description of what I have in mind:


Thus to train this model, we need data points containing information on both phones and samples. For the samples, it is the same as what was done before, that is windows of N samples (let’s say 240) extracted with a certain overlap, and their next sample (the 241st) used as the target. The former requires the extraction of phones, which can be found in the .phn files of the TIMIT dataset. Those files actually contain more than phones, since they also have a few symbols that represent pauses, silences and start/end of an utterance. In total, there are 61 such symbols. One way to extract them would be to set the current phone as the most frequent phone in the current window of N samples (that is, its mode). We can do the same thing for the previous and next phones and set them to be the mode of the previous or next windows of length N. Then, we can encode those values in a one-hot representation of length 61.

The WNN will receive this “phone list” as input, and will output a specific set of parameters for every sample to be predicted. Then, the NSNN, with a particular weight configuration for every sample to predict, will receive the current window of samples as input and will output the next sample. This should hopefully be closer to speech than the stationary sine waves from last time!


We could use a slightly modified back propagation algorithm for this model. It would work in a loop of three steps:

  1. Perform forward propagation on the WNN, with previous/current/next phones and current phone duration as inputs;
  2. Perform back propagation on the NSNN using the output of WNN as weights, the window of current samples as input and the actual next sample as target;
  3. Perform back propagation on the WNN with the weights updated in step 2 as targets.

This way, we can update the WNN parameters despite the lack of related targets in the dataset.

As a side note, I think for the moment that this procedure would mainly make sense with a stochastic gradient descent. Indeed, using a minibatch of size K on the NSNN would only produce one updated set of parameters for a total of K data points; thus we would get only one opportunity of applying back propagation to the WNN in K data points.

Validation/Testing/Speech generation

Similarly to training, validation or testing can be done by forward propagation using a new window of samples and its corresponding set of previous/current/next phones and phone duration. However, since only the next sample targets are available, the cost function could be assessed on the NSNN only.

Speech generation will work as validation/training, but would only necessitate a “phones list” and some initial values for the NSNN input. This phones list might be the hand-engineered sequence of phones we would like to hear, or a sequence extracted from a specific utterance of the dataset. As for the initial window of samples, this may be either random inputs, or a “primer” from a real utterance. Once the next sample is predicted, it can be concatenated with the original primer before going through the process all over again.


My goal here was originally to post a description of the model and some results at the same time. To do so, I started modifying William’s implementation of a vanilla MLP, itself based on this Theano tutorial. However, I’m still not as confident as I would like to be with Theano, and I haven’t been able to run my implementation so far. I suspect this should be a matter of a few hours.

In any case, the model I’m trying to train uses tanh activation for the hidden units of both networks, and mean squared error as cost. The NSNN should still receive around 240 samples as input (~15-ms frames), but this time should have a much smaller number of hidden units, like 10 (as opposed to 450 in my last post). This gives a total of 2421 parameters for the NSNN. For its part, the WNN should receive 184 inputs (61*3 + 1) and, as dictated by the NSNN, output 2421 real values.

Furthermore, I’m still trying to train my model with data points extracted from one subject’s set of 10 utterances. This way of extracting and pickling data will soon turn out to be impractical when training has to be performed on more than one or two subjects. For this reason, work on “on-the-fly” minibatch like what was done by Vincent and Laurent is very important and I hope to integrate such a feature shortly.

As soon as my implementation works, I will push the code on github and post my results.

Speech synthesis project description and first attempt at a regression MLP

Today, for my first blog post, I’ll introduce the Speech Synthesis project I’ll be working on, as well as my first attempts at a model using a vanilla feedforward neural net.

The project: This semester’s project consists in using TIMIT, a well-known speech dataset, to produce a model of speech synthesis. This dataset essentially contains sound (.wav) and transcription files (.txt). For example, a model could, at a basic level, be trained with a sequence of phonemes as input and sound waves as target; and then eventually generate sound samples given a new list of phonemes as input. Information on the speaker could be introduced in the model as well, such as gender, age and dialect, to produce specific voices or reproduce those of some of the dataset speakers. Using the deep learning algorithms discussed in class, attempting to create this sort of speech generative model will thus be at the core of this project.

The dataset: The TIMIT dataset, published in 1993, is composed of recordings from 630 speakers speaking 8 different American English dialects. Each speaker read 10 sentences, 2 of which are the same for every speaker. It is important to note that the number of speakers for each dialect varies, and that approximately 70% percent of the speakers are male. Each sentence is documented by 4 files: a .wav file containing the speech waveform (sampled at 16 kHz), and 3 .txt transcription files (orthographic, word and phonetic transcriptions). The following table, taken from the TIMIT 1993 release document, summarizes the speaker statistics.

Speaker statistics

First experiment: Predicting the next acoustic sample

For my first experiment, and as suggested by Prof. Bengio (, I will aim at training a feedforward neural network to predict the next acoustic sample based on a fixed window of previous samples. In this post I will go through all the steps I had to take to get there: obtaining the acoustic sample data in a python readable format, extracting the features for the considered task, and building and using the NN model. I will conclude by discussing the results.

Obtaining acoustic sample data in readable format for Python

Laurent has already figured out what to do and explains the process in his blog. I’ll recapitulate briefly here to document my first experiment.

Apparently, the preprocessed data that we had at first has gone through a frequency transform called MFCC. This is problematic because this type of data is not easily transformable back into sound samples. For this reason, it was said in class (Jan 30) that we should not use this format for speech synthesis.

Moreover, the version of TIMIT with sound samples only, in a NIST .wav file, apparently isn’t straightforwardly readable by standard audio player.

Laurent suggests the use of the linux “sox” command to convert that format into readable .wav format. Then, he notes that the package can be used to convert a data vector in a .wav file. This data has already been converted and is available on the LISA server. There are 2 file types which will be sufficient for my first exploratory experiment: readable .wav files (.wav) and corresponding numpy arrays (.npy).

I’ll begin my experiment with the very first subject of the first dialect, in /DR1/FCJF0”. Importing and visualizing the first utterance is straightforward.Waveform for FCJF0-SA1

Extracting features for predicting the next acoustic sample

Now that the vectors are understandable, we need to generate the actual table that will be used for training the regression algorithm. This means having a (n x d+1) matrix of n examples of d dimensions (+ the target). To do so, let’s first start with the fact that the sampling frequency is 16 kHz, and that a phoneme is 10-20 ms (as mentionned in class). To begin with, we could use 15 ms windows with “number of samples – 1” overlap, that is we shift the window of one sample to extract the next example. The following sample is also extracted as the target for that window. That would give:

number of samples in 1 frame = 16 000 Hz * 15 ms = 240 samples

For the first sentence of the first subject, consisting of 24679 samples, this gives a 24438 x 241 matrix. We can apply this on the 10 sentences of the first subject, and concatenate everything in one matrix. This gives a matrix of size 382 721 x 241, which can be used for training the model. The code used to extract the features can be found here.

Building and using the Vanilla Neural Network

Next, for the implementation of the neural net, I chose to revisit the code I wrote with William in the prerequisite for this class, IFT6390. In that course, we implemented a multiclass classification net with one hidden layer, using cross-entropy as loss function, tanh for the hidden layer and softmax for the output layer non-linearity, with L2 regularization. Going through the python/numpy code again, I reviewed the net’s implementation, and made changes to use the squared error as the loss function as well as a linear output layer. The code can be seen here.

The code, though vectorized, is not optimized, and doesn’t use Theano yet (this is the next step). For the moment, running 100 epochs with 450 hidden units and minibatches of 200 examples takes around 45 min on my CPU. Therefore, to choose the hyperparameters, I manually tried different combinations on a subset of the training set, and picked one that led to a visible decrease in training and validation errors over the first few epochs. For the results presented here, I used:

  • Regularization parameter: \lambda = 10^{-6}
  • Learning rate: \eta = 0.14
  • minibatch size = 200
  • epochs number = 100
  • number of hidden units = 450

Furthermore, through simple testing of the algorithm with a toy data set, I realized how crucial normalization is for the good behavior of the net. Inputs and targets are thus normalized column-wise by subtracting their mean and dividing by their standard deviation. For reconstruction of the synthetized speech in a .wav file, the inverse transformation is applied.

Once trained, the net is used to predict the next value of an array of 240 samples extracted from the dataset. This predicted value is then added to the original array, and the now last 240 samples are again used to predict the next value. This is repeated until the array has reached a size of 50 000 samples. Finally, the array is converted and saved in a .wav format.

Results and discussion

The training of the neural network yielded the following error curves over 100 epochs:

Training curves

We see that 100 epochs were not enough to produce a minimum in the validation error curve that could lead to early stopping; the optimal number of epochs has not been reached yet.

The synthesized speech waveform is presented in the following graphs:

Predicted output waveformPredicted output waveform (zoomed in)

We see that it looks like a combination of sine waves that stabilizes after approximately 6000 samples. Listening to the .wav file, it’s actually a minor seventh (something like C4+Bb4). The reason behind this specific output is not clear to me right now. It may have something to do with the window length used to extract the features (a wave with a period of 240 samples at 16 kHz is pretty close to a C2…). This result is obviously very far from the expected output of speech.

In any case, there is still a lot of things I want to work on with this specific model. First, I started “Theanizing” the code, which should make it much quicker to run. This will enable the use of a hyperparameter optimization strategy such as grid or random search. It will also allow training over a bigger number of epochs (as will running on the LISA cluster instead of my machine). Second, Pylearn2’s MLP model could be used for comparison purposes. Finally, feature extraction parameters could be varied: increasing or decreasing the size of the extracted windows, using data from more than one subject (or just trying with another subject’s data), predicting more than only the next sample, etc. Maybe this can shine light on a potential link between the frequencies in the synthesized signal and the features’ length.

I shall go over this in the next days.