Towards Utterance Copy via Deep NeuroEvolution: Do GAs Perform Well in Evolving DNN Weights for Copying Human Speech?
Utterance Copy, Neuro Evolution, Deep Learning, Speech Synthesis, Genetic Algorithm.
Utterance copy, also known as speech imitation, is the task of estimating the parameters of an input, target speech signal in order to artificially reconstruct another signal with the same properties at the output. After achieving some success by applying a supervised-trained long short-term memory (LSTM) deep neural network (DNN) to learn how to estimate the input parameters of the formant-based speech synthesizer called Klatt, we decided to investigate whether applying traditional optimization methods for updating the weights of a multivariate regressor in an unsupervised fashion may improve results. This work proposes a genetic algorithm (GA) as an alternative to the conventional training algorithms for deep neural networks (DNNs) such as back-propagation, or even high-level optimizers such as Adam. This combination between GAs and DNNs is called deep neuroevolution (DNE). When compared to a baseline software called WinSnoori, results with LSTMs trained on synthetic data only, as expected, proved efficient only in estimating the parameters of synthetic voices, with respect to some similarity measures such as PESQ, RMSE, SNR and DLE. We then conjecture that using a GA in combination with a deep, feed-forward network, on the other hand, would result in better scores when dealing with natural voices.