Speech Enhancement Based on Combined DNN Structure
In this paper, we propose the speech enhancement method by stacking the long-short term memory (LSTM) and
deep neural network (DNN). From the previous studies, it is known that DNN is good at modeling the complex relation
between input feature vectors and its desired target vectors through its multiple nonlinear hidden layers. Therefore, DNN-
based speech enhancement methods such as de-noising or de-reverberation have been proposed by training the relation
between the input noisy feature vectors and the target clean feature vector. However, despite adjacent frames are correlated,
DNN cannot model the temporal context of sequential data such as speech or video, since the mapping between the input
feature vectors and the target feature vectors of each frame is individually done. Recently, LSTM has been successfully
applied to various sequence prediction and sequence labeling tasks such as speech recognition and speech enhancement
because of its recurrent connection on each hidden layer. Therefore, we stack LSTM and DNN to take the advantages of
their complementarity. LSTM is used to model the temporal property of speech by using long-range history. And then, DNN
is trained with LSTM-based enhanced speech signals and noisy speech signals which are LSTM input and output. The
proposed method is evaluated in terms of the objective measures and shows a significant improvement compared with the
conventional single DNN and stacked DNN-based Speech enhancement methods.
Keywords- Speech Enhancement, De-noising, Long-short Term Memory, Deep Neural Network.