Leveraging Jointly Spatial, Temporal And Modulation Enhancement In Creating Noise-Robust Features For Speech Recognition
This paper presents to adopt various fusion types of spatial, temporal and modulation domain speech feature
enhancement techniques in order to achieve superior speech recognition performance under noise-corrupted environments.
With the mel-frequency cepstral coefficients (MFCC) as the standard speech feature representation, the spatial-domain
techniques involve the short-time intra-frame feature enhancement, while the temporal-domain techniques compensate for
the noise distortion that exists in the long-term inter-frame MFCC time stream. Furthermore, the modulation- domain
techniques are conducted on the Fourier transform of a MFCC time stream. The evaluation experiments conducted on the
connected-digit Aurora-2 database reveal that each of the spatial/temporal enhancement techniques adopted here performs
better than the unprocessed MFCC baseline, and the integration of the methods respectively for spatial-, temporal-and
modulation-domain features can result in even better recognition accuracy than the individual component method under a
wide range of noise-corrupted environments. These results clearly demonstrate that the methods in the three domains treat
noise in different aspects and therefore they are complementary to each other.
Keywords- Noise Robustness, Speech Recognition, Spatial Processing, Temporal Processing, Modulation Domain.