Convolutional Neural Networks (CNNs) have been shown their performance in speech recognition systems for extracting features, and also acoustic modeling. In addition, CNNs have been used for robust speech recognition and competitive results have been reported. Convolutive Bottleneck Network (CBN) is a kind of CNNs which has a bottleneck layer among its fully connected layers. The bottleneck features extracted by CBNs contain discriminative and rich context information. In this paper, we discuss these bottleneck features from an information theory viewpoint and use them as robust features for noisy speech recognition. In the proposed method, CBN inputs are the noisy logarithm of Mel filter bank energies (LMFBs) in a number of neighbor frames and its outputs are corresponding phone labels. In such a system, we showed that the mutual information between the bottleneck layer and labels are higher than the mutual information between noisy input features and labels. Thus, the bottleneck features are a denoised compressed form of input features which are more representative than input features for discriminating phone classes. Experimental results on the Aurora2 database show that bottleneck features extracted by CBN outperform some conventional speech features and also robust features extracted by CNN.
Type of Study:
Research Paper |
Subject:
Image Processing Received: 2019/08/12 | Revised: 2020/10/10 | Accepted: 2020/10/17