Iranian Journal of Electrical and Electronic Engineering

Search published articles

Showing 6 results for Speech

A Comparative Study of Gender and Age Classification in Speech Signals

M. H. Sedaaghi,
Volume 5, Issue 1 (3-2009)

Abstract

Accurate gender classification is useful in speech and speaker recognition as well as speech emotion classification, because a better performance has been reported when separate acoustic models are employed for males and females. Gender classification is also apparent in face recognition, video summarization, human-robot interaction, etc. Although gender classification is rather mature in applications dealing with images, it is still in its infancy in speech processing. Age classification, on the other hand, is also concerned as a useful tool in different applications, like issuing different permission levels for different aging groups. This paper concentrates on a comparative study of gender and age classification algorithms applied to speech signal. Experimental results are reported for the Danish Emotional Speech database (DES) and English Language Speech Database for Speaker Recognition (ELSDSR). The Bayes classifier using sequential floating forward selection (SFFS) for feature selection, probabilistic Neural Networks (PNNs), support vector machines (SVMs), the K nearest neighbor (K-NN) and Gaussian mixture model (GMM), as different classifiers, are empirically compared in order to determine the best classifier for gender and age classification when speech signal is processed. It is proven that gender classification can be performed with an accuracy of 95% approximately using speech signal either from both genders or male and female separately. The accuracy for age classification is about 88%.

Speech Enhancement by Modified Convex Combination of Fractional Adaptive Filtering

M. Geravanchizadeh, S. Ghalami Osgouei,
Volume 10, Issue 4 (12-2014)

Abstract

This paper presents new adaptive filtering techniques used in speech enhancement system. Adaptive filtering schemes are subjected to different trade-offs regarding their steady-state misadjustment, speed of convergence, and tracking performance. Fractional Least-Mean-Square (FLMS) is a new adaptive algorithm which has better performance than the conventional LMS algorithm. Normalization of LMS leads to better performance of adaptive filter. Furthermore, convex combination of two adaptive filters improves its performance. In this paper, new convex combinational adaptive filtering methods in the framework of speech enhancement system are proposed. The proposed methods utilize the idea of normalization and fractional derivative, both in the design of different convex mixing strategies and their related component filters. To assess our proposed methods, simulation results of different LMS-based algorithms based on their convergence behavior (i.e., MSE plots) and different objective and subjective criteria are compared. The objective and subjective evaluations include examining the results of SNR improvement, PESQ test, and listening tests for dual-channel speech enhancement. The powerful aspects of proposed methods are their low complexity, as expected with all LMS-based methods, along with a high convergence rate.

Speech Emotion Recognition Based on Power Normalized Cepstral Coefficients in Noisy Conditions

M. Bashirpour, M. Geravanchizadeh,
Volume 12, Issue 3 (9-2016)

Abstract

Automatic recognition of speech emotional states in noisy conditions has become an important research topic in the emotional speech recognition area, in recent years. This paper considers the recognition of emotional states via speech in real environments. For this task, we employ the power normalized cepstral coefficients (PNCC) in a speech emotion recognition system. We investigate its performance in emotion recognition using clean and noisy speech materials and compare it with the performances of the well-known MFCC, LPCC, RASTA-PLP, and also TEMFCC features. Speech samples are extracted from the Berlin emotional speech database (Emo DB) and Persian emotional speech database (Persian ESD) which are corrupted with 4 different noise types under various SNR levels. The experiments are conducted in clean train/noisy test scenarios to simulate practical conditions with noise sources. Simulation results show that higher recognition rates are achieved for PNCC as compared with the conventional features under noisy conditions.

Utilizing Kernel Adaptive Filters for Speech Enhancement within the ALE Framework

G. Alipoor,
Volume 13, Issue 4 (12-2017)

Abstract

Performance of the linear models, widely used within the framework of adaptive line enhancement (ALE), deteriorates dramatically in the presence of non-Gaussian noises. On the other hand, adaptive implementation of nonlinear models, e.g. the Volterra filters, suffers from the severe problems of large number of parameters and slow convergence. Nonetheless, kernel methods are emerging solutions that can tackle these problems by nonlinearly mapping the original input space to the reproducing kernel Hilbert spaces. The aim of the current paper is to exploit kernel adaptive filters within the ALE structure for speech signal enhancement. Performance of these nonlinear algorithms is compared with that of their linear as well as nonlinear Volterra counterparts, in the presence of various types of noises. Simulation results show that the kernel LMS algorithm, as compared to its counterparts, leads to a higher improvement in the quality of the enhanced speech. This improvement is more significant for non-Gaussian noises.

An Information-Theoretic Discussion of Convolutional Bottleneck Features for Robust Speech Recognition

B. Nasersharif, N. Naderi,
Volume 17, Issue 2 (6-2021)

Abstract

Convolutional Neural Networks (CNNs) have been shown their performance in speech recognition systems for extracting features, and also acoustic modeling. In addition, CNNs have been used for robust speech recognition and competitive results have been reported. Convolutive Bottleneck Network (CBN) is a kind of CNNs which has a bottleneck layer among its fully connected layers. The bottleneck features extracted by CBNs contain discriminative and rich context information. In this paper, we discuss these bottleneck features from an information theory viewpoint and use them as robust features for noisy speech recognition. In the proposed method, CBN inputs are the noisy logarithm of Mel filter bank energies (LMFBs) in a number of neighbor frames and its outputs are corresponding phone labels. In such a system, we showed that the mutual information between the bottleneck layer and labels are higher than the mutual information between noisy input features and labels. Thus, the bottleneck features are a denoised compressed form of input features which are more representative than input features for discriminating phone classes. Experimental results on the Aurora2 database show that bottleneck features extracted by CBN outperform some conventional speech features and also robust features extracted by CNN.

A Novel Nonparametric Kernel for Speech Emotion Recognition

Mohammad Hasheminejad,
Volume 19, Issue 4 (12-2023)

Abstract

The Nonparametric Speech Kernel (NSK), a nonparametric kernel technique, is presented in this study as a novel way to improve Speech Emotion Recognition (SER). The method aims to effectively reduce the size of speech features to improve recognition accuracy. The proposed approach addresses the need for efficient and compact low-dimensional features for speech emotion recognition. Having acknowledged the intrinsic distinctions between speech and picture data, we have refined the Kernel Nonparametric Weighted Feature Extraction (KNWFE) formulation to suggest NSK, which is especially intended for speech emotion identification. The output of NSK can be used as input features for deep learning models such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), or hybrid architectures. In deep learning, NSK can also be used as a kernel function for kernel-based methods such as kernelized support vector machines (SVM) or kernelized neural networks. Our tests demonstrate that NSK outperforms current techniques, outperforming the best-tested approach by 5.02% and 3.05%, respectively, with an average accuracy of 96.568% for the Persian speech emotion dataset and 82.56% for the Berlin speech emotion dataset.

Page 1 from 1

© 2022 by the authors. Licensee IUST, Tehran, Iran. This is an open access journal distributed under the terms and conditions of the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license.

Iranian Journal of Electrical and Electronic Engineering

Iran University of Science and Technology

Search published articles

Aims & Scopes

Related Websites