A New VAD Algorithm using Sparse Representation and Updated Dictionary in Spectrogram Domain

Document Type : Original Article


Department of Electrical Engineering, Nowshahr Branch, Islamic Azad University, Nowshahr, Iran


This article proposes the new VAD (Voice Activity Detection) method was made using Spectrogram Domain (Spectro-Temporal Response Field) space based on sparse representation. Spectrogram Domain components have two dimensions of time and frequency. On the other hand, using sparse representation in learning dictionaries of speech and noise and updating dictionaries, causes better separation of speech and noise segments. In this algorithm, using auditory spectrogram and sparse representation, an updating dictionaries with different atom sizes and K-SVD (k-means clustering method) and NMF (non-negative matrix factorization) learning methods were constructed and the results indicate that this method works well. For example, the proposed VAD performance was obtained in SNRs greater than 0dB is more than 92.71% and 91.21% in White noise and Car noise respectively, which shows the good performance of the proposed VAD compared to other methods. By comparing the NDS and MSC evaluation parameters with other methods, the results show better performance of the proposed method.


[1] R. Johny Elton, J. Mohanalin and P. Vasuki,“A novel voice activity detection algorithm using modified global thresholding,” International Journal of Speech Technology, vol. 24, pp. 127–142, 2021. 
[2] M. Eshaghi and M.R. Karami Mollaei,“Voice activitydetection based on using wavelet packet,” Digital Signal Processing, vol. 20, pp. 1102-1115, 2010.
[3] C.T. Hsieh, P.Y. Huang, T.W. Chen and Y. Chen,“Speech enhancement based on sparse representation under color noisy environment,” 2015 International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS),pp.134 - 138,2015.
[4] G. Martin, A. Abeer, E. Dan and et al.,“All for one: feature combination for highly channel-degraded speech activity detection,”INTERSPEECH, Lyon 2013, pp.709–713, 2013.
[5] M. Kolbæk, Zh. Tan , S. Jensen and J. Jensen,“on Loss Functions for Supervised Monaural Time-Domain Speech Enhancement,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 825-838, 2020.
[6] M. Eshaghi,F. Razzazi and A. Behrad,“A New VAD Algorithm using Sparse Representation in Spectro-Temporal Domain,”Journal of Information Systems and Telecominication (JIST),vol. 7, pp.709–713, 2019.
[7] M. Mirbagheri, N. Mesgarani, and Sh. Shamma,“Nonlinear filtering of spectro-temporal modulation in speech enhancement,”2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5478-5481,2010.
[8] N. Mesgarani, S. David, and S.A. Shamma, “Representation of phoneme in primary auditory cortex: how the brain analyzes speech,”2007 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP), pp. 765-768, 2007.
[9] M. Eshaghi, F. Razzazi and A. Behrad,“A voice activity detection algorithm in spectro-temporal domain using sparse representation,” International Journal of Machine Learning and Cybernetics, vol. 10,pp. 1791–1803, 2019.
[10] W. Li, Y. Zhou, N. Poh, F. Zhou, and Q. Liao,“ Feature Denoising Using Joint Sparse Representation for In-car Speech Recognition,”IEEE Transactions on audio, speech, and language processing, vol. 20, pp. 681-684, 2013.
[11] C. Mart´─▒nez, J. Goddardb, D. Milone, and H. Rufiner,“sparse spectro-temporal representation of speech forrobust classification,” Computer Speech and Language,vol.26, pp. 336-345,2012.
[12] M. Elad,“Sparse and redundant representations: from theory to applicationsin signal and image processing,”Springer Science & BusinessMedia, 2010.
[13] R. Rubinstein, A. M. Bruckstein and M. Elad,“Dictionaries for sparserepresentation modeling,” Proceedings of the IEEE,vol. 98, pp.1045–1057, 2010.
[14] M. Wei, Zh. Liu, X. Chen and H. Zhao,“Speech enhancement based on sparse representation using joint dictionary,”2018 International Conference on Computer Science, Electronics and Communication Engineering (CSECE),vol. 80, pp.500–503, 2018.
[15] K. Kreutz-Delgado, J.F. Murray, B.D. Rao, K. Engan, T. Lee and T.J. Sejnowski,“Dictionary learning algorithms for sparse representation,” Neural Computer,vol. 15, pp.349–396,2003.
[16] P. O. Hoyer,“Non-negative matrix factorization with sparseness con-straints,”The Journal of Machine Learning Research,vol. 5, pp. 1457–1469,2004.
[17] M. Aharon, M. Elad, and A. Bruckstein,“K-svd: A algorithm for designing over complete dictionaries for sparse representation,”IEEE Transactions on Signal Processing,vol.54, pp.4311–4322, 2006.
[18] R. Zdunek, and A. Cichocki,“Non-negative matrix factorization with quadratic programming,”Neural computation,vol. 71, pp. 2309-2320, 2007.
[19] G. H. Mohimani, M. Babaie-Zadeh and Ch. Jutten,“A fast approach for overcomplete sparse decomposition based on smoothed L0 norm,” IEEE Transactions on Signal Processing,vol.57, pp.289-301,2009.
[20] M. S. Lewicki and T. J. Sejnowski,“Learning overcomplete represen-tations,” Neural computation,vol. 12, pp. 337–365, 2000.
[21] Z. Jiang, G. Zhang, and L. S. Davis,“Submodular dictionary learn-ing for sparse coding,”2012IEEE Conference on Computer Vision andPattern Recognition (CVPR), pp. 3418–3425, 2012.
[22] J.F. Gemmeke, H.V. Hamme, B. Cranen and L. Boves ,“Compressive Sensing for Missing Data Imputation in Noise Robust Speech Recognition,”IEEE Journal of selected topics in signal processing,vol. 4, pp. 273-82, 2010.
[23] W. M. Fisher, G. R. Doddington, M. Goudie and M. Kathleen,“The DARPA speech recognition research database: specifications and status,” Proceedings of DARPA Workshop on Speech Recognition,CD-ROMs, 2005.
[24] A. Varga, H. J. M. Steeneken, M. Tomlinson and D. Jones,“The NOISEX-92 study the effect of additive noise on automatic speech recognition,” Documentation included in the NOISEX-92 CD-ROMs, 1992.
[25] J. McLoughlin,“Super-Audible Voice Activity Detection,” IEEE Transactions on Speech and Audio Processing,vol.22, pp.1424-1433, 2014.
[26] P.K. Ghosh, A. Tsiartas and S. Narayanan,“Robust voice activity detection using long-term signal variability,”IEEE Transactions on Audio, Speech and Language Processing,vol. 11, pp. 600–613,2011.
[27] J. Sohn, N. S. Kim and W. Sung,“A statistical model-based voice activity detection,”IEEE Signal Process,vol. 6 , pp.1–3,1999.
[28] A. Benyassine, E. Shlomot, H. Y. Su, D. Massaloux, C. Lamblin and J. P. Petit,“ITU-T Recommendation G.729 Annex B: a silence compression scheme for use with G.729 optimized for V.70 digital simultaneous voice and data applications,”IEEE Communications Magazine,vol. 35, pp. 64-73, 1997.
[29] N. Mesgarani and Sh. Shamma,“Denoising in the Domain of Spectro-temporal Modulations,” EURASIP Journal on Audio, Speech, and Music Processing,vol. 12, pp. 1-9 ,2007.
[30] L. N. Tan, B. J. Borgstrom, and A. Alwan,“Voice activity detection using harmonic frequency components in likelihood ratio test,”2010 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 4466 - 4469, 2010.
[31] J. Ramirez, J. Segura, C. Benitez, A. Torre and A. Rubio,“ Voice activity detection with noise reduction and long-term spectral divergence estimation,”2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, pp.271–287, 2004.
[32] M. Yanna and A. Nishihara,“Efficient voice activity detection algorithm using long-term spectral flatness measure,”EURASIP Journal on Audio, Speech, and Music Processing, ,vol. 87, pp. 1-18, 2013.
[33] X.K Yang, L. He, D. Qu and W. Q.Zhang,“Voice activity detection algorithm based on long-term pitch information,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 14, pp. 1-9 ,2016.