Deep Neural Network for Speaker Identification Using Static and Dynamic Prosodic Feature for Spontaneous and Dictated Data

Arifan Rahman, Wahyu Catur Wibowo

Abstract


We can recognize a person by his voice alone. In principle, the sound has a tone (pitch) that is different for each person. This study aims to measure a Deep Neural Network (DNN) performance with static and dynamic prosodic features. Prosodic is information about sound related to tone, intonation, pressure, duration, and rhythm of a person's pronunciation. The data used is dictated and spontaneous voice data that taken from YouTube. The data used consists of three male voices and one female voice. The data is segmented into various duration, 3 seconds, 5 seconds, and 10 seconds. After the data has been segmented, the static prosodic features with 103 dimensions will be extracted and the dynamic prosodic features with 13 dimensions will be extracted too. Each feature and feature combination will be trained and tested using DNN with a ratio of 90:10. The result shows that the 10 seconds segmented data has higher accuracy than the others. Accuracy of static prosodic features is better than dynamic prosodic features. The average accuracy of DNN for static prosodic features is 87.02%. The average accuracy of DNN for dynamic prosodic features is 72.97%. The average accuracy of DNN for combined static and dynamic prosodic features is 87.72%.

Keywords


voice; signal processing; deep neural network; prosodic;

Full Text:

PDF


DOI: http://dx.doi.org/10.58258/jisip.v5i4.2279

Refbacks

  • There are currently no refbacks.




Copyright (c) 2021 Arifan Rahman, Wahyu Catur Wibowo



Lisensi Creative Commons
Ciptaan disebarluaskan di bawah Lisensi Creative Commons Atribusi-BerbagiSerupa 4.0 Internasional.

JISIP (Jurnal Ilmu Sosial dan Pendidikan)
p-ISSN: 2598-9944, e-ISSN: 2656-6753
Jurnal ini diterbitkan oleh Lembaga Penelitian dan Pendidikan (LPP) Mandala.