Nepali Speech Emotion Detection Using Deep Learning
DOI:
https://doi.org/10.3126/injet.v3i2.95516Keywords:
Speech Emotion Recognition, Nepali Language, Deep Learning, MFCC, CNNAbstract
Emotionally intelligent human-computer interaction solutions depend on Speech Emotion Recognition (SER), which
attempts to recognize emotional states from speech. There is still little research on SER for languages with limited resources, like Nepali. In this work, a one-dimensional Convolutional Neural Network (1D-CNN) and Mel-Frequency Cepstral Coefficients(MFCCs) are used in a deep learning-based Nepali speech emotion detection system. 1,810 audio samples of 632 happy, 560 neutral, and 618 sad utterances were gathered from studio recordings, mobile recordings, podcasts, and broadcast sources to create a specific Nepali emotional speech dataset. Every audio sample underwent preprocessing, resampling to 16 kHz, and conversion to mono. A 1D-CNN model was fed MFCC features that had been retrieved. The suggested model yields an overall accuracy of 88% on the Nepali dataset, according to experimental results. With a precision of 0.96, recall of 0.92, and F1-score of 0.94, the Sad emotion class performed the best. The Neutral class received a precision of 0.89 and an F1-score of 0.81, but the Happy class received a recall of 0.98 and an F1-score of 0.89. Strong discrimination was shown by ROC analysis, with AUC values of 0.97 for neutral and 0.99 for happy and sad.
Downloads
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 International Journal on Engineering Technology

This work is licensed under a Creative Commons Attribution 4.0 International License.
This license enables reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use.