Music Emotion Recognition

Dasuni G
3 min readNov 19, 2020

A 2020 Guide

One way or another, we all have a music clip/playlist we love to play according to our mood changes!

And now, Music Streaming services (MSS) are into personalizing music playlists based on your mood!

MSSs are doing this through a technology called Music Emotion Recognition (MER).

MER is yet to be a disruptive technology not only for streaming services but also in affective computing applications like affective gaming and is still being developed in the computational research arena.

What is MER?

MER is the interdisciplinary area comprising of Music, Psychology & Computational analysis that will automatically detect and classify music based on emotions perceived by humans when they listen.

MER Technology: How it began!

Human brains are naturally sensitive to the audio dimension as the auditory system of the human species is naturally developed to be sensitive to sound wave properties.

Simply describing, the auditory system mainly comprises the human ear and the auditory cortex part of the brain where audio information processing happens.

So, it is undeniably true that if computers need to be emotionally sensitive to music just the way humans do, MER researchers need to clearly understand the functioning of the auditory system.

This post is not about how to carry out an MER project like mood classification in music. This will be a timely guide to MER research, where I talk about the challenges and opportunities in the present MER Research area.

Challenges in 2020

A typical MER project is more like a Machine Learning classification project. You select and extract audio features related to human emotion (Arousal & Valence) and then train a supervised classifier.

The acoustic feature selection process is the biggest challenge for MER.

Acoustic features of music in a certain genre are selected through a manual evaluation of the relationship between extracted features to that of emotions through human annotations. E.g. amplitude measured by RMS energy of a pop genre music is evaluated on human arousal. Evaluations are usually done by human annotations. E.g. subjects are made to listen to pop songs with various RMS energy levels which they have no idea about it and ask to rate how much they were aroused.

This is a huge challenge as there are high levels of uncertainty and subjectivity with such human experiments. and also, this is time-consuming and laborious.

This challenge brings another challenge! The model interpretability or how well the trained model can produce realistic results will become low., meaning the music mood classification by the model is contrasting to how humans perceive music emotions.

Opportunities

Researchers have now found new ways for emotion recognition with the advancements in algorithm functionality.

Acoustic feature selection is now being transformed through Representation Learning, and this is already implemented in feature selection of Speech Emotion Recognition technology.

Representation Learning is more of Deep Learning rather than Machine Learning. It functions similar to how brain cells aka neurons process information: incremental information processing and incremental learning through abstraction and representations. Representation Learning algorithms create data representations from raw input data out of which they learn like neurons, yielding the factors determining the variations behind data.

So how to apply Representation learning for MER research? Well, it is not simple to explain fully and I will later take it on to another post shortly. But writing an abstract on this topic, Music clips need to be converted into audio spectrograms and then fed into Representation Learning algorithms such as Autoencoder, Convolutional Neural Networks, etc.

These algorithms can learn data representations based on temporal dynamics from the audio spectrograms, thereby identifying features. A classic example of an open-source Representation Learning toolkit for audio feature selection is auDeep.

As an MER researcher, Representation Learning and other Deep Learning techniques may transform the future of Emotion AI. It is because Deep Learning performance ensembles a strong representation of how the human brain works and this makes machines perform more reliably and closely to human behavior.

--

--