DATA101 Final Project: An Exploration of Music Listening Habits and Self-Reported Mental Health Scores

Dataset Information

This dataset looks at survey results on music taste, music listening habits, and self-reported mental health scores for anxiety, depression, and insomnia. The dataset has 736 observations (participants) and 31 variables. The survey was created, distributed, and formatted into a data frame by computer science professor Catherine Rasgaitis at the University of Washington. The variables are as follows:

Variable Name	Description
Age	Respondent’s age
Primary Streaming Service	The streaming app the respondent uses the most to listen to music
Hours per day	Average hours the respondent spends listening to music per day
While working	Whether or not the respondent listens to music while working
Instrumentalist	Whether or not the respondent plays an instrument regularly
Composer	Whether or not the respondent composes music
Fav genre	The respondent’s favorite genre
Exploratory	Whether or not the respondent actively explores new artists/genres
Foreign languages	Whether or not the respondent regularly listens to music with lyrics in a language they are not fluent in
BPM	Average BPM of some of their favorite songs
Frequency[Classical, Country, EDM, Folk, Gospel, Hip Hop, Jazz, K Pop, Latin, Lofi, Metal,, Pop, R&B, Rap, Rock, Video Game Music] (each in their own respective column)	Respondents’ rank of how often they listen to each of the music genres where they can select: Never, Rarely, Sometimes, or Very Frequently
Anxiety, Depression, Insomnia, OCD (each in their own respective column)	Respondents’ self-reported ranks to these feelings where: 0 = “I do not experience this.” and 10 = “I experience this regularly, constantly, or extreme.”
Music effects	Whether or not music improves/worsens respondent’s mental health conditions

Loading the data

library(tidyverse)
surveyraw <- read_csv("mxmhsurvey.csv")
surveyraw <- surveyraw %>% select(-c(Timestamp, Permissions))
names(surveyraw) <- gsub(" ", "_", names(surveyraw))

Examining missing values

sum(is.na(surveyraw))

## [1] 129

A more detailed look:

colSums(is.na(surveyraw))

##                          Age    Primary_streaming_service 
##                            1                            1 
##                Hours_per_day                While_working 
##                            0                            3 
##              Instrumentalist                     Composer 
##                            4                            1 
##                    Fav_genre                  Exploratory 
##                            0                            0 
##            Foreign_languages                          BPM 
##                            4                          107 
##        Frequency_[Classical]          Frequency_[Country] 
##                            0                            0 
##              Frequency_[EDM]             Frequency_[Folk] 
##                            0                            0 
##           Frequency_[Gospel]          Frequency_[Hip_hop] 
##                            0                            0 
##             Frequency_[Jazz]            Frequency_[K_pop] 
##                            0                            0 
##            Frequency_[Latin]             Frequency_[Lofi] 
##                            0                            0 
##            Frequency_[Metal]              Frequency_[Pop] 
##                            0                            0 
##              Frequency_[R&B]              Frequency_[Rap] 
##                            0                            0 
##             Frequency_[Rock] Frequency_[Video_game_music] 
##                            0                            0 
##                      Anxiety                   Depression 
##                            0                            0 
##                     Insomnia                          OCD 
##                            0                            0 
##                Music_effects 
##                            8

BPM is the column with the most missing data, and it is no surprise. It was anticipated that many respondents would not go through the hassle of researching song data, and thus was left optional on the survey.

Examining percentage of missing values

mean(is.na(surveyraw) * 100)

## [1] 0.5653927

This is about 0.0565%, which is an incredibly small portion of the dataset. Onto a more detailed look:

colMeans(is.na(surveyraw) * 100)

##                          Age    Primary_streaming_service 
##                    0.1358696                    0.1358696 
##                Hours_per_day                While_working 
##                    0.0000000                    0.4076087 
##              Instrumentalist                     Composer 
##                    0.5434783                    0.1358696 
##                    Fav_genre                  Exploratory 
##                    0.0000000                    0.0000000 
##            Foreign_languages                          BPM 
##                    0.5434783                   14.5380435 
##        Frequency_[Classical]          Frequency_[Country] 
##                    0.0000000                    0.0000000 
##              Frequency_[EDM]             Frequency_[Folk] 
##                    0.0000000                    0.0000000 
##           Frequency_[Gospel]          Frequency_[Hip_hop] 
##                    0.0000000                    0.0000000 
##             Frequency_[Jazz]            Frequency_[K_pop] 
##                    0.0000000                    0.0000000 
##            Frequency_[Latin]             Frequency_[Lofi] 
##                    0.0000000                    0.0000000 
##            Frequency_[Metal]              Frequency_[Pop] 
##                    0.0000000                    0.0000000 
##              Frequency_[R&B]              Frequency_[Rap] 
##                    0.0000000                    0.0000000 
##             Frequency_[Rock] Frequency_[Video_game_music] 
##                    0.0000000                    0.0000000 
##                      Anxiety                   Depression 
##                    0.0000000                    0.0000000 
##                     Insomnia                          OCD 
##                    0.0000000                    0.0000000 
##                Music_effects 
##                    1.0869565

Questions formulated: 1. What is the makeup of preffered genre of those who participated in the survey? (Pie Chart) 2. What are the average scores of each genre based on age group? (Top Genres) (Histogram) 3. Is there a difference between mean self-reported anxiety scores between age groups for the most popular genre? (Rock) 4. Does listening to more music throughout the day affect self-reported depression levels in young adults? (ages 18 to 22)

EDA

str(surveyraw)

## tibble [736 × 31] (S3: tbl_df/tbl/data.frame)
##  $ Age                         : num [1:736] 18 63 18 61 18 18 18 21 19 18 ...
##  $ Primary_streaming_service   : chr [1:736] "Spotify" "Pandora" "Spotify" "YouTube Music" ...
##  $ Hours_per_day               : num [1:736] 3 1.5 4 2.5 4 5 3 1 6 1 ...
##  $ While_working               : chr [1:736] "Yes" "Yes" "No" "Yes" ...
##  $ Instrumentalist             : chr [1:736] "Yes" "No" "No" "No" ...
##  $ Composer                    : chr [1:736] "Yes" "No" "No" "Yes" ...
##  $ Fav_genre                   : chr [1:736] "Latin" "Rock" "Video game music" "Jazz" ...
##  $ Exploratory                 : chr [1:736] "Yes" "Yes" "No" "Yes" ...
##  $ Foreign_languages           : chr [1:736] "Yes" "No" "Yes" "Yes" ...
##  $ BPM                         : num [1:736] 156 119 132 84 107 86 66 95 94 155 ...
##  $ Frequency_[Classical]       : chr [1:736] "Rarely" "Sometimes" "Never" "Sometimes" ...
##  $ Frequency_[Country]         : chr [1:736] "Never" "Never" "Never" "Never" ...
##  $ Frequency_[EDM]             : chr [1:736] "Rarely" "Never" "Very frequently" "Never" ...
##  $ Frequency_[Folk]            : chr [1:736] "Never" "Rarely" "Never" "Rarely" ...
##  $ Frequency_[Gospel]          : chr [1:736] "Never" "Sometimes" "Never" "Sometimes" ...
##  $ Frequency_[Hip_hop]         : chr [1:736] "Sometimes" "Rarely" "Rarely" "Never" ...
##  $ Frequency_[Jazz]            : chr [1:736] "Never" "Very frequently" "Rarely" "Very frequently" ...
##  $ Frequency_[K_pop]           : chr [1:736] "Very frequently" "Rarely" "Very frequently" "Sometimes" ...
##  $ Frequency_[Latin]           : chr [1:736] "Very frequently" "Sometimes" "Never" "Very frequently" ...
##  $ Frequency_[Lofi]            : chr [1:736] "Rarely" "Rarely" "Sometimes" "Sometimes" ...
##  $ Frequency_[Metal]           : chr [1:736] "Never" "Never" "Sometimes" "Never" ...
##  $ Frequency_[Pop]             : chr [1:736] "Very frequently" "Sometimes" "Rarely" "Sometimes" ...
##  $ Frequency_[R&B]             : chr [1:736] "Sometimes" "Sometimes" "Never" "Sometimes" ...
##  $ Frequency_[Rap]             : chr [1:736] "Very frequently" "Rarely" "Rarely" "Never" ...
##  $ Frequency_[Rock]            : chr [1:736] "Never" "Very frequently" "Rarely" "Never" ...
##  $ Frequency_[Video_game_music]: chr [1:736] "Sometimes" "Rarely" "Very frequently" "Never" ...
##  $ Anxiety                     : num [1:736] 3 7 7 9 7 8 4 5 2 2 ...
##  $ Depression                  : num [1:736] 0 2 7 7 2 8 8 3 0 2 ...
##  $ Insomnia                    : num [1:736] 1 2 10 3 5 7 6 5 0 5 ...
##  $ OCD                         : num [1:736] 0 1 2 3 9 7 0 3 0 1 ...
##  $ Music_effects               : chr [1:736] NA NA "No effect" "Improve" ...

Looking at the top and bottom of the data. This shows us that there is no alphabetical organizational structure within the data frame, and observations were posted in the order that they were submitted (via survey mentioned).

head(surveyraw)

## # A tibble: 6 × 31
##     Age Primary_…¹ Hours…² While…³ Instr…⁴ Compo…⁵ Fav_g…⁶ Explo…⁷ Forei…⁸   BPM
##   <dbl> <chr>        <dbl> <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <dbl>
## 1    18 Spotify        3   Yes     Yes     Yes     Latin   Yes     Yes       156
## 2    63 Pandora        1.5 Yes     No      No      Rock    Yes     No        119
## 3    18 Spotify        4   No      No      No      Video … No      Yes       132
## 4    61 YouTube M…     2.5 Yes     No      Yes     Jazz    Yes     Yes        84
## 5    18 Spotify        4   Yes     No      No      R&B     Yes     No        107
## 6    18 Spotify        5   Yes     Yes     Yes     Jazz    Yes     Yes        86
## # … with 21 more variables: `Frequency_[Classical]` <chr>,
## #   `Frequency_[Country]` <chr>, `Frequency_[EDM]` <chr>,
## #   `Frequency_[Folk]` <chr>, `Frequency_[Gospel]` <chr>,
## #   `Frequency_[Hip_hop]` <chr>, `Frequency_[Jazz]` <chr>,
## #   `Frequency_[K_pop]` <chr>, `Frequency_[Latin]` <chr>,
## #   `Frequency_[Lofi]` <chr>, `Frequency_[Metal]` <chr>,
## #   `Frequency_[Pop]` <chr>, `Frequency_[R&B]` <chr>, …

tail(surveyraw)

## # A tibble: 6 × 31
##     Age Primary_…¹ Hours…² While…³ Instr…⁴ Compo…⁵ Fav_g…⁶ Explo…⁷ Forei…⁸   BPM
##   <dbl> <chr>        <dbl> <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <dbl>
## 1    21 Spotify          2 Yes     No      No      R&B     Yes     Yes       147
## 2    17 Spotify          2 Yes     Yes     No      Rock    Yes     Yes       120
## 3    18 Spotify          1 Yes     Yes     No      Pop     Yes     Yes       160
## 4    19 Other str…       6 Yes     No      Yes     Rap     Yes     No        120
## 5    19 Spotify          5 Yes     Yes     No      Classi… No      No        170
## 6    29 YouTube M…       2 Yes     No      No      Hip hop Yes     Yes        98
## # … with 21 more variables: `Frequency_[Classical]` <chr>,
## #   `Frequency_[Country]` <chr>, `Frequency_[EDM]` <chr>,
## #   `Frequency_[Folk]` <chr>, `Frequency_[Gospel]` <chr>,
## #   `Frequency_[Hip_hop]` <chr>, `Frequency_[Jazz]` <chr>,
## #   `Frequency_[K_pop]` <chr>, `Frequency_[Latin]` <chr>,
## #   `Frequency_[Lofi]` <chr>, `Frequency_[Metal]` <chr>,
## #   `Frequency_[Pop]` <chr>, `Frequency_[R&B]` <chr>, …

Looking at distribution of favorite music genre amongst the participants (Pie Chart will only show genres that have 30 or more votes)

surveyraw$Fav_genre <- gsub(" ", "_", surveyraw$Fav_genre)
surveytopgenres <- surveyraw %>% filter(Fav_genre %in% c("Metal", "Pop", "Classical", "EDM", "Folk", "Hip_hop", "R&B", "Rock", "Video_game_music"))
genre_counts <- table(surveytopgenres$Fav_genre)
piecol <- rainbow(length(genre_counts))
pie(genre_counts, labels = paste(names(genre_counts), sep = ""), main = "Favorite Music Genres", col = piecol, cex = 0.5, radius = 1)
legend("bottomleft", legend = paste(names(genre_counts), formatC(100 * genre_counts/sum(genre_counts), digits = 1, format = "f"), "%"),
       cex = 0.5, fill = piecol, title = "Genres")

We can see from this chart that Rock users dominate in their participation. Rock is a very versatile genre, and is known by many, which may explain its prevalence.

What are the average scores of each genre based on age group?

The age groups are as follows:

Description	Age Range
Kid	10 to 17
Young Adult	18 to 22
Adult	23 to 40
Mid-Life Adult	41 to 64
Elderly	65 +

age_intervals <- c(0, 17, 22, 40, 64, Inf)
age_labels <- c("Kid", "Young Adult", "Adult", "Mid-Life Adult", "Elderly")
surveyraw <- na.omit(surveyraw)
surveyraw$age_range <- cut(surveyraw$Age, breaks = age_intervals, labels = age_labels, right = FALSE, na.omit = TRUE)

Visual per age group

(x-axis labels were causing a lot of trouble so I color coded them as opposed to listing favorite genre on the x-axis)

mean_anx <- surveyraw %>% 
  group_by(Fav_genre, age_range) %>% 
  summarise(Mean_Anxiety = mean(Anxiety, na.rm = TRUE))

anx_kid <- mean_anx %>% filter(age_range == "Kid")
barplot(height = anx_kid$Mean_Anxiety,
        beside = TRUE,
        xlab = "Favorite Genre",
        ylab = "Mean Self-Reported Anxiety Scores",
        main = "Mean Self-Reported Anxiety Scores of Kids (10 to 17)",
        col = rainbow(length(unique(anx_kid$Fav_genre))),
        legend.text = unique(anx_kid$Fav_genre),
        args.legend = list(x = "topright", bty = "n", cex = 0.35))

anx_ya <- mean_anx %>% filter(age_range == "Young Adult")
barplot(height = anx_ya$Mean_Anxiety,
        beside = TRUE,
        ylim = c(0, 10),
        xlab = "Favorite Genre",
        ylab = "Mean Self-Reported Anxiety Scores",
        main = "Mean Self-Reported Anxiety Scores of Young Adults (18 to 22)",
        col = rainbow(length(unique(anx_ya$Fav_genre))),
        legend.text = unique(anx_ya$Fav_genre),
        args.legend = list(x = "topleft", bty = "n", cex = 0.3))

anx_adult <- mean_anx %>% filter(age_range == "Adult")
barplot(height = anx_adult$Mean_Anxiety,
        beside = TRUE,
        ylim = c(0, 10),
        xlab = "Favorite Genre",
        ylab = "Mean Self-Reported Anxiety Scores",
        main = "Mean Self-Reported Anxiety Scores of Adults (23 to 40)",
        col = rainbow(length(unique(anx_adult$Fav_genre))),
        legend.text = unique(anx_adult$Fav_genre),
        args.legend = list(x = "topleft", bty = "n", cex = 0.4))

anx_mla <- mean_anx %>% filter(age_range == "Mid-Life Adult")
barplot(height = anx_mla$Mean_Anxiety,
        beside = TRUE,
        ylim = c(0, 10),
        xlab = "Favorite Genre",
        ylab = "Mean Self-Reported Anxiety Scores",
        main = "Mean Self-Reported Anxiety Scores of Mid-Life Adults (41 to 64)",
        col = rainbow(length(unique(anx_mla$Fav_genre))),
        legend.text = unique(anx_mla$Fav_genre),
        args.legend = list(x = "topright", bty = "n", cex = 0.4))

anx_eld <- mean_anx %>% filter(age_range == "Elderly")
barplot(height = anx_eld$Mean_Anxiety,
        beside = TRUE,
        xlab = "Favorite Genre",
        ylab = "Mean Self-Reported Anxiety Scores",
        main = "Mean Self-Reported Anxiety Scores of the Elderly (65+)",
        col = rainbow(length(unique(anx_eld$Fav_genre))),
        legend.text = unique(anx_eld$Fav_genre),
        args.legend = list(x = "topright", bty = "n", cex = 1))

Data Exploration Analysis

What do our findings tell us about the data? There’s no doubt that there’s plenty of information to dissect from our graphs pertaining to the variance of favorite genres and self-reported anxiety scores based on different favorite genres among age groups. Rock, Pop, and Metal make up the majority of the selected favorites, with 30.1% saying Rock is their favorite genre, 18.3 % going with Pop, and 14.1% choosing Metal. This also gives us (to a very minimal extent) some insight on the listening habits of Washingtonians, given that the majority of people who completed the survey were Washington State locals. The latter portion of the analysis also gave us rich insight on mean self-reported anxiety scores based on both age group, and genre selected as their favorite. This representation of course, may not be representative of the population based on aggregated counts for means, but nonetheless, it’s interpretation is an interesting one. Let us discuss the upper and lower limits of each category.

Kids - The highest anxiety score was attributed to R&B music, and the lowest to rap. This trend follows my personal experience with music growing up as a teen in the early 2010s, as rap was getting more and more creative and unorthodox, while R&B was yet to innovate and had a “stuck in the past” feel to it that didn’t resonate with the young crowds.

Young Adults - The highest anxiety score was attributed to Folk Music, and the lowest to Gospel. Gospel music is attributed to Christianity, and many young adults find themselves in religion. This may attribute to bettering themselves and having more mental clarity. Folk music is music that transcends generations, and originated from tradition. It may be the case that some young adults are heavily surrounded by this music and while they enjoy it, it somehow affects their anxiety levels. This would be a great study.

Adults - The highest anxiety scores were attributed to both K-Pop and Lofi, and the lowest to R&B. Adults can be reluctant to opening their ears and hearts to newer styles of music, and kpop and lofi are novel semi-mainstream genres. R&B however, is incredibly popular among this age group, especially for those who found love in the 2000s, as is the case for the latter half this age group.

Mid-Life Adults - The highest anxiety scores were attributed to EDM, and the lowest to R&B. This one in particular made me chuckle. EDM is my favorite genre, and the amount of times i’ve been told it’s obnoxious, loud, irritable, etc. is too many to count. R&B however? A classic, and most closely resembles cadences of songs from their childhood.

Elderly - The highest anxiety scores were attributed to Gospel, and the lowest to Rap. It’s difficult to even try and understand this one, as I am confident in saying that very few would have guessed it to be this way.

Question 3. Is there a difference between mean self-reported anxiety scores between age groups for the most popular genre? (Rock)

mean_anx_rock <- surveyraw %>% 
  filter(Fav_genre == "Rock")

Given that we are testing statistical differences in means between more than 2 groups, an ANOVA test is the most appropriate test.

Ho: µkid = µyoungadult = µadult = µmid-lifeadult = µelderly

Ha: Not all the means are equal

The sample sizes for the different age ranges are not equal, and since this affects the results of ANOVA tests, we’ll be using Welch’s ANOVA test.

Creating a visual for our statistical analysis

boxplot(Anxiety ~ age_range, data = mean_anx_rock, 
        main = "Distribution of Anxiety Scores by Age Range",
        xlab = "Age Range",
        ylab = "Mean Anxiety Score")

oneway.test(Anxiety ~ age_range, data = mean_anx_rock, var.equal = FALSE)

## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  Anxiety and age_range
## F = 6.3068, num df = 4.000, denom df = 13.548, p-value = 0.004359

Our p-value is 0.004359, which is very small. This generates evidence to reject the null hypothesis, and conclude that the mean self-reported anxiety scores among age ranges in the “rock” genre are statistically significant.

4. Does listening to more music throughout the day affect self-reported depression levels in young adults? (ages 18 to 22)

survey_ya <- surveyraw %>% filter(age_range == "Young Adult")
plot(survey_ya$Hours_per_day, survey_ya$Depression, 
     main = "Depression Levels vs. Daily Hours Spent Listening to Music for Young Adults",
     xlab = "Hours Spent Listening to Music (Per Day)",
     ylab = "Self-Reported Depression Level")
abline(lm(survey_ya$Depression ~ survey_ya$Hours_per_day))

Looking at the linear model

lm <- lm(survey_ya$Depression ~ survey_ya$Hours_per_day, data = survey_ya)
summary(lm)

## 
## Call:
## lm(formula = survey_ya$Depression ~ survey_ya$Hours_per_day, 
##     data = survey_ya)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2319 -2.4711  0.0217  2.2753  5.5289 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               4.2176     0.2888  14.603   <2e-16 ***
## survey_ya$Hours_per_day   0.1268     0.0611   2.075    0.039 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.874 on 253 degrees of freedom
## Multiple R-squared:  0.01673,    Adjusted R-squared:  0.01285 
## F-statistic: 4.306 on 1 and 253 DF,  p-value: 0.039

Linear Model for our variables: Anxiety Score = 0.0611x + 4.2176, where x is the number of hours spent listening to music throughout the day.

The p-value of this linear model is 0.039, so at an alpha level of 0.05, it is statistically significant. Our r-squared value however, is 0.01285, which indicates that age range explains only about 1.29% of the variation of mean depression. This is very weak, but does not mean that there is no correlation at all.

Ethical Concerns

The first ethical concern that struck me was the fact that there were participants under the age of 18 that submitted their information. They signed off on the “permissions” tab, which is their consent in allowing the school to use their information. While they remained anonymous, it still raises a concern on not having some sort of a safeguard that shows at least an effort to ensure that those who can sign their consent are at least of age to do so.

Also, something I would like to consider is how exactly this data is going to be used. While it was a fascinating exploration, it can cause some social damage if they attempt to use this data to sway a particular music genre or listening habit. Lastly, I personally would’ve put a column on whether or not they have had a particular mental health diagnostic in the past. My concern is discrediting those who are perhaps a bit more educated on how to more appropriately scale their levels and reduce the variance to something more appropriate, which will polish our statistical calculations.