This dataset looks at survey results on music taste, music listening habits, and self-reported mental health scores for anxiety, depression, and insomnia. The dataset has 736 observations (participants) and 31 variables. The survey was created, distributed, and formatted into a data frame by computer science professor Catherine Rasgaitis at the University of Washington. The variables are as follows:
Variable Name | Description |
---|---|
Age | Respondent’s age |
Primary Streaming Service | The streaming app the respondent uses the most to listen to music |
Hours per day | Average hours the respondent spends listening to music per day |
While working | Whether or not the respondent listens to music while working |
Instrumentalist | Whether or not the respondent plays an instrument regularly |
Composer | Whether or not the respondent composes music |
Fav genre | The respondent’s favorite genre |
Exploratory | Whether or not the respondent actively explores new artists/genres |
Foreign languages | Whether or not the respondent regularly listens to music with lyrics in a language they are not fluent in |
BPM | Average BPM of some of their favorite songs |
Frequency[Classical, Country, EDM, Folk, Gospel, Hip Hop, Jazz, K Pop, Latin, Lofi, Metal,, Pop, R&B, Rap, Rock, Video Game Music] (each in their own respective column) | Respondents’ rank of how often they listen to each of the music genres where they can select: Never, Rarely, Sometimes, or Very Frequently |
Anxiety, Depression, Insomnia, OCD (each in their own respective column) | Respondents’ self-reported ranks to these feelings where: 0 = “I do not experience this.” and 10 = “I experience this regularly, constantly, or extreme.” |
Music effects | Whether or not music improves/worsens respondent’s mental health conditions |
library(tidyverse)
surveyraw <- read_csv("mxmhsurvey.csv")
surveyraw <- surveyraw %>% select(-c(Timestamp, Permissions))
names(surveyraw) <- gsub(" ", "_", names(surveyraw))
sum(is.na(surveyraw))
## [1] 129
A more detailed look:
colSums(is.na(surveyraw))
## Age Primary_streaming_service
## 1 1
## Hours_per_day While_working
## 0 3
## Instrumentalist Composer
## 4 1
## Fav_genre Exploratory
## 0 0
## Foreign_languages BPM
## 4 107
## Frequency_[Classical] Frequency_[Country]
## 0 0
## Frequency_[EDM] Frequency_[Folk]
## 0 0
## Frequency_[Gospel] Frequency_[Hip_hop]
## 0 0
## Frequency_[Jazz] Frequency_[K_pop]
## 0 0
## Frequency_[Latin] Frequency_[Lofi]
## 0 0
## Frequency_[Metal] Frequency_[Pop]
## 0 0
## Frequency_[R&B] Frequency_[Rap]
## 0 0
## Frequency_[Rock] Frequency_[Video_game_music]
## 0 0
## Anxiety Depression
## 0 0
## Insomnia OCD
## 0 0
## Music_effects
## 8
BPM is the column with the most missing data, and it is no surprise. It was anticipated that many respondents would not go through the hassle of researching song data, and thus was left optional on the survey.
mean(is.na(surveyraw) * 100)
## [1] 0.5653927
This is about 0.0565%, which is an incredibly small portion of the dataset. Onto a more detailed look:
colMeans(is.na(surveyraw) * 100)
## Age Primary_streaming_service
## 0.1358696 0.1358696
## Hours_per_day While_working
## 0.0000000 0.4076087
## Instrumentalist Composer
## 0.5434783 0.1358696
## Fav_genre Exploratory
## 0.0000000 0.0000000
## Foreign_languages BPM
## 0.5434783 14.5380435
## Frequency_[Classical] Frequency_[Country]
## 0.0000000 0.0000000
## Frequency_[EDM] Frequency_[Folk]
## 0.0000000 0.0000000
## Frequency_[Gospel] Frequency_[Hip_hop]
## 0.0000000 0.0000000
## Frequency_[Jazz] Frequency_[K_pop]
## 0.0000000 0.0000000
## Frequency_[Latin] Frequency_[Lofi]
## 0.0000000 0.0000000
## Frequency_[Metal] Frequency_[Pop]
## 0.0000000 0.0000000
## Frequency_[R&B] Frequency_[Rap]
## 0.0000000 0.0000000
## Frequency_[Rock] Frequency_[Video_game_music]
## 0.0000000 0.0000000
## Anxiety Depression
## 0.0000000 0.0000000
## Insomnia OCD
## 0.0000000 0.0000000
## Music_effects
## 1.0869565
str(surveyraw)
## tibble [736 × 31] (S3: tbl_df/tbl/data.frame)
## $ Age : num [1:736] 18 63 18 61 18 18 18 21 19 18 ...
## $ Primary_streaming_service : chr [1:736] "Spotify" "Pandora" "Spotify" "YouTube Music" ...
## $ Hours_per_day : num [1:736] 3 1.5 4 2.5 4 5 3 1 6 1 ...
## $ While_working : chr [1:736] "Yes" "Yes" "No" "Yes" ...
## $ Instrumentalist : chr [1:736] "Yes" "No" "No" "No" ...
## $ Composer : chr [1:736] "Yes" "No" "No" "Yes" ...
## $ Fav_genre : chr [1:736] "Latin" "Rock" "Video game music" "Jazz" ...
## $ Exploratory : chr [1:736] "Yes" "Yes" "No" "Yes" ...
## $ Foreign_languages : chr [1:736] "Yes" "No" "Yes" "Yes" ...
## $ BPM : num [1:736] 156 119 132 84 107 86 66 95 94 155 ...
## $ Frequency_[Classical] : chr [1:736] "Rarely" "Sometimes" "Never" "Sometimes" ...
## $ Frequency_[Country] : chr [1:736] "Never" "Never" "Never" "Never" ...
## $ Frequency_[EDM] : chr [1:736] "Rarely" "Never" "Very frequently" "Never" ...
## $ Frequency_[Folk] : chr [1:736] "Never" "Rarely" "Never" "Rarely" ...
## $ Frequency_[Gospel] : chr [1:736] "Never" "Sometimes" "Never" "Sometimes" ...
## $ Frequency_[Hip_hop] : chr [1:736] "Sometimes" "Rarely" "Rarely" "Never" ...
## $ Frequency_[Jazz] : chr [1:736] "Never" "Very frequently" "Rarely" "Very frequently" ...
## $ Frequency_[K_pop] : chr [1:736] "Very frequently" "Rarely" "Very frequently" "Sometimes" ...
## $ Frequency_[Latin] : chr [1:736] "Very frequently" "Sometimes" "Never" "Very frequently" ...
## $ Frequency_[Lofi] : chr [1:736] "Rarely" "Rarely" "Sometimes" "Sometimes" ...
## $ Frequency_[Metal] : chr [1:736] "Never" "Never" "Sometimes" "Never" ...
## $ Frequency_[Pop] : chr [1:736] "Very frequently" "Sometimes" "Rarely" "Sometimes" ...
## $ Frequency_[R&B] : chr [1:736] "Sometimes" "Sometimes" "Never" "Sometimes" ...
## $ Frequency_[Rap] : chr [1:736] "Very frequently" "Rarely" "Rarely" "Never" ...
## $ Frequency_[Rock] : chr [1:736] "Never" "Very frequently" "Rarely" "Never" ...
## $ Frequency_[Video_game_music]: chr [1:736] "Sometimes" "Rarely" "Very frequently" "Never" ...
## $ Anxiety : num [1:736] 3 7 7 9 7 8 4 5 2 2 ...
## $ Depression : num [1:736] 0 2 7 7 2 8 8 3 0 2 ...
## $ Insomnia : num [1:736] 1 2 10 3 5 7 6 5 0 5 ...
## $ OCD : num [1:736] 0 1 2 3 9 7 0 3 0 1 ...
## $ Music_effects : chr [1:736] NA NA "No effect" "Improve" ...
Looking at the top and bottom of the data. This shows us that there is no alphabetical organizational structure within the data frame, and observations were posted in the order that they were submitted (via survey mentioned).
head(surveyraw)
## # A tibble: 6 × 31
## Age Primary_…¹ Hours…² While…³ Instr…⁴ Compo…⁵ Fav_g…⁶ Explo…⁷ Forei…⁸ BPM
## <dbl> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 18 Spotify 3 Yes Yes Yes Latin Yes Yes 156
## 2 63 Pandora 1.5 Yes No No Rock Yes No 119
## 3 18 Spotify 4 No No No Video … No Yes 132
## 4 61 YouTube M… 2.5 Yes No Yes Jazz Yes Yes 84
## 5 18 Spotify 4 Yes No No R&B Yes No 107
## 6 18 Spotify 5 Yes Yes Yes Jazz Yes Yes 86
## # … with 21 more variables: `Frequency_[Classical]` <chr>,
## # `Frequency_[Country]` <chr>, `Frequency_[EDM]` <chr>,
## # `Frequency_[Folk]` <chr>, `Frequency_[Gospel]` <chr>,
## # `Frequency_[Hip_hop]` <chr>, `Frequency_[Jazz]` <chr>,
## # `Frequency_[K_pop]` <chr>, `Frequency_[Latin]` <chr>,
## # `Frequency_[Lofi]` <chr>, `Frequency_[Metal]` <chr>,
## # `Frequency_[Pop]` <chr>, `Frequency_[R&B]` <chr>, …
tail(surveyraw)
## # A tibble: 6 × 31
## Age Primary_…¹ Hours…² While…³ Instr…⁴ Compo…⁵ Fav_g…⁶ Explo…⁷ Forei…⁸ BPM
## <dbl> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 21 Spotify 2 Yes No No R&B Yes Yes 147
## 2 17 Spotify 2 Yes Yes No Rock Yes Yes 120
## 3 18 Spotify 1 Yes Yes No Pop Yes Yes 160
## 4 19 Other str… 6 Yes No Yes Rap Yes No 120
## 5 19 Spotify 5 Yes Yes No Classi… No No 170
## 6 29 YouTube M… 2 Yes No No Hip hop Yes Yes 98
## # … with 21 more variables: `Frequency_[Classical]` <chr>,
## # `Frequency_[Country]` <chr>, `Frequency_[EDM]` <chr>,
## # `Frequency_[Folk]` <chr>, `Frequency_[Gospel]` <chr>,
## # `Frequency_[Hip_hop]` <chr>, `Frequency_[Jazz]` <chr>,
## # `Frequency_[K_pop]` <chr>, `Frequency_[Latin]` <chr>,
## # `Frequency_[Lofi]` <chr>, `Frequency_[Metal]` <chr>,
## # `Frequency_[Pop]` <chr>, `Frequency_[R&B]` <chr>, …
Looking at distribution of favorite music genre amongst the participants (Pie Chart will only show genres that have 30 or more votes)
surveyraw$Fav_genre <- gsub(" ", "_", surveyraw$Fav_genre)
surveytopgenres <- surveyraw %>% filter(Fav_genre %in% c("Metal", "Pop", "Classical", "EDM", "Folk", "Hip_hop", "R&B", "Rock", "Video_game_music"))
genre_counts <- table(surveytopgenres$Fav_genre)
piecol <- rainbow(length(genre_counts))
pie(genre_counts, labels = paste(names(genre_counts), sep = ""), main = "Favorite Music Genres", col = piecol, cex = 0.5, radius = 1)
legend("bottomleft", legend = paste(names(genre_counts), formatC(100 * genre_counts/sum(genre_counts), digits = 1, format = "f"), "%"),
cex = 0.5, fill = piecol, title = "Genres")
We can see from this chart that Rock users dominate in their participation. Rock is a very versatile genre, and is known by many, which may explain its prevalence.
The age groups are as follows:
Description | Age Range |
---|---|
Kid | 10 to 17 |
Young Adult | 18 to 22 |
Adult | 23 to 40 |
Mid-Life Adult | 41 to 64 |
Elderly | 65 + |
age_intervals <- c(0, 17, 22, 40, 64, Inf)
age_labels <- c("Kid", "Young Adult", "Adult", "Mid-Life Adult", "Elderly")
surveyraw <- na.omit(surveyraw)
surveyraw$age_range <- cut(surveyraw$Age, breaks = age_intervals, labels = age_labels, right = FALSE, na.omit = TRUE)
Visual per age group
(x-axis labels were causing a lot of trouble so I color coded them as opposed to listing favorite genre on the x-axis)
mean_anx <- surveyraw %>%
group_by(Fav_genre, age_range) %>%
summarise(Mean_Anxiety = mean(Anxiety, na.rm = TRUE))
anx_kid <- mean_anx %>% filter(age_range == "Kid")
barplot(height = anx_kid$Mean_Anxiety,
beside = TRUE,
xlab = "Favorite Genre",
ylab = "Mean Self-Reported Anxiety Scores",
main = "Mean Self-Reported Anxiety Scores of Kids (10 to 17)",
col = rainbow(length(unique(anx_kid$Fav_genre))),
legend.text = unique(anx_kid$Fav_genre),
args.legend = list(x = "topright", bty = "n", cex = 0.35))
anx_ya <- mean_anx %>% filter(age_range == "Young Adult")
barplot(height = anx_ya$Mean_Anxiety,
beside = TRUE,
ylim = c(0, 10),
xlab = "Favorite Genre",
ylab = "Mean Self-Reported Anxiety Scores",
main = "Mean Self-Reported Anxiety Scores of Young Adults (18 to 22)",
col = rainbow(length(unique(anx_ya$Fav_genre))),
legend.text = unique(anx_ya$Fav_genre),
args.legend = list(x = "topleft", bty = "n", cex = 0.3))
anx_adult <- mean_anx %>% filter(age_range == "Adult")
barplot(height = anx_adult$Mean_Anxiety,
beside = TRUE,
ylim = c(0, 10),
xlab = "Favorite Genre",
ylab = "Mean Self-Reported Anxiety Scores",
main = "Mean Self-Reported Anxiety Scores of Adults (23 to 40)",
col = rainbow(length(unique(anx_adult$Fav_genre))),
legend.text = unique(anx_adult$Fav_genre),
args.legend = list(x = "topleft", bty = "n", cex = 0.4))
anx_mla <- mean_anx %>% filter(age_range == "Mid-Life Adult")
barplot(height = anx_mla$Mean_Anxiety,
beside = TRUE,
ylim = c(0, 10),
xlab = "Favorite Genre",
ylab = "Mean Self-Reported Anxiety Scores",
main = "Mean Self-Reported Anxiety Scores of Mid-Life Adults (41 to 64)",
col = rainbow(length(unique(anx_mla$Fav_genre))),
legend.text = unique(anx_mla$Fav_genre),
args.legend = list(x = "topright", bty = "n", cex = 0.4))
anx_eld <- mean_anx %>% filter(age_range == "Elderly")
barplot(height = anx_eld$Mean_Anxiety,
beside = TRUE,
xlab = "Favorite Genre",
ylab = "Mean Self-Reported Anxiety Scores",
main = "Mean Self-Reported Anxiety Scores of the Elderly (65+)",
col = rainbow(length(unique(anx_eld$Fav_genre))),
legend.text = unique(anx_eld$Fav_genre),
args.legend = list(x = "topright", bty = "n", cex = 1))
What do our findings tell us about the data? There’s no doubt that there’s plenty of information to dissect from our graphs pertaining to the variance of favorite genres and self-reported anxiety scores based on different favorite genres among age groups. Rock, Pop, and Metal make up the majority of the selected favorites, with 30.1% saying Rock is their favorite genre, 18.3 % going with Pop, and 14.1% choosing Metal. This also gives us (to a very minimal extent) some insight on the listening habits of Washingtonians, given that the majority of people who completed the survey were Washington State locals. The latter portion of the analysis also gave us rich insight on mean self-reported anxiety scores based on both age group, and genre selected as their favorite. This representation of course, may not be representative of the population based on aggregated counts for means, but nonetheless, it’s interpretation is an interesting one. Let us discuss the upper and lower limits of each category.
Kids - The highest anxiety score was attributed to R&B music, and the lowest to rap. This trend follows my personal experience with music growing up as a teen in the early 2010s, as rap was getting more and more creative and unorthodox, while R&B was yet to innovate and had a “stuck in the past” feel to it that didn’t resonate with the young crowds.
Young Adults - The highest anxiety score was attributed to Folk Music, and the lowest to Gospel. Gospel music is attributed to Christianity, and many young adults find themselves in religion. This may attribute to bettering themselves and having more mental clarity. Folk music is music that transcends generations, and originated from tradition. It may be the case that some young adults are heavily surrounded by this music and while they enjoy it, it somehow affects their anxiety levels. This would be a great study.
Adults - The highest anxiety scores were attributed to both K-Pop and Lofi, and the lowest to R&B. Adults can be reluctant to opening their ears and hearts to newer styles of music, and kpop and lofi are novel semi-mainstream genres. R&B however, is incredibly popular among this age group, especially for those who found love in the 2000s, as is the case for the latter half this age group.
Mid-Life Adults - The highest anxiety scores were attributed to EDM, and the lowest to R&B. This one in particular made me chuckle. EDM is my favorite genre, and the amount of times i’ve been told it’s obnoxious, loud, irritable, etc. is too many to count. R&B however? A classic, and most closely resembles cadences of songs from their childhood.
Elderly - The highest anxiety scores were attributed to Gospel, and the lowest to Rap. It’s difficult to even try and understand this one, as I am confident in saying that very few would have guessed it to be this way.
mean_anx_rock <- surveyraw %>%
filter(Fav_genre == "Rock")
Given that we are testing statistical differences in means between more than 2 groups, an ANOVA test is the most appropriate test.
Ho: µkid = µyoungadult = µadult = µmid-lifeadult = µelderly
Ha: Not all the means are equal
The sample sizes for the different age ranges are not equal, and since this affects the results of ANOVA tests, we’ll be using Welch’s ANOVA test.
Creating a visual for our statistical analysis
boxplot(Anxiety ~ age_range, data = mean_anx_rock,
main = "Distribution of Anxiety Scores by Age Range",
xlab = "Age Range",
ylab = "Mean Anxiety Score")
oneway.test(Anxiety ~ age_range, data = mean_anx_rock, var.equal = FALSE)
##
## One-way analysis of means (not assuming equal variances)
##
## data: Anxiety and age_range
## F = 6.3068, num df = 4.000, denom df = 13.548, p-value = 0.004359
Our p-value is 0.004359, which is very small. This generates evidence to reject the null hypothesis, and conclude that the mean self-reported anxiety scores among age ranges in the “rock” genre are statistically significant.
survey_ya <- surveyraw %>% filter(age_range == "Young Adult")
plot(survey_ya$Hours_per_day, survey_ya$Depression,
main = "Depression Levels vs. Daily Hours Spent Listening to Music for Young Adults",
xlab = "Hours Spent Listening to Music (Per Day)",
ylab = "Self-Reported Depression Level")
abline(lm(survey_ya$Depression ~ survey_ya$Hours_per_day))
Looking at the linear model
lm <- lm(survey_ya$Depression ~ survey_ya$Hours_per_day, data = survey_ya)
summary(lm)
##
## Call:
## lm(formula = survey_ya$Depression ~ survey_ya$Hours_per_day,
## data = survey_ya)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.2319 -2.4711 0.0217 2.2753 5.5289
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.2176 0.2888 14.603 <2e-16 ***
## survey_ya$Hours_per_day 0.1268 0.0611 2.075 0.039 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.874 on 253 degrees of freedom
## Multiple R-squared: 0.01673, Adjusted R-squared: 0.01285
## F-statistic: 4.306 on 1 and 253 DF, p-value: 0.039
Linear Model for our variables: Anxiety Score = 0.0611x + 4.2176, where x is the number of hours spent listening to music throughout the day.
The p-value of this linear model is 0.039, so at an alpha level of 0.05, it is statistically significant. Our r-squared value however, is 0.01285, which indicates that age range explains only about 1.29% of the variation of mean depression. This is very weak, but does not mean that there is no correlation at all.
The first ethical concern that struck me was the fact that there were participants under the age of 18 that submitted their information. They signed off on the “permissions” tab, which is their consent in allowing the school to use their information. While they remained anonymous, it still raises a concern on not having some sort of a safeguard that shows at least an effort to ensure that those who can sign their consent are at least of age to do so.
Also, something I would like to consider is how exactly this data is going to be used. While it was a fascinating exploration, it can cause some social damage if they attempt to use this data to sway a particular music genre or listening habit. Lastly, I personally would’ve put a column on whether or not they have had a particular mental health diagnostic in the past. My concern is discrediting those who are perhaps a bit more educated on how to more appropriately scale their levels and reduce the variance to something more appropriate, which will polish our statistical calculations.