read.csv("mxmh_survey_results.csv") -> music_mentalhealth
This data set was found on Kaggle. It was created by Catherine Rasgaitis, from the University of Washington in 2022. This data set is a voluntary sample taken from a Google Form. The form was posted across a variety of media including Reddit forums, different social media applications and posters/cards distributed and hung up in a myriad of public locations. The data on the form was collected for a period of approximately three months. Respondents were not restricted by age. Since the form was voluntary, there is definitely bias. On Kaggle, it explains the majority of respondents range from late teens to early twenties possibly because the survey was advertised on social media sites that attract this demographic. This is an observational study as subjects were not assigned to treatments. Questions were asked through an online survey using a Google Form. In the survey there were three blocks of questions. The survey first asked participants some general background questions in regard to music background/listening habits. Then participants were asked some questions regarding their music use (frequency of different genres). Lastly, they had to rate there mental health on a scale from 1-10.
In recent years, mental health has been a growing concern and there has been an increase in discussions on how best to combat mental health. As people who enjoy listening to music, we are interested in knowing if listening to music (a certain genre or specific amount of time) promotes positive mental health. Also, the percentage of college students experiencing mental illnesses such as anxiety and depression are staggering. We feel the class and anyone interested in promoting positive mental health should be interested in this data set.
Our hope in analyzing this data set through data visualizations and machine learning is to determine the effect in which music has on someones mental health.
1: Background - generic questions on musical background and listening habits.
2: Music genres - ranking of how often they listen to 16 different genres options include: Never, Rarely, Sometimes and Very frequently’
3: Mental Health - ranking Anxiety, Depression, Insomnia, and OCD on a scale 0 - “I do not experience this.” 10 - “I experience this regularly, constantly/or to an extreme”
Our first couple graphs give an overview of the data collected in the survey such as ages of participants, favorite streaming service and favorite genre of participants. The second set of graphs are used to look at the relationship of mental health and music.
Looking at age demographic of people surveyed
music_mentalhealth|>
group_by(Age) |>
summarize(counts = n()) -> count_age
count_age |>
arrange(-counts)
## # A tibble: 62 × 2
## Age counts
## <int> <int>
## 1 18 85
## 2 19 61
## 3 17 59
## 4 21 52
## 5 16 44
## 6 20 40
## 7 22 39
## 8 23 37
## 9 25 22
## 10 26 22
## # ℹ 52 more rows
ggplot(
data = music_mentalhealth,
mapping = aes(x = Age)
) +
# histogram with bin widths of 0.025 wide
geom_histogram(
bins = 30,
fill = "orchid",
color = "black"
) +
scale_y_continuous(
expand = c(0, 0, 0.05, 0)
) +
labs(title = "Age Demographic") -> age_demographic
The table above shows that the majority of the people surveyed were in their late teens/early twenties, which may skew the results for most popular music genre, hours spent listening to music, mental health ratings, etc.
Looking at the majority age group of people surveyed
music_mentalhealth |>
filter(Age == 18 | Age == 19 | Age == 17 | Age == 21 | Age == 16 | Age == 20 | Age == 22 | Age == 23) |>
select(-Timestamp) -> majority_surveyed
tibble(majority_surveyed)
## # A tibble: 417 × 32
## Age Primary.streaming.service Hours.per.day While.working Instrumentalist
## <int> <chr> <dbl> <chr> <chr>
## 1 18 Spotify 3 Yes Yes
## 2 18 Spotify 4 No No
## 3 18 Spotify 4 Yes No
## 4 18 Spotify 5 Yes Yes
## 5 18 YouTube Music 3 Yes Yes
## 6 21 Spotify 1 Yes No
## 7 19 Spotify 6 Yes No
## 8 18 I do not use a streaming s… 1 Yes No
## 9 18 Spotify 3 Yes Yes
## 10 19 YouTube Music 8 Yes No
## # ℹ 407 more rows
## # ℹ 27 more variables: Composer <chr>, Fav.genre <chr>, Exploratory <chr>,
## # Foreign.languages <chr>, BPM <int>, Frequency..Classical. <chr>,
## # Frequency..Country. <chr>, Frequency..EDM. <chr>, Frequency..Folk. <chr>,
## # Frequency..Gospel. <chr>, Frequency..Hip.hop. <chr>,
## # Frequency..Jazz. <chr>, Frequency..K.pop. <chr>, Frequency..Latin. <chr>,
## # Frequency..Lofi. <chr>, Frequency..Metal. <chr>, Frequency..Pop. <chr>, …
ggplot(data = majority_surveyed,
mapping = aes(x = Hours.per.day,
y = Age,
color = Age)) +
# adding points to create a scatter plot
geom_jitter(mapping = aes(color = factor(Age))) +
# removing the legend
theme(legend.position = "none") +
# adding a title
labs(title = "Hours of Music Listened to Per day by \n the Largest Age Group Surveyed",
x = "Hours Listened to Per Day"
) -> age_graph
# age_demographic + age_graph
age_demographic + age_graph
Based on the graph above, we know that the data is right skewed and the
majority of the people who participated in the survey were between the
ages of 16 and 23-years-old. We can also identify an outlier on the
graph. There was one 89-year-old surveyed and it’s important to note
that he claimed to listen to music 24 hours a day, listens to music
while working, and is an instrumentalist and composer. We then took a
closer look at the hours per day of the majority age group surveyed.
Respondents claim to whether or not music improved their mental health
Each participant was asked whether music improved, worsened, or had no effect on their mental health
music_mentalhealth |>
select(Age, Music.effects) -> effects_of_music
# graph the proportion
effects_of_music |>
filter(Music.effects != "") |> # disregard the people who did not answer the question
group_by(Music.effects) |>
summarize(counts = n()) |>
mutate(prop = counts/sum(counts)) -> # create new column in data set
Music.effects.prop
Music.effects.prop
## # A tibble: 3 × 3
## Music.effects counts prop
## <chr> <int> <dbl>
## 1 Improve 542 0.745
## 2 No effect 169 0.232
## 3 Worsen 17 0.0234
ggplot(data = Music.effects.prop,
mapping = aes(x = Music.effects,
y = prop,
fill = Music.effects)) +
geom_col(color = "black",
linewidth = 0.5) +
# changing the colors of the bars
scale_fill_manual(values = c("Improve" = "darkgreen", # green indicates improvement (positive)
"No effect" = "#800080", # purple indicates no effect (neutral)
"Worsen" = "tomato")) + # red indicates worsen (negative)
# And display percentages on the y-axis and remove the extra space on the bottom:
scale_y_continuous(labels = scales::label_percent(accuracy = 1),
expand = c(0, 0, 0.05, 0))+
labs(x = NULL,
y = NULL,
fill = NULL,
title = "Music's Impact on Mental Health Among proportion of Participant's Responses")+
# removing the legend from the graph
theme(legend.position = "none") +
# adding the percent to each bar on the graph
geom_text(mapping = aes(label = scales::percent(prop, accuracy = 1)),
size = 3,
color = "black",
vjust = -0.5)
Based on the bar graph above, the majority of people (74%) who were
surveyed found that listening to music improved their mental health
(positive impact). Only 2% of people who participated in the survey said
that listening to music worsens their mental health, and 23% claim that
listening to music had no effect on their mental health. Based off of
these results, we could assume that their was a positive association
with listening to music and an improvement in mental health, however,
the scatterplot that compared the strength rating of four mental
illnesses and time spent listening to music showed the opposite
association. The scatterplot shown below that the more time spent
listening to music, the worse the mental illness got (the
stronger/higher the rating of the mental illness got)
What is the most popular genre of music (favorite genre) among the people surveyed?
music_mentalhealth |>
filter(Primary.streaming.service != "")|>
select (Age, Primary.streaming.service) -> favorite_genre
# create a boxplot showing the favorite genre among the people surveyed
ggplot(data = favorite_genre,
mapping = aes(x = factor(Primary.streaming.service, levels = c("Spotify",
"YouTube Music",
"Apple Music",
"Pandora",
"Other streaming service",
"I do not use a streaming service")),
y = Age,
fill = Primary.streaming.service)) +
# Adding a boxplot with geom_boxplot
geom_boxplot(color = "black",
show.legend = F,
outlier.shape = NA)+
# Removing the x and y-axis titles
labs(x = "Primary Streaming Service",
y = "Age",
title = "Streaming Service by Age")+
scale_x_discrete(labels = c("Spotify",
"YouTube",
"Apple Music",
"Pandora",
"Other",
"None"))+
scale_y_continuous(breaks = seq(from = 0,
to = 80,
by = 10)) +
# changing the theme of the graph to black and white
theme_bw() +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle=45, hjust=1)) -> streaming_service_age
# Question: What is the most popular streaming service among the people surveyed?
# determine the counts for favorite genre among the people surveyed
favorite_genre|>
group_by(Primary.streaming.service) |>
summarize(counts = n()) |>
arrange(-counts) -> favorite.genre.by.age
favorite.genre.by.age
## # A tibble: 6 × 2
## Primary.streaming.service counts
## <chr> <int>
## 1 Spotify 458
## 2 YouTube Music 94
## 3 I do not use a streaming service. 71
## 4 Apple Music 51
## 5 Other streaming service 50
## 6 Pandora 11
ggplot(data = favorite.genre.by.age,
mapping = aes(x = factor(Primary.streaming.service, levels = c("Spotify",
"YouTube Music",
"Apple Music",
"Pandora",
"Other streaming service",
"I do not use a streaming service")),
y = counts,
fill = Primary.streaming.service)) +
geom_bar(stat = "identity",
position = "dodge",
color = "black")+
# Changing the x and y-axis titles and adding a title to the graph
labs(x = "Primary Streaming Service",
y = "Count",
title = "Favorite Streaming Service \n Among All Who Were Surveyed",
labeller = labeller(Primary.streaming.service = c("I do not use a streaming service" = "None",
"Other streaming service" = "Other")))+
# changing the theme of the graph to black and white
theme_bw() +
# change column names
scale_x_discrete(labels = c("Spotify",
"YouTube",
"Apple Music",
"Pandora",
"Other",
"None")) +
# removing the legend from the graph
theme(plot.title = element_text(hjust = 0.5),
legend.position = "none",
axis.text.x = element_text(angle=45, hjust=1)) +
scale_y_continuous(expand = c(0, 0, 0.05, 0)) -> bar.graph1
favorite_genre |>
filter(Primary.streaming.service == "Pandora") |>
arrange(-Age)-> favorite.pandora
favorite.pandora
## Age Primary.streaming.service
## 1 73 Pandora
## 2 69 Pandora
## 3 68 Pandora
## 4 63 Pandora
## 5 61 Pandora
## 6 60 Pandora
## 7 53 Pandora
## 8 36 Pandora
## 9 32 Pandora
## 10 22 Pandora
## 11 15 Pandora
# since spotify was the most popular primary streaming service (greatest number of counts), look at the ages of the participants who chose spotify
favorite_genre |>
filter(Primary.streaming.service == "Spotify") |>
arrange(Age) -> favorite.spotify
tibble(favorite.spotify)
## # A tibble: 458 × 2
## Age Primary.streaming.service
## <int> <chr>
## 1 12 Spotify
## 2 12 Spotify
## 3 13 Spotify
## 4 13 Spotify
## 5 13 Spotify
## 6 13 Spotify
## 7 13 Spotify
## 8 14 Spotify
## 9 14 Spotify
## 10 14 Spotify
## # ℹ 448 more rows
# look at the counts for each age that chose spotify as the primary streaming service (favorite streaming service)
favorite.spotify|>
group_by(Age) |>
summarize(counts = n())|>
arrange(-counts) -> counts.spotify
counts.spotify
## # A tibble: 42 × 2
## Age counts
## <int> <int>
## 1 18 58
## 2 19 44
## 3 17 43
## 4 16 33
## 5 21 32
## 6 22 25
## 7 23 24
## 8 20 22
## 9 26 17
## 10 15 15
## # ℹ 32 more rows
# streaming_service_age + bar.graph1
streaming_service_age + bar.graph1
Observations:
Based on the bar chart, a majority of the participants who were surveyed chose Spotify as their favorite streaming service (458/735 = 62.3%). We looked deeper into the data and among all who voted for Spotify as their favorite streaming service, 58 were 18-year-olds, 44 were 19-year-olds, 43 were 17-year-olds, 33 were 16-year-olds, and 32 were 21-year-olds. The boxplot showing the relationship between age and favorite streaming service also shows that the majority of people who chose Spotify had a median age of 20. It’s also important to note that the majority of participants who were surveyed were among this age demographic. So if an older demographic was surveyed, the graphs could potentially look different.
Pandora had the lowest count for favorite streaming service (11 total people). Based on the boxplot, Pandora was chosen as primary streaming service among older participants. We looked closer at the data, and among the 11 total people surveyed 4 were under the age of 40 (15, 22, 32, and 36) and the other 7 were over the age of 50 (53, 60, 61, 63, 68, 69, 73). Pandora has the greatest age spread among streaming services.
music_mentalhealth |>
select(Age, Fav.genre)|>
group_by(Fav.genre) |>
summarize(counts = n()) |>
arrange(desc(counts)) -> counts.fav.genre
counts.fav.genre
## # A tibble: 16 × 2
## Fav.genre counts
## <chr> <int>
## 1 Rock 188
## 2 Pop 114
## 3 Metal 88
## 4 Classical 53
## 5 Video game music 44
## 6 EDM 37
## 7 Hip hop 35
## 8 R&B 35
## 9 Folk 30
## 10 K pop 26
## 11 Country 25
## 12 Rap 22
## 13 Jazz 20
## 14 Lofi 10
## 15 Gospel 6
## 16 Latin 3
# creating the bar graph to display favorite genre of music
ggplot(data = counts.fav.genre,
mapping = aes(x = fct_reorder(Fav.genre, counts),
y = counts)) +
geom_col(fill = "seagreen",
color = "black") +
# adding a label to the x-axis and a title to the graph
labs(x = "Favorite Genre",
fill = NULL,
title = "Favorite Genre Among People Surveyed")+
# removing the legend from the graph
theme(legend.position = "none",
axis.text.x = element_text(angle= 45, hjust=1)) +
scale_y_continuous(breaks = seq(from = 0,
to = 300,
by = 50),
expand = c(0, 0, 0.05, 0))
Observations: Based on the graph, Rock was the most popular genre
(favorite genre) of music among the people surveyed (188 people chose
rock).
The code below contains cleaning original data set:
useable_data <-
music_mentalhealth |>
# Cleaning all the column names
clean_names() |>
# selecting columns to keep
select(age, bpm, hours_per_day, fav_genre, contains("frequency"), anxiety, depression, insomnia, ocd)
The code below re-formats the useable_data to be used for a set of bar charts which looks at the frequency of different genres of music listened to for different mental illnesses
clean_data <-
useable_data |>
# using pivot longer to create a column with all of the frequencies together
pivot_longer(
cols = starts_with("frequency"),
names_to = "music_genre",
values_to = "frequency_rating") |>
# using pivot longer to create a column with all of the ratings of mental illnesses together
pivot_longer(
cols = c(anxiety, depression, insomnia, ocd),
names_to = "mental_illness_type",
values_to = "strength_rating") |>
# only keeping mental illnesses that are "severe" (people have ranked them 7 through 10)
filter(strength_rating >= 7)
tibble(useable_data)
## # A tibble: 736 × 24
## age bpm hours_per_day fav_genre frequency_classical frequency_country
## <int> <int> <dbl> <chr> <chr> <chr>
## 1 18 156 3 Latin Rarely Never
## 2 63 119 1.5 Rock Sometimes Never
## 3 18 132 4 Video game m… Never Never
## 4 61 84 2.5 Jazz Sometimes Never
## 5 18 107 4 R&B Never Never
## 6 18 86 5 Jazz Rarely Sometimes
## 7 18 66 3 Video game m… Sometimes Never
## 8 21 95 1 K pop Never Never
## 9 19 94 6 Rock Never Very frequently
## 10 18 155 1 R&B Rarely Rarely
## # ℹ 726 more rows
## # ℹ 18 more variables: frequency_edm <chr>, frequency_folk <chr>,
## # frequency_gospel <chr>, frequency_hip_hop <chr>, frequency_jazz <chr>,
## # frequency_k_pop <chr>, frequency_latin <chr>, frequency_lofi <chr>,
## # frequency_metal <chr>, frequency_pop <chr>, frequency_r_b <chr>,
## # frequency_rap <chr>, frequency_rock <chr>,
## # frequency_video_game_music <chr>, anxiety <dbl>, depression <dbl>, …
Since making facets for 16 different genres would not look neat, I used the code below to create two different data sets, each of eight genres to then create two separate graphs for.
# creating a data set with the first 8 genres
genre1 <-
clean_data |>
filter(music_genre %in% c("frequency_classical", "frequency_country", "frequency_edm", "frequency_folk", "frequency_gospel", "frequency_hip_hop", "frequency_jazz", "frequency_k_pop"))
# creating a data set with the second 8 genres
genre2 <-
clean_data |>
filter(music_genre %in% c("frequency_latin", "frequency_lofi", "frequency_metal", "frequency_pop", "frequency_r_b", "frequency_rap", "frequency_rock", "frequency_video_game_music"))
The next set of code and graphs examine the relationship between mental health and different aspects of the participants music listening behaviors.
The code chunk below contains the code used to create the graph that looks at music frequency by type of mental illness
# Creating a plot of the data
ggplot(data = genre1,
mapping = aes(x = mental_illness_type,
fill = as.factor(frequency_rating))) +
# creating the bar charts
geom_bar(position = "fill", # using the proportions instead of counts
color = "black") + # outlining the bars in black
# creating facets for each genre of music
facet_wrap(facets = ~ music_genre,
ncol = 4,
scales = "fixed",
labeller = labeller(music_genre = c( # changing titles of facets
"frequency_classical" = "Classical",
"frequency_country" = "Country",
"frequency_edm" = "EDM",
"frequency_folk" = "Folk",
"frequency_gospel" = "Gospel",
"frequency_hip_hop" = "Hip Hop",
"frequency_jazz" = "Jazz",
"frequency_k_pop" = "K-Pop",
"frequency_latin" = "Latin",
"frequency_lofi" = "Lo-fi",
"frequency_metal" = "Metal",
"frequency_pop" = "Pop",
"frequency_r_b" = "R&B",
"frequency_rap" = "Rap",
"frequency_rock" = "Rock",
"frequency_video_game_music" = "Video Game"))) +
# changing the color of the values
scale_fill_manual(values = c("seagreen", "steelblue", "darkorange", "darkolivegreen2")) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1), # angling the axis text
panel.grid.major.x = element_blank(), # removing unnecessary grid lines
panel.grid.minor.x = element_blank(),
panel.grid.minor.y = element_blank()) +
# adding and removing labels
labs(title = "Music Frequency by Type of Mental Illness",
x = "",
y = NULL,
fill = "") +
# expanding the length of the bars and adding percents on y-axis
scale_y_continuous(expand = c(0, 0, 0.05, 0),
labels = scales::percent,
limits = c(0, 1) # Adjust the limits to exclude 0
) -> graph1
graph1
# creating the second graph with the other 8 genres
graph1 %+% genre2 -> graph2
graph2
These bar graphs demonstrate that the type of mental illness someone has
ranked as experiencing severely (7-10), does not determine the frequency
in which they listen to a certain genre of music. This is displayed by
the proportions being relatively the same for each mental illness within
the different genres of music. However, from the graphs, we are able to
tell which genres are more popular than others. For example, gospel and
Latin music seemed to be listened to the least frequent whereas rock and
pop appear to be listened to with the most frequency.
The code below is used to clean the data to make a scatter plot with trendlines that compares the strength ratings of the four mental illness with the number of hours of music listened to
clean_data2 <-
useable_data |>
# using pivot longer to create a column with all of the frequencies together
pivot_longer(
cols = starts_with("frequency"),
names_to = "music_genre",
values_to = "frequency_rating") |>
pivot_longer(
cols = c(anxiety, depression, insomnia, ocd),
names_to = "mental_illness_type",
values_to = "strength_rating") |>
# removing certain columns
select(-age, -fav_genre, -music_genre, -frequency_rating) |>
# calculating the mean hours of music listened to per day for each mental illness and strength rating
group_by(mental_illness_type, strength_rating) |>
summarise(hours_per_day = mean(hours_per_day)) |>
filter(hours_per_day <= 4.75)
clean_data2
## # A tibble: 46 × 3
## # Groups: mental_illness_type [4]
## mental_illness_type strength_rating hours_per_day
## <chr> <dbl> <dbl>
## 1 anxiety 0 3.9
## 2 anxiety 1 2.80
## 3 anxiety 2 3.91
## 4 anxiety 3 3.16
## 5 anxiety 4 3.14
## 6 anxiety 5 3.77
## 7 anxiety 6 3.18
## 8 anxiety 7 3.52
## 9 anxiety 7.5 4
## 10 anxiety 8 3.91
## # ℹ 36 more rows
The code below creates a graph that compares the number of hours of music listened to per day with the strength ratings of mental illness. All four mental illnesses are examined to determine if there is an association.
# creating the plot
ggplot(data = clean_data2,
mapping = aes(x = hours_per_day,
y = strength_rating,
color = mental_illness_type)) +
# adding points to create a scatter plot
geom_point() +
# adding a trend line to the graph
geom_smooth(method = "lm",
se = FALSE,
linewidth = 0.5,
fullrange = T) +
# changing the color choice
scale_color_manual(values = c("orangered", "deepskyblue3", "navy", "plum")) +
# changing the labels on the x-axis and y-axis and adding a title to the graph and changing the title of the legend
labs(
y = "Strength Rating of Mental Illness (0-10)",
x = "Hours of Music Listened To (per day)",
title = "Comparing the Strength Rating of Four Different Mental Illnesses \n to the Number of Hours of Music Listened to",
color = "Type of Mental Illness") +
scale_y_continuous(breaks = c(0:10)) +
theme(plot.title = element_text(size = 14),
legend.position = "bottom",
panel.grid.minor.x = element_blank(),
panel.grid.minor.y = element_blank())
The data produced from the graph demonstrates that with more hours of
music listened to per day there is a slight increase in strength of a
mental illness. It is important to keep in mind that there is only about
a two hour difference between the minimum and maximum of the average
hours listened to for each mental illness. It appears as though the four
mental illnesses rankings increase at about the same rate.
While our data set is a survey, we wanted to determine if there is an association between the effects of music and mental health (music therapy). While the survey focuses on four types of mental illnesses (anxiety, depression, insomnia, ocd), for our machine learning method we decided to just focus on predicting anxiety levels among participants, due to anxiety being the most prevalent mental illness among those surveyed. We first created a multiple linear regression, followed by a regression tree in hopes of determining an association between anxiety and several variables relating with music.
Below, is the code used to clean the data for the machine learning methods
useable_data |>
select(age, bpm, hours_per_day, fav_genre, anxiety) |>
# combining Hip Hop and R&B into one genre to maximize the data
mutate(
fav_genre = if_else(fav_genre %in% c("Hip Hop", "R&B"),
"Hip/R&B",
fav_genre)) |>
filter(fav_genre %in% c("Rock", "Pop", "Metal", "Classical", "Hip/R&B")) -> mlr_data
tibble(mlr_data)
## # A tibble: 478 × 5
## age bpm hours_per_day fav_genre anxiety
## <int> <int> <dbl> <chr> <dbl>
## 1 63 119 1.5 Rock 7
## 2 18 107 4 Hip/R&B 7
## 3 19 94 6 Rock 2
## 4 18 155 1 Hip/R&B 2
## 5 17 NA 2 Pop 7
## 6 19 118 5 Hip/R&B 6
## 7 18 79 2 Pop 3
## 8 16 84 3 Rock 10
## 9 18 169 2 Pop 7
## 10 14 136 12 Rock 8
## # ℹ 468 more rows
For the machine learning we selected age, bpm, hours_per_day, and anxiety levels among the most popular genres selected by the participants. Based on the bar graph displaying “Favorite Genre Among People Surveyed”, we selected Rock, Pop, Metal, Classical, and combined the participants who chose either Hip or R&B.
pacman::p_load(GGally)
# using ggpairs to create a scatter plot matrix
mlr_data|>
ggpairs()
When examining the outputs in which show the correlation for anxiety, it
is evident that there is not really a predictor of how someone will rank
their anxiety on a scale from 1-10. Based on the multiple plots, there
was no correlation between age, bpm, hours_per_day, or fav_genre and
anxiety ranking.
# Fitting the linear regression model and save it as anxiety_lm
anxiety_lm <-
lm(
formula = anxiety ~ .,
data = mlr_data
)
summary(anxiety_lm)
##
## Call:
## lm(formula = anxiety ~ ., data = mlr_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.1108 -1.8814 0.3447 1.8861 5.8618
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.334e+00 7.144e-01 8.866 < 2e-16 ***
## age -5.656e-02 1.110e-02 -5.097 5.4e-07 ***
## bpm -2.538e-05 4.047e-03 -0.006 0.9950
## hours_per_day 5.302e-02 5.034e-02 1.053 0.2928
## fav_genreHip/R&B 1.293e-01 6.478e-01 0.200 0.8419
## fav_genreMetal 4.441e-01 5.316e-01 0.835 0.4040
## fav_genrePop 9.192e-01 5.039e-01 1.824 0.0689 .
## fav_genreRock 1.154e+00 4.802e-01 2.403 0.0167 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.677 on 388 degrees of freedom
## (82 observations deleted due to missingness)
## Multiple R-squared: 0.08533, Adjusted R-squared: 0.06883
## F-statistic: 5.171 on 7 and 388 DF, p-value: 1.205e-05
pacman::p_load(regclass)
# calculating the variance inflation factor
VIF(anxiety_lm)
## GVIF Df GVIF^(1/(2*Df))
## age 1.025079 1 1.012462
## bpm 1.064309 1 1.031653
## hours_per_day 1.020989 1 1.010440
## fav_genre 1.098462 4 1.011808
All of the VIFs were low, therefore there is not an issue with multicollinearity.
pacman::p_load(broom)
# creating three linear models with different complexities
# least complex
anxiety_lm1 <-
lm(formula = anxiety ~ fav_genre,
data = mlr_data)
# Middle Model
anxiety_lm2 <-
lm(formula = anxiety ~ fav_genre + age,
data = mlr_data)
# Most complicated model
anxiety_lm3 <-
lm(formula = anxiety ~ fav_genre + age + hours_per_day + bpm,
data = mlr_data)
# use augment_columns() in the broom package to calculate the fit stats ourselves (add the predictions and residuals)
music_lm1 <-
augment_columns(x = anxiety_lm1,
data = mlr_data)
music_lm1
## # A tibble: 478 × 12
## age bpm hours_per_day fav_genre anxiety .fitted .se.fit .resid .hat
## <int> <int> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 63 119 1.5 Rock 7 6.12 0.203 0.878 0.00532
## 2 18 107 4 Hip/R&B 7 5.17 0.470 1.83 0.0286
## 3 19 94 6 Rock 2 6.12 0.203 -4.12 0.00532
## 4 18 155 1 Hip/R&B 2 5.17 0.470 -3.17 0.0286
## 5 17 NA 2 Pop 7 6.07 0.260 0.925 0.00877
## 6 19 118 5 Hip/R&B 6 5.17 0.470 0.829 0.0286
## 7 18 79 2 Pop 3 6.07 0.260 -3.07 0.00877
## 8 16 84 3 Rock 10 6.12 0.203 3.88 0.00532
## 9 18 169 2 Pop 7 6.07 0.260 0.925 0.00877
## 10 14 136 12 Rock 8 6.12 0.203 1.88 0.00532
## # ℹ 468 more rows
## # ℹ 3 more variables: .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>
music_lm1 |>
dplyr::select(fav_genre, anxiety, .fitted, .resid)
## # A tibble: 478 × 4
## fav_genre anxiety .fitted .resid
## <chr> <dbl> <dbl> <dbl>
## 1 Rock 7 6.12 0.878
## 2 Hip/R&B 7 5.17 1.83
## 3 Rock 2 6.12 -4.12
## 4 Hip/R&B 2 5.17 -3.17
## 5 Pop 7 6.07 0.925
## 6 Hip/R&B 6 5.17 0.829
## 7 Pop 3 6.07 -3.07
## 8 Rock 10 6.12 3.88
## 9 Pop 7 6.07 0.925
## 10 Rock 8 6.12 1.88
## # ℹ 468 more rows
The .fitted column gives us the predicted value and the .resid gives us the residuals.
# combining the fit statistics together
bind_rows(
glance(anxiety_lm1),
glance(anxiety_lm2),
glance(anxiety_lm3)
) |>
mutate(
explanatories = c(as.character(formula(anxiety_lm1))[3],
as.character(formula(anxiety_lm2))[3],
as.character(formula(anxiety_lm3))[3])
) |>
dplyr::select(explanatories, r.squared, adj.r.squared, sigma)
## # A tibble: 3 × 4
## explanatories r.squared adj.r.squared sigma
## <chr> <dbl> <dbl> <dbl>
## 1 fav_genre 0.0229 0.0146 2.78
## 2 fav_genre + age 0.0773 0.0675 2.70
## 3 fav_genre + age + hours_per_day + bpm 0.0853 0.0688 2.68
All three model fit statistics have extremely low r.squared values. Anxiety_lm3 had the “largest” r.squared value (R2 = 0.085). Sigma is how much we are off by (so our predictions are ~2.68 off which is not good).
We want to find the simplest model that still fits well. Although none of the models exactly fit the data well (all have a very low r.squared value), the best option would be the second model using favorite genre and age to predict anxiety levels. We can rule out the first model because the r.squared is noticeably lower than the other two models. The second and third models fit about equally well (similar r.squared values). Since the second model is simpler than the third model (2 variables vs 4 variables), we should use the second model.
augment_columns(
x = anxiety_lm2,
data = mlr_data
) |>
# Making a residual plot
ggplot(mapping = aes(x = .fitted, # .fitted is our predicted values
y = .resid)) + # .resid is our residual values
# creating a scatter plot
geom_point() +
# adding a horizontal line
geom_hline(yintercept = 0,
color = "red",
linewidth = 1) +
# Making an r-squared (R2) plot: actual (y) vs predicted (y-hat)
augment_columns(
x = anxiety_lm2,
data = mlr_data
) |>
ggplot(
mapping = aes(
x = .fitted, # predicted values
y = anxiety # observed (actual) values
)
) +
# making a scatter plot
geom_point() +
# adding the trend line
geom_smooth(method = "lm",
color = "red",
se = F,
linewidth = 1)
The residual plot shows the correlation between the predicted values
(.fitted) and the residual values. The r-squared (R2) plot shows the
correlation between the predicted values (.fitted) and observed/actual
values (anxiety). Both plots show a non-linear association. The points
do not align with the horizontal line for the residual plot, nor do the
data points align with the linear trendline for the r-squared plot.
pacman::p_load(rpart)
pacman::p_load(rpart.plot)
#Fit the full regression tree using function `rpart()` in the eponymous package, `rpart`.
anxiety_full <-
rpart(
formula = anxiety ~ .,
data = mlr_data,
method = "anova", # use anova for regression tree instead of "class" ("class" used for categorical response)
minsplit = 2,
minbucket = 1,
cp = -1
)
# Looking at the cp table
cptable_anxiety <-
anxiety_full$cptable |>
data.frame()
tibble(cptable_anxiety)
## # A tibble: 149 × 5
## CP nsplit rel.error xerror xstd
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.0869 0 1 1.00 0.0506
## 2 0.0233 1 0.913 0.934 0.0517
## 3 0.0146 2 0.890 0.942 0.0520
## 4 0.0126 3 0.875 0.978 0.0569
## 5 0.0110 5 0.850 0.999 0.0586
## 6 0.00993 7 0.828 1.02 0.0606
## 7 0.00913 11 0.788 1.04 0.0639
## 8 0.00841 12 0.779 1.09 0.0681
## 9 0.00805 13 0.771 1.14 0.0712
## 10 0.00780 14 0.763 1.16 0.0723
## # ℹ 139 more rows
# finding the CP value
plotcp(anxiety_full)
The Complexity Parameter of a decision tree measures how much the tree
is over fitting the data. A small dip in cp value means that the tree is
less complex and more generalization, which is good.
# finding the xerror cut off
anxiety_full$cptable |>
data.frame() |>
# tree with the lowest xerror
slice_min(xerror, n = 1, with_ties = F) |>
# adding xerror and xstd to create our xcutoff
mutate(xcutoff = xerror + xstd) |>
# saving the xcuttoff value
pull(xcutoff) -> xcutoff
# finding the cp value to prune the tree:
anxiety_full$cptable |>
data.frame() |>
# Finding all the trees with an xerror below our xcutoff
filter(xerror < xcutoff) |>
# picking the simplist tree
slice(1) |>
# saving the cp value as a single number
pull(CP) -> cp_prune
c(xcutoff, cp_prune)
## [1] 0.98557300 0.02328773
# pruning the tree
anxiety_tree <-
prune(
tree = anxiety_full,
cp = cp_prune
)
# creating a basic decision tree diagram
rpart.plot(anxiety_tree,
type = 5,
digits = 3)
# a few adjustments to the diagram
rpart.plot(anxiety_tree,
digits = 4,
fallen.leaves = FALSE,
type = 5,
extra = 101)
# Other adjustments to the diagram
rpart.plot(anxiety_tree,
digits = 4,
fallen.leaves = TRUE,
type = 5,
box.palette = 'BlGnYl',
shadow.col = 'gray')
# decipher each variables importance
caret::varImp(anxiety_tree) |>
arrange(desc(Overall))
## Overall
## age 0.086913596
## fav_genre 0.020241566
## bpm 0.012369585
## hours_per_day 0.007956438
caret::varImp(anxiety_tree) |>
arrange(desc(Overall)) |>
rownames_to_column(var = "variable") |>
# showing the variables importance as a bar chart
ggplot(mapping = aes(x = fct_reorder(variable, -Overall),
y = Overall)) +
geom_col(fill = "steelblue",
color = "black") +
labs(x = NULL,
y = "Variable Importance") +
scale_y_continuous(expand = c(0, 0, 0.05, 0))
Based on the graph and the table, the most important variable appears to
be age.
# generating predictions
pred_anxiety <-
predict(object = anxiety_tree,
newdata = mlr_data)
# comparing the distribution of predicted values vs. actual values
bind_rows("predicted" = summary(pred_anxiety),
"actual" = summary(mlr_data$anxiety),
.id = "anxiety")
## # A tibble: 2 × 7
## anxiety Min. `1st Qu.` Median Mean `3rd Qu.` Max.
## <chr> <table[1d]> <table[1d]> <table[1d]> <table[1d]> <table[1d]> <table[>
## 1 predicted 3.340426 6.110209 6.110209 5.837866 6.110209 6.1102…
## 2 actual 0.000000 4.000000 6.000000 5.837866 8.000000 10.0000…
# Looking at the correlation of outcomes with predictions
c(
"R2" = cor(pred_anxiety, mlr_data$anxiety)^2,
"MAE" = mean(abs(pred_anxiety - mlr_data$anxiety)),
"MAE_mean" = mean(abs(mlr_data$anxiety - mean(mlr_data$anxiety)))
)
## R2 MAE MAE_mean
## 0.0869136 2.1920682 2.3410830
The goal of music therapy (MT) is to improve an individual’s stress, mood, and overall mental health. The study asked participants questions regarding to their relationship with music and other questions regarding their mental health. The goal of our project was to determine if their was a relationship (association) between music and mental health ratings. We were hoping that there was a positive relationship with a specific genre or amount of time spent listening to music and low mental health ratings. However, since this study was a survey, the answers were biased (based on each individual person).
While 74% of participants claimed that music improved their mental health, there was no relationship found between the participants’ mental health scores and music frequency. Nor was their a relationship between participants’ mental health scores and hours spent listening to music.
Lastly, we were unable to predict anxiety levels based on age, bpm, hours_per_day, or fav_genre. Each plot showed extremely low correlation values.
Since the data is coming from an online survey, participation is voluntary. The majority of participants surveyed were in their late teens/early 20s, so in order to get less biased results, the study should use a different method for collecting data. As for the mental illness and frequency sections of the survey, the way in which people ranked their responses is subject to their own understanding of what the rankings mean. We believe there should be a column that states the participant’s mental health score before listening to music and a column that states the participant’s mental health score after listening to music. Then we will create a column comparing the difference in order to see if their is a quantitative value associated with improvement in mental health, versus just a qualitative value (stating if it improved, worsened, or had no effect on mental health).