read.csv("mxmh_survey_results.csv") -> music_mentalhealth

Introduction

This data set was found on Kaggle. It was created by Catherine Rasgaitis, from the University of Washington in 2022. This data set is a voluntary sample taken from a Google Form. The form was posted across a variety of media including Reddit forums, different social media applications and posters/cards distributed and hung up in a myriad of public locations. The data on the form was collected for a period of approximately three months. Respondents were not restricted by age. Since the form was voluntary, there is definitely bias. On Kaggle, it explains the majority of respondents range from late teens to early twenties possibly because the survey was advertised on social media sites that attract this demographic. This is an observational study as subjects were not assigned to treatments. Questions were asked through an online survey using a Google Form. In the survey there were three blocks of questions. The survey first asked participants some general background questions in regard to music background/listening habits. Then participants were asked some questions regarding their music use (frequency of different genres). Lastly, they had to rate there mental health on a scale from 1-10.

In recent years, mental health has been a growing concern and there has been an increase in discussions on how best to combat mental health. As people who enjoy listening to music, we are interested in knowing if listening to music (a certain genre or specific amount of time) promotes positive mental health. Also, the percentage of college students experiencing mental illnesses such as anxiety and depression are staggering. We feel the class and anyone interested in promoting positive mental health should be interested in this data set.

Our hope in analyzing this data set through data visualizations and machine learning is to determine the effect in which music has on someones mental health.

Survey Question Blocks

1: Background - generic questions on musical background and listening habits.

2: Music genres - ranking of how often they listen to 16 different genres options include: Never, Rarely, Sometimes and Very frequently’

3: Mental Health - ranking Anxiety, Depression, Insomnia, and OCD on a scale 0 - “I do not experience this.” 10 - “I experience this regularly, constantly/or to an extreme”

Data Visualizations

Our first couple graphs give an overview of the data collected in the survey such as ages of participants, favorite streaming service and favorite genre of participants. The second set of graphs are used to look at the relationship of mental health and music.

Looking at age demographic of people surveyed

music_mentalhealth|>
  group_by(Age) |>
  summarize(counts = n()) -> count_age

count_age |>
  arrange(-counts)
## # A tibble: 62 × 2
##      Age counts
##    <int>  <int>
##  1    18     85
##  2    19     61
##  3    17     59
##  4    21     52
##  5    16     44
##  6    20     40
##  7    22     39
##  8    23     37
##  9    25     22
## 10    26     22
## # ℹ 52 more rows
ggplot(
  data = music_mentalhealth,
  mapping = aes(x = Age)
) + 
  # histogram with bin widths of 0.025 wide
  geom_histogram(
    bins = 30,
    fill = "orchid",
    color = "black"
  ) +
  
  scale_y_continuous(
    expand = c(0, 0, 0.05, 0)
  ) +
  
  labs(title = "Age Demographic") -> age_demographic

The table above shows that the majority of the people surveyed were in their late teens/early twenties, which may skew the results for most popular music genre, hours spent listening to music, mental health ratings, etc.

Looking at the majority age group of people surveyed

music_mentalhealth |>
  filter(Age == 18 | Age == 19 | Age == 17 | Age == 21 | Age == 16 | Age == 20 | Age == 22 | Age == 23) |>
  select(-Timestamp) -> majority_surveyed

tibble(majority_surveyed)
## # A tibble: 417 × 32
##      Age Primary.streaming.service   Hours.per.day While.working Instrumentalist
##    <int> <chr>                               <dbl> <chr>         <chr>          
##  1    18 Spotify                                 3 Yes           Yes            
##  2    18 Spotify                                 4 No            No             
##  3    18 Spotify                                 4 Yes           No             
##  4    18 Spotify                                 5 Yes           Yes            
##  5    18 YouTube Music                           3 Yes           Yes            
##  6    21 Spotify                                 1 Yes           No             
##  7    19 Spotify                                 6 Yes           No             
##  8    18 I do not use a streaming s…             1 Yes           No             
##  9    18 Spotify                                 3 Yes           Yes            
## 10    19 YouTube Music                           8 Yes           No             
## # ℹ 407 more rows
## # ℹ 27 more variables: Composer <chr>, Fav.genre <chr>, Exploratory <chr>,
## #   Foreign.languages <chr>, BPM <int>, Frequency..Classical. <chr>,
## #   Frequency..Country. <chr>, Frequency..EDM. <chr>, Frequency..Folk. <chr>,
## #   Frequency..Gospel. <chr>, Frequency..Hip.hop. <chr>,
## #   Frequency..Jazz. <chr>, Frequency..K.pop. <chr>, Frequency..Latin. <chr>,
## #   Frequency..Lofi. <chr>, Frequency..Metal. <chr>, Frequency..Pop. <chr>, …
ggplot(data = majority_surveyed,
       mapping = aes(x = Hours.per.day,
                     y = Age,
                     color = Age)) +
  
  # adding points to create a scatter plot 
  geom_jitter(mapping = aes(color = factor(Age))) +
  
  # removing the legend
  theme(legend.position = "none") +
  
  # adding a title
  labs(title = "Hours of Music Listened to Per day by \n the Largest Age Group Surveyed",
       x = "Hours Listened to Per Day"
  ) -> age_graph

# age_demographic + age_graph

age_demographic + age_graph

Based on the graph above, we know that the data is right skewed and the majority of the people who participated in the survey were between the ages of 16 and 23-years-old. We can also identify an outlier on the graph. There was one 89-year-old surveyed and it’s important to note that he claimed to listen to music 24 hours a day, listens to music while working, and is an instrumentalist and composer. We then took a closer look at the hours per day of the majority age group surveyed.

Respondents claim to whether or not music improved their mental health

Each participant was asked whether music improved, worsened, or had no effect on their mental health

music_mentalhealth |>
  select(Age, Music.effects) -> effects_of_music

# graph the proportion
effects_of_music |>
  filter(Music.effects != "") |>                   # disregard the people who did not answer the question
  group_by(Music.effects) |>
  summarize(counts = n()) |>
  mutate(prop = counts/sum(counts))   ->              # create new column in data set
  
  Music.effects.prop

Music.effects.prop
## # A tibble: 3 × 3
##   Music.effects counts   prop
##   <chr>          <int>  <dbl>
## 1 Improve          542 0.745 
## 2 No effect        169 0.232 
## 3 Worsen            17 0.0234
ggplot(data = Music.effects.prop,
       mapping = aes(x = Music.effects,
                     y = prop,
                     fill = Music.effects)) +
  
  geom_col(color = "black",
           linewidth = 0.5) +
  
  # changing the colors of the bars
  scale_fill_manual(values = c("Improve" = "darkgreen",    # green indicates improvement (positive)
                               "No effect" = "#800080",    # purple indicates no effect (neutral)
                               "Worsen" = "tomato")) +     # red indicates worsen (negative)
  
  # And display percentages on the y-axis and remove the extra space on the bottom:
  scale_y_continuous(labels = scales::label_percent(accuracy = 1),
                     expand = c(0, 0, 0.05, 0))+
  
  labs(x = NULL,
       y = NULL,
       fill = NULL,
       title = "Music's Impact on Mental Health Among proportion of Participant's Responses")+
  
  # removing the legend from the graph
  theme(legend.position = "none") +
  
  # adding the percent to each bar on the graph
  geom_text(mapping = aes(label = scales::percent(prop, accuracy = 1)),
            size = 3, 
            color = "black",
            vjust = -0.5)

Based on the bar graph above, the majority of people (74%) who were surveyed found that listening to music improved their mental health (positive impact). Only 2% of people who participated in the survey said that listening to music worsens their mental health, and 23% claim that listening to music had no effect on their mental health. Based off of these results, we could assume that their was a positive association with listening to music and an improvement in mental health, however, the scatterplot that compared the strength rating of four mental illnesses and time spent listening to music showed the opposite association. The scatterplot shown below that the more time spent listening to music, the worse the mental illness got (the stronger/higher the rating of the mental illness got)

What is the most popular genre of music (favorite genre) among the people surveyed?

music_mentalhealth |>
  filter(Primary.streaming.service != "")|>
  select (Age, Primary.streaming.service) -> favorite_genre



# create a boxplot showing the favorite genre among the people surveyed
ggplot(data = favorite_genre,
       mapping = aes(x = factor(Primary.streaming.service, levels = c("Spotify",
                                                                      "YouTube Music",
                                                                      "Apple Music",
                                                                      "Pandora",
                                                                      "Other streaming service",
                                                                      "I do not use a streaming service")),
                     y = Age,
                     fill = Primary.streaming.service)) +
  
  # Adding a boxplot with geom_boxplot
  geom_boxplot(color = "black",
               show.legend = F,
               outlier.shape = NA)+
  
  
  # Removing the x and y-axis titles
  labs(x = "Primary Streaming Service",
       y = "Age",
       title = "Streaming Service by Age")+
  
  scale_x_discrete(labels = c("Spotify",
                              "YouTube",
                              "Apple Music",
                              "Pandora",
                              "Other",
                              "None"))+
  
  scale_y_continuous(breaks = seq(from = 0,
                                  to = 80,
                                  by = 10)) +
  
  # changing the theme of the graph to black and white
  theme_bw() +
  
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle=45, hjust=1))    -> streaming_service_age




# Question: What is the most popular streaming service among the people surveyed?
# determine the counts for favorite genre among the people surveyed
favorite_genre|>
  group_by(Primary.streaming.service) |>
  summarize(counts = n()) |>
  arrange(-counts) -> favorite.genre.by.age

favorite.genre.by.age
## # A tibble: 6 × 2
##   Primary.streaming.service         counts
##   <chr>                              <int>
## 1 Spotify                              458
## 2 YouTube Music                         94
## 3 I do not use a streaming service.     71
## 4 Apple Music                           51
## 5 Other streaming service               50
## 6 Pandora                               11
ggplot(data = favorite.genre.by.age,
       mapping = aes(x = factor(Primary.streaming.service, levels = c("Spotify",
                                                                      "YouTube Music",
                                                                      "Apple Music",
                                                                      "Pandora",
                                                                      "Other streaming service",
                                                                      "I do not use a streaming service")),
                     y = counts,
                     fill = Primary.streaming.service)) +
  
  geom_bar(stat = "identity",
           position = "dodge",
           color = "black")+
  
  
  # Changing the x and y-axis titles and adding a title to the graph
  labs(x = "Primary Streaming Service",
       y = "Count",
       title = "Favorite Streaming Service \n Among All Who Were Surveyed",
       labeller = labeller(Primary.streaming.service = c("I do not use a streaming service" = "None",
                                                         "Other streaming service" = "Other")))+
  
  # changing the theme of the graph to black and white
  theme_bw() +
  
  # change column names
  scale_x_discrete(labels = c("Spotify",
                              "YouTube",
                              "Apple Music",
                              "Pandora",
                              "Other",
                              "None")) +
  
  
  # removing the legend from the graph
  theme(plot.title = element_text(hjust = 0.5),
        legend.position = "none",
        axis.text.x = element_text(angle=45, hjust=1)) +
  
  scale_y_continuous(expand = c(0, 0, 0.05, 0)) -> bar.graph1


favorite_genre |>
  filter(Primary.streaming.service == "Pandora") |>
  arrange(-Age)-> favorite.pandora

favorite.pandora
##    Age Primary.streaming.service
## 1   73                   Pandora
## 2   69                   Pandora
## 3   68                   Pandora
## 4   63                   Pandora
## 5   61                   Pandora
## 6   60                   Pandora
## 7   53                   Pandora
## 8   36                   Pandora
## 9   32                   Pandora
## 10  22                   Pandora
## 11  15                   Pandora
# since spotify was the most popular primary streaming service (greatest number of counts), look at the ages of the participants who chose spotify
favorite_genre |>
  filter(Primary.streaming.service == "Spotify") |>
  arrange(Age) -> favorite.spotify

tibble(favorite.spotify)
## # A tibble: 458 × 2
##      Age Primary.streaming.service
##    <int> <chr>                    
##  1    12 Spotify                  
##  2    12 Spotify                  
##  3    13 Spotify                  
##  4    13 Spotify                  
##  5    13 Spotify                  
##  6    13 Spotify                  
##  7    13 Spotify                  
##  8    14 Spotify                  
##  9    14 Spotify                  
## 10    14 Spotify                  
## # ℹ 448 more rows
# look at the counts for each age that chose spotify as the primary streaming service (favorite streaming service)
favorite.spotify|>
  group_by(Age) |>
  summarize(counts = n())|>
  arrange(-counts) -> counts.spotify

counts.spotify
## # A tibble: 42 × 2
##      Age counts
##    <int>  <int>
##  1    18     58
##  2    19     44
##  3    17     43
##  4    16     33
##  5    21     32
##  6    22     25
##  7    23     24
##  8    20     22
##  9    26     17
## 10    15     15
## # ℹ 32 more rows
# streaming_service_age + bar.graph1

streaming_service_age + bar.graph1

Observations:

Based on the bar chart, a majority of the participants who were surveyed chose Spotify as their favorite streaming service (458/735 = 62.3%). We looked deeper into the data and among all who voted for Spotify as their favorite streaming service, 58 were 18-year-olds, 44 were 19-year-olds, 43 were 17-year-olds, 33 were 16-year-olds, and 32 were 21-year-olds. The boxplot showing the relationship between age and favorite streaming service also shows that the majority of people who chose Spotify had a median age of 20. It’s also important to note that the majority of participants who were surveyed were among this age demographic. So if an older demographic was surveyed, the graphs could potentially look different.

Pandora had the lowest count for favorite streaming service (11 total people). Based on the boxplot, Pandora was chosen as primary streaming service among older participants. We looked closer at the data, and among the 11 total people surveyed 4 were under the age of 40 (15, 22, 32, and 36) and the other 7 were over the age of 50 (53, 60, 61, 63, 68, 69, 73). Pandora has the greatest age spread among streaming services.

music_mentalhealth |>
  select(Age, Fav.genre)|>
  group_by(Fav.genre) |>
  summarize(counts = n()) |>
  arrange(desc(counts)) -> counts.fav.genre

counts.fav.genre
## # A tibble: 16 × 2
##    Fav.genre        counts
##    <chr>             <int>
##  1 Rock                188
##  2 Pop                 114
##  3 Metal                88
##  4 Classical            53
##  5 Video game music     44
##  6 EDM                  37
##  7 Hip hop              35
##  8 R&B                  35
##  9 Folk                 30
## 10 K pop                26
## 11 Country              25
## 12 Rap                  22
## 13 Jazz                 20
## 14 Lofi                 10
## 15 Gospel                6
## 16 Latin                 3
# creating the bar graph to display favorite genre of music
ggplot(data = counts.fav.genre,
       mapping = aes(x = fct_reorder(Fav.genre, counts),
                     y = counts)) +
  
  geom_col(fill = "seagreen",
           color = "black") +
  
  # adding a label to the x-axis and a title to the graph
  labs(x = "Favorite Genre",
       fill = NULL,
       title = "Favorite Genre Among People Surveyed")+
  
  # removing the legend from the graph
  theme(legend.position = "none",
        axis.text.x = element_text(angle= 45, hjust=1)) +
  
  scale_y_continuous(breaks = seq(from = 0,
                                  to = 300, 
                                  by = 50),
                     expand = c(0, 0, 0.05, 0))

Observations: Based on the graph, Rock was the most popular genre (favorite genre) of music among the people surveyed (188 people chose rock).

The code below contains cleaning original data set:

useable_data <-
  music_mentalhealth |> 
  # Cleaning all the column names
  clean_names() |> 
  
  # selecting columns to keep
  select(age, bpm, hours_per_day, fav_genre, contains("frequency"), anxiety, depression, insomnia, ocd)

The code below re-formats the useable_data to be used for a set of bar charts which looks at the frequency of different genres of music listened to for different mental illnesses

clean_data <-
  
  useable_data |> 
  
  # using pivot longer to create a column with all of the frequencies together
  pivot_longer(
    cols = starts_with("frequency"), 
    names_to = "music_genre",
    values_to = "frequency_rating") |> 
  
  # using pivot longer to create a column with all of the ratings of mental illnesses together
  pivot_longer(
    cols = c(anxiety, depression, insomnia, ocd),
    names_to = "mental_illness_type",
    values_to = "strength_rating") |> 
  
  # only keeping mental illnesses that are "severe" (people have ranked them 7 through 10)
  filter(strength_rating >= 7) 

tibble(useable_data)
## # A tibble: 736 × 24
##      age   bpm hours_per_day fav_genre     frequency_classical frequency_country
##    <int> <int>         <dbl> <chr>         <chr>               <chr>            
##  1    18   156           3   Latin         Rarely              Never            
##  2    63   119           1.5 Rock          Sometimes           Never            
##  3    18   132           4   Video game m… Never               Never            
##  4    61    84           2.5 Jazz          Sometimes           Never            
##  5    18   107           4   R&B           Never               Never            
##  6    18    86           5   Jazz          Rarely              Sometimes        
##  7    18    66           3   Video game m… Sometimes           Never            
##  8    21    95           1   K pop         Never               Never            
##  9    19    94           6   Rock          Never               Very frequently  
## 10    18   155           1   R&B           Rarely              Rarely           
## # ℹ 726 more rows
## # ℹ 18 more variables: frequency_edm <chr>, frequency_folk <chr>,
## #   frequency_gospel <chr>, frequency_hip_hop <chr>, frequency_jazz <chr>,
## #   frequency_k_pop <chr>, frequency_latin <chr>, frequency_lofi <chr>,
## #   frequency_metal <chr>, frequency_pop <chr>, frequency_r_b <chr>,
## #   frequency_rap <chr>, frequency_rock <chr>,
## #   frequency_video_game_music <chr>, anxiety <dbl>, depression <dbl>, …

Since making facets for 16 different genres would not look neat, I used the code below to create two different data sets, each of eight genres to then create two separate graphs for.

# creating a data set with the first 8 genres
genre1 <-
  clean_data |> 
  filter(music_genre %in% c("frequency_classical", "frequency_country", "frequency_edm", "frequency_folk", "frequency_gospel", "frequency_hip_hop", "frequency_jazz", "frequency_k_pop"))

# creating a data set with the second 8 genres
genre2 <-
  clean_data |> 
  filter(music_genre %in% c("frequency_latin", "frequency_lofi", "frequency_metal", "frequency_pop", "frequency_r_b", "frequency_rap", "frequency_rock", "frequency_video_game_music"))

The next set of code and graphs examine the relationship between mental health and different aspects of the participants music listening behaviors.

The code chunk below contains the code used to create the graph that looks at music frequency by type of mental illness

# Creating a plot of the data 
ggplot(data = genre1,                        
       mapping =  aes(x = mental_illness_type,  
                      fill = as.factor(frequency_rating))) + 
  
  # creating the bar charts 
  geom_bar(position = "fill",     # using the proportions instead of counts     
           color = "black") +       # outlining the bars in black
  
  # creating facets for each genre of music 
  facet_wrap(facets = ~ music_genre,  
             ncol = 4,                   
             scales = "fixed",
             labeller = labeller(music_genre = c(   # changing titles of facets
               "frequency_classical" = "Classical", 
               "frequency_country" = "Country", 
               "frequency_edm" = "EDM", 
               "frequency_folk" =  "Folk", 
               "frequency_gospel" =  "Gospel", 
               "frequency_hip_hop" =  "Hip Hop", 
               "frequency_jazz" = "Jazz", 
               "frequency_k_pop" =  "K-Pop", 
               "frequency_latin" = "Latin", 
               "frequency_lofi" = "Lo-fi", 
               "frequency_metal" = "Metal", 
               "frequency_pop" = "Pop", 
               "frequency_r_b" = "R&B", 
               "frequency_rap" = "Rap", 
               "frequency_rock" = "Rock", 
               "frequency_video_game_music" = "Video Game"))) +          
  
  # changing the color of the values 
  scale_fill_manual(values = c("seagreen", "steelblue",  "darkorange", "darkolivegreen2")) +
  
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),  # angling the axis text
    panel.grid.major.x = element_blank(),     # removing unnecessary grid lines
    panel.grid.minor.x = element_blank(),
    panel.grid.minor.y = element_blank()) +
  
  # adding and removing labels
  labs(title = "Music Frequency by Type of Mental Illness",
       x = "",
       y = NULL,
       fill = "") +
  
  # expanding the length of the bars and adding percents on y-axis
  scale_y_continuous(expand = c(0, 0, 0.05, 0),
                     labels = scales::percent,
                     limits = c(0, 1)          # Adjust the limits to exclude 0
  ) -> graph1

graph1

# creating the second graph with the other 8 genres 
graph1 %+% genre2 -> graph2

graph2 

These bar graphs demonstrate that the type of mental illness someone has ranked as experiencing severely (7-10), does not determine the frequency in which they listen to a certain genre of music. This is displayed by the proportions being relatively the same for each mental illness within the different genres of music. However, from the graphs, we are able to tell which genres are more popular than others. For example, gospel and Latin music seemed to be listened to the least frequent whereas rock and pop appear to be listened to with the most frequency.

The code below is used to clean the data to make a scatter plot with trendlines that compares the strength ratings of the four mental illness with the number of hours of music listened to

clean_data2 <-
  
  useable_data |> 
  
  # using pivot longer to create a column with all of the frequencies together
  pivot_longer(
    cols = starts_with("frequency"), 
    names_to = "music_genre",
    values_to = "frequency_rating") |> 
  
  pivot_longer(
    cols = c(anxiety, depression, insomnia, ocd),
    names_to = "mental_illness_type",
    values_to = "strength_rating") |> 
  
  # removing certain columns
  select(-age, -fav_genre, -music_genre, -frequency_rating) |>
  
  # calculating the mean hours of music listened to per day for each mental illness and strength rating
  group_by(mental_illness_type, strength_rating) |> 
  summarise(hours_per_day = mean(hours_per_day)) |> 
  
  filter(hours_per_day <= 4.75)


clean_data2
## # A tibble: 46 × 3
## # Groups:   mental_illness_type [4]
##    mental_illness_type strength_rating hours_per_day
##    <chr>                         <dbl>         <dbl>
##  1 anxiety                         0            3.9 
##  2 anxiety                         1            2.80
##  3 anxiety                         2            3.91
##  4 anxiety                         3            3.16
##  5 anxiety                         4            3.14
##  6 anxiety                         5            3.77
##  7 anxiety                         6            3.18
##  8 anxiety                         7            3.52
##  9 anxiety                         7.5          4   
## 10 anxiety                         8            3.91
## # ℹ 36 more rows

The code below creates a graph that compares the number of hours of music listened to per day with the strength ratings of mental illness. All four mental illnesses are examined to determine if there is an association.

# creating the plot
ggplot(data = clean_data2,
       mapping = aes(x = hours_per_day,
                     y = strength_rating,
                     color = mental_illness_type)) +
  
  # adding points to create a scatter plot 
  geom_point() + 
  
  # adding a trend line to the graph
  geom_smooth(method = "lm",
              se = FALSE,
              linewidth = 0.5,
              fullrange = T) +
  
  # changing the color choice
  scale_color_manual(values = c("orangered", "deepskyblue3",  "navy", "plum")) +
  
  # changing the labels on the x-axis and y-axis and adding a title to the graph and changing the title of the legend
  labs(
    y = "Strength Rating of Mental Illness (0-10)",
    x = "Hours of Music Listened To (per day)",
    title = "Comparing the Strength Rating of Four Different Mental Illnesses \n to the Number of Hours of Music Listened to",
    color = "Type of Mental Illness") + 
  
  scale_y_continuous(breaks = c(0:10)) +
  
  theme(plot.title = element_text(size = 14),
        legend.position = "bottom",
        panel.grid.minor.x = element_blank(),
        panel.grid.minor.y = element_blank())

The data produced from the graph demonstrates that with more hours of music listened to per day there is a slight increase in strength of a mental illness. It is important to keep in mind that there is only about a two hour difference between the minimum and maximum of the average hours listened to for each mental illness. It appears as though the four mental illnesses rankings increase at about the same rate.

Machine Learning Methods

While our data set is a survey, we wanted to determine if there is an association between the effects of music and mental health (music therapy). While the survey focuses on four types of mental illnesses (anxiety, depression, insomnia, ocd), for our machine learning method we decided to just focus on predicting anxiety levels among participants, due to anxiety being the most prevalent mental illness among those surveyed. We first created a multiple linear regression, followed by a regression tree in hopes of determining an association between anxiety and several variables relating with music.

Below, is the code used to clean the data for the machine learning methods

useable_data |> 
  select(age, bpm, hours_per_day, fav_genre, anxiety) |> 
  
  # combining Hip Hop and R&B into one genre to maximize the data
  mutate(
    fav_genre = if_else(fav_genre %in% c("Hip Hop", "R&B"),
                        "Hip/R&B",
                        fav_genre)) |> 
  
  
  filter(fav_genre %in% c("Rock", "Pop", "Metal", "Classical", "Hip/R&B")) -> mlr_data

tibble(mlr_data)
## # A tibble: 478 × 5
##      age   bpm hours_per_day fav_genre anxiety
##    <int> <int>         <dbl> <chr>       <dbl>
##  1    63   119           1.5 Rock            7
##  2    18   107           4   Hip/R&B         7
##  3    19    94           6   Rock            2
##  4    18   155           1   Hip/R&B         2
##  5    17    NA           2   Pop             7
##  6    19   118           5   Hip/R&B         6
##  7    18    79           2   Pop             3
##  8    16    84           3   Rock           10
##  9    18   169           2   Pop             7
## 10    14   136          12   Rock            8
## # ℹ 468 more rows

For the machine learning we selected age, bpm, hours_per_day, and anxiety levels among the most popular genres selected by the participants. Based on the bar graph displaying “Favorite Genre Among People Surveyed”, we selected Rock, Pop, Metal, Classical, and combined the participants who chose either Hip or R&B.

pacman::p_load(GGally)

# using ggpairs to create a scatter plot matrix 

mlr_data|> 
  ggpairs()

When examining the outputs in which show the correlation for anxiety, it is evident that there is not really a predictor of how someone will rank their anxiety on a scale from 1-10. Based on the multiple plots, there was no correlation between age, bpm, hours_per_day, or fav_genre and anxiety ranking.

# Fitting the linear regression model and save it as anxiety_lm

anxiety_lm <-
  lm(
    formula = anxiety ~ .,  
    data = mlr_data
  ) 

summary(anxiety_lm)
## 
## Call:
## lm(formula = anxiety ~ ., data = mlr_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.1108 -1.8814  0.3447  1.8861  5.8618 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       6.334e+00  7.144e-01   8.866  < 2e-16 ***
## age              -5.656e-02  1.110e-02  -5.097  5.4e-07 ***
## bpm              -2.538e-05  4.047e-03  -0.006   0.9950    
## hours_per_day     5.302e-02  5.034e-02   1.053   0.2928    
## fav_genreHip/R&B  1.293e-01  6.478e-01   0.200   0.8419    
## fav_genreMetal    4.441e-01  5.316e-01   0.835   0.4040    
## fav_genrePop      9.192e-01  5.039e-01   1.824   0.0689 .  
## fav_genreRock     1.154e+00  4.802e-01   2.403   0.0167 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.677 on 388 degrees of freedom
##   (82 observations deleted due to missingness)
## Multiple R-squared:  0.08533,    Adjusted R-squared:  0.06883 
## F-statistic: 5.171 on 7 and 388 DF,  p-value: 1.205e-05
pacman::p_load(regclass)

# calculating the variance inflation factor 

VIF(anxiety_lm)
##                   GVIF Df GVIF^(1/(2*Df))
## age           1.025079  1        1.012462
## bpm           1.064309  1        1.031653
## hours_per_day 1.020989  1        1.010440
## fav_genre     1.098462  4        1.011808

All of the VIFs were low, therefore there is not an issue with multicollinearity.

pacman::p_load(broom)

# creating three linear models with different complexities

# least complex

anxiety_lm1 <-
  lm(formula = anxiety ~ fav_genre,
     data = mlr_data)

# Middle Model

anxiety_lm2 <-
  lm(formula = anxiety ~ fav_genre + age,
     data = mlr_data)

# Most complicated model

anxiety_lm3 <-
  lm(formula = anxiety ~ fav_genre + age + hours_per_day + bpm,
     data = mlr_data)
# use augment_columns() in the broom package to calculate the fit stats ourselves (add the predictions and residuals)

music_lm1 <-
  augment_columns(x = anxiety_lm1,
                  data = mlr_data)
music_lm1
## # A tibble: 478 × 12
##      age   bpm hours_per_day fav_genre anxiety .fitted .se.fit .resid    .hat
##    <int> <int>         <dbl> <chr>       <dbl>   <dbl>   <dbl>  <dbl>   <dbl>
##  1    63   119           1.5 Rock            7    6.12   0.203  0.878 0.00532
##  2    18   107           4   Hip/R&B         7    5.17   0.470  1.83  0.0286 
##  3    19    94           6   Rock            2    6.12   0.203 -4.12  0.00532
##  4    18   155           1   Hip/R&B         2    5.17   0.470 -3.17  0.0286 
##  5    17    NA           2   Pop             7    6.07   0.260  0.925 0.00877
##  6    19   118           5   Hip/R&B         6    5.17   0.470  0.829 0.0286 
##  7    18    79           2   Pop             3    6.07   0.260 -3.07  0.00877
##  8    16    84           3   Rock           10    6.12   0.203  3.88  0.00532
##  9    18   169           2   Pop             7    6.07   0.260  0.925 0.00877
## 10    14   136          12   Rock            8    6.12   0.203  1.88  0.00532
## # ℹ 468 more rows
## # ℹ 3 more variables: .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>
music_lm1 |> 
  dplyr::select(fav_genre, anxiety, .fitted, .resid)
## # A tibble: 478 × 4
##    fav_genre anxiety .fitted .resid
##    <chr>       <dbl>   <dbl>  <dbl>
##  1 Rock            7    6.12  0.878
##  2 Hip/R&B         7    5.17  1.83 
##  3 Rock            2    6.12 -4.12 
##  4 Hip/R&B         2    5.17 -3.17 
##  5 Pop             7    6.07  0.925
##  6 Hip/R&B         6    5.17  0.829
##  7 Pop             3    6.07 -3.07 
##  8 Rock           10    6.12  3.88 
##  9 Pop             7    6.07  0.925
## 10 Rock            8    6.12  1.88 
## # ℹ 468 more rows

The .fitted column gives us the predicted value and the .resid gives us the residuals.

# combining the fit statistics together 

bind_rows(
  glance(anxiety_lm1),
  glance(anxiety_lm2),
  glance(anxiety_lm3)
) |> 
  
  mutate(
    explanatories = c(as.character(formula(anxiety_lm1))[3],
                      as.character(formula(anxiety_lm2))[3],
                      as.character(formula(anxiety_lm3))[3])
  ) |> 
  
  dplyr::select(explanatories, r.squared, adj.r.squared, sigma)
## # A tibble: 3 × 4
##   explanatories                         r.squared adj.r.squared sigma
##   <chr>                                     <dbl>         <dbl> <dbl>
## 1 fav_genre                                0.0229        0.0146  2.78
## 2 fav_genre + age                          0.0773        0.0675  2.70
## 3 fav_genre + age + hours_per_day + bpm    0.0853        0.0688  2.68

All three model fit statistics have extremely low r.squared values. Anxiety_lm3 had the “largest” r.squared value (R2 = 0.085). Sigma is how much we are off by (so our predictions are ~2.68 off which is not good).

We want to find the simplest model that still fits well. Although none of the models exactly fit the data well (all have a very low r.squared value), the best option would be the second model using favorite genre and age to predict anxiety levels. We can rule out the first model because the r.squared is noticeably lower than the other two models. The second and third models fit about equally well (similar r.squared values). Since the second model is simpler than the third model (2 variables vs 4 variables), we should use the second model.

augment_columns(
  x = anxiety_lm2,
  data = mlr_data
) |> 
  
  # Making a residual plot
  ggplot(mapping = aes(x = .fitted,     # .fitted is our predicted values 
                       y = .resid)) +   # .resid is our residual values 
  
  # creating a scatter plot
  geom_point() +
  
  # adding a horizontal line
  geom_hline(yintercept = 0,
             color = "red",
             linewidth = 1) +

  # Making an r-squared (R2) plot: actual (y) vs predicted (y-hat)
  augment_columns(
    x = anxiety_lm2,
    data = mlr_data
  ) |> 
  
  ggplot(
    mapping = aes(
      x = .fitted,    # predicted values
      y = anxiety     # observed (actual) values
    )
  ) +
  
  # making a scatter plot
  geom_point() +
  
  # adding the trend line
  geom_smooth(method = "lm",
              color = "red",
              se = F,
              linewidth = 1)

The residual plot shows the correlation between the predicted values (.fitted) and the residual values. The r-squared (R2) plot shows the correlation between the predicted values (.fitted) and observed/actual values (anxiety). Both plots show a non-linear association. The points do not align with the horizontal line for the residual plot, nor do the data points align with the linear trendline for the r-squared plot.

pacman::p_load(rpart)
pacman::p_load(rpart.plot)


#Fit the full regression tree using function `rpart()` in the eponymous package, `rpart`.
anxiety_full <- 
  rpart(
    formula = anxiety ~ ., 
    data = mlr_data,
    method = "anova",   # use anova for regression tree instead of "class" ("class" used for categorical response)
    minsplit = 2, 
    minbucket = 1,
    cp = -1
  )

# Looking at the cp table
cptable_anxiety <-
anxiety_full$cptable |> 
  data.frame()

tibble(cptable_anxiety)
## # A tibble: 149 × 5
##         CP nsplit rel.error xerror   xstd
##      <dbl>  <dbl>     <dbl>  <dbl>  <dbl>
##  1 0.0869       0     1      1.01  0.0510
##  2 0.0233       1     0.913  0.937 0.0522
##  3 0.0146       2     0.890  0.942 0.0522
##  4 0.0126       3     0.875  0.976 0.0587
##  5 0.0110       5     0.850  0.984 0.0589
##  6 0.00993      7     0.828  1.04  0.0640
##  7 0.00913     11     0.788  1.06  0.0651
##  8 0.00841     12     0.779  1.08  0.0663
##  9 0.00805     13     0.771  1.10  0.0686
## 10 0.00780     14     0.763  1.14  0.0705
## # ℹ 139 more rows
# finding the CP value
plotcp(anxiety_full)

The Complexity Parameter of a decision tree measures how much the tree is over fitting the data. A small dip in cp value means that the tree is less complex and more generalization, which is good.

# finding the xerror cut off
anxiety_full$cptable |> 
  data.frame() |> 
  # tree with the lowest xerror
  slice_min(xerror, n = 1, with_ties = F) |> 
  # adding xerror and xstd to create our xcutoff
  mutate(xcutoff = xerror + xstd) |> 
  # saving the xcuttoff value
  pull(xcutoff) -> xcutoff



# finding the cp value to prune the tree:
anxiety_full$cptable |> 
  data.frame() |> 
  # Finding all the trees with an xerror below our xcutoff
  filter(xerror < xcutoff) |> 
  # picking the simplist tree 
  slice(1) |> 
  # saving the cp value as a single number
  pull(CP) -> cp_prune

c(xcutoff, cp_prune)
## [1] 0.98965545 0.02328773
# pruning the tree
anxiety_tree <-
  prune(
    tree = anxiety_full,
    cp = cp_prune
  )
# creating a basic decision tree diagram
rpart.plot(anxiety_tree, 
           type = 5,
           digits = 3)

# a few adjustments to the diagram
rpart.plot(anxiety_tree, 
           digits = 4,
           fallen.leaves = FALSE,
           type = 5, 
           extra = 101)

# Other adjustments to the diagram
rpart.plot(anxiety_tree, 
           digits = 4,
           fallen.leaves = TRUE,
           type = 5,
           box.palette = 'BlGnYl',
           shadow.col = 'gray')

# decipher each variables importance 
caret::varImp(anxiety_tree) |> 
  arrange(desc(Overall))
##                   Overall
## age           0.086913596
## fav_genre     0.020241566
## bpm           0.012369585
## hours_per_day 0.007956438
caret::varImp(anxiety_tree) |> 
  arrange(desc(Overall)) |> 
  rownames_to_column(var = "variable") |> 
  
  # showing the variables importance as a bar chart
  
  ggplot(mapping = aes(x = fct_reorder(variable, -Overall),
                       y = Overall)) + 
  
  geom_col(fill = "steelblue",
           color = "black") + 
  
  labs(x = NULL,
       y = "Variable Importance") + 
  
  scale_y_continuous(expand = c(0, 0, 0.05, 0))

Based on the graph and the table, the most important variable appears to be age.

# generating predictions 

pred_anxiety <-
  predict(object = anxiety_tree, 
          newdata = mlr_data)


# comparing the distribution of predicted values vs. actual values
bind_rows("predicted" = summary(pred_anxiety),
          "actual" = summary(mlr_data$anxiety),
          .id = "anxiety")
## # A tibble: 2 × 7
##   anxiety   Min.        `1st Qu.`   Median      Mean        `3rd Qu.`   Max.    
##   <chr>     <table[1d]> <table[1d]> <table[1d]> <table[1d]> <table[1d]> <table[>
## 1 predicted 3.340426    6.110209    6.110209    5.837866    6.110209     6.1102…
## 2 actual    0.000000    4.000000    6.000000    5.837866    8.000000    10.0000…
# Looking at the correlation of outcomes with predictions
c(
  "R2" = cor(pred_anxiety, mlr_data$anxiety)^2,
  "MAE" = mean(abs(pred_anxiety - mlr_data$anxiety)),
  "MAE_mean" = mean(abs(mlr_data$anxiety - mean(mlr_data$anxiety)))
)
##        R2       MAE  MAE_mean 
## 0.0869136 2.1920682 2.3410830

Conclusion

The goal of music therapy (MT) is to improve an individual’s stress, mood, and overall mental health. The study asked participants questions regarding to their relationship with music and other questions regarding their mental health. The goal of our project was to determine if their was a relationship (association) between music and mental health ratings. We were hoping that there was a positive relationship with a specific genre or amount of time spent listening to music and low mental health ratings. However, since this study was a survey, the answers were biased (based on each individual person).

While 74% of participants claimed that music improved their mental health, there was no relationship found between the participants’ mental health scores and music frequency. Nor was their a relationship between participants’ mental health scores and hours spent listening to music.

Lastly, we were unable to predict anxiety levels based on age, bpm, hours_per_day, or fav_genre. Each plot showed extremely low correlation values.

Limitation / Recommendations:

Since the data is coming from an online survey, participation is voluntary. The majority of participants surveyed were in their late teens/early 20s, so in order to get less biased results, the study should use a different method for collecting data. As for the mental illness and frequency sections of the survey, the way in which people ranked their responses is subject to their own understanding of what the rankings mean. We believe there should be a column that states the participant’s mental health score before listening to music and a column that states the participant’s mental health score after listening to music. Then we will create a column comparing the difference in order to see if their is a quantitative value associated with improvement in mental health, versus just a qualitative value (stating if it improved, worsened, or had no effect on mental health).