song <- read.csv("spotify-2023.csv")

Data Overview

Introduction

In today’s world, music isn’t just about catchy tunes; it’s about how it affects us. We’ve all felt its power to change our mood, ease our worries, and even help us express ourselves. And now, researchers are digging deep to understand why some songs become hits while others fade into the background. This project is all about diving into the data behind music. We’re breaking down song samples, looking at all the little details, and trying to figure out what makes a song popular based on the total number of streams on Spotify.

Ethical Consideration

As we delve into our research on song popularity prediction, it’s essential to prioritize ethical principles. This includes obtaining consent, protecting data privacy, and ensuring transparency in our methods and findings. We must also be mindful of potential societal impacts, striving to promote diversity and avoid perpetuating biases. By upholding these standards, we can conduct our study responsibly and contribute positively to the field of music research.

Data Sources

Nidula Elgiriyewithana. “Most Streamed Spotify Songs 2023.” Kaggle, 26 Aug. 2023, www.kaggle.com/datasets/nelgiriyewithana/top-spotify-songs-2023.

Data Exploration

This dataset consists of 24 variables and 953 observation. We gonna look at these variables to build our model to predict the song popularity.

  • artist_count: Number of artists contributing to the song
  • released_year: Year when the song was released
  • released_month: Month when the song was released
  • released_day: Day of the month when the song was released
  • in_spotify_playlists: Number of Spotify playlists the song is included in
  • in_spotify_charts: Presence and rank of the song on Spotify charts
  • streams: Total number of streams on Spotify
  • in_apple_playlists: Number of Apple Music playlists the song is included in
  • in_apple_charts: Presence and rank of the song on Apple Music charts
  • bpm: Beats per minute, a measure of song tempo
  • danceability_%: Percentage indicating how suitable the song is for dancing
  • valence_%: Positivity of the song’s musical content
  • energy_%: Perceived energy level of the song
  • acousticness_%: Amount of acoustic sound in the song
  • instrumentalness_%: Amount of instrumental content in the song
  • liveness_%: Presence of live performance elements
  • speechiness_%: Amount of spoken words in the song

We can see the glimpse of data as below

glimpse(song)
## Rows: 953
## Columns: 24
## $ track_name           <chr> "Seven (feat. Latto) (Explicit Ver.)", "LALA", "v…
## $ artist.s._name       <chr> "Latto, Jung Kook", "Myke Towers", "Olivia Rodrig…
## $ artist_count         <int> 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1…
## $ released_year        <int> 2023, 2023, 2023, 2019, 2023, 2023, 2023, 2023, 2…
## $ released_month       <int> 7, 3, 6, 8, 5, 6, 3, 7, 5, 3, 4, 7, 1, 4, 3, 12, …
## $ released_day         <int> 14, 23, 30, 23, 18, 1, 16, 7, 15, 17, 17, 7, 12, …
## $ in_spotify_playlists <int> 553, 1474, 1397, 7858, 3133, 2186, 3090, 714, 109…
## $ in_spotify_charts    <int> 147, 48, 113, 100, 50, 91, 50, 43, 83, 44, 40, 55…
## $ streams              <chr> "141381703", "133716286", "140003974", "800840817…
## $ in_apple_playlists   <int> 43, 48, 94, 116, 84, 67, 34, 25, 60, 49, 41, 37, …
## $ in_apple_charts      <int> 263, 126, 207, 207, 133, 213, 222, 89, 210, 110, …
## $ in_deezer_playlists  <chr> "45", "58", "91", "125", "87", "88", "43", "30", …
## $ in_deezer_charts     <int> 10, 14, 14, 12, 15, 17, 13, 13, 11, 13, 12, 5, 58…
## $ in_shazam_charts     <chr> "826", "382", "949", "548", "425", "946", "418", …
## $ bpm                  <int> 125, 92, 138, 170, 144, 141, 148, 100, 130, 170, …
## $ key                  <chr> "B", "C#", "F", "A", "A", "C#", "F", "F", "C#", "…
## $ mode                 <chr> "Major", "Major", "Major", "Major", "Minor", "Maj…
## $ danceability_.       <int> 80, 71, 51, 55, 65, 92, 67, 67, 85, 81, 57, 78, 7…
## $ valence_.            <int> 89, 61, 32, 58, 23, 66, 83, 26, 22, 56, 56, 52, 6…
## $ energy_.             <int> 83, 74, 53, 72, 80, 58, 76, 71, 62, 48, 72, 82, 6…
## $ acousticness_.       <int> 31, 7, 17, 11, 14, 19, 48, 37, 12, 21, 23, 18, 6,…
## $ instrumentalness_.   <int> 0, 0, 0, 0, 63, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 17,…
## $ liveness_.           <int> 8, 10, 31, 11, 11, 8, 8, 11, 28, 8, 27, 15, 3, 9,…
## $ speechiness_.        <int> 4, 4, 6, 15, 6, 24, 3, 4, 9, 33, 5, 7, 7, 3, 6, 4…

Here is the summary of the dataset

summary(song)
##   track_name        artist.s._name      artist_count   released_year 
##  Length:953         Length:953         Min.   :1.000   Min.   :1930  
##  Class :character   Class :character   1st Qu.:1.000   1st Qu.:2020  
##  Mode  :character   Mode  :character   Median :1.000   Median :2022  
##                                        Mean   :1.556   Mean   :2018  
##                                        3rd Qu.:2.000   3rd Qu.:2022  
##                                        Max.   :8.000   Max.   :2023  
##  released_month    released_day   in_spotify_playlists in_spotify_charts
##  Min.   : 1.000   Min.   : 1.00   Min.   :   31        Min.   :  0.00   
##  1st Qu.: 3.000   1st Qu.: 6.00   1st Qu.:  875        1st Qu.:  0.00   
##  Median : 6.000   Median :13.00   Median : 2224        Median :  3.00   
##  Mean   : 6.034   Mean   :13.93   Mean   : 5200        Mean   : 12.01   
##  3rd Qu.: 9.000   3rd Qu.:22.00   3rd Qu.: 5542        3rd Qu.: 16.00   
##  Max.   :12.000   Max.   :31.00   Max.   :52898        Max.   :147.00   
##    streams          in_apple_playlists in_apple_charts  in_deezer_playlists
##  Length:953         Min.   :  0.00     Min.   :  0.00   Length:953         
##  Class :character   1st Qu.: 13.00     1st Qu.:  7.00   Class :character   
##  Mode  :character   Median : 34.00     Median : 38.00   Mode  :character   
##                     Mean   : 67.81     Mean   : 51.91                      
##                     3rd Qu.: 88.00     3rd Qu.: 87.00                      
##                     Max.   :672.00     Max.   :275.00                      
##  in_deezer_charts in_shazam_charts        bpm            key           
##  Min.   : 0.000   Length:953         Min.   : 65.0   Length:953        
##  1st Qu.: 0.000   Class :character   1st Qu.:100.0   Class :character  
##  Median : 0.000   Mode  :character   Median :121.0   Mode  :character  
##  Mean   : 2.666                      Mean   :122.5                     
##  3rd Qu.: 2.000                      3rd Qu.:140.0                     
##  Max.   :58.000                      Max.   :206.0                     
##      mode           danceability_.    valence_.        energy_.    
##  Length:953         Min.   :23.00   Min.   : 4.00   Min.   : 9.00  
##  Class :character   1st Qu.:57.00   1st Qu.:32.00   1st Qu.:53.00  
##  Mode  :character   Median :69.00   Median :51.00   Median :66.00  
##                     Mean   :66.97   Mean   :51.43   Mean   :64.28  
##                     3rd Qu.:78.00   3rd Qu.:70.00   3rd Qu.:77.00  
##                     Max.   :96.00   Max.   :97.00   Max.   :97.00  
##  acousticness_.  instrumentalness_.   liveness_.    speechiness_.  
##  Min.   : 0.00   Min.   : 0.000     Min.   : 3.00   Min.   : 2.00  
##  1st Qu.: 6.00   1st Qu.: 0.000     1st Qu.:10.00   1st Qu.: 4.00  
##  Median :18.00   Median : 0.000     Median :12.00   Median : 6.00  
##  Mean   :27.06   Mean   : 1.581     Mean   :18.21   Mean   :10.13  
##  3rd Qu.:43.00   3rd Qu.: 0.000     3rd Qu.:24.00   3rd Qu.:11.00  
##  Max.   :97.00   Max.   :91.000     Max.   :97.00   Max.   :64.00

Data Cleaning

song$streams <- as.numeric(as.character(song$streams))
## Warning: NAs introduced by coercion
song$in_shazam_charts <- as.numeric(as.character(song$in_shazam_charts))
## Warning: NAs introduced by coercion
song <- na.omit(song)
song <- song %>% 
  select(-track_name, -artist.s._name, -in_deezer_playlists, -in_deezer_charts, -in_shazam_charts, -key, -mode)
glimpse(song)
## Rows: 895
## Columns: 17
## $ artist_count         <int> 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2, 1, 1, 1, 2, 1, 3…
## $ released_year        <int> 2023, 2023, 2023, 2019, 2023, 2023, 2023, 2023, 2…
## $ released_month       <int> 7, 3, 6, 8, 5, 6, 3, 7, 5, 3, 4, 7, 12, 2, 3, 3, …
## $ released_day         <int> 14, 23, 30, 23, 18, 1, 16, 7, 15, 17, 17, 7, 8, 2…
## $ in_spotify_playlists <int> 553, 1474, 1397, 7858, 3133, 2186, 3090, 714, 109…
## $ in_spotify_charts    <int> 147, 48, 113, 100, 50, 91, 50, 43, 83, 44, 40, 55…
## $ streams              <dbl> 141381703, 133716286, 140003974, 800840817, 30323…
## $ in_apple_playlists   <int> 43, 48, 94, 116, 84, 67, 34, 25, 60, 49, 41, 37, …
## $ in_apple_charts      <int> 263, 126, 207, 207, 133, 213, 222, 89, 210, 110, …
## $ bpm                  <int> 125, 92, 138, 170, 144, 141, 148, 100, 130, 170, …
## $ danceability_.       <int> 80, 71, 51, 55, 65, 92, 67, 67, 85, 81, 57, 78, 6…
## $ valence_.            <int> 89, 61, 32, 58, 23, 66, 83, 26, 22, 56, 56, 52, 4…
## $ energy_.             <int> 83, 74, 53, 72, 80, 58, 76, 71, 62, 48, 72, 82, 7…
## $ acousticness_.       <int> 31, 7, 17, 11, 14, 19, 48, 37, 12, 21, 23, 18, 5,…
## $ instrumentalness_.   <int> 0, 0, 0, 0, 63, 0, 0, 0, 0, 0, 0, 0, 17, 0, 0, 0,…
## $ liveness_.           <int> 8, 10, 31, 11, 11, 8, 8, 11, 28, 8, 27, 15, 16, 3…
## $ speechiness_.        <int> 4, 4, 6, 15, 6, 24, 3, 4, 9, 33, 5, 7, 4, 3, 16, …

Here is the glimpse of data after cleanning.

Data Analysis

Spliting Set

First, we would want to separate a training set and train set

N <- seq(448)
S <- sample(N,447)
songtest_sample <- song[S,]
songtrain_sample <- song[-S, ]

Export csv file

#write.csv(songtest_sample, "songtest.csv")
#write.csv(songtrain_sample, "songtrain.csv")

songtest <- read.csv("songtest.csv")
songtrain <- read.csv("songtrain.csv")

songtest <- songtest %>% 
  select(-X)

songtrain2 <- songtrain %>% 
  select(-X)

Since seperate the dataset would result in a different dataset every times. Therefore, I would export the train and the train set as csv file to do the report on that dataset only. To not run the code again which we would get a new dataset, I put hash in front of the export function.

Examination of variables

We gonna make histogram to see each variables that if they are approximately normal or not and plot each variables with song popularity to see if they will better fitted as a quadratic line.

streams

hist(songtrain2$streams, col = 'skyblue', xlab = "Streams", main = "Histogram of Streams")

From the histogram, we can see that it is skewed to the right. This means that this variables is not approximatly normal distributed, so we would need to take the log of this variable.

songtrain3 <- songtrain2 %>% 
  mutate(Logstreams = log(streams)) %>% 
  select(-streams)


hist(songtrain3$Logstreams, col = "royalblue", xlab = "Log of Streams", main = "Histogram of Log Streams")

After logging this variable, the histogram is quite normal distributed which means that logging this variables fix this problem. Therefore, we would consider the log of this variable for the model.

artist_count

hist(songtrain2$artist_count, col = "skyblue", xlab = "Artist Count", main = "Histogram of Artist Count")

From the histogram, we can see that it is skewed to the right. This means that this variables is not approximately normal distributed. This is because of most of the song in the world is mostly by only one artist. However, we would still consider this variable for the model.

songtrain3 %>%
  ggplot(aes(artist_count, Logstreams))+
  geom_point(color = "royalblue1")+
  stat_smooth(method = "lm",formula = y ~ poly(x,1), color = "red")+
  stat_smooth(method = "lm",formula = y ~ poly(x,2), color = "orange")

From the plot of Logstreams against artist_count, we can see that the quadratic line and the linear line have almost the same fit with the data points. Therefore, we would not consider if variable I(artist_count^2) suitable with the best model.

released_year

hist(songtrain2$released_year, col = "skyblue", xlab = "Released Year", main = "Histogram of Released Year")

From the histogram, we can see that it is skewed to the left. This means that this variables is not approximatly normal distributed. This is because most of the song on the chart is from 2023 since this is a chart of Spotify in 2023. However, we would still consider this variable for the model.

songtrain3 %>%
  ggplot(aes(released_year, Logstreams))+
  geom_point(color = "royalblue1")+
  stat_smooth(method = "lm",formula = y ~ poly(x,1), color = "red")+
  stat_smooth(method = "lm",formula = y ~ poly(x,2), color = "orange")

From the plot of Logstreams against released_year, we can see that the quadratic line and the linear line have almost the same fit with the data points. However, it seem like the quadratic model would fit a little more points than linear line. Therefore, we would consider if variable I(released_year^2) suitable with the best model.

released_month

hist(songtrain2$released_month, col = "skyblue", xlab = "Released Month", main = "Histogram of Released Month")

From the histogram of released_month, we can see that this variable is approximately normal distributed. Therefore we would not need to take the log of this variable.

songtrain3 %>%
  ggplot(aes(released_month, Logstreams))+
  geom_point(color = "royalblue1")+
  stat_smooth(method = "lm",formula = y ~ poly(x,1), color = "red")+
  stat_smooth(method = "lm",formula = y ~ poly(x,2), color = "orange")

From the plot of Logstreams against released_month, we can see that the quadratic line and the linear line have almost the same fit with the data points. Therefore, we would not consider if variable I(released_month^2) suitable with the best model.

released_day

hist(songtrain2$released_day, col = "skyblue", xlab = "The realeased day of the song", main = "Histogram of songs' released day")

From the histogram of released_day, we can see that this variable is approximately normal distributed. Therefore we would not need to take the log of this variable.

songtrain3 %>%
  ggplot(aes(released_day, Logstreams))+
  geom_point(color = "royalblue1")+
  stat_smooth(method = "lm",formula = y ~ poly(x,1), color = "red")+
  stat_smooth(method = "lm",formula = y ~ poly(x,2), color = "orange")

From the plot of Logstreams against released_day, we can see that the quadratic line and the linear line have almost the same fit with the data points. However, it seem like the quadratic model would fit a little more points than linear line. Therefore, we would consider if variable I(released_day^2) suitable with the best model.

in_spotify_playlists

hist(songtrain2$in_spotify_playlists, col = "skyblue", xlab = "Number of Spotify playlist of song is included in", main = "Histogram of Spotify playlist a song is included in")

From the histogram, we can see that it is skewed to the right. This means that this variables is not approximatly normal distributed, so we would need to take the log of this variable.

songtrain3 <- songtrain3 %>% 
  mutate(Login_spotify_playlists = log(in_spotify_playlists)) %>% 
  select(-in_spotify_playlists)

hist(songtrain3$Login_spotify_playlists, col ="royalblue", xlab ="Log Number of Playlist a song is include in", main = "Histogram of log number of playlist")

After logging this variable, the histogram is quite normal distributed which means that logging this variables songtrain this problem. Therefore, we would consider the log of this variable for the model.

songtrain3 %>%
  ggplot(aes(Login_spotify_playlists, Logstreams))+
  geom_point(color = "royalblue1")+
  stat_smooth(method = "lm",formula = y ~ poly(x,1), color = "red")+
  stat_smooth(method = "lm",formula = y ~ poly(x,2), color = "orange")

From the plot of Logstreams against Login_spotify_playlists, we can see that the quadratic line and the linear line have almost the same fit with the data points. However, it seem like the quadratic model would fit a little more points than linear line. Therefore, we would consider if variable I(Login_spotify_playlists^2) suitable with the best model.

in_spotify_charts

hist(songtrain2$in_spotify_charts, col = "skyblue", xlab = "Rank of the song on Spotify charts", main = "Histogram of Spotify Ranking")

From the histogram, we can see that it is skewed to the right. This is because most of the song in this dataset have not been on chart. However, we would still consider this as a variable for the prediction since in theory, the more times the songs are on chart, the higher number of streams it should have.

songtrain3 %>%
  ggplot(aes(in_spotify_charts, Logstreams))+
  geom_point(color = "royalblue1")+
  stat_smooth(method = "lm",formula = y ~ poly(x,1), color = "red")+
  stat_smooth(method = "lm",formula = y ~ poly(x,2), color = "orange") 

From the plot of Logstreams against in_spotify_charts, we can see that the quadratic line and the linear line have almost the same fit with the data points. However, it seem like the quadratic model would fit a little more points than linear line. Therefore, we would consider if variable I(in_spotify_charts^2) suitable with the best model.

in_apple_playlists

hist(songtrain2$in_apple_playlists, col = "skyblue", xlab = "Number of Apple playlist a song is included in", main = "Histogram of Apple Playlist a song is included in")

From the histogram, we can see that it is skewed to the right. This means that this variables is not approximately normal distributed, so we would need to take the log of this variable.

songtrain3 <- songtrain3 %>% 
  mutate(Login_apple_playlists = log(in_apple_playlists + 1)) %>% 
  select(-in_apple_playlists)

hist(songtrain3$Login_apple_playlists, col ="royalblue", xlab = "Log of the number of Apple playlist a Song is included in", main = "Histogram of the number of Apple playlist ")

We add one to variable when log to avoid infinite value. After logging this variable, the histogram is quite normal distributed which means that logging this variables songtrain this problem. Therefore, we would consider the log of this variable for the model.

songtrain3 %>%
  ggplot(aes(Login_apple_playlists, Logstreams))+
  geom_point(color = "royalblue1")+
  stat_smooth(method = "lm",formula = y ~ poly(x,1), color = "red")+
  stat_smooth(method = "lm",formula = y ~ poly(x,2), color = "orange")

From the plot of Logstreams against Login_apple_playlists, we can see that the quadratic line and the linear line have almost the same fit with the data points. However, it seem like the quadratic model would fit a little more points than linear line. Therefore, we would consider if variable I(Login_apple_playlists^2) suitable with the best model.

in_apple_charts

hist(songtrain2$in_apple_charts, col = "skyblue", xlab = "Rank of song on Apple charts", main = "Histogram of song on Apple Charts")

From the histogram, we can see that it is skewed to the right. This means that this variables is not approximately normal distributed, so we would need to take the log of this variable.

songtrain3 <- songtrain3 %>% 
  mutate(Login_apple_charts = log(in_apple_charts + 1)) %>% 
  select(-in_apple_charts)

hist(songtrain3$Login_apple_charts, col ="royalblue", xlab = "Log of the rank of a song on Apple charts", main = "Histogram of LOf rank of a song on Apple chart")

We add one to variable when log to avoid infinite value. After logging this variable, the histogram is quite normal distributed which means that logging this variables songtrain this problem. Therefore, we would consider the log of this variable for the model.

songtrain3 %>%
  ggplot(aes(Login_apple_charts, Logstreams))+
  geom_point(color = "royalblue1")+
  stat_smooth(method = "lm",formula = y ~ poly(x,1), color = "red")+
  stat_smooth(method = "lm",formula = y ~ poly(x,2), color = "orange")

From the plot of Logstreams against Login_apple_charts, we can see that the quadratic line and the linear line have almost the same fit with the data points. However, we can see that the quadratic line and the linear line have almost the same fit with the data points. However, it seem like the quadratic model would fit a little more points than linear line. Therefore, we would consider if variable I(Login_apple_charts^2) suitable with the best model.

bpm

hist(songtrain2$bpm, col = "skyblue", xlab = "Beats per Minute", main = "Histogram of BPM")

From the histogram of bpm, we can see that this variable is approximately normal distributed. Therefore we would not need to take the log of this variable.

songtrain3 %>%
  ggplot(aes(bpm, Logstreams))+
  geom_point(color = "royalblue1")+
  stat_smooth(method = "lm",formula = y ~ poly(x,1), color = "red")+
  stat_smooth(method = "lm",formula = y ~ poly(x,2), color = "orange")

From the plot of Logstreams against bpm, we can see that the quadratic line and the linear line have almost the same fit with the data points. Therefore, we would not consider if variable I(bpm^2) suitable with the best model.

danceability_.

hist(songtrain2$danceability_., col = "skyblue", xlab = "Danceability", main = "Histogram of Dancebility")

From the histogram of danceability_ , we can see that this variable is approximately normal distributed. Therefore we would not need to take the log of this variable.

songtrain3 %>%
  ggplot(aes(danceability_., Logstreams))+
  geom_point(color = "royalblue1")+
  stat_smooth(method = "lm",formula = y ~ poly(x,1), color = "red")+
  stat_smooth(method = "lm",formula = y ~ poly(x,2), color = "orange")

From the plot of Logstreams against danceability_., we can see that the quadratic line and the linear line have almost the same fit with the data points. Therefore, we would not consider if variable I(danceability_.^2) suitable with the best model.

valence_.

hist(songtrain2$valence_., col = "skyblue", xlab = "Positivity of the song's musical content", main = "Histogram of Valence")

From the histogram of valence_., we can see that this variable is approximately normal distributed. Therefore we would not need to take the log of this variable.

songtrain3 %>%
  ggplot(aes(valence_., Logstreams))+
  geom_point(color = "royalblue1")+
  stat_smooth(method = "lm",formula = y ~ poly(x,1), color = "red")+
  stat_smooth(method = "lm",formula = y ~ poly(x,2), color = "orange")

From the plot of Logstreams against valence_., we can see that the quadratic line and the linear line have almost the same fit with the data points. Therefore, we would not consider if variable I(valence_.^2) suitable with the best model.

energy_.

hist(songtrain2$energy_., col = "skyblue", xlab = "Energy of the song", main = "Histogram of energy")

From the histogram of energy_., we can see that this variable is approximately normal distributed. Therefore we would not need to take the log of this variable.

songtrain3 %>%
  ggplot(aes(energy_., Logstreams))+
  geom_point(color = "royalblue1")+
  stat_smooth(method = "lm",formula = y ~ poly(x,1), color = "red")+
  stat_smooth(method = "lm",formula = y ~ poly(x,2), color = "orange")

From the plot of Logstreams against energy_., we can see that the quadratic line and the linear line have almost the same fit with the data points. Therefore, we would not consider if variable I(energy_.^2) suitable with the best model.

acousticness_.

hist(songtrain2$acousticness_., col = "skyblue", xlab = "Acousticness", main = "Histogram of Acousticness")

From the histogram, we can see that it is skewed to the right. This means that this variables is not approximately normal distributed, so we would need to take the log of this variable.

songtrain3 <- songtrain3 %>% 
  mutate(Logacousticness_. = log(acousticness_. + 1)) %>% 
  select(-acousticness_.)

hist(songtrain3$Logacousticness_., col ="royalblue", xlab = "Log of acousticness", main = "Histogram of log acousticness")

We add one to variable when log to avoid infinite value. After logging this variable, the histogram is quite normal distributed which means that logging this variables fix this problem. Therefore, we would consider the log of this variable for the model.

songtrain3 %>%
  ggplot(aes(Logacousticness_., Logstreams))+
  geom_point(color = "royalblue1")+
  stat_smooth(method = "lm",formula = y ~ poly(x,1), color = "red")+
  stat_smooth(method = "lm",formula = y ~ poly(x,2), color = "orange")

From the plot of Logstreams against Logacousticness_., we can see that the quadratic line and the linear line have almost the same fit with the data points. Therefore, we would not consider if variable I(Logacousticness_.^2) suitable with the best model.

instrumentalness_.

hist(songtrain2$instrumentalness_., col = "skyblue", xlab = "Instrumentalness", main = "Histogram of instrumentalness")

From the histogram, we can see that it is skewed to the right. This means that this variables is not approximately normal distributed, so we would need to take the log of this variable.

songtrain3 <- songtrain3 %>% 
  mutate(Loginstrumentalness_. = log(instrumentalness_. + 1)) %>% 
  select(-instrumentalness_.)

hist(songtrain3$Loginstrumentalness_., col ="royalblue", xlab = "Log of instrumentalness", main = "Log of instrumentalness")

We add one to variable when log to avoid infinite value. After logging this variable, the histogram is not normal distributed which means that logging this variables do not fix this problem. Therefore, we would not consider this variable for the model.

liveness_.

hist(songtrain2$liveness_., col = "skyblue", xlab = "Presence of live performance elements (%)", main = "Histogram of Live Elements")

From the histogram, we can see that it is skewed to the right. This means that this variables is not approximately normal distributed, so we would need to take the log of this variable.

songtrain3 <- songtrain3 %>% 
  mutate(Logliveness_. = log(liveness_.)) %>% 
  select(-liveness_.)

hist(songtrain3$Logliveness_., col ="royalblue", xlab = "Log of the presence of live performance elements (%)", main = "Histogram of Log Live Elements") 

After logging this variable, the histogram is quite normal distributed which means that logging this variables fix this problem. Therefore, we would consider the log of this variable for the model.

songtrain3 %>%
  ggplot(aes(Logliveness_., Logstreams))+
  geom_point(color = "royalblue1")+
  stat_smooth(method = "lm",formula = y ~ poly(x,1), color = "red")+
  stat_smooth(method = "lm",formula = y ~ poly(x,2), color = "orange")

From the plot of Logstreams against Logliveness_., we can see that the quadratic line and the linear line have almost the same fit with the data points. Therefore, we would not consider if variable I(Logliveness_.^2) suitable with the best model.

speechiness_.

hist(songtrain2$speechiness_., col = "skyblue", xlab = "Speechiness", main = "Histogram of speechiness" )

After logging this variable, the histogram is quite normal distributed which means that logging this variables fix this problem. Therefore, we would consider the log of this variable for the model.

songtrain3 <- songtrain3 %>% 
  mutate(Logspeechiness_. = log(speechiness_.)) %>% 
  select(-speechiness_.)

hist(songtrain3$Logspeechiness_., col ="royalblue", xlab = "Log speechiness", main = "Histogram of log speechiness")

After logging this variable, the histogram is quite normal distributed which means that logging this variables fix this problem. Therefore, we would consider the log of this variable for the model.

songtrain3 %>%
  ggplot(aes(Logspeechiness_., Logstreams))+
  geom_point(color = "royalblue1")+
  stat_smooth(method = "lm",formula = y ~ poly(x,1), color = "red")+
  stat_smooth(method = "lm",formula = y ~ poly(x,2), color = "orange")

From the plot of Logstreams against Logspeechiness_., we can see that the quadratic line and the linear line have almost the same fit with the data points. Therefore, we would not consider if variable I(Logspeechiness_.^2) suitable with the best model.

Interaction Plot

To see if there is an interaction between released_day and Login_spotify_playlists, we would filter to different Login_spotify_playlists group and plot Logstreams against released_day.

songtrain3 %>% filter(Login_spotify_playlists >= 0 & Login_spotify_playlists <= 7) %>% ggplot(aes(released_day, Logstreams)) + geom_point(color = "skyblue")+geom_smooth(method = "lm", color = "red") + labs(title = "Interaction Plot of released_day and Login_spotify_playlists")
## `geom_smooth()` using formula = 'y ~ x'

songtrain3 %>% filter(Login_spotify_playlists > 7 & Login_spotify_playlists <= 9) %>% ggplot(aes(released_day, Logstreams)) + geom_point(color = "royalblue")+geom_smooth(method = "lm", color = "red")+ labs(title = "Interaction Plot of released_day and Login_spotify_playlists")
## `geom_smooth()` using formula = 'y ~ x'

songtrain3 %>% filter(Login_spotify_playlists > 9) %>% ggplot(aes(released_day, Logstreams)) + geom_point(color = "navy")+geom_smooth(method = "lm", color = "red") + labs(title = "Interaction Plot of released_day and Login_spotify_playlists")
## `geom_smooth()` using formula = 'y ~ x'

From these plots, we can see that there are not any trend in these graph. Therefore, we can conclude that the slope of released_day is not correlate to the rate of change of Login_spotify_playlists. Therefore, the variable for the best model can not have the interaction variable.

Let’s try again on one more interaction term.

songtrain3 %>% filter(Login_spotify_playlists >= 0 & Login_spotify_playlists <= 7) %>% ggplot(aes(energy_., Logstreams)) + geom_point(color = "skyblue")+geom_smooth(method = "lm", color = "red") +  labs(title = "Interaction Plot of energy_. and Login_spotify_playlists")
## `geom_smooth()` using formula = 'y ~ x'

songtrain3 %>% filter(Login_spotify_playlists > 7 & Login_spotify_playlists <= 9) %>% ggplot(aes(energy_., Logstreams)) + geom_point(color = "royalblue")+geom_smooth(method = "lm", color = "red") + labs(title = "Interaction Plot of energy_. and Login_spotify_playlists")
## `geom_smooth()` using formula = 'y ~ x'

songtrain3 %>% filter(Login_spotify_playlists > 9) %>% ggplot(aes(energy_., Logstreams)) + geom_point(color = "navy")+geom_smooth(method = "lm", color = "red") + labs(title = "Interaction Plot of energy_. and Login_spotify_playlists")
## `geom_smooth()` using formula = 'y ~ x'

From these plots, we still can see that there are not any trend in these graph. Therefore, we can conclude that the slope of released_day is not correlate to the rate of change of energy_.. Therefore, the variable for the best model can not have the interaction variable.

Correlation Test

cor(songtrain3)
##                         artist_count released_year released_month released_day
## artist_count             1.000000000   0.106810521   -0.008180603  0.056057079
## released_year            0.106810521   1.000000000    0.061312795  0.078709598
## released_month          -0.008180603   0.061312795    1.000000000  0.075768333
## released_day             0.056057079   0.078709598    0.075768333  1.000000000
## in_spotify_charts       -0.044775331  -0.033404658   -0.046296838 -0.013569012
## bpm                     -0.015177062   0.029475277   -0.047385342 -0.059158876
## danceability_.           0.267606806   0.113874864   -0.056125166  0.031512743
## valence_.                0.181556082  -0.071527723   -0.186141454  0.046435367
## energy_.                 0.190884266   0.002169269   -0.059985458  0.048946540
## Logstreams              -0.058810263  -0.275761046    0.043743792  0.107736939
## Login_spotify_playlists -0.026381520  -0.434432988   -0.007532317  0.007167030
## Login_apple_playlists    0.101961289  -0.260879144    0.003830782  0.064866354
## Login_apple_charts      -0.151447141  -0.143568654    0.060185043 -0.035998023
## Logacousticness_.       -0.058880378   0.013540150   -0.008621915  0.002118181
## Loginstrumentalness_.   -0.128318562  -0.104007321    0.065836377  0.015342311
## Logliveness_.            0.059694527   0.051878154   -0.016946051 -0.016049702
## Logspeechiness_.         0.159523734   0.118884400    0.015485453 -0.016884250
##                         in_spotify_charts          bpm danceability_.
## artist_count                  -0.04477533 -0.015177062     0.26760681
## released_year                 -0.03340466  0.029475277     0.11387486
## released_month                -0.04629684 -0.047385342    -0.05612517
## released_day                  -0.01356901 -0.059158876     0.03151274
## in_spotify_charts              1.00000000  0.016017589     0.01998765
## bpm                            0.01601759  1.000000000    -0.10224367
## danceability_.                 0.01998765 -0.102243670     1.00000000
## valence_.                      0.08373595 -0.015341812     0.39487445
## energy_.                       0.06107525  0.007069839     0.18059295
## Logstreams                     0.32427195 -0.006475977     0.02739416
## Login_spotify_playlists        0.19343294 -0.035757248    -0.02942905
## Login_apple_playlists          0.19039098 -0.030208978     0.13793751
## Login_apple_charts             0.30247000 -0.083644732     0.01569308
## Logacousticness_.             -0.03801276 -0.076463396    -0.15953767
## Loginstrumentalness_.         -0.07537085 -0.048405969    -0.17970892
## Logliveness_.                 -0.02628857 -0.029554319    -0.06622832
## Logspeechiness_.              -0.04677227  0.097613871     0.20909260
##                           valence_.     energy_.   Logstreams
## artist_count             0.18155608  0.190884266 -0.058810263
## released_year           -0.07152772  0.002169269 -0.275761046
## released_month          -0.18614145 -0.059985458  0.043743792
## released_day             0.04643537  0.048946540  0.107736939
## in_spotify_charts        0.08373595  0.061075249  0.324271949
## bpm                     -0.01534181  0.007069839 -0.006475977
## danceability_.           0.39487445  0.180592952  0.027394160
## valence_.                1.00000000  0.373863941 -0.001801740
## energy_.                 0.37386394  1.000000000  0.054954160
## Logstreams              -0.00180174  0.054954160  1.000000000
## Login_spotify_playlists -0.03850706  0.040532652  0.762942174
## Login_apple_playlists    0.06726161  0.114414061  0.698380539
## Login_apple_charts       0.02198265  0.124607248  0.497448939
## Logacousticness_.       -0.03100504 -0.448604051 -0.101385863
## Loginstrumentalness_.   -0.12783397 -0.101127695  0.020105996
## Logliveness_.           -0.01432861  0.063322383 -0.043590364
## Logspeechiness_.         0.08631690  0.111212690 -0.158019883
##                         Login_spotify_playlists Login_apple_playlists
## artist_count                       -0.026381520           0.101961289
## released_year                      -0.434432988          -0.260879144
## released_month                     -0.007532317           0.003830782
## released_day                        0.007167030           0.064866354
## in_spotify_charts                   0.193432944           0.190390983
## bpm                                -0.035757248          -0.030208978
## danceability_.                     -0.029429048           0.137937508
## valence_.                          -0.038507061           0.067261607
## energy_.                            0.040532652           0.114414061
## Logstreams                          0.762942174           0.698380539
## Login_spotify_playlists             1.000000000           0.724623847
## Login_apple_playlists               0.724623847           1.000000000
## Login_apple_charts                  0.345094228           0.456850561
## Logacousticness_.                  -0.145580828          -0.200347131
## Loginstrumentalness_.               0.065869484          -0.039075810
## Logliveness_.                      -0.035924257          -0.050530857
## Logspeechiness_.                   -0.083378273          -0.129618095
##                         Login_apple_charts Logacousticness_.
## artist_count                   -0.15144714      -0.058880378
## released_year                  -0.14356865       0.013540150
## released_month                  0.06018504      -0.008621915
## released_day                   -0.03599802       0.002118181
## in_spotify_charts               0.30247000      -0.038012756
## bpm                            -0.08364473      -0.076463396
## danceability_.                  0.01569308      -0.159537667
## valence_.                       0.02198265      -0.031005043
## energy_.                        0.12460725      -0.448604051
## Logstreams                      0.49744894      -0.101385863
## Login_spotify_playlists         0.34509423      -0.145580828
## Login_apple_playlists           0.45685056      -0.200347131
## Login_apple_charts              1.00000000      -0.144498524
## Logacousticness_.              -0.14449852       1.000000000
## Loginstrumentalness_.          -0.03320082       0.059981137
## Logliveness_.                   0.03014205      -0.009913390
## Logspeechiness_.               -0.13558022       0.002784047
##                         Loginstrumentalness_. Logliveness_. Logspeechiness_.
## artist_count                      -0.12831856    0.05969453      0.159523734
## released_year                     -0.10400732    0.05187815      0.118884400
## released_month                     0.06583638   -0.01694605      0.015485453
## released_day                       0.01534231   -0.01604970     -0.016884250
## in_spotify_charts                 -0.07537085   -0.02628857     -0.046772270
## bpm                               -0.04840597   -0.02955432      0.097613871
## danceability_.                    -0.17970892   -0.06622832      0.209092601
## valence_.                         -0.12783397   -0.01432861      0.086316903
## energy_.                          -0.10112769    0.06332238      0.111212690
## Logstreams                         0.02010600   -0.04359036     -0.158019883
## Login_spotify_playlists            0.06586948   -0.03592426     -0.083378273
## Login_apple_playlists             -0.03907581   -0.05053086     -0.129618095
## Login_apple_charts                -0.03320082    0.03014205     -0.135580220
## Logacousticness_.                  0.05998114   -0.00991339      0.002784047
## Loginstrumentalness_.              1.00000000   -0.07352980     -0.186365272
## Logliveness_.                     -0.07352980    1.00000000     -0.013093706
## Logspeechiness_.                  -0.18636527   -0.01309371      1.000000000

From the result, we can see that not any of our variables in our dataset correlate too much to filter out of the dataset. So we do not remove any variable.

Best Subsets

best.subset <- regsubsets(Logstreams ~ . + I(Login_apple_charts^2) + I(Login_apple_playlists ^2) + I(in_spotify_charts^2) + I(Login_spotify_playlists^2) + I(released_day^2) + I(released_year^2), songtrain3,nvmax = 22)
sum <- summary(best.subset)
sum$outmat
##           artist_count released_year released_month released_day
## 1  ( 1 )  " "          " "           " "            " "         
## 2  ( 1 )  " "          " "           " "            " "         
## 3  ( 1 )  " "          " "           " "            " "         
## 4  ( 1 )  " "          " "           " "            " "         
## 5  ( 1 )  " "          " "           " "            " "         
## 6  ( 1 )  " "          " "           " "            " "         
## 7  ( 1 )  " "          "*"           " "            " "         
## 8  ( 1 )  " "          "*"           " "            " "         
## 9  ( 1 )  " "          "*"           " "            " "         
## 10  ( 1 ) " "          "*"           " "            " "         
## 11  ( 1 ) " "          "*"           " "            " "         
## 12  ( 1 ) " "          "*"           " "            " "         
## 13  ( 1 ) " "          "*"           " "            " "         
## 14  ( 1 ) " "          "*"           " "            "*"         
## 15  ( 1 ) "*"          "*"           " "            "*"         
## 16  ( 1 ) " "          "*"           "*"            "*"         
## 17  ( 1 ) "*"          "*"           "*"            "*"         
## 18  ( 1 ) "*"          "*"           "*"            "*"         
## 19  ( 1 ) "*"          "*"           "*"            "*"         
## 20  ( 1 ) "*"          "*"           "*"            "*"         
## 21  ( 1 ) "*"          "*"           "*"            "*"         
## 22  ( 1 ) "*"          "*"           "*"            "*"         
##           in_spotify_charts bpm danceability_. valence_. energy_.
## 1  ( 1 )  " "               " " " "            " "       " "     
## 2  ( 1 )  " "               " " " "            " "       " "     
## 3  ( 1 )  " "               " " " "            " "       " "     
## 4  ( 1 )  "*"               " " " "            " "       " "     
## 5  ( 1 )  "*"               " " " "            " "       " "     
## 6  ( 1 )  "*"               " " " "            " "       " "     
## 7  ( 1 )  "*"               " " " "            " "       " "     
## 8  ( 1 )  "*"               " " " "            " "       " "     
## 9  ( 1 )  "*"               " " " "            " "       " "     
## 10  ( 1 ) "*"               "*" " "            " "       " "     
## 11  ( 1 ) "*"               "*" " "            " "       " "     
## 12  ( 1 ) "*"               "*" "*"            " "       " "     
## 13  ( 1 ) "*"               "*" "*"            " "       " "     
## 14  ( 1 ) "*"               "*" "*"            " "       " "     
## 15  ( 1 ) "*"               "*" "*"            " "       " "     
## 16  ( 1 ) "*"               "*" "*"            " "       " "     
## 17  ( 1 ) "*"               "*" "*"            " "       " "     
## 18  ( 1 ) "*"               "*" "*"            "*"       " "     
## 19  ( 1 ) "*"               "*" "*"            "*"       "*"     
## 20  ( 1 ) "*"               "*" "*"            "*"       "*"     
## 21  ( 1 ) "*"               "*" "*"            "*"       "*"     
## 22  ( 1 ) "*"               "*" "*"            "*"       "*"     
##           Login_spotify_playlists Login_apple_playlists Login_apple_charts
## 1  ( 1 )  " "                     " "                   " "               
## 2  ( 1 )  " "                     " "                   "*"               
## 3  ( 1 )  " "                     "*"                   "*"               
## 4  ( 1 )  " "                     "*"                   "*"               
## 5  ( 1 )  " "                     "*"                   "*"               
## 6  ( 1 )  " "                     "*"                   "*"               
## 7  ( 1 )  " "                     "*"                   "*"               
## 8  ( 1 )  " "                     "*"                   "*"               
## 9  ( 1 )  " "                     "*"                   "*"               
## 10  ( 1 ) " "                     "*"                   "*"               
## 11  ( 1 ) " "                     "*"                   "*"               
## 12  ( 1 ) " "                     "*"                   "*"               
## 13  ( 1 ) " "                     "*"                   "*"               
## 14  ( 1 ) " "                     "*"                   "*"               
## 15  ( 1 ) " "                     "*"                   "*"               
## 16  ( 1 ) "*"                     "*"                   "*"               
## 17  ( 1 ) "*"                     "*"                   "*"               
## 18  ( 1 ) "*"                     "*"                   "*"               
## 19  ( 1 ) "*"                     "*"                   "*"               
## 20  ( 1 ) "*"                     "*"                   "*"               
## 21  ( 1 ) "*"                     "*"                   "*"               
## 22  ( 1 ) "*"                     "*"                   "*"               
##           Logacousticness_. Loginstrumentalness_. Logliveness_.
## 1  ( 1 )  " "               " "                   " "          
## 2  ( 1 )  " "               " "                   " "          
## 3  ( 1 )  " "               " "                   " "          
## 4  ( 1 )  " "               " "                   " "          
## 5  ( 1 )  " "               " "                   " "          
## 6  ( 1 )  " "               " "                   " "          
## 7  ( 1 )  " "               " "                   " "          
## 8  ( 1 )  " "               " "                   " "          
## 9  ( 1 )  "*"               " "                   " "          
## 10  ( 1 ) "*"               " "                   " "          
## 11  ( 1 ) "*"               " "                   " "          
## 12  ( 1 ) "*"               " "                   " "          
## 13  ( 1 ) "*"               " "                   " "          
## 14  ( 1 ) "*"               " "                   " "          
## 15  ( 1 ) "*"               " "                   " "          
## 16  ( 1 ) "*"               " "                   " "          
## 17  ( 1 ) "*"               " "                   " "          
## 18  ( 1 ) "*"               " "                   " "          
## 19  ( 1 ) "*"               " "                   " "          
## 20  ( 1 ) "*"               "*"                   " "          
## 21  ( 1 ) "*"               "*"                   " "          
## 22  ( 1 ) "*"               "*"                   "*"          
##           Logspeechiness_. I(Login_apple_charts^2) I(Login_apple_playlists^2)
## 1  ( 1 )  " "              " "                     " "                       
## 2  ( 1 )  " "              " "                     " "                       
## 3  ( 1 )  " "              " "                     " "                       
## 4  ( 1 )  " "              " "                     " "                       
## 5  ( 1 )  " "              " "                     " "                       
## 6  ( 1 )  " "              " "                     " "                       
## 7  ( 1 )  " "              " "                     " "                       
## 8  ( 1 )  " "              " "                     " "                       
## 9  ( 1 )  " "              " "                     " "                       
## 10  ( 1 ) " "              " "                     " "                       
## 11  ( 1 ) "*"              " "                     " "                       
## 12  ( 1 ) "*"              " "                     " "                       
## 13  ( 1 ) "*"              " "                     "*"                       
## 14  ( 1 ) "*"              " "                     "*"                       
## 15  ( 1 ) "*"              " "                     "*"                       
## 16  ( 1 ) "*"              " "                     "*"                       
## 17  ( 1 ) "*"              " "                     "*"                       
## 18  ( 1 ) "*"              " "                     "*"                       
## 19  ( 1 ) "*"              " "                     "*"                       
## 20  ( 1 ) "*"              " "                     "*"                       
## 21  ( 1 ) "*"              "*"                     "*"                       
## 22  ( 1 ) "*"              "*"                     "*"                       
##           I(in_spotify_charts^2) I(Login_spotify_playlists^2) I(released_day^2)
## 1  ( 1 )  " "                    "*"                          " "              
## 2  ( 1 )  " "                    "*"                          " "              
## 3  ( 1 )  " "                    "*"                          " "              
## 4  ( 1 )  " "                    "*"                          " "              
## 5  ( 1 )  " "                    "*"                          "*"              
## 6  ( 1 )  "*"                    "*"                          "*"              
## 7  ( 1 )  " "                    "*"                          "*"              
## 8  ( 1 )  "*"                    "*"                          "*"              
## 9  ( 1 )  "*"                    "*"                          "*"              
## 10  ( 1 ) "*"                    "*"                          "*"              
## 11  ( 1 ) "*"                    "*"                          "*"              
## 12  ( 1 ) "*"                    "*"                          "*"              
## 13  ( 1 ) "*"                    "*"                          "*"              
## 14  ( 1 ) "*"                    "*"                          "*"              
## 15  ( 1 ) "*"                    "*"                          "*"              
## 16  ( 1 ) "*"                    "*"                          "*"              
## 17  ( 1 ) "*"                    "*"                          "*"              
## 18  ( 1 ) "*"                    "*"                          "*"              
## 19  ( 1 ) "*"                    "*"                          "*"              
## 20  ( 1 ) "*"                    "*"                          "*"              
## 21  ( 1 ) "*"                    "*"                          "*"              
## 22  ( 1 ) "*"                    "*"                          "*"              
##           I(released_year^2)
## 1  ( 1 )  " "               
## 2  ( 1 )  " "               
## 3  ( 1 )  " "               
## 4  ( 1 )  " "               
## 5  ( 1 )  " "               
## 6  ( 1 )  " "               
## 7  ( 1 )  "*"               
## 8  ( 1 )  "*"               
## 9  ( 1 )  "*"               
## 10  ( 1 ) "*"               
## 11  ( 1 ) "*"               
## 12  ( 1 ) "*"               
## 13  ( 1 ) "*"               
## 14  ( 1 ) "*"               
## 15  ( 1 ) "*"               
## 16  ( 1 ) "*"               
## 17  ( 1 ) "*"               
## 18  ( 1 ) "*"               
## 19  ( 1 ) "*"               
## 20  ( 1 ) "*"               
## 21  ( 1 ) "*"               
## 22  ( 1 ) "*"

When there are 22 variables, model will be like this

lm22 <- lm(Logstreams ~ artist_count + released_year + released_month + released_day + in_spotify_charts + bpm + danceability_. + valence_. + energy_. + Login_spotify_playlists + Login_apple_playlists + Login_apple_charts +  Logacousticness_. + Loginstrumentalness_. + Logliveness_. + Logspeechiness_. + I(Login_apple_charts^2) + I(Login_apple_playlists^2) + I(in_spotify_charts^2) +  I(Login_spotify_playlists^2) + I(released_day^2) + I(released_year^2),songtrain3)
summary(lm22)
## 
## Call:
## lm(formula = Logstreams ~ artist_count + released_year + released_month + 
##     released_day + in_spotify_charts + bpm + danceability_. + 
##     valence_. + energy_. + Login_spotify_playlists + Login_apple_playlists + 
##     Login_apple_charts + Logacousticness_. + Loginstrumentalness_. + 
##     Logliveness_. + Logspeechiness_. + I(Login_apple_charts^2) + 
##     I(Login_apple_playlists^2) + I(in_spotify_charts^2) + I(Login_spotify_playlists^2) + 
##     I(released_day^2) + I(released_year^2), data = songtrain3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.41806 -0.30688  0.02724  0.30363  2.10042 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -2.504e+03  7.687e+02  -3.258 0.001214 ** 
## artist_count                 -3.426e-02  2.817e-02  -1.216 0.224646    
## released_year                 2.514e+00  7.699e-01   3.266 0.001179 ** 
## released_month                9.661e-03  6.900e-03   1.400 0.162168    
## released_day                 -1.484e-02  1.080e-02  -1.374 0.170026    
## in_spotify_charts             4.246e-02  1.089e-02   3.900 0.000112 ***
## bpm                           2.094e-03  8.489e-04   2.467 0.014025 *  
## danceability_.                2.172e-03  1.841e-03   1.180 0.238754    
## valence_.                     8.956e-04  1.236e-03   0.725 0.469001    
## energy_.                     -6.987e-04  1.729e-03  -0.404 0.686370    
## Login_spotify_playlists      -3.478e-01  2.601e-01  -1.337 0.181807    
## Login_apple_playlists         2.742e-01  7.781e-02   3.524 0.000470 ***
## Login_apple_charts            1.091e-01  5.155e-02   2.117 0.034824 *  
## Logacousticness_.             5.097e-02  2.126e-02   2.398 0.016922 *  
## Loginstrumentalness_.        -1.254e-02  3.202e-02  -0.392 0.695586    
## Logliveness_.                -1.150e-02  3.749e-02  -0.307 0.759111    
## Logspeechiness_.             -6.735e-02  3.042e-02  -2.214 0.027382 *  
## I(Login_apple_charts^2)      -3.388e-03  1.043e-02  -0.325 0.745492    
## I(Login_apple_playlists^2)   -2.450e-02  1.332e-02  -1.840 0.066527 .  
## I(in_spotify_charts^2)       -1.058e-03  3.967e-04  -2.667 0.007949 ** 
## I(Login_spotify_playlists^2)  5.045e-02  1.730e-02   2.916 0.003734 ** 
## I(released_day^2)             7.830e-04  3.499e-04   2.238 0.025749 *  
## I(released_year^2)           -6.267e-04  1.928e-04  -3.251 0.001243 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4832 on 425 degrees of freedom
## Multiple R-squared:  0.7331, Adjusted R-squared:  0.7192 
## F-statistic: 53.05 on 22 and 425 DF,  p-value: < 2.2e-16

From the model, we can see that the Adjusted R-squared is quite high. However since the p-value for the variables some of the variables are higher than 0.05 so changes in these variables do not significantly affect the predicted Logstreams Therefore, this is not the best model that we want to find.

When there are 21 variables, model will be like this

lm21 <- lm(Logstreams ~ artist_count + released_year + released_month + released_day + in_spotify_charts + bpm + danceability_. + valence_. + energy_. + Login_spotify_playlists + Login_apple_playlists + Login_apple_charts +  Logacousticness_. + Loginstrumentalness_.  + Logspeechiness_. + I(Login_apple_charts^2) + I(Login_apple_playlists^2) + I(in_spotify_charts^2) +  I(Login_spotify_playlists^2) + I(released_day^2) + I(released_year^2),songtrain3)
summary(lm21)
## 
## Call:
## lm(formula = Logstreams ~ artist_count + released_year + released_month + 
##     released_day + in_spotify_charts + bpm + danceability_. + 
##     valence_. + energy_. + Login_spotify_playlists + Login_apple_playlists + 
##     Login_apple_charts + Logacousticness_. + Loginstrumentalness_. + 
##     Logspeechiness_. + I(Login_apple_charts^2) + I(Login_apple_playlists^2) + 
##     I(in_spotify_charts^2) + I(Login_spotify_playlists^2) + I(released_day^2) + 
##     I(released_year^2), data = songtrain3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.41143 -0.30965  0.02599  0.29760  2.09039 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -2.498e+03  7.676e+02  -3.254 0.001227 ** 
## artist_count                 -3.496e-02  2.805e-02  -1.246 0.213351    
## released_year                 2.509e+00  7.688e-01   3.263 0.001191 ** 
## released_month                9.725e-03  6.889e-03   1.412 0.158784    
## released_day                 -1.458e-02  1.075e-02  -1.356 0.175846    
## in_spotify_charts             4.253e-02  1.087e-02   3.911 0.000107 ***
## bpm                           2.103e-03  8.475e-04   2.482 0.013468 *  
## danceability_.                2.224e-03  1.831e-03   1.214 0.225289    
## valence_.                     8.968e-04  1.234e-03   0.727 0.467902    
## energy_.                     -7.286e-04  1.725e-03  -0.422 0.672894    
## Login_spotify_playlists      -3.542e-01  2.590e-01  -1.368 0.172095    
## Login_apple_playlists         2.754e-01  7.763e-02   3.548 0.000431 ***
## Login_apple_charts            1.083e-01  5.142e-02   2.106 0.035771 *  
## Logacousticness_.             5.089e-02  2.123e-02   2.397 0.016960 *  
## Loginstrumentalness_.        -1.173e-02  3.188e-02  -0.368 0.712980    
## Logspeechiness_.             -6.703e-02  3.037e-02  -2.207 0.027853 *  
## I(Login_apple_charts^2)      -3.300e-03  1.042e-02  -0.317 0.751537    
## I(Login_apple_playlists^2)   -2.461e-02  1.330e-02  -1.850 0.064936 .  
## I(in_spotify_charts^2)       -1.059e-03  3.962e-04  -2.672 0.007834 ** 
## I(Login_spotify_playlists^2)  5.085e-02  1.723e-02   2.951 0.003341 ** 
## I(released_day^2)             7.747e-04  3.485e-04   2.223 0.026735 *  
## I(released_year^2)           -6.253e-04  1.925e-04  -3.248 0.001256 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4827 on 426 degrees of freedom
## Multiple R-squared:  0.733,  Adjusted R-squared:  0.7198 
## F-statistic: 55.69 on 21 and 426 DF,  p-value: < 2.2e-16

From the model, we can see that the Adjusted R-squared is quite high. However since the p-value for the variables some of the variables are higher than 0.05 so changes in these variables do not significantly affect the predicted Logstreams Therefore, this is not the best model that we want to find.

When there are 20 variables, model will be like this

lm20 <- lm(Logstreams ~ artist_count + released_year + released_month + released_day + in_spotify_charts + bpm + danceability_. + valence_. + energy_. + Login_spotify_playlists + Login_apple_playlists + Login_apple_charts +  Logacousticness_. + Loginstrumentalness_.  + Logspeechiness_. + I(Login_apple_playlists^2) + I(in_spotify_charts^2) +  I(Login_spotify_playlists^2) + I(released_day^2) + I(released_year^2),songtrain3)
summary(lm20)
## 
## Call:
## lm(formula = Logstreams ~ artist_count + released_year + released_month + 
##     released_day + in_spotify_charts + bpm + danceability_. + 
##     valence_. + energy_. + Login_spotify_playlists + Login_apple_playlists + 
##     Login_apple_charts + Logacousticness_. + Loginstrumentalness_. + 
##     Logspeechiness_. + I(Login_apple_playlists^2) + I(in_spotify_charts^2) + 
##     I(Login_spotify_playlists^2) + I(released_day^2) + I(released_year^2), 
##     data = songtrain3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.41393 -0.31178  0.02887  0.29274  2.09638 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -2.490e+03  7.664e+02  -3.249 0.001248 ** 
## artist_count                 -3.493e-02  2.802e-02  -1.247 0.213221    
## released_year                 2.501e+00  7.676e-01   3.258 0.001212 ** 
## released_month                9.761e-03  6.881e-03   1.419 0.156767    
## released_day                 -1.471e-02  1.073e-02  -1.371 0.171030    
## in_spotify_charts             4.206e-02  1.076e-02   3.909 0.000108 ***
## bpm                           2.097e-03  8.464e-04   2.477 0.013618 *  
## danceability_.                2.243e-03  1.828e-03   1.227 0.220493    
## valence_.                     8.764e-04  1.231e-03   0.712 0.477038    
## energy_.                     -7.124e-04  1.722e-03  -0.414 0.679310    
## Login_spotify_playlists      -3.463e-01  2.575e-01  -1.345 0.179341    
## Login_apple_playlists         2.786e-01  7.689e-02   3.624 0.000325 ***
## Login_apple_charts            9.300e-02  1.759e-02   5.288 1.98e-07 ***
## Logacousticness_.             5.139e-02  2.115e-02   2.430 0.015506 *  
## Loginstrumentalness_.        -1.272e-02  3.169e-02  -0.401 0.688325    
## Logspeechiness_.             -6.704e-02  3.034e-02  -2.210 0.027665 *  
## I(Login_apple_playlists^2)   -2.514e-02  1.318e-02  -1.907 0.057227 .  
## I(in_spotify_charts^2)       -1.045e-03  3.935e-04  -2.656 0.008205 ** 
## I(Login_spotify_playlists^2)  5.034e-02  1.714e-02   2.937 0.003489 ** 
## I(released_day^2)             7.791e-04  3.478e-04   2.240 0.025612 *  
## I(released_year^2)           -6.233e-04  1.922e-04  -3.242 0.001279 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4822 on 427 degrees of freedom
## Multiple R-squared:  0.7329, Adjusted R-squared:  0.7204 
## F-statistic: 58.59 on 20 and 427 DF,  p-value: < 2.2e-16

From the model, we can see that the Adjusted R-squared is quite high. However since the p-value for the variables some of the variables are higher than 0.05 so changes in these variables do not significantly affect the predicted Logstreams Therefore, this is not the best model that we want to find.

When there are 19 variables, model will be like this

lm19 <- lm(Logstreams ~ artist_count + released_year + released_month + released_day + in_spotify_charts + bpm + danceability_. + valence_. + energy_. + Login_spotify_playlists + Login_apple_playlists + Login_apple_charts +  Logacousticness_.  + Logspeechiness_. + I(Login_apple_playlists^2) + I(in_spotify_charts^2) +  I(Login_spotify_playlists^2) + I(released_day^2) + I(released_year^2),songtrain3)
summary(lm19)
## 
## Call:
## lm(formula = Logstreams ~ artist_count + released_year + released_month + 
##     released_day + in_spotify_charts + bpm + danceability_. + 
##     valence_. + energy_. + Login_spotify_playlists + Login_apple_playlists + 
##     Login_apple_charts + Logacousticness_. + Logspeechiness_. + 
##     I(Login_apple_playlists^2) + I(in_spotify_charts^2) + I(Login_spotify_playlists^2) + 
##     I(released_day^2) + I(released_year^2), data = songtrain3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.40949 -0.30902  0.02315  0.29568  2.09761 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -2.460e+03  7.619e+02  -3.229 0.001340 ** 
## artist_count                 -3.430e-02  2.795e-02  -1.227 0.220446    
## released_year                 2.470e+00  7.630e-01   3.237 0.001301 ** 
## released_month                9.649e-03  6.868e-03   1.405 0.160788    
## released_day                 -1.469e-02  1.072e-02  -1.370 0.171384    
## in_spotify_charts             4.225e-02  1.074e-02   3.934 9.73e-05 ***
## bpm                           2.108e-03  8.452e-04   2.494 0.013014 *  
## danceability_.                2.300e-03  1.821e-03   1.263 0.207302    
## valence_.                     8.810e-04  1.230e-03   0.716 0.474293    
## energy_.                     -6.993e-04  1.720e-03  -0.407 0.684536    
## Login_spotify_playlists      -3.537e-01  2.566e-01  -1.378 0.168817    
## Login_apple_playlists         2.776e-01  7.678e-02   3.616 0.000335 ***
## Login_apple_charts            9.313e-02  1.757e-02   5.301 1.85e-07 ***
## Logacousticness_.             5.113e-02  2.112e-02   2.421 0.015890 *  
## Logspeechiness_.             -6.514e-02  2.994e-02  -2.176 0.030126 *  
## I(Login_apple_playlists^2)   -2.482e-02  1.315e-02  -1.888 0.059675 .  
## I(in_spotify_charts^2)       -1.048e-03  3.931e-04  -2.667 0.007952 ** 
## I(Login_spotify_playlists^2)  5.075e-02  1.709e-02   2.969 0.003152 ** 
## I(released_day^2)             7.778e-04  3.475e-04   2.238 0.025707 *  
## I(released_year^2)           -6.156e-04  1.911e-04  -3.221 0.001373 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4817 on 428 degrees of freedom
## Multiple R-squared:  0.7328, Adjusted R-squared:  0.721 
## F-statistic: 61.79 on 19 and 428 DF,  p-value: < 2.2e-16

From the model, we can see that the Adjusted R-squared is quite high. However since the p-value for the variables some of the variables are higher than 0.05 so changes in these variables do not significantly affect the predicted Logstreams Therefore, this is not the best model that we want to find.

When there are 18 variables, model will be like this

lm18 <- lm(Logstreams ~ artist_count + released_year + released_month + released_day + in_spotify_charts + bpm + danceability_. + valence_. + Login_spotify_playlists + Login_apple_playlists + Login_apple_charts +  Logacousticness_.  + Logspeechiness_. + I(Login_apple_playlists^2) + I(in_spotify_charts^2) +  I(Login_spotify_playlists^2) + I(released_day^2) + I(released_year^2),songtrain3)
summary(lm18)
## 
## Call:
## lm(formula = Logstreams ~ artist_count + released_year + released_month + 
##     released_day + in_spotify_charts + bpm + danceability_. + 
##     valence_. + Login_spotify_playlists + Login_apple_playlists + 
##     Login_apple_charts + Logacousticness_. + Logspeechiness_. + 
##     I(Login_apple_playlists^2) + I(in_spotify_charts^2) + I(Login_spotify_playlists^2) + 
##     I(released_day^2) + I(released_year^2), data = songtrain3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.42605 -0.30542  0.02699  0.30007  2.09362 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -2.423e+03  7.558e+02  -3.206 0.001446 ** 
## artist_count                 -3.588e-02  2.765e-02  -1.298 0.195055    
## released_year                 2.433e+00  7.569e-01   3.215 0.001404 ** 
## released_month                9.729e-03  6.859e-03   1.418 0.156799    
## released_day                 -1.457e-02  1.071e-02  -1.361 0.174087    
## in_spotify_charts             4.203e-02  1.072e-02   3.922 0.000102 ***
## bpm                           2.115e-03  8.441e-04   2.506 0.012580 *  
## danceability_.                2.369e-03  1.811e-03   1.308 0.191620    
## valence_.                     6.925e-04  1.138e-03   0.608 0.543294    
## Login_spotify_playlists      -3.526e-01  2.563e-01  -1.375 0.169721    
## Login_apple_playlists         2.776e-01  7.670e-02   3.620 0.000330 ***
## Login_apple_charts            9.248e-02  1.748e-02   5.291 1.94e-07 ***
## Logacousticness_.             5.500e-02  1.883e-02   2.922 0.003664 ** 
## Logspeechiness_.             -6.656e-02  2.971e-02  -2.241 0.025569 *  
## I(Login_apple_playlists^2)   -2.482e-02  1.313e-02  -1.890 0.059473 .  
## I(in_spotify_charts^2)       -1.039e-03  3.920e-04  -2.650 0.008352 ** 
## I(Login_spotify_playlists^2)  5.073e-02  1.707e-02   2.971 0.003133 ** 
## I(released_day^2)             7.730e-04  3.469e-04   2.228 0.026391 *  
## I(released_year^2)           -6.064e-04  1.896e-04  -3.199 0.001482 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4813 on 429 degrees of freedom
## Multiple R-squared:  0.7327, Adjusted R-squared:  0.7215 
## F-statistic: 65.34 on 18 and 429 DF,  p-value: < 2.2e-16

From the model, we can see that the Adjusted R-squared is quite high. However since the p-value for the variables some of the variables are higher than 0.05 so changes in these variables do not significantly affect the predicted Logstreams Therefore, this is not the best model that we want to find.

When there are 17 variables, model will be like this

lm17 <- lm(Logstreams ~ artist_count + released_year + released_month + released_day + in_spotify_charts + bpm + danceability_. + Login_spotify_playlists + Login_apple_playlists + Login_apple_charts +  Logacousticness_.  + Logspeechiness_. + I(Login_apple_playlists^2) + I(in_spotify_charts^2) +  I(Login_spotify_playlists^2) + I(released_day^2) + I(released_year^2),songtrain3)
summary(lm17)
## 
## Call:
## lm(formula = Logstreams ~ artist_count + released_year + released_month + 
##     released_day + in_spotify_charts + bpm + danceability_. + 
##     Login_spotify_playlists + Login_apple_playlists + Login_apple_charts + 
##     Logacousticness_. + Logspeechiness_. + I(Login_apple_playlists^2) + 
##     I(in_spotify_charts^2) + I(Login_spotify_playlists^2) + I(released_day^2) + 
##     I(released_year^2), data = songtrain3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.43203 -0.29333  0.02828  0.29477  2.09569 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -2.384e+03  7.525e+02  -3.168 0.001643 ** 
## artist_count                 -3.430e-02  2.751e-02  -1.247 0.213051    
## released_year                 2.395e+00  7.537e-01   3.177 0.001594 ** 
## released_month                9.039e-03  6.760e-03   1.337 0.181861    
## released_day                 -1.366e-02  1.059e-02  -1.290 0.197786    
## in_spotify_charts             4.261e-02  1.066e-02   3.996 7.58e-05 ***
## bpm                           2.123e-03  8.434e-04   2.518 0.012179 *  
## danceability_.                2.777e-03  1.681e-03   1.652 0.099339 .  
## Login_spotify_playlists      -3.541e-01  2.561e-01  -1.382 0.167565    
## Login_apple_playlists         2.713e-01  7.594e-02   3.573 0.000393 ***
## Login_apple_charts            9.249e-02  1.747e-02   5.296 1.89e-07 ***
## Logacousticness_.             5.509e-02  1.881e-02   2.928 0.003589 ** 
## Logspeechiness_.             -6.614e-02  2.968e-02  -2.229 0.026351 *  
## I(Login_apple_playlists^2)   -2.356e-02  1.296e-02  -1.818 0.069778 .  
## I(in_spotify_charts^2)       -1.051e-03  3.912e-04  -2.685 0.007525 ** 
## I(Login_spotify_playlists^2)  5.064e-02  1.706e-02   2.968 0.003165 ** 
## I(released_day^2)             7.468e-04  3.440e-04   2.171 0.030487 *  
## I(released_year^2)           -5.969e-04  1.888e-04  -3.162 0.001679 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4809 on 430 degrees of freedom
## Multiple R-squared:  0.7325, Adjusted R-squared:  0.7219 
## F-statistic: 69.26 on 17 and 430 DF,  p-value: < 2.2e-16

From the model, we can see that the Adjusted R-squared is quite high. However since the p-value for the variables some of the variables are higher than 0.05 so changes in these variables do not significantly affect the predicted Logstreams Therefore, this is not the best model that we want to find.

When there are 16 variables, model will be like this

lm16 <- lm(Logstreams ~ released_year + released_month + released_day + in_spotify_charts + bpm + danceability_. + Login_spotify_playlists + Login_apple_playlists + Login_apple_charts +  Logacousticness_.  + Logspeechiness_. + I(Login_apple_playlists^2) + I(in_spotify_charts^2) +  I(Login_spotify_playlists^2) + I(released_day^2) + I(released_year^2),songtrain3)
summary(lm16)
## 
## Call:
## lm(formula = Logstreams ~ released_year + released_month + released_day + 
##     in_spotify_charts + bpm + danceability_. + Login_spotify_playlists + 
##     Login_apple_playlists + Login_apple_charts + Logacousticness_. + 
##     Logspeechiness_. + I(Login_apple_playlists^2) + I(in_spotify_charts^2) + 
##     I(Login_spotify_playlists^2) + I(released_day^2) + I(released_year^2), 
##     data = songtrain3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.42231 -0.29779  0.01975  0.28636  2.05080 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -2.387e+03  7.530e+02  -3.170 0.001632 ** 
## released_year                 2.398e+00  7.542e-01   3.179 0.001582 ** 
## released_month                9.005e-03  6.764e-03   1.331 0.183797    
## released_day                 -1.409e-02  1.059e-02  -1.330 0.184119    
## in_spotify_charts             4.239e-02  1.067e-02   3.973 8.32e-05 ***
## bpm                           2.152e-03  8.437e-04   2.550 0.011106 *  
## danceability_.                2.382e-03  1.652e-03   1.442 0.150109    
## Login_spotify_playlists      -3.541e-01  2.563e-01  -1.382 0.167787    
## Login_apple_playlists         2.660e-01  7.587e-02   3.506 0.000502 ***
## Login_apple_charts            9.688e-02  1.712e-02   5.659 2.78e-08 ***
## Logacousticness_.             5.574e-02  1.882e-02   2.962 0.003224 ** 
## Logspeechiness_.             -7.037e-02  2.950e-02  -2.385 0.017495 *  
## I(Login_apple_playlists^2)   -2.386e-02  1.297e-02  -1.840 0.066448 .  
## I(in_spotify_charts^2)       -1.042e-03  3.914e-04  -2.661 0.008074 ** 
## I(Login_spotify_playlists^2)  5.087e-02  1.707e-02   2.980 0.003048 ** 
## I(released_day^2)             7.589e-04  3.441e-04   2.206 0.027934 *  
## I(released_year^2)           -5.977e-04  1.889e-04  -3.164 0.001665 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4812 on 431 degrees of freedom
## Multiple R-squared:  0.7315, Adjusted R-squared:  0.7216 
## F-statistic:  73.4 on 16 and 431 DF,  p-value: < 2.2e-16

From the model, we can see that the Adjusted R-squared is quite high. However since the p-value for the variables some of the variables are higher than 0.05 so changes in these variables do not significantly affect the predicted Logstreams Therefore, this is not the best model that we want to find.

When there are 15 variables, model will be like this

lm15 <- lm(Logstreams ~ artist_count + released_year + released_day + in_spotify_charts + bpm + danceability_. + Login_apple_playlists + Login_apple_charts +  Logacousticness_.  + Logspeechiness_. + I(Login_apple_playlists^2) + I(in_spotify_charts^2) +  I(Login_spotify_playlists^2) + I(released_day^2) + I(released_year^2),songtrain3)
summary(lm15)
## 
## Call:
## lm(formula = Logstreams ~ artist_count + released_year + released_day + 
##     in_spotify_charts + bpm + danceability_. + Login_apple_playlists + 
##     Login_apple_charts + Logacousticness_. + Logspeechiness_. + 
##     I(Login_apple_playlists^2) + I(in_spotify_charts^2) + I(Login_spotify_playlists^2) + 
##     I(released_day^2) + I(released_year^2), data = songtrain3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.48393 -0.31104  0.02565  0.30241  2.12722 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -2.862e+03  6.950e+02  -4.118 4.58e-05 ***
## artist_count                 -3.418e-02  2.755e-02  -1.241 0.215388    
## released_year                 2.873e+00  6.961e-01   4.128 4.40e-05 ***
## released_day                 -1.474e-02  1.052e-02  -1.400 0.162098    
## in_spotify_charts             4.399e-02  1.064e-02   4.134 4.28e-05 ***
## bpm                           2.073e-03  8.439e-04   2.456 0.014434 *  
## danceability_.                2.776e-03  1.677e-03   1.655 0.098625 .  
## Login_apple_playlists         2.364e-01  7.082e-02   3.338 0.000917 ***
## Login_apple_charts            9.589e-02  1.739e-02   5.515 6.01e-08 ***
## Logacousticness_.             5.627e-02  1.882e-02   2.990 0.002945 ** 
## Logspeechiness_.             -6.616e-02  2.967e-02  -2.230 0.026263 *  
## I(Login_apple_playlists^2)   -1.729e-02  1.204e-02  -1.436 0.151657    
## I(in_spotify_charts^2)       -1.096e-03  3.910e-04  -2.804 0.005279 ** 
## I(Login_spotify_playlists^2)  2.720e-02  2.582e-03  10.534  < 2e-16 ***
## I(released_day^2)             7.877e-04  3.419e-04   2.304 0.021695 *  
## I(released_year^2)           -7.169e-04  1.743e-04  -4.114 4.67e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4816 on 432 degrees of freedom
## Multiple R-squared:  0.7305, Adjusted R-squared:  0.7211 
## F-statistic: 78.06 on 15 and 432 DF,  p-value: < 2.2e-16

From the model, we can see that the Adjusted R-squared is quite high. However since the p-value for the variables some of the variables are higher than 0.05 so changes in these variables do not significantly affect the predicted Logstreams Therefore, this is not the best model that we want to find.

When there are 14 variables, model will be like this

lm14 <- lm(Logstreams ~ released_year  + released_day + in_spotify_charts + bpm + danceability_. + Login_apple_playlists + Login_apple_charts +  Logacousticness_.  + Logspeechiness_. + I(Login_apple_playlists^2) + I(in_spotify_charts^2) +  I(Login_spotify_playlists^2) + I(released_day^2) + I(released_year^2),songtrain3)
summary(lm14)
## 
## Call:
## lm(formula = Logstreams ~ released_year + released_day + in_spotify_charts + 
##     bpm + danceability_. + Login_apple_playlists + Login_apple_charts + 
##     Logacousticness_. + Logspeechiness_. + I(Login_apple_playlists^2) + 
##     I(in_spotify_charts^2) + I(Login_spotify_playlists^2) + I(released_day^2) + 
##     I(released_year^2), data = songtrain3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.47407 -0.30273  0.02022  0.30182  2.08239 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -2.865e+03  6.954e+02  -4.120 4.55e-05 ***
## released_year                 2.876e+00  6.966e-01   4.129 4.37e-05 ***
## released_day                 -1.517e-02  1.052e-02  -1.441  0.15022    
## in_spotify_charts             4.377e-02  1.065e-02   4.111 4.71e-05 ***
## bpm                           2.101e-03  8.441e-04   2.489  0.01318 *  
## danceability_.                2.383e-03  1.648e-03   1.446  0.14887    
## Login_apple_playlists         2.311e-01  7.074e-02   3.267  0.00117 ** 
## Login_apple_charts            1.003e-01  1.704e-02   5.884 8.03e-09 ***
## Logacousticness_.             5.692e-02  1.882e-02   3.024  0.00264 ** 
## Logspeechiness_.             -7.039e-02  2.949e-02  -2.387  0.01744 *  
## I(Login_apple_playlists^2)   -1.759e-02  1.205e-02  -1.460  0.14498    
## I(in_spotify_charts^2)       -1.087e-03  3.912e-04  -2.780  0.00568 ** 
## I(Login_spotify_playlists^2)  2.743e-02  2.577e-03  10.642  < 2e-16 ***
## I(released_day^2)             7.999e-04  3.420e-04   2.339  0.01978 *  
## I(released_year^2)           -7.177e-04  1.744e-04  -4.115 4.63e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4819 on 433 degrees of freedom
## Multiple R-squared:  0.7295, Adjusted R-squared:  0.7208 
## F-statistic: 83.42 on 14 and 433 DF,  p-value: < 2.2e-16

From the model, we can see that the Adjusted R-squared is quite high. However since the p-value for the variables some of the variables are higher than 0.05 so changes in these variables do not significantly affect the predicted Logstreams Therefore, this is not the best model that we want to find.

When there are 13 variables, model will be like this

lm13 <- lm(Logstreams ~ released_year + released_day + in_spotify_charts + bpm + danceability_. + Login_apple_playlists + Login_apple_charts +  Logacousticness_.  + Logspeechiness_. + I(in_spotify_charts^2) +  I(Login_spotify_playlists^2) + I(released_day^2) + I(released_year^2),songtrain3)
summary(lm13)
## 
## Call:
## lm(formula = Logstreams ~ released_year + released_day + in_spotify_charts + 
##     bpm + danceability_. + Login_apple_playlists + Login_apple_charts + 
##     Logacousticness_. + Logspeechiness_. + I(in_spotify_charts^2) + 
##     I(Login_spotify_playlists^2) + I(released_day^2) + I(released_year^2), 
##     data = songtrain3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.46240 -0.28277  0.01942  0.29691  2.08856 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -2.949e+03  6.940e+02  -4.249 2.63e-05 ***
## released_year                 2.961e+00  6.951e-01   4.260 2.51e-05 ***
## released_day                 -1.421e-02  1.052e-02  -1.351  0.17738    
## in_spotify_charts             4.380e-02  1.066e-02   4.109 4.75e-05 ***
## bpm                           2.202e-03  8.423e-04   2.614  0.00925 ** 
## danceability_.                2.484e-03  1.649e-03   1.506  0.13269    
## Login_apple_playlists         1.368e-01  2.887e-02   4.737 2.94e-06 ***
## Login_apple_charts            9.773e-02  1.697e-02   5.758 1.61e-08 ***
## Logacousticness_.             5.671e-02  1.885e-02   3.009  0.00277 ** 
## Logspeechiness_.             -6.871e-02  2.951e-02  -2.328  0.02036 *  
## I(in_spotify_charts^2)       -1.076e-03  3.916e-04  -2.749  0.00623 ** 
## I(Login_spotify_playlists^2)  2.574e-02  2.307e-03  11.157  < 2e-16 ***
## I(released_day^2)             7.798e-04  3.421e-04   2.279  0.02314 *  
## I(released_year^2)           -7.390e-04  1.740e-04  -4.247 2.65e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4825 on 434 degrees of freedom
## Multiple R-squared:  0.7282, Adjusted R-squared:   0.72 
## F-statistic: 89.44 on 13 and 434 DF,  p-value: < 2.2e-16

From the model, we can see that the Adjusted R-squared is quite high. However since the p-value for the variables some of the variables are higher than 0.05 so changes in these variables do not significantly affect the predicted Logstreams Therefore, this is not the best model that we want to find.

When there are 12 variables, model will be like this

lm12 <- lm(Logstreams ~ released_year + in_spotify_charts + bpm + danceability_. + Login_apple_playlists + Login_apple_charts +  Logacousticness_.  + Logspeechiness_. + I(in_spotify_charts^2) +  I(Login_spotify_playlists^2) + I(released_day^2) + I(released_year^2),songtrain3)
summary(lm12)
## 
## Call:
## lm(formula = Logstreams ~ released_year + in_spotify_charts + 
##     bpm + danceability_. + Login_apple_playlists + Login_apple_charts + 
##     Logacousticness_. + Logspeechiness_. + I(in_spotify_charts^2) + 
##     I(Login_spotify_playlists^2) + I(released_day^2) + I(released_year^2), 
##     data = songtrain3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.47451 -0.28725  0.02535  0.29404  2.11136 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -2.954e+03  6.946e+02  -4.253 2.59e-05 ***
## released_year                 2.967e+00  6.957e-01   4.265 2.46e-05 ***
## in_spotify_charts             4.442e-02  1.066e-02   4.167 3.72e-05 ***
## bpm                           2.179e-03  8.430e-04   2.585 0.010056 *  
## danceability_.                2.690e-03  1.643e-03   1.637 0.102293    
## Login_apple_playlists         1.382e-01  2.888e-02   4.783 2.37e-06 ***
## Login_apple_charts            9.768e-02  1.699e-02   5.750 1.69e-08 ***
## Logacousticness_.             5.480e-02  1.881e-02   2.913 0.003759 ** 
## Logspeechiness_.             -6.885e-02  2.954e-02  -2.331 0.020215 *  
## I(in_spotify_charts^2)       -1.077e-03  3.920e-04  -2.748 0.006242 ** 
## I(Login_spotify_playlists^2)  2.575e-02  2.309e-03  11.152  < 2e-16 ***
## I(released_day^2)             3.323e-04  8.577e-05   3.874 0.000124 ***
## I(released_year^2)           -7.408e-04  1.742e-04  -4.253 2.58e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.483 on 435 degrees of freedom
## Multiple R-squared:  0.727,  Adjusted R-squared:  0.7195 
## F-statistic: 96.55 on 12 and 435 DF,  p-value: < 2.2e-16

From the model, we can see that the Adjusted R-squared is quite high. However since the p-value for the variables some of the variables are higher than 0.05 so changes in these variables do not significantly affect the predicted Logstreams Therefore, this is not the best model that we want to find.

When there are 11 variables, model will be like this

lm11 <- lm(Logstreams ~ released_year + in_spotify_charts + bpm + Login_apple_playlists + Login_apple_charts +  Logacousticness_.  + Logspeechiness_. + I(in_spotify_charts^2) +  I(Login_spotify_playlists^2) + I(released_day^2) + I(released_year^2), songtrain3)
summary(lm11)
## 
## Call:
## lm(formula = Logstreams ~ released_year + in_spotify_charts + 
##     bpm + Login_apple_playlists + Login_apple_charts + Logacousticness_. + 
##     Logspeechiness_. + I(in_spotify_charts^2) + I(Login_spotify_playlists^2) + 
##     I(released_day^2) + I(released_year^2), data = songtrain3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.54108 -0.29015  0.00861  0.28965  2.10930 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -2.874e+03  6.942e+02  -4.139 4.18e-05 ***
## released_year                 2.886e+00  6.953e-01   4.151 3.98e-05 ***
## in_spotify_charts             4.418e-02  1.068e-02   4.137 4.22e-05 ***
## bpm                           1.967e-03  8.345e-04   2.357 0.018878 *  
## Login_apple_playlists         1.484e-01  2.825e-02   5.255 2.32e-07 ***
## Login_apple_charts            9.646e-02  1.701e-02   5.672 2.57e-08 ***
## Logacousticness_.             4.993e-02  1.861e-02   2.683 0.007571 ** 
## Logspeechiness_.             -5.751e-02  2.877e-02  -1.999 0.046230 *  
## I(in_spotify_charts^2)       -1.059e-03  3.926e-04  -2.698 0.007246 ** 
## I(Login_spotify_playlists^2)  2.532e-02  2.298e-03  11.016  < 2e-16 ***
## I(released_day^2)             3.363e-04  8.590e-05   3.915 0.000105 ***
## I(released_year^2)           -7.206e-04  1.741e-04  -4.140 4.18e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4839 on 436 degrees of freedom
## Multiple R-squared:  0.7254, Adjusted R-squared:  0.7184 
## F-statistic: 104.7 on 11 and 436 DF,  p-value: < 2.2e-16

From this model, it seem like all the conditions for this model are fulfill with p-value for all the variables are much smaller than 0.05. In addition, the Adjusted R-squared for this model is higher than the based model. Therefore, this is the best model that we get from Best Subsets.

Validate the Mathematical Assumptions

Assumptions for Linear Models:

  1. ε|xi is independent of ε|xj for any xi ̸= xj (independence)

  2. ε|x has standard deviation σ that does not depend on x. (homoscedasticity)

  3. ε|x is normally distributed for each x (normality)

Independence Assumptions

cor(songtrain3)
##                         artist_count released_year released_month released_day
## artist_count             1.000000000   0.106810521   -0.008180603  0.056057079
## released_year            0.106810521   1.000000000    0.061312795  0.078709598
## released_month          -0.008180603   0.061312795    1.000000000  0.075768333
## released_day             0.056057079   0.078709598    0.075768333  1.000000000
## in_spotify_charts       -0.044775331  -0.033404658   -0.046296838 -0.013569012
## bpm                     -0.015177062   0.029475277   -0.047385342 -0.059158876
## danceability_.           0.267606806   0.113874864   -0.056125166  0.031512743
## valence_.                0.181556082  -0.071527723   -0.186141454  0.046435367
## energy_.                 0.190884266   0.002169269   -0.059985458  0.048946540
## Logstreams              -0.058810263  -0.275761046    0.043743792  0.107736939
## Login_spotify_playlists -0.026381520  -0.434432988   -0.007532317  0.007167030
## Login_apple_playlists    0.101961289  -0.260879144    0.003830782  0.064866354
## Login_apple_charts      -0.151447141  -0.143568654    0.060185043 -0.035998023
## Logacousticness_.       -0.058880378   0.013540150   -0.008621915  0.002118181
## Loginstrumentalness_.   -0.128318562  -0.104007321    0.065836377  0.015342311
## Logliveness_.            0.059694527   0.051878154   -0.016946051 -0.016049702
## Logspeechiness_.         0.159523734   0.118884400    0.015485453 -0.016884250
##                         in_spotify_charts          bpm danceability_.
## artist_count                  -0.04477533 -0.015177062     0.26760681
## released_year                 -0.03340466  0.029475277     0.11387486
## released_month                -0.04629684 -0.047385342    -0.05612517
## released_day                  -0.01356901 -0.059158876     0.03151274
## in_spotify_charts              1.00000000  0.016017589     0.01998765
## bpm                            0.01601759  1.000000000    -0.10224367
## danceability_.                 0.01998765 -0.102243670     1.00000000
## valence_.                      0.08373595 -0.015341812     0.39487445
## energy_.                       0.06107525  0.007069839     0.18059295
## Logstreams                     0.32427195 -0.006475977     0.02739416
## Login_spotify_playlists        0.19343294 -0.035757248    -0.02942905
## Login_apple_playlists          0.19039098 -0.030208978     0.13793751
## Login_apple_charts             0.30247000 -0.083644732     0.01569308
## Logacousticness_.             -0.03801276 -0.076463396    -0.15953767
## Loginstrumentalness_.         -0.07537085 -0.048405969    -0.17970892
## Logliveness_.                 -0.02628857 -0.029554319    -0.06622832
## Logspeechiness_.              -0.04677227  0.097613871     0.20909260
##                           valence_.     energy_.   Logstreams
## artist_count             0.18155608  0.190884266 -0.058810263
## released_year           -0.07152772  0.002169269 -0.275761046
## released_month          -0.18614145 -0.059985458  0.043743792
## released_day             0.04643537  0.048946540  0.107736939
## in_spotify_charts        0.08373595  0.061075249  0.324271949
## bpm                     -0.01534181  0.007069839 -0.006475977
## danceability_.           0.39487445  0.180592952  0.027394160
## valence_.                1.00000000  0.373863941 -0.001801740
## energy_.                 0.37386394  1.000000000  0.054954160
## Logstreams              -0.00180174  0.054954160  1.000000000
## Login_spotify_playlists -0.03850706  0.040532652  0.762942174
## Login_apple_playlists    0.06726161  0.114414061  0.698380539
## Login_apple_charts       0.02198265  0.124607248  0.497448939
## Logacousticness_.       -0.03100504 -0.448604051 -0.101385863
## Loginstrumentalness_.   -0.12783397 -0.101127695  0.020105996
## Logliveness_.           -0.01432861  0.063322383 -0.043590364
## Logspeechiness_.         0.08631690  0.111212690 -0.158019883
##                         Login_spotify_playlists Login_apple_playlists
## artist_count                       -0.026381520           0.101961289
## released_year                      -0.434432988          -0.260879144
## released_month                     -0.007532317           0.003830782
## released_day                        0.007167030           0.064866354
## in_spotify_charts                   0.193432944           0.190390983
## bpm                                -0.035757248          -0.030208978
## danceability_.                     -0.029429048           0.137937508
## valence_.                          -0.038507061           0.067261607
## energy_.                            0.040532652           0.114414061
## Logstreams                          0.762942174           0.698380539
## Login_spotify_playlists             1.000000000           0.724623847
## Login_apple_playlists               0.724623847           1.000000000
## Login_apple_charts                  0.345094228           0.456850561
## Logacousticness_.                  -0.145580828          -0.200347131
## Loginstrumentalness_.               0.065869484          -0.039075810
## Logliveness_.                      -0.035924257          -0.050530857
## Logspeechiness_.                   -0.083378273          -0.129618095
##                         Login_apple_charts Logacousticness_.
## artist_count                   -0.15144714      -0.058880378
## released_year                  -0.14356865       0.013540150
## released_month                  0.06018504      -0.008621915
## released_day                   -0.03599802       0.002118181
## in_spotify_charts               0.30247000      -0.038012756
## bpm                            -0.08364473      -0.076463396
## danceability_.                  0.01569308      -0.159537667
## valence_.                       0.02198265      -0.031005043
## energy_.                        0.12460725      -0.448604051
## Logstreams                      0.49744894      -0.101385863
## Login_spotify_playlists         0.34509423      -0.145580828
## Login_apple_playlists           0.45685056      -0.200347131
## Login_apple_charts              1.00000000      -0.144498524
## Logacousticness_.              -0.14449852       1.000000000
## Loginstrumentalness_.          -0.03320082       0.059981137
## Logliveness_.                   0.03014205      -0.009913390
## Logspeechiness_.               -0.13558022       0.002784047
##                         Loginstrumentalness_. Logliveness_. Logspeechiness_.
## artist_count                      -0.12831856    0.05969453      0.159523734
## released_year                     -0.10400732    0.05187815      0.118884400
## released_month                     0.06583638   -0.01694605      0.015485453
## released_day                       0.01534231   -0.01604970     -0.016884250
## in_spotify_charts                 -0.07537085   -0.02628857     -0.046772270
## bpm                               -0.04840597   -0.02955432      0.097613871
## danceability_.                    -0.17970892   -0.06622832      0.209092601
## valence_.                         -0.12783397   -0.01432861      0.086316903
## energy_.                          -0.10112769    0.06332238      0.111212690
## Logstreams                         0.02010600   -0.04359036     -0.158019883
## Login_spotify_playlists            0.06586948   -0.03592426     -0.083378273
## Login_apple_playlists             -0.03907581   -0.05053086     -0.129618095
## Login_apple_charts                -0.03320082    0.03014205     -0.135580220
## Logacousticness_.                  0.05998114   -0.00991339      0.002784047
## Loginstrumentalness_.              1.00000000   -0.07352980     -0.186365272
## Logliveness_.                     -0.07352980    1.00000000     -0.013093706
## Logspeechiness_.                  -0.18636527   -0.01309371      1.000000000

Based on the correlation test, we can see that none of the variables that we use in the model are highly correlated. Therefore, the model is consistent with the independence assumption.

Homoscedasticity and Normality Assumption

songtrain4 <- songtrain3 %>% 
  mutate(res = residuals(lm11), fit = fitted.values(lm11))
shapiro.test(songtrain4$res)
## 
##  Shapiro-Wilk normality test
## 
## data:  songtrain4$res
## W = 0.99053, p-value = 0.005632
ncvTest(lm11)     
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 11.12461, Df = 1, p = 0.0008519

From the test, we can see that the p-value which is 0.005632 for Shapiro test and 0.0008519 for ncvTest which are smaller than 0.05 which mean our data is not consistent with homoscedasticity and normality. However, when the sample size is large, any tests such as these will be too powerful and often reject the null hypothesis. Thus, they commit many Type I errors. So we would want to plot the histogram and scatterplot of residual to have a better conclusion for the assumptions.

ggplot(songtrain4, aes(res)) + geom_histogram(fill = "royalblue1", color = "black",) + labs( title = "Histogram of residuals")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(songtrain4, aes(fit, res)) + geom_point(color = "royalblue1") + labs(title = "Scatterplot of residuals")

We can see that the in the first plot, we can see that the residual histogram is approximately normal. Which mean that this model is conistent with normality assumption.

In the second plot, the residual scatterplot does not show any trend in the scatter plot which mean the model is also consistent with the homoscedasticity assumption. Addition, the dataset also is consistent with the independence assumption by correlation test.

Therefore we conclude that this model is valid for predicting Logstreams as a function of released_year, in_spotify_charts, bpm, Login_apple_playlists, Login_apple_charts, Logacousticness_., Logspeechiness_., I(in_spotify_charts^2), I(Login_spotify_playlists^2), I(released_day^2), I(released_year^2).

Coefficient Explaining

summary(lm11)
## 
## Call:
## lm(formula = Logstreams ~ released_year + in_spotify_charts + 
##     bpm + Login_apple_playlists + Login_apple_charts + Logacousticness_. + 
##     Logspeechiness_. + I(in_spotify_charts^2) + I(Login_spotify_playlists^2) + 
##     I(released_day^2) + I(released_year^2), data = songtrain3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.54108 -0.29015  0.00861  0.28965  2.10930 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -2.874e+03  6.942e+02  -4.139 4.18e-05 ***
## released_year                 2.886e+00  6.953e-01   4.151 3.98e-05 ***
## in_spotify_charts             4.418e-02  1.068e-02   4.137 4.22e-05 ***
## bpm                           1.967e-03  8.345e-04   2.357 0.018878 *  
## Login_apple_playlists         1.484e-01  2.825e-02   5.255 2.32e-07 ***
## Login_apple_charts            9.646e-02  1.701e-02   5.672 2.57e-08 ***
## Logacousticness_.             4.993e-02  1.861e-02   2.683 0.007571 ** 
## Logspeechiness_.             -5.751e-02  2.877e-02  -1.999 0.046230 *  
## I(in_spotify_charts^2)       -1.059e-03  3.926e-04  -2.698 0.007246 ** 
## I(Login_spotify_playlists^2)  2.532e-02  2.298e-03  11.016  < 2e-16 ***
## I(released_day^2)             3.363e-04  8.590e-05   3.915 0.000105 ***
## I(released_year^2)           -7.206e-04  1.741e-04  -4.140 4.18e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4839 on 436 degrees of freedom
## Multiple R-squared:  0.7254, Adjusted R-squared:  0.7184 
## F-statistic: 104.7 on 11 and 436 DF,  p-value: < 2.2e-16

From this model, we can see that

Intercept:

  • Estimate: The estimated intercept is -2.874e+03, indicating that when all other predictor variables are zero, the predicted value of Logstreams is approximately -2874.
  • Std. Error: The standard error of the intercept estimate is 6.942e+02.
  • t value: The t-value associated with the intercept is -4.139, with a p-value of 4.18e-05, indicating statistical significance. Therefore, the intercept is significantly different from zero, suggesting that there’s a base level of Logstreams even when all predictors are zero.

released_year:

  • Estimate: The estimated coefficient for released_year is 2.886, suggesting that for each one-unit increase in the released year, the Logstreams increase by approximately 2.886.
  • Std. Error: The standard error of the coefficient estimate for released_year is 6.953e-01.
  • t value: The t-value associated with released_year is 4.151, with a p-value of 3.98e-05, indicating statistical significance. Thus, the year of release significantly affects the number of Logstreams, with newer releases generally having higher Logstreams.

in_spotify_charts:

  • Estimate: The estimated coefficient for in_spotify_charts is 0.04418, implying that for each one-unit increase in being in Spotify charts, the Logstreams increase by approximately 0.04418.
  • Std. Error: The standard error of the coefficient estimate for in_spotify_charts is 1.068e-02.
  • t value: The t-value associated with in_spotify_charts is 4.137, with a p-value of 4.22e-05, indicating statistical significance. This suggests that being in the Spotify charts positively influences the number of Logstreams.

bpm:

  • Estimate: The estimated coefficient for bpm is 0.001967, suggesting that for each one-unit increase in beats per minute, the Logstreams increase by approximately 0.001967.
  • Std. Error: The standard error of the coefficient estimate for bpm is 8.345e-04.
  • t value: The t-value associated with bpm is 2.357, with a p-value of 0.018878, indicating statistical significance. Thus, songs with higher beats per minute tend to have higher Logstreams.

Login_apple_playlists:

  • Estimate: The estimated coefficient for Login_apple_playlists is 1.484e-01, indicating that for each one-unit increase in login to Apple playlists, the Logstreams increase by approximately 0.1484.
  • Std. Error: The standard error of the coefficient estimate for Login_apple_playlists is 2.825e-02.
  • t value: The t-value associated with Login_apple_playlists is 5.255, with a p-value of 2.32e-07, suggesting statistical significance.

Login_apple_charts:

  • Estimate: The estimated coefficient for Login_apple_charts is 9.646e-02, implying that for each one-unit increase in login to Apple charts, the Logstreams increase by approximately 0.09646.
  • Std. Error: The standard error of the coefficient estimate for Login_apple_charts is 1.701e-02.
  • t value: The t-value associated with Login_apple_charts is 5.672, with a p-value of 2.57e-08, indicating statistical significance.

Logacousticness_:

  • Estimate: The estimated coefficient for Logacousticness_ is 4.993e-02, suggesting that for each one-unit increase in the log-transformed acousticness, the Logstreams increase by approximately 0.04993.
  • Std. Error: The standard error of the coefficient estimate for Logacousticness_ is 1.861e-02.
  • t value: The t-value associated with Logacousticness_ is 2.683, with a p-value of 0.007571, indicating statistical significance.

Logspeechiness_:

  • Estimate: The estimated coefficient for Logspeechiness_ is -5.751e-02, indicating that for each one-unit increase in the log-transformed speechiness, the Logstreams decrease by approximately 0.05751.
  • Std. Error: The standard error of the coefficient estimate for Logspeechiness_ is 2.877e-02.
  • t value: The t-value associated with Logspeechiness_ is -1.999, with a p-value of 0.046230, suggesting statistical significance.

I(in_spotify_charts^2):

  • Estimate: The estimated coefficient for I(in_spotify_charts^2) is -1.059e-03, suggesting that the squared term of being in Spotify charts has a negative effect on Logstreams.
  • Std. Error: The standard error of the coefficient estimate for I(in_spotify_charts^2) is 3.926e-04.
  • t value: The t-value associated with I(in_spotify_charts^2) is -2.698, with a p-value of 0.007246, indicating statistical significance.

I(Login_spotify_playlists^2):

  • Estimate: The estimated coefficient for I(Login_spotify_playlists^2) is 2.532e-02, implying that the squared term of login to Spotify playlists has a positive effect on Logstreams.
  • Std. Error: The standard error of the coefficient estimate for I(Login_spotify_playlists^2) is 2.298e-03.
  • t value: The t-value associated with I(Login_spotify_playlists^2) is 11.016, with a p-value of < 2e-16, indicating statistical significance.

I(released_day^2):

  • Estimate: The estimated coefficient for I(released_day^2) is 3.363e-04, suggesting that the squared term of released_day has a positive effect on Logstreams.
  • Std. Error: The standard error of the coefficient estimate for I(released_day^2) is 8.590e-05.
  • t value: The t-value associated with I(released_day^2) is 3.915, with a p-value of 0.000105, indicating statistical significance.

I(released_year^2):

  • Estimate: The estimated coefficient for I(released_year^2) is -7.206e-04, indicating that the squared term of released_year has a negative effect on Logstreams.
  • Std. Error: The standard error of the coefficient estimate for I(released_year^2) is 1.741e-04.
  • t value: The t-value associated with I(released_year^2) is -4.140, with a p-value of 4.18e-05, indicating statistical significance.

Overall, we have the Adjusted R-squared is 0.7184 which mean that 71.84% of the variation of variable Logstreams explained by these variable. The p-value of < 2.2e-16 indicates that the model is statistically significant in predicting the 0.7184

Robust Standard Errors

coeftest(lm11,vcov = vcovHC(lm11,type = "HC1"))
## 
## t test of coefficients:
## 
##                                 Estimate  Std. Error t value  Pr(>|t|)    
## (Intercept)                  -2.8736e+03  7.2158e+02 -3.9824 7.991e-05 ***
## released_year                 2.8864e+00  7.2363e-01  3.9888 7.786e-05 ***
## in_spotify_charts             4.4185e-02  8.6840e-03  5.0881 5.386e-07 ***
## bpm                           1.9668e-03  8.8963e-04  2.2108 0.0275702 *  
## Login_apple_playlists         1.4844e-01  3.3847e-02  4.3855 1.453e-05 ***
## Login_apple_charts            9.6457e-02  1.6211e-02  5.9501 5.510e-09 ***
## Logacousticness_.             4.9930e-02  1.6900e-02  2.9544 0.0033020 ** 
## Logspeechiness_.             -5.7509e-02  2.9788e-02 -1.9306 0.0541844 .  
## I(in_spotify_charts^2)       -1.0592e-03  2.9528e-04 -3.5870 0.0003723 ***
## I(Login_spotify_playlists^2)  2.5320e-02  2.6517e-03  9.5487 < 2.2e-16 ***
## I(released_day^2)             3.3633e-04  8.0689e-05  4.1682 3.704e-05 ***
## I(released_year^2)           -7.2056e-04  1.8138e-04 -3.9726 8.316e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We are doing robust standard errors to look if we can solve the issue with heteroscedasticity in our model. And based on the result, we see that although increase the Standard Error of all variables, the p-value of almost all the variable are still smaller than 0.05 so almost all the variable are statistical significant. It seem like from the graph above the model is already consistent with homoscedasticity. Therefore, it do not really matter to do robust standard error on this model.

Prediction

newdata = data.frame(released_year = 2023, in_spotify_charts = 0, bpm = 115, Login_apple_playlists = 4, Login_apple_charts = 4,  Logacousticness_. = 3.7, Logspeechiness_. = 3.2 ,  Login_spotify_playlists = 7 , released_day = 21)
predict(lm11,newdata,interval = "predict", level = 0.99)
##        fit      lwr      upr
## 1 19.30162 18.03766 20.56559

Using the value of released_year = 2023, in_spotify_charts = 0, bpm = 115, Login_apple_playlists = 4, Login_apple_charts = 4, Logacousticness_. = 3.7, Logspeechiness_. = 3.2 , Login_spotify_playlists = 7 , released_day = 21 we see that this model predict the Logstreams is 19.30162 with the lowerbound is 18.03766 and the upperbound is 20.56559 in the 99% confidence interval.

Check the model on the test set

To check the model, we try the same model but on a different dataset which is cartest2. But first, we would need to mutate a columns of LogPrice for the test set and get rid of the old column Price

songtest2 <- songtest %>% 
  mutate(Login_apple_playlists = log(in_apple_playlists + 1)) %>% 
  select(-in_apple_playlists) %>% 
  mutate(Login_apple_charts = log(in_apple_charts + 1)) %>% 
  select(-in_apple_charts) %>% 
  mutate(Logspeechiness_. = log(speechiness_.)) %>% 
  select(-speechiness_.) %>%
  mutate(Logacousticness_. = log(acousticness_. + 1)) %>% 
  select(-acousticness_.) %>% 
  mutate(Login_spotify_playlists = log(in_spotify_playlists)) %>% 
  select(-in_spotify_playlists) %>% 
  mutate(Logstreams = log(streams)) %>% 
  select(-streams)
  
lm_test <- lm(Logstreams ~ released_year + in_spotify_charts + bpm + Login_apple_playlists + Login_apple_charts +  Logacousticness_.  + Logspeechiness_. + I(in_spotify_charts^2) +  I(Login_spotify_playlists^2) + I(released_day^2) + I(released_year^2), songtest2)
summary(lm_test)
## 
## Call:
## lm(formula = Logstreams ~ released_year + in_spotify_charts + 
##     bpm + Login_apple_playlists + Login_apple_charts + Logacousticness_. + 
##     Logspeechiness_. + I(in_spotify_charts^2) + I(Login_spotify_playlists^2) + 
##     I(released_day^2) + I(released_year^2), data = songtest2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.8234  -0.3074   0.0174   0.4138   2.4398 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -1.093e+03  6.082e+02  -1.798   0.0729 .  
## released_year                 1.112e+00  6.115e-01   1.819   0.0697 .  
## in_spotify_charts             5.880e-03  4.517e-03   1.302   0.1937    
## bpm                          -1.241e-03  1.401e-03  -0.886   0.3760    
## Login_apple_playlists        -1.107e-01  4.957e-02  -2.233   0.0261 *  
## Login_apple_charts            3.701e-02  3.453e-02   1.072   0.2844    
## Logacousticness_.             5.553e-02  3.146e-02   1.765   0.0783 .  
## Logspeechiness_.             -4.133e-02  5.510e-02  -0.750   0.4536    
## I(in_spotify_charts^2)       -6.395e-05  4.745e-05  -1.348   0.1784    
## I(Login_spotify_playlists^2)  4.859e-02  3.173e-03  15.310   <2e-16 ***
## I(released_day^2)             2.044e-04  1.357e-04   1.506   0.1327    
## I(released_year^2)           -2.785e-04  1.537e-04  -1.812   0.0707 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8049 on 435 degrees of freedom
## Multiple R-squared:  0.6252, Adjusted R-squared:  0.6157 
## F-statistic: 65.97 on 11 and 435 DF,  p-value: < 2.2e-16

Based on the result, we can see that most of the variables become insignificant. This is because of: - Multicollinearity: The dataset may suffer from multicollinearity, where predictor variables are highly correlated with each other (independence assumption) - Small Sample Size: After splitting, the training dataset only has ~450 data points, which could have affected the generalizability of findings - Overfitting: After evaluating the training model on the test set, it seems that overfitting is evident, as the “best” model fails to generalize to new data due to too much noise being captured and/or random fluctuations.

Confint

confint(lm11)
##                                      2.5 %        97.5 %
## (Intercept)                  -4.238091e+03 -1.509205e+03
## released_year                 1.519896e+00  4.252981e+00
## in_spotify_charts             2.319561e-02  6.517406e-02
## bpm                           3.265649e-04  3.606960e-03
## Login_apple_playlists         9.291790e-02  2.039547e-01
## Login_apple_charts            6.303465e-02  1.298796e-01
## Logacousticness_.             1.335562e-02  8.650406e-02
## Logspeechiness_.             -1.140517e-01 -9.656708e-04
## I(in_spotify_charts^2)       -1.830745e-03 -2.876047e-04
## I(Login_spotify_playlists^2)  2.080298e-02  2.983789e-02
## I(released_day^2)             1.674987e-04  5.051604e-04
## I(released_year^2)           -1.062669e-03 -3.784500e-04

Certainly! Let’s interpret each coefficient along with its corresponding 95% confidence interval:

(Intercept): - Estimate: The estimated intercept is between -4.238091e+03 and -1.509205e+03 with 95% confidence. - Interpretation: We are 95% confident that when all predictor variables are zero, the outcome (Logstreams, in this case) is between approximately -4238 and -1509.

released_year: - Estimate: The estimated coefficient for released_year is between 1.519896e+00 and 4.252981e+00 with 95% confidence. - Interpretation: We are 95% confident that for each one-unit increase in the released year, the Logstreams increase by an amount between approximately 1.52 and 4.25.

in_spotify_charts: - Estimate: The estimated coefficient for in_spotify_charts is between 2.319561e-02 and 6.517406e-02 with 95% confidence. - Interpretation: We are 95% confident that for each one-unit increase in being in Spotify charts, the Logstreams increase by an amount between approximately 0.0232 and 0.0652.

bpm: - Estimate: The estimated coefficient for bpm is between 3.265649e-04 and 3.606960e-03 with 95% confidence. - Interpretation: We are 95% confident that for each one-unit increase in beats per minute (bpm), the Logstreams increase by an amount between approximately 0.000326 and 0.00361.

Login_apple_playlists: - Estimate: The estimated coefficient for Login_apple_playlists is between 9.291790e-02 and 2.039547e-01 with 95% confidence. - Interpretation: We are 95% confident that for each one-unit increase in login to Apple playlists, the Logstreams increase by an amount between approximately 0.0929 and 0.204.

Login_apple_charts: - Estimate: The estimated coefficient for Login_apple_charts is between 6.303465e-02 and 1.298796e-01 with 95% confidence. - Interpretation: We are 95% confident that for each one-unit increase in login to Apple charts, the Logstreams increase by an amount between approximately 0.063 and 0.130.

Logacousticness_: - Estimate: The estimated coefficient for Logacousticness_ is between 1.335562e-02 and 8.650406e-02 with 95% confidence. - Interpretation: We are 95% confident that for each one-unit increase in the log-transformed acousticness, the Logstreams increase by an amount between approximately 0.0134 and 0.0865.

Logspeechiness_: - Estimate: The estimated coefficient for Logspeechiness_ is between -1.140517e-01 and -9.656708e-04 with 95% confidence. - Interpretation: We are 95% confident that for each one-unit increase in the log-transformed speechiness, the Logstreams decrease by an amount between approximately -0.114 and -0.001.

I(in_spotify_charts^2): - Estimate: The estimated coefficient for I(in_spotify_charts^2) is between -1.830745e-03 and -2.876047e-04 with 95% confidence. - Interpretation: We are 95% confident that the squared term of being in Spotify charts has an effect on Logstreams within the range of approximately -0.00183 and -0.000288.

I(Login_spotify_playlists^2): - Estimate: The estimated coefficient for I(Login_spotify_playlists^2) is between 2.080298e-02 and 2.983789e-02 with 95% confidence. - Interpretation: We are 95% confident that the squared term of login to Spotify playlists has an effect on Logstreams within the range of approximately 0.0208 and 0.0298.

I(released_day^2): - Estimate: The estimated coefficient for I(released_day^2) is between 1.674987e-04 and 5.051604e-04 with 95% confidence. - Interpretation: We are 95% confident that the squared term of released_day has an effect on Logstreams within the range of approximately 0.000167 and 0.000505.

I(released_year^2): - Estimate: The estimated coefficient for I(released_year^2) is between -1.062669e-03 and -3.784500e-04 with 95% confidence. - Interpretation: We are 95% confident that the squared term of released_year has an effect on Logstreams within the range of approximately -0.00106 and -0.000378.

Limitation and Future Question

Limitations:

  • Multicollinearity: The dataset may suffer from multicollinearity, where predictor variables are highly correlated with each other (independence assumption)
  • Small Sample Size: After splitting, the training dataset only has ~450 data points, which could have affected the generalizability of findings
  • Overfitting: After evaluating the training model on the test set, it seems that overfitting is evident, as the “best” model fails to generalize to new data due to too much noise being captured and/or random fluctuations.

We want to explore the question:

  • How does collaboration between artists influence the popularity of songs on streaming platforms like Spotify? Because this could help the prediction of the popularity of a song more accurate.