song <- read.csv("spotify-2023.csv")
In today’s world, music isn’t just about catchy tunes; it’s about how it affects us. We’ve all felt its power to change our mood, ease our worries, and even help us express ourselves. And now, researchers are digging deep to understand why some songs become hits while others fade into the background. This project is all about diving into the data behind music. We’re breaking down song samples, looking at all the little details, and trying to figure out what makes a song popular based on the total number of streams on Spotify.
As we delve into our research on song popularity prediction, it’s essential to prioritize ethical principles. This includes obtaining consent, protecting data privacy, and ensuring transparency in our methods and findings. We must also be mindful of potential societal impacts, striving to promote diversity and avoid perpetuating biases. By upholding these standards, we can conduct our study responsibly and contribute positively to the field of music research.
Nidula Elgiriyewithana. “Most Streamed Spotify Songs 2023.” Kaggle, 26 Aug. 2023, www.kaggle.com/datasets/nelgiriyewithana/top-spotify-songs-2023.
This dataset consists of 24 variables and 953 observation. We gonna look at these variables to build our model to predict the song popularity.
We can see the glimpse of data as below
glimpse(song)
## Rows: 953
## Columns: 24
## $ track_name <chr> "Seven (feat. Latto) (Explicit Ver.)", "LALA", "v…
## $ artist.s._name <chr> "Latto, Jung Kook", "Myke Towers", "Olivia Rodrig…
## $ artist_count <int> 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1…
## $ released_year <int> 2023, 2023, 2023, 2019, 2023, 2023, 2023, 2023, 2…
## $ released_month <int> 7, 3, 6, 8, 5, 6, 3, 7, 5, 3, 4, 7, 1, 4, 3, 12, …
## $ released_day <int> 14, 23, 30, 23, 18, 1, 16, 7, 15, 17, 17, 7, 12, …
## $ in_spotify_playlists <int> 553, 1474, 1397, 7858, 3133, 2186, 3090, 714, 109…
## $ in_spotify_charts <int> 147, 48, 113, 100, 50, 91, 50, 43, 83, 44, 40, 55…
## $ streams <chr> "141381703", "133716286", "140003974", "800840817…
## $ in_apple_playlists <int> 43, 48, 94, 116, 84, 67, 34, 25, 60, 49, 41, 37, …
## $ in_apple_charts <int> 263, 126, 207, 207, 133, 213, 222, 89, 210, 110, …
## $ in_deezer_playlists <chr> "45", "58", "91", "125", "87", "88", "43", "30", …
## $ in_deezer_charts <int> 10, 14, 14, 12, 15, 17, 13, 13, 11, 13, 12, 5, 58…
## $ in_shazam_charts <chr> "826", "382", "949", "548", "425", "946", "418", …
## $ bpm <int> 125, 92, 138, 170, 144, 141, 148, 100, 130, 170, …
## $ key <chr> "B", "C#", "F", "A", "A", "C#", "F", "F", "C#", "…
## $ mode <chr> "Major", "Major", "Major", "Major", "Minor", "Maj…
## $ danceability_. <int> 80, 71, 51, 55, 65, 92, 67, 67, 85, 81, 57, 78, 7…
## $ valence_. <int> 89, 61, 32, 58, 23, 66, 83, 26, 22, 56, 56, 52, 6…
## $ energy_. <int> 83, 74, 53, 72, 80, 58, 76, 71, 62, 48, 72, 82, 6…
## $ acousticness_. <int> 31, 7, 17, 11, 14, 19, 48, 37, 12, 21, 23, 18, 6,…
## $ instrumentalness_. <int> 0, 0, 0, 0, 63, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 17,…
## $ liveness_. <int> 8, 10, 31, 11, 11, 8, 8, 11, 28, 8, 27, 15, 3, 9,…
## $ speechiness_. <int> 4, 4, 6, 15, 6, 24, 3, 4, 9, 33, 5, 7, 7, 3, 6, 4…
Here is the summary of the dataset
summary(song)
## track_name artist.s._name artist_count released_year
## Length:953 Length:953 Min. :1.000 Min. :1930
## Class :character Class :character 1st Qu.:1.000 1st Qu.:2020
## Mode :character Mode :character Median :1.000 Median :2022
## Mean :1.556 Mean :2018
## 3rd Qu.:2.000 3rd Qu.:2022
## Max. :8.000 Max. :2023
## released_month released_day in_spotify_playlists in_spotify_charts
## Min. : 1.000 Min. : 1.00 Min. : 31 Min. : 0.00
## 1st Qu.: 3.000 1st Qu.: 6.00 1st Qu.: 875 1st Qu.: 0.00
## Median : 6.000 Median :13.00 Median : 2224 Median : 3.00
## Mean : 6.034 Mean :13.93 Mean : 5200 Mean : 12.01
## 3rd Qu.: 9.000 3rd Qu.:22.00 3rd Qu.: 5542 3rd Qu.: 16.00
## Max. :12.000 Max. :31.00 Max. :52898 Max. :147.00
## streams in_apple_playlists in_apple_charts in_deezer_playlists
## Length:953 Min. : 0.00 Min. : 0.00 Length:953
## Class :character 1st Qu.: 13.00 1st Qu.: 7.00 Class :character
## Mode :character Median : 34.00 Median : 38.00 Mode :character
## Mean : 67.81 Mean : 51.91
## 3rd Qu.: 88.00 3rd Qu.: 87.00
## Max. :672.00 Max. :275.00
## in_deezer_charts in_shazam_charts bpm key
## Min. : 0.000 Length:953 Min. : 65.0 Length:953
## 1st Qu.: 0.000 Class :character 1st Qu.:100.0 Class :character
## Median : 0.000 Mode :character Median :121.0 Mode :character
## Mean : 2.666 Mean :122.5
## 3rd Qu.: 2.000 3rd Qu.:140.0
## Max. :58.000 Max. :206.0
## mode danceability_. valence_. energy_.
## Length:953 Min. :23.00 Min. : 4.00 Min. : 9.00
## Class :character 1st Qu.:57.00 1st Qu.:32.00 1st Qu.:53.00
## Mode :character Median :69.00 Median :51.00 Median :66.00
## Mean :66.97 Mean :51.43 Mean :64.28
## 3rd Qu.:78.00 3rd Qu.:70.00 3rd Qu.:77.00
## Max. :96.00 Max. :97.00 Max. :97.00
## acousticness_. instrumentalness_. liveness_. speechiness_.
## Min. : 0.00 Min. : 0.000 Min. : 3.00 Min. : 2.00
## 1st Qu.: 6.00 1st Qu.: 0.000 1st Qu.:10.00 1st Qu.: 4.00
## Median :18.00 Median : 0.000 Median :12.00 Median : 6.00
## Mean :27.06 Mean : 1.581 Mean :18.21 Mean :10.13
## 3rd Qu.:43.00 3rd Qu.: 0.000 3rd Qu.:24.00 3rd Qu.:11.00
## Max. :97.00 Max. :91.000 Max. :97.00 Max. :64.00
song$streams <- as.numeric(as.character(song$streams))
## Warning: NAs introduced by coercion
song$in_shazam_charts <- as.numeric(as.character(song$in_shazam_charts))
## Warning: NAs introduced by coercion
song <- na.omit(song)
song <- song %>%
select(-track_name, -artist.s._name, -in_deezer_playlists, -in_deezer_charts, -in_shazam_charts, -key, -mode)
glimpse(song)
## Rows: 895
## Columns: 17
## $ artist_count <int> 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2, 1, 1, 1, 2, 1, 3…
## $ released_year <int> 2023, 2023, 2023, 2019, 2023, 2023, 2023, 2023, 2…
## $ released_month <int> 7, 3, 6, 8, 5, 6, 3, 7, 5, 3, 4, 7, 12, 2, 3, 3, …
## $ released_day <int> 14, 23, 30, 23, 18, 1, 16, 7, 15, 17, 17, 7, 8, 2…
## $ in_spotify_playlists <int> 553, 1474, 1397, 7858, 3133, 2186, 3090, 714, 109…
## $ in_spotify_charts <int> 147, 48, 113, 100, 50, 91, 50, 43, 83, 44, 40, 55…
## $ streams <dbl> 141381703, 133716286, 140003974, 800840817, 30323…
## $ in_apple_playlists <int> 43, 48, 94, 116, 84, 67, 34, 25, 60, 49, 41, 37, …
## $ in_apple_charts <int> 263, 126, 207, 207, 133, 213, 222, 89, 210, 110, …
## $ bpm <int> 125, 92, 138, 170, 144, 141, 148, 100, 130, 170, …
## $ danceability_. <int> 80, 71, 51, 55, 65, 92, 67, 67, 85, 81, 57, 78, 6…
## $ valence_. <int> 89, 61, 32, 58, 23, 66, 83, 26, 22, 56, 56, 52, 4…
## $ energy_. <int> 83, 74, 53, 72, 80, 58, 76, 71, 62, 48, 72, 82, 7…
## $ acousticness_. <int> 31, 7, 17, 11, 14, 19, 48, 37, 12, 21, 23, 18, 5,…
## $ instrumentalness_. <int> 0, 0, 0, 0, 63, 0, 0, 0, 0, 0, 0, 0, 17, 0, 0, 0,…
## $ liveness_. <int> 8, 10, 31, 11, 11, 8, 8, 11, 28, 8, 27, 15, 16, 3…
## $ speechiness_. <int> 4, 4, 6, 15, 6, 24, 3, 4, 9, 33, 5, 7, 4, 3, 16, …
Here is the glimpse of data after cleanning.
First, we would want to separate a training set and train set
N <- seq(448)
S <- sample(N,447)
songtest_sample <- song[S,]
songtrain_sample <- song[-S, ]
Export csv file
#write.csv(songtest_sample, "songtest.csv")
#write.csv(songtrain_sample, "songtrain.csv")
songtest <- read.csv("songtest.csv")
songtrain <- read.csv("songtrain.csv")
songtest <- songtest %>%
select(-X)
songtrain2 <- songtrain %>%
select(-X)
Since seperate the dataset would result in a different dataset every times. Therefore, I would export the train and the train set as csv file to do the report on that dataset only. To not run the code again which we would get a new dataset, I put hash in front of the export function.
We gonna make histogram to see each variables that if they are approximately normal or not and plot each variables with song popularity to see if they will better fitted as a quadratic line.
hist(songtrain2$streams, col = 'skyblue', xlab = "Streams", main = "Histogram of Streams")
From the histogram, we can see that it is skewed to the right. This means that this variables is not approximatly normal distributed, so we would need to take the log of this variable.
songtrain3 <- songtrain2 %>%
mutate(Logstreams = log(streams)) %>%
select(-streams)
hist(songtrain3$Logstreams, col = "royalblue", xlab = "Log of Streams", main = "Histogram of Log Streams")
After logging this variable, the histogram is quite normal distributed which means that logging this variables fix this problem. Therefore, we would consider the log of this variable for the model.
hist(songtrain2$artist_count, col = "skyblue", xlab = "Artist Count", main = "Histogram of Artist Count")
From the histogram, we can see that it is skewed to the right. This means that this variables is not approximately normal distributed. This is because of most of the song in the world is mostly by only one artist. However, we would still consider this variable for the model.
songtrain3 %>%
ggplot(aes(artist_count, Logstreams))+
geom_point(color = "royalblue1")+
stat_smooth(method = "lm",formula = y ~ poly(x,1), color = "red")+
stat_smooth(method = "lm",formula = y ~ poly(x,2), color = "orange")
From the plot of Logstreams against artist_count, we can see that the quadratic line and the linear line have almost the same fit with the data points. Therefore, we would not consider if variable I(artist_count^2) suitable with the best model.
hist(songtrain2$released_year, col = "skyblue", xlab = "Released Year", main = "Histogram of Released Year")
From the histogram, we can see that it is skewed to the left. This means that this variables is not approximatly normal distributed. This is because most of the song on the chart is from 2023 since this is a chart of Spotify in 2023. However, we would still consider this variable for the model.
songtrain3 %>%
ggplot(aes(released_year, Logstreams))+
geom_point(color = "royalblue1")+
stat_smooth(method = "lm",formula = y ~ poly(x,1), color = "red")+
stat_smooth(method = "lm",formula = y ~ poly(x,2), color = "orange")
From the plot of Logstreams against released_year, we can see that the quadratic line and the linear line have almost the same fit with the data points. However, it seem like the quadratic model would fit a little more points than linear line. Therefore, we would consider if variable I(released_year^2) suitable with the best model.
hist(songtrain2$released_month, col = "skyblue", xlab = "Released Month", main = "Histogram of Released Month")
From the histogram of released_month, we can see that this variable is approximately normal distributed. Therefore we would not need to take the log of this variable.
songtrain3 %>%
ggplot(aes(released_month, Logstreams))+
geom_point(color = "royalblue1")+
stat_smooth(method = "lm",formula = y ~ poly(x,1), color = "red")+
stat_smooth(method = "lm",formula = y ~ poly(x,2), color = "orange")
From the plot of Logstreams against released_month, we can see that the quadratic line and the linear line have almost the same fit with the data points. Therefore, we would not consider if variable I(released_month^2) suitable with the best model.
hist(songtrain2$released_day, col = "skyblue", xlab = "The realeased day of the song", main = "Histogram of songs' released day")
From the histogram of released_day, we can see that this variable is approximately normal distributed. Therefore we would not need to take the log of this variable.
songtrain3 %>%
ggplot(aes(released_day, Logstreams))+
geom_point(color = "royalblue1")+
stat_smooth(method = "lm",formula = y ~ poly(x,1), color = "red")+
stat_smooth(method = "lm",formula = y ~ poly(x,2), color = "orange")
From the plot of Logstreams against released_day, we can see that the quadratic line and the linear line have almost the same fit with the data points. However, it seem like the quadratic model would fit a little more points than linear line. Therefore, we would consider if variable I(released_day^2) suitable with the best model.
hist(songtrain2$in_spotify_playlists, col = "skyblue", xlab = "Number of Spotify playlist of song is included in", main = "Histogram of Spotify playlist a song is included in")
From the histogram, we can see that it is skewed to the right. This means that this variables is not approximatly normal distributed, so we would need to take the log of this variable.
songtrain3 <- songtrain3 %>%
mutate(Login_spotify_playlists = log(in_spotify_playlists)) %>%
select(-in_spotify_playlists)
hist(songtrain3$Login_spotify_playlists, col ="royalblue", xlab ="Log Number of Playlist a song is include in", main = "Histogram of log number of playlist")
After logging this variable, the histogram is quite normal distributed which means that logging this variables songtrain this problem. Therefore, we would consider the log of this variable for the model.
songtrain3 %>%
ggplot(aes(Login_spotify_playlists, Logstreams))+
geom_point(color = "royalblue1")+
stat_smooth(method = "lm",formula = y ~ poly(x,1), color = "red")+
stat_smooth(method = "lm",formula = y ~ poly(x,2), color = "orange")
From the plot of Logstreams against Login_spotify_playlists, we can see that the quadratic line and the linear line have almost the same fit with the data points. However, it seem like the quadratic model would fit a little more points than linear line. Therefore, we would consider if variable I(Login_spotify_playlists^2) suitable with the best model.
hist(songtrain2$in_spotify_charts, col = "skyblue", xlab = "Rank of the song on Spotify charts", main = "Histogram of Spotify Ranking")
From the histogram, we can see that it is skewed to the right. This is because most of the song in this dataset have not been on chart. However, we would still consider this as a variable for the prediction since in theory, the more times the songs are on chart, the higher number of streams it should have.
songtrain3 %>%
ggplot(aes(in_spotify_charts, Logstreams))+
geom_point(color = "royalblue1")+
stat_smooth(method = "lm",formula = y ~ poly(x,1), color = "red")+
stat_smooth(method = "lm",formula = y ~ poly(x,2), color = "orange")
From the plot of Logstreams against in_spotify_charts, we can see that the quadratic line and the linear line have almost the same fit with the data points. However, it seem like the quadratic model would fit a little more points than linear line. Therefore, we would consider if variable I(in_spotify_charts^2) suitable with the best model.
hist(songtrain2$in_apple_playlists, col = "skyblue", xlab = "Number of Apple playlist a song is included in", main = "Histogram of Apple Playlist a song is included in")
From the histogram, we can see that it is skewed to the right. This means that this variables is not approximately normal distributed, so we would need to take the log of this variable.
songtrain3 <- songtrain3 %>%
mutate(Login_apple_playlists = log(in_apple_playlists + 1)) %>%
select(-in_apple_playlists)
hist(songtrain3$Login_apple_playlists, col ="royalblue", xlab = "Log of the number of Apple playlist a Song is included in", main = "Histogram of the number of Apple playlist ")
We add one to variable when log to avoid infinite value. After logging this variable, the histogram is quite normal distributed which means that logging this variables songtrain this problem. Therefore, we would consider the log of this variable for the model.
songtrain3 %>%
ggplot(aes(Login_apple_playlists, Logstreams))+
geom_point(color = "royalblue1")+
stat_smooth(method = "lm",formula = y ~ poly(x,1), color = "red")+
stat_smooth(method = "lm",formula = y ~ poly(x,2), color = "orange")
From the plot of Logstreams against Login_apple_playlists, we can see that the quadratic line and the linear line have almost the same fit with the data points. However, it seem like the quadratic model would fit a little more points than linear line. Therefore, we would consider if variable I(Login_apple_playlists^2) suitable with the best model.
hist(songtrain2$in_apple_charts, col = "skyblue", xlab = "Rank of song on Apple charts", main = "Histogram of song on Apple Charts")
From the histogram, we can see that it is skewed to the right. This means that this variables is not approximately normal distributed, so we would need to take the log of this variable.
songtrain3 <- songtrain3 %>%
mutate(Login_apple_charts = log(in_apple_charts + 1)) %>%
select(-in_apple_charts)
hist(songtrain3$Login_apple_charts, col ="royalblue", xlab = "Log of the rank of a song on Apple charts", main = "Histogram of LOf rank of a song on Apple chart")
We add one to variable when log to avoid infinite value. After logging this variable, the histogram is quite normal distributed which means that logging this variables songtrain this problem. Therefore, we would consider the log of this variable for the model.
songtrain3 %>%
ggplot(aes(Login_apple_charts, Logstreams))+
geom_point(color = "royalblue1")+
stat_smooth(method = "lm",formula = y ~ poly(x,1), color = "red")+
stat_smooth(method = "lm",formula = y ~ poly(x,2), color = "orange")
From the plot of Logstreams against Login_apple_charts, we can see that the quadratic line and the linear line have almost the same fit with the data points. However, we can see that the quadratic line and the linear line have almost the same fit with the data points. However, it seem like the quadratic model would fit a little more points than linear line. Therefore, we would consider if variable I(Login_apple_charts^2) suitable with the best model.
hist(songtrain2$bpm, col = "skyblue", xlab = "Beats per Minute", main = "Histogram of BPM")
From the histogram of bpm, we can see that this variable is approximately normal distributed. Therefore we would not need to take the log of this variable.
songtrain3 %>%
ggplot(aes(bpm, Logstreams))+
geom_point(color = "royalblue1")+
stat_smooth(method = "lm",formula = y ~ poly(x,1), color = "red")+
stat_smooth(method = "lm",formula = y ~ poly(x,2), color = "orange")
From the plot of Logstreams against bpm, we can see that the quadratic line and the linear line have almost the same fit with the data points. Therefore, we would not consider if variable I(bpm^2) suitable with the best model.
hist(songtrain2$danceability_., col = "skyblue", xlab = "Danceability", main = "Histogram of Dancebility")
From the histogram of danceability_ , we can see that this variable is approximately normal distributed. Therefore we would not need to take the log of this variable.
songtrain3 %>%
ggplot(aes(danceability_., Logstreams))+
geom_point(color = "royalblue1")+
stat_smooth(method = "lm",formula = y ~ poly(x,1), color = "red")+
stat_smooth(method = "lm",formula = y ~ poly(x,2), color = "orange")
From the plot of Logstreams against danceability_., we can see that the quadratic line and the linear line have almost the same fit with the data points. Therefore, we would not consider if variable I(danceability_.^2) suitable with the best model.
hist(songtrain2$valence_., col = "skyblue", xlab = "Positivity of the song's musical content", main = "Histogram of Valence")
From the histogram of valence_., we can see that this variable is approximately normal distributed. Therefore we would not need to take the log of this variable.
songtrain3 %>%
ggplot(aes(valence_., Logstreams))+
geom_point(color = "royalblue1")+
stat_smooth(method = "lm",formula = y ~ poly(x,1), color = "red")+
stat_smooth(method = "lm",formula = y ~ poly(x,2), color = "orange")
From the plot of Logstreams against valence_., we can see that the quadratic line and the linear line have almost the same fit with the data points. Therefore, we would not consider if variable I(valence_.^2) suitable with the best model.
hist(songtrain2$energy_., col = "skyblue", xlab = "Energy of the song", main = "Histogram of energy")
From the histogram of energy_., we can see that this variable is approximately normal distributed. Therefore we would not need to take the log of this variable.
songtrain3 %>%
ggplot(aes(energy_., Logstreams))+
geom_point(color = "royalblue1")+
stat_smooth(method = "lm",formula = y ~ poly(x,1), color = "red")+
stat_smooth(method = "lm",formula = y ~ poly(x,2), color = "orange")
From the plot of Logstreams against energy_., we can see that the quadratic line and the linear line have almost the same fit with the data points. Therefore, we would not consider if variable I(energy_.^2) suitable with the best model.
hist(songtrain2$acousticness_., col = "skyblue", xlab = "Acousticness", main = "Histogram of Acousticness")
From the histogram, we can see that it is skewed to the right. This means that this variables is not approximately normal distributed, so we would need to take the log of this variable.
songtrain3 <- songtrain3 %>%
mutate(Logacousticness_. = log(acousticness_. + 1)) %>%
select(-acousticness_.)
hist(songtrain3$Logacousticness_., col ="royalblue", xlab = "Log of acousticness", main = "Histogram of log acousticness")
We add one to variable when log to avoid infinite value. After logging this variable, the histogram is quite normal distributed which means that logging this variables fix this problem. Therefore, we would consider the log of this variable for the model.
songtrain3 %>%
ggplot(aes(Logacousticness_., Logstreams))+
geom_point(color = "royalblue1")+
stat_smooth(method = "lm",formula = y ~ poly(x,1), color = "red")+
stat_smooth(method = "lm",formula = y ~ poly(x,2), color = "orange")
From the plot of Logstreams against Logacousticness_., we can see that the quadratic line and the linear line have almost the same fit with the data points. Therefore, we would not consider if variable I(Logacousticness_.^2) suitable with the best model.
hist(songtrain2$instrumentalness_., col = "skyblue", xlab = "Instrumentalness", main = "Histogram of instrumentalness")
From the histogram, we can see that it is skewed to the right. This means that this variables is not approximately normal distributed, so we would need to take the log of this variable.
songtrain3 <- songtrain3 %>%
mutate(Loginstrumentalness_. = log(instrumentalness_. + 1)) %>%
select(-instrumentalness_.)
hist(songtrain3$Loginstrumentalness_., col ="royalblue", xlab = "Log of instrumentalness", main = "Log of instrumentalness")
We add one to variable when log to avoid infinite value. After logging this variable, the histogram is not normal distributed which means that logging this variables do not fix this problem. Therefore, we would not consider this variable for the model.
hist(songtrain2$liveness_., col = "skyblue", xlab = "Presence of live performance elements (%)", main = "Histogram of Live Elements")
From the histogram, we can see that it is skewed to the right. This means that this variables is not approximately normal distributed, so we would need to take the log of this variable.
songtrain3 <- songtrain3 %>%
mutate(Logliveness_. = log(liveness_.)) %>%
select(-liveness_.)
hist(songtrain3$Logliveness_., col ="royalblue", xlab = "Log of the presence of live performance elements (%)", main = "Histogram of Log Live Elements")
After logging this variable, the histogram is quite normal distributed which means that logging this variables fix this problem. Therefore, we would consider the log of this variable for the model.
songtrain3 %>%
ggplot(aes(Logliveness_., Logstreams))+
geom_point(color = "royalblue1")+
stat_smooth(method = "lm",formula = y ~ poly(x,1), color = "red")+
stat_smooth(method = "lm",formula = y ~ poly(x,2), color = "orange")
From the plot of Logstreams against Logliveness_., we can see that the quadratic line and the linear line have almost the same fit with the data points. Therefore, we would not consider if variable I(Logliveness_.^2) suitable with the best model.
hist(songtrain2$speechiness_., col = "skyblue", xlab = "Speechiness", main = "Histogram of speechiness" )
After logging this variable, the histogram is quite normal distributed which means that logging this variables fix this problem. Therefore, we would consider the log of this variable for the model.
songtrain3 <- songtrain3 %>%
mutate(Logspeechiness_. = log(speechiness_.)) %>%
select(-speechiness_.)
hist(songtrain3$Logspeechiness_., col ="royalblue", xlab = "Log speechiness", main = "Histogram of log speechiness")
After logging this variable, the histogram is quite normal distributed which means that logging this variables fix this problem. Therefore, we would consider the log of this variable for the model.
songtrain3 %>%
ggplot(aes(Logspeechiness_., Logstreams))+
geom_point(color = "royalblue1")+
stat_smooth(method = "lm",formula = y ~ poly(x,1), color = "red")+
stat_smooth(method = "lm",formula = y ~ poly(x,2), color = "orange")
From the plot of Logstreams against Logspeechiness_., we can see that the quadratic line and the linear line have almost the same fit with the data points. Therefore, we would not consider if variable I(Logspeechiness_.^2) suitable with the best model.
To see if there is an interaction between released_day and Login_spotify_playlists, we would filter to different Login_spotify_playlists group and plot Logstreams against released_day.
songtrain3 %>% filter(Login_spotify_playlists >= 0 & Login_spotify_playlists <= 7) %>% ggplot(aes(released_day, Logstreams)) + geom_point(color = "skyblue")+geom_smooth(method = "lm", color = "red") + labs(title = "Interaction Plot of released_day and Login_spotify_playlists")
## `geom_smooth()` using formula = 'y ~ x'
songtrain3 %>% filter(Login_spotify_playlists > 7 & Login_spotify_playlists <= 9) %>% ggplot(aes(released_day, Logstreams)) + geom_point(color = "royalblue")+geom_smooth(method = "lm", color = "red")+ labs(title = "Interaction Plot of released_day and Login_spotify_playlists")
## `geom_smooth()` using formula = 'y ~ x'
songtrain3 %>% filter(Login_spotify_playlists > 9) %>% ggplot(aes(released_day, Logstreams)) + geom_point(color = "navy")+geom_smooth(method = "lm", color = "red") + labs(title = "Interaction Plot of released_day and Login_spotify_playlists")
## `geom_smooth()` using formula = 'y ~ x'
From these plots, we can see that there are not any trend in these graph. Therefore, we can conclude that the slope of released_day is not correlate to the rate of change of Login_spotify_playlists. Therefore, the variable for the best model can not have the interaction variable.
Let’s try again on one more interaction term.
songtrain3 %>% filter(Login_spotify_playlists >= 0 & Login_spotify_playlists <= 7) %>% ggplot(aes(energy_., Logstreams)) + geom_point(color = "skyblue")+geom_smooth(method = "lm", color = "red") + labs(title = "Interaction Plot of energy_. and Login_spotify_playlists")
## `geom_smooth()` using formula = 'y ~ x'
songtrain3 %>% filter(Login_spotify_playlists > 7 & Login_spotify_playlists <= 9) %>% ggplot(aes(energy_., Logstreams)) + geom_point(color = "royalblue")+geom_smooth(method = "lm", color = "red") + labs(title = "Interaction Plot of energy_. and Login_spotify_playlists")
## `geom_smooth()` using formula = 'y ~ x'
songtrain3 %>% filter(Login_spotify_playlists > 9) %>% ggplot(aes(energy_., Logstreams)) + geom_point(color = "navy")+geom_smooth(method = "lm", color = "red") + labs(title = "Interaction Plot of energy_. and Login_spotify_playlists")
## `geom_smooth()` using formula = 'y ~ x'
From these plots, we still can see that there are not any trend in these graph. Therefore, we can conclude that the slope of released_day is not correlate to the rate of change of energy_.. Therefore, the variable for the best model can not have the interaction variable.
cor(songtrain3)
## artist_count released_year released_month released_day
## artist_count 1.000000000 0.106810521 -0.008180603 0.056057079
## released_year 0.106810521 1.000000000 0.061312795 0.078709598
## released_month -0.008180603 0.061312795 1.000000000 0.075768333
## released_day 0.056057079 0.078709598 0.075768333 1.000000000
## in_spotify_charts -0.044775331 -0.033404658 -0.046296838 -0.013569012
## bpm -0.015177062 0.029475277 -0.047385342 -0.059158876
## danceability_. 0.267606806 0.113874864 -0.056125166 0.031512743
## valence_. 0.181556082 -0.071527723 -0.186141454 0.046435367
## energy_. 0.190884266 0.002169269 -0.059985458 0.048946540
## Logstreams -0.058810263 -0.275761046 0.043743792 0.107736939
## Login_spotify_playlists -0.026381520 -0.434432988 -0.007532317 0.007167030
## Login_apple_playlists 0.101961289 -0.260879144 0.003830782 0.064866354
## Login_apple_charts -0.151447141 -0.143568654 0.060185043 -0.035998023
## Logacousticness_. -0.058880378 0.013540150 -0.008621915 0.002118181
## Loginstrumentalness_. -0.128318562 -0.104007321 0.065836377 0.015342311
## Logliveness_. 0.059694527 0.051878154 -0.016946051 -0.016049702
## Logspeechiness_. 0.159523734 0.118884400 0.015485453 -0.016884250
## in_spotify_charts bpm danceability_.
## artist_count -0.04477533 -0.015177062 0.26760681
## released_year -0.03340466 0.029475277 0.11387486
## released_month -0.04629684 -0.047385342 -0.05612517
## released_day -0.01356901 -0.059158876 0.03151274
## in_spotify_charts 1.00000000 0.016017589 0.01998765
## bpm 0.01601759 1.000000000 -0.10224367
## danceability_. 0.01998765 -0.102243670 1.00000000
## valence_. 0.08373595 -0.015341812 0.39487445
## energy_. 0.06107525 0.007069839 0.18059295
## Logstreams 0.32427195 -0.006475977 0.02739416
## Login_spotify_playlists 0.19343294 -0.035757248 -0.02942905
## Login_apple_playlists 0.19039098 -0.030208978 0.13793751
## Login_apple_charts 0.30247000 -0.083644732 0.01569308
## Logacousticness_. -0.03801276 -0.076463396 -0.15953767
## Loginstrumentalness_. -0.07537085 -0.048405969 -0.17970892
## Logliveness_. -0.02628857 -0.029554319 -0.06622832
## Logspeechiness_. -0.04677227 0.097613871 0.20909260
## valence_. energy_. Logstreams
## artist_count 0.18155608 0.190884266 -0.058810263
## released_year -0.07152772 0.002169269 -0.275761046
## released_month -0.18614145 -0.059985458 0.043743792
## released_day 0.04643537 0.048946540 0.107736939
## in_spotify_charts 0.08373595 0.061075249 0.324271949
## bpm -0.01534181 0.007069839 -0.006475977
## danceability_. 0.39487445 0.180592952 0.027394160
## valence_. 1.00000000 0.373863941 -0.001801740
## energy_. 0.37386394 1.000000000 0.054954160
## Logstreams -0.00180174 0.054954160 1.000000000
## Login_spotify_playlists -0.03850706 0.040532652 0.762942174
## Login_apple_playlists 0.06726161 0.114414061 0.698380539
## Login_apple_charts 0.02198265 0.124607248 0.497448939
## Logacousticness_. -0.03100504 -0.448604051 -0.101385863
## Loginstrumentalness_. -0.12783397 -0.101127695 0.020105996
## Logliveness_. -0.01432861 0.063322383 -0.043590364
## Logspeechiness_. 0.08631690 0.111212690 -0.158019883
## Login_spotify_playlists Login_apple_playlists
## artist_count -0.026381520 0.101961289
## released_year -0.434432988 -0.260879144
## released_month -0.007532317 0.003830782
## released_day 0.007167030 0.064866354
## in_spotify_charts 0.193432944 0.190390983
## bpm -0.035757248 -0.030208978
## danceability_. -0.029429048 0.137937508
## valence_. -0.038507061 0.067261607
## energy_. 0.040532652 0.114414061
## Logstreams 0.762942174 0.698380539
## Login_spotify_playlists 1.000000000 0.724623847
## Login_apple_playlists 0.724623847 1.000000000
## Login_apple_charts 0.345094228 0.456850561
## Logacousticness_. -0.145580828 -0.200347131
## Loginstrumentalness_. 0.065869484 -0.039075810
## Logliveness_. -0.035924257 -0.050530857
## Logspeechiness_. -0.083378273 -0.129618095
## Login_apple_charts Logacousticness_.
## artist_count -0.15144714 -0.058880378
## released_year -0.14356865 0.013540150
## released_month 0.06018504 -0.008621915
## released_day -0.03599802 0.002118181
## in_spotify_charts 0.30247000 -0.038012756
## bpm -0.08364473 -0.076463396
## danceability_. 0.01569308 -0.159537667
## valence_. 0.02198265 -0.031005043
## energy_. 0.12460725 -0.448604051
## Logstreams 0.49744894 -0.101385863
## Login_spotify_playlists 0.34509423 -0.145580828
## Login_apple_playlists 0.45685056 -0.200347131
## Login_apple_charts 1.00000000 -0.144498524
## Logacousticness_. -0.14449852 1.000000000
## Loginstrumentalness_. -0.03320082 0.059981137
## Logliveness_. 0.03014205 -0.009913390
## Logspeechiness_. -0.13558022 0.002784047
## Loginstrumentalness_. Logliveness_. Logspeechiness_.
## artist_count -0.12831856 0.05969453 0.159523734
## released_year -0.10400732 0.05187815 0.118884400
## released_month 0.06583638 -0.01694605 0.015485453
## released_day 0.01534231 -0.01604970 -0.016884250
## in_spotify_charts -0.07537085 -0.02628857 -0.046772270
## bpm -0.04840597 -0.02955432 0.097613871
## danceability_. -0.17970892 -0.06622832 0.209092601
## valence_. -0.12783397 -0.01432861 0.086316903
## energy_. -0.10112769 0.06332238 0.111212690
## Logstreams 0.02010600 -0.04359036 -0.158019883
## Login_spotify_playlists 0.06586948 -0.03592426 -0.083378273
## Login_apple_playlists -0.03907581 -0.05053086 -0.129618095
## Login_apple_charts -0.03320082 0.03014205 -0.135580220
## Logacousticness_. 0.05998114 -0.00991339 0.002784047
## Loginstrumentalness_. 1.00000000 -0.07352980 -0.186365272
## Logliveness_. -0.07352980 1.00000000 -0.013093706
## Logspeechiness_. -0.18636527 -0.01309371 1.000000000
From the result, we can see that not any of our variables in our dataset correlate too much to filter out of the dataset. So we do not remove any variable.
best.subset <- regsubsets(Logstreams ~ . + I(Login_apple_charts^2) + I(Login_apple_playlists ^2) + I(in_spotify_charts^2) + I(Login_spotify_playlists^2) + I(released_day^2) + I(released_year^2), songtrain3,nvmax = 22)
sum <- summary(best.subset)
sum$outmat
## artist_count released_year released_month released_day
## 1 ( 1 ) " " " " " " " "
## 2 ( 1 ) " " " " " " " "
## 3 ( 1 ) " " " " " " " "
## 4 ( 1 ) " " " " " " " "
## 5 ( 1 ) " " " " " " " "
## 6 ( 1 ) " " " " " " " "
## 7 ( 1 ) " " "*" " " " "
## 8 ( 1 ) " " "*" " " " "
## 9 ( 1 ) " " "*" " " " "
## 10 ( 1 ) " " "*" " " " "
## 11 ( 1 ) " " "*" " " " "
## 12 ( 1 ) " " "*" " " " "
## 13 ( 1 ) " " "*" " " " "
## 14 ( 1 ) " " "*" " " "*"
## 15 ( 1 ) "*" "*" " " "*"
## 16 ( 1 ) " " "*" "*" "*"
## 17 ( 1 ) "*" "*" "*" "*"
## 18 ( 1 ) "*" "*" "*" "*"
## 19 ( 1 ) "*" "*" "*" "*"
## 20 ( 1 ) "*" "*" "*" "*"
## 21 ( 1 ) "*" "*" "*" "*"
## 22 ( 1 ) "*" "*" "*" "*"
## in_spotify_charts bpm danceability_. valence_. energy_.
## 1 ( 1 ) " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " "
## 4 ( 1 ) "*" " " " " " " " "
## 5 ( 1 ) "*" " " " " " " " "
## 6 ( 1 ) "*" " " " " " " " "
## 7 ( 1 ) "*" " " " " " " " "
## 8 ( 1 ) "*" " " " " " " " "
## 9 ( 1 ) "*" " " " " " " " "
## 10 ( 1 ) "*" "*" " " " " " "
## 11 ( 1 ) "*" "*" " " " " " "
## 12 ( 1 ) "*" "*" "*" " " " "
## 13 ( 1 ) "*" "*" "*" " " " "
## 14 ( 1 ) "*" "*" "*" " " " "
## 15 ( 1 ) "*" "*" "*" " " " "
## 16 ( 1 ) "*" "*" "*" " " " "
## 17 ( 1 ) "*" "*" "*" " " " "
## 18 ( 1 ) "*" "*" "*" "*" " "
## 19 ( 1 ) "*" "*" "*" "*" "*"
## 20 ( 1 ) "*" "*" "*" "*" "*"
## 21 ( 1 ) "*" "*" "*" "*" "*"
## 22 ( 1 ) "*" "*" "*" "*" "*"
## Login_spotify_playlists Login_apple_playlists Login_apple_charts
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) " " " " "*"
## 3 ( 1 ) " " "*" "*"
## 4 ( 1 ) " " "*" "*"
## 5 ( 1 ) " " "*" "*"
## 6 ( 1 ) " " "*" "*"
## 7 ( 1 ) " " "*" "*"
## 8 ( 1 ) " " "*" "*"
## 9 ( 1 ) " " "*" "*"
## 10 ( 1 ) " " "*" "*"
## 11 ( 1 ) " " "*" "*"
## 12 ( 1 ) " " "*" "*"
## 13 ( 1 ) " " "*" "*"
## 14 ( 1 ) " " "*" "*"
## 15 ( 1 ) " " "*" "*"
## 16 ( 1 ) "*" "*" "*"
## 17 ( 1 ) "*" "*" "*"
## 18 ( 1 ) "*" "*" "*"
## 19 ( 1 ) "*" "*" "*"
## 20 ( 1 ) "*" "*" "*"
## 21 ( 1 ) "*" "*" "*"
## 22 ( 1 ) "*" "*" "*"
## Logacousticness_. Loginstrumentalness_. Logliveness_.
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) " " " " " "
## 3 ( 1 ) " " " " " "
## 4 ( 1 ) " " " " " "
## 5 ( 1 ) " " " " " "
## 6 ( 1 ) " " " " " "
## 7 ( 1 ) " " " " " "
## 8 ( 1 ) " " " " " "
## 9 ( 1 ) "*" " " " "
## 10 ( 1 ) "*" " " " "
## 11 ( 1 ) "*" " " " "
## 12 ( 1 ) "*" " " " "
## 13 ( 1 ) "*" " " " "
## 14 ( 1 ) "*" " " " "
## 15 ( 1 ) "*" " " " "
## 16 ( 1 ) "*" " " " "
## 17 ( 1 ) "*" " " " "
## 18 ( 1 ) "*" " " " "
## 19 ( 1 ) "*" " " " "
## 20 ( 1 ) "*" "*" " "
## 21 ( 1 ) "*" "*" " "
## 22 ( 1 ) "*" "*" "*"
## Logspeechiness_. I(Login_apple_charts^2) I(Login_apple_playlists^2)
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) " " " " " "
## 3 ( 1 ) " " " " " "
## 4 ( 1 ) " " " " " "
## 5 ( 1 ) " " " " " "
## 6 ( 1 ) " " " " " "
## 7 ( 1 ) " " " " " "
## 8 ( 1 ) " " " " " "
## 9 ( 1 ) " " " " " "
## 10 ( 1 ) " " " " " "
## 11 ( 1 ) "*" " " " "
## 12 ( 1 ) "*" " " " "
## 13 ( 1 ) "*" " " "*"
## 14 ( 1 ) "*" " " "*"
## 15 ( 1 ) "*" " " "*"
## 16 ( 1 ) "*" " " "*"
## 17 ( 1 ) "*" " " "*"
## 18 ( 1 ) "*" " " "*"
## 19 ( 1 ) "*" " " "*"
## 20 ( 1 ) "*" " " "*"
## 21 ( 1 ) "*" "*" "*"
## 22 ( 1 ) "*" "*" "*"
## I(in_spotify_charts^2) I(Login_spotify_playlists^2) I(released_day^2)
## 1 ( 1 ) " " "*" " "
## 2 ( 1 ) " " "*" " "
## 3 ( 1 ) " " "*" " "
## 4 ( 1 ) " " "*" " "
## 5 ( 1 ) " " "*" "*"
## 6 ( 1 ) "*" "*" "*"
## 7 ( 1 ) " " "*" "*"
## 8 ( 1 ) "*" "*" "*"
## 9 ( 1 ) "*" "*" "*"
## 10 ( 1 ) "*" "*" "*"
## 11 ( 1 ) "*" "*" "*"
## 12 ( 1 ) "*" "*" "*"
## 13 ( 1 ) "*" "*" "*"
## 14 ( 1 ) "*" "*" "*"
## 15 ( 1 ) "*" "*" "*"
## 16 ( 1 ) "*" "*" "*"
## 17 ( 1 ) "*" "*" "*"
## 18 ( 1 ) "*" "*" "*"
## 19 ( 1 ) "*" "*" "*"
## 20 ( 1 ) "*" "*" "*"
## 21 ( 1 ) "*" "*" "*"
## 22 ( 1 ) "*" "*" "*"
## I(released_year^2)
## 1 ( 1 ) " "
## 2 ( 1 ) " "
## 3 ( 1 ) " "
## 4 ( 1 ) " "
## 5 ( 1 ) " "
## 6 ( 1 ) " "
## 7 ( 1 ) "*"
## 8 ( 1 ) "*"
## 9 ( 1 ) "*"
## 10 ( 1 ) "*"
## 11 ( 1 ) "*"
## 12 ( 1 ) "*"
## 13 ( 1 ) "*"
## 14 ( 1 ) "*"
## 15 ( 1 ) "*"
## 16 ( 1 ) "*"
## 17 ( 1 ) "*"
## 18 ( 1 ) "*"
## 19 ( 1 ) "*"
## 20 ( 1 ) "*"
## 21 ( 1 ) "*"
## 22 ( 1 ) "*"
When there are 22 variables, model will be like this
lm22 <- lm(Logstreams ~ artist_count + released_year + released_month + released_day + in_spotify_charts + bpm + danceability_. + valence_. + energy_. + Login_spotify_playlists + Login_apple_playlists + Login_apple_charts + Logacousticness_. + Loginstrumentalness_. + Logliveness_. + Logspeechiness_. + I(Login_apple_charts^2) + I(Login_apple_playlists^2) + I(in_spotify_charts^2) + I(Login_spotify_playlists^2) + I(released_day^2) + I(released_year^2),songtrain3)
summary(lm22)
##
## Call:
## lm(formula = Logstreams ~ artist_count + released_year + released_month +
## released_day + in_spotify_charts + bpm + danceability_. +
## valence_. + energy_. + Login_spotify_playlists + Login_apple_playlists +
## Login_apple_charts + Logacousticness_. + Loginstrumentalness_. +
## Logliveness_. + Logspeechiness_. + I(Login_apple_charts^2) +
## I(Login_apple_playlists^2) + I(in_spotify_charts^2) + I(Login_spotify_playlists^2) +
## I(released_day^2) + I(released_year^2), data = songtrain3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.41806 -0.30688 0.02724 0.30363 2.10042
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.504e+03 7.687e+02 -3.258 0.001214 **
## artist_count -3.426e-02 2.817e-02 -1.216 0.224646
## released_year 2.514e+00 7.699e-01 3.266 0.001179 **
## released_month 9.661e-03 6.900e-03 1.400 0.162168
## released_day -1.484e-02 1.080e-02 -1.374 0.170026
## in_spotify_charts 4.246e-02 1.089e-02 3.900 0.000112 ***
## bpm 2.094e-03 8.489e-04 2.467 0.014025 *
## danceability_. 2.172e-03 1.841e-03 1.180 0.238754
## valence_. 8.956e-04 1.236e-03 0.725 0.469001
## energy_. -6.987e-04 1.729e-03 -0.404 0.686370
## Login_spotify_playlists -3.478e-01 2.601e-01 -1.337 0.181807
## Login_apple_playlists 2.742e-01 7.781e-02 3.524 0.000470 ***
## Login_apple_charts 1.091e-01 5.155e-02 2.117 0.034824 *
## Logacousticness_. 5.097e-02 2.126e-02 2.398 0.016922 *
## Loginstrumentalness_. -1.254e-02 3.202e-02 -0.392 0.695586
## Logliveness_. -1.150e-02 3.749e-02 -0.307 0.759111
## Logspeechiness_. -6.735e-02 3.042e-02 -2.214 0.027382 *
## I(Login_apple_charts^2) -3.388e-03 1.043e-02 -0.325 0.745492
## I(Login_apple_playlists^2) -2.450e-02 1.332e-02 -1.840 0.066527 .
## I(in_spotify_charts^2) -1.058e-03 3.967e-04 -2.667 0.007949 **
## I(Login_spotify_playlists^2) 5.045e-02 1.730e-02 2.916 0.003734 **
## I(released_day^2) 7.830e-04 3.499e-04 2.238 0.025749 *
## I(released_year^2) -6.267e-04 1.928e-04 -3.251 0.001243 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4832 on 425 degrees of freedom
## Multiple R-squared: 0.7331, Adjusted R-squared: 0.7192
## F-statistic: 53.05 on 22 and 425 DF, p-value: < 2.2e-16
From the model, we can see that the Adjusted R-squared is quite high. However since the p-value for the variables some of the variables are higher than 0.05 so changes in these variables do not significantly affect the predicted Logstreams Therefore, this is not the best model that we want to find.
When there are 21 variables, model will be like this
lm21 <- lm(Logstreams ~ artist_count + released_year + released_month + released_day + in_spotify_charts + bpm + danceability_. + valence_. + energy_. + Login_spotify_playlists + Login_apple_playlists + Login_apple_charts + Logacousticness_. + Loginstrumentalness_. + Logspeechiness_. + I(Login_apple_charts^2) + I(Login_apple_playlists^2) + I(in_spotify_charts^2) + I(Login_spotify_playlists^2) + I(released_day^2) + I(released_year^2),songtrain3)
summary(lm21)
##
## Call:
## lm(formula = Logstreams ~ artist_count + released_year + released_month +
## released_day + in_spotify_charts + bpm + danceability_. +
## valence_. + energy_. + Login_spotify_playlists + Login_apple_playlists +
## Login_apple_charts + Logacousticness_. + Loginstrumentalness_. +
## Logspeechiness_. + I(Login_apple_charts^2) + I(Login_apple_playlists^2) +
## I(in_spotify_charts^2) + I(Login_spotify_playlists^2) + I(released_day^2) +
## I(released_year^2), data = songtrain3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.41143 -0.30965 0.02599 0.29760 2.09039
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.498e+03 7.676e+02 -3.254 0.001227 **
## artist_count -3.496e-02 2.805e-02 -1.246 0.213351
## released_year 2.509e+00 7.688e-01 3.263 0.001191 **
## released_month 9.725e-03 6.889e-03 1.412 0.158784
## released_day -1.458e-02 1.075e-02 -1.356 0.175846
## in_spotify_charts 4.253e-02 1.087e-02 3.911 0.000107 ***
## bpm 2.103e-03 8.475e-04 2.482 0.013468 *
## danceability_. 2.224e-03 1.831e-03 1.214 0.225289
## valence_. 8.968e-04 1.234e-03 0.727 0.467902
## energy_. -7.286e-04 1.725e-03 -0.422 0.672894
## Login_spotify_playlists -3.542e-01 2.590e-01 -1.368 0.172095
## Login_apple_playlists 2.754e-01 7.763e-02 3.548 0.000431 ***
## Login_apple_charts 1.083e-01 5.142e-02 2.106 0.035771 *
## Logacousticness_. 5.089e-02 2.123e-02 2.397 0.016960 *
## Loginstrumentalness_. -1.173e-02 3.188e-02 -0.368 0.712980
## Logspeechiness_. -6.703e-02 3.037e-02 -2.207 0.027853 *
## I(Login_apple_charts^2) -3.300e-03 1.042e-02 -0.317 0.751537
## I(Login_apple_playlists^2) -2.461e-02 1.330e-02 -1.850 0.064936 .
## I(in_spotify_charts^2) -1.059e-03 3.962e-04 -2.672 0.007834 **
## I(Login_spotify_playlists^2) 5.085e-02 1.723e-02 2.951 0.003341 **
## I(released_day^2) 7.747e-04 3.485e-04 2.223 0.026735 *
## I(released_year^2) -6.253e-04 1.925e-04 -3.248 0.001256 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4827 on 426 degrees of freedom
## Multiple R-squared: 0.733, Adjusted R-squared: 0.7198
## F-statistic: 55.69 on 21 and 426 DF, p-value: < 2.2e-16
From the model, we can see that the Adjusted R-squared is quite high. However since the p-value for the variables some of the variables are higher than 0.05 so changes in these variables do not significantly affect the predicted Logstreams Therefore, this is not the best model that we want to find.
When there are 20 variables, model will be like this
lm20 <- lm(Logstreams ~ artist_count + released_year + released_month + released_day + in_spotify_charts + bpm + danceability_. + valence_. + energy_. + Login_spotify_playlists + Login_apple_playlists + Login_apple_charts + Logacousticness_. + Loginstrumentalness_. + Logspeechiness_. + I(Login_apple_playlists^2) + I(in_spotify_charts^2) + I(Login_spotify_playlists^2) + I(released_day^2) + I(released_year^2),songtrain3)
summary(lm20)
##
## Call:
## lm(formula = Logstreams ~ artist_count + released_year + released_month +
## released_day + in_spotify_charts + bpm + danceability_. +
## valence_. + energy_. + Login_spotify_playlists + Login_apple_playlists +
## Login_apple_charts + Logacousticness_. + Loginstrumentalness_. +
## Logspeechiness_. + I(Login_apple_playlists^2) + I(in_spotify_charts^2) +
## I(Login_spotify_playlists^2) + I(released_day^2) + I(released_year^2),
## data = songtrain3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.41393 -0.31178 0.02887 0.29274 2.09638
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.490e+03 7.664e+02 -3.249 0.001248 **
## artist_count -3.493e-02 2.802e-02 -1.247 0.213221
## released_year 2.501e+00 7.676e-01 3.258 0.001212 **
## released_month 9.761e-03 6.881e-03 1.419 0.156767
## released_day -1.471e-02 1.073e-02 -1.371 0.171030
## in_spotify_charts 4.206e-02 1.076e-02 3.909 0.000108 ***
## bpm 2.097e-03 8.464e-04 2.477 0.013618 *
## danceability_. 2.243e-03 1.828e-03 1.227 0.220493
## valence_. 8.764e-04 1.231e-03 0.712 0.477038
## energy_. -7.124e-04 1.722e-03 -0.414 0.679310
## Login_spotify_playlists -3.463e-01 2.575e-01 -1.345 0.179341
## Login_apple_playlists 2.786e-01 7.689e-02 3.624 0.000325 ***
## Login_apple_charts 9.300e-02 1.759e-02 5.288 1.98e-07 ***
## Logacousticness_. 5.139e-02 2.115e-02 2.430 0.015506 *
## Loginstrumentalness_. -1.272e-02 3.169e-02 -0.401 0.688325
## Logspeechiness_. -6.704e-02 3.034e-02 -2.210 0.027665 *
## I(Login_apple_playlists^2) -2.514e-02 1.318e-02 -1.907 0.057227 .
## I(in_spotify_charts^2) -1.045e-03 3.935e-04 -2.656 0.008205 **
## I(Login_spotify_playlists^2) 5.034e-02 1.714e-02 2.937 0.003489 **
## I(released_day^2) 7.791e-04 3.478e-04 2.240 0.025612 *
## I(released_year^2) -6.233e-04 1.922e-04 -3.242 0.001279 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4822 on 427 degrees of freedom
## Multiple R-squared: 0.7329, Adjusted R-squared: 0.7204
## F-statistic: 58.59 on 20 and 427 DF, p-value: < 2.2e-16
From the model, we can see that the Adjusted R-squared is quite high. However since the p-value for the variables some of the variables are higher than 0.05 so changes in these variables do not significantly affect the predicted Logstreams Therefore, this is not the best model that we want to find.
When there are 19 variables, model will be like this
lm19 <- lm(Logstreams ~ artist_count + released_year + released_month + released_day + in_spotify_charts + bpm + danceability_. + valence_. + energy_. + Login_spotify_playlists + Login_apple_playlists + Login_apple_charts + Logacousticness_. + Logspeechiness_. + I(Login_apple_playlists^2) + I(in_spotify_charts^2) + I(Login_spotify_playlists^2) + I(released_day^2) + I(released_year^2),songtrain3)
summary(lm19)
##
## Call:
## lm(formula = Logstreams ~ artist_count + released_year + released_month +
## released_day + in_spotify_charts + bpm + danceability_. +
## valence_. + energy_. + Login_spotify_playlists + Login_apple_playlists +
## Login_apple_charts + Logacousticness_. + Logspeechiness_. +
## I(Login_apple_playlists^2) + I(in_spotify_charts^2) + I(Login_spotify_playlists^2) +
## I(released_day^2) + I(released_year^2), data = songtrain3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.40949 -0.30902 0.02315 0.29568 2.09761
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.460e+03 7.619e+02 -3.229 0.001340 **
## artist_count -3.430e-02 2.795e-02 -1.227 0.220446
## released_year 2.470e+00 7.630e-01 3.237 0.001301 **
## released_month 9.649e-03 6.868e-03 1.405 0.160788
## released_day -1.469e-02 1.072e-02 -1.370 0.171384
## in_spotify_charts 4.225e-02 1.074e-02 3.934 9.73e-05 ***
## bpm 2.108e-03 8.452e-04 2.494 0.013014 *
## danceability_. 2.300e-03 1.821e-03 1.263 0.207302
## valence_. 8.810e-04 1.230e-03 0.716 0.474293
## energy_. -6.993e-04 1.720e-03 -0.407 0.684536
## Login_spotify_playlists -3.537e-01 2.566e-01 -1.378 0.168817
## Login_apple_playlists 2.776e-01 7.678e-02 3.616 0.000335 ***
## Login_apple_charts 9.313e-02 1.757e-02 5.301 1.85e-07 ***
## Logacousticness_. 5.113e-02 2.112e-02 2.421 0.015890 *
## Logspeechiness_. -6.514e-02 2.994e-02 -2.176 0.030126 *
## I(Login_apple_playlists^2) -2.482e-02 1.315e-02 -1.888 0.059675 .
## I(in_spotify_charts^2) -1.048e-03 3.931e-04 -2.667 0.007952 **
## I(Login_spotify_playlists^2) 5.075e-02 1.709e-02 2.969 0.003152 **
## I(released_day^2) 7.778e-04 3.475e-04 2.238 0.025707 *
## I(released_year^2) -6.156e-04 1.911e-04 -3.221 0.001373 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4817 on 428 degrees of freedom
## Multiple R-squared: 0.7328, Adjusted R-squared: 0.721
## F-statistic: 61.79 on 19 and 428 DF, p-value: < 2.2e-16
From the model, we can see that the Adjusted R-squared is quite high. However since the p-value for the variables some of the variables are higher than 0.05 so changes in these variables do not significantly affect the predicted Logstreams Therefore, this is not the best model that we want to find.
When there are 18 variables, model will be like this
lm18 <- lm(Logstreams ~ artist_count + released_year + released_month + released_day + in_spotify_charts + bpm + danceability_. + valence_. + Login_spotify_playlists + Login_apple_playlists + Login_apple_charts + Logacousticness_. + Logspeechiness_. + I(Login_apple_playlists^2) + I(in_spotify_charts^2) + I(Login_spotify_playlists^2) + I(released_day^2) + I(released_year^2),songtrain3)
summary(lm18)
##
## Call:
## lm(formula = Logstreams ~ artist_count + released_year + released_month +
## released_day + in_spotify_charts + bpm + danceability_. +
## valence_. + Login_spotify_playlists + Login_apple_playlists +
## Login_apple_charts + Logacousticness_. + Logspeechiness_. +
## I(Login_apple_playlists^2) + I(in_spotify_charts^2) + I(Login_spotify_playlists^2) +
## I(released_day^2) + I(released_year^2), data = songtrain3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.42605 -0.30542 0.02699 0.30007 2.09362
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.423e+03 7.558e+02 -3.206 0.001446 **
## artist_count -3.588e-02 2.765e-02 -1.298 0.195055
## released_year 2.433e+00 7.569e-01 3.215 0.001404 **
## released_month 9.729e-03 6.859e-03 1.418 0.156799
## released_day -1.457e-02 1.071e-02 -1.361 0.174087
## in_spotify_charts 4.203e-02 1.072e-02 3.922 0.000102 ***
## bpm 2.115e-03 8.441e-04 2.506 0.012580 *
## danceability_. 2.369e-03 1.811e-03 1.308 0.191620
## valence_. 6.925e-04 1.138e-03 0.608 0.543294
## Login_spotify_playlists -3.526e-01 2.563e-01 -1.375 0.169721
## Login_apple_playlists 2.776e-01 7.670e-02 3.620 0.000330 ***
## Login_apple_charts 9.248e-02 1.748e-02 5.291 1.94e-07 ***
## Logacousticness_. 5.500e-02 1.883e-02 2.922 0.003664 **
## Logspeechiness_. -6.656e-02 2.971e-02 -2.241 0.025569 *
## I(Login_apple_playlists^2) -2.482e-02 1.313e-02 -1.890 0.059473 .
## I(in_spotify_charts^2) -1.039e-03 3.920e-04 -2.650 0.008352 **
## I(Login_spotify_playlists^2) 5.073e-02 1.707e-02 2.971 0.003133 **
## I(released_day^2) 7.730e-04 3.469e-04 2.228 0.026391 *
## I(released_year^2) -6.064e-04 1.896e-04 -3.199 0.001482 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4813 on 429 degrees of freedom
## Multiple R-squared: 0.7327, Adjusted R-squared: 0.7215
## F-statistic: 65.34 on 18 and 429 DF, p-value: < 2.2e-16
From the model, we can see that the Adjusted R-squared is quite high. However since the p-value for the variables some of the variables are higher than 0.05 so changes in these variables do not significantly affect the predicted Logstreams Therefore, this is not the best model that we want to find.
When there are 17 variables, model will be like this
lm17 <- lm(Logstreams ~ artist_count + released_year + released_month + released_day + in_spotify_charts + bpm + danceability_. + Login_spotify_playlists + Login_apple_playlists + Login_apple_charts + Logacousticness_. + Logspeechiness_. + I(Login_apple_playlists^2) + I(in_spotify_charts^2) + I(Login_spotify_playlists^2) + I(released_day^2) + I(released_year^2),songtrain3)
summary(lm17)
##
## Call:
## lm(formula = Logstreams ~ artist_count + released_year + released_month +
## released_day + in_spotify_charts + bpm + danceability_. +
## Login_spotify_playlists + Login_apple_playlists + Login_apple_charts +
## Logacousticness_. + Logspeechiness_. + I(Login_apple_playlists^2) +
## I(in_spotify_charts^2) + I(Login_spotify_playlists^2) + I(released_day^2) +
## I(released_year^2), data = songtrain3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.43203 -0.29333 0.02828 0.29477 2.09569
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.384e+03 7.525e+02 -3.168 0.001643 **
## artist_count -3.430e-02 2.751e-02 -1.247 0.213051
## released_year 2.395e+00 7.537e-01 3.177 0.001594 **
## released_month 9.039e-03 6.760e-03 1.337 0.181861
## released_day -1.366e-02 1.059e-02 -1.290 0.197786
## in_spotify_charts 4.261e-02 1.066e-02 3.996 7.58e-05 ***
## bpm 2.123e-03 8.434e-04 2.518 0.012179 *
## danceability_. 2.777e-03 1.681e-03 1.652 0.099339 .
## Login_spotify_playlists -3.541e-01 2.561e-01 -1.382 0.167565
## Login_apple_playlists 2.713e-01 7.594e-02 3.573 0.000393 ***
## Login_apple_charts 9.249e-02 1.747e-02 5.296 1.89e-07 ***
## Logacousticness_. 5.509e-02 1.881e-02 2.928 0.003589 **
## Logspeechiness_. -6.614e-02 2.968e-02 -2.229 0.026351 *
## I(Login_apple_playlists^2) -2.356e-02 1.296e-02 -1.818 0.069778 .
## I(in_spotify_charts^2) -1.051e-03 3.912e-04 -2.685 0.007525 **
## I(Login_spotify_playlists^2) 5.064e-02 1.706e-02 2.968 0.003165 **
## I(released_day^2) 7.468e-04 3.440e-04 2.171 0.030487 *
## I(released_year^2) -5.969e-04 1.888e-04 -3.162 0.001679 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4809 on 430 degrees of freedom
## Multiple R-squared: 0.7325, Adjusted R-squared: 0.7219
## F-statistic: 69.26 on 17 and 430 DF, p-value: < 2.2e-16
From the model, we can see that the Adjusted R-squared is quite high. However since the p-value for the variables some of the variables are higher than 0.05 so changes in these variables do not significantly affect the predicted Logstreams Therefore, this is not the best model that we want to find.
When there are 16 variables, model will be like this
lm16 <- lm(Logstreams ~ released_year + released_month + released_day + in_spotify_charts + bpm + danceability_. + Login_spotify_playlists + Login_apple_playlists + Login_apple_charts + Logacousticness_. + Logspeechiness_. + I(Login_apple_playlists^2) + I(in_spotify_charts^2) + I(Login_spotify_playlists^2) + I(released_day^2) + I(released_year^2),songtrain3)
summary(lm16)
##
## Call:
## lm(formula = Logstreams ~ released_year + released_month + released_day +
## in_spotify_charts + bpm + danceability_. + Login_spotify_playlists +
## Login_apple_playlists + Login_apple_charts + Logacousticness_. +
## Logspeechiness_. + I(Login_apple_playlists^2) + I(in_spotify_charts^2) +
## I(Login_spotify_playlists^2) + I(released_day^2) + I(released_year^2),
## data = songtrain3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.42231 -0.29779 0.01975 0.28636 2.05080
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.387e+03 7.530e+02 -3.170 0.001632 **
## released_year 2.398e+00 7.542e-01 3.179 0.001582 **
## released_month 9.005e-03 6.764e-03 1.331 0.183797
## released_day -1.409e-02 1.059e-02 -1.330 0.184119
## in_spotify_charts 4.239e-02 1.067e-02 3.973 8.32e-05 ***
## bpm 2.152e-03 8.437e-04 2.550 0.011106 *
## danceability_. 2.382e-03 1.652e-03 1.442 0.150109
## Login_spotify_playlists -3.541e-01 2.563e-01 -1.382 0.167787
## Login_apple_playlists 2.660e-01 7.587e-02 3.506 0.000502 ***
## Login_apple_charts 9.688e-02 1.712e-02 5.659 2.78e-08 ***
## Logacousticness_. 5.574e-02 1.882e-02 2.962 0.003224 **
## Logspeechiness_. -7.037e-02 2.950e-02 -2.385 0.017495 *
## I(Login_apple_playlists^2) -2.386e-02 1.297e-02 -1.840 0.066448 .
## I(in_spotify_charts^2) -1.042e-03 3.914e-04 -2.661 0.008074 **
## I(Login_spotify_playlists^2) 5.087e-02 1.707e-02 2.980 0.003048 **
## I(released_day^2) 7.589e-04 3.441e-04 2.206 0.027934 *
## I(released_year^2) -5.977e-04 1.889e-04 -3.164 0.001665 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4812 on 431 degrees of freedom
## Multiple R-squared: 0.7315, Adjusted R-squared: 0.7216
## F-statistic: 73.4 on 16 and 431 DF, p-value: < 2.2e-16
From the model, we can see that the Adjusted R-squared is quite high. However since the p-value for the variables some of the variables are higher than 0.05 so changes in these variables do not significantly affect the predicted Logstreams Therefore, this is not the best model that we want to find.
When there are 15 variables, model will be like this
lm15 <- lm(Logstreams ~ artist_count + released_year + released_day + in_spotify_charts + bpm + danceability_. + Login_apple_playlists + Login_apple_charts + Logacousticness_. + Logspeechiness_. + I(Login_apple_playlists^2) + I(in_spotify_charts^2) + I(Login_spotify_playlists^2) + I(released_day^2) + I(released_year^2),songtrain3)
summary(lm15)
##
## Call:
## lm(formula = Logstreams ~ artist_count + released_year + released_day +
## in_spotify_charts + bpm + danceability_. + Login_apple_playlists +
## Login_apple_charts + Logacousticness_. + Logspeechiness_. +
## I(Login_apple_playlists^2) + I(in_spotify_charts^2) + I(Login_spotify_playlists^2) +
## I(released_day^2) + I(released_year^2), data = songtrain3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.48393 -0.31104 0.02565 0.30241 2.12722
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.862e+03 6.950e+02 -4.118 4.58e-05 ***
## artist_count -3.418e-02 2.755e-02 -1.241 0.215388
## released_year 2.873e+00 6.961e-01 4.128 4.40e-05 ***
## released_day -1.474e-02 1.052e-02 -1.400 0.162098
## in_spotify_charts 4.399e-02 1.064e-02 4.134 4.28e-05 ***
## bpm 2.073e-03 8.439e-04 2.456 0.014434 *
## danceability_. 2.776e-03 1.677e-03 1.655 0.098625 .
## Login_apple_playlists 2.364e-01 7.082e-02 3.338 0.000917 ***
## Login_apple_charts 9.589e-02 1.739e-02 5.515 6.01e-08 ***
## Logacousticness_. 5.627e-02 1.882e-02 2.990 0.002945 **
## Logspeechiness_. -6.616e-02 2.967e-02 -2.230 0.026263 *
## I(Login_apple_playlists^2) -1.729e-02 1.204e-02 -1.436 0.151657
## I(in_spotify_charts^2) -1.096e-03 3.910e-04 -2.804 0.005279 **
## I(Login_spotify_playlists^2) 2.720e-02 2.582e-03 10.534 < 2e-16 ***
## I(released_day^2) 7.877e-04 3.419e-04 2.304 0.021695 *
## I(released_year^2) -7.169e-04 1.743e-04 -4.114 4.67e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4816 on 432 degrees of freedom
## Multiple R-squared: 0.7305, Adjusted R-squared: 0.7211
## F-statistic: 78.06 on 15 and 432 DF, p-value: < 2.2e-16
From the model, we can see that the Adjusted R-squared is quite high. However since the p-value for the variables some of the variables are higher than 0.05 so changes in these variables do not significantly affect the predicted Logstreams Therefore, this is not the best model that we want to find.
When there are 14 variables, model will be like this
lm14 <- lm(Logstreams ~ released_year + released_day + in_spotify_charts + bpm + danceability_. + Login_apple_playlists + Login_apple_charts + Logacousticness_. + Logspeechiness_. + I(Login_apple_playlists^2) + I(in_spotify_charts^2) + I(Login_spotify_playlists^2) + I(released_day^2) + I(released_year^2),songtrain3)
summary(lm14)
##
## Call:
## lm(formula = Logstreams ~ released_year + released_day + in_spotify_charts +
## bpm + danceability_. + Login_apple_playlists + Login_apple_charts +
## Logacousticness_. + Logspeechiness_. + I(Login_apple_playlists^2) +
## I(in_spotify_charts^2) + I(Login_spotify_playlists^2) + I(released_day^2) +
## I(released_year^2), data = songtrain3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.47407 -0.30273 0.02022 0.30182 2.08239
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.865e+03 6.954e+02 -4.120 4.55e-05 ***
## released_year 2.876e+00 6.966e-01 4.129 4.37e-05 ***
## released_day -1.517e-02 1.052e-02 -1.441 0.15022
## in_spotify_charts 4.377e-02 1.065e-02 4.111 4.71e-05 ***
## bpm 2.101e-03 8.441e-04 2.489 0.01318 *
## danceability_. 2.383e-03 1.648e-03 1.446 0.14887
## Login_apple_playlists 2.311e-01 7.074e-02 3.267 0.00117 **
## Login_apple_charts 1.003e-01 1.704e-02 5.884 8.03e-09 ***
## Logacousticness_. 5.692e-02 1.882e-02 3.024 0.00264 **
## Logspeechiness_. -7.039e-02 2.949e-02 -2.387 0.01744 *
## I(Login_apple_playlists^2) -1.759e-02 1.205e-02 -1.460 0.14498
## I(in_spotify_charts^2) -1.087e-03 3.912e-04 -2.780 0.00568 **
## I(Login_spotify_playlists^2) 2.743e-02 2.577e-03 10.642 < 2e-16 ***
## I(released_day^2) 7.999e-04 3.420e-04 2.339 0.01978 *
## I(released_year^2) -7.177e-04 1.744e-04 -4.115 4.63e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4819 on 433 degrees of freedom
## Multiple R-squared: 0.7295, Adjusted R-squared: 0.7208
## F-statistic: 83.42 on 14 and 433 DF, p-value: < 2.2e-16
From the model, we can see that the Adjusted R-squared is quite high. However since the p-value for the variables some of the variables are higher than 0.05 so changes in these variables do not significantly affect the predicted Logstreams Therefore, this is not the best model that we want to find.
When there are 13 variables, model will be like this
lm13 <- lm(Logstreams ~ released_year + released_day + in_spotify_charts + bpm + danceability_. + Login_apple_playlists + Login_apple_charts + Logacousticness_. + Logspeechiness_. + I(in_spotify_charts^2) + I(Login_spotify_playlists^2) + I(released_day^2) + I(released_year^2),songtrain3)
summary(lm13)
##
## Call:
## lm(formula = Logstreams ~ released_year + released_day + in_spotify_charts +
## bpm + danceability_. + Login_apple_playlists + Login_apple_charts +
## Logacousticness_. + Logspeechiness_. + I(in_spotify_charts^2) +
## I(Login_spotify_playlists^2) + I(released_day^2) + I(released_year^2),
## data = songtrain3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.46240 -0.28277 0.01942 0.29691 2.08856
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.949e+03 6.940e+02 -4.249 2.63e-05 ***
## released_year 2.961e+00 6.951e-01 4.260 2.51e-05 ***
## released_day -1.421e-02 1.052e-02 -1.351 0.17738
## in_spotify_charts 4.380e-02 1.066e-02 4.109 4.75e-05 ***
## bpm 2.202e-03 8.423e-04 2.614 0.00925 **
## danceability_. 2.484e-03 1.649e-03 1.506 0.13269
## Login_apple_playlists 1.368e-01 2.887e-02 4.737 2.94e-06 ***
## Login_apple_charts 9.773e-02 1.697e-02 5.758 1.61e-08 ***
## Logacousticness_. 5.671e-02 1.885e-02 3.009 0.00277 **
## Logspeechiness_. -6.871e-02 2.951e-02 -2.328 0.02036 *
## I(in_spotify_charts^2) -1.076e-03 3.916e-04 -2.749 0.00623 **
## I(Login_spotify_playlists^2) 2.574e-02 2.307e-03 11.157 < 2e-16 ***
## I(released_day^2) 7.798e-04 3.421e-04 2.279 0.02314 *
## I(released_year^2) -7.390e-04 1.740e-04 -4.247 2.65e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4825 on 434 degrees of freedom
## Multiple R-squared: 0.7282, Adjusted R-squared: 0.72
## F-statistic: 89.44 on 13 and 434 DF, p-value: < 2.2e-16
From the model, we can see that the Adjusted R-squared is quite high. However since the p-value for the variables some of the variables are higher than 0.05 so changes in these variables do not significantly affect the predicted Logstreams Therefore, this is not the best model that we want to find.
When there are 12 variables, model will be like this
lm12 <- lm(Logstreams ~ released_year + in_spotify_charts + bpm + danceability_. + Login_apple_playlists + Login_apple_charts + Logacousticness_. + Logspeechiness_. + I(in_spotify_charts^2) + I(Login_spotify_playlists^2) + I(released_day^2) + I(released_year^2),songtrain3)
summary(lm12)
##
## Call:
## lm(formula = Logstreams ~ released_year + in_spotify_charts +
## bpm + danceability_. + Login_apple_playlists + Login_apple_charts +
## Logacousticness_. + Logspeechiness_. + I(in_spotify_charts^2) +
## I(Login_spotify_playlists^2) + I(released_day^2) + I(released_year^2),
## data = songtrain3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.47451 -0.28725 0.02535 0.29404 2.11136
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.954e+03 6.946e+02 -4.253 2.59e-05 ***
## released_year 2.967e+00 6.957e-01 4.265 2.46e-05 ***
## in_spotify_charts 4.442e-02 1.066e-02 4.167 3.72e-05 ***
## bpm 2.179e-03 8.430e-04 2.585 0.010056 *
## danceability_. 2.690e-03 1.643e-03 1.637 0.102293
## Login_apple_playlists 1.382e-01 2.888e-02 4.783 2.37e-06 ***
## Login_apple_charts 9.768e-02 1.699e-02 5.750 1.69e-08 ***
## Logacousticness_. 5.480e-02 1.881e-02 2.913 0.003759 **
## Logspeechiness_. -6.885e-02 2.954e-02 -2.331 0.020215 *
## I(in_spotify_charts^2) -1.077e-03 3.920e-04 -2.748 0.006242 **
## I(Login_spotify_playlists^2) 2.575e-02 2.309e-03 11.152 < 2e-16 ***
## I(released_day^2) 3.323e-04 8.577e-05 3.874 0.000124 ***
## I(released_year^2) -7.408e-04 1.742e-04 -4.253 2.58e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.483 on 435 degrees of freedom
## Multiple R-squared: 0.727, Adjusted R-squared: 0.7195
## F-statistic: 96.55 on 12 and 435 DF, p-value: < 2.2e-16
From the model, we can see that the Adjusted R-squared is quite high. However since the p-value for the variables some of the variables are higher than 0.05 so changes in these variables do not significantly affect the predicted Logstreams Therefore, this is not the best model that we want to find.
When there are 11 variables, model will be like this
lm11 <- lm(Logstreams ~ released_year + in_spotify_charts + bpm + Login_apple_playlists + Login_apple_charts + Logacousticness_. + Logspeechiness_. + I(in_spotify_charts^2) + I(Login_spotify_playlists^2) + I(released_day^2) + I(released_year^2), songtrain3)
summary(lm11)
##
## Call:
## lm(formula = Logstreams ~ released_year + in_spotify_charts +
## bpm + Login_apple_playlists + Login_apple_charts + Logacousticness_. +
## Logspeechiness_. + I(in_spotify_charts^2) + I(Login_spotify_playlists^2) +
## I(released_day^2) + I(released_year^2), data = songtrain3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.54108 -0.29015 0.00861 0.28965 2.10930
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.874e+03 6.942e+02 -4.139 4.18e-05 ***
## released_year 2.886e+00 6.953e-01 4.151 3.98e-05 ***
## in_spotify_charts 4.418e-02 1.068e-02 4.137 4.22e-05 ***
## bpm 1.967e-03 8.345e-04 2.357 0.018878 *
## Login_apple_playlists 1.484e-01 2.825e-02 5.255 2.32e-07 ***
## Login_apple_charts 9.646e-02 1.701e-02 5.672 2.57e-08 ***
## Logacousticness_. 4.993e-02 1.861e-02 2.683 0.007571 **
## Logspeechiness_. -5.751e-02 2.877e-02 -1.999 0.046230 *
## I(in_spotify_charts^2) -1.059e-03 3.926e-04 -2.698 0.007246 **
## I(Login_spotify_playlists^2) 2.532e-02 2.298e-03 11.016 < 2e-16 ***
## I(released_day^2) 3.363e-04 8.590e-05 3.915 0.000105 ***
## I(released_year^2) -7.206e-04 1.741e-04 -4.140 4.18e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4839 on 436 degrees of freedom
## Multiple R-squared: 0.7254, Adjusted R-squared: 0.7184
## F-statistic: 104.7 on 11 and 436 DF, p-value: < 2.2e-16
From this model, it seem like all the conditions for this model are fulfill with p-value for all the variables are much smaller than 0.05. In addition, the Adjusted R-squared for this model is higher than the based model. Therefore, this is the best model that we get from Best Subsets.
Assumptions for Linear Models:
ε|xi is independent of ε|xj for any xi ̸= xj (independence)
ε|x has standard deviation σ that does not depend on x. (homoscedasticity)
ε|x is normally distributed for each x (normality)
cor(songtrain3)
## artist_count released_year released_month released_day
## artist_count 1.000000000 0.106810521 -0.008180603 0.056057079
## released_year 0.106810521 1.000000000 0.061312795 0.078709598
## released_month -0.008180603 0.061312795 1.000000000 0.075768333
## released_day 0.056057079 0.078709598 0.075768333 1.000000000
## in_spotify_charts -0.044775331 -0.033404658 -0.046296838 -0.013569012
## bpm -0.015177062 0.029475277 -0.047385342 -0.059158876
## danceability_. 0.267606806 0.113874864 -0.056125166 0.031512743
## valence_. 0.181556082 -0.071527723 -0.186141454 0.046435367
## energy_. 0.190884266 0.002169269 -0.059985458 0.048946540
## Logstreams -0.058810263 -0.275761046 0.043743792 0.107736939
## Login_spotify_playlists -0.026381520 -0.434432988 -0.007532317 0.007167030
## Login_apple_playlists 0.101961289 -0.260879144 0.003830782 0.064866354
## Login_apple_charts -0.151447141 -0.143568654 0.060185043 -0.035998023
## Logacousticness_. -0.058880378 0.013540150 -0.008621915 0.002118181
## Loginstrumentalness_. -0.128318562 -0.104007321 0.065836377 0.015342311
## Logliveness_. 0.059694527 0.051878154 -0.016946051 -0.016049702
## Logspeechiness_. 0.159523734 0.118884400 0.015485453 -0.016884250
## in_spotify_charts bpm danceability_.
## artist_count -0.04477533 -0.015177062 0.26760681
## released_year -0.03340466 0.029475277 0.11387486
## released_month -0.04629684 -0.047385342 -0.05612517
## released_day -0.01356901 -0.059158876 0.03151274
## in_spotify_charts 1.00000000 0.016017589 0.01998765
## bpm 0.01601759 1.000000000 -0.10224367
## danceability_. 0.01998765 -0.102243670 1.00000000
## valence_. 0.08373595 -0.015341812 0.39487445
## energy_. 0.06107525 0.007069839 0.18059295
## Logstreams 0.32427195 -0.006475977 0.02739416
## Login_spotify_playlists 0.19343294 -0.035757248 -0.02942905
## Login_apple_playlists 0.19039098 -0.030208978 0.13793751
## Login_apple_charts 0.30247000 -0.083644732 0.01569308
## Logacousticness_. -0.03801276 -0.076463396 -0.15953767
## Loginstrumentalness_. -0.07537085 -0.048405969 -0.17970892
## Logliveness_. -0.02628857 -0.029554319 -0.06622832
## Logspeechiness_. -0.04677227 0.097613871 0.20909260
## valence_. energy_. Logstreams
## artist_count 0.18155608 0.190884266 -0.058810263
## released_year -0.07152772 0.002169269 -0.275761046
## released_month -0.18614145 -0.059985458 0.043743792
## released_day 0.04643537 0.048946540 0.107736939
## in_spotify_charts 0.08373595 0.061075249 0.324271949
## bpm -0.01534181 0.007069839 -0.006475977
## danceability_. 0.39487445 0.180592952 0.027394160
## valence_. 1.00000000 0.373863941 -0.001801740
## energy_. 0.37386394 1.000000000 0.054954160
## Logstreams -0.00180174 0.054954160 1.000000000
## Login_spotify_playlists -0.03850706 0.040532652 0.762942174
## Login_apple_playlists 0.06726161 0.114414061 0.698380539
## Login_apple_charts 0.02198265 0.124607248 0.497448939
## Logacousticness_. -0.03100504 -0.448604051 -0.101385863
## Loginstrumentalness_. -0.12783397 -0.101127695 0.020105996
## Logliveness_. -0.01432861 0.063322383 -0.043590364
## Logspeechiness_. 0.08631690 0.111212690 -0.158019883
## Login_spotify_playlists Login_apple_playlists
## artist_count -0.026381520 0.101961289
## released_year -0.434432988 -0.260879144
## released_month -0.007532317 0.003830782
## released_day 0.007167030 0.064866354
## in_spotify_charts 0.193432944 0.190390983
## bpm -0.035757248 -0.030208978
## danceability_. -0.029429048 0.137937508
## valence_. -0.038507061 0.067261607
## energy_. 0.040532652 0.114414061
## Logstreams 0.762942174 0.698380539
## Login_spotify_playlists 1.000000000 0.724623847
## Login_apple_playlists 0.724623847 1.000000000
## Login_apple_charts 0.345094228 0.456850561
## Logacousticness_. -0.145580828 -0.200347131
## Loginstrumentalness_. 0.065869484 -0.039075810
## Logliveness_. -0.035924257 -0.050530857
## Logspeechiness_. -0.083378273 -0.129618095
## Login_apple_charts Logacousticness_.
## artist_count -0.15144714 -0.058880378
## released_year -0.14356865 0.013540150
## released_month 0.06018504 -0.008621915
## released_day -0.03599802 0.002118181
## in_spotify_charts 0.30247000 -0.038012756
## bpm -0.08364473 -0.076463396
## danceability_. 0.01569308 -0.159537667
## valence_. 0.02198265 -0.031005043
## energy_. 0.12460725 -0.448604051
## Logstreams 0.49744894 -0.101385863
## Login_spotify_playlists 0.34509423 -0.145580828
## Login_apple_playlists 0.45685056 -0.200347131
## Login_apple_charts 1.00000000 -0.144498524
## Logacousticness_. -0.14449852 1.000000000
## Loginstrumentalness_. -0.03320082 0.059981137
## Logliveness_. 0.03014205 -0.009913390
## Logspeechiness_. -0.13558022 0.002784047
## Loginstrumentalness_. Logliveness_. Logspeechiness_.
## artist_count -0.12831856 0.05969453 0.159523734
## released_year -0.10400732 0.05187815 0.118884400
## released_month 0.06583638 -0.01694605 0.015485453
## released_day 0.01534231 -0.01604970 -0.016884250
## in_spotify_charts -0.07537085 -0.02628857 -0.046772270
## bpm -0.04840597 -0.02955432 0.097613871
## danceability_. -0.17970892 -0.06622832 0.209092601
## valence_. -0.12783397 -0.01432861 0.086316903
## energy_. -0.10112769 0.06332238 0.111212690
## Logstreams 0.02010600 -0.04359036 -0.158019883
## Login_spotify_playlists 0.06586948 -0.03592426 -0.083378273
## Login_apple_playlists -0.03907581 -0.05053086 -0.129618095
## Login_apple_charts -0.03320082 0.03014205 -0.135580220
## Logacousticness_. 0.05998114 -0.00991339 0.002784047
## Loginstrumentalness_. 1.00000000 -0.07352980 -0.186365272
## Logliveness_. -0.07352980 1.00000000 -0.013093706
## Logspeechiness_. -0.18636527 -0.01309371 1.000000000
Based on the correlation test, we can see that none of the variables that we use in the model are highly correlated. Therefore, the model is consistent with the independence assumption.
songtrain4 <- songtrain3 %>%
mutate(res = residuals(lm11), fit = fitted.values(lm11))
shapiro.test(songtrain4$res)
##
## Shapiro-Wilk normality test
##
## data: songtrain4$res
## W = 0.99053, p-value = 0.005632
ncvTest(lm11)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 11.12461, Df = 1, p = 0.0008519
From the test, we can see that the p-value which is 0.005632 for Shapiro test and 0.0008519 for ncvTest which are smaller than 0.05 which mean our data is not consistent with homoscedasticity and normality. However, when the sample size is large, any tests such as these will be too powerful and often reject the null hypothesis. Thus, they commit many Type I errors. So we would want to plot the histogram and scatterplot of residual to have a better conclusion for the assumptions.
ggplot(songtrain4, aes(res)) + geom_histogram(fill = "royalblue1", color = "black",) + labs( title = "Histogram of residuals")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(songtrain4, aes(fit, res)) + geom_point(color = "royalblue1") + labs(title = "Scatterplot of residuals")
We can see that the in the first plot, we can see that the residual histogram is approximately normal. Which mean that this model is conistent with normality assumption.
In the second plot, the residual scatterplot does not show any trend in the scatter plot which mean the model is also consistent with the homoscedasticity assumption. Addition, the dataset also is consistent with the independence assumption by correlation test.
Therefore we conclude that this model is valid for predicting Logstreams as a function of released_year, in_spotify_charts, bpm, Login_apple_playlists, Login_apple_charts, Logacousticness_., Logspeechiness_., I(in_spotify_charts^2), I(Login_spotify_playlists^2), I(released_day^2), I(released_year^2).
summary(lm11)
##
## Call:
## lm(formula = Logstreams ~ released_year + in_spotify_charts +
## bpm + Login_apple_playlists + Login_apple_charts + Logacousticness_. +
## Logspeechiness_. + I(in_spotify_charts^2) + I(Login_spotify_playlists^2) +
## I(released_day^2) + I(released_year^2), data = songtrain3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.54108 -0.29015 0.00861 0.28965 2.10930
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.874e+03 6.942e+02 -4.139 4.18e-05 ***
## released_year 2.886e+00 6.953e-01 4.151 3.98e-05 ***
## in_spotify_charts 4.418e-02 1.068e-02 4.137 4.22e-05 ***
## bpm 1.967e-03 8.345e-04 2.357 0.018878 *
## Login_apple_playlists 1.484e-01 2.825e-02 5.255 2.32e-07 ***
## Login_apple_charts 9.646e-02 1.701e-02 5.672 2.57e-08 ***
## Logacousticness_. 4.993e-02 1.861e-02 2.683 0.007571 **
## Logspeechiness_. -5.751e-02 2.877e-02 -1.999 0.046230 *
## I(in_spotify_charts^2) -1.059e-03 3.926e-04 -2.698 0.007246 **
## I(Login_spotify_playlists^2) 2.532e-02 2.298e-03 11.016 < 2e-16 ***
## I(released_day^2) 3.363e-04 8.590e-05 3.915 0.000105 ***
## I(released_year^2) -7.206e-04 1.741e-04 -4.140 4.18e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4839 on 436 degrees of freedom
## Multiple R-squared: 0.7254, Adjusted R-squared: 0.7184
## F-statistic: 104.7 on 11 and 436 DF, p-value: < 2.2e-16
From this model, we can see that
Intercept:
released_year:
in_spotify_charts:
bpm:
Login_apple_playlists:
Login_apple_charts:
Logacousticness_:
Logspeechiness_:
I(in_spotify_charts^2):
I(Login_spotify_playlists^2):
I(released_day^2):
I(released_year^2):
Overall, we have the Adjusted R-squared is 0.7184 which mean that 71.84% of the variation of variable Logstreams explained by these variable. The p-value of < 2.2e-16 indicates that the model is statistically significant in predicting the 0.7184
coeftest(lm11,vcov = vcovHC(lm11,type = "HC1"))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.8736e+03 7.2158e+02 -3.9824 7.991e-05 ***
## released_year 2.8864e+00 7.2363e-01 3.9888 7.786e-05 ***
## in_spotify_charts 4.4185e-02 8.6840e-03 5.0881 5.386e-07 ***
## bpm 1.9668e-03 8.8963e-04 2.2108 0.0275702 *
## Login_apple_playlists 1.4844e-01 3.3847e-02 4.3855 1.453e-05 ***
## Login_apple_charts 9.6457e-02 1.6211e-02 5.9501 5.510e-09 ***
## Logacousticness_. 4.9930e-02 1.6900e-02 2.9544 0.0033020 **
## Logspeechiness_. -5.7509e-02 2.9788e-02 -1.9306 0.0541844 .
## I(in_spotify_charts^2) -1.0592e-03 2.9528e-04 -3.5870 0.0003723 ***
## I(Login_spotify_playlists^2) 2.5320e-02 2.6517e-03 9.5487 < 2.2e-16 ***
## I(released_day^2) 3.3633e-04 8.0689e-05 4.1682 3.704e-05 ***
## I(released_year^2) -7.2056e-04 1.8138e-04 -3.9726 8.316e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We are doing robust standard errors to look if we can solve the issue with heteroscedasticity in our model. And based on the result, we see that although increase the Standard Error of all variables, the p-value of almost all the variable are still smaller than 0.05 so almost all the variable are statistical significant. It seem like from the graph above the model is already consistent with homoscedasticity. Therefore, it do not really matter to do robust standard error on this model.
newdata = data.frame(released_year = 2023, in_spotify_charts = 0, bpm = 115, Login_apple_playlists = 4, Login_apple_charts = 4, Logacousticness_. = 3.7, Logspeechiness_. = 3.2 , Login_spotify_playlists = 7 , released_day = 21)
predict(lm11,newdata,interval = "predict", level = 0.99)
## fit lwr upr
## 1 19.30162 18.03766 20.56559
Using the value of released_year = 2023, in_spotify_charts = 0, bpm = 115, Login_apple_playlists = 4, Login_apple_charts = 4, Logacousticness_. = 3.7, Logspeechiness_. = 3.2 , Login_spotify_playlists = 7 , released_day = 21 we see that this model predict the Logstreams is 19.30162 with the lowerbound is 18.03766 and the upperbound is 20.56559 in the 99% confidence interval.
To check the model, we try the same model but on a different dataset which is cartest2. But first, we would need to mutate a columns of LogPrice for the test set and get rid of the old column Price
songtest2 <- songtest %>%
mutate(Login_apple_playlists = log(in_apple_playlists + 1)) %>%
select(-in_apple_playlists) %>%
mutate(Login_apple_charts = log(in_apple_charts + 1)) %>%
select(-in_apple_charts) %>%
mutate(Logspeechiness_. = log(speechiness_.)) %>%
select(-speechiness_.) %>%
mutate(Logacousticness_. = log(acousticness_. + 1)) %>%
select(-acousticness_.) %>%
mutate(Login_spotify_playlists = log(in_spotify_playlists)) %>%
select(-in_spotify_playlists) %>%
mutate(Logstreams = log(streams)) %>%
select(-streams)
lm_test <- lm(Logstreams ~ released_year + in_spotify_charts + bpm + Login_apple_playlists + Login_apple_charts + Logacousticness_. + Logspeechiness_. + I(in_spotify_charts^2) + I(Login_spotify_playlists^2) + I(released_day^2) + I(released_year^2), songtest2)
summary(lm_test)
##
## Call:
## lm(formula = Logstreams ~ released_year + in_spotify_charts +
## bpm + Login_apple_playlists + Login_apple_charts + Logacousticness_. +
## Logspeechiness_. + I(in_spotify_charts^2) + I(Login_spotify_playlists^2) +
## I(released_day^2) + I(released_year^2), data = songtest2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.8234 -0.3074 0.0174 0.4138 2.4398
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.093e+03 6.082e+02 -1.798 0.0729 .
## released_year 1.112e+00 6.115e-01 1.819 0.0697 .
## in_spotify_charts 5.880e-03 4.517e-03 1.302 0.1937
## bpm -1.241e-03 1.401e-03 -0.886 0.3760
## Login_apple_playlists -1.107e-01 4.957e-02 -2.233 0.0261 *
## Login_apple_charts 3.701e-02 3.453e-02 1.072 0.2844
## Logacousticness_. 5.553e-02 3.146e-02 1.765 0.0783 .
## Logspeechiness_. -4.133e-02 5.510e-02 -0.750 0.4536
## I(in_spotify_charts^2) -6.395e-05 4.745e-05 -1.348 0.1784
## I(Login_spotify_playlists^2) 4.859e-02 3.173e-03 15.310 <2e-16 ***
## I(released_day^2) 2.044e-04 1.357e-04 1.506 0.1327
## I(released_year^2) -2.785e-04 1.537e-04 -1.812 0.0707 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8049 on 435 degrees of freedom
## Multiple R-squared: 0.6252, Adjusted R-squared: 0.6157
## F-statistic: 65.97 on 11 and 435 DF, p-value: < 2.2e-16
Based on the result, we can see that most of the variables become insignificant. This is because of: - Multicollinearity: The dataset may suffer from multicollinearity, where predictor variables are highly correlated with each other (independence assumption) - Small Sample Size: After splitting, the training dataset only has ~450 data points, which could have affected the generalizability of findings - Overfitting: After evaluating the training model on the test set, it seems that overfitting is evident, as the “best” model fails to generalize to new data due to too much noise being captured and/or random fluctuations.
confint(lm11)
## 2.5 % 97.5 %
## (Intercept) -4.238091e+03 -1.509205e+03
## released_year 1.519896e+00 4.252981e+00
## in_spotify_charts 2.319561e-02 6.517406e-02
## bpm 3.265649e-04 3.606960e-03
## Login_apple_playlists 9.291790e-02 2.039547e-01
## Login_apple_charts 6.303465e-02 1.298796e-01
## Logacousticness_. 1.335562e-02 8.650406e-02
## Logspeechiness_. -1.140517e-01 -9.656708e-04
## I(in_spotify_charts^2) -1.830745e-03 -2.876047e-04
## I(Login_spotify_playlists^2) 2.080298e-02 2.983789e-02
## I(released_day^2) 1.674987e-04 5.051604e-04
## I(released_year^2) -1.062669e-03 -3.784500e-04
Certainly! Let’s interpret each coefficient along with its corresponding 95% confidence interval:
(Intercept): - Estimate: The estimated intercept is between -4.238091e+03 and -1.509205e+03 with 95% confidence. - Interpretation: We are 95% confident that when all predictor variables are zero, the outcome (Logstreams, in this case) is between approximately -4238 and -1509.
released_year: - Estimate: The estimated coefficient for released_year is between 1.519896e+00 and 4.252981e+00 with 95% confidence. - Interpretation: We are 95% confident that for each one-unit increase in the released year, the Logstreams increase by an amount between approximately 1.52 and 4.25.
in_spotify_charts: - Estimate: The estimated coefficient for in_spotify_charts is between 2.319561e-02 and 6.517406e-02 with 95% confidence. - Interpretation: We are 95% confident that for each one-unit increase in being in Spotify charts, the Logstreams increase by an amount between approximately 0.0232 and 0.0652.
bpm: - Estimate: The estimated coefficient for bpm is between 3.265649e-04 and 3.606960e-03 with 95% confidence. - Interpretation: We are 95% confident that for each one-unit increase in beats per minute (bpm), the Logstreams increase by an amount between approximately 0.000326 and 0.00361.
Login_apple_playlists: - Estimate: The estimated coefficient for Login_apple_playlists is between 9.291790e-02 and 2.039547e-01 with 95% confidence. - Interpretation: We are 95% confident that for each one-unit increase in login to Apple playlists, the Logstreams increase by an amount between approximately 0.0929 and 0.204.
Login_apple_charts: - Estimate: The estimated coefficient for Login_apple_charts is between 6.303465e-02 and 1.298796e-01 with 95% confidence. - Interpretation: We are 95% confident that for each one-unit increase in login to Apple charts, the Logstreams increase by an amount between approximately 0.063 and 0.130.
Logacousticness_: - Estimate: The estimated coefficient for Logacousticness_ is between 1.335562e-02 and 8.650406e-02 with 95% confidence. - Interpretation: We are 95% confident that for each one-unit increase in the log-transformed acousticness, the Logstreams increase by an amount between approximately 0.0134 and 0.0865.
Logspeechiness_: - Estimate: The estimated coefficient for Logspeechiness_ is between -1.140517e-01 and -9.656708e-04 with 95% confidence. - Interpretation: We are 95% confident that for each one-unit increase in the log-transformed speechiness, the Logstreams decrease by an amount between approximately -0.114 and -0.001.
I(in_spotify_charts^2): - Estimate: The estimated coefficient for I(in_spotify_charts^2) is between -1.830745e-03 and -2.876047e-04 with 95% confidence. - Interpretation: We are 95% confident that the squared term of being in Spotify charts has an effect on Logstreams within the range of approximately -0.00183 and -0.000288.
I(Login_spotify_playlists^2): - Estimate: The estimated coefficient for I(Login_spotify_playlists^2) is between 2.080298e-02 and 2.983789e-02 with 95% confidence. - Interpretation: We are 95% confident that the squared term of login to Spotify playlists has an effect on Logstreams within the range of approximately 0.0208 and 0.0298.
I(released_day^2): - Estimate: The estimated coefficient for I(released_day^2) is between 1.674987e-04 and 5.051604e-04 with 95% confidence. - Interpretation: We are 95% confident that the squared term of released_day has an effect on Logstreams within the range of approximately 0.000167 and 0.000505.
I(released_year^2): - Estimate: The estimated coefficient for I(released_year^2) is between -1.062669e-03 and -3.784500e-04 with 95% confidence. - Interpretation: We are 95% confident that the squared term of released_year has an effect on Logstreams within the range of approximately -0.00106 and -0.000378.
Limitations:
We want to explore the question: